Automation Suite
2023.10
false
Banner background image
Automation Suite on Linux Installation Guide
Last updated Feb 28, 2024

Management alerts

alertmanager.rules

AlertmanagerConfigInconsistent

These are internal Alertmanager errors for HA clusters with multiple AlertManager replicas. Alerts may appear and disappear intermittently. Temporarily scaling down, then scaling up Alertmanager replicas may fix the issue.

To fix the issue, take the following steps:

  1. Scale to zero. Note that it takes a moment for the pods to shut down:

    kubectl scale statefulset -n cattle-monitoring-system alertmanager-rancher-monitoring-alertmanager --replicas=0kubectl scale statefulset -n cattle-monitoring-system alertmanager-rancher-monitoring-alertmanager --replicas=0
  2. Scale back to two:

    kubectl scale statefulset -n cattle-monitoring-system alertmanager-rancher-monitoring-alertmanager --replicas=2kubectl scale statefulset -n cattle-monitoring-system alertmanager-rancher-monitoring-alertmanager --replicas=2
  3. Check if the Alertmanager pods started and are in the running state:

    kubectl get po -n cattle-monitoring-systemkubectl get po -n cattle-monitoring-system

If the issue persists, contact UiPath Support.

AlertmanagerFailedReload

AlertManager has failed to load or reload configuration. Please check any custom AlertManager configurations for input errors and otherwise contact UiPath Support.

AlertmanagerMembersInconsistent

These are internal Alertmanager errors for HA clusters with multiple AlertManager replicas. Alerts may appear and disappear intermittently. Temporarily scaling down, then scaling up Alertmanager replicas may fix the issue.

To fix the issue, take the following steps:

  1. Scale to zero. Note that it takes a moment for the pods to shut down:

    kubectl scale statefulset -n cattle-monitoring-system alertmanager-rancher-monitoring-alertmanager --replicas=0kubectl scale statefulset -n cattle-monitoring-system alertmanager-rancher-monitoring-alertmanager --replicas=0
  2. Scale back to two:

    kubectl scale statefulset -n cattle-monitoring-system alertmanager-rancher-monitoring-alertmanager --replicas=2kubectl scale statefulset -n cattle-monitoring-system alertmanager-rancher-monitoring-alertmanager --replicas=2
  3. Check if the Alertmanager pods started and are in the running state:

    kubectl get po -n cattle-monitoring-systemkubectl get po -n cattle-monitoring-system

If the issue persists, contact UiPath Support.

general.rules

TargetDown

Prometheus is not able to collect metrics from the target in the alert, which means Grafana dashboards and further alerts based on metrics from that target are not be available. Check other alerts pertaining to that target.

Watchdog

This is an alert meant to ensure that the entire alerting pipeline is functional. This alert is always firing. Therefore, it should always be firing in AlertManager and against a receiver. There are integrations with various notification mechanisms that notify you when this alert is not firing. For example, the DeadMansSnitch integration in PagerDuty.

prometheus-operator

PrometheusOperatorListErrors, PrometheusOperatorWatchErrors, PrometheusOperatorSyncFailed, PrometheusOperatorReconcileErrors, PrometheusOperatorNodeLookupErrors, PrometheusOperatorNotReady, PrometheusOperatorRejectedResources

Internal errors of the Prometheus operator, which controls Prometheus resources. Prometheus itself may still be healthy while these errors are present; however, this error indicates there is degraded monitoring configurability. Contact UiPath Support.

prometheus

This is the start of your concept.

PrometheusBadConfig

Prometheus has failed to load or reload configuration. Please check any custom Prometheus configurations for input errors. Otherwise contact UiPath Support.

PrometheusErrorSendingAlertsToSomeAlertmanagers, PrometheusErrorSendingAlertsToAnyAlertmanager, PrometheusNotConnectedToAlertmanagers

The connection from Prometheus to AlertManager is not healthy. Metrics may still be queryable, and Grafana dashboards may still show them, but alerts will not fire. Check any custom configuration of AlertManager for input errors and and otherwise contact UiPath Support.

PrometheusNotificationQueueRunningFull, PrometheusTSDBReloadsFailing, PrometheusTSDBCompactionsFailing, PrometheusNotIngestingSamples, PrometheusDuplicateTimestamps, PrometheusOutOfOrderTimestamps, PrometheusRemoteStorageFailures, PrometheusRemoteWriteBehind, PrometheusRemoteWriteDesiredShards

Internal Prometheus errors indicating metrics may not be collected as expected. Please contact UiPath Support.

PrometheusRuleFailures

This may happen if there are malformed alerts based on non-existent metrics or incorrect PromQL syntax. Contact UiPath Support if no custom alerts have been added.

PrometheusMissingRuleEvaluations

Prometheus is not able to evaluate whether alerts should be firing. This may happen if there are too many alerts. Please remove expensive custom alert evaluations and/or see documentation on increasing CPU limit for Prometheus. Contact UiPath Support if no custom alerts have been added.

PrometheusTargetLimitHit

There are too many targets for Prometheus to collect from. If extra ServiceMonitors have been added (see Monitoring console), you can remove them.

uipath.prometheus.resource.provisioning.alerts

PrometheusMemoryUsage, PrometheusStorageUsage

These alerts warn when the cluster is approaching the configured limits for memory and storage. This is likely to happen on clusters with a recent substantial increase in usage (usually from Robots rather than users), or when nodes are added to the cluster without adjusting Prometheus resources. This is due to an increase in the amount of metrics being collected.

The rate of increased storage utilization can be seen on the Kubernetes / Persistent Volumes dashboard:



You can adjust it by resizing the PVC as instructed here: Configuring the cluster.

The rate of increased memory utilization can be seen on the Kubernetes / Compute Resources / Pod dashboard.



You can adjust it by editing the Prometheus memory resource limits in the rancher-monitoring app from ArgoCD. The rancher-monitoring app automatically re-syncs after clicking Save.



Note that Prometheus takes some time to restart and start showing metrics in Grafana again. it usually takes less than 10 minutes, even with large clusters.

uipath.availability.alerts

UiPathAvailabilityHighTrafficUserFacing

The number of http 500 responses from UiPath services exceeds a given threshold.

Traffic level

Number of requests in 20 minutes

Error threshold (for http 500s)

High

>100,000

0.1%

Medium

Between 10,000 and 100,000

1%

Low

< 10,000

5%

Errors in user-facing services would likely result in degraded functionality that is directly observable in the Automation Suite UI, while errors in backend services would have less obvious consequences.

The alert indicates which service is experiencing a high error rate. To understand what cascading issues there may be from other services that the reporting service depends on, you can use the Istio Workload dashboard, which shows errors between services.

Please double check any recently reconfigured Automation Suite products. Detailed logs are also available with the kubectl logs command. If the error persists, please contact UiPath Support.

backup

NFSServerDisconnected

This alert indicates that the NFS server connection is lost.

You need to check the NFS server connection and mount path.

VolumeBackupFailed

This alert indicates that the backup failed for a PVC.

BackupDisabled

This alert indicates that the backup is disabled.

You need to check if the cluster is unhealthy.

cronjob-alerts

CronJobSuspended

The uipath-infra/istio-configure-script-cronjob cronjob is in suspended state.

To fix this issue, enable the cronjob by taking the following steps:

export KUBECONFIG="/etc/rancher/rke2/rke2.yaml" && export PATH="$PATH:/usr/local/bin:/var/lib/rancher/rke2/bin"
kubectl -n uipath-infra patch cronjob istio-configure-script-cronjob -p '{"spec":{"suspend":false}}'
epoch=$(date +"%s")
kubectl -n uipath-infra create job istio-configure-script-cronjob-manual-$epoch --from=cronjob/istio-configure-script-cronjob
kubectl -n uipath-infra wait --for=condition=complete --timeout=300s job/istio-configure-script-cronjob-manual-$epoch
kubectl get node -o wide
#Verify if all the IP's listed by the above command are part of output of below command
kubectl -n istio-system get svc istio-ingressgateway -o json | jq '.spec.externalIPs'export KUBECONFIG="/etc/rancher/rke2/rke2.yaml" && export PATH="$PATH:/usr/local/bin:/var/lib/rancher/rke2/bin"
kubectl -n uipath-infra patch cronjob istio-configure-script-cronjob -p '{"spec":{"suspend":false}}'
epoch=$(date +"%s")
kubectl -n uipath-infra create job istio-configure-script-cronjob-manual-$epoch --from=cronjob/istio-configure-script-cronjob
kubectl -n uipath-infra wait --for=condition=complete --timeout=300s job/istio-configure-script-cronjob-manual-$epoch
kubectl get node -o wide
#Verify if all the IP's listed by the above command are part of output of below command
kubectl -n istio-system get svc istio-ingressgateway -o json | jq '.spec.externalIPs'

IdentityKerberosTgtUpdateFailed

This job updates the latest Kerberos ticket to all the UiPath services. Failures in this job would cause SQL server authentication to fail. Please contact UiPath Support.

Was this page helpful?

Get The Help You Need
Learning RPA - Automation Courses
UiPath Community Forum
Uipath Logo White
Trust and Security
© 2005-2024 UiPath. All rights reserved.