Automation Suite
2021.10
false
Banner background image
Automation Suite Installation Guide
Last updated Apr 19, 2024

Alert Runbooks

Note:
  • For general instructions on using the available tools for alerts, metrics, and visualizations, see Using the monitoring stack
  • For more on how to fix issues and how to create a support bundle for UiPath Support engineers, see Troubleshooting.
  • When contacting UiPath Support, please include any alerts that are currently firing.

Alert severity key

Alert severity

Description

Info

Unexpected but harmless. Can be silenced but may be useful during diagnostics.

Warning

Indication of a targeted degradation of functionality or a likelihood of degradation in the near future, which may affect the entire cluster. Suggests prompt action (usually within days) to keep cluster healthy.

Critical

Known to cause serious degradation of functionality that is often widespread in the cluster. Requires immediate action (same day) to repair cluster.

general.rules

TargetDown

Prometheus is not able to collect metrics from the target in the alert, which means Grafana dashboards and further alerts based on metrics from that target are not be available. Check other alerts pertaining to that target.

Watchdog

This is an alert meant to ensure that the entire alerting pipeline is functional. This alert is always firing. Therefore, it should always be firing in AlertManager and against a receiver. There are integrations with various notification mechanisms that notify you when this alert is not firing. For example, the DeadMansSnitch integration in PagerDuty.

kubernetes-apps

KubePodCrashLooping

A pod that keeps restarting unexpectedly. This can happen due to an out-of-memory (OOM) error, in which case the limits can be adjusted. Check the pod events with kubectl describe, and logs with kubectl logs to see details on possible crashes. If the issue persists, contact UiPath® Support.

KubePodNotReady

A pod has started, but it is not responding to the health probe with success. This may mean that it is stuck and is not able to serve traffic. You can check pod logs with kubectl logs to see if there is any indication of progress. If the issue persists, contact UiPath® Support.

KubeDeploymentGenerationMismatch, KubeStatefulSetGenerationMismatch

There has been an attempted update to a deployment or statefulset, but it has failed, and a rollback has not yet occurred. Contact UiPath® Support.

KubeDeploymentReplicasMismatch, KubeStatefulSetReplicasMismatch

In high availability clusters with multiple replicas, this alert fires when the number of replicas is not optimal. This may occur when there are not enough resources in the cluster to schedule. Check resource utilization, and add capacity as necessary. Otherwise contact UiPath® Support.

KubeStatefulSetUpdateNotRolledOut

An update to a statefulset has failed. Contact UiPath® Support.

See also: StatefulSets.

KubeDaemonSetRolloutStuck

Daemonset rollout has failed. Contact UiPath® Support.

See also: DaemonSet.

KubeContainerWaiting

A container is stuck in the waiting state. It has been scheduled to a worker node, but it cannot run on that machine. Check kubectl describe of the pod for more information. The most common cause of waiting containers is a failure to pull the image. For air-gapped clusters, this could mean that the local registry is not available. If the issue persists, contact UiPath® Support.

KubeDaemonSetNotScheduled, KubeDaemonSetMisScheduled

This may indicate an issue with one of the nodes Check the health of each node, and remediate any known issues. Otherwise contact UiPath® Support.

KubeJobCompletion

A job takes more than 12 hours to complete. This is not expected. Contact UiPath® Support.

KubeJobFailed

A job has failed; however, most jobs are retried automatically. If the issue persists, contact UiPath® Support.

KubeHpaReplicasMismatch

The autoscaler cannot scale the targeted resource as configured. If desired is higher than actual, then there may be a lack of resources. If desired is lower than actual, pods may be stuck while shutting down. If the issue persists, contact UiPath® Support.

KubeHpaMaxedOut

The number of replicas for a given service has reached its maximum. This happens when the amount of requests being made to the cluster is very high. If high traffic is expected and temporary, you may silence this alert. However, this alert is a sign that the cluster is at capacity and cannot handle much more traffic. If more resource capacity is available on the cluster, you can increase the number of maximum replicas for the service by following these instructions:

# Find the horizontal autoscaler that controls the replicas of the desired resource
kubectl get hpa -A
# Increase the number of max replicas of the desired resource, replacing <namespace> <resource> and <maxReplicas>
kubectl -n <namespace> patch hpa <resource> --patch '{"spec":{"maxReplicas":<maxReplicas>}}'# Find the horizontal autoscaler that controls the replicas of the desired resource
kubectl get hpa -A
# Increase the number of max replicas of the desired resource, replacing <namespace> <resource> and <maxReplicas>
kubectl -n <namespace> patch hpa <resource> --patch '{"spec":{"maxReplicas":<maxReplicas>}}'

kubernetes-resources

KubeCPUOvercommit, KubeMemoryOvercommit

These warnings indicate that the cluster cannot tolerate node failure. For single-node evaluation clusters, this is known, and these alerts may be silenced. For multi-node HA-ready production setups, these alerts fire when too many nodes become unhealthy to support high availability, and they indicate that the nodes should be brought back to health or replaced.

KubeCPUQuotaOvercommit, KubeMemoryQuotaOvercommit, KubeQuotaAlmostFull, KubeQuotaFullyUsed, KubeQuotaExceeded

These alerts pertain to namespace resource quotas that only exist in the cluster if added through customization. Namespace resource quotas are not added as part of Automation Suite installation.

See also: Resource Quotas.

CPUThrottlingHigh

A container’s CPU utilization has been throttled according to the configured limits. This is part of normal Kubernetes operation and may provide useful information when other alerts are firing. You may silence this alert.

Kubernetes-storage

KubePersistentVolumeFillingUp

When Warning: The available space is less than 15% and is likely to fill up within four days.

When Critical: The available space is less than 3%.

For any services that run out of space, data may be difficult to recover, so volumes should be resized before hitting 0% available space. See the following instructions: Configuring the cluster.

For Prometheus-specific alerts, see PrometheusStorageUsage for more details and instructions.

KubePersistentVolumeErrors

PersistentVolume is not able to be provisioned. This means any service requiring the volume would not start. Check for other errors with Longhorn and/or Ceph storage and contact UiPath® Support.

kube-state-metrics

KubeStateMetricsListErrors, KubeStateMetricsWatchErrors

The Kube State Metrics collector is not able to collect metrics from the cluster without errors. This means important alerts may not fire. Contact UiPath® Support.

kubernetes-system-apiserver

KubeClientCertificateExpiration

When Warning: A client certificate used to authenticate to the Kubernetes API server expires in less than seven days.

When Critical: A client certificate used to authenticate to the Kubernetes API server expires in less than one day.

You must renew the certificate.

AggregatedAPIErrors, AggregatedAPIDown, KubeAPIDown, KubeAPITerminatedRequests

Indicates problems with the Kubernetes control plane. Check the health of master nodes, resolve any outstanding issues, and contact UiPath Support if the issues persist.

See also:

kubernetes-system-kubelet

KubeNodeNotReady, KubeNodeUnreachable, KubeNodeReadinessFlapping, KubeletPlegDurationHigh, KubeletPodStartUpLatencyHigh, KubeletDown

These alerts indicate a problem with a node. In multi-node HA-ready production clusters, pods would likely be rescheduled onto other nodes. If the issue persists, you should remove and drain the node to maintain the health of the cluster. In clusters without extra capacity, another node should be joined to the cluster first.

KubeletTooManyPods

There are too many pods running on the specified node.

Join another node to the cluster.

KubeletClientCertificateExpiration, KubeletServerCertificateExpiration

When Warning: A client or server certificate for Kubelet expires in less than seven days.

When Critical: A client or server certificate for Kubelet expires in less than one day.

You must renew the certificate.

KubeletClientCertificateRenewalErrors, KubeletServerCertificateRenewalErrors

Kubelet has failed to renew its client or server certificate. Contact UiPath® support.

kubernetes-system

KubeVersionMismatch

There are different semantic versions of Kubernetes components running. This can happen as a result of an unsuccessful Kubernetes upgrade.

KubeClientErrors

Kubernetes API server client is experiencing greater than 1% errors. There may be an issue with the node this client is running on, or the Kubernetes API server itself.

Kube-apiserver-slos

KubeAPIErrorBudgetBurn

The Kubernetes API server is burning too much error budget.

node-exporter

NodeFilesystemSpaceFillingUp, NodeFilesystemAlmostOutOfSpace, NodeFilesystemFilesFillingUp

The filesystem on a particular node is filling up. Provision more space by adding a disk or mounting unused disks.

NodeRAIDDegraded

RAID array is in a degraded state due to one or more disk failures. The number of spare drives

is insufficient to fix the issue automatically.

NodeRAIDDiskFailure

RAID array needs attention and possibly a disk swap.

NodeNetworkReceiveErrs, NodeNetworkTransmitErrs, NodeHighNumberConntrackEntriesUsed

There is a problem with the physical network interface on the node. If the issues persist, it may need to be replaced.

NodeClockSkewDetected, NodeClockNotSynchronising

There is a problem with the clock on the node. Make sure NTP is configured correctly.

node-network

NodeNetworkInterfaceFlapping

There is a problem with the physical network interface on the node. If the issues persist, it may need to be replaced.

uipath.prometheus.resource.provisioning.alerts

PrometheusMemoryUsage, PrometheusStorageUsage

These alerts warn when the cluster is approaching the configured limits for memory and storage. This is likely to happen on clusters with a recent substantial increase in usage (usually from Robots rather than users), or when nodes are added to the cluster without adjusting Prometheus resources. This is due to an increase in the amount of metrics being collected.

The rate of increased storage utilization can be seen on the Kubernetes / Persistent Volumes dashboard:



You can adjust it by resizing the PVC as instructed here: Configuring the cluster.

The rate of increased memory utilization can be seen on the Kubernetes / Compute Resources / Pod dashboard.



You can adjust it by editing the Prometheus memory resource limits in the rancher-monitoring app from ArgoCD. The rancher-monitoring app automatically re-syncs after clicking Save.



Note that Prometheus takes some time to restart and start showing metrics in Grafana again. it usually takes less than 10 minutes, even with large clusters.

alertmanager.rules

AlertmanagerConfigInconsistent

These are internal Alertmanager errors for HA clusters with multiple AlertManager replicas. Alerts may appear and disappear intermittently. Temporarily scaling down, then scaling up Alertmanager replicas may fix the issue.

To fix the issue, take the following steps:

  1. Scale to zero. Note that it takes a moment for the pods to shut down:

    kubectl scale statefulset -n cattle-monitoring-system alertmanager-rancher-monitoring-alertmanager --replicas=0kubectl scale statefulset -n cattle-monitoring-system alertmanager-rancher-monitoring-alertmanager --replicas=0
  2. Scale back to two:

    kubectl scale statefulset -n cattle-monitoring-system alertmanager-rancher-monitoring-alertmanager --replicas=2kubectl scale statefulset -n cattle-monitoring-system alertmanager-rancher-monitoring-alertmanager --replicas=2
  3. Check if the Alertmanager pods started and are in the running state:

    kubectl get po -n cattle-monitoring-systemkubectl get po -n cattle-monitoring-system

If the issue persists, contact UiPath® Support.

AlertmanagerFailedReload

AlertManager has failed to load or reload configuration. Please check any custom AlertManager configurations for input errors and otherwise contact UiPath® Support.

prometheus-operator

PrometheusOperatorListErrors, PrometheusOperatorWatchErrors, PrometheusOperatorSyncFailed, PrometheusOperatorReconcileErrors, PrometheusOperatorNodeLookupErrors, PrometheusOperatorNotReady, PrometheusOperatorRejectedResources

Internal errors of the Prometheus operator, which controls Prometheus resources. Prometheus itself may still be healthy while these errors are present; however, this error indicates there is degraded monitoring configurability. Contact UiPath® Support.

prometheus

PrometheusBadConfig

Prometheus has failed to load or reload configuration. Please check any custom Prometheus configurations for input errors. Otherwise contact UiPath® Support.

PrometheusErrorSendingAlertsToSomeAlertmanagers, PrometheusErrorSendingAlertsToAnyAlertmanager, PrometheusNotConnectedToAlertmanagers

The connection from Prometheus to AlertManager is not healthy. Metrics may still be queryable, and Grafana dashboards may still show them, but alerts will not fire. Check any custom configuration of AlertManager for input errors and and otherwise contact UiPath® Support.

PrometheusNotificationQueueRunningFull, PrometheusTSDBReloadsFailing, PrometheusTSDBCompactionsFailing, PrometheusNotIngestingSamples, PrometheusDuplicateTimestamps, PrometheusOutOfOrderTimestamps, PrometheusRemoteStorageFailures, PrometheusRemoteWriteBehind, PrometheusRemoteWriteDesiredShards

Internal Prometheus errors indicating metrics may not be collected as expected. Please contact UiPath® Support.

PrometheusRuleFailures

This may happen if there are malformed alerts based on non-existent metrics or incorrect PromQL syntax. Contact UiPath® Support if no custom alerts have been added.

PrometheusMissingRuleEvaluations

Prometheus is not able to evaluate whether alerts should be firing. This may happen if there are too many alerts. Please remove expensive custom alert evaluations and/or see documentation on increasing CPU limit for Prometheus. Contact UiPath® Support if no custom alerts have been added.

PrometheusTargetLimitHit

There are too many targets for Prometheus to collect from. If extra ServiceMonitors have been added (see Monitoring console), you can remove them.

uipath.availability.alerts

UiPathAvailabilityHighTrafficUserFacing, UiPathAvailabilityHighTrafficBackend, UiPathAvailabilityMediumTrafficUserFacing, UiPathAvailabilityMediumTrafficBackend, UiPathAvailabilityLowTrafficUserFacing, UiPathAvailabilityLowTrafficBackend

The number of http 500 responses from UiPath® services exceeds a given threshold.

Traffic level

Number of requests in 20 minutes

Error threshold (for http 500s)

High

>100,000

0.1%

Medium

Between 10,000 and 100,000

1%

Low

< 10,000

5%

Errors in user-facing services would likely result in degraded functionality that is directly observable in the Automation Suite UI, while errors in backend services would have less obvious consequences.

The alert indicates which service is experiencing a high error rate. To understand what cascading issues there may be from other services that the reporting service depends on, you can use the Istio Workload dashboard, which shows errors between services.

Please double check any recently reconfigured Automation Suite products. Detailed logs are also available with the kubectl logs command. If the error persists, please contact UiPath® Support.

uipath.cronjob.alerts.rules

UiPath CronJob "kerberos-tgt-refresh" Failed

This job obtains the latest Kerberos ticket from the AD server for SQL-integrated authentication. Failures in this job would cause SQL server authentication to fail. Please contact UiPath® Support.

UiPath CronJob Kerberos-tgt-secret-update Failed

This job updates the latest Kerberos ticket to all the UiPath services. Failures in this job would cause SQL server authentication to fail. Please contact UiPath Support.

Osd-alert.rules

CephOSDNearFull

When the alert severity is Warning, the available space is less than 25% and is likely to fill up very soon.

For any services that run out of space, data may be difficult to recover, so you should resize volumes before hitting 10% available space. See the following instructions: Configuring the cluster.

CephOSDCriticallyFull

When the alert severity is Critical, the available space is less than 20%.

For any services that run out of space, data may be difficult to recover, so you should resize volumes before hitting 10% available space. See the following instructions: Configuring the cluster.

Was this page helpful?

Get The Help You Need
Learning RPA - Automation Courses
UiPath Community Forum
Uipath Logo White
Trust and Security
© 2005-2024 UiPath. All rights reserved.