- Overview
- Requirements
- Pre-installation
- Installation
- Post-installation
- Migration and upgrade
- Monitoring and alerting
- Cluster administration
- Product-specific configuration
- Troubleshooting
Automation Suite on EKS/AKS Installation Guide
Alert runbooks
- For general instructions on using the available tools for alerts, metrics, and visualizations, see Using the monitoring stack.
- For more on how to fix issues and how to create a support bundle for UiPath® Support engineers, see Troubleshooting.
- When contacting UiPath® Support, please include any alerts that are currently firing.
Alert severity |
Description |
---|---|
Info | Unexpected but harmless. Can be silenced but may be useful during diagnostics. |
Warning | Indication of a targeted degradation of functionality or a likelihood of degradation in the near future, which may affect the entire cluster. Suggests prompt action (usually within days) to keep cluster healthy. |
Critical | Known to cause serious degradation of functionality that is often widespread in the cluster. Requires immediate action (same day) to repair cluster. |
Prometheus is not able to collect metrics from the target in the alert, which means Grafana dashboards and further alerts based on metrics from that target are not be available. Check other alerts pertaining to that target.
This is an alert meant to ensure that the entire alerting pipeline is functional. This alert is always firing. Therefore, it should always be firing in AlertManager and against a receiver. There are integrations with various notification mechanisms that notify you when this alert is not firing. For example, the DeadMansSnitch integration in PagerDuty.
kubectl describe
, and logs with kubectl logs
to see details on possible crashes. If the issue persists, contact UiPath® Support.
kubectl logs
to see if there is any indication of progress. If the issue persists, contact UiPath® Support.
There has been an attempted update to a deployment or statefulset, but it has failed, and a rollback has not yet occurred. Contact UiPath® Support.
In high availability clusters with multiple replicas, this alert fires when the number of replicas is not optimal. This may occur when there are not enough resources in the cluster to schedule. Check resource utilization, and add capacity as necessary. Otherwise contact UiPath® Support.
An update to a statefulset has failed. Contact UiPath® Support.
See also: StatefulSets.
Daemonset rollout has failed. Contact UiPath® Support.
See also: DaemonSet.
kubectl describe
of the pod for more information. The most common cause of waiting containers is a failure to pull the image. For air-gapped
clusters, this could mean that the local registry is not available. If the issue persists, contact UiPath® Support.
This may indicate an issue with one of the nodes Check the health of each node, and remediate any known issues. Otherwise contact UiPath® Support.
A job takes more than 12 hours to complete. This is not expected. Contact UiPath® Support.
A job has failed; however, most jobs are retried automatically. If the issue persists, contact UiPath® Support.
The autoscaler cannot scale the targeted resource as configured. If desired is higher than actual, then there may be a lack of resources. If desired is lower than actual, pods may be stuck while shutting down. If the issue persists, contact UiPath® Support.
See also: Horizontal Pod Autoscaling
The number of replicas for a given service has reached its maximum. This happens when the amount of requests being made to the cluster is very high. If high traffic is expected and temporary, you may silence this alert. However, this alert is a sign that the cluster is at capacity and cannot handle much more traffic. If more resource capacity is available on the cluster, you can increase the number of maximum replicas for the service by following these instructions:
# Find the horizontal autoscaler that controls the replicas of the desired resource
kubectl get hpa -A
# Increase the number of max replicas of the desired resource, replacing <namespace> <resource> and <maxReplicas>
kubectl -n <namespace> patch hpa <resource> --patch '{"spec":{"maxReplicas":<maxReplicas>}}'
# Find the horizontal autoscaler that controls the replicas of the desired resource
kubectl get hpa -A
# Increase the number of max replicas of the desired resource, replacing <namespace> <resource> and <maxReplicas>
kubectl -n <namespace> patch hpa <resource> --patch '{"spec":{"maxReplicas":<maxReplicas>}}'
See also: Horizontal Pod Autoscaling.
These warnings indicate that the cluster cannot tolerate node failure. For single-node evaluation clusters, this is known, and these alerts may be silenced. For multi-node HA-ready production setups, these alerts fire when too many nodes become unhealthy to support high availability, and they indicate that the nodes should be brought back to health or replaced.
KubeCPUQuotaOvercommit, KubeMemoryQuotaOvercommit, KubeQuotaAlmostFull, KubeQuotaFullyUsed, KubeQuotaExceeded
These alerts pertain to namespace resource quotas that only exist in the cluster if added through customization. Namespace resource quotas are not added as part of Automation Suite installation.
See also: Resource Quotas.
When Warning: The available space is less than 30% and is likely to fill up within four days.
When Critical: The available space is less than 10%.
For any services that run out of space, data may be difficult to recover, so volumes should be resized before hitting 0% available space.
For Prometheus-specific alerts, see PrometheusStorageUsage for more details and instructions.
The Kube State Metrics collector is not able to collect metrics from the cluster without errors. This means important alerts may not fire. Contact UiPath® Support.
See also: Kube state metrics at release.
When Warning: A client certificate used to authenticate to the Kubernetes API server expires in less than seven days.
When Critical: A client certificate used to authenticate to the Kubernetes API server expires in less than one day.
You must renew the certificate.
Indicates problems with the Kubernetes control plane. Check the health of master nodes, resolve any outstanding issues, and contact UiPath® Support if the issues persist.
See also:
This alert indicates that the Kubernetes API server is experiencing a high error rate. This issue could lead to other failures, so it is recommended that you investigate the problem proactively.
api-server
pod to find out the root cause of the issue using the kubectl logs <pod-name> -n kube-system
command.
KubeNodeNotReady, KubeNodeUnreachable, KubeNodeReadinessFlapping, KubeletPlegDurationHigh, KubeletPodStartUpLatencyHigh, KubeletDown
These alerts indicate a problem with a node. In multi-node HA-ready production clusters, pods would likely be rescheduled onto other nodes. If the issue persists, you should remove and drain the node to maintain the health of the cluster. In clusters without extra capacity, another node should be joined to the cluster first.
If the issues persist, contact UiPath® Support.
When Warning: A client or server certificate for Kubelet expires in less than seven days.
When Critical: A client or server certificate for Kubelet expires in less than one day.
You must renew the certificate.
There are different semantic versions of Kubernetes components running. This can happen as a result of an unsuccessful Kubernetes upgrade.
Kubernetes API server client is experiencing greater than 1% errors. There may be an issue with the node this client is running on, or the Kubernetes API server itself.
This alert indicates that memory usage is very high on the Kubernetes node.
MemoryPressure
incident type occurs when a Kubernetes cluster node is running low on memory, which can be caused by a memory leak in an
application. This incident type requires immediate attention to prevent any downtime and ensure the proper functioning of
the Kubernetes cluster.
If this alert fires, try to identify the pod on the node that is consuming more memory, by taking these steps:
-
Retrieve the nodes CPU and memory stats:
kubectl top node
kubectl top node -
Retrieve the pods running on the node:
kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=${NODE_NAME}
kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=${NODE_NAME} -
Check the memory usage for pods in a namespace using:
kubectl top pod --namespace <namespace> kubectl logs -f <pod-name> -n <ns>
kubectl top pod --namespace <namespace> kubectl logs -f <pod-name> -n <ns>
If you are able to identify any pod with high memory usage, check the logs of the pod and look for memory leak errors.
To address the issue, increase the memory spec for the nodes if possible.
If the issue persists, generate thesupport bundle and contact UiPath® Support.
This alert indicates that disk usage is very high on the Kubernetes node.
If this alert fires, try to see which pod is consuming more disk:
-
Confirm if the node is under
DiskPressure
using the following command:kubectl describe node <node-name>
kubectl describe node <node-name>Identify for theDiskPressure
condition in the output. -
Check the disk space usage on the affected node:
df -h
df -hThis shows disk usage on all mounted file systems. Identify where the high usage.
-
If the disk is full and cleanup is insufficient, consider resizing the disk for the node (especially in cloud environments such as AWS or GCP). This process may involve expanding volumes, depending on your infrastructure.
The filesystem on a particular node is filling up.
If this alert fires, consider the following steps:
-
Confirm if the node is under
DiskPressure
using the following command:kubectl describe node <node-name>
kubectl describe node <node-name>Identify for theDiskPressure
condition in the output.
-
Clear the logs and temporary files. Check for large log files in
/var/log/
and clean them, if possible.
-
Check the disk space usage on the affected node:
df -h
df -hThis shows disk usage on all mounted file systems. Identify where the high usage.
-
If the disk is full and cleanup is insufficient, consider resizing the disk for the node (especially in cloud environments such as AWS or GCP). This process may involve expanding volumes, depending on your infrastructure.
RAID array is in a degraded state due to one or more disk failures. The number of spare drives
is insufficient to fix the issue automatically.
These errors indicate that the network driver is reporting a high number of failures. This can be caused by physicall hardware failures, or misconfiguration in the physical network. This issue pertains to the OS and is not controlled by the UiPath® application.
/proc/net/dev
counter that the linux kernel provides.
Contact your network admin and the team that manages the physical infrastructure.
The node has become unresponsive due to some issue causing broken communication between nodes in the cluster.
If the issue persists, reach out to UiPath® Support with the generated support bundle.
These alerts warn when the cluster is approaching the configured limits for memory and storage. This is likely to happen on clusters with a recent substantial increase in usage (usually from Robots rather than users), or when nodes are added to the cluster without adjusting Prometheus resources. This is due to an increase in the amount of metrics being collected. This could also be due to a large number of alerts that are being fired, it is important to check why the large amount of alerts are being fired.
If this issue persists, contact UiPath® Support with the generated support bundle.
Alertmanager
instances within the same cluster have different configurations. This could indicate a problem with the configuration rollout
which is not consistent across all the instances of Alertmanager
.
To fix the issue, take the following steps:
-
Run a
diff
tool between allalertmanager.yml
that are deployed to identify the problem. -
Delete the incorrect secret and deploy the correct one.
If the issue persists, contact UiPath® Support.
AlertManager has failed to load or reload configuration. Please check any custom AlertManager configurations for input errors and otherwise contact UiPath® Support and provide the support bundle. For details, see Using the Automation Suite support bundle.
PrometheusOperatorListErrors, PrometheusOperatorWatchErrors, PrometheusOperatorSyncFailed, PrometheusOperatorReconcileErrors, PrometheusOperatorNodeLookupErrors, PrometheusOperatorNotReady, PrometheusOperatorRejectedResources
Internal errors of the Prometheus operator, which controls Prometheus resources. Prometheus itself may still be healthy while these errors are present; however, this error indicates there is degraded monitoring configurability. Contact UiPath® Support.
Prometheus has failed to load or reload configuration. Please check any custom Prometheus configurations for input errors. Otherwise contact UiPath® Support.
PrometheusErrorSendingAlertsToSomeAlertmanagers, PrometheusErrorSendingAlertsToAnyAlertmanager, PrometheusNotConnectedToAlertmanagers
The connection from Prometheus to AlertManager is not healthy. Metrics may still be queryable, and Grafana dashboards may still show them, but alerts will not fire. Check any custom configuration of AlertManager for input errors and and otherwise contact UiPath® Support.
PrometheusNotificationQueueRunningFull, PrometheusTSDBReloadsFailing, PrometheusTSDBCompactionsFailing, PrometheusNotIngestingSamples, PrometheusDuplicateTimestamps, PrometheusOutOfOrderTimestamps, PrometheusRemoteStorageFailures, PrometheusRemoteWriteBehind, PrometheusRemoteWriteDesiredShards
Internal Prometheus errors indicating metrics may not be collected as expected. Please contact UiPath® Support.
This may happen if there are malformed alerts based on non-existent metrics or incorrect PromQL syntax. Contact UiPath® Support if no custom alerts have been added.
Prometheus is not able to evaluate whether alerts should be firing. This may happen if there are too many alerts. Please remove expensive custom alert evaluations and/or see documentation on increasing CPU limit for Prometheus. Contact UiPath® Support if no custom alerts have been added.
UiPathAvailabilityHighTrafficBackend, UiPathAvailabilityMediumTrafficUserFacing, UiPathAvailabilityMediumTrafficBackend, UiPathAvailabilityLowTrafficUserFacing, UiPathAvailabilityLowTrafficBackend
The number of http 500 responses from UiPath® services exceeds a given threshold.
Traffic level |
Number of requests in 20 minutes |
Error threshold (for http 500s) |
---|---|---|
High |
>100,000 |
0.1% |
Medium |
Between 10,000 and 100,000 |
1% |
Low |
< 10,000 |
5% |
Errors in user-facing services would likely result in degraded functionality that is directly observable in the Automation Suite UI, while errors in backend services would have less obvious consequences.
The alert indicates which service is experiencing a high error rate. To understand what cascading issues there may be from other services that the reporting service depends on, you can use the Istio Workload dashboard, which shows errors between services.
Please double check any recently reconfigured Automation Suite products. Detailed logs are also available with the kubectl logs command. If the error persists, please contact UiPath® Support.
uipath-infra/istio-configure-script-cronjob
cronjob is in suspended state.
To fix this issue, enable the cronjob by taking the following steps:
export KUBECONFIG="/etc/rancher/rke2/rke2.yaml" && export PATH="$PATH:/usr/local/bin:/var/lib/rancher/rke2/bin"
kubectl -n uipath-infra patch cronjob istio-configure-script-cronjob -p '{"spec":{"suspend":false}}'
epoch=$(date +"%s")
kubectl -n uipath-infra create job istio-configure-script-cronjob-manual-$epoch --from=cronjob/istio-configure-script-cronjob
kubectl -n uipath-infra wait --for=condition=complete --timeout=300s job/istio-configure-script-cronjob-manual-$epoch
kubectl get node -o wide
#Verify if all the IP's listed by the above command are part of output of below command
kubectl -n <istio-system> get svc istio-ingressgateway -o json | jq '.spec.externalIPs'
export KUBECONFIG="/etc/rancher/rke2/rke2.yaml" && export PATH="$PATH:/usr/local/bin:/var/lib/rancher/rke2/bin"
kubectl -n uipath-infra patch cronjob istio-configure-script-cronjob -p '{"spec":{"suspend":false}}'
epoch=$(date +"%s")
kubectl -n uipath-infra create job istio-configure-script-cronjob-manual-$epoch --from=cronjob/istio-configure-script-cronjob
kubectl -n uipath-infra wait --for=condition=complete --timeout=300s job/istio-configure-script-cronjob-manual-$epoch
kubectl get node -o wide
#Verify if all the IP's listed by the above command are part of output of below command
kubectl -n <istio-system> get svc istio-ingressgateway -o json | jq '.spec.externalIPs'
This job obtains the latest Kerberos ticket from the AD server for SQL-integrated authentication. Failures in this job would cause SQL server authentication to fail. Please contact UiPath® Support.
Errors in the request routing layer would result in degraded functionality that is directly observable in the Automation Suite UI. The requests will not be routed to backend services.
istio-ingressgateway
pods in istio-system
namespace. Retrieve the pod name by running the following commands:
kubectl get pods -n istio-system
kubectl logs <istio-ingressgateway-pod-name> -n istio-system
kubectl get pods -n istio-system
kubectl logs <istio-ingressgateway-pod-name> -n istio-system
This alert indicates that the server TLS certificate will expire in the following 30 days.
To fix this issue, update the server TLS certificate. For instructions, see Managing server certificates.
This alert indicates that the server TLS certificate will expire in the following 7 days.
To fix this issue, update the TLS certificate. For instructions, see Managing server certificates.
This alert indicates that the Identity token signing certificate will expire in the following 30 days.
To fix this issue, update the Identity token signing certificate. For instructions, see Managing server certificates.
This alert indicates that the Identity token signing certificate will expire in the following 7 days.
To fix this issue, update the Identity token signing certificate. For instructions, see Managing server certificates.
This alert indicates that the etcd cluster has an insufficient number of members. Note that the cluster must have an odd number of members. The severity of this alert is critical.
Make sure that there is an odd number of server nodes in the cluster, and all of them are up and healthy.
This alert shows that the etcd cluster has no leader. The severity of this alert is critical.
This alert indicates that the etcd leader changes more than twice in 10 minutes. This is a warning.
This alert indicates that a certain percentage of GRPC request failures was detected in etcd.
This alert indicates that etcd GRPC requests are slow. This is a warning.
If this alert persists, contact UiPath® support.
This alert indicates that a certain percentage of HTTP failures was detected in etcd.
This alert indicates that etcd member communication is slowing down. This is a warning.
This alert indicates that the etcd server received more than 5 failed proposals in the last hour. This is a warning.
This alert indicates that etcd WAL fsync duration is increasing. This is a warning.
/var/lib/rancher
partition is less than:
- 35% – the severity of the alert is warning
- 25% – the severity of the alert is critical
If this alert fires, increase the size of the disk.
/var/lib/kubelet
partition is less than:
- 35% – the severity of the alert is warning
-
25% – the severity of the alert is critical
If this alert fires, increase the size of the disk.
/var
partition is less than:
- 35% – the severity of the alert is warning
- 25% – the severity of the alert is critical
The storage requirements for ML skills can substantially increase disk usage.
If this alert fires, increase the size of the disk.
This alert indicates that the NFS server connection is lost.
You need to check the NFS server connection and mount path.
This alert indicates that the backup failed for a PVC.
To address this issue, take the following steps:
-
Check the status of the PVC to ensure it is
Bound
to a Persistent Volume (PV).kubectl get pvc --namespace <namespace>
kubectl get pvc --namespace <namespace>The command lists all PVCs and their current status. The PVC should have a status ofBound
to indicate it has successfully claimed a PV.If the status isPending
, it means the PVC is still waiting for a suitable PV, and further investigation is needed. -
If the PVC is not in a
Bound
state or if you need more detailed information, use thedescribe
command:kubectl describe pvc <pvc-name> --namespace <namespace>
kubectl describe pvc <pvc-name> --namespace <namespace>Look for information on the status, events, and any error messages. For example, an issue could be related to storage class misconfigurations or quota limitations.
-
Check the health of the Persistent Volume (PV) that is bound to the PVC:
kubectl get pv <pv-name>
kubectl get pv <pv-name>The status should beBound
. If the PV is in aReleased
orFailed
state, it may indicate issues with the underlying storage. -
If the PVC is used by a pod, check whether the pod has successfully mounted the volume:
kubectl get pod <pod-name> --namespace <namespace>
kubectl get pod <pod-name> --namespace <namespace>If the pod is in aRunning
state, it indicates that the PVC is mounted successfully. If the pod is in an error state (such asInitBackOff
), it might indicate issues with volume mounting. -
If there are issues with mounting the PVC, describe the pod to check for any mounting errors:
kubectl describe pod <pod-name> --namespace <namespace>
kubectl describe pod <pod-name> --namespace <namespace>
- Alert severity key
- general.rules
- TargetDown
- Watchdog
- kubernetes-apps
- KubePodCrashLooping
- KubePodNotReady
- KubeDeploymentGenerationMismatch, KubeStatefulSetGenerationMismatch
- KubeDeploymentReplicasMismatch, KubeStatefulSetReplicasMismatch
- KubeStatefulSetUpdateNotRolledOut
- KubeDaemonSetRolloutStuck
- KubeContainerWaiting
- KubeDaemonSetNotScheduled, KubeDaemonSetMisScheduled
- KubeJobCompletion
- KubeJobFailed
- KubeHpaReplicasMismatch
- KubeHpaMaxedOut
- kubernetes-resources
- KubeCPUOvercommit, KubeMemoryOvercommit
- KubeCPUQuotaOvercommit, KubeMemoryQuotaOvercommit, KubeQuotaAlmostFull, KubeQuotaFullyUsed, KubeQuotaExceeded
- CPUThrottlingHigh
- Kubernetes-storage
- KubePersistentVolumeFillingUp
- kube-state-metrics
- KubeStateMetricsListErrors, KubeStateMetricsWatchErrors
- kubernetes-system-apiserver
- KubeClientCertificateExpiration
- AggregatedAPIErrors, AggregatedAPIDown, KubeAPIDown, KubeAPITerminatedRequests
- KubernetesApiServerErrors
- kubernetes-system-kubelet
- KubeNodeNotReady, KubeNodeUnreachable, KubeNodeReadinessFlapping, KubeletPlegDurationHigh, KubeletPodStartUpLatencyHigh, KubeletDown
- KubeletTooManyPods
- KubeletClientCertificateExpiration, KubeletServerCertificateExpiration
- KubeletClientCertificateRenewalErrors, KubeletServerCertificateRenewalErrors
- kubernetes-system
- KubeVersionMismatch
- KubeClientErrors
- KubernetesMemoryPressure
- KubernetesDiskPressure
- Kube-apiserver-slos
- KubeAPIErrorBudgetBurn
- node-exporter
- NodeFilesystemSpaceFillingUp
- NodeRAIDDegraded
- NodeRAIDDiskFailure
- NodeNetworkReceiveErrs
- NodeClockSkewDetected, NodeClockNotSynchronising
- node-network
- NodeNetworkInterfaceFlapping
- InternodeCommunicationBroken
- uipath.prometheus.resource.provisioning.alerts
- PrometheusMemoryUsage, PrometheusStorageUsage
- alertmanager.rules
- AlertmanagerConfigInconsistent
- AlertmanagerFailedReload
- prometheus-operator
- PrometheusOperatorListErrors, PrometheusOperatorWatchErrors, PrometheusOperatorSyncFailed, PrometheusOperatorReconcileErrors, PrometheusOperatorNodeLookupErrors, PrometheusOperatorNotReady, PrometheusOperatorRejectedResources
- prometheus
- PrometheusBadConfig
- PrometheusErrorSendingAlertsToSomeAlertmanagers, PrometheusErrorSendingAlertsToAnyAlertmanager, PrometheusNotConnectedToAlertmanagers
- PrometheusNotificationQueueRunningFull, PrometheusTSDBReloadsFailing, PrometheusTSDBCompactionsFailing, PrometheusNotIngestingSamples, PrometheusDuplicateTimestamps, PrometheusOutOfOrderTimestamps, PrometheusRemoteStorageFailures, PrometheusRemoteWriteBehind, PrometheusRemoteWriteDesiredShards
- PrometheusRuleFailures
- PrometheusMissingRuleEvaluations
- PrometheusTargetLimitHit
- UiPathAvailabilityHighTrafficBackend, UiPathAvailabilityMediumTrafficUserFacing, UiPathAvailabilityMediumTrafficBackend, UiPathAvailabilityLowTrafficUserFacing, UiPathAvailabilityLowTrafficBackend
- uipath.cronjob.alerts.rules
- CronJobSuspended
- UiPath CronJob "kerberos-tgt-refresh" Failed
- IdentityKerberosTgtUpdateFailed
- uipath.requestrouting.alerts
- UiPathRequestRouting
- Server TLS Certificate Alerts
- SecretCertificateExpiry30Days
- SecretCertificateExpiry7Days
- Identity Token Signing Certificate Alerts
- IdentityCertificateExpiry30Days
- IdentityCertificateExpiry7Days
- etdc Alerts
- EtcdInsufficientMembers
- EtcdNoLeader
- EtcdHighNumberOfLeaderChanges
- EtcdHighNumberOfFailedGrpcRequests
- EtcdGrpcRequestsSlow
- EtcdHighNumberOfFailedHttpRequests
- EtcdHttpRequestsSlow
- EtcdMemberCommunicationSlow
- EtcdHighNumberOfFailedProposals
- EtcdHighFsyncDurations
- EtcdHighCommitDurations
- Disk Size Alerts
- LowDiskForRancherPartition
- LowDiskForKubeletPartition
- LowDiskForVarPartition
- Backup Alerts
- NFSServerDisconnected
- VolumeBackupFailed
- BackupDisabled