- Overview
- Requirements
- Installation
- Post-installation
- Cluster administration
- Managing products
- Managing the cluster in ArgoCD
- Setting up the external NFS server
- Automated: Enabling the Backup on the Cluster
- Automated: Disabling the Backup on the Cluster
- Automated, Online: Restoring the Cluster
- Automated, Offline: Restoring the Cluster
- Manual: Enabling the Backup on the Cluster
- Manual: Disabling the Backup on the Cluster
- Manual, Online: Restoring the Cluster
- Manual, Offline: Restoring the Cluster
- Additional configuration
- Migrating objectstore from persistent volume to raw disks
- Monitoring and alerting
- Using the monitoring stack
- Alert Runbooks
- Migration and upgrade
- Migration options
- Step 1: Moving the Identity organization data from standalone to Automation Suite
- Step 2: Restoring the standalone product database
- Step 3: Backing up the platform database in Automation Suite
- Step 4: Merging organizations in Automation Suite
- Step 5: Updating the migrated product connection strings
- Step 6: Migrating standalone Insights
- Step 7: Deleting the default tenant
- B) Single tenant migration
- Product-specific configuration
- Best practices and maintenance
- Troubleshooting
- How to Troubleshoot Services During Installation
- How to Uninstall the Cluster
- How to clean up offline artifacts to improve disk space
- How to clear Redis data
- How to enable Istio logging
- How to manually clean up logs
- How to clean up old logs stored in the sf-logs bucket
- How to disable streaming logs for AI Center
- How to debug failed Automation Suite installations
- How to delete images from the old installer after upgrade
- How to automatically clean up Longhorn snapshots
- How to disable TX checksum offloading
- How to address weak ciphers in TLS 1.2
- Unable to run an offline installation on RHEL 8.4 OS
- Error in Downloading the Bundle
- Offline installation fails because of missing binary
- Certificate issue in offline installation
- First installation fails during Longhorn setup
- SQL connection string validation error
- Prerequisite check for selinux iscsid module fails
- Azure disk not marked as SSD
- Failure After Certificate Update
- Automation Suite not working after OS upgrade
- Automation Suite Requires Backlog_wait_time to Be Set 1
- Volume unable to mount due to not being ready for workloads
- RKE2 fails during installation and upgrade
- Failure to upload or download data in objectstore
- PVC resize does not heal Ceph
- Failure to Resize Objectstore PVC
- Rook Ceph or Looker pod stuck in Init state
- StatefulSet volume attachment error
- Failure to create persistent volumes
- Storage reclamation patch
- Backup failed due to TooManySnapshots error
- All Longhorn replicas are faulted
- Setting a timeout interval for the management portals
- Update the underlying directory connections
- Cannot Log in After Migration
- Kinit: Cannot Find KDC for Realm <AD Domain> While Getting Initial Credentials
- Kinit: Keytab Contains No Suitable Keys for *** While Getting Initial Credentials
- GSSAPI Operation Failed With Error: An Invalid Status Code Was Supplied (Client's Credentials Have Been Revoked).
- Alarm Received for Failed Kerberos-tgt-update Job
- SSPI Provider: Server Not Found in Kerberos Database
- Login Failed for User <ADDOMAIN><aduser>. Reason: The Account Is Disabled.
- ArgoCD login failed
- Failure to get the sandbox image
- Pods not showing in ArgoCD UI
- Redis Probe Failure
- RKE2 Server Fails to Start
- Secret Not Found in UiPath Namespace
- After the Initial Install, ArgoCD App Went Into Progressing State
- MongoDB pods in CrashLoopBackOff or pending PVC provisioning after deletion
- Unexpected Inconsistency; Run Fsck Manually
- Degraded MongoDB or Business Applications After Cluster Restore
- Missing Self-heal-operator and Sf-k8-utils Repo
- Unhealthy Services After Cluster Restore or Rollback
- RabbitMQ pod stuck in CrashLoopBackOff
- Prometheus in CrashloopBackoff state with out-of-memory (OOM) error
- Missing Ceph-rook metrics from monitoring dashboards
- Pods cannot communicate with FQDN in a proxy environment
- Using the Automation Suite Diagnostics Tool
- Using the Automation Suite support bundle
- Exploring Logs
Automation Suite Installation Guide
Alert Runbooks
- For general instructions on using the available tools for alerts, metrics, and visualizations, see Using the monitoring stack
- For more on how to fix issues and how to create a support bundle for UiPath Support engineers, see Troubleshooting.
- When contacting UiPath Support, please include any alerts that are currently firing.
Alert severity |
Description |
---|---|
Info | Unexpected but harmless. Can be silenced but may be useful during diagnostics. |
Warning | Indication of a targeted degradation of functionality or a likelihood of degradation in the near future, which may affect the entire cluster. Suggests prompt action (usually within days) to keep cluster healthy. |
Critical | Known to cause serious degradation of functionality that is often widespread in the cluster. Requires immediate action (same day) to repair cluster. |
Prometheus is not able to collect metrics from the target in the alert, which means Grafana dashboards and further alerts based on metrics from that target are not be available. Check other alerts pertaining to that target.
This is an alert meant to ensure that the entire alerting pipeline is functional. This alert is always firing. Therefore, it should always be firing in AlertManager and against a receiver. There are integrations with various notification mechanisms that notify you when this alert is not firing. For example, the DeadMansSnitch integration in PagerDuty.
kubectl describe
, and logs with kubectl logs
to see details on possible crashes. If the issue persists, contact UiPath® Support.
kubectl logs
to see if there is any indication of progress. If the issue persists, contact UiPath® Support.
There has been an attempted update to a deployment or statefulset, but it has failed, and a rollback has not yet occurred. Contact UiPath® Support.
In high availability clusters with multiple replicas, this alert fires when the number of replicas is not optimal. This may occur when there are not enough resources in the cluster to schedule. Check resource utilization, and add capacity as necessary. Otherwise contact UiPath® Support.
An update to a statefulset has failed. Contact UiPath® Support.
See also: StatefulSets.
Daemonset rollout has failed. Contact UiPath® Support.
See also: DaemonSet.
kubectl describe
of the pod for more information. The most common cause of waiting containers is a failure to pull the image. For air-gapped
clusters, this could mean that the local registry is not available. If the issue persists, contact UiPath® Support.
This may indicate an issue with one of the nodes Check the health of each node, and remediate any known issues. Otherwise contact UiPath® Support.
A job takes more than 12 hours to complete. This is not expected. Contact UiPath® Support.
A job has failed; however, most jobs are retried automatically. If the issue persists, contact UiPath® Support.
The autoscaler cannot scale the targeted resource as configured. If desired is higher than actual, then there may be a lack of resources. If desired is lower than actual, pods may be stuck while shutting down. If the issue persists, contact UiPath® Support.
See also: Horizontal Pod Autoscaling
The number of replicas for a given service has reached its maximum. This happens when the amount of requests being made to the cluster is very high. If high traffic is expected and temporary, you may silence this alert. However, this alert is a sign that the cluster is at capacity and cannot handle much more traffic. If more resource capacity is available on the cluster, you can increase the number of maximum replicas for the service by following these instructions:
# Find the horizontal autoscaler that controls the replicas of the desired resource
kubectl get hpa -A
# Increase the number of max replicas of the desired resource, replacing <namespace> <resource> and <maxReplicas>
kubectl -n <namespace> patch hpa <resource> --patch '{"spec":{"maxReplicas":<maxReplicas>}}'
# Find the horizontal autoscaler that controls the replicas of the desired resource
kubectl get hpa -A
# Increase the number of max replicas of the desired resource, replacing <namespace> <resource> and <maxReplicas>
kubectl -n <namespace> patch hpa <resource> --patch '{"spec":{"maxReplicas":<maxReplicas>}}'
See also: Horizontal Pod Autoscaling.
These warnings indicate that the cluster cannot tolerate node failure. For single-node evaluation clusters, this is known, and these alerts may be silenced. For multi-node HA-ready production setups, these alerts fire when too many nodes become unhealthy to support high availability, and they indicate that the nodes should be brought back to health or replaced.
KubeCPUQuotaOvercommit, KubeMemoryQuotaOvercommit, KubeQuotaAlmostFull, KubeQuotaFullyUsed, KubeQuotaExceeded
These alerts pertain to namespace resource quotas that only exist in the cluster if added through customization. Namespace resource quotas are not added as part of Automation Suite installation.
See also: Resource Quotas.
When Warning: The available space is less than 30% and is likely to fill up within four days.
When Critical: The available space is less than 10%.
For any services that run out of space, data may be difficult to recover, so volumes should be resized before hitting 0% available space.
For instructions, see Configuring the cluster.
For Prometheus-specific alerts, see PrometheusStorageUsage for more details and instructions.
The Kube State Metrics collector is not able to collect metrics from the cluster without errors. This means important alerts may not fire. Contact UiPath® Support.
See also: Kube state metrics at release.
When Warning: A client certificate used to authenticate to the Kubernetes API server expires in less than seven days.
When Critical: A client certificate used to authenticate to the Kubernetes API server expires in less than one day.
You must renew the certificate.
Indicates problems with the Kubernetes control plane. Check the health of master nodes, resolve any outstanding issues, and contact UiPath® Support if the issues persist.
See also:
This alert indicates that the Kubernetes API server is experiencing a high error rate. This issue could lead to other failures, so it is recommended that you investigate the problem proactively.
api-server
pod to find out the root cause of the issue using the kubectl logs <pod-name> -n kube-system
command.
KubeNodeNotReady, KubeNodeUnreachable, KubeNodeReadinessFlapping, KubeletPlegDurationHigh, KubeletPodStartUpLatencyHigh, KubeletDown
These alerts indicate a problem with a node. In multi-node HA-ready production clusters, pods would likely be rescheduled onto other nodes. If the issue persists, you should remove and drain the node to maintain the health of the cluster. In clusters without extra capacity, another node should be joined to the cluster first.
There are too many pods running on the specified node.
Join another node to the cluster.
When Warning: A client or server certificate for Kubelet expires in less than seven days.
When Critical: A client or server certificate for Kubelet expires in less than one day.
You must renew the certificate.
There are different semantic versions of Kubernetes components running. This can happen as a result of an unsuccessful Kubernetes upgrade.
Kubernetes API server client is experiencing greater than 1% errors. There may be an issue with the node this client is running on, or the Kubernetes API server itself.
This alert indicates that memory usage is very high on the Kubernetes node.
If this alert fires, try to see which pod is consuming more memory.
The filesystem on a particular node is filling up. Provision more space by adding a disk or mounting unused disks.
RAID array is in a degraded state due to one or more disk failures. The number of spare drives
is insufficient to fix the issue automatically.
There is a problem with the physical network interface on the node. If the issues persist, it may need to be replaced.
The node has become unresponsive due to some issue causing broken communication between nodes in the cluster.
To fix this problem, restart the affected node. If the issue persists, reach out to UiPath® Support with the Support Bundle Tool.
These alerts warn when the cluster is approaching the configured limits for memory and storage. This is likely to happen on clusters with a recent substantial increase in usage (usually from Robots rather than users), or when nodes are added to the cluster without adjusting Prometheus resources. This is due to an increase in the amount of metrics being collected.
The rate of increased storage utilization can be seen on the Kubernetes / Persistent Volumes dashboard:
You can adjust it by resizing the PVC as instructed here: Configuring the cluster.
The rate of increased memory utilization can be seen on the Kubernetes / Compute Resources / Pod dashboard.
You can adjust it by editing the Prometheus memory resource limits in the rancher-monitoring app from ArgoCD. The rancher-monitoring app automatically re-syncs after clicking Save.
Note that Prometheus takes some time to restart and start showing metrics in Grafana again. it usually takes less than 10 minutes, even with large clusters.
These are internal Alertmanager errors for HA clusters with multiple AlertManager replicas. Alerts may appear and disappear intermittently. Temporarily scaling down, then scaling up Alertmanager replicas may fix the issue.
To fix the issue, take the following steps:
-
Scale to zero. Note that it takes a moment for the pods to shut down:
kubectl scale statefulset -n cattle-monitoring-system alertmanager-rancher-monitoring-alertmanager --replicas=0
kubectl scale statefulset -n cattle-monitoring-system alertmanager-rancher-monitoring-alertmanager --replicas=0 -
Scale back to two:
kubectl scale statefulset -n cattle-monitoring-system alertmanager-rancher-monitoring-alertmanager --replicas=2
kubectl scale statefulset -n cattle-monitoring-system alertmanager-rancher-monitoring-alertmanager --replicas=2 -
Check if the Alertmanager pods started and are in the running state:
kubectl get po -n cattle-monitoring-system
kubectl get po -n cattle-monitoring-system
If the issue persists, contact UiPath® Support.
PrometheusOperatorListErrors, PrometheusOperatorWatchErrors, PrometheusOperatorSyncFailed, PrometheusOperatorReconcileErrors, PrometheusOperatorNodeLookupErrors, PrometheusOperatorNotReady, PrometheusOperatorRejectedResources
Internal errors of the Prometheus operator, which controls Prometheus resources. Prometheus itself may still be healthy while these errors are present; however, this error indicates there is degraded monitoring configurability. Contact UiPath® Support.
Prometheus has failed to load or reload configuration. Please check any custom Prometheus configurations for input errors. Otherwise contact UiPath® Support.
PrometheusErrorSendingAlertsToSomeAlertmanagers, PrometheusErrorSendingAlertsToAnyAlertmanager, PrometheusNotConnectedToAlertmanagers
The connection from Prometheus to AlertManager is not healthy. Metrics may still be queryable, and Grafana dashboards may still show them, but alerts will not fire. Check any custom configuration of AlertManager for input errors and and otherwise contact UiPath® Support.
PrometheusNotificationQueueRunningFull, PrometheusTSDBReloadsFailing, PrometheusTSDBCompactionsFailing, PrometheusNotIngestingSamples, PrometheusDuplicateTimestamps, PrometheusOutOfOrderTimestamps, PrometheusRemoteStorageFailures, PrometheusRemoteWriteBehind, PrometheusRemoteWriteDesiredShards
Internal Prometheus errors indicating metrics may not be collected as expected. Please contact UiPath® Support.
This may happen if there are malformed alerts based on non-existent metrics or incorrect PromQL syntax. Contact UiPath® Support if no custom alerts have been added.
Prometheus is not able to evaluate whether alerts should be firing. This may happen if there are too many alerts. Please remove expensive custom alert evaluations and/or see documentation on increasing CPU limit for Prometheus. Contact UiPath® Support if no custom alerts have been added.
UiPathAvailabilityHighTrafficUserFacing, UiPathAvailabilityHighTrafficBackend, UiPathAvailabilityMediumTrafficUserFacing, UiPathAvailabilityMediumTrafficBackend, UiPathAvailabilityLowTrafficUserFacing, UiPathAvailabilityLowTrafficBackend
The number of http 500 responses from UiPath® services exceeds a given threshold.
Traffic level |
Number of requests in 20 minutes |
Error threshold (for http 500s) |
---|---|---|
High |
>100,000 |
0.1% |
Medium |
Between 10,000 and 100,000 |
1% |
Low |
< 10,000 |
5% |
Errors in user-facing services would likely result in degraded functionality that is directly observable in the Automation Suite UI, while errors in backend services would have less obvious consequences.
The alert indicates which service is experiencing a high error rate. To understand what cascading issues there may be from other services that the reporting service depends on, you can use the Istio Workload dashboard, which shows errors between services.
Please double check any recently reconfigured Automation Suite products. Detailed logs are also available with the kubectl logs command. If the error persists, please contact UiPath® Support.
uipath-infra/istio-configure-script-cronjob
cronjob is in suspended state.
To fix this issue, enable the cronjob by taking the following steps:
export KUBECONFIG="/etc/rancher/rke2/rke2.yaml" && export PATH="$PATH:/usr/local/bin:/var/lib/rancher/rke2/bin"
kubectl -n uipath-infra patch cronjob istio-configure-script-cronjob -p '{"spec":{"suspend":false}}'
epoch=$(date +"%s")
kubectl -n uipath-infra create job istio-configure-script-cronjob-manual-$epoch --from=cronjob/istio-configure-script-cronjob
kubectl -n uipath-infra wait --for=condition=complete --timeout=300s job/istio-configure-script-cronjob-manual-$epoch
kubectl get node -o wide
#Verify if all the IP's listed by the above command are part of output of below command
kubectl -n istio-system get svc istio-ingressgateway -o json | jq '.spec.externalIPs'
export KUBECONFIG="/etc/rancher/rke2/rke2.yaml" && export PATH="$PATH:/usr/local/bin:/var/lib/rancher/rke2/bin"
kubectl -n uipath-infra patch cronjob istio-configure-script-cronjob -p '{"spec":{"suspend":false}}'
epoch=$(date +"%s")
kubectl -n uipath-infra create job istio-configure-script-cronjob-manual-$epoch --from=cronjob/istio-configure-script-cronjob
kubectl -n uipath-infra wait --for=condition=complete --timeout=300s job/istio-configure-script-cronjob-manual-$epoch
kubectl get node -o wide
#Verify if all the IP's listed by the above command are part of output of below command
kubectl -n istio-system get svc istio-ingressgateway -o json | jq '.spec.externalIPs'
This job obtains the latest Kerberos ticket from the AD server for SQL-integrated authentication. Failures in this job would cause SQL server authentication to fail. Please contact UiPath® Support.
This alert indicates that the Ceph storage cluster utilization has crossed 75% and will become read-only at 85%.
If this alert fires, free up some space in Ceph by deleting some unused datasets in AI Center or Task Mining or expand the storage available for Ceph PVC by following the instructions in Resizing PVC.
Before resizing PVC, make sure you meet the storage requirements. For details, see Evaluating your storage needs.
This alert indicates that Ceph storage cluster utilization has crossed 80% and will become read-only at 85%.
If this alert fires, free up some space in Ceph by deleting some unused datasets in AI Center or Task Mining or expand the storage available for Ceph PVC by following the instructions in Resizing PVC.
Before resizing PVC, make sure you meet the storage requirements. For details, see Evaluating your storage needs.
This alert indicates that Ceph storage cluster utilization has crossed 85% and will become read-only now. Free up some space or expand the storage cluster immediately.
If this alert fires, free up some space in Ceph by deleting some unused datasets in AI Center or Task Mining or expand the storage available for Ceph PVC by following the instructions in Resizing PVC.
Before resizing PVC, make sure you meet the storage requirements. For details, see Evaluating your storage needs.
This alert indicates that Ceph storage pool usage has crossed 90%.
If this alert fires, free up some space in CEPH by deleting some unused datasets in AI Center or Task Mining or expand the storage available for Ceph PVC by following the instructions in Resizing PVC.
Before resizing PVC, make sure you meet the storage requirements. For details, see Evaluating your storage needs.
This alert indicates that the Ceph storage cluster has been in error state for more than 10m.
rook-ceph-mgr
job has been in error state for an unacceptable amount of time. Check for other alerts that might have triggered prior to
this one and troubleshoot those first.
This alert indicates that storage cluster quorum is low.
Multiple mons work together to provide redundancy; this is possible because each keeps a copy of the metadata. The cluster is deployed with 3 mons, and requires 2 or more mons to be up and running for quorum and for the storage operations to run. If quorum is lost, access to data is at risk.
If this alert fires, check if any OSDs are in terminating state, if there are any, force delete those pods, and wait for some time for the operator to reconcile. If the issue persists, contact UiPath® support.
When the alert severity is Critical, the available space is less than 20%.
For any services that run out of space, data may be difficult to recover, so you should resize volumes before hitting 10% available space. See the following instructions: Configuring the cluster.
Errors in the request routing layer would result in degraded functionality that is directly observable in the Automation Suite UI. The requests will not be routed to backend services.
kubectl logs
command in the Istio ingress gateway pod. If the error persists, contact UiPath® Support.
This alert indicate that less than 3 nodes are running in the RabbitMQ cluster.
kubectl logs <pod-name> -n <namespace>
command
To fix the issue, delete the pod using the kubectl delete pod <pod-name> -n <namespace>
command, and check again once the new pod comes up.
This alert is fired if the MongoDB TLS certificate does not automatically rotate in the 19-day timeframe. The severity of this alert is critical.
To rotate the certificate, follow the instructions in MongoDB certificate renewal.
This alert triggers when MongoDB is down. The severity of this alert is critical.
If this alert is fired, take the following steps:
- Check the logs using the following command:
kubectl logs <pod-name> -n mongodb
; - Use the Diagnostics Tool;
- Contact UiPath Support.
The MongoDB replication set member, as seen from another member of the set, is unreachable. If the alert is fired, then most probably the node is down. The severity of this alert is critical.
If this alert is fired, take the following steps:
- Check if the node is down;
- If the node is down, restart it and find the root cause;
- If the issue persists, contact UiPath Support.
The status of the MongoDB replication set member, as seen from another member of the set, is not yet known. Is this alert is fired, one or more replicas are not in running state. The severity of this alert is critical.
If this alert is fired, take the following steps:
- Check the logs by running the following command:
kubectl logs <pod-name> -n mongodb
; - To see the details on the replica status, run the following command for describing the pod:
kubectl describe <pod-name> -n mongodb
; - If the issue persists, contact UiPath Support.
This alert indicates that MongoDB replication lag is more than 10 seconds. The severity of this alert is critical.
If this alert is fired, take the following steps:
- Check the logs by running the following command:
kubectl logs <pod-name> -n mongodb
; - To see details on the replica status, run the following command for describing the pod:
kubectl describe <pod-name> -n mongodb
- If the issue persists, contact UiPath Support.
This alert indicates that the number of connections has reached its maximum. If this is expected and temporary, you may silence the alert. However, the alert is a sign that the Mongo connection is at limit and cannot handle more. This alert is a warning.
If this alert is fired, take the following steps:
-
To query the number of connections on the node, run the following command:
db.serverStatus().connections
current
indicates existing connectionsavailable
indicates the number of available connections;
- If the issue persists, contact UiPath Support.
This alert indicates a high latency in the instance. This may mean that the traffic has increased on a node. There may be due to a replica not being healthy or traffic overloaded on a replica. If this is expected and temporary, you may silence this alert. However, this alert is a sign that the instance is at its limit and cannot handle more. The severity of this alert is critical.
If this alert is fired, take the following steps:
- Check the logs and heath of instances;
- If the issue persists, contact UiPath Support.
MongoDB replication set member either performs startup self-checks, or transitions from completing a rollback or resync. The severity of this alert is critical.
If this alert is fired, take the following steps:
- Check the status of replica by running the following command:
rs.status()
. - Check the logs using
kubectl logs <pod-name> -n mongodb
- If the issue persists, contact UiPath Support.
MongoDB replication set member is actively performing a rollback. Data is not available for reads. The severity of this alert is critical.
If this alert is fired, take the following steps:
- Check the status of the replica by running the following command:
rs.status()
; - Check the logs by running the following command:
kubectl logs <pod-name> -n mongodb
; - If the issue persists, contact UiPath Support.
MongoDB replication set member was once in a replica set but was subsequently removed. The severity of this alert is critical.
If this alert is fired, take the following steps:
- Check the status of the replica by running the following command:
rs.status()
; - Check the logs by running the following command:
kubectl logs <pod-name> -n mongodb
; - If the issue persists, contact UiPath Support.
This alert indicates that the server TLS certificate will expire in the following 30 days.
To fix this issue, update the server TLS certificate. For instructions, see Managing server certificates.
This alert indicates that the server TLS certificate will expire in the following 7 days.
To fix this issue, update the TLS certificate. For instructions, see Managing server certificates.
This alert indicates that the Identity token signing certificate will expire in the following 30 days.
To fix this issue, update the Identity token signing certificate. For instructions, see Managing server certificates.
This alert indicates that the Identity token signing certificate will expire in the following 7 days.
To fix this issue, update the Identity token signing certificate. For instructions, see Managing server certificates.
This alert indicates that the etcd cluster has an insufficient number of members. Note that the cluster must have an odd number of members. The severity of this alert is critical.
Make sure that there is an odd number of server nodes in the cluster, and all of them are up and healthy.
This alert shows that the etcd cluster has no leader. The severity of this alert is critical.
This alert indicates that the etcd leader changes more than twice in 10 minutes. This is a warning.
This alert indicates that a certain percentage of GRPC request failures was detected in etcd.
This alert indicates that a certain percentage of HTTP failures was detected in etcd.
This alert indicates that etcd member communication is slowing down. This is a warning.
This alert indicates that the etcd server received more than 5 failed proposals in the last hour. This is a warning.
This alert indicates that etcd WAL fsync duration is increasing. This is a warning.
/var/lib/rancher
partition is less than:
- 35% – the severity of the alert is warning
- 25% – the severity of the alert is critical
If this alert fires, increase the size of the disk.
/var/lib/kubelet
partition is less than:
- 35% – the severity of the alert is warning
-
25% – the severity of the alert is critical
If this alert fires, increase the size of the disk.
This alert indicates that the free space for the Longhorn disk is less than:
- 35% – the severity of the alert is warning
- 25% – the severity of the alert is critical
If this alert fires, increase the size of the disk.
/var
partition is less than:
- 35% – the severity of the alert is warning
- 25% – the severity of the alert is critical
The storage requirements for ML skills can substantially increase disk usage.
If this alert fires, increase the size of the disk.
This alert indicates that the NFS server connection is lost.
You need to check the NFS server connection and mount path.
If the cumulative number of backup or snapshot objects created by Longhorn is too high, you may encounter one of the following alerts:
To fix the issue causing these alerts to be triggered, run the following script:
#!/bin/bash
set -e
# longhorn backend URL
url=
# By default, snapshot older than 10 days will be deleted
days=-1
function display_usage() {
echo "usage: $(basename "$0") [-h] -u longhorn-url -d days"
echo " -u Longhorn URL"
echo " -d Number of days(should be >0). By default, script will delete snapshot older than 10 days."
echo " -h Print help"
}
while getopts 'hd:u:' flag "$@"; do
case "${flag}" in
u)
url=${OPTARG}
;;
d)
days=${OPTARG}
[ "$days" ] && [ -z "${days//[0-9]}" ] || { echo "Invalid number of days=$days"; exit 1; }
;;
h)
display_usage
exit 0
;;
:)
echo "Invalid option: ${OPTARG} requires an argument."
exit 1
;;
*)
echo "Unexpected option ${flag}"
exit 1
;;
esac
done
[[ -z "$url" ]] && echo "Missing longhorn URL" && exit 1
# check if URL is valid
curl -s --connect-timeout 30 ${url}/v1 >> /dev/null || { echo "Unable to connect to longhorn backend"; exit 1; }
echo "Deleting snapshots older than $days days"
# Fetch list of longhorn volumes
vols=$( (curl -s -X GET ${url}/v1/volumes |jq -r '.data[].name') )
#delete given snapshot for given volume
function delete_snapshot() {
local vol=$1
local snap=$2
[[ -z "$vol" || -z "$snap" ]] && echo "Error: delete_snapshot: Empty argument" && return 1
curl -s -X POST ${url}/v1/volumes/${vol}?action=snapshotDelete -d '{"name": "'$snap'"}' >> /dev/null
echo "Snapshot=$snap deleted for volume=$vol"
}
#perform cleanup for given volume
function cleanup_volume() {
local vol=$1
local deleted_snap=0
[[ -z "$vol" ]] && echo "Error: cleanup_volume: Empty argument" && return 1
# fetch list of snapshot
snaps=$( (curl -s -X POST ${url}/v1/volumes/${vol}?action=snapshotList | jq -r '.data[] | select(.usercreated==true) | .name' ) )
for i in ${snaps[@]}; do
echo $i
if [[ $i == "volume-head" ]]; then
continue
fi
# calculate date difference for snapshot
snapTime=$(curl -s -X POST ${url}/v1/volumes/${vol}?action=snapshotGet -d '{"name":"'$i'"}' |jq -r '.created')
currentTime=$(date "+%s")
timeDiff=$((($(date -d $snapTime "+%s") - $currentTime) / 86400))
if [[ $timeDiff -lt $days ]]; then
echo "Ignoring snapshot $i, since it is older than $timeDiff days"
continue
fi
#trigger deletion for snapshot
delete_snapshot $vol $i
deleted_snap=$((deleted_snap+1))
done
if [[ "$deleted_snap" -gt 0 ]]; then
#trigger purge for volume
curl -s -X POST ${url}/v1/volumes/${vol}?action=snapshotPurge >> /dev/null
fi
}
for i in ${vols[@]}; do
cleanup_volume $i
done
#!/bin/bash
set -e
# longhorn backend URL
url=
# By default, snapshot older than 10 days will be deleted
days=-1
function display_usage() {
echo "usage: $(basename "$0") [-h] -u longhorn-url -d days"
echo " -u Longhorn URL"
echo " -d Number of days(should be >0). By default, script will delete snapshot older than 10 days."
echo " -h Print help"
}
while getopts 'hd:u:' flag "$@"; do
case "${flag}" in
u)
url=${OPTARG}
;;
d)
days=${OPTARG}
[ "$days" ] && [ -z "${days//[0-9]}" ] || { echo "Invalid number of days=$days"; exit 1; }
;;
h)
display_usage
exit 0
;;
:)
echo "Invalid option: ${OPTARG} requires an argument."
exit 1
;;
*)
echo "Unexpected option ${flag}"
exit 1
;;
esac
done
[[ -z "$url" ]] && echo "Missing longhorn URL" && exit 1
# check if URL is valid
curl -s --connect-timeout 30 ${url}/v1 >> /dev/null || { echo "Unable to connect to longhorn backend"; exit 1; }
echo "Deleting snapshots older than $days days"
# Fetch list of longhorn volumes
vols=$( (curl -s -X GET ${url}/v1/volumes |jq -r '.data[].name') )
#delete given snapshot for given volume
function delete_snapshot() {
local vol=$1
local snap=$2
[[ -z "$vol" || -z "$snap" ]] && echo "Error: delete_snapshot: Empty argument" && return 1
curl -s -X POST ${url}/v1/volumes/${vol}?action=snapshotDelete -d '{"name": "'$snap'"}' >> /dev/null
echo "Snapshot=$snap deleted for volume=$vol"
}
#perform cleanup for given volume
function cleanup_volume() {
local vol=$1
local deleted_snap=0
[[ -z "$vol" ]] && echo "Error: cleanup_volume: Empty argument" && return 1
# fetch list of snapshot
snaps=$( (curl -s -X POST ${url}/v1/volumes/${vol}?action=snapshotList | jq -r '.data[] | select(.usercreated==true) | .name' ) )
for i in ${snaps[@]}; do
echo $i
if [[ $i == "volume-head" ]]; then
continue
fi
# calculate date difference for snapshot
snapTime=$(curl -s -X POST ${url}/v1/volumes/${vol}?action=snapshotGet -d '{"name":"'$i'"}' |jq -r '.created')
currentTime=$(date "+%s")
timeDiff=$((($(date -d $snapTime "+%s") - $currentTime) / 86400))
if [[ $timeDiff -lt $days ]]; then
echo "Ignoring snapshot $i, since it is older than $timeDiff days"
continue
fi
#trigger deletion for snapshot
delete_snapshot $vol $i
deleted_snap=$((deleted_snap+1))
done
if [[ "$deleted_snap" -gt 0 ]]; then
#trigger purge for volume
curl -s -X POST ${url}/v1/volumes/${vol}?action=snapshotPurge >> /dev/null
fi
}
for i in ${vols[@]}; do
cleanup_volume $i
done
This alert indicates that the cumulative number of backup objects created in the system by Longhorn is increasing, which may lead to potential downtime. This is a warning.
This alert is triggered when the Longhorn backup count is greater than or equal to 150 and less than 200.
This alert indicates that the cumulative number of backup objects created in the system by Longhorn is increasing, which may lead to potential downtime. This is a critical alert.
This alert is triggered when the Longhorn backup count is greater than or equal to 200 and less than 240.
This alert indicates that the cumulative number of snapshot objects created in the system by Longhorn is increasing, which may lead to potential downtime. This is a warning.
This alert is triggered if the snapshot count is higher than or equal to 150 and less than 200.
This alert indicates that the cumulative number of snapshot objects created in the system by Longhorn is increasing, which may lead to potential downtime. This alert is critical.
This alert is triggered if the snapshot count is higher than or equal to 200 and less than 240.
- Alert severity key
- general.rules
- TargetDown
- Watchdog
- kubernetes-apps
- KubePodCrashLooping
- KubePodNotReady
- KubeDeploymentGenerationMismatch, KubeStatefulSetGenerationMismatch
- KubeDeploymentReplicasMismatch, KubeStatefulSetReplicasMismatch
- KubeStatefulSetUpdateNotRolledOut
- KubeDaemonSetRolloutStuck
- KubeContainerWaiting
- KubeDaemonSetNotScheduled, KubeDaemonSetMisScheduled
- KubeJobCompletion
- KubeJobFailed
- KubeHpaReplicasMismatch
- KubeHpaMaxedOut
- kubernetes-resources
- KubeCPUOvercommit, KubeMemoryOvercommit
- KubeCPUQuotaOvercommit, KubeMemoryQuotaOvercommit, KubeQuotaAlmostFull, KubeQuotaFullyUsed, KubeQuotaExceeded
- CPUThrottlingHigh
- Kubernetes-storage
- KubePersistentVolumeFillingUp
- KubePersistentVolumeErrors
- kube-state-metrics
- KubeStateMetricsListErrors, KubeStateMetricsWatchErrors
- kubernetes-system-apiserver
- KubeClientCertificateExpiration
- AggregatedAPIErrors, AggregatedAPIDown, KubeAPIDown, KubeAPITerminatedRequests
- KubernetesApiServerErrors
- kubernetes-system-kubelet
- KubeNodeNotReady, KubeNodeUnreachable, KubeNodeReadinessFlapping, KubeletPlegDurationHigh, KubeletPodStartUpLatencyHigh, KubeletDown
- KubeletTooManyPods
- KubeletClientCertificateExpiration, KubeletServerCertificateExpiration
- KubeletClientCertificateRenewalErrors, KubeletServerCertificateRenewalErrors
- kubernetes-system
- KubeVersionMismatch
- KubeClientErrors
- KubernetesMemoryPressure
- KubernetesDiskPressure
- Kube-apiserver-slos
- KubeAPIErrorBudgetBurn
- node-exporter
- NodeFilesystemSpaceFillingUp, NodeFilesystemAlmostOutOfSpace, NodeFilesystemFilesFillingUp
- NodeRAIDDegraded
- NodeRAIDDiskFailure
- NodeNetworkReceiveErrs, NodeNetworkTransmitErrs, NodeHighNumberConntrackEntriesUsed
- NodeClockSkewDetected, NodeClockNotSynchronising
- node-network
- NodeNetworkInterfaceFlapping
- InternodeCommunicationBroken
- uipath.prometheus.resource.provisioning.alerts
- PrometheusMemoryUsage, PrometheusStorageUsage
- alertmanager.rules
- AlertmanagerConfigInconsistent
- AlertmanagerFailedReload
- prometheus-operator
- PrometheusOperatorListErrors, PrometheusOperatorWatchErrors, PrometheusOperatorSyncFailed, PrometheusOperatorReconcileErrors, PrometheusOperatorNodeLookupErrors, PrometheusOperatorNotReady, PrometheusOperatorRejectedResources
- prometheus
- PrometheusBadConfig
- PrometheusErrorSendingAlertsToSomeAlertmanagers, PrometheusErrorSendingAlertsToAnyAlertmanager, PrometheusNotConnectedToAlertmanagers
- PrometheusNotificationQueueRunningFull, PrometheusTSDBReloadsFailing, PrometheusTSDBCompactionsFailing, PrometheusNotIngestingSamples, PrometheusDuplicateTimestamps, PrometheusOutOfOrderTimestamps, PrometheusRemoteStorageFailures, PrometheusRemoteWriteBehind, PrometheusRemoteWriteDesiredShards
- PrometheusRuleFailures
- PrometheusMissingRuleEvaluations
- PrometheusTargetLimitHit
- uipath.availability.alerts
- UiPathAvailabilityHighTrafficUserFacing, UiPathAvailabilityHighTrafficBackend, UiPathAvailabilityMediumTrafficUserFacing, UiPathAvailabilityMediumTrafficBackend, UiPathAvailabilityLowTrafficUserFacing, UiPathAvailabilityLowTrafficBackend
- uipath.cronjob.alerts.rules
- CronJobSuspended
- UiPath CronJob "kerberos-tgt-refresh" Failed
- IdentityKerberosTgtUpdateFailed
- Ceph Alerts
- CephClusterNearFull
- CephClusterCriticallyFull
- CephClusterReadOnly
- CephPoolQuotaBytesCriticallyExhausted
- CephClusterErrorState
- CephMonQuorumAtRisk
- CephOSDCriticallyFull
- uipath.requestrouting.alerts
- UiPathRequestRouting
- RabbitmqNodeDown
- MongoDB Alerts
- MongodbCertExpiration
- MongodbDown
- MongodbReplicationStatusUnreachable
- MongodbReplicationStatusNotKnown
- MongodbReplicationLag
- MongodbTooManyConnections
- MongodbHighLatency
- MongodbReplicationStatusSelfCheck
- MongodbReplicationStatusRollback
- MongodbReplicationStatusRemoved
- Server TLS Certificate Alerts
- SecretCertificateExpiry30Days
- SecretCertificateExpiry7Days
- Identity Token Signing Certificate Alerts
- IdentityCertificateExpiry30Days
- IdentityCertificateExpiry7Days
- etdc Alerts
- EtcdInsufficientMembers
- EtcdNoLeader
- EtcdHighNumberOfLeaderChanges
- EtcdHighNumberOfFailedGrpcRequests
- EtcdGrpcRequestsSlow
- EtcdHighNumberOfFailedHttpRequests
- EtcdHttpRequestsSlow
- EtcdMemberCommunicationSlow
- EtcdHighNumberOfFailedProposals
- EtcdHighFsyncDurations
- EtcdHighCommitDurations
- Disk Size Alerts
- LowDiskForRancherPartition
- LowDiskForKubeletPartition
- LowDiskForLonghornPartition
- LowDiskForVarPartition
- Backup Alerts
- NFSServerDisconnected
- VolumeBackupFailed
- BackupDisabled
- longhorn-snapshot-alert
- LonghornBackupObjectThresholdExceededWarn
- LonghornBackupObjectThresholdExceededCritical
- LonghornSnapshotObjectThresholdExceededWarn
- LonghornSnapshotObjectThresholdExceededCritical