订阅

UiPath Automation Suite

UiPath Automation Suite 指南

警示 runbook

此页面将引导您完成一系列警示,并提供上下文和修复方法。

📘

备注:

有关使用可用工具显示警示、指标和可视化的一般说明,请参阅使用监控堆栈
有关如何解决问题以及如何为 UiPath 支持工程师创建支持捆绑包的更多信息,请参阅故障排除
联系 UiPath 支持团队时,请提供当前触发的任何警示。

警示严重性键


信息

意外但无害。可以静默,但在诊断期间可能有用。

Warning

表明功能有针对性地降级或在不久的将来可能降级,这可能会影响整个集群。建议迅速采取行动(通常在几天内),以保持集群的正常运行。

Critical

已知会导致集群中普遍存在的功能严重降级。需要立即采取行动(当天)修复集群。

 

general.rules


TargetDown

Prometheus 无法从警示中的目标收集指标,这意味着 Grafana 仪表板和基于该目标的指标的进一步警示不可用。检查与该目标相关的其他警示。

Watchdog

这是一个警示,用于确保整个警示管道正常运行。此警示始终处于触发状态。因此,它应始终在“警示管理器”中针对接收器触发。有各种通知机制的集成,可在此警示未触发时通知您。例如,PagerDuty 中的 DeadMansSnitch 集成。

 

kubernetes-apps


KubePodCrashLooping

A pod that keeps restarting unexpectedly. This can happen due to an out-of-memory (OOM) error, in which case the limits can be adjusted. Check the pod events with kubectl describe, and logs with kubectl logs to see details on possible crashes. If the issue persists, contact UiPath Support.

KubePodNotReady

A pod has started, but it is not responding to the health probe with success. This may mean that it is stuck and is not able to serve traffic. You can check pod logs with kubectl logs to see if there is any indication of progress. If the issue persists, contact UiPath Support.

KubeDeploymentGenerationMismatch, KubeStatefulSetGenerationMismatch

已尝试对部署或状态副本集进行更新,但失败了,并且尚未发生回滚。请联系 UiPath 支持团队。

KubeDeploymentReplicasMismatch, KubeStatefulSetReplicasMismatch

在具有多个副本的高可用性集群中,当副本数量不是最佳时,将触发此警示。当集群中没有足够的资源进行计划时,可能会发生这种情况。检查资源利用率,并根据需要添加容量。否则,请联系 UiPath 支持团队。

KubeStatefulSetUpdateNotRolledOut

状态副本集更新失败。请联系 UiPath 支持团队。
See also: StatefulSets.

KubeDaemonSetRolloutStuck

守护程序集推出失败。请联系 UiPath 支持团队。
See also: DaemonSet.

KubeContainerWaiting

A container is stuck in the waiting state. It has been scheduled to a worker node, but it cannot run on that machine. Check kubectl describe of the pod for more information. The most common cause of waiting containers is a failure to pull the image. For air-gapped clusters, this could mean that the local registry is not available. If the issue persists, contact UiPath Support.

KubeDaemonSetNotScheduled, KubeDaemonSetMisScheduled

这可能表明其中一个节点存在问题,检查每个节点的运行状况,并修复任何已知问题。否则,请联系 UiPath 支持团队。

KubeJobCompletion

完成一项作业需要超过 12 个小时。这不是预期的。请联系 UiPath 支持团队。

KubeJobFailed

作业失败;但是,大多数作业都会自动重试。如果问题仍然存在,请联系 UiPath 支持团队。

KubeHpaReplicasMismatch

自动调节程序无法按配置扩展目标资源。如果期望值高于实际值,则可能是资源不足。如果期望值低于实际值,则 Pod 可能会在关闭时卡住。如果问题仍然存在,请联系 UiPath 支持团队。
See also: Horizontal Pod Autoscaling

KubeHpaMaxedOut

给定服务的副本数量已达到最大值。当对集群发出的请求数量非常多时,就会发生这种情况。如果预计会有暂时的高流量,您可以静默此警示。但是,此警示表示集群已满,无法处理更多流量。如果集群上有更多资源容量可用,您可以按照以下说明增加服务的最大副本数:

# Find the horizontal autoscaler that controls the replicas of the desired resource
kubectl get hpa -A
# Increase the number of max replicas of the desired resource, replacing <namespace> <resource> and <maxReplicas>
kubectl -n <namespace> patch hpa <resource> --patch '{"spec":{"maxReplicas":<maxReplicas>}}'

See also: Horizontal Pod Autoscaling.

 

kubernetes-resources


KubeCPUOvercommit, KubeMemoryOvercommit

这些警告表明集群不能容忍节点故障。对于单节点评估集群,这是已知的,并且系统可能会静默这些警示。对于多节点 HA 就绪生产设置,当太多节点运行状况不佳而无法支持高可用性时,将触发这些警示,并指示应将节点恢复正常状态或进行更换。

KubeCPUQuotaOvercommit, KubeMemoryQuotaOvercommit, KubeQuotaAlmostFull, KubeQuotaFullyUsed, KubeQuotaExceeded

这些警示与通过自定义添加的命名空间资源配额有关,这些配额仅存在于集群中。命名空间资源配额不会作为 Automation Suite 安装的一部分添加。
See also: Resource Quotas.

CPUThrottlingHigh

已根据配置的限制限制容器的 CPU 利用率。这是 Kubernetes 正常操作的一部分,可能会在触发其他警示时提供有用的信息。您可以静默此警示。

 

kubernetes-storage


KubePersistentVolumeFillingUp

When Warning: The available space is less than 30% and is likely to fill up within four days.
When Critical: The available space is less than 10%.

For any services that run out of space, data may be difficult to recover, so volumes should be resized before hitting 0% available space. See the following instructions: Configuring the cluster.

For Prometheus-specific alerts, see PrometheusStorageUsage for more details and instructions.

KubePersistentVolumeErrors

无法配置持久卷。这意味着任何需要该卷的服务都不会启动。检查 Longhorn 和/或 Ceph 存储是否存在其他错误,并联系 UiPath 支持团队。

 

kube-state-metrics


KubeStateMetricsListErrors, KubeStateMetricsWatchErrors

Kube 状态指标收集器无法在没有错误的情况下从集群收集指标。这意味着可能不会触发重要的警示。请联系 UiPath 支持团队。
See also: Kube state metrics at release.

 

kubernetes-system-apiserver


KubeClientCertificateExpiration

Warning:用于对 Kubernetes API 服务器进行身份验证的客户端证书将在七天内过期。
Critical:用于对 Kubernetes API 服务器进行身份验证的客户端证书将在一天内过期。
您必须续订证书。

AggregatedAPIErrors, AggregatedAPIDown, KubeAPIDown, KubeAPITerminatedRequests

指示 Kubernetes 控制平面存在问题。检查主节点的运行状况,解决所有未解决的问题,如果问题持续存在,请联系 UiPath 支持团队。

另请参阅:
The Kubernetes API
Kubernetes API Aggregation Layer.

KubernetesApiServerErrors

This alert indicates that the Kubernetes API server is experiencing a high error rate. This issue could lead to other failures, so it is recommended that you investigate the problem proactively.
Check logs for the api-server pod to find out the root cause of the issue using the kubectl logs <pod-name> -n kube-system command.

 

kubernetes-system-kubelet


KubeNodeNotReady, KubeNodeUnreachable, KubeNodeReadinessFlapping, KubeletPlegDurationHigh, KubeletPodStartUpLatencyHigh, KubeletDown

These alerts indicate a problem with a node. In multi-node HA-ready production clusters, pods would likely be rescheduled onto other nodes. If the issue persists, you should remove and drain the node to maintain the health of the cluster. In clusters without extra capacity, another node should be joined to the cluster first.

KubeletTooManyPods

There are too many pods running on the specified node. Join another node to the cluster.

KubeletClientCertificateExpiration, KubeletServerCertificateExpiration

Warning:Kubelet 的客户端或服务器证书将在七天内过期。
Critical:Kubelet 的客户端或服务器证书将在一天内过期。
您必须续订证书。

KubeletClientCertificateRenewalErrors, KubeletServerCertificateRenewalErrors

Kubelet 无法续订其客户端或服务器证书。请联系 UiPath 支持团队。

 

kubernetes-system


KubeVersionMismatch

正在运行的 Kubernetes 组件有不同的语义版本。发生这种情况的原因可能是 Kubernetes 升级失败。

KubeClientErrors

Kubernetes API 服务器客户端遇到超过 1% 的错误。运行此客户端的节点或 Kubernetes API 服务器本身可能存在问题。

KubernetesMemoryPressure

This alert indicates that memory usage is very high on the Kubernetes node.
If this alert fires, try to see which pod is consuming more memory.

KubernetesDiskPressure

This alert indicates that disk usage is very high on the Kubernetes node.
If this alert fires, try to see which pod is consuming more disk.

 

kube-apiserver-slos


KubeAPIErrorBudgetBurn

Kubernetes API 服务器消耗了过多的错误预算。

 

node-exporter


NodeFilesystemSpaceFillingUp, NodeFilesystemSpaceFillingUp, NodeFilesystemAlmostOutOfSpace, NodeFilesystemFilesFillingUp, NodeFilesystemAlmostOutOfFiles

特定节点上的文件系统正在填满。通过添加磁盘或装载未使用的磁盘来配置更多空间。

NodeRAIDDegraded

由于一个或多个磁盘故障,容错式磁盘阵列处于降级状态。备用驱动器的数量
不足以自动修复问题。

NodeRAIDDiskFailure

需要注意容错式磁盘阵列,可能还需要进行磁盘交换。

NodeNetworkReceiveErrs, NodeNetworkTransmitErrs, NodeHighNumberConntrackEntriesUsed

节点上的物理网络接口有问题。如果问题仍然存在,则可能需要更换。

NodeClockSkewDetected, NodeClockNotSynchronising

节点上的时钟有问题。确保 NTP 配置正确。

 

node-network


NodeNetworkInterfaceFlapping

节点上的物理网络接口有问题。如果问题仍然存在,则可能需要更换。

 

InternodeCommunicationBroken


The node has become unresponsive due to some issue causing broken communication between nodes in the cluster.
To fix this problem, restart the affected node. If the issue persists, reach out to UiPath Support with the Support Bundle Tool.

 

uipath.prometheus.resource.provisioning.alerts


PrometheusMemoryUsage, PrometheusStorageUsage

These alerts warn when the cluster is approaching the configured limits for memory and storage. This is likely to happen on clusters with a recent substantial increase in usage (usually from Robots rather than users), or when nodes are added to the cluster without adjusting Prometheus resources. This is due to an increase in the amount of metrics being collected.

可以在 Kubernetes/持久卷仪表板上看到存储利用率的提高率:

12451245

You can adjust it by resizing the PVC as instructed here: Configuring the cluster.

The rate of increased memory utilization can be seen on the Kubernetes / Compute Resources / Pod dashboard.

12521252

You can adjust it by editing the Prometheus memory resource limits in the rancher-monitoring app from ArgoCD. The rancher-monitoring app automatically re-syncs after clicking Save.

14761476

请注意,Prometheus 需要一些时间才能重新启动并再次开始在 Grafana 中显示指标。通常情况下,耗时不到 10 分钟,即使是大型集群也不例外。

 

alertmanager.rules


AlertmanagerConfigInconsistent, AlertmanagerMembersInconsistent

这些是具有多个警示管理器副本的 HA 集群的内部警示管理器错误。警示可能会间歇性地出现和消失。暂时缩小规模,然后扩大警示管理器副本可能会解决此问题:

# First, scale to zero. This will take a moment for the pods to shut down.
kubectl scale statefulset -n cattle-monitoring-system alertmanager-rancher-monitoring-alertmanager --replicas=0
# Then scale back to two.
kubectl scale statefulset -n cattle-monitoring-system alertmanager-rancher-monitoring-alertmanager --replicas=2
# Check to see alertmanager pods have started and are now in the running state.
kubectl get po -n cattle-monitoring-system

如果问题仍然存在,请联系 UiPath 支持团队。

AlertmanagerFailedReload

警示管理器加载或重新加载配置失败。请检查任何自定义警示管理器配置中是否存在输入错误,否则请联系 UiPath 支持团队。

 

prometheus-operator


PrometheusOperatorListErrors, PrometheusOperatorWatchErrors, PrometheusOperatorSyncFailed, PrometheusOperatorReconcileErrors, PrometheusOperatorNodeLookupErrors, PrometheusOperatorNotReady, PrometheusOperatorRejectedResources

控制 Prometheus 资源的 Prometheus 运算符的内部错误。存在这些错误时,Prometheus 本身可能仍然运行良好;但是,此错误表示监控可配置性下降。请联系 UiPath 支持团队。

 

prometheus


PrometheusBadConfig

Prometheus 加载或重新加载配置失败。请检查任何自定义 Prometheus 配置是否存在输入错误。否则,请联系 UiPath 支持团队。

PrometheusErrorSendingAlertsToSomeAlertmanagers, PrometheusErrorSendingAlertsToAnyAlertmanager, PrometheusNotConnectedToAlertmanagers

从 Prometheus 到警示管理器的连接不正常。指标仍可查询,并且 Grafana 仪表板可能仍会显示指标,但不会触发警示。检查警示管理器的任何自定义配置是否存在输入错误,否则请联系 UiPath 支持团队。

PrometheusNotificationQueueRunningFull, PrometheusTSDBReloadsFailing, PrometheusTSDBCompactionsFailing, PrometheusNotIngestingSamples, PrometheusDuplicateTimestamps, PrometheusOutOfOrderTimestamps, PrometheusRemoteStorageFailures, PrometheusRemoteWriteBehind, PrometheusRemoteWriteDesiredShards

表示可能无法按预期收集指标的内部 Prometheus 错误。请联系 UiPath 支持团队。

PrometheusRuleFailures

如果存在基于不存在的指标或不正确的 PromQL 语法的格式错误警示,则可能会发生这种情况。如果未添加自定义警示,请联系 UiPath 支持团队。

PrometheusMissingRuleEvaluations

Prometheus 无法评估是否应触发警示。如果警示太多,可能会发生这种情况。请删除昂贵的自定义警示评估和/或查看有关增加 Prometheus CPU 限制的文档。如果未添加自定义警示,请联系 UiPath 支持团队。

PrometheusTargetLimitHit

Prometheus 要收集的目标过多。如果添加了额外的 ServiceMonitor(请参阅监控控制台),您可以将其删除。

 

uipath.availability.alerts


UiPathAvailabilityHighTrafficUserFacing, UiPathAvailabilityHighTrafficBackend, UiPathAvailabilityMediumTrafficUserFacing, UiPathAvailabilityMediumTrafficBackend, UiPathAvailabilityLowTrafficUserFacing, UiPathAvailabilityLowTrafficBackend

来自 UiPath 服务的 HTTP 500 响应数量超过给定阈值。

Traffic level

Number of requests in 20 minutes

Error threshold (for http 500s)

High

100,000

0.1%

Medium

Between 10,000 and 100,000

1%

Low

< 10,000

5%

面向用户的服务中的错误可能会导致可在 Automation Suite 用户界面中直接观察到的功能降级,而后端服务中的错误则不会产生明显的后果。

警示会指明哪个服务的错误率较高。要了解报告服务所依赖的其他服务可能存在哪些级联问题,您可以使用 Istio 工作负载仪表板,该仪表板会显示服务之间的错误。

Please double check any recently reconfigured Automation Suite products. Detailed logs are also available with the kubectl logs command. If the error persists, please contact UiPath Support.

 

uipath.cronjob.alerts.rules


CronJobSuspended

The uipath-infra/istio-configure-script-cronjob cronjob is in suspended state.
To fix this issue, enable the cronjob by taking the following steps:

export KUBECONFIG="/etc/rancher/rke2/rke2.yaml" && export PATH="$PATH:/usr/local/bin:/var/lib/rancher/rke2/bin"
kubectl -n uipath-infra patch cronjob istio-configure-script-cronjob -p '{"spec":{"suspend":false}}'
epoch=$(date +"%s")
kubectl -n uipath-infra create job istio-configure-script-cronjob-manual-$epoch --from=cronjob/istio-configure-script-cronjob
kubectl -n uipath-infra wait --for=condition=complete --timeout=300s job/istio-configure-script-cronjob-manual-$epoch
kubectl get node -o wide
#Verify if all the IP's listed by the above command are part of output of below command
kubectl -n istio-system get svc istio-ingressgateway -o json | jq '.spec.externalIPs'

UiPath CronJob "kerberos-tgt-refresh" Failed

此作业从 AD Server 获取最新的 Kerberos 票证,以进行 SQL 集成身份验证。此作业失败将导致 SQL Server 身份验证失败。请联系 UiPath 支持团队。

IdentityKerberosTgtUpdateFailed

此作业将最新的 Kerberos 票证更新为所有 UiPath 服务。此作业失败将导致 SQL Server 身份验证失败。请联系 UiPath 支持团队。

 

Ceph alerts


CephClusterNearFull, CephOSDNearFull, PersistentVolumeUsageNearFull

This alert indicates that the Ceph storage cluster utilization has crossed 75% and will become read-only at 85%.
If this alert fires, free up some space in Ceph by deleting some unused datasets in AI Center or Task Mining or expand the storage available for Ceph PVC by following the instructions in Resizing PVC.
Before resizing PVC, make sure you meet the storage requirements. For details, see Evaluating your storage needs.

CephClusterCriticallyFull

This alert indicates that Ceph storage cluster utilization has crossed 80% and will become read-only at 85%.
If this alert fires, free up some space in Ceph by deleting some unused datasets in AI Center or Task Mining or expand the storage available for Ceph PVC by following the instructions in Resizing PVC.
Before resizing PVC, make sure you meet the storage requirements. For details, see Evaluating your storage needs.

CephClusterReadOnly

This alert indicates that Ceph storage cluster utilization has crossed 85% and will become read-only now. Free up some space or expand the storage cluster immediately.
If this alert fires, free up some space in Ceph by deleting some unused datasets in AI Center or Task Mining or expand the storage available for Ceph PVC by following the instructions in Resizing PVC.
Before resizing PVC, make sure you meet the storage requirements. For details, see Evaluating your storage needs.

CephPoolQuotaBytesCriticallyExhausted

This alert indicates that Ceph storage pool usage has crossed 90%.
If this alert fires, free up some space in CEPH by deleting some unused datasets in AI Center or Task Mining or expand the storage available for Ceph PVC by following the instructions in Resizing PVC.
Before resizing PVC, make sure you meet the storage requirements. For details, see Evaluating your storage needs.

CephClusterErrorState

This alert indicates that the Ceph storage cluster has been in error state for more than 10m.
This alert reflects that the rook-ceph-mgr job has been in error state for an unacceptable amount of time. Check for other alerts that might have triggered prior to this one and troubleshoot those first.

CephMonQuorumAtRisk

This alert indicates that storage cluster quorum is low.
Multiple mons work together to provide redundancy; this is possible because each keeps a copy of the metadata. The cluster is deployed with 3 mons, and requires 2 or more mons to be up and running for quorum and for the storage operations to run. If quorum is lost, access to data is at risk.
If this alert fires, check if any OSDs are in terminating state, if there are any, force delete those pods, and wait for some time for the operator to reconcile. If the issue persists, contact UiPath support.

CephOSDCriticallyFull

当警示严重性为 Critical 时,可用空间小于 20%。

For any services that run out of space, data may be difficult to recover, so you should resize volumes before hitting 10% available space. See the following instructions: Configuring the cluster.

 

uipath.requestrouting.alerts

UiPathRequestRouting

Errors in the request routing layer would result in degraded functionality that is directly observable in the Automation Suite UI. The requests will not be routed to backend services.

You can find detailed error log of request routing by running the kubectl logs command in the Istio ingress gateway pod. If the error persists, contact UiPath Support.

 

RabbitmqNodeDown

This alert indicate that less than 3 nodes are running in the RabbitMQ cluster.
Check which RabbitMQ pod is down using the kubectl logs <pod-name> -n <namespace> command
To fix the issue, delete the pod using the kubectl delete pod <pod-name> -n <namespace> command, and check again once the new pod comes up.

 

MongoDB alerts


MongodbCertExpiration

This alert is fired if the MongoDB TLS certificate does not automatically rotate in the 19-day timeframe. The severity of this alert is critical.

To rotate the certificate, follow the instructions in MongoDB certificate renewal.

MongodbDown

This alert triggers when MongoDB is down. The severity of this alert is critical.

If this alert is fired, take the following steps:

  • Check the logs using the following command: kubectl logs <pod-name> -n mongodb
  • Use the Diagnostics Tool;
  • Contact UiPath Support.

MongodbReplicationStatusUnreachable

The MongoDB replication set member, as seen from another member of the set, is unreachable. If the alert is fired, then most probably the node is down. The severity of this alert is critical.

If this alert is fired, take the following steps:

  • Check if the node is down;
  • If the node is down, restart it and find the root cause;
  • 如果问题仍然存在,请联系 UiPath 支持团队。

MongodbReplicationStatusNotKnown

The status of the MongoDB replication set member, as seen from another member of the set, is not yet known. Is this alert is fired, one or more replicas are not in running state. The severity of this alert is critical.

If this alert is fired, take the following steps:

  • Check the logs by running the following command: kubectl logs <pod-name> -n mongodb
  • To see the details on the replica status, run the following command for describing the pod: kubectl describe <pod-name> -n mongodb
  • 如果问题仍然存在,请联系 UiPath 支持团队。

MongodbReplicationLag

This alert indicates that MongoDB replication lag is more than 10 seconds. The severity of this alert is critical.

If this alert is fired, take the following steps:

  • Check the logs by running the following command: kubectl logs <pod-name> -n mongodb
  • To see details on the replica status, run the following command for describing the pod: kubectl describe <pod-name> -n mongodb
  • 如果问题仍然存在,请联系 UiPath 支持团队。

MongodbTooManyConnections

This alert indicates that the number of connections has reached its maximum. If this is expected and temporary, you may silence the alert. However, the alert is a sign that the Mongo connection is at limit and cannot handle more. This alert is a warning.

If this alert is fired, take the following steps:

  • To query the number of connections on the node, run the following command: db.serverStatus().connections
    • current indicates existing connections
    • available indicates the number of available connections;
  • 如果问题仍然存在,请联系 UiPath 支持团队。

MongodbHighLatency

This alert indicates a high latency in the instance. This may mean that the traffic has increased on a node. There may be due to a replica not being healthy or traffic overloaded on a replica. If this is expected and temporary, you may silence this alert. However, this alert is a sign that the instance is at its limit and cannot handle more. The severity of this alert is critical.

If this alert is fired, take the following steps:

  • Check the logs and heath of instances;
  • 如果问题仍然存在,请联系 UiPath 支持团队。

MongodbReplicationStatusSelfCheck

MongoDB replication set member either performs startup self-checks, or transitions from completing a rollback or resync. The severity of this alert is critical.

If this alert is fired, take the following steps:

  • Check the status of replica by running the following command: rs.status()
  • Check the logs using kubectl logs <pod-name> -n mongodb
  • 如果问题仍然存在,请联系 UiPath 支持团队。

MongodbReplicationStatusRollback

MongoDB replication set member is actively performing a rollback. Data is not available for reads. The severity of this alert is critical.

If this alert is fired, take the following steps:

  • Check the status of the replica by running the following command: rs.status()
  • Check the logs by running the following command: kubectl logs <pod-name> -n mongodb
  • 如果问题仍然存在,请联系 UiPath 支持团队。

MongodbReplicationStatusRemoved

MongoDB replication set member was once in a replica set but was subsequently removed. The severity of this alert is critical.

If this alert is fired, take the following steps:

  • Check the status of the replica by running the following command: rs.status()
  • Check the logs by running the following command: kubectl logs <pod-name> -n mongodb
  • 如果问题仍然存在,请联系 UiPath 支持团队。

 

Server TLS certificate alerts


SecretCertificateExpiry30Days

This alert indicates that the server TLS certificate will expire in the following 30 days.
To fix this issue, update the server TLS certificate. For instructions, see Managing server certificates.

SecretCertificateExpiry7Days

This alert indicates that the server TLS certificate will expire in the following 7 days.
To fix this issue, update the TLS certificate. For instructions, see Managing server certificates.

 

Identity token signing certificate alerts


IdentityCertificateExpiry30Days

This alert indicates that the Identity token signing certificate will expire in the following 30 days.
To fix this issue, update the Identity token signing certificate. For instructions, see Managing server certificates.

IdentityCertificateExpiry7Days

This alert indicates that the Identity token signing certificate will expire in the following 7 days.
To fix this issue, update the Identity token signing certificate. For instructions, see Managing server certificates.

 

etdc alerts


EtcdInsufficientMembers

This alert indicates that the etcd cluster has an insufficient number of members. Note that the cluster must have an odd number of members. The severity of this alert is critical.
Make sure that there is an odd number of server nodes in the cluster, and all of them are up and healthy.

EtcdNoLeader

This alert shows that the etcd cluster has no leader. The severity of this alert is critical.

EtcdHighNumberOfLeaderChanges

This alert indicates that the etcd leader changes more than twice in 10 minutes. This is a warning.

EtcdHighNumberOfFailedGrpcRequests

This alert indicates that a certain percentage of GRPC request failures was detected in etcd.

EtcdGrpcRequestsSlow

This alert indicates that etcd GRPC requests are slow. This is a warning.

EtcdHighNumberOfFailedHttpRequests

This alert indicates that a certain percentage of HTTP failures was detected in etcd.

EtcdHttpRequestsSlow

This alert indicates that HTTP requests are slowing down. This is a warning.

EtcdMemberCommunicationSlow

This alert indicates that etcd member communication is slowing down. This is a warning.

EtcdHighNumberOfFailedProposals

This alert indicates that the etcd server received more than 5 failed proposals in the last hour. This is a warning.

EtcdHighFsyncDurations

This alert indicates that etcd WAL fsync duration is increasing. This is a warning.

EtcdHighCommitDurations

This alert indicates that etcd commit duration is increasing. This is a warning.

 

Disk size alerts


LowDiskForRancherPartition

This alert indicates that the free space for the /var/lib/rancher partition is less than:

  • 35% – the severity of the alert is warning
  • 25% – the severity of the alert is critical

If this alert fires, increase the size of the disk.

LowDiskForKubeletPartition

This alert indicates that the free space for the /var/lib/kubelet partition is less than:

  • 35% – the severity of the alert is warning
  • 25% – the severity of the alert is critical
    If this alert fires, increase the size of the disk.

LowDiskForLonghornPartition

This alert indicates that the free space for the Longhorn disk is less than:

  • 35% – the severity of the alert is warning
  • 25% – the severity of the alert is critical

If this alert fires, increase the size of the disk.

LowDiskForVarPartition

This alert indicates that the free space for the /var partition is less than:

  • 35% – the severity of the alert is warning
  • 25% – the severity of the alert is critical

If this alert fires, increase the size of the disk.

 

Backup alerts


NFSServerDisconnected

This alert indicates that the NFS server connection is lost.
You need to check the NFS server connection and mount path.

VolumeBackupFailed

This alert indicates that the backup failed for a PVC.

BackupDisabled

This alert indicates that the backup is disabled.
You need to check if the cluster is unhealthy.

4 个月前更新


警示 runbook


此页面将引导您完成一系列警示,并提供上下文和修复方法。

建议的编辑仅限用于 API 参考页面

您只能建议对 Markdown 正文内容进行编辑,而不能建议对 API 规范进行编辑。