Automation Suite

2023.10

False

Linux 版 Automation Suite 安装指南

上次更新日期 2024年4月19日

管理警示

alertmanager.rules

AlertmanagerConfigInconsistent

这些是具有多个警示管理器副本的 HA 集群的内部警示管理器错误。警示可能会间歇性地出现和消失。暂时缩小规模，然后扩大警示管理器副本可能会解决此问题。

要解决此问题，请执行以下步骤：

缩放至零。请注意，Pod 需要一段时间才能关闭：
```
kubectl scale statefulset -n cattle-monitoring-system alertmanager-rancher-monitoring-alertmanager --replicas=0kubectl scale statefulset -n cattle-monitoring-system alertmanager-rancher-monitoring-alertmanager --replicas=0
```

缩小到 2：

kubectl scale statefulset -n cattle-monitoring-system alertmanager-rancher-monitoring-alertmanager --replicas=2kubectl scale statefulset -n cattle-monitoring-system alertmanager-rancher-monitoring-alertmanager --replicas=2

检查 Alertmanager Pod 是否已启动以及是否处于正在运行状态：
```
kubectl get po -n cattle-monitoring-systemkubectl get po -n cattle-monitoring-system
```

如果问题仍然存在，请联系 UiPath® 支持团队。

AlertmanagerFailedReload

AlertManager has failed to load or reload configuration. Please check any custom AlertManager configurations for input errors and otherwise contact UiPath® Support.

AlertmanagerMembersInconsistent

要解决此问题，请执行以下步骤：

缩放至零。请注意，Pod 需要一段时间才能关闭：

kubectl scale statefulset -n cattle-monitoring-system alertmanager-rancher-monitoring-alertmanager --replicas=0kubectl scale statefulset -n cattle-monitoring-system alertmanager-rancher-monitoring-alertmanager --replicas=0

缩小到 2：

kubectl scale statefulset -n cattle-monitoring-system alertmanager-rancher-monitoring-alertmanager --replicas=2kubectl scale statefulset -n cattle-monitoring-system alertmanager-rancher-monitoring-alertmanager --replicas=2

检查 Alertmanager Pod 是否已启动以及是否处于正在运行状态：
```
kubectl get po -n cattle-monitoring-systemkubectl get po -n cattle-monitoring-system
```

如果问题仍然存在，请联系 UiPath® 支持团队。

常规.规则

TargetDown

Prometheus 无法从警示中的目标收集指标，这意味着 Grafana 仪表板和基于该目标的指标的进一步警示不可用。检查与该目标相关的其他警示。

Watchdog

这是一个警示，用于确保整个警示管道正常运行。此警示始终处于触发状态。因此，它应始终在“警示管理器”中针对接收器触发。有各种通知机制的集成，可在此警示未触发时通知您。例如，PagerDuty 中的 DeadMansSnitch 集成。

prometheus-operator

PrometheusOperatorListErrors, PrometheusOperatorWatchErrors, PrometheusOperatorSyncFailed, PrometheusOperatorReconcileErrors, PrometheusOperatorNodeLookupErrors, PrometheusOperatorNotReady, PrometheusOperatorRejectedResources

Internal errors of the Prometheus operator, which controls Prometheus resources. Prometheus itself may still be healthy while these errors are present; however, this error indicates there is degraded monitoring configurability. Contact UiPath® Support.

Prometheus

PrometheusBadConfig

Prometheus has failed to load or reload configuration. Please check any custom Prometheus configurations for input errors. Otherwise contact UiPath® Support.

PrometheusErrorSendingAlertsToSomeAlertmanagers, PrometheusErrorSendingAlertsToAnyAlertmanager, PrometheusNotConnectedToAlertmanagers

The connection from Prometheus to AlertManager is not healthy. Metrics may still be queryable, and Grafana dashboards may still show them, but alerts will not fire. Check any custom configuration of AlertManager for input errors and and otherwise contact UiPath® Support.

PrometheusNotificationQueueRunningFull, PrometheusTSDBReloadsFailing, PrometheusTSDBCompactionsFailing, PrometheusNotIngestingSamples, PrometheusDuplicateTimestamps, PrometheusOutOfOrderTimestamps, PrometheusRemoteStorageFailures, PrometheusRemoteWriteBehind, PrometheusRemoteWriteDesiredShards

Internal Prometheus errors indicating metrics may not be collected as expected. Please contact UiPath® Support.

PrometheusRuleFailures

This may happen if there are malformed alerts based on non-existent metrics or incorrect PromQL syntax. Contact UiPath® Support if no custom alerts have been added.

PrometheusMissingRuleEvaluations

Prometheus 无法评估是否应触发警示。如果警示太多，可能会发生这种情况。请删除昂贵的自定义警示评估和/或查看有关增加 Prometheus CPU 限制的文档。如果未添加自定义警示，请联系 UiPath® 支持团队。

PrometheusTargetLimitHit

Prometheus 要收集的目标过多。如果添加了额外的 ServiceMonitor（请参阅监控控制台），您可以将其删除。

uipath.prometheus.resource.provisioning.alerts

PrometheusMemoryUsage, PrometheusStorageUsage

当集群接近配置的内存和存储限制时，这些警示会发出警告。这可能发生在最近使用量大幅增加的集群上（通常来自机器人而不是用户），或者在未调整 Prometheus 资源的情况下将节点添加到集群中时。这是因为要收集的指标数量增加。

可以在 Kubernetes/持久卷仪表板上看到存储利用率的提高率：

您可以按照此处的说明，通过调整 PVC 的大小来调整 PVC：配置集群。

可以在 Kubernetes/计算资源/Pod 仪表板上看到内存利用率的增加率。

您可以通过在 ArgoCD 的 rancher-monitoring 应用程序中编辑 Prometheus 内存资源限制来进行调整。单击“保存”后，rancher-monitoring 应用程序将自动重新同步。

请注意，Prometheus 需要一些时间才能重新启动并再次开始在 Grafana 中显示指标。通常情况下，耗时不到 10 分钟，即使是大型集群也不例外。

uipath.availability.alerts

UiPathAvailabilityHighTrafficUserFacing

The number of http 500 responses from UiPath® services exceeds a given threshold.

流量级别	20 分钟内的请求数	错误阈值（适用于 HTTP 500）
高	>100,000	0.1%
中	10,000 到 100,000 之间	1%
低	< 10,000	5%

面向用户的服务中的错误可能会导致可在 Automation Suite 用户界面中直接观察到的功能降级，而后端服务中的错误则不会产生明显的后果。

警示会指明哪个服务的错误率较高。要了解报告服务所依赖的其他服务可能存在哪些级联问题，您可以使用 Istio 工作负载仪表板，该仪表板会显示服务之间的错误。

Please double check any recently reconfigured Automation Suite products. Detailed logs are also available with the kubectl logs command. If the error persists, please contact UiPath® Support.

备份

NFSServerDisconnected

此警示表示 NFS 服务器连接已丢失。

您需要检查 NFS 服务器连接和装载路径。

VolumeBackupFailed

此警示表示 PVC 的备份失败。

BackupDisabled

此警示表示备份已禁用。

您需要检查集群是否运行状况不佳。

cronjob-alerts

CronJobSuspended

uipath-infra/istio-configure-script-cronjob cronjob 处于挂起状态。

要解决此问题，请执行以下步骤来启用 cronjob：

export KUBECONFIG="/etc/rancher/rke2/rke2.yaml" && export PATH="$PATH:/usr/local/bin:/var/lib/rancher/rke2/bin"
kubectl -n uipath-infra patch cronjob istio-configure-script-cronjob -p '{"spec":{"suspend":false}}'
epoch=$(date +"%s")
kubectl -n uipath-infra create job istio-configure-script-cronjob-manual-$epoch --from=cronjob/istio-configure-script-cronjob
kubectl -n uipath-infra wait --for=condition=complete --timeout=300s job/istio-configure-script-cronjob-manual-$epoch
kubectl get node -o wide
#Verify if all the IP's listed by the above command are part of output of below command
kubectl -n istio-system get svc istio-ingressgateway -o json | jq '.spec.externalIPs'export KUBECONFIG="/etc/rancher/rke2/rke2.yaml" && export PATH="$PATH:/usr/local/bin:/var/lib/rancher/rke2/bin"
kubectl -n uipath-infra patch cronjob istio-configure-script-cronjob -p '{"spec":{"suspend":false}}'
epoch=$(date +"%s")
kubectl -n uipath-infra create job istio-configure-script-cronjob-manual-$epoch --from=cronjob/istio-configure-script-cronjob
kubectl -n uipath-infra wait --for=condition=complete --timeout=300s job/istio-configure-script-cronjob-manual-$epoch
kubectl get node -o wide
#Verify if all the IP's listed by the above command are part of output of below command
kubectl -n istio-system get svc istio-ingressgateway -o json | jq '.spec.externalIPs'

IdentityKerberosTgtUpdateFailed

This job updates the latest Kerberos ticket to all the UiPath® services. Failures in this job would cause SQL server authentication to fail. Please contact UiPath® Support.

在此页面上