- 概述
- 要求
- 安装
- 安装后
- 集群管理
- 监控和警示
- 迁移和升级
- 特定于产品的配置
- 最佳实践和维护
- 故障排除
- 无法获取沙盒映像
- Pod 未显示在 ArgoCD 用户界面中
- Redis 探测器失败
- RKE2 服务器无法启动
- 在 UiPath 命名空间中找不到密码
- ArgoCD 在首次安装后进入“进行中”状态
- 意外不一致;手动运行 fsck
- MongoDB Pod 处于 CrashLoopBackOff 状态或在删除后处于“等待 PVC 配置”状态
- MongoDB Pod 从 4.4.4-ent 升级到 5.0.7-ent 失败
- 集群还原或回滚后服务运行状况不佳
- Pod 在 Init:0/X 中卡住
- Prometheus 处于 CrashLoopBackoff 状态,并出现内存不足 (OOM) 错误
- 监控仪表板中缺少 Ceph-rook 指标
- Pod 无法在代理环境中与 FQDN 通信
- 使用 Automation Suite 诊断工具
- 使用 Automation Suite 支持捆绑包
- 探索日志
Automation Suite 安装指南
集群还原或回滚后服务运行状况不佳
在集群还原或回滚后,AI Center、Orchestrator、Platform、Document Understanding 或 Task Mining 可能运行不正常,RabbitMQ Pod 日志显示以下错误:
[root@server0 UiPathAutomationSuite]# k -n rabbitmq logs rabbitmq-server-0
2022-10-29 07:38:49.146614+00:00 [info] <0.9223.362> accepting AMQP connection <0.9223.362> (10.42.1.161:37524 -> 10.42.0.228:5672)
2022-10-29 07:38:49.147411+00:00 [info] <0.9223.362> Connection <0.9223.362> (10.42.1.161:37524 -> 10.42.0.228:5672) has a client-provided name: rabbitConnectionFactory#77049094:2100
2022-10-29 07:38:49.147644+00:00 [erro] <0.9223.362> Error on AMQP connection <0.9223.362> (10.42.1.161:37524 -> 10.42.0.228:5672, state: starting):
2022-10-29 07:38:49.147644+00:00 [erro] <0.9223.362> PLAIN login refused: user 'aicenter-service' - invalid credentials
2022-10-29 07:38:49.147922+00:00 [info] <0.9223.362> closing AMQP connection <0.9223.362> (10.42.1.161:37524 -> 10.42.0.228:5672 - rabbitConnectionFactory#77049094:2100)
2022-10-29 07:38:55.818447+00:00 [info] <0.9533.362> accepting AMQP connection <0.9533.362> (10.42.0.198:45032 -> 10.42.0.228:5672)
2022-10-29 07:38:55.821662+00:00 [info] <0.9533.362> Connection <0.9533.362> (10.42.0.198:45032 -> 10.42.0.228:5672) has a client-provided name: rabbitConnectionFactory#2100d047:4057
2022-10-29 07:38:55.822058+00:00 [erro] <0.9533.362> Error on AMQP connection <0.9533.362> (10.42.0.198:45032 -> 10.42.0.228:5672, state: starting):
2022-10-29 07:38:55.822058+00:00 [erro] <0.9533.362> PLAIN login refused: user 'aicenter-service' - invalid credentials
2022-10-29 07:38:55.822447+00:00 [info] <0.9533.362> closing AMQP connection <0.9533.362> (10.42.0.198:45032 -> 10.42.0.228:5672 - rabbitConnectionFactory#2100d047:4057)
[root@server0 UiPathAutomationSuite]# k -n rabbitmq logs rabbitmq-server-0
2022-10-29 07:38:49.146614+00:00 [info] <0.9223.362> accepting AMQP connection <0.9223.362> (10.42.1.161:37524 -> 10.42.0.228:5672)
2022-10-29 07:38:49.147411+00:00 [info] <0.9223.362> Connection <0.9223.362> (10.42.1.161:37524 -> 10.42.0.228:5672) has a client-provided name: rabbitConnectionFactory#77049094:2100
2022-10-29 07:38:49.147644+00:00 [erro] <0.9223.362> Error on AMQP connection <0.9223.362> (10.42.1.161:37524 -> 10.42.0.228:5672, state: starting):
2022-10-29 07:38:49.147644+00:00 [erro] <0.9223.362> PLAIN login refused: user 'aicenter-service' - invalid credentials
2022-10-29 07:38:49.147922+00:00 [info] <0.9223.362> closing AMQP connection <0.9223.362> (10.42.1.161:37524 -> 10.42.0.228:5672 - rabbitConnectionFactory#77049094:2100)
2022-10-29 07:38:55.818447+00:00 [info] <0.9533.362> accepting AMQP connection <0.9533.362> (10.42.0.198:45032 -> 10.42.0.228:5672)
2022-10-29 07:38:55.821662+00:00 [info] <0.9533.362> Connection <0.9533.362> (10.42.0.198:45032 -> 10.42.0.228:5672) has a client-provided name: rabbitConnectionFactory#2100d047:4057
2022-10-29 07:38:55.822058+00:00 [erro] <0.9533.362> Error on AMQP connection <0.9533.362> (10.42.0.198:45032 -> 10.42.0.228:5672, state: starting):
2022-10-29 07:38:55.822058+00:00 [erro] <0.9533.362> PLAIN login refused: user 'aicenter-service' - invalid credentials
2022-10-29 07:38:55.822447+00:00 [info] <0.9533.362> closing AMQP connection <0.9533.362> (10.42.0.198:45032 -> 10.42.0.228:5672 - rabbitConnectionFactory#2100d047:4057)
CrashLoopBackOff
状态。
如果所有 Pod 都在运行,请执行以下步骤:
-
删除 RabbitMQ 中的用户:
kubectl -n rabbitmq exec rabbitmq-server-0 -c rabbitmq -- rabbitmqctl list_users -s --formatter json | jq '.[]|.user' | grep -v default_user | xargs -I{} kubectl -n rabbitmq exec rabbitmq-server-0 -c rabbitmq -- rabbitmqctl delete_user {}
kubectl -n rabbitmq exec rabbitmq-server-0 -c rabbitmq -- rabbitmqctl list_users -s --formatter json | jq '.[]|.user' | grep -v default_user | xargs -I{} kubectl -n rabbitmq exec rabbitmq-server-0 -c rabbitmq -- rabbitmqctl delete_user {} -
删除 UiPath 命名空间中的 RabbitMQ 应用程序密码:
kubectl -n uipath get secret --template '{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}' | grep -i rabbitmq-secret | xargs -I{} kubectl -n uipath delete secret {}
kubectl -n uipath get secret --template '{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}' | grep -i rabbitmq-secret | xargs -I{} kubectl -n uipath delete secret {} -
删除 RabbitMQ 命名空间中的 RabbitMQ 应用程序密码:
kubectl -n rabbitmq get secret --template '{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}' | grep -i rabbitmq-secret | xargs -I{} kubectl -n rabbitmq delete secret {}
kubectl -n rabbitmq get secret --template '{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}' | grep -i rabbitmq-secret | xargs -I{} kubectl -n rabbitmq delete secret {} -
通过 ArgoCD 同步
sfcore
应用程序并等待同步完成: -
对 UiPath 命名空间中的所有应用程序执行首次重新启动:
kubectl -n uipath rollout restart deploy
kubectl -n uipath rollout restart deploy
如果某些 Pod 处于 CrashLoopBackOff 状态,请执行以下步骤:
-
确定哪些 RabbitMQ Pod 卡在
CrashLoopBackOff
状态,并检查 RabbitMQCrashLoopBackOff
Pod 日志:kubectl -n rabbitmq get pods kubectl -n rabbitmq logs <CrashLoopBackOff-Pod-Name>
kubectl -n rabbitmq get pods kubectl -n rabbitmq logs <CrashLoopBackOff-Pod-Name> -
检查前面命令的输出。如果问题与 Mnesia 表数据写入有关,您应该会看到类似于以下内容的错误消息:
Mnesia('rabbit@rabbitmq-server-0.rabbitmq-nodes.rabbitmq'): ** ERROR ** (could not write core file: eacces) ** FATAL ** Failed to merge schema: Bad cookie in table definition rabbit_user_permission: 'rabbit@rabbitmq-server-0.rabbitmq-nodes.rabbitmq' = {cstruct,rabbit_user_permission,set,[],['rabbit@rabbitmq-server-2.rabbitmq-nodes.rabbitmq','rabbit@rabbitmq-server-0.rabbitmq-nodes.rabbitmq','rabbit@rabbitmq-server-1.rabbitmq-nodes.rabbitmq'],[],[],0,read_write,false,[],[],false,user_permission,[user_vhost,permission],[],[],[],{{1667351034020261908,-576460752303416575,1},'rabbit@rabbitmq-server-1.rabbitmq-nodes.rabbitmq'},{{4,0},{'rabbit@rabbitmq-server-2.rabbitmq-nodes.rabbitmq',{1667,351040,418694}}}}, 'rabbit@rabbitmq-server-1.rabbitmq-nodes.rabbitmq' = {cstruct,rabbit_user_permission,set,[],['rabbit@rabbitmq-server-1.rabbitmq-nodes.rabbitmq'],[],[],0,read_write,false,[],[],false,user_permission,[user_vhost,permission],[],[],[],{{1667372429216834387,-576460752303417087,1},'rabbit@rabbitmq-server-1.rabbitmq-nodes.rabbitmq'},{{2,0},[]}}
Mnesia('rabbit@rabbitmq-server-0.rabbitmq-nodes.rabbitmq'): ** ERROR ** (could not write core file: eacces) ** FATAL ** Failed to merge schema: Bad cookie in table definition rabbit_user_permission: 'rabbit@rabbitmq-server-0.rabbitmq-nodes.rabbitmq' = {cstruct,rabbit_user_permission,set,[],['rabbit@rabbitmq-server-2.rabbitmq-nodes.rabbitmq','rabbit@rabbitmq-server-0.rabbitmq-nodes.rabbitmq','rabbit@rabbitmq-server-1.rabbitmq-nodes.rabbitmq'],[],[],0,read_write,false,[],[],false,user_permission,[user_vhost,permission],[],[],[],{{1667351034020261908,-576460752303416575,1},'rabbit@rabbitmq-server-1.rabbitmq-nodes.rabbitmq'},{{4,0},{'rabbit@rabbitmq-server-2.rabbitmq-nodes.rabbitmq',{1667,351040,418694}}}}, 'rabbit@rabbitmq-server-1.rabbitmq-nodes.rabbitmq' = {cstruct,rabbit_user_permission,set,[],['rabbit@rabbitmq-server-1.rabbitmq-nodes.rabbitmq'],[],[],0,read_write,false,[],[],false,user_permission,[user_vhost,permission],[],[],[],{{1667372429216834387,-576460752303417087,1},'rabbit@rabbitmq-server-1.rabbitmq-nodes.rabbitmq'},{{2,0},[]}} -
要解决此问题,请执行以下步骤:
-
查找 RabbitMQ 副本的数量:
rabbitmqReplicas=$(kubectl -n rabbitmq get rabbitmqcluster rabbitmq -o json | jq -r '.spec.replicas')
-
缩减 RabbitMQ 副本:
kubectl -n rabbitmq patch rabbitmqcluster rabbitmq -p "{\"spec\":{\"replicas\": 0}}" --type=merge
kubectl -n rabbitmq scale sts rabbitmq-server --replicas=0
-
等待所有 RabbitMQ Pod 终止:
kubectl -n rabbitmq get pod
-
检查待处理的卷快照:
kubectl get volumesnapshot -n rabbitmq <crashloopbackupoff_pod_pvc_name> | grep false
kubectl get volumesnapshot -n rabbitmq <crashloopbackupoff_pod_pvc_name> | grep false -
如果存在任何待处理的卷快照,请将其删除:
kubectl delete volumesnapshot -n rabbitmq <pending_volume_snapshot_name>
kubectl delete volumesnapshot -n rabbitmq <pending_volume_snapshot_name> -
查找并删除卡在
CrashLoopBackOff
状态的 RabbitMQ Pod 的 PVC:kubectl -n rabbitmq get pvc
kubectl -n rabbitmq delete pvc <crashloopbackupoff_pod_pvc_name>
-
扩展 RabbitMQ 副本:
kubectl -n rabbitmq patch rabbitmqcluster rabbitmq -p "{\"spec\":{\"replicas\": $rabbitmqReplicas}}" --type=merge
-
检查所有 RabbitMQ Pod 是否运行状况良好:
kubectl -n rabbitmq get pod
-