订阅

UiPath Automation Suite

UiPath Automation Suite 指南

故障排除

本页介绍了如何解决设置 Automation Suite 时可能遇到的问题。

故障排除操作方法


Automation Suite 会生成日志,您可以在需要对安装错误进行故障排除时进行探索。您可以在日志文件中找到有关安装期间发生的问题的所有详细信息,该日志文件保存在还包含 install-uipath.sh 脚本的目录中。每次执行安装程序都会生成一个新的日志文件,该文件遵循 install-$(date +'%Y-%m-%dT%H_%M_%S').log 命名约定,您可以在遇到任何安装问题时查看该文件。
If you want to troubleshoot post-installation errors, use the Support Bundle tool.

 

如何在安装过程中对服务进行故障排除


在其中一个集群服务器节点上执行以下步骤:

  1. 获取 Kubernetes 访问权限。
on server nodes:
export KUBECONFIG="/etc/rancher/rke2/rke2.yaml"
export PATH="$PATH:/usr/local/bin:/var/lib/rancher/rke2/bin"

on agent nodes:
export KUBECONFIG="/var/lib/rancher/rke2/agent/kubelet.kubeconfig"
export PATH="$PATH:/usr/local/bin:/var/lib/rancher/rke2/bin"

# To validate, execute the following command which should not return an error:
kubectl get nodes
  1. 通过运行以下命令检索 ArgoCD 密码:
kubectl get secrets/argocd-admin-password -n argocd --template '{{ .data.password }}' | base64 -d
  1. 连接到 ArgoCD
    a. 导航到 https://alm.<fqdn>/:443
    b. 使用 admin 作为用户名,以及在步骤 2 中获得的密码进行登录。

  2. 找到 UiPath 服务应用程序,如下所示:
    a. 使用 ArgoCD 中提供的搜索栏,键入uipath

    b. 然后单击其卡打开 UiPath 应用程序。

    c. 检查以下内容:Application was not synced due to a failed job/pod

    d.如果存在上述错误,请执行以下步骤。

    e. 通过查找红色的心碎图标来找到任何未同步的组件,如下图所示。

    f. 打开最右侧的组件(通常是 Pod),然后单击“日志”选项卡。日志将包含一条错误消息,指明 Pod 失败的原因。

    g. 解决任何未解决的配置问题后,请返回主页并单击 UiPath 应用程序上的“同步”按钮。

 

如何卸载集群


如果您遇到特定于集群上运行的 Kubernetes 的问题,可以直接卸载 rke2 集群。

  1. 根据您的安装配置文件,运行以下命令之一:
    1.1. 在在线设置中,在集群的每个节点上以提升的特权(即sudo )运行以下脚本。 这将卸载节点。
function remove_rke2_entry_from_exclude() {
  local current_exclude_list new_exclude_list
  YUM_CONF_FILE=$1
  if [[ ! -s "${YUM_CONF_FILE}" ]];
  then
    # File is empty
    return
  fi
  current_exclude_list=$(grep 'exclude=' "${YUM_CONF_FILE}" | tail -1)
  if echo "$current_exclude_list" | grep -q 'rke2-*';
  then
    if [[ -w ${YUM_CONF_FILE} ]];
    then
      new_exclude_list=$(printf '%s\n' "${current_exclude_list//rke2-* /}")
      new_exclude_list=$(printf '%s\n' "${new_exclude_list//rke2-*,/}")
      new_exclude_list=$(printf '%s\n' "${new_exclude_list//rke2-\*/}")
      sed -i "/exclude=.*rke2-\*/d" "${YUM_CONF_FILE}"
      echo "${new_exclude_list}" >> "${YUM_CONF_FILE}"
    else
      error "${YUM_CONF_FILE} file is readonly and contains rke2-* under package exclusion. Please remove the entry for AS to work."
    fi
  fi
}

function enable_rke2_package_upgrade() {
  remove_rke2_entry_from_exclude /etc/dnf/dnf.conf
  remove_rke2_entry_from_exclude /etc/yum.conf
}

enable_rke2_package_upgrade

service_exists() {
    local n=$1
    if [[ $(systemctl list-units --all -t service --full --no-legend "$n.service" | cut -f1 -d' ') == $n.service ]]; then
        return 0
    else
        return 1
    fi
}
if service_exists rke2-server; then
  systemctl stop rke2-server
  systemctl disable rke2-server
fi
if service_exists rke2-agent; then
  systemctl stop rke2-agent
  systemctl disable rke2-agent
fi
if [ -e /usr/bin/rke2-killall.sh ]
then
    echo "Running rke2-killall.sh"
    /usr/bin/rke2-killall.sh > /dev/null
else
    echo "File not found: rke2-killall.sh"
fi
if [ -e /usr/bin/rke2-uninstall.sh ]
then
    echo "Running rke2-uninstall.sh"
    /usr/bin/rke2-uninstall.sh > /dev/null
else
    echo "File not found: rke2-uninstall.sh"
fi

crontab -l > backupcron
sed -i '/backupjob/d' backupcron > /dev/null
crontab backupcron > /dev/null
rm -rf backupcron > /dev/null
rm -rfv /usr/bin/backupjob > /dev/null
rm -rfv /etc/rancher/ > /dev/null
rm -rfv /var/lib/rook/ > /dev/null
rm -rfv /var/lib/longhorn/ > /dev/null
rm -rfv /var/lib/rancher/rke2/server/db/* > /dev/null
umount -l -f /var/lib/rancher/rke2/server/db > /dev/null 2>&1 || true
rm -rfv /var/lib/rancher/* > /dev/null
umount -l -f /var/lib/rancher
rm -rfv /var/lib/rancher/* > /dev/null
while ! rm -rfv /var/lib/kubelet/* > /dev/null; do
  findmnt --list   --submounts  -n -o TARGET  --target /var/lib/kubelet | grep '/var/lib/kubelet/plugins'  | xargs -r umount -f -l
  sleep 5
done
umount -l -f /var/lib/kubelet
rm -rfv /var/lib/kubelet/* > /dev/null
rm -rfv /datadisk/* > /dev/null
umount -l -f /datadisk
rm -rfv /datadisk/* > /dev/null
rm -rfv ~/.uipath/* > /dev/null
mount /var/lib/rancher
mkdir -p /var/lib/rancher/rke2/server/db/ && mount -a
rm -rfv /var/lib/rancher/rke2/server/db/* > /dev/null
echo "Uninstall RKE complete."

     1.2 在离线设置中,在集群的每个节点上以提升的特权(即sudo )运行以下脚本。 这将卸载节点。

function remove_rke2_entry_from_exclude() {
  local current_exclude_list new_exclude_list
  YUM_CONF_FILE=$1
  if [[ ! -s "${YUM_CONF_FILE}" ]];
  then
    # File is empty
    return
  fi
  current_exclude_list=$(grep 'exclude=' "${YUM_CONF_FILE}" | tail -1)
  if echo "$current_exclude_list" | grep -q 'rke2-*';
  then
    if [[ -w ${YUM_CONF_FILE} ]];
    then
      new_exclude_list=$(printf '%s\n' "${current_exclude_list//rke2-* /}")
      new_exclude_list=$(printf '%s\n' "${new_exclude_list//rke2-*,/}")
      new_exclude_list=$(printf '%s\n' "${new_exclude_list//rke2-\*/}")
      sed -i "/exclude=.*rke2-\*/d" "${YUM_CONF_FILE}"
      echo "${new_exclude_list}" >> "${YUM_CONF_FILE}"
    else
      error "${YUM_CONF_FILE} file is readonly and contains rke2-* under package exclusion. Please remove the entry for AS to work."
    fi
  fi
}

function enable_rke2_package_upgrade() {
  remove_rke2_entry_from_exclude /etc/dnf/dnf.conf
  remove_rke2_entry_from_exclude /etc/yum.conf
}

enable_rke2_package_upgrade

service_exists() {
    local n=$1
    if [[ $(systemctl list-units --all -t service --full --no-legend "$n.service" | cut -f1 -d' ') == $n.service ]]; then
        return 0
    else
        return 1
    fi
}
if service_exists rke2-server; then
  systemctl stop rke2-server
  systemctl disable rke2-server
fi
if service_exists rke2-agent; then
  systemctl stop rke2-agent
  systemctl disable rke2-agent
fi
if [ -e /usr/local/bin/rke2-killall.sh ]
then
  echo "Running rke2-killall.sh"
  /usr/local/bin/rke2-killall.sh > /dev/null
else
  echo "File not found: rke2-killall.sh"
fi
if [ -e /usr/local/bin/rke2-uninstall.sh ]
then
  echo "Running rke2-uninstall.sh"
  /usr/local/bin/rke2-uninstall.sh > /dev/null
else
    echo "File not found: rke2-uninstall.sh"
fi

crontab -l > backupcron
sed -i '/backupjob/d' backupcron > /dev/null
crontab backupcron > /dev/null
rm -rf backupcron > /dev/null
rm -rfv /usr/bin/backupjob > /dev/null
rm -rfv /etc/rancher/ > /dev/null
rm -rfv /var/lib/rook/ > /dev/null
rm -rfv /var/lib/longhorn/ > /dev/null
rm -rfv /var/lib/rancher/rke2/server/db/* > /dev/null
umount -l -f /var/lib/rancher/rke2/server/db > /dev/null 2>&1 || true
rm -rfv /var/lib/rancher/* > /dev/null
umount -l -f /var/lib/rancher
rm -rfv /var/lib/rancher/* > /dev/null
while ! rm -rfv /var/lib/kubelet/* > /dev/null; do
  findmnt --list   --submounts  -n -o TARGET  --target /var/lib/kubelet | grep '/var/lib/kubelet/plugins'  | xargs -r umount -f -l
  sleep 5
done
umount -l -f /var/lib/kubelet
rm -rfv /var/lib/kubelet/* > /dev/null
rm -rfv /datadisk/* > /dev/null
umount -l -f /datadisk
rm -rfv /datadisk/* > /dev/null
rm -rfv ~/.uipath/* > /dev/null
mount /var/lib/rancher
mkdir -p /var/lib/rancher/rke2/server/db/ && mount -a
rm -rfv /var/lib/rancher/rke2/server/db/* > /dev/null
echo "Uninstall RKE complete."
  1. 卸载后重新启动节点。

🚧

重要

从集群中卸载某个节点时,您必须运行以下命令: kubectl delete node <node_name> 。 这将从集群中删除节点。

 

如何清理离线工件以改善磁盘空间


如果运行离线安装,由于使用了离线工件,通常需要更大的磁盘空间。

安装完成后,您可以删除这些本地工件。否则,可能会在集群操作期间造成不必要的磁盘压力。

在执行安装的主服务器上,您可以使用以下命令执行清理。

  1. 使用以下命令删除 Podman 加载到本地容器存储中的所有映像:
podman image rm -af
  1. 然后删除与 --offline-tmp-folder 标志一起使用的临时离线文件夹。此参数默认为 /tmp
rm -rf /path/to/temp/folder

 

常见问题


无法在 RHEL 8.4 操作系统上运行离线安装


描述

如果您安装 RHEL 8.4 并执行离线安装(需要 Podman),则可能会发生以下问题。这些问题特定于 Podman 和一起安装的操作系统。请参阅下面的两个

潜在问题

  • 您不能在集群上同时安装以下两者:
    • podman-1.0.0-8.git921f98f.module+el8.3.0+10171+12421f43.x86_64
    • podman-3.0.1-6.module+el8.4.0+10607+f4da7515.x86_64
  • cockpit-podman-29-2.module+el8.4.0+10607+f4da7515.noarch 需要 podman >= 1.3.0,但无法安装任何提供程序
  • 无法为作业安装最佳候选者
  • 已安装的包 cockpit-podman-29-2.module+el8.4.0+10607+f4da7515.noarch 存在问题

潜在问题

  • podman-3.0.1-6.module+el8.4.0+10607+f4da7515.x86_64 需要 containernetworking-plugins >= 0.8.1-1,但无法安装任何提供程序
  • 您不能同时安装以下两者:
    • containernetworking-plugins-0.7.4-4.git9ebe139.module+el8.3.0+10171+12421f43.x86_64
    • containernetworking-plugins-0.9.1-1.module+el8.4.0+10607+f4da7515.x86_64
  • podman-catatonit-3.0.1-6.module+el8.4.0+10607+f4da7515.x86_64 需要 podman = 3.0.1-6.module+el8.4.0+10607+f4da7515,但无法安装任何提供程序
  • 无法为作业安装最佳候选者
  • 已安装的包 podman-catatonit-3.0.1-6.module+el8.4.0+10607+f4da7515.x86_64 存在问题
    (尝试向命令行添加 --allowerasing 以替换冲突的包、添加 --skip-broken 以跳过可卸载的包,或添加 --nobest 以使其不只使用最佳候选包)

解决方案

您需要删除 podman 的当前版本,并允许 Automation Suite 安装所需的版本。

  1. 使用 yum remove podman 命令删除当前版本的 Podman。

  2. 删除当前版本后重新运行安装程序,这将安装正确的版本。

 

由于缺少二进制文件,离线安装失败


描述

在离线安装期间,Fabric 阶段执行失败,并显示以下错误消息:

Error: overlay: can't stat program "/usr/bin/fuse-overlayfs": stat /usr/bin/fuse-overlayfs: no such file or directory

解决方案

您需要从 podman 配置 /etc/containers/storage.conf 中删除包含 mount_program 键的行。
确保删除该行,而不是将其注释掉。

 

无法获取沙盒映像


描述

尝试获取以下沙盒映像时,您可能会收到特定错误消息:index.docker.io/rancher/pause3.2

这可能会在离线安装中发生。

解决方案

重新启动 rke2-serverrke2-agent(取决于计划 Pod 的计算机是服务器还是代理)。

要检查 Pod 计划在哪个节点,请运行 kubectl -n <namespace> get pods -o wide

# If machine is a Master node
systemctl restart rke2-server
# If machine is an Agent Node
systemctl restart rke2-agent

 

SQL 连接字符串验证错误


描述

您可能会收到与连接字符串相关的错误,如下所示:

Sqlcmd: Error: Microsoft Driver 17 for SQL Server :
Server—tcp : <connection string>
Login failed for user

即使所有凭据都正确,也会出现此错误。连接字符串验证失败。

解决方案

确保连接字符串具有以下结构:

Server=<Sql server host name>;User Id=<user_name for sql server>;Password=<Password>;Initial Catalog=<database name>;Persist Security Info=False;MultipleActiveResultSets=False;Encrypt=True;TrustServerCertificate=False;Connection Timeout=30;Max Pool Size=100;

📘

备注:

User Id 区分大小写。

 

Pod 未显示在 ArgoCD 用户界面中


描述

有时,ArgoCD 用户界面不显示 Pod,而仅显示应用程序和相应的部署。有关详细信息,请参见下图:

1763

单击任何部署时,将显示以下错误:Unable to load data: EOF

1920

解决方案

您可以通过从 ArgoCD 命名空间中删除所有 Redis 副本并等待其再次启动来解决此问题。

kubectl -n argocd delete pod argocd-redis-ha-server-0 argocd-redis-ha-server-1 argocd-redis-ha-server-2

# Wait for all 3 pods to come back up
kubectl -n argocd get pods | grep argocd-redis-ha-server

 

离线安装中的证书问题


描述

您可能会收到证书由未知颁发机构签名的错误。

Error: failed to do request: Head "https://sfdev1778654-9f843b23-lb.westeurope.cloudapp.azure.com:30071/v2/helm/audit-service/blobs/sha256:09bffbc520ff000b834fe1a654acd089889b09d22d5cf1129b0edf2d76554892": x509: certificate signed by unknown authority

解决方案

CA 根证书和服务器证书都需要位于计算机上的受信任存储中。

要进行调查,请执行以下命令:

[root@server0 ~]# find /etc/pki/ca-trust/source{,/anchors} -maxdepth 1 -not -type d -exec ls -1 {} +
/etc/pki/ca-trust/source/anchors/rootCA.crt
/etc/pki/ca-trust/source/anchors/server.crt

提供的证书需要在这些命令的输出中。

或者,执行以下命令:

[root@server0 ~]# openssl x509 -in /etc/pki/ca-trust/source/anchors/server.crt -text -noout

确保输出中的“使用者可选名称”中存在完全限定域名。

X509v3 Subject Alternative Name:
                DNS:sfdev1778654-9f843b23-lb.westeurope.cloudapp.azure.com

您可以按如下方式更新 CA 证书:

[root@server0 ~]# update-ca-trust

 

下载捆绑包时出错


描述

文档列出了 wget 作为下载捆绑包的选项。由于尺寸较大,连接可能会中断且无法恢复。

解决方案

One way to mitigate this could be to switch to a different download tool, such as azcopy (more information here). Run these commands, while updating the bundle URL to match the desired version/bundle combination.

wget https://aka.ms/downloadazcopy-v10-linux -O azcopy.tar.gz
tar -xvf ./azcopy.tar.gz
azcopy_linux_amd64_10.11.0/azcopy copy https://download.uipath.com/service-fabric/0.0.23-private4/sf-0.0.23-private4.tar.gz /var/tmp/sf.tar.gz --from-to BlobLocal

 

Longhorn 错误


Rook Ceph 或 Looker Pod 卡在 Init 状态

描述

有时,在节点重新启动时,某个问题会导致 Looker 或 Rook Ceph Pod 卡在 Init 状态,因为缺少将 PVC 附加到 Pod 所需的卷。

通过运行以下命令验证问题是否确实与 Longhorn 相关:

kubectl get events -A -o json | jq -r '.items[] | select(.message != null) | select(.message | contains("cannot get resource \"volumeattachments\" in API group \"storage.k8s.io\""))'

如果它与 Longhorn 相关,则此命令应返回受问题影响的 Pod 名称列表。如果该命令未返回任何内容,则问题的原因将有所不同。

解决方案

如果上一个命令返回非空输出,请运行以下脚本以修复有问题的 Pod:

#!/bin/bash


function wait_till_rollout() {
    local namespace=$1
    local object_type=$2
    local deploy=$3

    local try=0
    local maxtry=2
    local status="notready"

    while [[ ${status} == "notready" ]]  && (( try != maxtry )) ; do
        kubectl -n "$namespace" rollout status "$deploy" -w --timeout=600s; 
        # shellcheck disable=SC2181
        if [[ "$?" -ne 0 ]]; 
        then
            status="notready"
            try=$((try+1))
        else
            status="ready"
        fi
    done
    if [[ $status == "notready" ]]; then 
        echo "$deploy of type $object_type failed in namespace $namespace. Plz re-run the script once again to verify that it's not a transient issue !!!"
        exit 1
    fi
}

function fix_pv_deployments() {
    for pod_name in $(kubectl get events -A -o json | jq -r '.items[]  | select(.message | contains("cannot get resource \"volumeattachments\" in API group \"storage.k8s.io\"")) | select(.involvedObject.kind == "Pod") | .involvedObject.name + "/" + .involvedObject.namespace' | sort | uniq)
    do
        POD_NAME=$(echo "${pod_name}" | cut -d '/' -f1)
        NS=$(echo "${pod_name}" | cut -d '/' -f2)
        controller_data=$(kubectl -n "${NS}" get po "${POD_NAME}" -o json | jq -r '[.metadata.ownerReferences[] | select(.controller==true)][0] | .kind + "=" + .name')
        [[ $controller_data == "" ]] && error "Error: Could not determine owner for pod: ${POD_NAME}" && exit 1
        CONTROLLER_KIND=$(echo "${controller_data}" | cut -d'=' -f1)
        CONTROLLER_NAME=$(echo "${controller_data}" | cut -d'=' -f2)
        if [[ $CONTROLLER_KIND == "ReplicaSet" ]]
        then
            controller_data=$(kubectl  -n "${NS}" get "${CONTROLLER_KIND}" "${CONTROLLER_NAME}" -o json | jq -r '[.metadata.ownerReferences[] | select(.controller==true)][0] | .kind + "=" + .name')
            CONTROLLER_KIND=$(echo "${controller_data}" | cut -d'=' -f1)
            CONTROLLER_NAME=$(echo "${controller_data}" | cut -d'=' -f2)

            replicas=$(kubectl -n "${NS}" get "$CONTROLLER_KIND" "$CONTROLLER_NAME" -o json | jq -r '.status.replicas')
            unavailable_replicas=$(kubectl -n "${NS}" get "$CONTROLLER_KIND" "$CONTROLLER_NAME" -o json | jq -r '.status.unavailableReplicas')

            if [ -n "$unavailable_replicas" ]; then 
                available_replicas=$((replicas - unavailable_replicas))
                if [ $available_replicas -eq 0 ]; then
                    kubectl -n "$NS" scale "$CONTROLLER_KIND" "$CONTROLLER_NAME" --replicas=0
                    sleep 15
                    kubectl -n "$NS" scale "$CONTROLLER_KIND" "$CONTROLLER_NAME" --replicas="$replicas"
                    deployment_name="$CONTROLLER_KIND/$CONTROLLER_NAME"
                    wait_till_rollout "$NS" "deploy" "$deployment_name"
                fi 
            fi
        fi
    done
}

fix_pv_deployments

 

状态副本集卷附加错误


RabbitMQ、cattle-monitoring-system 或其他状态副本集 Pod 中的 Pod 卡在初始化状态。

描述

有时,在节点电源故障或升级期间,由于缺少将 PVC 附加到 Pod 所需的卷,RabbitMQ 或 cattle-monitoring-system 中的 Pod 会卡在初始化状态。

通过运行以下命令,验证问题是否确实与状态副本集卷附件有关:

kubectl -n <namespace> describe pod <pod-name> | grep "cannot get resource \"volumeattachments\" in API group \" storage.k8s.io\""

如果它与状态副本集卷附件相关,则会显示错误消息。

解决方案

要解决此问题,请重新启动节点。

 

无法创建持久卷

描述

Longhorn 已成功安装,但无法创建持久卷。

解决方案

使用命令 lsmod | grep <module_name> 验证内核模块是否已成功加载到集群中。
<module_name> 替换为以下每个内核模块:

  • libiscsi_tcp
  • libiscsi
  • iscsi_tcp
  • scsi_transport_iscsi

加载任何缺少的模块。

 

rke2-coredns-rke2-coredns-autoscaler Pod 处于 CrashLoopBackOff 状态


描述

节点重新启动后,rke2-coredns-rke2-coredns-autoscaler 会进入 CrashLoopBackOff 状态。这对 Automation Suite 没有任何影响。

解决方案

使用以下命令 (kubectl delete pod <pod name> -n kube-system) 删除 CrashLoopBackOff 中的 rke2-coredns-rke2-coredns-autoscaler Pod。

 

Redis 探测器失败


描述

如果节点 ID 文件不存在,则 Redis 探测器可能会失败。如果 Pod 尚未启动,则可能会发生这种情况。

有一个恢复作业可以自动修复此问题,并且不应在作业运行时执行以下步骤。

当 Redis 企业版集群与其半数以上的节点失去联系时(由于节点故障或网络拆分),集群将停止响应客户端连接。Pod 也无法重新加入集群。

解决方案

  1. 使用以下命令删除 Redis 集群和数据库:
kubectl delete redb -n redis-system redis-cluster-db --force --grace-period=0 &
kubectl delete rec -n redis-system redis-cluster --force --grace-period=0 &
kubectl patch redb -n redis-system redis-cluster-db --type=json -p '[{"op":"remove","path":"/metadata/finalizers","value":"finalizer.redisenterprisedatabases.app.redislabs.com"}]'
kubectl patch rec redis-cluster -n redis-system --type=json -p '[{"op":"remove","path":"/metadata/finalizers","value":"redbfinalizer.redisenterpriseclusters.app.redislabs.com"}]'
kubectl delete job redis-cluster-db-job -n redis-system
  1. 转到 ArgoCD 用户界面并同步 redis-cluster 应用程序。

 

RKE2 服务器无法启动


描述

服务器无法启动。RKE2 无法正常启动的原因有很多,通常可以在日志中找到。

解决方案

使用以下命令检查日志:

journalctl -u rke2-server

可能的原因(基于日志):集群中的学习者成员过多

Too many etcd servers are added to the cluster, and there are two learner nodes trying to be promoted. More information here: Runtime reconfiguration.

执行以下步骤:

  1. 在正常情况下,如果有足够的时间,节点应该成为正式成员。
  2. 可以尝试卸载-重新安装循环。

Alternatively, this could be caused by a networking problem. Ensure you have configured the machine to enable the necessary ports.

 

停止的节点不会发生节点排空


描述

如果集群中的节点已停止,并且其相应的 Pod 在 15 分钟后未重新计划到可用节点,请运行以下脚本以手动排空该节点。

#!/bin/sh

KUBECTL="/usr/local/bin/kubectl"

# Get only nodes which are not drained yet
NOT_READY_NODES=$($KUBECTL get nodes | grep -P 'NotReady(?!,SchedulingDisabled)' | awk '{print $1}' | xargs echo)
# Get only nodes which are still drained
READY_NODES=$($KUBECTL get nodes | grep '\sReady,SchedulingDisabled' | awk '{print $1}' | xargs echo)

echo "Unready nodes that are undrained: $NOT_READY_NODES"
echo "Ready nodes: $READY_NODES"


for node in $NOT_READY_NODES; do
  echo "Node $node not drained yet, draining..."
  $KUBECTL drain --ignore-daemonsets --force --delete-emptydir-data $node
  echo "Done"
done;

for node in $READY_NODES; do
  echo "Node $node still drained, uncordoning..."
  $KUBECTL uncordon $node
  echo "Done"
done;

 

启用 Istio 日志记录


要调试 Istio,您需要启用日志记录。为此,请执行以下步骤:

  1. 通过运行以下命令查找 istio-ingressgateway Pod。复制网关 Pod 名称。它应该类似于 istio-ingressgateway-r4mbx
kubectl -n istio-system get pods
  1. 通过运行以下命令打开网关 Pod Shell。
kubectl exec -it -n istio-system istio-ingressgateway-r4mbx bash
  1. 通过运行以下命令启用调试级别的日志记录。
curl -X POST http://localhost:15000/logging?level=debug
  1. 从服务器节点运行以下命令。
istioctl_bin=$(find /var/lib/rancher/rke2/ -name "istioctl" -type f -perm -u+x   -print -quit)
if [[ -n ${istioctl_bin} ]]
then
echo "istioctl bin found"
  kubectl -n istio-system get cm istio-installer-base -o go-template='{{ index .data "istio-base.yaml" }}'  > istio-base.yaml
  kubectl -n istio-system get cm istio-installer-overlay  -o go-template='{{ index .data "overlay-config.yaml" }}'  > overlay-config.yaml 
  ${istioctl_bin} -i istio-system install -y -f istio-base.yaml -f overlay-config.yaml --set meshConfig.accessLogFile=/dev/stdout --set meshConfig.accessLogEncoding=JSON 
else
  echo "istioctl bin not found"
fi

 

在 UiPath 命名空间中找不到密码


描述

如果服务安装失败,并且检查 kubectl -n uipath get pods 返回失败的 Pod,请执行以下步骤。

解决方案

  1. 检查 kubectl -n uipath describe pod <pod-name> 并查找未找到的密码。
  2. 如果找不到密码,请查找凭据管理器作业日志并查看它是否失败。
  3. 如果凭据管理器作业失败,并且 kubectl get pods -n rook-ceph|grep rook-ceph-tool 返回多个 Pod,请执行以下操作:
    a. 删除未运行的 rook-ceph-tool
    b. 转到 ArgoCD 用户界面并同步 sfcore 应用程序。
    c. 作业完成后,检查是否在凭据管理器作业日志中创建了所有密码。
    d. 现在同步 uipath 应用程序。

 

迁移后无法登录


描述

某个问题可能会影响从独立产品到 Automation Suite 的迁移。它会阻止您登录,并显示以下错误消息:Cannot find client details

解决方案

要解决此问题,您需要先重新同步 uipath 应用程序,然后在 ArgoCD 中同步 platform 应用程序。

 

ArgoCD 登录失败


描述

使用管理员密码时,您可能无法登录 ArgoCD,或者安装程序可能会失败,并显示以下错误消息:

3396

解决方案

要解决此问题,请输入您的密码,创建 bcrypt 密码,然后运行下节所述的命令:

password="<enter_your_password>"
bcryptPassword=<generate bcrypt password using link https://www.browserling.com/tools/bcrypt >

# Enter your bcrypt password and run below command
kubectl -n argocd patch secret argocd-secret \
  -p '{"stringData": {
    "admin.password": "<enter you bcryptPassword here>",
    "admin.passwordMtime": "'$(date +%FT%T%Z)'"
  }}'

# Run below commands
argocdInitialAdminSecretPresent=$(kubectl -n argocd get secret argocd-initial-admin-secret --ignore-not-found )
if [[ -n ${argocdInitialAdminSecretPresent} ]]; then
   echo "Start updating argocd-initial-admin-secret"
   kubectl -n argocd patch secret argocd-initial-admin-secret \
   -p "{
      \"stringData\": {
         \"password\": \"$password\"
      }
   }"
fi

argocAdminSecretName=$(kubectl -n argocd get secret argocd-admin-password --ignore-not-found )
if [[ -n ${argocAdminSecretName} ]]; then
   echo "Start updating argocd-admin-password"
   kubectl -n argocd patch secret argocd-admin-password \
   -p "{
      \"stringData\": {
         \"password\": \"$password\"
      }
   }"
fi

 

初始安装后,ArgoCD 应用程序进入“进行中”状态


描述

每当集群状态偏离 Helm 存储库中定义的状态时, argocd 都会尝试同步状态,并且每分钟都会发生一次对帐。每当发生这种情况时,您都会注意到 ArgoCD 应用程序处于处理中的状态。

解决方案

这是 ArgoCD 的预期行为,它不会以任何方式影响应用程序。

 

Automation Suite 要求将 backlog_wait_time 设置为 1


描述

如果未将 backlog_wait_time 设置为 1,审核事件可能会导致不稳定(系统冻结)。
For more details, see this issue description.

解决方案

如果安装程序失败并显示 Automation Suite requires backlog_wait_time to be set 1 错误消息,请执行以下步骤将 backlog_wait_time 设置为 1

  1. 通过在 /etc/audit/rules.d/audit.rules 文件中附加 --backlog_wait_time 1,将 backlog_wait_time 设置为 1
  2. 重新启动节点。
  3. 通过在节点中运行 sudo auditctl -s | grep "backlog_wait_time",验证 auditctlbacklog_wait_time 值是否设置为 1

 

无法调整对象存储 PVC 的大小


描述

objectstore resize-pvc 操作失败并显示以下错误时,会发生此问题:

Failed resizing the PVC: <pvc name> in namespace: rook-ceph, ROLLING BACK

解决方案

要解决此问题,请执行以下步骤:

  1. 手动运行以下脚本:
#!/bin/sh

ROOK_CEPH_OSD_PREPARE=$(kubectl -n rook-ceph get pods | grep rook-ceph-osd-prepare-set | awk '{print $1}')
if [[ -n ${ROOK_CEPH_OSD_PREPARE} ]]; then
    for pod in ${ROOK_CEPH_OSD_PREPARE}; do
    echo "Start deleting rook ceph osd pod $pod .."
    kubectl -n rook-ceph delete pod $pod
    echo "Done"
    done;
fi
  1. 重新运行 objectstore resize-pvc 命令。

 

PVC 调整大小无法修复 Ceph


描述

如果 Ceph 由于存储不足而导致运行状况不佳,则对象存储 PVC 大小调整不会对其进行修复。

解决方案

要加快非 HA 集群中的 Ceph 恢复,请运行以下命令:

function set_ceph_pool_config_non_ha() {
  # Return if HA 
  [[ "$(kubectl -n rook-ceph get cephobjectstore rook-ceph  -o jsonpath='{.spec.dataPool.replicated.size}')" -eq 1 ]] || return
  # Set pool size and min_size
  kubectl -n "rook-ceph" exec deploy/rook-ceph-tools – ceph osd pool set  "device_health_metrics" "size" "1" --yes-i-really-mean-it || true
  kubectl -n "rook-ceph" exec deploy/rook-ceph-tools – ceph osd pool set  "device_health_metrics" "min_size" "1" --yes-i-really-mean-it || true
}

 

无法在对象存储 (rook-ceph) 中上传/下载数据


967

描述

当对象存储由于置放群组 (PG) 不一致而处于降级状态时,可能会发生此问题。
通过运行以下命令,验证问题是否确实与 rook-ceph PG 不一致有关:

export KUBECONFIG=/etc/rancher/rke2/rke2.yaml PATH=$PATH:/var/lib/rancher/rke2/bin
ROOK_CEPH_TOOLS=$(kubectl -n rook-ceph get pods | grep rook-ceph-tools)
kubectl -n rook-ceph exec -it $ROOK_CEPH_TOOLS -- ceph status

如果问题与 rook-ceph PG 不一致有关,则输出将包含以下消息:

602
....
....
Possible data damage: X pgs inconsistent
....
....
X active+clean+inconsistent
....
....

解决方案

要修复不一致的 PG,请执行以下步骤:

  1. 执行到 rook-ceph 工具:
kubectl -n rook-ceph exec -it $ROOK_CEPH_TOOLS -- sh
  1. 触发 rook-ceph 垃圾收集器流程。等待该流程完成。
radosgw-admin gc process
  1. 查找包含 active+clean+inconsistent 个 PG 的列表:
ceph health detail
# output of this command be like
# ....
# pg <pg-id> is active+clean+inconsistent, acting ..
# pg <pg-id> is active+clean+inconsistent, acting ..
# ....
#
  1. 一次触发一个对 PG 的深度清理。此命令需要几分钟才能运行,具体取决于 PG 大小。
ceph pg deep-scrub <pg-id>
  1. 观察清理状态:
ceph -w | grep <pg-id>
  1. 检查 PG 清理状态。如果 PG 清理成功,则 PG 状态应为 active+clean+inconsistent
ceph health detail | grep <pg-id>
  1. 修复 PG:
ceph pg repair <pg-id>
  1. 检查 PG 修复状态。如果成功修复 PG,则应从 active+clean+inconsistent 列表中删除 PG ID。
ceph health detail | grep <pg-id>
  1. 对其余不一致的 PG 重复步骤 3 到 8。

 

证书更新后失败


描述

当证书更新步骤在内部失败时,会发生此问题。您可能无法访问 Automation Suite 或 Orchestrator。

错误

1813

解决方案

  1. 从任何服务器节点运行以下命令
export KUBECONFIG=/etc/rancher/rke2/rke2.yaml
export PATH=$PATH:/var/lib/rancher/rke2/bin

kubectl -n uipath rollout restart deployments
  1. 等待上述命令成功执行,然后运行以下命令以验证上一个命令的状态。
deployments=$(kubectl -n uipath get deployment -o name)
for i in $deployments; 
do 
kubectl -n uipath rollout status "$i" -w --timeout=600s; 
if [[ "$?" -ne 0 ]]; 
then
    echo "$i deployment failed in namespace uipath."
fi
done
echo "All deployments are succeeded in namespace uipath"

上述命令执行完成后,您应该能够访问 Automation Suite 和 Orchestrator

 

CrashLoopBackOff 中的 Mongo Pod 或删除后待处理的 PVC 配置


由于 PVC 损坏,Mongo Pod 可能会卡在返现循环中。此问题的最可能原因是非正常关机。
遇到此问题时,日志将显示以下内容:

Common point must be at least stable timestamp
{"t":{"$date":"2022-05-18T09:37:55.053+00:00"},"s":"W",  "c":"STORAGE",  "id":22271,   "ctx":"initandlisten","msg":"Detected unclean shutdown - Lock file is not empty","attr":{"lockFile":"/data/mongod.lock"}}
    ['currentState.Running' = false]
    ['currentState.IsVCRedistCorrect' = true]
    ['desiredState.ProcessType' != mongos ('desiredState.ProcessType' = mongod)]

恢复步骤

  1. 获取故障 Pod 的损坏 PVC 的名称。
kubectl -n mongodb get pvc
  1. Delete the failing pod.
kubectl delete pod <pod-name> -n mongodb
eg - kubectl -n mongodb delete pod mongodb-replica-set-2
  1. Delete the PVC for the failing pod.
kubectl -n mongodb delete pvc <pvc-name>
eg - kubectl -n mongodb delete pvc logs-volume-mongodb-replica-set-2
eg - kubectl -n mongodb delete pvc data-volume-mongodb-replica-set-2

📘

此时,PVC 应该会自动同步,并且 Pod 应该不会再遇到任何问题。如果未发生自动配置,则需要通过以下步骤手动执行该操作。

  1. 获取运行状况良好的节点的 PVC YAML。
kubectl -n mongodb get pvc <pvc-name> -o yaml > pvc.yaml
  1. 编辑名称并从 YAML 中删除 uuids/pvc-ids
  2. 删除卷名称和 UID,并将 PVC 重命名为已删除的 PVC 名称。
1191
  1. 应用 PVC。
kubectl -n mongodb apply pvc.yaml
  1. 应配置 PVC 并将其附加到 Pod 的 PVC,并且 Pod 不应再遇到任何问题。如果 Pod 未重新同步,则将其删除。

 

意外不一致;手动运行 fsck


安装或升级 Automation Suite 时,如果任何 Pod 无法挂载到 PVC Pod,则会显示以下错误消息:
UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY

1188

恢复步骤

如果您遇到上述错误,请按照以下恢复步骤操作:

  1. 通过运行以下命令通过 SSH 连接到系统:
ssh <user>@<node-ip>
  1. 检查 PVC 的事件,并验证该问题是否与由于文件错误导致 PVC 挂载失败有关。 为此,请运行以下命令:
export KUBECONFIG=/etc/rancher/rke2/rke2.yaml PATH=$PATH:/var/lib/rancher/rke2/bin:/usr/local/bin
kubectl get events -n mongodb
kubectl get events -n longhorn-system
  1. 检查事件中提到的 PVC 卷,然后运行fsck命令。
fsck -a <pvc-volume-name>
Eg - fsck -a /dev/longhorn/pvc-5abe3c8f-7422-44da-9132-92be5641150a
  1. 删除失败的 MongoDB Pod,以将其正确挂载到 PVC。
kubectl delete pod <pod-name> -n mongodb

 

Degraded MongoDB or business applications after cluster restore or rollback


描述

Occasionally, after cluster restore/rollback to versions 2022.4, or 2021.10, an issue causes MongoDB or business applications (Apps) pods to get stuck in the initial state. This happens because the volume required for attaching the PVC to a pod is missing.

解决方案

  1. Verify if the problem is indeed related to MongoDB volume attachment issue:
# fetch all mongodb pods
kubectl -n mongodb get pods 
#describe pods stuck in init state
#kubectl -n mongodb describe pods mongodb-replica-set-<replica index number>
kubectl -n mongodb describe pods mongodb-replica-set-0

If the issue is related to the MongoDB volume attachment, the following events are displayed:

Events:
  Type     Reason              Age                   From                     Message
  ----     ------              ----                  ----                     -------
  Warning  FailedAttachVolume  3m9s (x65 over 133m)  attachdetach-controller  AttachVolume.Attach failed for volume "pvc-66897693-e52d-4b89-aac6-ca0cc5ae9e07" : rpc error: code = Aborted desc = volume pvc-66897693-e52d-4b89-aac6-ca0cc5ae9e07 is not ready for workloads
  Warning  FailedMount         103s (x50 over 112m)  kubelet                  (combined from similar events): Unable to attach or mount volumes: unmounted volumes=[logs-volume], unattached volumes=[hooks mongodb-replica-set-keyfile tls-secret data-volume healthstatus tls-ca kube-api-access-45qcl agent-scripts logs-volume automation-config]: timed out waiting for the condition
  1. Fix the problematic MongoDB pods by running the following script:
Click for the restore_mongodb.sh script
#!/bin/bash

set -eu


FAILED_PVC_LIST=""
STORAGE_CLASS_SINGLE_REPLICA="longhorn-backup-single-replica"
STORAGE_CLASS="longhorn-backup"
LOCAL_RESTORE_PATH="restoredata"
mkdir -p ${LOCAL_RESTORE_PATH}
export LOCAL_RESTORE_PATH="${LOCAL_RESTORE_PATH}"


function delete_pvc_resources(){
  local pv_name=$1
  local volumeattachment_name=$2

  echo "deleting pv & volumes forcefully"
  delete_pv_forcefully "${pv_name}"
  sleep 2
  delete_longhornvolume_forcefully "${pv_name}"
  sleep 2
  if [ -n "${volumeattachment_name}" ]; then
    echo "deleting volumeattachments forcefully"
    delete_volumeattachment_forcefully "${volumeattachment_name}"
    sleep 5
  fi
}

function delete_longhornvolume_forcefully() {
  local pv_name=$1
  
  if ! kubectl -n longhorn-system get volumes.longhorn.io "${pv_name}" >/dev/null 2>&1 ; then
    echo "Volume ${pv_name} not found, skip deletion."
    return
  fi

  kubectl -n longhorn-system delete volumes.longhorn.io "${pv_name}" --grace-period=0 --force &
  local success=0
  local try=0
  local maxtry=10
  while (( try < maxtry )); do
    local result=""
    # update finaluzer field to null
    result=$(kubectl -n longhorn-system get volumes.longhorn.io "${pv_name}" -o=json | jq '.metadata.finalizers = null' | kubectl apply -f -) || true
    echo "${result}"

    result=$(kubectl -n longhorn-system get volumes.longhorn.io "${pv_name}")|| true
    echo "${result}"
    if [[ -z "${result}" ]]; then
      success=1
      break;
    fi
    kubectl -n longhorn-system delete volumes.longhorn.io "${pv_name}" --grace-period=0 --force &
    echo "Waiting to delete volume ${pv_name} ${try}/${maxtry}..."; sleep 10
    try=$(( try + 1 ))
  done

   if [ "${success}" -eq 0 ]; then
    echo "Failed to delete volume ${pv_name}."
  fi

}

function delete_pv_forcefully() {
  local pv_name=$1

  kubectl delete pv "${pv_name}" --grace-period=0 --force &

  local success=0
  local try=0
  local maxtry=10
  while (( try < maxtry )); do
    # update finaluzer field to null
    result=$(kubectl get pv "${pv_name}" -o=json | jq '.metadata.finalizers = null' | kubectl apply -f -) || true
    echo "${result}"

    result=$(kubectl get pv "${pv_name}")|| true
    echo "${result}"
    if [[ -z "${result}" ]]; then
      success=1
      break;
    fi
    kubectl delete pv "${pv_name}" --grace-period=0 --force &
    echo "Waiting to delete pv ${pv_name} ${try}/${maxtry}..."; sleep 10
    try=$(( try + 1 ))
  done

  if [ "${success}" -eq 0 ]; then
    echo "Failed to delete pv ${pv_name}."
  fi
}

function validate_pv_backup_available(){
  local pv_name=$1
  local validate_pv_backup_available_resp=0
  local backup_resp=""
  local resp_status_code=""
  local resp_message=""

  local try=0
  local maxtry=3
  while (( try != maxtry )) ; do
    backup_resp=$(curl --noproxy "*" "${LONGHORN_URL}/v1/backupvolumes/${pv_name}?") || true
    echo "Backup Response: ${backup_resp}"
    if { [ -n "${backup_resp}" ] && [ "${backup_resp}" != " " ]; }; then
      resp_status_code=$(echo "${backup_resp}"| jq -c ".status")
      resp_message=$(echo "${backup_resp}"| jq -c ".message")
      if [[ -n "${resp_status_code}" && "${resp_status_code}" != " " && "${resp_status_code}" != "null" && (( resp_status_code -gt 200 )) ]] ; then
        validate_pv_backup_available_resp=0
      else
        resp_message=$(echo "${backup_resp}"| jq -c ".messages.error")
        if { [ -z "${resp_message}" ] || [ "${resp_message}" = "null" ]; }; then
          echo "PVC Backup is available for ${pv_name}"
          # shellcheck disable=SC2034
          validate_pv_backup_available_resp=1
          break;
        fi
      fi
    fi
    try=$((try+1))
    sleep 10
  done
  export IS_BACKUP_AVAILABLE="${validate_pv_backup_available_resp}"
}

function store_pvc_resources(){
  local pvc_name=$1
  local namespace=$2
  local pv_name=$3
  local volumeattachment_name=$4

  if [[ -n "${volumeattachment_name}" && "${volumeattachment_name}" != " " ]]; then
    result=$(kubectl get volumeattachments "${volumeattachment_name}" -o json > "${LOCAL_RESTORE_PATH}/${volumeattachment_name}".json) || true
    echo "${result}"
  fi

  kubectl get pv "${pv_name}" -o json > "${LOCAL_RESTORE_PATH}/${pv_name}"-pv.json
  kubectl -n longhorn-system get volumes.longhorn.io "${pv_name}" -o json > "${LOCAL_RESTORE_PATH}/${pv_name}"-volume.json
  kubectl get pvc "${pvc_name}" -n "${namespace}" -o json > "${LOCAL_RESTORE_PATH}/${pvc_name}"-pvc.json
}

function wait_pvc_bound() {
  local namespace=$1
  local pvc_name=$2

  try=0
  maxtry=30
  while (( try < maxtry )); do
    local status=""
    status=$(kubectl -n "${namespace}" get pvc "${pvc_name}" -o json| jq -r ".status | select( has(\"phase\")).phase")
    if [[ "${status}"  == "Bound" ]]; then
      echo "PVC ${pvc_name} Bouned successfully."
      break;
    fi
    echo "waiting for PVC to bound...${try}/${maxtry}"; sleep 30
    try=$((try+1))
  done
}

function create_volume() {
  local pv_name=$1
  local backup_path=$2
  local response
  create_volume_status="succeed"
  echo "Creating Volume with PV: ${pv_name} and BackupPath: ${backup_path}"
  local accessmode
  local replicacount
  accessmode=$(< "${LOCAL_RESTORE_PATH}/${pv_name}-volume.json" jq -r ".spec.accessMode")
  replicacount=$(< "${LOCAL_RESTORE_PATH}/${pv_name}-volume.json" jq -r ".spec.numberOfReplicas")

  # create volume from backup pv
  response=$(curl --noproxy "*" "${LONGHORN_URL}/v1/volumes" -H 'Accept: application/json' -H 'Content-Type: application/json;charset=UTF-8' --data-raw "{\"name\":\"${pv_name}\", \"accessMode\":\"${accessmode}\", \"fromBackup\": \"${backup_path}\", \"numberOfReplicas\": ${replicacount}}")

  sleep 5

  if [[ -n "${response}" && "${response}" != "null" && "${response}" != " " && -n $(echo "${response}"|jq -r ".id") ]]; then
  # wait for volume to be detached , max wait 4hr
    local try=0
    local maxtry=480
    local success=0
    while (( try < maxtry ));do
      status=$(kubectl -n longhorn-system get volumes.longhorn.io "${pv_name}"  -o json |jq -r ".status.state") || true
      echo "volume ${pv_name} status: ${status}"
      if [[ "${status}" == "detached" ]]; then
        # update label
        kubectl -n longhorn-system label volumes.longhorn.io/"${pv_name}" recurring-job-group.longhorn.io/uipath-backup=enabled
        success=1
        echo "Volume ready to use"
        break
      fi
      echo "waiting for volume to be ready to use${try}/${maxtry}..."; sleep 30
      restore_status_url="${LONGHORN_URL}/v1/volumes/${pv_name}?action=snapshotList"
      try=$(( try + 1 ))
    done

    if [ "${success}" -eq 0 ]; then
      create_volume_status="failed"
      echo "${pv_name} Volume is not ready to use with status ${status}"
    fi
  else
    create_volume_status="failed"
    kubectl create -f "${LOCAL_RESTORE_PATH}/${pv_name}"-volume.json || true
    echo "${pv_name} Volume creation failed ${response} "
  fi
  echo "${create_volume_status}"
}

function restore_with_backupvolume() {
  local pvc_name=$1
  local namespace=$2
  local pv_name=$3
  local volumeattachment_name=$4
  local backup_path=$5
  create_volume_status="succeed"
  store_pvc_resources "${pvc_name}" "${namespace}" "${pv_name}" "${volumeattachment_name}"
  sleep 5
  delete_pvc_resources "${pv_name}" "${volumeattachment_name}"
  sleep 5
  create_volume "${pv_name}" "${backup_path}"
  sleep 10

  kubectl create -f "${LOCAL_RESTORE_PATH}/${pv_name}"-pv.json
  if [[ -n "${create_volume_status}" && "${create_volume_status}" != "succeed" ]]; then
    echo "Backup volume restore failed for pvc ${pvc_name} in namespace ${namespace}"
    restore_pvc_status="failed"
  else 
    local pvc_uid=""
    pvc_uid=$(kubectl -n "${namespace}" get pvc "${pvc_name}" -o json| jq -r ".metadata.uid")

    # update pv with pvc uid
    kubectl patch pv "${pv_name}" --type json -p "[{\"op\": \"add\", \"path\": \"/spec/claimRef/uid\", \"value\": \"${pvc_uid}\"}]"

    wait_pvc_bound "${namespace}" "${pvc_name}"
    echo "${result}"
  fi
  echo ${restore_pvc_status}
}

function restore_pvc(){
  local pvc_name=$1
  local namespace=$2
  local pv_name=$3
  local volumeattachment_name=$4
  local backup_path=$5
  restore_pvc_status="succeed"
  restore_with_backupvolume "${pvc_name}" "${namespace}" "${pv_name}" "${volumeattachment_name}" "${backup_path}"
  echo ${restore_pvc_status}
  # attach_volume "${pv_name}"
}

function get_backup_list_by_pvname(){
  local pv_name=$1
  local get_backup_list_by_pvname_resp=""
  local backup_list_response=""

  local try=0
  local maxtry=3
  while (( try != maxtry )) ; do

    backup_list_response=$(curl --noproxy "*" "${LONGHORN_URL}/v1/backupvolumes/${pv_name}?action=backupList" -X 'POST' -H 'Content-Length: 0' -H 'Accept: application/json')
    if [[ -n "${backup_list_response}" && "${backup_list_response}" != " " && -n $( echo "${backup_list_response}"|jq ".data" )  && -n $( echo "${backup_list_response}"|jq ".data[]" ) ]]; then
      echo "Backup List Response: ${backup_list_response}"
      # pick first backup data with url non empty
      # shellcheck disable=SC2034
      get_backup_list_by_pvname_resp=$(echo "${backup_list_response}"|jq -c '[.data[]|select(.url | . != null and . != "")][0]')
      break;
    fi
    try=$((try+1))
    sleep 10
  done
  export PV_BACKUP_PATH="${get_backup_list_by_pvname_resp}"
}

function get_pvc_resources() {

  get_pvc_resources_resp=""

  local PVC_NAME=$1
  local PVC_NAMESPACE=$2

  # PV have one to one mapping with PVC
  PV_NAME=$(kubectl -n "${PVC_NAMESPACE}" get pvc "${PVC_NAME}" -o json|jq -r ".spec.volumeName")

  VOLUME_ATTACHMENT_LIST=$(kubectl get volumeattachments -n "${PVC_NAMESPACE}" -o=json|jq -c ".items[]|\
{name: .metadata.name, pvClaimName:.spec.source| select( has (\"persistentVolumeName\")).persistentVolumeName}")

  VOLUME_ATTACHMENT_NAME=""
  for VOLUME_ATTACHMENT in ${VOLUME_ATTACHMENT_LIST};
  do
    PV_CLAIM_NAME=$(echo "${VOLUME_ATTACHMENT}"|jq -r ".pvClaimName")
    if [[ "${PV_NAME}" = "${PV_CLAIM_NAME}" ]]; then
      VOLUME_ATTACHMENT_NAME=$(echo "${VOLUME_ATTACHMENT}"|jq -r ".name")
      break;
    fi
  done

  BACKUP_PATH=""
  validate_pv_backup_available "${PV_NAME}"
  local is_backup_available=${IS_BACKUP_AVAILABLE:- 0}
  unset IS_BACKUP_AVAILABLE

  if [ "${is_backup_available}" -eq 1 ]; then
    echo "Backup is available for PVC ${PVC_NAME}"

    get_backup_list_by_pvname "${PV_NAME}"
    BACKUP_PATH=$(echo "${PV_BACKUP_PATH}"| jq -r ".url")
    unset PV_BACKUP_PATH
  fi

  get_pvc_resources_resp="{\"pvc_name\": \"${PVC_NAME}\", \"pv_name\": \"${PV_NAME}\", \"volumeattachment_name\": \"${VOLUME_ATTACHMENT_NAME}\", \"backup_path\": \"${BACKUP_PATH}\"}"
  echo "${get_pvc_resources_resp}"
}

function scale_ownerreferences() {
  local ownerReferences=$1
  local namespace=$2
  local replicas=$3

  # no operation required
  if [[ -z "${ownerReferences}" || "${ownerReferences}" == "null" ]]; then
    return
  fi

  ownerReferences=$(echo "${ownerReferences}"| jq -c ".[]")
  for ownerReference in ${ownerReferences};
  do
    echo "Owner: ${ownerReference}"
    local resourceKind
    local resourceName
    resourceKind=$(echo "${ownerReference}"| jq -r ".kind")
    resourceName=$(echo "${ownerReference}"| jq -r ".name")

    if kubectl -n "${namespace}" get "${resourceKind}" "${resourceName}" >/dev/null 2>&1; then
      # scale replicas
      kubectl  -n "${namespace}" patch "${resourceKind}" "${resourceName}" --type json -p "[{\"op\": \"replace\", \"path\": \"/spec/members\", \"value\": ${replicas} }]"
    fi
  done
}

function scale_down_statefulset() {
  local statefulset_name=$1
  local namespace=$2
  local ownerReferences=$3

  echo "Start Scale Down statefulset ${statefulset_name} under namespace ${namespace}..."

  # validate and scale down ownerreference
  scale_ownerreferences "${ownerReferences}" "${namespace}" 0

  local try=0
  local maxtry=30
  success=0
  while (( try != maxtry )) ; do
    result=$(kubectl scale statefulset "${statefulset_name}" --replicas=0 -n "${namespace}") || true
    echo "${result}"
    scaledown=$(kubectl get statefulset "${statefulset_name}" -n "${namespace}"|grep 0/0) || true
    if { [ -n "${scaledown}" ] && [ "${scaledown}" != " " ]; }; then
      echo "Statefulset scaled down successfully."
      success=1
      break
    else
      try=$((try+1))
      echo "waiting for the statefulset ${statefulset_name} to scale down...${try}/${maxtry}";sleep 30
    fi
  done

  if [ ${success} -eq 0 ]; then
    echo "Statefulset ${statefulset_name} scaled down failed"
  fi
}

function scale_up_statefulset() {
  local statefulset_name=$1
  local namespace=$2
  local replica=$3
  local ownerReferences=$4

  # Scale up statefulsets using PVCs
  echo "Start Scale Up statefulset ${statefulset_name}..."

  # validate and scale up ownerreference
  scale_ownerreferences "${ownerReferences}" "${namespace}" "${replica}"

  echo "Waiting to scale up statefulset..."

  local try=1
  local maxtry=15
  local success=0
  while (( try != maxtry )) ; do

    kubectl scale statefulset "${statefulset_name}" --replicas="${replica}" -n "${namespace}"
    kubectl get statefulset "${statefulset_name}" -n "${namespace}"

    scaleup=$(kubectl get statefulset "${statefulset_name}" -n "${namespace}"|grep "${replica}"/"${replica}") || true
    if ! { [ -n "${scaleup}" ] && [ "${scaleup}" != " " ]; }; then
      try=$((try+1))
      echo "waiting for the statefulset ${statefulset_name} to scale up...${try}/${maxtry}"; sleep 30
    else
      echo "Statefulset scaled up successfully."
      success=1
      break
    fi
  done

  if [ ${success} -eq 0 ]; then
    echo "Statefulset scaled up failed ${statefulset_name}."
  fi
}

function restore_pvc_attached_to_statefulsets() {
  local namespace
  namespace="${1}"
  local statefulset_list

  # list of all statefulset using PVC
  statefulset_list=$(kubectl get statefulset -n "${namespace}" -o=json | jq -r ".items[] | select(.spec.volumeClaimTemplates).metadata.name")

  for statefulset_name in ${statefulset_list};
  do
    local replica
    local ownerReferences
    local pvc_restore_failed
    local try=0
    local maxtry=5
    local status="notready"
    pvc_restore_failed=""
    restore_pvc_status=""

    # check if statefulset is reday
    while [[ "${status}" == "notready" ]] && (( try < maxtry )); do
      echo "fetch statefulset ${statefulset_name} metadata...  ${try}/${maxtry}"
      try=$(( try + 1 ))
      replica=$(kubectl get statefulset "${statefulset_name}" -n "${namespace}" -o=json | jq -c ".spec.replicas")

      if [[ "${replica}" != 0 ]]; then
        status="ready"
      else
        echo "statefulset ${statefulset_name} replica is not ready. Wait and retry"; sleep 30
      fi
    done

    if [[ "${status}" != "ready" ]]; then
      echo "Failed to restore pvc for Statefulset ${statefulset_name} in namespace ${namespace}. Please retrigger volume restore step."
    fi

    # Fetch ownerReferences and claim name
    ownerReferences=$(kubectl get statefulset "${statefulset_name}" -n "${namespace}" -o=json | jq -c ".metadata.ownerReferences")
    claimTemplatesName=$(kubectl get statefulset "${statefulset_name}" -n "${namespace}" -o=json | jq -c ".spec | select( has (\"volumeClaimTemplates\") ).volumeClaimTemplates[].metadata.name " | xargs)
    
    echo "Scaling down Statefulset ${statefulset_name} with ${replica} under namespace ${namespace}"
    scale_down_statefulset "${statefulset_name}" "${namespace}" "${ownerReferences}"
    for name in ${claimTemplatesName}; do
      local pvc_prefix
      pvc_prefix="${name}-${statefulset_name}"

      for((i=0;i<"${replica}";i++)); do
        local pvc_name
        pvc_name="${pvc_prefix}-${i}"

        pvc_exist=$(kubectl -n "${namespace}" get pvc "${pvc_name}") || true
        if [[ -z "${pvc_exist}" || "${pvc_exist}" == " " ]]; then
          echo "PVC not available for the statefulset ${statefulset_name}, skipping restore."
          continue;
        fi

        local pvc_storageclass
        pvc_storageclass=$(kubectl -n "${namespace}" get pvc "${pvc_name}" -o json| jq -r ".spec.storageClassName")
        if [[ ! ( "${pvc_storageclass}" == "${STORAGE_CLASS}" || "${pvc_storageclass}" == "${STORAGE_CLASS_SINGLE_REPLICA}" ) ]]; then
          echo "backup not available for pvc ${pvc_name}, storageclass: ${pvc_storageclass} "
          continue;
        fi

        # get pv, volumeattachments for pvc
        get_pvc_resources_resp=""
        get_pvc_resources "${pvc_name}" "${namespace}"

        local pv_name
        local volumeattachment_name
        local backup_path
        pv_name=$(echo "${get_pvc_resources_resp}"| jq -r ".pv_name")
        volumeattachment_name=$(echo "${get_pvc_resources_resp}"| jq -r ".volumeattachment_name")
        backup_path=$(echo "${get_pvc_resources_resp}"| jq -r ".backup_path")
        if [[ -z "${backup_path}" || "${backup_path}" == " " || "${backup_path}" == "null" ]]; then
          pvc_restore_failed="error"
          FAILED_PVC_LIST="${FAILED_PVC_LIST},${pv_name}"
          continue;
        fi

        restore_pvc_status="succeed"
        restore_pvc "${pvc_name}" "${namespace}" "${pv_name}" "${volumeattachment_name}" "${backup_path}"
        if [[ -n "${restore_pvc_status}" && "${restore_pvc_status}" != "succeed" ]]; then
          pvc_restore_failed="error"
          FAILED_PVC_LIST="${FAILED_PVC_LIST},${pv_name}"
          continue;
        fi
      done
    done

    sleep 10
    scale_up_statefulset "${statefulset_name}" "${namespace}" "${replica}" "${ownerReferences}"
    sleep 5

    if [[ -n "${pvc_restore_failed}" && "${pvc_restore_failed}" == "error" ]]; then
        echo "Failed to restore pvc for Statefulset ${statefulset_name} in namespace ${namespace}."
    fi    
  done
}

LONGHORN_URL=$(kubectl -n longhorn-system get svc longhorn-backend -o jsonpath="{.spec.clusterIP}"):9500

restore_pvc_attached_to_statefulsets "mongodb"

 

从 2021.10 自动升级后,集群运行状况不佳


在从 Automation Suite 2021.10 自动升级期间,CNI 提供程序将从 Canal 迁移到 Cilium。此操作要求重新启动所有节点。在极少数情况下,一个或多个节点可能无法成功重新启动,从而导致在这些节点上运行的 Pod 保持槽糕的运行状态。

恢复步骤

  1. 识别失败的重新启动。
    在 Ansible 执行期间,您可能会看到类似于以下代码片段的输出:
TASK [Reboot the servers] ***************************************************************************************************************************

fatal: [10.0.1.6]: FAILED! =>

  msg: 'Failed to connect to the host via ssh: ssh: connect to host 10.0.1.6 port 22: Connection timed out'

或者,浏览位于 /var/tmp/uipathctl_<version>/_install-uipath.log 的 Ansible 主机上的日志。如果发现任何重新启动失败,请在所有节点上执行步骤 2 到 4。

  1. 确认每个节点都需要重新启动。
    连接到每个节点并运行以下命令:
ssh <username>@<ip-address>
iptables-save 2>/dev/null | grep -i cali -c

如果结果不为零,则需要重新启动。

  1. 重新启动节点:
sudo reboot
  1. 等待节点响应(您应该能够通过 SSH 访问该节点),然后在每个其他节点上重复步骤 2 到 4。

 

在 Longhorn 设置期间首次安装失败


在极少数情况下,如果首次尝试安装 Longhorn 失败,则后续重试可能会引发特定于 Helm 的错误:Error: UPGRADE FAILED: longhorn has no deployed releases

恢复步骤

通过运行以下命令,删除 Longhorn Helm 版本,然后重试安装:

/opt/UiPathAutomationSuite/<version>/bin/helm uninstall longhorn --namespace longhorn-system

 

操作系统升级后,Automation Suite 无法正常工作


描述

操作系统升级后,Ceph OSD Pod 有时可能会卡在 CrashLoopBackOff 状态。此问题导致无法访问 Automation Suite。

解决方案

  1. 通过运行以下命令检查 Pod 的状态:
kubectl - n rook-ceph get pods
  1. 如果先前输出中的任何 Pod 在 CrashLoopBackOff 中,请通过运行以下命令来恢复它们:
$OSD_PODS=$(kubectl -n rook-ceph get deployment -l app=rook-ceph-rgw --no-headers | awk '{print $1}')
kubectl -n rook-ceph rollout restart deploy $OSD_PODS
  1. 等待大约 5 分钟,让 Pod 再次处于运行状态,然后通过运行以下命令检查其状态:
kubectl - n rook-ceph get pods

 

Missing self-heal-operator and sf-k8-utils repo

描述

This issue causes workloads to go into ImagePullBackOff or ErrImagePull state with the following error:

Failed to pull image "sf-k8-utils-rhel:<tag>": rpc error: code = Unknown desc = failed to pull and unpack image "docker.io/library/sf-k8-utils-rhel:<tag>": failed to resolve reference "docker.io/library/sf-k8-utils-rhel:<tag>": failed to do request: Head "https://localhost:30071/v2/library/sf-k8-utils-rhel/manifests/<tag>?ns=docker.io": dial tcp [::1]:30071: connect: connection refused
OR
Failed to pull image "self-heal-operator:<tag>": rpc error: code = Unknown desc = failed to pull and unpack image "docker.io/library/self-heal-operator:<tag>": failed to resolve reference "docker.io/library/self-heal-operator:<tag>": failed to do request: Head "https://localhost:30071/v2/library/self-heal-operator/manifests/<tag>?ns=docker.io": dial tcp [::1]:30071: connect: connection refused

解决方案

To fix this issue, run the following script from all the nodes in the cluster one by one.

Click for the fix_image_project_id.sh script
#!/bin/bash

export KUBECONFIG=${KUBECONFIG:-/etc/rancher/rke2/rke2.yaml}
export PATH=$PATH:/var/lib/rancher/rke2/bin:${SCRIPT_DIR}/Fabric_Installer/bin:/usr/local/bin

function get_docker_registry_url() {
  local rancher_registries_file="/etc/rancher/rke2/registries.yaml"
  config=$(cat < ${rancher_registries_file} | grep -A1 "configs:"|tail -n1| awk '{print $0}'|tr -d ' '|tr -d '"')
  url="${config::-1}"
  echo "${url}"
}

function get_docker_registry_credentials() {
  local key="$1"
  local rancher_registries_file="/etc/rancher/rke2/registries.yaml"
  value=$(cat < ${rancher_registries_file} | grep "$key:" | cut -d: -f2 | xargs)
  echo "${value}"
}

function get_cluster_config() {
  local key=$1
  # the go template if prevents it from printing <no-value> instead of empty strings
  value=$(kubectl get secret service-cluster-configurations \
    -o "go-template={{if index .data \"${key^^}\"}}{{index .data \"${key^^}\"}}{{end}}" \
    -n uipath-infra --ignore-not-found) || true

  echo -n "$(base64 -d <<<"$value")"
}

function update_image_tag() {
    username=$(get_docker_registry_credentials username)
    password=$(get_docker_registry_credentials password)
    url=$(get_docker_registry_url)

    images=(self-heal-operator sf-k8-utils-rhel)
    for image in ${images[@]}; do
        echo "Start checking available $image tag"
        tag=$(curl -u $username:$password -X GET https://${url}/v2/$image/tags/list -k -q -s | jq -rc .tags[0] )
        if [[ "${tag}" != "null" ]]; then
            echo "$image with tag ${tag} found..."
            podman login ${url} --username $username --password $password --tls-verify=false
            podman pull ${url}/${image}:${tag} --tls-verify=false
            podman tag ${url}/${image}:${tag} ${url}/uipath/${image}:${tag}
            podman tag ${url}/${image}:${tag} ${url}/library/${image}:${tag}
            podman push ${url}/uipath/${image}:${tag} --tls-verify=false
            podman push ${url}/library/${image}:${tag} --tls-verify=false
            echo "$image is retag and push to docker registry"
        else
            echo "no tag available for $image"
        fi
    done
}

function validate_rke2_registry_config() {
  local rancher_registries_file="/etc/rancher/rke2/registries.yaml"
  local endpoint_present="false"
  endpoint=$(cat < ${rancher_registries_file} | grep -A2 "docker.io:" | grep -A1 "endpoint:"|tail -n1|xargs)

  if [[ -n "${endpoint}" ]]; then
    endpoint_present="true"
  fi

  echo "${endpoint_present}"
}

function update_rke2_registry_config() {
   local DOCKER_REGISTRY_URL=$(get_docker_registry_url)
   local DOCKER_REGISTRY_LOCAL_USERNAME=$(get_docker_registry_credentials username)
   local DOCKER_REGISTRY_LOCAL_PASSWORD=$(get_docker_registry_credentials password)
   local registriesPath="/etc/rancher/rke2/registries.yaml"
   local DOCKER_REGISTRY_NODEPORT=30071

   echo "Create temp file with name ${registriesPath}_tmp"
   cp -r ${registriesPath} ${registriesPath}_tmp

   echo "Start updating ${registriesPath}"

   cat > "${registriesPath}" <<EOF
mirrors:
  docker-registry.docker-registry.svc.cluster.local:5000:
    endpoint:
      - "https://${DOCKER_REGISTRY_URL}"
  docker.io:
    endpoint:
      - "https://${DOCKER_REGISTRY_URL}"
  ${DOCKER_REGISTRY_URL}:
    endpoint:
      - "https://${DOCKER_REGISTRY_URL}"
configs:
  "localhost:${DOCKER_REGISTRY_NODEPORT}":
    tls:
      insecure_skip_verify: true
    auth:
      username: ${DOCKER_REGISTRY_LOCAL_USERNAME}
      password: ${DOCKER_REGISTRY_LOCAL_PASSWORD}
EOF
}

function is_server_node() {
    [[ "$(systemctl is-enabled rke2-server 2>>/dev/null)" == "enabled" ]] && echo -n "true" && return
    echo "false"
}

function main() {
    local is_server_node=$(is_server_node)
    local install_type=$(get_cluster_config "INSTALL_TYPE")
    if [[ "${install_type}" != "offline" ]]; then
        echo "This script is compatible with only offline cluster. Current cluster install_type=${install_type}"
        exit 0
    fi

    if [[ "${is_server_node}" == "true" ]]; then
        echo "current node is identified as server node. Updating image tag"
        update_image_tag
    else
        echo "current node is identified as agent node."
    fi
    
    rke2_registry_config_is_valid=$(validate_rke2_registry_config)
    if [[ "${rke2_registry_config_is_valid}" == "false" ]]; then
      echo "start updating rke2 config"
      update_rke2_registry_config
      if [[ "${is_server_node}" == "true" ]]; then
        echo "Registry configuration is updated. Restarting service using command: systemctl restart rke2-server"
        systemctl restart rke2-server.service
      else
        echo "Registry configuration is updated. Restarting service using command: systemctl restart rke2-agent"
        systemctl restart rke2-agent.service
      fi
    else
      echo "rke2 config update is not required"
    fi
    
}

main

📘

The fix_image_project_id.sh script restarts the Kubernetes server (rke2 service) and all the workloads running on the nodes.

Running the fix_image_project_id.sh script is required only if you use Automation Suite 2022.4.0 or 2022.4.1.

 

集群还原或回滚后服务运行状况不佳

描述


在集群还原或回滚后,AI Center、Orchestrator、Platform、Document Understanding 或 Task Mining 可能运行不正常,RabbitMQ Pod 日志显示以下错误:

[root@server0 UiPathAutomationSuite]# k -n rabbitmq logs rabbitmq-server-0
2022-10-29 07:38:49.146614+00:00 [info] <0.9223.362> accepting AMQP connection <0.9223.362> (10.42.1.161:37524 -> 10.42.0.228:5672)
2022-10-29 07:38:49.147411+00:00 [info] <0.9223.362> Connection <0.9223.362> (10.42.1.161:37524 -> 10.42.0.228:5672) has a client-provided name: rabbitConnectionFactory#77049094:2100
2022-10-29 07:38:49.147644+00:00 [erro] <0.9223.362> Error on AMQP connection <0.9223.362> (10.42.1.161:37524 -> 10.42.0.228:5672, state: starting):
2022-10-29 07:38:49.147644+00:00 [erro] <0.9223.362> PLAIN login refused: user 'aicenter-service' - invalid credentials
2022-10-29 07:38:49.147922+00:00 [info] <0.9223.362> closing AMQP connection <0.9223.362> (10.42.1.161:37524 -> 10.42.0.228:5672 - rabbitConnectionFactory#77049094:2100)
2022-10-29 07:38:55.818447+00:00 [info] <0.9533.362> accepting AMQP connection <0.9533.362> (10.42.0.198:45032 -> 10.42.0.228:5672)
2022-10-29 07:38:55.821662+00:00 [info] <0.9533.362> Connection <0.9533.362> (10.42.0.198:45032 -> 10.42.0.228:5672) has a client-provided name: rabbitConnectionFactory#2100d047:4057
2022-10-29 07:38:55.822058+00:00 [erro] <0.9533.362> Error on AMQP connection <0.9533.362> (10.42.0.198:45032 -> 10.42.0.228:5672, state: starting):
2022-10-29 07:38:55.822058+00:00 [erro] <0.9533.362> PLAIN login refused: user 'aicenter-service' - invalid credentials
2022-10-29 07:38:55.822447+00:00 [info] <0.9533.362> closing AMQP connection <0.9533.362> (10.42.0.198:45032 -> 10.42.0.228:5672 - rabbitConnectionFactory#2100d047:4057)

解决方案

要解决此问题,请执行以下步骤:

  1. 检查是否由于 Mnesia 表数据写入问题,部分或所有 RabbitMQ Pod 处于CrashLoopBackOff状态。 如果所有 Pod 都在运行,请跳至步骤 2。
如果某些 Pod 处于 CrashLoopBackOff 状态,请单击此处以获取说明。
  1. 确定哪些 RabbitMQ Pod 卡在 CrashLoopBackOff 状态,并检查 RabbitMQ CrashLoopBackOff Pod 日志:
kubectl -n rabbitmq get pods
kubectl -n rabbitmq logs <CrashLoopBackOff-Pod-Name>
  1. 检查前面命令的输出。如果问题与 Mnesia 表数据写入有关,您应该会看到类似于以下内容的错误消息:
Mnesia('rabbit@rabbitmq-server-0.rabbitmq-nodes.rabbitmq'): ** ERROR ** (could not write core file: eacces)
 ** FATAL ** Failed to merge schema: Bad cookie in table definition rabbit_user_permission: 'rabbit@rabbitmq-server-0.rabbitmq-nodes.rabbitmq' = {cstruct,rabbit_user_permission,set,[],['rabbit@rabbitmq-server-2.rabbitmq-nodes.rabbitmq','rabbit@rabbitmq-server-0.rabbitmq-nodes.rabbitmq','rabbit@rabbitmq-server-1.rabbitmq-nodes.rabbitmq'],[],[],0,read_write,false,[],[],false,user_permission,[user_vhost,permission],[],[],[],{{1667351034020261908,-576460752303416575,1},'rabbit@rabbitmq-server-1.rabbitmq-nodes.rabbitmq'},{{4,0},{'rabbit@rabbitmq-server-2.rabbitmq-nodes.rabbitmq',{1667,351040,418694}}}}, 'rabbit@rabbitmq-server-1.rabbitmq-nodes.rabbitmq' = {cstruct,rabbit_user_permission,set,[],['rabbit@rabbitmq-server-1.rabbitmq-nodes.rabbitmq'],[],[],0,read_write,false,[],[],false,user_permission,[user_vhost,permission],[],[],[],{{1667372429216834387,-576460752303417087,1},'rabbit@rabbitmq-server-1.rabbitmq-nodes.rabbitmq'},{{2,0},[]}}
  1. 要解决此问题,请执行以下步骤:
    1. 查找 RabbitMQ 副本的数量:
      rabbitmqReplicas=$(kubectl -n rabbitmq get rabbitmqcluster rabbitmq -o json | jq -r '.spec.replicas')
    2. 缩减 RabbitMQ 副本:
      kubectl -n rabbitmq patch rabbitmqcluster rabbitmq -p "{\"spec\":{\"replicas\": 0}}" --type=merge
      kubectl -n rabbitmq scale sts rabbitmq-server --replicas=0
    3. 等待所有 RabbitMQ Pod 终止:
      kubectl -n rabbitmq get pod
    4. 查找并删除卡在 CrashLoopBackOff 状态的 RabbitMQ Pod 的 PVC:
      kubectl -n rabbitmq get pvc
      kubectl -n rabbitmq delete pvc <crashloopbackupoff_pod_pvc_name>
    5. 扩展 RabbitMQ 副本:
      kubectl -n rabbitmq patch rabbitmqcluster rabbitmq -p "{\"spec\":{\"replicas\": $rabbitmqReplicas}}" --type=merge
    6. 检查所有 RabbitMQ Pod 是否运行状况良好:
      kubectl -n rabbitmq get pod
  1. 删除 RabbitMQ 中的用户:
kubectl -n rabbitmq exec rabbitmq-server-0 -c rabbitmq -- rabbitmqctl  list_users -s --formatter json | jq '.[]|.user' | grep -v default_user | xargs -I{} kubectl -n rabbitmq exec rabbitmq-server-0 -c rabbitmq -- rabbitmqctl delete_user {}
  1. 删除 UiPath 命名空间中的 RabbitMQ 应用程序密码:
kubectl -n uipath get secret --template '{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}' | grep -i rabbitmq-secret | xargs -I{} kubectl -n uipath delete secret {}
  1. 删除 RabbitMQ 命名空间中的 RabbitMQ 应用程序密码:
kubectl -n rabbitmq get secret --template '{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}' | grep -i rabbitmq-secret | xargs -I{} kubectl -n rabbitmq delete secret {}
  1. 通过 ArgoCD 同步 sfcore 应用程序并等待同步完成:
2868
  1. 对 UiPath 命名空间中的所有应用程序执行首次重新启动:
kubectl -n uipath rollout restart deploy

 

RabbitMQ pod stuck in CrashLoopBackOff

描述

This issue causes the RabbitMQ pod to be stuck in CrashLoopBackOff, with the log of the failing pod showing wal_checksum_validation_failure as a reason.

To get a list of all pods, run the following command:

kubectl -n rabbitmq get pods

To get the logs of a pod, run the following command:

kubectl -n rabbitmq logs <CrashLoopBackOff-Pod-Name>

解决方案

要解决此问题,请执行以下步骤:

  1. 查找 RabbitMQ 副本的数量:
rabbitmqReplicas=$(kubectl -n rabbitmq get rabbitmqcluster rabbitmq -o json | jq -r '.spec.replicas')
  1. 缩减 RabbitMQ 副本:
`kubectl -n rabbitmq patch rabbitmqcluster rabbitmq -p "{\"spec\":{\"replicas\": 0}}" --type=merge
kubectl -n rabbitmq scale sts rabbitmq-server --replicas=0
  1. 等待所有 RabbitMQ Pod 终止:
kubectl -n rabbitmq get pod
  1. 查找并删除卡在 CrashLoopBackOff 状态的 RabbitMQ Pod 的 PVC:
kubectl -n rabbitmq get pvc
kubectl -n rabbitmq delete pvc <crashloopbackupoff_pod_pvc_name>
  1. 扩展 RabbitMQ 副本:
kubectl -n rabbitmq patch rabbitmqcluster rabbitmq -p "{\"spec\":{\"replicas\": $rabbitmqReplicas}}" --type=merge
  1. 检查所有 RabbitMQ Pod 是否运行状况良好:
kubectl -n rabbitmq get pod

 

Cleaning up old logs stored in the sf-logs bundle

A bug might cause log accumulation in the sf-logs object store bucket. To clean up old logs in the sf-logs bucket, follow the instructions on running the dedicated script. Make sure to follow the steps relevant to your environment type.

To clean up old logs stored in the sf-logs bundle, take the following steps:

  1. Get the version of the sf-k8-utils-rhel image available in your environment:

    • in an offline environment, run the following command: podman search localhost:30071/uipath/sf-k8-utils-rhel --tls-verify=false --list-tags
    • in an online environment, run the following command: podman search registry.uipath.com/uipath/sf-k8-utils-rhel --list-tags
  2. Update line 121 in the following yaml definition accordingly to include the proper image tag:

apiVersion: v1
kind: ConfigMap
metadata:
  name: cleanup-script
  namespace: uipath-infra
data:
  cleanup_old_logs.sh:  |
    #!/bin/bash

    function parse_args() {
      CUTOFFDAY=7
      SKIPDRYRUN=0
      while getopts 'c:sh' flag "$@"; do
        case "${flag}" in
        c)
          CUTOFFDAY=${OPTARG}
          ;;
        s)
          SKIPDRYRUN=1
          ;;
        h)
          display_usage
          exit 0
          ;;
        *)
          echo "Unexpected option ${flag}"
          display_usage
          exit 1
          ;;
        esac
      done

      shift $((OPTIND - 1))
    }

    function display_usage() {
      echo "usage: $(basename "$0") -c <number> [-s]"
      echo "  -s                                                         skip dry run, Really deletes the log dirs"
      echo "  -c                                                         logs older than how many days to be deleted. Default is 7 days"
      echo "  -h                                                         help"
      echo "NOTE: Default is dry run, to really delete logs set -s"
    }

    function setS3CMDContext() {
        OBJECT_GATEWAY_INTERNAL_HOST=$(kubectl -n rook-ceph get services/rook-ceph-rgw-rook-ceph -o jsonpath="{.spec.clusterIP}")
        OBJECT_GATEWAY_INTERNAL_PORT=$(kubectl -n rook-ceph get services/rook-ceph-rgw-rook-ceph -o jsonpath="{.spec.ports[0].port}")
        AWS_ACCESS_KEY=$1
        AWS_SECRET_KEY=$2

        # Reference https://rook.io/docs/rook/v1.5/ceph-object.html#consume-the-object-storage
        export AWS_HOST=$OBJECT_GATEWAY_INTERNAL_HOST
        export AWS_ENDPOINT=$OBJECT_GATEWAY_INTERNAL_HOST:$OBJECT_GATEWAY_INTERNAL_PORT
        export AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY
        export AWS_SECRET_ACCESS_KEY=$AWS_SECRET_KEY
    }

    # Set s3cmd context by passing correct AccessKey and SecretKey
    function setS3CMDContextForLogs() {
        BUCKET_NAME='sf-logs'
        AWS_ACCESS_KEY=$(kubectl -n cattle-logging-system get secret s3-store-secret -o json | jq '.data.OBJECT_STORAGE_ACCESSKEY' | sed -e 's/^"//' -e 's/"$//' | base64 -d)
        AWS_SECRET_KEY=$(kubectl -n cattle-logging-system get secret s3-store-secret -o json | jq '.data.OBJECT_STORAGE_SECRETKEY' | sed -e 's/^"//' -e 's/"$//' | base64 -d)

        setS3CMDContext "$AWS_ACCESS_KEY" "$AWS_SECRET_KEY"
    }

    function delete_old_logs() {
        local cutoffdate=$1
        days=$(s3cmd ls  s3://sf-logs/ --host="${AWS_HOST}" --host-bucket= s3://sf-logs --no-check-certificate --no-ssl)
        days=${days//DIR}
        if [[ $SKIPDRYRUN -eq 0 ]];
        then
            echo "DRY RUN. Following log dirs are selected for deletion"
        fi
        for day in $days
        do
            day=${day#*sf-logs/}
            day=${day::-1}
            if [[ ${day} < ${cutoffdate} ]];
            then
                if [[ $SKIPDRYRUN -eq 0 ]];
                then
                    echo "s3://$BUCKET_NAME/$day"
                else
                    echo "###############################################################"
                    echo "Deleting Logs for day: {$day}"
                    echo "###############################################################"
                    s3cmd del "s3://$BUCKET_NAME/$day/" --host="${AWS_HOST}" --host-bucket= --no-ssl --recursive || true
                fi
            fi
        done
    }

    function main() {
        # Set S3 context by setting correct env variables
        setS3CMDContextForLogs
        echo "Bucket name is $BUCKET_NAME"

        CUTOFFDATE=$(date --date="${CUTOFFDAY} day ago" +%Y_%m_%d)
        echo "logs older than ${CUTOFFDATE} will be deleted"

        delete_old_logs "${CUTOFFDATE}"
        if [[ $SKIPDRYRUN -eq 0 ]];
        then
            echo "NOTE: For really deleting the old log directories run with -s option"
        fi
    }

    parse_args "$@"
    main
    exit 0
---
apiVersion: v1
kind: Pod
metadata:
  name: cleanup-old-logs
  namespace: uipath-infra
spec:
   serviceAccountName: fluentd-logs-cleanup-sa
   containers:
   - name: cleanup
     image: localhost:30071/uipath/sf-k8-utils-rhel:0.8
     command: ["/bin/bash"]
     args: ["/scripts-dir/cleanup_old_logs.sh", "-s"]
     volumeMounts:
     - name: scripts-vol
       mountPath: /scripts-dir
     securityContext:
       privileged: false
       allowPrivilegeEscalation: false
       readOnlyRootFilesystem: true
       runAsUser: 9999
       runAsGroup: 9999
       runAsNonRoot: true
       capabilities:
         drop: ["NET_RAW"]
   volumes:
   - name: scripts-vol
     configMap:
       name: cleanup-script
  1. Copy the content of the aforementioned yaml definition to a file called cleanup.yaml. Trigger a pod to clean up the old logs:
kubectl apply -f cleanup.yaml
  1. Get details on the progress:
kubectl -n uipath-infra logs cleanup-old-logs -f
  1. Delete the job:
kubectl delete -f cleanup.yaml

 

身份服务器问题

Setting a timeout interval for the Management portals

在安装前,您无法更新用于对主机级别和组织级别管理门户进行身份验证的令牌的到期时间。 因此,用户会话不会超时。

要为这些门户设置超时时间间隔,您可以更新accessTokenLifetime属性。
The below example sets the timeout interval to 86400 seconds (24 hours):

UPDATE [identity].[Clients] SET AccessTokenLifetime = 86400 WHERE ClientName = 'Portal.OpenId'

Update the underlying directory connections


Disabling or changing the AD integration settings does not update the underlying directory connections properly.

Run the following command to update the SQL DirectoryConnections table entries, then restart the Identity pods:

UPDATE [identity].[DirectoryConnections] SET IsDeleted = 1, DeletionTime=GETUTCDATE() WHERE (Type='ad' OR Type='ldapad') AND IsDeleted=0

 

Kerberos 问题


kinit:找不到领域的 KDC 在获取初始凭据时

描述

在安装期间(如果启用了 Kerberos 身份验证)或在 kerberos-tgt-update cron 作业执行期间,当 UiPath 集群无法连接到 AD 服务器以获取用于身份验证的 Kerberos 票证时,可能会发生此错误。

解决方案

检查 AD 域并确保其配置正确且可路由,如下所示:

getent ahosts <AD domain> | awk '{print $1}' | sort | uniq

如果此命令未返回可路由的 IP 地址,则 Kerberos 身份验证所需的 AD 域配置不正确。

您需要与 IT 管理员合作,将 AD 域添加到 DNS 服务器,并确保此命令返回可路由的 IP 地址。

 

kinit: 获取初始凭据时,密钥表不包含适合 *** 的密钥

描述

此错误可在失败作业的日志中找到,作业名称可为以下其中之一:services-preinstall-validations-jobkerberos-jobs-triggerkerberos-tgt-update

解决方案

确保 AD 用户仍然存在,处于活动状态,并且其密码未更改且未过期。重置用户的密码,并在需要时重新生成密钥表。
另请确保按以下格式提供默认的 Kerberos AD 用户参数<KERB_DEFAULT_USERNAME>HTTP/<Service Fabric FQDN>

 

GSSAPI 操作失败,错误为:提供了无效的状态代码(已撤销客户端凭据)。

描述

使用 Kerberos 进行 SQL 访问时,可能会发现此日志,并且 SQL 连接在服务内部失败。 同样,您可能会在以下任一作业名称中看到kinit: Client's credentials have been revoked while getting initial credentialservices-preinstall-validations-jobkerberos-jobs-triggerkerberos-tgt-update

解决方案

这可能是由于用于生成密钥表的 AD 用户帐户被禁用所致。 重新启用 AD 用户帐户应该可以解决此问题。

收到有关 kerberos-tgt-update 作业失败的警报

描述

如果 uipath 集群无法检索到最新的 Kerberos 票证,则会发生这种情况。

解决方案

To find the issue, check the log for a failed job whose name starts with kerberos-tgt-update. After you've identified the problem in the log, check the related troubleshooting information in this section and in the Troubleshooting section for configuring Active Directory.

 

SSPI 提供程序: 在 Kerberos 数据库中找不到服务器

解决方案

Make sure that the correct SPN records are set up in the AD domain controller for the SQL server. For instructions, see SPN formats in the Microsoft SQL Server documentation.

 

 

用户 <ADDOMAIN>\<aduser> 登录失败。原因:帐户已遭到禁用。

描述

使用 Kerberos 进行 SQL 访问并且服务内部的 SQL 连接失败时,可以找到此日志。

解决方案

This issue could be caused by the AD user losing access to the SQL server. See instructions on how to reconfigure the AD user.

 

Orchestrator 相关问题


Orchestrator Pod 处于 CrashLoopBackOff 状态或 1/2 多次重新启动后开始运行


描述

如果 Orchestrator Pod 处于 CrashLoopBackOff 状态或 1/2 多次重新启动后开始运行,则故障可能与对象存储提供程序 Ceph 的身份验证密钥有关。

要检查故障是否与 Ceph 相关,请运行以下命令:

kubectl -n uipath get pod -l app.kubernetes.io/component=orchestrator

如果此命令的输出类似于以下选项之一,则需要运行其他命令。

Option 1:
NAME                            READY   STATUS    RESTARTS   AGE
orchestrator-6dc848b7d5-q5c2q   1/2     Running   2          6m1s

OR 

Option 2
NAME                            READY   STATUS             RESTARTS   AGE
orchestrator-6dc848b7d5-q5c2q   1/2     CrashLoopBackOff   6          16m

运行以下命令,验证失败是否与 Ceph 身份验证密钥有关:

kubectl -n uipath logs -l app.kubernetes.io/component=orchestrator | grep 'Error making request with Error Code InvalidAccessKeyId and Http Status Code Forbidden' -o

如果上述命令的输出中包含字符串 Error making request with Error Code InvalidAccessKeyId and Http Status Code Forbidden,则失败的原因在于 Ceph 身份验证密钥。

解决方案

使用以下命令重新运行 rook-ceph-configure-script-jobcredential-manager 作业:

kubectl -n uipath-infra get job "rook-ceph-configure-script-job" -o json | jq 'del(. | .spec.selector, .spec.template.metadata.labels)' | kubectl replace --force -f -
kubectl -n uipath-infra get job "credential-manager-job" -o json | jq 'del(. | .spec.selector, .spec.template.metadata.labels)' | kubectl replace --force -f -
kubectl -n uipath delete pod -l app.kubernetes.io/component=orchestrator

 

Test Manager 相关问题


Test Manager 许可证问题


如果您的许可证是在登录时分配的,则打开 Test Manager 时可能无法检测到您的许可证分配。

如果发生这种情况,请执行以下步骤:

  1. 导航到 Test Manager。
  2. 从门户中注销。
  3. 重新登录。

 

AI Center 相关问题


AI Center 技能部署问题

有时,首次部署模型时,DU 模型技能部署可能会间歇性地失败,并显示“无法列出部署”或“未知错误”。解决方法是再次尝试部署模型。第二次部署会更快,因为大多数映像构建的部署工作都会在第一次尝试期间完成。首次部署 DU 模型大约需要 1 到 1.5 个小时,再次部署时会更快。

在极少数情况下,由于集群状态,技能部署或包上传等异步操作可能会停留很长时间。如果 DU 技能部署需要耗费超过 2 到 3 个小时,请尝试部署更简单的模型(例如模板模型)。如果部署该模型也需要耗费一个小时以上,则缓解措施是使用以下命令重新启动 AI Center 服务:

kubectl -n uipath rollout restart deployment ai-deployer-deployment
kubectl -n uipath rollout restart deployment ai-trainer-deployment
kubectl -n uipath rollout restart deployment ai-pkgmanager-deployment
kubectl -n uipath rollout restart deployment ai-helper-deployment
kubectl -n uipath rollout restart deployment ai-appmanager-deployment

使用以下命令进行验证,等待 AI Center Pod 重新启动:

kubectl -n uipath get pods | grep ai-*

以上所有 Pod 都应处于“正在运行”状态,并且容器状态应显示为 2/2。

禁用流日志

要禁用现有技能的日志流式传输,请编辑技能部署并将 LOGS_STREAMING_ENABLED 环境变量更改为 false
您还可以使用 ArgoCD 在 aicenter 应用程序详细信息下添加 logsStreamingEnabled 全局变量,并将值设置为 false。确保在更改完成后同步 ArgoCD。

2827

 

Document Understanding 相关问题


Document Understanding 不在 Automation Suite 的左侧栏


描述

如果在 Automation Suite 的左侧栏中找不到 Document Understanding,请注意,Document Understanding 当前不是 Automation Suite 上的单独应用程序,因此不会显示在左侧栏中。

解决方案

Data Manager 组件是 AI Center 的一部分,因此请确保启用 AI Center。

此外,请使用以下公共 URL 访问表单提取程序、智能表单提取程序(包括手写识别)和智能关键字分类器:

<FQDN>/du_/svc/formextractor
<FQDN>/du_/svc/intelligentforms
<FQDN>/du_/svc/intelligentkeywords

如果您在 Studio 中尝试使用智能关键词分类器、表单提取程序和智能表单提取程序时收到 Your license can not be validated 错误消息,除了确保输入了正确的端点外,还请获取您在许可证下为 Document Understanding 生成的 API 密钥在 Automation Suite 安装中,而不是从 cloud.uipath.com 安装。

 

创建数据标签会话时处于“失败”状态


描述

如果您无法在 AI Center 中的 Data Manager 上创建数据标签会话,请执行以下步骤。

解决方案 1

请仔细检查 Document Understanding 是否已正确启用。您应该在安装之前更新配置文件,并将 documentunderstanding.enabled 设置为 True,或者您可以在安装 ArgoCD 后更新它,如下所示。

完成后,您需要禁用它并在要使用数据标签功能的租户上禁用 AI Center,或创建一个新的租户。

1492

解决方案 2

如果在配置文件或 ArgoCD 中正确启用了 Document Understanding,则有时不会为 DefaultTenant 启用 Document Understanding。这表明其自身无法创建数据标签会话。

要解决此问题,请在该租户上禁用 AI Center,然后重新启用。请注意,您可能需要等待几分钟才能重新启用 AI Center。

 

尝试部署 ML 技能时处于“失败”状态


描述

如果您未能成功地在 AI Center 上部署 Document Understanding ML 技能,请查看下面的解决方案。

解决方案 1

如果要离线安装 Automation Suite,请仔细检查 Document Understanding 捆绑包是否已下载并安装。

该捆绑包包含基本映像(例如,模型库),以便模型在通过 AI Center 用户界面上传 ML 包后在 AI Center 上正确运行。

For details about installing Document Understanding bundle, please refer to the documentation here and here. To add Document Understanding bundle, please follow the documentation to re-run the Document Understanding bundle installation.

解决方案 2

即使您已安装用于离线安装的 Document Understanding 捆绑包,也可能会出现以下错误消息:modulenotfounderror: no module named 'ocr.release'; 'ocr' is not a package

在 AI Center 中创建 Document Understanding OCR ML 包时,请记住,它不能命名为 ocrOCR,这会与包中的文件夹冲突。请务必选择其他名称。

解决方案 3

有时,首次部署模型时,Document Understanding 模型技能部署可能会间歇性失败,并显示 Failed to list deploymentUnknown Error

解决方法是再次尝试部署模型。第二次部署会更快,因为大多数映像构建的部署工作都会在第一次尝试期间完成。首次部署 Document Understanding ML 包大约需要 1 到 1.5 个小时,再次部署时会更快。

 

ArgoCD 中的迁移作业失败


描述

ArgoCD 中 Document Understanding 的迁移作业失败。

解决方案

Document Understanding 要求在 SQL Server 上启用“全文搜索”功能。否则,安装可能会失败,而不会显示明确的错误消息,因为 ArgoCD 中的迁移作业将失败。

 

使用智能表单提取程序进行手写识别时不起作用


描述

使用智能表单提取程序进行手写识别时不起作用或运行速度过慢。

解决方案 1

如果您离线使用智能表单提取程序,请在安装前检查以确保已在配置文件中启用手写功能,或在 ArgoCD 中启用它。

要仔细检查,请转到 ArgoCD > Document Understanding > 应用程序详细信息 > du-services.handwritingEnabled(将其设置为 True)。

在离线场景中,需要先安装 Document Understanding 捆绑包,然后再执行此操作,否则 ArgoCD 同步将失败。

解决方案 2

尽管在配置文件中启用了手写功能,但您可能仍面临相同的问题。

请注意,每个容器允许用于手写的最大 CPU 数量默认为 2。如果手写处理工作负载较大,则可能需要调整 handwriting.max_cpu_per_pod 参数。您可以在安装前在配置文件中进行更新,也可以在 ArgoCD 中进行更新。

For more details on how to calculate the parameter value based on your volume, please check the documentation here.

 

Insights 相关问题


导航至 Insights 主页会生成 404


极少数情况下,可能会发生路由错误,导致 Insights 主页上出现 404。您可以通过转到 ArgoCD 中的 Insights 应用程序并删除虚拟服务 Insightsprovisioning-vs 来解决此问题。请注意,您可能必须单击“清除筛选器”以显示 X 个其他资源,才能查看和删除此虚拟服务。

Looker 无法初始化


During Looker initialization, you might encounter and error stating RuntimeError: Error starting Looker. A Looker pod failure produced this error due to a possible system failure or loss of power. The issue is persistent even if you reinitialize Looker.

要解决此问题,请删除持久性卷声明 (PVC),然后重新启动。

约一个月前更新


故障排除


本页介绍了如何解决设置 Automation Suite 时可能遇到的问题。

建议的编辑仅限用于 API 参考页面

您只能建议对 Markdown 正文内容进行编辑,而不能建议对 API 规范进行编辑。