订阅

UiPath Automation Suite

UiPath Automation Suite 指南

故障排除

本页介绍了如何解决设置 Automation Suite 时可能遇到的问题。

故障排除操作方法


Automation Suite generates logs you can explore whenever you need to troubleshoot installation errors. You can find all details on the issues occurring during installation in a log file saved in a directory that also contains the install-uipath.sh script. Each execution of the installer generates a new log file that follows the install-$(date +'%Y-%m-%dT%H_%M_%S').log naming convention, and that you can look into whenever encountering any installation issues.
If you want to troubleshoot post-installation errors, use the Support Bundle tool.

 

如何在安装过程中对服务进行故障排除


Take the following steps on one of the cluster server nodes:

  1. 获取 Kubernetes 访问权限。
on server nodes:
export KUBECONFIG="/etc/rancher/rke2/rke2.yaml"
export PATH="$PATH:/usr/local/bin:/var/lib/rancher/rke2/bin"

on agent nodes:
export KUBECONFIG="/var/lib/rancher/rke2/agent/kubelet.kubeconfig"
export PATH="$PATH:/usr/local/bin:/var/lib/rancher/rke2/bin"

# To validate, execute the following command which should not return an error:
kubectl get nodes
  1. 通过运行以下命令检索 ArgoCD 密码:
kubectl get secrets/argocd-admin-password -n argocd --template '{{ .data.password }}' | base64 -d
  1. 连接到 ArgoCD
    a. Navigate to https://alm.<fqdn>/:443
    b. Login using admin as the username and the password obtained at Step 2.

  2. 找到 UiPath 服务应用程序,如下所示:
    a. Using the search bar provided in ArgoCD, type in uipath

    b. 然后单击其卡打开 UiPath 应用程序。

    c. Check for the following: Application was not synced due to a failed job/pod

    d.如果存在上述错误,请执行以下步骤。

    e. 通过查找红色的心碎图标来找到任何未同步的组件,如下图所示。

    f. 打开最右侧的组件(通常是 Pod),然后单击“日志”选项卡。日志将包含一条错误消息,指明 Pod 失败的原因。

    g. 解决任何未解决的配置问题后,请返回主页并单击 UiPath 应用程序上的“同步”按钮。

 

如何卸载集群


如果您遇到特定于集群上运行的 Kubernetes 的问题,可以直接卸载 rke2 集群。

  1. Depending on your installation profile, run one of the following commands:
    1.1. In an online setup, run the following script with elevated privileges, i.e. sudo, on each node of the cluster. This will uninstall the nodes.
function remove_rke2_entry_from_exclude() {
  local current_exclude_list new_exclude_list
  YUM_CONF_FILE=$1
  if [[ ! -s "${YUM_CONF_FILE}" ]];
  then
    # File is empty
    return
  fi
  current_exclude_list=$(grep 'exclude=' "${YUM_CONF_FILE}" | tail -1)
  if echo "$current_exclude_list" | grep -q 'rke2-*';
  then
    if [[ -w ${YUM_CONF_FILE} ]];
    then
      new_exclude_list=$(printf '%s\n' "${current_exclude_list//rke2-* /}")
      new_exclude_list=$(printf '%s\n' "${new_exclude_list//rke2-*,/}")
      new_exclude_list=$(printf '%s\n' "${new_exclude_list//rke2-\*/}")
      sed -i "/exclude=.*rke2-\*/d" "${YUM_CONF_FILE}"
      echo "${new_exclude_list}" >> "${YUM_CONF_FILE}"
    else
      error "${YUM_CONF_FILE} file is readonly and contains rke2-* under package exclusion. Please remove the entry for AS to work."
    fi
  fi
}

function enable_rke2_package_upgrade() {
  remove_rke2_entry_from_exclude /etc/dnf/dnf.conf
  remove_rke2_entry_from_exclude /etc/yum.conf
}

enable_rke2_package_upgrade

service_exists() {
    local n=$1
    if [[ $(systemctl list-units --all -t service --full --no-legend "$n.service" | cut -f1 -d' ') == $n.service ]]; then
        return 0
    else
        return 1
    fi
}
if service_exists rke2-server; then
  systemctl stop rke2-server
  systemctl disable rke2-server
fi
if service_exists rke2-agent; then
  systemctl stop rke2-agent
  systemctl disable rke2-agent
fi
if [ -e /usr/bin/rke2-killall.sh ]
then
    echo "Running rke2-killall.sh"
    /usr/bin/rke2-killall.sh > /dev/null
else
    echo "File not found: rke2-killall.sh"
fi
if [ -e /usr/bin/rke2-uninstall.sh ]
then
    echo "Running rke2-uninstall.sh"
    /usr/bin/rke2-uninstall.sh > /dev/null
else
    echo "File not found: rke2-uninstall.sh"
fi

crontab -l > backupcron
sed -i '/backupjob/d' backupcron > /dev/null
crontab backupcron > /dev/null
rm -rf backupcron > /dev/null
rm -rfv /usr/bin/backupjob > /dev/null
rm -rfv /etc/rancher/ > /dev/null
rm -rfv /var/lib/rook/ > /dev/null
rm -rfv /var/lib/longhorn/ > /dev/null
rm -rfv /var/lib/rancher/rke2/server/db/* > /dev/null
umount -l -f /var/lib/rancher/rke2/server/db > /dev/null 2>&1 || true
rm -rfv /var/lib/rancher/* > /dev/null
umount -l -f /var/lib/rancher
rm -rfv /var/lib/rancher/* > /dev/null
while ! rm -rfv /var/lib/kubelet/* > /dev/null; do
  findmnt --list   --submounts  -n -o TARGET  --target /var/lib/kubelet | grep '/var/lib/kubelet/plugins'  | xargs -r umount -f -l
  sleep 5
done
umount -l -f /var/lib/kubelet
rm -rfv /var/lib/kubelet/* > /dev/null
rm -rfv /datadisk/* > /dev/null
umount -l -f /datadisk
rm -rfv /datadisk/* > /dev/null
rm -rfv ~/.uipath/* > /dev/null
mount /var/lib/rancher
mkdir -p /var/lib/rancher/rke2/server/db/ && mount -a
rm -rfv /var/lib/rancher/rke2/server/db/* > /dev/null
echo "Uninstall RKE complete."

     1.2 In an offline setup, run the following script with elevated privileges, i.e. sudo, on each node of the cluster. This will uninstall the nodes.

function remove_rke2_entry_from_exclude() {
  local current_exclude_list new_exclude_list
  YUM_CONF_FILE=$1
  if [[ ! -s "${YUM_CONF_FILE}" ]];
  then
    # File is empty
    return
  fi
  current_exclude_list=$(grep 'exclude=' "${YUM_CONF_FILE}" | tail -1)
  if echo "$current_exclude_list" | grep -q 'rke2-*';
  then
    if [[ -w ${YUM_CONF_FILE} ]];
    then
      new_exclude_list=$(printf '%s\n' "${current_exclude_list//rke2-* /}")
      new_exclude_list=$(printf '%s\n' "${new_exclude_list//rke2-*,/}")
      new_exclude_list=$(printf '%s\n' "${new_exclude_list//rke2-\*/}")
      sed -i "/exclude=.*rke2-\*/d" "${YUM_CONF_FILE}"
      echo "${new_exclude_list}" >> "${YUM_CONF_FILE}"
    else
      error "${YUM_CONF_FILE} file is readonly and contains rke2-* under package exclusion. Please remove the entry for AS to work."
    fi
  fi
}

function enable_rke2_package_upgrade() {
  remove_rke2_entry_from_exclude /etc/dnf/dnf.conf
  remove_rke2_entry_from_exclude /etc/yum.conf
}

enable_rke2_package_upgrade

service_exists() {
    local n=$1
    if [[ $(systemctl list-units --all -t service --full --no-legend "$n.service" | cut -f1 -d' ') == $n.service ]]; then
        return 0
    else
        return 1
    fi
}
if service_exists rke2-server; then
  systemctl stop rke2-server
  systemctl disable rke2-server
fi
if service_exists rke2-agent; then
  systemctl stop rke2-agent
  systemctl disable rke2-agent
fi
if [ -e /usr/local/bin/rke2-killall.sh ]
then
  echo "Running rke2-killall.sh"
  /usr/local/bin/rke2-killall.sh > /dev/null
else
  echo "File not found: rke2-killall.sh"
fi
if [ -e /usr/local/bin/rke2-uninstall.sh ]
then
  echo "Running rke2-uninstall.sh"
  /usr/local/bin/rke2-uninstall.sh > /dev/null
else
    echo "File not found: rke2-uninstall.sh"
fi

crontab -l > backupcron
sed -i '/backupjob/d' backupcron > /dev/null
crontab backupcron > /dev/null
rm -rf backupcron > /dev/null
rm -rfv /usr/bin/backupjob > /dev/null
rm -rfv /etc/rancher/ > /dev/null
rm -rfv /var/lib/rook/ > /dev/null
rm -rfv /var/lib/longhorn/ > /dev/null
rm -rfv /var/lib/rancher/rke2/server/db/* > /dev/null
umount -l -f /var/lib/rancher/rke2/server/db > /dev/null 2>&1 || true
rm -rfv /var/lib/rancher/* > /dev/null
umount -l -f /var/lib/rancher
rm -rfv /var/lib/rancher/* > /dev/null
while ! rm -rfv /var/lib/kubelet/* > /dev/null; do
  findmnt --list   --submounts  -n -o TARGET  --target /var/lib/kubelet | grep '/var/lib/kubelet/plugins'  | xargs -r umount -f -l
  sleep 5
done
umount -l -f /var/lib/kubelet
rm -rfv /var/lib/kubelet/* > /dev/null
rm -rfv /datadisk/* > /dev/null
umount -l -f /datadisk
rm -rfv /datadisk/* > /dev/null
rm -rfv ~/.uipath/* > /dev/null
mount /var/lib/rancher
mkdir -p /var/lib/rancher/rke2/server/db/ && mount -a
rm -rfv /var/lib/rancher/rke2/server/db/* > /dev/null
echo "Uninstall RKE complete."
  1. Reboot the node after uninstall.

🚧

重要

When uninstalling one of the nodes from the cluster, you must run the following command: kubectl delete node <node_name>. This removes the node from the cluster.

 

如何清理离线工件以改善磁盘空间


如果运行离线安装,由于使用了离线工件,通常需要更大的磁盘空间。

安装完成后,您可以删除这些本地工件。否则,可能会在集群操作期间造成不必要的磁盘压力。

在执行安装的主服务器上,您可以使用以下命令执行清理。

  1. 使用以下命令删除 Podman 加载到本地容器存储中的所有映像:
podman image rm -af
  1. Then remove the temporary offline folder, used with the flag --offline-tmp-folder. This parameter defaults to /tmp:
rm -rf /path/to/temp/folder

 

常见问题


无法在 RHEL 8.4 操作系统上运行离线安装


描述

如果您安装 RHEL 8.4 并执行离线安装(需要 Podman),则可能会发生以下问题。这些问题特定于 Podman 和一起安装的操作系统。请参阅下面的两个

潜在问题

  • 您不能在集群上同时安装以下两者:
    • podman-1.0.0-8.git921f98f.module+el8.3.0+10171+12421f43.x86_64
    • podman-3.0.1-6.module+el8.4.0+10607+f4da7515.x86_64
  • cockpit-podman-29-2.module+el8.4.0+10607+f4da7515.noarch requires podman >= 1.3.0, but none of the providers can be installed
  • 无法为作业安装最佳候选者
  • problem with installed package cockpit-podman-29-2.module+el8.4.0+10607+f4da7515.noarch

潜在问题

  • podman-3.0.1-6.module+el8.4.0+10607+f4da7515.x86_64 requires containernetworking-plugins >= 0.8.1-1, but none of the providers can be installed
  • 您不能同时安装以下两者:
    • containernetworking-plugins-0.7.4-4.git9ebe139.module+el8.3.0+10171+12421f43.x86_64
    • containernetworking-plugins-0.9.1-1.module+el8.4.0+10607+f4da7515.x86_64
  • podman-catatonit-3.0.1-6.module+el8.4.0+10607+f4da7515.x86_64 requires podman = 3.0.1-6.module+el8.4.0+10607+f4da7515, but none of the providers can be installed
  • 无法为作业安装最佳候选者
  • problem with installed package podman-catatonit-3.0.1-6.module+el8.4.0+10607+f4da7515.x86_64
    (try to add --allowerasing to command line to replace conflicting packages or --skip-broken to skip uninstallable packages or --nobest to use not only best candidate packages)

解决方案

You need to remove the current version of podman and allow Automation Suite to install the required version.

  1. Remove the current version of podman using the yum remove podman command.

  2. 删除当前版本后重新运行安装程序,这将安装正确的版本。

 

由于缺少二进制文件,离线安装失败


描述

在离线安装期间,Fabric 阶段执行失败,并显示以下错误消息:

Error: overlay: can't stat program "/usr/bin/fuse-overlayfs": stat /usr/bin/fuse-overlayfs: no such file or directory

解决方案

You need to remove the line containing the mount_program key from the podman configuration /etc/containers/storage.conf
确保删除该行,而不是将其注释掉。

 

无法获取沙盒映像


描述

You can receive an error message specific when trying to get the following sandbox image: index.docker.io/rancher/pause3.2

这可能会在离线安装中发生。

解决方案

Restart either rke2-serverrke2-agent (depending on whether the machine that the pod is scheduled on is either a server or an agent).

To check which node the pod is scheduled on, run kubectl -n <namespace> get pods -o wide

# If machine is a Master node
systemctl restart rke2-server
# If machine is an Agent Node
systemctl restart rke2-agent

 

SQL 连接字符串验证错误


描述

您可能会收到与连接字符串相关的错误,如下所示:

Sqlcmd: Error: Microsoft Driver 17 for SQL Server :
Server—tcp : <connection string>
Login failed for user

即使所有凭据都正确,也会出现此错误。连接字符串验证失败。

解决方案

确保连接字符串具有以下结构:

Server=<Sql server host name>;User Id=<user_name for sql server>;Password=<Password>;Initial Catalog=<database name>;Persist Security Info=False;MultipleActiveResultSets=False;Encrypt=True;TrustServerCertificate=False;Connection Timeout=30;Max Pool Size=100;

📘

备注:

User Id 区分大小写。

 

Pod 未显示在 ArgoCD 用户界面中


描述

有时,ArgoCD 用户界面不显示 Pod,而仅显示应用程序和相应的部署。有关详细信息,请参见下图:

17631763

When clicking on any of the deployments, the following error is displayed: Unable to load data: EOF

19201920

解决方案

您可以通过从 ArgoCD 命名空间中删除所有 Redis 副本并等待其再次启动来解决此问题。

kubectl -n argocd delete pod argocd-redis-ha-server-0 argocd-redis-ha-server-1 argocd-redis-ha-server-2

# Wait for all 3 pods to come back up
kubectl -n argocd get pods | grep argocd-redis-ha-server

 

离线安装中的证书问题


描述

您可能会收到证书由未知颁发机构签名的错误。

Error: failed to do request: Head "https://sfdev1778654-9f843b23-lb.westeurope.cloudapp.azure.com:30071/v2/helm/audit-service/blobs/sha256:09bffbc520ff000b834fe1a654acd089889b09d22d5cf1129b0edf2d76554892": x509: certificate signed by unknown authority

解决方案

CA 根证书和服务器证书都需要位于计算机上的受信任存储中。

要进行调查,请执行以下命令:

[[email protected] ~]# find /etc/pki/ca-trust/source{,/anchors} -maxdepth 1 -not -type d -exec ls -1 {} +
/etc/pki/ca-trust/source/anchors/rootCA.crt
/etc/pki/ca-trust/source/anchors/server.crt

提供的证书需要在这些命令的输出中。

或者,执行以下命令:

[[email protected] ~]# openssl x509 -in /etc/pki/ca-trust/source/anchors/server.crt -text -noout

确保输出中的“使用者可选名称”中存在完全限定域名。

X509v3 Subject Alternative Name:
                DNS:sfdev1778654-9f843b23-lb.westeurope.cloudapp.azure.com

您可以按如下方式更新 CA 证书:

[[email protected] ~]# update-ca-trust

 

下载捆绑包时出错


描述

The documentation lists wget as an option for downloading the bundles. Because of the large sizes, the connection may be interrupted and not recover.

解决方案

One way to mitigate this could be to switch to a different download tool, such as azcopy (more information here). Run these commands, while updating the bundle URL to match the desired version/bundle combination.

wget https://aka.ms/downloadazcopy-v10-linux -O azcopy.tar.gz
tar -xvf ./azcopy.tar.gz
azcopy_linux_amd64_10.11.0/azcopy copy https://download.uipath.com/service-fabric/0.0.23-private4/sf-0.0.23-private4.tar.gz /var/tmp/sf.tar.gz --from-to BlobLocal

 

Longhorn 错误


Rook Ceph 或 Looker Pod 卡在 Init 状态

描述

有时,在节点重新启动时,某个问题会导致 Looker 或 Rook Ceph Pod 卡在 Init 状态,因为缺少将 PVC 附加到 Pod 所需的卷。

通过运行以下命令验证问题是否确实与 Longhorn 相关:

kubectl get events -A -o json | jq -r '.items[] | select(.message != null) | select(.message | contains("cannot get resource \"volumeattachments\" in API group \"storage.k8s.io\""))'

如果它与 Longhorn 相关,则此命令应返回受问题影响的 Pod 名称列表。如果该命令未返回任何内容,则问题的原因将有所不同。

解决方案

如果上一个命令返回非空输出,请运行以下脚本以修复有问题的 Pod:

#!/bin/bash


function wait_till_rollout() {
    local namespace=$1
    local object_type=$2
    local deploy=$3

    local try=0
    local maxtry=2
    local status="notready"

    while [[ ${status} == "notready" ]]  && (( try != maxtry )) ; do
        kubectl -n "$namespace" rollout status "$deploy" -w --timeout=600s; 
        # shellcheck disable=SC2181
        if [[ "$?" -ne 0 ]]; 
        then
            status="notready"
            try=$((try+1))
        else
            status="ready"
        fi
    done
    if [[ $status == "notready" ]]; then 
        echo "$deploy of type $object_type failed in namespace $namespace. Plz re-run the script once again to verify that it's not a transient issue !!!"
        exit 1
    fi
}

function fix_pv_deployments() {
    for pod_name in $(kubectl get events -A -o json | jq -r '.items[]  | select(.message | contains("cannot get resource \"volumeattachments\" in API group \"storage.k8s.io\"")) | select(.involvedObject.kind == "Pod") | .involvedObject.name + "/" + .involvedObject.namespace' | sort | uniq)
    do
        POD_NAME=$(echo "${pod_name}" | cut -d '/' -f1)
        NS=$(echo "${pod_name}" | cut -d '/' -f2)
        controller_data=$(kubectl -n "${NS}" get po "${POD_NAME}" -o json | jq -r '[.metadata.ownerReferences[] | select(.controller==true)][0] | .kind + "=" + .name')
        [[ $controller_data == "" ]] && error "Error: Could not determine owner for pod: ${POD_NAME}" && exit 1
        CONTROLLER_KIND=$(echo "${controller_data}" | cut -d'=' -f1)
        CONTROLLER_NAME=$(echo "${controller_data}" | cut -d'=' -f2)
        if [[ $CONTROLLER_KIND == "ReplicaSet" ]]
        then
            controller_data=$(kubectl  -n "${NS}" get "${CONTROLLER_KIND}" "${CONTROLLER_NAME}" -o json | jq -r '[.metadata.ownerReferences[] | select(.controller==true)][0] | .kind + "=" + .name')
            CONTROLLER_KIND=$(echo "${controller_data}" | cut -d'=' -f1)
            CONTROLLER_NAME=$(echo "${controller_data}" | cut -d'=' -f2)

            replicas=$(kubectl -n "${NS}" get "$CONTROLLER_KIND" "$CONTROLLER_NAME" -o json | jq -r '.status.replicas')
            unavailable_replicas=$(kubectl -n "${NS}" get "$CONTROLLER_KIND" "$CONTROLLER_NAME" -o json | jq -r '.status.unavailableReplicas')

            if [ -n "$unavailable_replicas" ]; then 
                available_replicas=$((replicas - unavailable_replicas))
                if [ $available_replicas -eq 0 ]; then
                    kubectl -n "$NS" scale "$CONTROLLER_KIND" "$CONTROLLER_NAME" --replicas=0
                    sleep 15
                    kubectl -n "$NS" scale "$CONTROLLER_KIND" "$CONTROLLER_NAME" --replicas="$replicas"
                    deployment_name="$CONTROLLER_KIND/$CONTROLLER_NAME"
                    wait_till_rollout "$NS" "deploy" "$deployment_name"
                fi 
            fi
        fi
    done
}

fix_pv_deployments

 

StatefulSet volume attachment error


Pods in RabbitMQ or cattle-monitoring-system or other StatefulSet pods are stuck in the init state.

描述

Occasionally, upon node power failure or during upgrade, an issue causes the pods in RabbitMQ or cattle-monitoring-system to get stuck in init state as the volume required for attaching the PVC to a pod is missing.

Verify if the problem is indeed related to the StatefulSet volume attachment by running the following command:

kubectl -n <namespace> describe pod <pod-name> | grep "cannot get resource \"volumeattachments\" in API group \" storage.k8s.io\""

If it is related to the StatefulSet volume attachment, it will show an error message.

解决方案

To fix this issue, reboot the node.

 

无法创建持久卷

描述

Longhorn 已成功安装,但无法创建持久卷。

解决方案

Verify if the kernel modules are successfully loaded in the cluster by using the command lsmod | grep <module_name>
替换 <module_name> with each of the kernel modules below:

  • libiscsi_tcp
  • libiscsi
  • iscsi_tcp
  • scsi_transport_iscsi

加载任何缺少的模块。

 

rke2-coredns-rke2-coredns-autoscaler Pod 处于 CrashLoopBackOff 状态


描述

节点重新启动后,rke2-coredns-rke2-coredns-autoscaler 会进入 CrashLoopBackOff 状态。这对 Automation Suite 没有任何影响。

解决方案

Delete the rke2-coredns-rke2-coredns-autoscaler pod that is in CrashLoopBackOff using the following command: kubectl delete pod <pod name> -n kube-system

 

Redis 探测器失败


描述

如果节点 ID 文件不存在,则 Redis 探测器可能会失败。如果 Pod 尚未启动,则可能会发生这种情况。

有一个恢复作业可以自动修复此问题,并且不应在作业运行时执行以下步骤。

当 Redis 企业版集群与其半数以上的节点失去联系时(由于节点故障或网络拆分),集群将停止响应客户端连接。Pod 也无法重新加入集群。

解决方案

  1. 使用以下命令删除 Redis 集群和数据库:
kubectl delete redb -n redis-system redis-cluster-db --force --grace-period=0 &
kubectl delete rec -n redis-system redis-cluster --force --grace-period=0 &
kubectl patch redb -n redis-system redis-cluster-db --type=json -p '[{"op":"remove","path":"/metadata/finalizers","value":"finalizer.redisenterprisedatabases.app.redislabs.com"}]'
kubectl patch rec redis-cluster -n redis-system --type=json -p '[{"op":"remove","path":"/metadata/finalizers","value":"redbfinalizer.redisenterpriseclusters.app.redislabs.com"}]'
kubectl delete job redis-cluster-db-job -n redis-system
  1. 转到 ArgoCD 用户界面并同步 redis-cluster 应用程序。

 

RKE2 服务器无法启动


描述

服务器无法启动。RKE2 无法正常启动的原因有很多,通常可以在日志中找到。

解决方案

使用以下命令检查日志:

journalctl -u rke2-server

可能的原因(基于日志):集群中的学习者成员过多

Too many etcd servers are added to the cluster, and there are two learner nodes trying to be promoted. More information here: Runtime reconfiguration.

执行以下步骤:

  1. 在正常情况下,如果有足够的时间,节点应该成为正式成员。
  2. 可以尝试卸载-重新安装循环。

Alternatively, this could be caused by a networking problem. Ensure you have configured the machine to enable the necessary ports.

 

停止的节点不会发生节点排空


描述

如果集群中的节点已停止,并且其相应的 Pod 在 15 分钟后未重新计划到可用节点,请运行以下脚本以手动排空该节点。

#!/bin/sh

KUBECTL="/usr/local/bin/kubectl"

# Get only nodes which are not drained yet
NOT_READY_NODES=$($KUBECTL get nodes | grep -P 'NotReady(?!,SchedulingDisabled)' | awk '{print $1}' | xargs echo)
# Get only nodes which are still drained
READY_NODES=$($KUBECTL get nodes | grep '\sReady,SchedulingDisabled' | awk '{print $1}' | xargs echo)

echo "Unready nodes that are undrained: $NOT_READY_NODES"
echo "Ready nodes: $READY_NODES"


for node in $NOT_READY_NODES; do
  echo "Node $node not drained yet, draining..."
  $KUBECTL drain --ignore-daemonsets --force --delete-emptydir-data $node
  echo "Done"
done;

for node in $READY_NODES; do
  echo "Node $node still drained, uncordoning..."
  $KUBECTL uncordon $node
  echo "Done"
done;

 

启用 Istio 日志记录


要调试 Istio,您需要启用日志记录。为此,请执行以下步骤:

  1. Find the istio-ingressgateway pod by running the following command. Copy the gateway pod name. It should be something like istio-ingressgateway-r4mbx
kubectl -n istio-system get pods
  1. 通过运行以下命令打开网关 Pod Shell。
kubectl exec -it -n istio-system istio-ingressgateway-r4mbx bash
  1. 通过运行以下命令启用调试级别的日志记录。
curl -X POST http://localhost:15000/logging?level=debug
  1. Run the following command from a server node.
istioctl_bin=$(find /var/lib/rancher/rke2/ -name "istioctl" -type f -perm -u+x   -print -quit)
if [[ -n ${istioctl_bin} ]]
then
echo "istioctl bin found"
  kubectl -n istio-system get cm istio-installer-base -o go-template='{{ index .data "istio-base.yaml" }}'  > istio-base.yaml
  kubectl -n istio-system get cm istio-installer-overlay  -o go-template='{{ index .data "overlay-config.yaml" }}'  > overlay-config.yaml 
  ${istioctl_bin} -i istio-system install -y -f istio-base.yaml -f overlay-config.yaml --set meshConfig.accessLogFile=/dev/stdout --set meshConfig.accessLogEncoding=JSON 
else
  echo "istioctl bin not found"
fi

 

在 UiPath 命名空间中找不到密码


描述

If service installation fails, and checking kubectl -n uipath get pods returns failed pods, take the following steps.

解决方案

  1. 选中 kubectl -n uipath describe pod <pod-name> and look for secret not found.
  2. 如果找不到密码,请查找凭据管理器作业日志并查看它是否失败。
  3. If the credential manager job failed and kubectl get pods -n rook-ceph|grep rook-ceph-tool returns more than one pod, do the following:
    a. delete rook-ceph-tool that is not running.
    b. go to ArgoCD UI and sync sfcore appllication.
    c. 作业完成后,检查是否在凭据管理器作业日志中创建了所有密码。
    d. Now sync uipath application.

 

迁移后无法登录


描述

An issue might affect the migration from a standalone product to Automation Suite. It prevents you from logging in, with the following error message being displayed: Cannot find client details

解决方案

To fix this problem, you need to re-sync uipath app first, and then sync platform app in ArgoCD.

 

ArgoCD login failed


描述

You may fail to log into ArgoCD when using the admin password or the installer may fail with the following error message:

33963396

解决方案

To fix this issue, enter your password, create a bcrypt password, and run the commands described in the following section:

password="<enter_your_password>"
bcryptPassword=<generate bcrypt password using link https://www.browserling.com/tools/bcrypt >

# Enter your bcrypt password and run below command
kubectl -n argocd patch secret argocd-secret \
  -p '{"stringData": {
    "admin.password": "<enter you bcryptPassword here>",
    "admin.passwordMtime": "'$(date +%FT%T%Z)'"
  }}'

# Run below commands
argocdInitialAdminSecretPresent=$(kubectl -n argocd get secret argocd-initial-admin-secret --ignore-not-found )
if [[ -n ${argocdInitialAdminSecretPresent} ]]; then
   echo "Start updating argocd-initial-admin-secret"
   kubectl -n argocd patch secret argocd-initial-admin-secret \
   -p "{
      \"stringData\": {
         \"password\": \"$password\"
      }
   }"
fi

argocAdminSecretName=$(kubectl -n argocd get secret argocd-admin-password --ignore-not-found )
if [[ -n ${argocAdminSecretName} ]]; then
   echo "Start updating argocd-admin-password"
   kubectl -n argocd patch secret argocd-admin-password \
   -p "{
      \"stringData\": {
         \"password\": \"$password\"
      }
   }"
fi

 

初始安装后,ArgoCD 应用程序进入“进行中”状态


描述

Whenever the cluster state deviates from what has been defined in the helm repository, argocd tries to sync the state and reconciliation happens every minute. Whenever this happens, you can notice that the ArgoCD app is in progressing state.

解决方案

这是 ArgoCD 的预期行为,它不会以任何方式影响应用程序。

 

Automation Suite requires backlog_wait_time to be set 1


描述

Audit events can cause instability (system freeze) if backlog_wait_time is not set to 1
For more details, see this issue description.

解决方案

If the installer fails with the Automation Suite requires backlog_wait_time to be set 1 error message, take the following steps to set backlog_wait_time to 1

  1. 集合 backlog_wait_time to 1 by appending --backlog_wait_time 1 in the /etc/audit/rules.d/audit.rules file.
  2. 重新启动节点。
  3. Validate if backlog_wait_time value is set to 1 for auditctl by running sudo auditctl -s | grep "backlog_wait_time" in the node.

 

无法调整对象存储 PVC 的大小


描述

This issue occurs when the objectstore resize-pvc operation fails with the following error:

Failed resizing the PVC: <pvc name> in namespace: rook-ceph, ROLLING BACK

解决方案

要解决此问题,请执行以下步骤:

  1. 手动运行以下脚本:
#!/bin/sh

ROOK_CEPH_OSD_PREPARE=$(kubectl -n rook-ceph get pods | grep rook-ceph-osd-prepare-set | awk '{print $1}')
if [[ -n ${ROOK_CEPH_OSD_PREPARE} ]]; then
    for pod in ${ROOK_CEPH_OSD_PREPARE}; do
    echo "Start deleting rook ceph osd pod $pod .."
    kubectl -n rook-ceph delete pod $pod
    echo "Done"
    done;
fi
  1. Rerun the objectstore resize-pvc command.

 

Failure to upload/download data in object-store (rook-ceph)


967967

描述

This issue may occur when the object-store state is in a degraded state due to a placement group (PG) inconsistency.
Verify if the problem is indeed related to rook-ceph PG inconsistency by running the following commands:

export KUBECONFIG=/etc/rancher/rke2/rke2.yaml PATH=$PATH:/var/lib/rancher/rke2/bin
ROOK_CEPH_TOOLS=$(kubectl -n rook-ceph get pods | grep rook-ceph-tools)
kubectl -n rook-ceph exec -it $ROOK_CEPH_TOOLS -- ceph status

If the problem is related to a rook-ceph PG inconsistency, the output will contain the following messages:

602602
....
....
Possible data damage: X pgs inconsistent
....
....
X active+clean+inconsistent
....
....

解决方案

To repair the inconsistent PG, take the following steps:

  1. Exec to rook-ceph tools:
kubectl -n rook-ceph exec -it $ROOK_CEPH_TOOLS -- sh
  1. Trigger the rook-ceph garbage collector process. Wait until the process is complete.
radosgw-admin gc process
  1. Find a list of active+clean+inconsistent PGs:
ceph health detail
# output of this command be like
# ....
# pg <pg-id> is active+clean+inconsistent, acting ..
# pg <pg-id> is active+clean+inconsistent, acting ..
# ....
#
  1. Trigger a deep scrub on the PGs one at a time. This command takes few minutes to run, depending on the PG size.
ceph pg deep-scrub <pg-id>
  1. Watch the scrubbing status:
ceph -w | grep <pg-id>
  1. Check the PG scrub status. If the PG scrub is successful, the PG status should be active+clean+inconsistent
ceph health detail | grep <pg-id>
  1. Repair the PG:
ceph pg repair <pg-id>
  1. Check the PG repair status. The PG ID should be removed from the active+clean+inconsistent list if the PG is repaired successfully.
ceph health detail | grep <pg-id>
  1. Repeat steps 3 to 8 for the rest of the inconsistent PG.

 

证书更新后失败


描述

当证书更新步骤在内部失败时,会发生此问题。您可能无法访问 Automation Suite 或 Orchestrator。

错误

18131813

解决方案

  1. 从任何服务器节点运行以下命令
export KUBECONFIG=/etc/rancher/rke2/rke2.yaml
export PATH=$PATH:/var/lib/rancher/rke2/bin

kubectl -n uipath rollout restart deployments
  1. 等待上述命令成功执行,然后运行以下命令以验证上一个命令的状态。
deployments=$(kubectl -n uipath get deployment -o name)
for i in $deployments; 
do 
kubectl -n uipath rollout status "$i" -w --timeout=600s; 
if [[ "$?" -ne 0 ]]; 
then
    echo "$i deployment failed in namespace uipath."
fi
done
echo "All deployments are succeeded in namespace uipath"

上述命令执行完成后,您应该能够访问 Automation Suite 和 Orchestrator

 

Unexpected inconsistency; run fsck manually


While installing or upgrading Automation Suite, if the any pods cannot mount to the PVC pods, the following error message is displayed:
UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY

11881188

Recovery steps

If you encounter the error above, follow the recovery steps below:

  1. SSH to the system by running the following command:
ssh <user>@<node-ip>
  1. Check the events of the PVC and verify that the issue is related to the PVC mount failure due to file error. To do this, run the following command:
export KUBECONFIG=/etc/rancher/rke2/rke2.yaml PATH=$PATH:/var/lib/rancher/rke2/bin:/usr/local/bin
kubectl get events -n mongodb
kubectl get events -n longhorn-system
  1. Check the PVC volume mentioned in the event and run the fsck command.
fsck -a <pvc-volume-name>
Eg - fsck -a /dev/longhorn/pvc-5abe3c8f-7422-44da-9132-92be5641150a
  1. Delete the failing MongoDB pod to properly mount it to the PVC.
kubectl delete pod <pod-name> -n mongodb

 

Cluster unhealthy after automated upgrade from 2021.10


During the automated upgrade from Automation Suite 2021.10, the CNI provider is migrated from Canal to Cilium. This operation requires that all nodes are restarted. On rare occasions, one or more nodes might not be successfully rebooted, causing pods running on those nodes to remain unhealthy.

Recovery steps

  1. Identify failed restarts.
    During the Ansible execution, you might see output similar to the following snippet:
TASK [Reboot the servers] ***************************************************************************************************************************

fatal: [10.0.1.6]: FAILED! =>

  msg: 'Failed to connect to the host via ssh: ssh: connect to host 10.0.1.6 port 22: Connection timed out'

Alternatively, browse the logs on the Ansible host machine, located at /var/tmp/uipathctl_<version>/_install-uipath.log. If any failed restarts were identified, execute steps 2 through 4 on all nodes.

  1. Confirm a reboot is needed on each node.
    Connect to the each node and run the following command:
ssh <username>@<ip-address>
iptables-save 2>/dev/null | grep -i cali -c

If the result is not zero, a reboot is needed.

  1. Reboot the node:
sudo reboot
  1. Wait for the node to become responsive (you should be able to SSH to it) and repeat steps 2 through 4 on every other node.

 

First installation fails during Longhorn setup


On rare occasions, if the first attempt to install Longhorn fails, subsequent retries might throw a Helm-specific error: Error: UPGRADE FAILED: longhorn has no deployed releases

Recovery steps

Remove the Longhorn Helm release before retrying the installation by running the following command:

/opt/UiPathAutomationSuite/<version>/bin/helm uninstall longhorn --namespace longhorn-system

 

Automation Suite not working after OS upgrade


描述

After an OS upgrade, Ceph OSD pods can sometimes get stuck in CrashLoopBackOff state. This issue causes Automation Suite not to be accessible.

解决方案

  1. Check the state of the pods by running the following command:
kubectl - n rook-ceph get pods
  1. If any of the pods in previous output are in CrashLoopBackOff, recover them by running the following command:
$OSD_PODS=$(kubectl -n rook-ceph get deployment -l app=rook-ceph-rgw --no-headers | awk '{print $1}')
kubectl -n rook-ceph rollout restart deploy $OSD_PODS
  1. Wait for approximately 5 minutes for the pods to be in running state again, and check their status by running the following command:
kubectl - n rook-ceph get pods

 

Identity Server issues

Setting a timeout interval for the Management portals

Pre-installation, you cannot update the expiration time for the token used to authenticate to the host- and organization-level Management portals. Therefore user sessions do not time out.

To set a time interval for timeout for these portals, you can update the accessTokenLifetime property.
The below example sets the timeout interval to 86400 seconds (24 hours):

UPDATE [identity].[Clients] SET AccessTokenLifetime = 86400 WHERE ClientName = 'Portal.OpenId'

 

Kerberos 问题


kinit:找不到领域的 KDC 在获取初始凭据时

描述

This error might occur during installation (if you have Kerberos authentication enabled) or during the kerberos-tgt-update cron job execution when the UiPath cluster cannot connect to the AD server to obtain the Kerberos ticket for authentication.

解决方案

检查 AD 域并确保其配置正确且可路由,如下所示:

getent ahosts <AD domain> | awk '{print $1}' | sort | uniq

如果此命令未返回可路由的 IP 地址,则 Kerberos 身份验证所需的 AD 域配置不正确。

您需要与 IT 管理员合作,将 AD 域添加到 DNS 服务器,并确保此命令返回可路由的 IP 地址。

 

kinit: Keytab contains no suitable keys for *** while getting initial credentials

描述

This error could be found in the log of a failed job, with one of the following job names: services-preinstall-validations-job, kerberos-jobs-trigger, kerberos-tgt-update

解决方案

确保 AD 用户仍然存在,处于活动状态,并且其密码未更改且未过期。重置用户的密码,并在需要时重新生成密钥表。
Also make sure to provide the default Kerberos AD user parameter <KERB_DEFAULT_USERNAME> in the following format: HTTP/<Service Fabric FQDN>

 

GSSAPI 操作失败,错误为:提供了无效的状态代码(已撤销客户端凭据)。

描述

This log could be found when using Kerberos for SQL access, and SQL connection is failing inside services. Similarly, you may see kinit: Client's credentials have been revoked while getting initial credential in one of the following job names: services-preinstall-validations-job, kerberos-jobs-trigger, kerberos-tgt-update

解决方案

This could be caused by the AD user account used to generate the keytab being disabled. Re-enabling the AD user account should fix the issue.

Alarm received for failed kerberos-tgt-update job

描述

This happens if the uipath cluster failed to retrieve the latest Kerberos ticket.

解决方案

To find the issue, check the log for a failed job whose name starts with kerberos-tgt-update. After you've identified the problem in the log, check the related troubleshooting information in this section and in the Troubleshooting section for configuring Active Directory.

 

SSPI Provider: Server not found in Kerberos database

解决方案

Make sure that the correct SPN records are set up in the AD domain controller for the SQL server. For instructions, see SPN formats in the Microsoft SQL Server documentation.

 

 

Login failed for user <ADDOMAIN>\<aduser>. Reason: The account is disabled.

描述

使用 Kerberos 进行 SQL 访问并且服务内部的 SQL 连接失败时,可以找到此日志。

解决方案

This issue could be caused by the AD user losing access to the SQL server. See instructions on how to reconfigure the AD user.

 

Orchestrator 相关问题


Orchestrator Pod 处于 CrashLoopBackOff 状态或 1/2 多次重新启动后开始运行


描述

如果 Orchestrator Pod 处于 CrashLoopBackOff 状态或 1/2 多次重新启动后开始运行,则故障可能与对象存储提供程序 Ceph 的身份验证密钥有关。

要检查故障是否与 Ceph 相关,请运行以下命令:

kubectl -n uipath get pod -l app.kubernetes.io/component=orchestrator

如果此命令的输出类似于以下选项之一,则需要运行其他命令。

Option 1:
NAME                            READY   STATUS    RESTARTS   AGE
orchestrator-6dc848b7d5-q5c2q   1/2     Running   2          6m1s

OR 

Option 2
NAME                            READY   STATUS             RESTARTS   AGE
orchestrator-6dc848b7d5-q5c2q   1/2     CrashLoopBackOff   6          16m

运行以下命令,验证失败是否与 Ceph 身份验证密钥有关:

kubectl -n uipath logs -l app.kubernetes.io/component=orchestrator | grep 'Error making request with Error Code InvalidAccessKeyId and Http Status Code Forbidden' -o

If the output of the above command contains the string Error making request with Error Code InvalidAccessKeyId and Http Status Code Forbidden, the failure is due to the Ceph authentication keys.

解决方案

Rerun the rook-ceph-configure-script-job and credential-manager jobs using the following commands:

kubectl -n uipath-infra get job "rook-ceph-configure-script-job" -o json | jq 'del(. | .spec.selector, .spec.template.metadata.labels)' | kubectl replace --force -f -
kubectl -n uipath-infra get job "credential-manager-job" -o json | jq 'del(. | .spec.selector, .spec.template.metadata.labels)' | kubectl replace --force -f -
kubectl -n uipath delete pod -l app.kubernetes.io/component=orchestrator

 

Test Manager 相关问题


Test Manager 许可证问题


如果您的许可证是在登录时分配的,则打开 Test Manager 时可能无法检测到您的许可证分配。

如果发生这种情况,请执行以下步骤:

  1. 导航到 Test Manager。
  2. 从门户中注销。
  3. 重新登录。

 

AI Center 相关问题


AI Center 技能部署问题

有时,首次部署模型时,DU 模型技能部署可能会间歇性地失败,并显示“无法列出部署”或“未知错误”。解决方法是再次尝试部署模型。第二次部署会更快,因为大多数映像构建的部署工作都会在第一次尝试期间完成。首次部署 DU 模型大约需要 1 到 1.5 个小时,再次部署时会更快。

在极少数情况下,由于集群状态,技能部署或包上传等异步操作可能会停留很长时间。如果 DU 技能部署需要耗费超过 2 到 3 个小时,请尝试部署更简单的模型(例如模板模型)。如果部署该模型也需要耗费一个小时以上,则缓解措施是使用以下命令重新启动 AI Center 服务:

kubectl -n uipath rollout restart deployment ai-deployer-deployment
kubectl -n uipath rollout restart deployment ai-trainer-deployment
kubectl -n uipath rollout restart deployment ai-pkgmanager-deployment
kubectl -n uipath rollout restart deployment ai-helper-deployment
kubectl -n uipath rollout restart deployment ai-appmanager-deployment

使用以下命令进行验证,等待 AI Center Pod 重新启动:

kubectl -n uipath get pods | grep ai-*

以上所有 Pod 都应处于“正在运行”状态,并且容器状态应显示为 2/2。

 

Document Understanding 相关问题


Document Understanding 不在 Automation Suite 的左侧栏


描述

如果在 Automation Suite 的左侧栏中找不到 Document Understanding,请注意,Document Understanding 当前不是 Automation Suite 上的单独应用程序,因此不会显示在左侧栏中。

解决方案

Data Manager 组件是 AI Center 的一部分,因此请确保启用 AI Center。

此外,请使用以下公共 URL 访问表单提取程序、智能表单提取程序(包括手写识别)和智能关键字分类器:

<FQDN>/du_/svc/formextractor
<FQDN>/du_/svc/intelligentforms
<FQDN>/du_/svc/intelligentkeywords

If you get the Your license can not be validated error message when trying to use Intelligent Keyword Classifier, Form Extractor and Intelligent Form Extractor in Studio, besides making sure you have input the right endpoint, please also take the API key that you generated for Document Understanding under License in the Automation Suite install, and not from cloud.uipath.com.

 

创建数据标签会话时处于“失败”状态


描述

如果您无法在 AI Center 中的 Data Manager 上创建数据标签会话,请执行以下步骤。

解决方案 1

Please double-check if Document Understanding is properly enabled. You should have updated the configuration file before the installation and set documentunderstanding.enabled to True, or you could update it in ArgoCD post-installation as below.

完成后,您需要禁用它并在要使用数据标签功能的租户上禁用 AI Center,或创建一个新的租户。

14921492

解决方案 2

如果在配置文件或 ArgoCD 中正确启用了 Document Understanding,则有时不会为 DefaultTenant 启用 Document Understanding。这表明其自身无法创建数据标签会话。

要解决此问题,请在该租户上禁用 AI Center,然后重新启用。请注意,您可能需要等待几分钟才能重新启用 AI Center。

 

尝试部署 ML 技能时处于“失败”状态


描述

如果您未能成功地在 AI Center 上部署 Document Understanding ML 技能,请查看下面的解决方案。

解决方案 1

如果要离线安装 Automation Suite,请仔细检查 Document Understanding 捆绑包是否已下载并安装。

该捆绑包包含基本映像(例如,模型库),以便模型在通过 AI Center 用户界面上传 ML 包后在 AI Center 上正确运行。

For details about installing Document Understanding bundle, please refer to the documentation here and here. To add Document Understanding bundle, please follow the documentation to re-run the Document Understanding bundle installation.

解决方案 2

Even if you have installed the Document Understanding bundle for offline installation, another issue might occur along with this error message: modulenotfounderror: no module named 'ocr.release'; 'ocr' is not a package

在 AI Center 中创建 Document Understanding OCR ML 包时,请记住,它不能命名为 ocrOCR,这会与包中的文件夹冲突。请务必选择其他名称。

解决方案 3

Sometimes, intermittently, Document Understanding Model Skill Deployments can fail with Failed to list deploymentUnknown Error when deploying the model for the first time.

解决方法是再次尝试部署模型。第二次部署会更快,因为大多数映像构建的部署工作都会在第一次尝试期间完成。首次部署 Document Understanding ML 包大约需要 1 到 1.5 个小时,再次部署时会更快。

 

ArgoCD 中的迁移作业失败


描述

ArgoCD 中 Document Understanding 的迁移作业失败。

解决方案

Document Understanding 要求在 SQL Server 上启用“全文搜索”功能。否则,安装可能会失败,而不会显示明确的错误消息,因为 ArgoCD 中的迁移作业将失败。

 

使用智能表单提取程序进行手写识别时不起作用


描述

使用智能表单提取程序进行手写识别时不起作用或运行速度过慢。

解决方案 1

如果您离线使用智能表单提取程序,请在安装前检查以确保已在配置文件中启用手写功能,或在 ArgoCD 中启用它。

要仔细检查,请转到 ArgoCD > Document Understanding > 应用程序详细信息 > du-services.handwritingEnabled(将其设置为 True)。

在离线场景中,需要先安装 Document Understanding 捆绑包,然后再执行此操作,否则 ArgoCD 同步将失败。

解决方案 2

尽管在配置文件中启用了手写功能,但您可能仍面临相同的问题。

Please know that the default for the maximum amount of CPUs each container is allowed to use for handwriting is 2. You may need to adjust handwriting.max_cpu_per_pod parameter if you have a larger handwriting processing workload. You could update it in the configuration file before installation or update it in ArgoCD.

For more details on how to calculate the parameter value based on your volume, please check the documentation here.

 

Insights 相关问题


导航至 Insights 主页会生成 404


极少数情况下,可能会发生路由错误,导致 Insights 主页上出现 404。您可以通过转到 ArgoCD 中的 Insights 应用程序并删除虚拟服务 Insightsprovisioning-vs 来解决此问题。请注意,您可能必须单击“清除筛选器”以显示 X 个其他资源,才能查看和删除此虚拟服务。

Looker fails to initialize


During Looker initialization, you might encounter and error stating RuntimeError: Error starting Looker. A Looker pod failure produced this error due to a possible system failure or loss of power. The issue is persistent even if you reinitialize Looker.

To solve this issue, delete the persistent volume claim (PVC) and then restart.

约一个月前更新


故障排除


本页介绍了如何解决设置 Automation Suite 时可能遇到的问题。

建议的编辑仅限用于 API 参考页面

您只能建议对 Markdown 正文内容进行编辑,而不能建议对 API 规范进行编辑。