Subscribe

UiPath Automation Suite

The UiPath Automation Suite Guide

Troubleshooting

This page explains how to fix issues you might encounter when setting up Automation Suite.

Troubleshooting how-tos


Automation Suite generates logs you can explore whenever you need to troubleshoot installation errors. You can find all details on the issues occurring during installation in a log file saved in a directory that also contains the install-uipath.sh script. Each execution of the installer generates a new log file that follows the install-$(date +'%Y-%m-%dT%H_%M_%S').log naming convention, and that you can look into whenever encountering any installation issues.
If you want to troubleshoot post-installation errors, use the Support Bundle tool.

 

How to troubleshoot services during installation


Take the following steps on one of the cluster server nodes:

  1. Obtain Kubernetes access.
on server nodes:
export KUBECONFIG="/etc/rancher/rke2/rke2.yaml"
export PATH="$PATH:/usr/local/bin:/var/lib/rancher/rke2/bin"

on agent nodes:
export KUBECONFIG="/var/lib/rancher/rke2/agent/kubelet.kubeconfig"
export PATH="$PATH:/usr/local/bin:/var/lib/rancher/rke2/bin"

# To validate, execute the following command which should not return an error:
kubectl get nodes
  1. Retrieve the ArgoCD password by running the following command:
kubectl get secrets/argocd-admin-password -n argocd --template '{{ .data.password }}' | base64 -d
  1. Connect to ArgoCD
    a. Navigate to https://alm.<fqdn>/:443
    b. Login using admin as the username and the password obtained at Step 2.

  2. Locate the UiPath Services application as follows:
    a. Using the search bar provided in ArgoCD, type in uipath.

    b. Then open the UiPath application by clicking its card.

    c. Check for the following: Application was not synced due to a failed job/pod.

    d. If the above error exists, take the following steps.

    e. Locate any un-synced components by looking for the red broken heart icon, as shown in the following image.

    f. Open the right-most component (usually pods) and click the Logs tab. The Logs will contain an error message indicating the reason for the pod failure.

    g. Once you resolve any outstanding configuration issues, go back to the home page and click the Sync button on the UiPath application.

 

How to uninstall the cluster


If you experience issues specific to Kubernetes running on the cluster, you can directly uninstall the rke2 cluster.

  1. Depending on your installation profile, run one of the following commands:
    1.1. In an online setup, run the following script with elevated privileges, i.e. sudo, on each node of the cluster. This will uninstall the nodes.
function remove_rke2_entry_from_exclude() {
  local current_exclude_list new_exclude_list
  YUM_CONF_FILE=$1
  if [[ ! -s "${YUM_CONF_FILE}" ]];
  then
    # File is empty
    return
  fi
  current_exclude_list=$(grep 'exclude=' "${YUM_CONF_FILE}" | tail -1)
  if echo "$current_exclude_list" | grep -q 'rke2-*';
  then
    if [[ -w ${YUM_CONF_FILE} ]];
    then
      new_exclude_list=$(printf '%s\n' "${current_exclude_list//rke2-* /}")
      new_exclude_list=$(printf '%s\n' "${new_exclude_list//rke2-*,/}")
      new_exclude_list=$(printf '%s\n' "${new_exclude_list//rke2-\*/}")
      sed -i "/exclude=.*rke2-\*/d" "${YUM_CONF_FILE}"
      echo "${new_exclude_list}" >> "${YUM_CONF_FILE}"
    else
      error "${YUM_CONF_FILE} file is readonly and contains rke2-* under package exclusion. Please remove the entry for AS to work."
    fi
  fi
}

function enable_rke2_package_upgrade() {
  remove_rke2_entry_from_exclude /etc/dnf/dnf.conf
  remove_rke2_entry_from_exclude /etc/yum.conf
}

enable_rke2_package_upgrade

service_exists() {
    local n=$1
    if [[ $(systemctl list-units --all -t service --full --no-legend "$n.service" | cut -f1 -d' ') == $n.service ]]; then
        return 0
    else
        return 1
    fi
}
if service_exists rke2-server; then
  systemctl stop rke2-server
  systemctl disable rke2-server
fi
if service_exists rke2-agent; then
  systemctl stop rke2-agent
  systemctl disable rke2-agent
fi
if [ -e /usr/bin/rke2-killall.sh ]
then
    echo "Running rke2-killall.sh"
    /usr/bin/rke2-killall.sh > /dev/null
else
    echo "File not found: rke2-killall.sh"
fi
if [ -e /usr/bin/rke2-uninstall.sh ]
then
    echo "Running rke2-uninstall.sh"
    /usr/bin/rke2-uninstall.sh > /dev/null
else
    echo "File not found: rke2-uninstall.sh"
fi

crontab -l > backupcron
sed -i '/backupjob/d' backupcron > /dev/null
crontab backupcron > /dev/null
rm -rf backupcron > /dev/null
rm -rfv /usr/bin/backupjob > /dev/null
rm -rfv /etc/rancher/ > /dev/null
rm -rfv /var/lib/rook/ > /dev/null
rm -rfv /var/lib/longhorn/ > /dev/null
rm -rfv /var/lib/rancher/rke2/server/db/* > /dev/null
umount -l -f /var/lib/rancher/rke2/server/db > /dev/null 2>&1 || true
rm -rfv /var/lib/rancher/* > /dev/null
umount -l -f /var/lib/rancher
rm -rfv /var/lib/rancher/* > /dev/null
while ! rm -rfv /var/lib/kubelet/* > /dev/null; do
  findmnt --list   --submounts  -n -o TARGET  --target /var/lib/kubelet | grep '/var/lib/kubelet/plugins'  | xargs -r umount -f -l
  sleep 5
done
umount -l -f /var/lib/kubelet
rm -rfv /var/lib/kubelet/* > /dev/null
rm -rfv /datadisk/* > /dev/null
umount -l -f /datadisk
rm -rfv /datadisk/* > /dev/null
rm -rfv ~/.uipath/* > /dev/null
mount /var/lib/rancher
mkdir -p /var/lib/rancher/rke2/server/db/ && mount -a
rm -rfv /var/lib/rancher/rke2/server/db/* > /dev/null
echo "Uninstall RKE complete."

     1.2 In an offline setup, run the following script with elevated privileges, i.e. sudo, on each node of the cluster. This will uninstall the nodes.

function remove_rke2_entry_from_exclude() {
  local current_exclude_list new_exclude_list
  YUM_CONF_FILE=$1
  if [[ ! -s "${YUM_CONF_FILE}" ]];
  then
    # File is empty
    return
  fi
  current_exclude_list=$(grep 'exclude=' "${YUM_CONF_FILE}" | tail -1)
  if echo "$current_exclude_list" | grep -q 'rke2-*';
  then
    if [[ -w ${YUM_CONF_FILE} ]];
    then
      new_exclude_list=$(printf '%s\n' "${current_exclude_list//rke2-* /}")
      new_exclude_list=$(printf '%s\n' "${new_exclude_list//rke2-*,/}")
      new_exclude_list=$(printf '%s\n' "${new_exclude_list//rke2-\*/}")
      sed -i "/exclude=.*rke2-\*/d" "${YUM_CONF_FILE}"
      echo "${new_exclude_list}" >> "${YUM_CONF_FILE}"
    else
      error "${YUM_CONF_FILE} file is readonly and contains rke2-* under package exclusion. Please remove the entry for AS to work."
    fi
  fi
}

function enable_rke2_package_upgrade() {
  remove_rke2_entry_from_exclude /etc/dnf/dnf.conf
  remove_rke2_entry_from_exclude /etc/yum.conf
}

enable_rke2_package_upgrade

service_exists() {
    local n=$1
    if [[ $(systemctl list-units --all -t service --full --no-legend "$n.service" | cut -f1 -d' ') == $n.service ]]; then
        return 0
    else
        return 1
    fi
}
if service_exists rke2-server; then
  systemctl stop rke2-server
  systemctl disable rke2-server
fi
if service_exists rke2-agent; then
  systemctl stop rke2-agent
  systemctl disable rke2-agent
fi
if [ -e /usr/local/bin/rke2-killall.sh ]
then
  echo "Running rke2-killall.sh"
  /usr/local/bin/rke2-killall.sh > /dev/null
else
  echo "File not found: rke2-killall.sh"
fi
if [ -e /usr/local/bin/rke2-uninstall.sh ]
then
  echo "Running rke2-uninstall.sh"
  /usr/local/bin/rke2-uninstall.sh > /dev/null
else
    echo "File not found: rke2-uninstall.sh"
fi

crontab -l > backupcron
sed -i '/backupjob/d' backupcron > /dev/null
crontab backupcron > /dev/null
rm -rf backupcron > /dev/null
rm -rfv /usr/bin/backupjob > /dev/null
rm -rfv /etc/rancher/ > /dev/null
rm -rfv /var/lib/rook/ > /dev/null
rm -rfv /var/lib/longhorn/ > /dev/null
rm -rfv /var/lib/rancher/rke2/server/db/* > /dev/null
umount -l -f /var/lib/rancher/rke2/server/db > /dev/null 2>&1 || true
rm -rfv /var/lib/rancher/* > /dev/null
umount -l -f /var/lib/rancher
rm -rfv /var/lib/rancher/* > /dev/null
while ! rm -rfv /var/lib/kubelet/* > /dev/null; do
  findmnt --list   --submounts  -n -o TARGET  --target /var/lib/kubelet | grep '/var/lib/kubelet/plugins'  | xargs -r umount -f -l
  sleep 5
done
umount -l -f /var/lib/kubelet
rm -rfv /var/lib/kubelet/* > /dev/null
rm -rfv /datadisk/* > /dev/null
umount -l -f /datadisk
rm -rfv /datadisk/* > /dev/null
rm -rfv ~/.uipath/* > /dev/null
mount /var/lib/rancher
mkdir -p /var/lib/rancher/rke2/server/db/ && mount -a
rm -rfv /var/lib/rancher/rke2/server/db/* > /dev/null
echo "Uninstall RKE complete."
  1. Clean up the OSD disk:
for osddisk in $(find /dev/uipath/ceph -maxdepth 1 -mindepth 1 -type l); do
  devName=$(basename "${osddisk}")
  devPath="/dev/${devName}"
  sgdisk --zap-all "${devPath}"
  dd if=/dev/zero of="${devPath}" bs=1M count=100 oflag=direct,dsync
  blkdiscard "${devPath}"
done
ls /dev/mapper/ceph-* | xargs -I% -- dmsetup remove %  
rm -rf /dev/ceph-*
rm -rf /etc/udev/rules.d/99-ceph-raw-osd.rules

3, Reboot the node after uninstall.

🚧

Important!

When uninstalling one of the nodes from the cluster, you must run the following command: kubectl delete node <node_name>. This removes the node from the cluster.

 

How to clean up offline artifacts to improve disk space


If you run an offline installation, you typically need a larger disk size due to the offline artifacts that are used.

Once the installation completes, you can remove those local artifacts. Failure to do so can result in unnecessary disk pressure during cluster operations.

On the primary server, where the installation was performed, you can perform a cleanup using the following commands.

  1. Remove all images loaded by podman into the local container storage using the following command.:
podman image rm -af
  1. Then remove the temporary offline folder, used with the flag --offline-tmp-folder. This parameter defaults to /tmp:
rm -rf /path/to/temp/folder

 

Common Issues


Unable to run an offline installation on RHEL 8.4 OS


Description

The following issues can happen if you install RHEL 8.4 and are performing the offline installation, which requires podman. These issues are specific to podman and the OS being installed together. See the two below

Potential issue

  • you cannot install both of the following on the cluster:
    • podman-1.0.0-8.git921f98f.module+el8.3.0+10171+12421f43.x86_64
    • podman-3.0.1-6.module+el8.4.0+10607+f4da7515.x86_64
  • package cockpit-podman-29-2.module+el8.4.0+10607+f4da7515.noarch requires podman >= 1.3.0, but none of the providers can be installed
  • cannot install the best candidate for the job
  • problem with installed package cockpit-podman-29-2.module+el8.4.0+10607+f4da7515.noarch

Potential issue

  • package podman-3.0.1-6.module+el8.4.0+10607+f4da7515.x86_64 requires containernetworking-plugins >= 0.8.1-1, but none of the providers can be installed
  • you cannot install both of the following:
    • containernetworking-plugins-0.7.4-4.git9ebe139.module+el8.3.0+10171+12421f43.x86_64
    • containernetworking-plugins-0.9.1-1.module+el8.4.0+10607+f4da7515.x86_64
  • package podman-catatonit-3.0.1-6.module+el8.4.0+10607+f4da7515.x86_64 requires podman = 3.0.1-6.module+el8.4.0+10607+f4da7515, but none of the providers can be installed
  • cannot install the best candidate for the job
  • problem with installed package podman-catatonit-3.0.1-6.module+el8.4.0+10607+f4da7515.x86_64
    (try to add --allowerasing to command line to replace conflicting packages or --skip-broken to skip uninstallable packages or --nobest to use not only best candidate packages)

Solution

You need to remove the current version of podman and allow Automation Suite to install the required version.

  1. Remove the current version of podman using the yum remove podman command.

  2. Re-run the installer after having removed the current version, which should install the correct version.

 

Offline installation fails because of missing binary


Description

During offline installation, at the fabric stage, the execution fails with the following error message:

Error: overlay: can't stat program "/usr/bin/fuse-overlayfs": stat /usr/bin/fuse-overlayfs: no such file or directory

Solution

You need to remove the line containing the mount_program key from the podman configuration /etc/containers/storage.conf.
Ensure you remove the line rather than comment it out.

 

Failure to get the sandbox image


Description

You can receive an error message specific when trying to get the following sandbox image: index.docker.io/rancher/pause3.2

This can happen in an offline installation.

Solution

Restart either rke2-server or rke2-agent (depending on whether the machine that the pod is scheduled on is either a server or an agent).

To check which node the pod is scheduled on, run kubectl -n <namespace> get pods -o wide.

# If machine is a Master node
systemctl restart rke2-server
# If machine is an Agent Node
systemctl restart rke2-agent

 

SQL connection string validation error


Description

You might receive an error relating to the connection strings as follows:

Sqlcmd: Error: Microsoft Driver 17 for SQL Server :
Server—tcp : <connection string>
Login failed for user

This error appears even though all credentials are correct. The connection string validation failed.

Solution

Make sure the connection string has the following structure:

Server=<Sql server host name>;User Id=<user_name for sql server>;Password=<Password>;Initial Catalog=<database name>;Persist Security Info=False;MultipleActiveResultSets=False;Encrypt=True;TrustServerCertificate=False;Connection Timeout=30;Max Pool Size=100;

📘

Note:

User Id is case-sensitive.

 

Pods not showing in ArgoCD UI


Description

Occasionally, the ArgoCD UI does not show pods, but only displays applications and corresponding deployments. See the following image for details:

17631763

When clicking on any of the deployments, the following error is displayed: Unable to load data: EOF.

19201920

Solution

You can fix this issue by deleting all Redis replicas from ArgoCD namespace and waiting for it to come back up again.

kubectl -n argocd delete pod argocd-redis-ha-server-0 argocd-redis-ha-server-1 argocd-redis-ha-server-2

# Wait for all 3 pods to come back up
kubectl -n argocd get pods | grep argocd-redis-ha-server

 

Certificate issue in offline installation


Description

You might get an error that the certificate is signed by an unknown authority.

Error: failed to do request: Head "https://sfdev1778654-9f843b23-lb.westeurope.cloudapp.azure.com:30071/v2/helm/audit-service/blobs/sha256:09bffbc520ff000b834fe1a654acd089889b09d22d5cf1129b0edf2d76554892": x509: certificate signed by unknown authority

Solution

Both the rootCA and the server certificates need to be in the trusted store on the machine.

To investigate, execute the following commands:

[[email protected] ~]# find /etc/pki/ca-trust/source{,/anchors} -maxdepth 1 -not -type d -exec ls -1 {} +
/etc/pki/ca-trust/source/anchors/rootCA.crt
/etc/pki/ca-trust/source/anchors/server.crt

The provided certificates need to be in the output of those commands.

Alternatively, execute the following command:

[[email protected] ~]# openssl x509 -in /etc/pki/ca-trust/source/anchors/server.crt -text -noout

Ensure that the fully qualified domain name is present in the Subject Alternative Name from the output.

X509v3 Subject Alternative Name:
                DNS:sfdev1778654-9f843b23-lb.westeurope.cloudapp.azure.com

You can update the CA Certificate as follows:

[[email protected] ~]# update-ca-trust

 

Error in downloading the bundle


Description

The documentation lists wget as an option for downloading the bundles. Because of the large sizes, the connection may be interrupted and not recover.

Solution

One way to mitigate this could be to switch to a different download tool, such as azcopy (more information here). Run these commands, while updating the bundle URL to match the desired version/bundle combination.

wget https://aka.ms/downloadazcopy-v10-linux -O azcopy.tar.gz
tar -xvf ./azcopy.tar.gz
azcopy_linux_amd64_10.11.0/azcopy copy https://download.uipath.com/service-fabric/0.0.23-private4/sf-0.0.23-private4.tar.gz /var/tmp/sf.tar.gz --from-to BlobLocal

 

Longhorn errors


Rook Ceph or Looker pod stuck in Init state

Description

Occasionally, on node restart, an issue causes the Looker or Rook Ceph pod to get stuck in Init state as the volume required for attaching the PVC to a pod is missing.

Verify if the problem is indeed related to Longhorn by running the following command:

kubectl get events -A -o json | jq -r '.items[] | select(.message != null) | select(.message | contains("cannot get resource \"volumeattachments\" in API group \"storage.k8s.io\""))'

If it is related to Longhorn, this command should return a list of pod names affected by the issue. If the command does not return anything, the cause of the problem is different.

Solution

Run the following script to fix the problematic pods if the previous command returns a non-empty output:

#!/bin/bash


function wait_till_rollout() {
    local namespace=$1
    local object_type=$2
    local deploy=$3

    local try=0
    local maxtry=2
    local status="notready"

    while [[ ${status} == "notready" ]]  && (( try != maxtry )) ; do
        kubectl -n "$namespace" rollout status "$deploy" -w --timeout=600s; 
        # shellcheck disable=SC2181
        if [[ "$?" -ne 0 ]]; 
        then
            status="notready"
            try=$((try+1))
        else
            status="ready"
        fi
    done
    if [[ $status == "notready" ]]; then 
        echo "$deploy of type $object_type failed in namespace $namespace. Plz re-run the script once again to verify that it's not a transient issue !!!"
        exit 1
    fi
}

function fix_pv_deployments() {
    for pod_name in $(kubectl get events -A -o json | jq -r '.items[]  | select(.message | contains("cannot get resource \"volumeattachments\" in API group \"storage.k8s.io\"")) | select(.involvedObject.kind == "Pod") | .involvedObject.name + "/" + .involvedObject.namespace' | sort | uniq)
    do
        POD_NAME=$(echo "${pod_name}" | cut -d '/' -f1)
        NS=$(echo "${pod_name}" | cut -d '/' -f2)
        controller_data=$(kubectl -n "${NS}" get po "${POD_NAME}" -o json | jq -r '[.metadata.ownerReferences[] | select(.controller==true)][0] | .kind + "=" + .name')
        [[ $controller_data == "" ]] && error "Error: Could not determine owner for pod: ${POD_NAME}" && exit 1
        CONTROLLER_KIND=$(echo "${controller_data}" | cut -d'=' -f1)
        CONTROLLER_NAME=$(echo "${controller_data}" | cut -d'=' -f2)
        if [[ $CONTROLLER_KIND == "ReplicaSet" ]]
        then
            controller_data=$(kubectl  -n "${NS}" get "${CONTROLLER_KIND}" "${CONTROLLER_NAME}" -o json | jq -r '[.metadata.ownerReferences[] | select(.controller==true)][0] | .kind + "=" + .name')
            CONTROLLER_KIND=$(echo "${controller_data}" | cut -d'=' -f1)
            CONTROLLER_NAME=$(echo "${controller_data}" | cut -d'=' -f2)

            replicas=$(kubectl -n "${NS}" get "$CONTROLLER_KIND" "$CONTROLLER_NAME" -o json | jq -r '.status.replicas')
            unavailable_replicas=$(kubectl -n "${NS}" get "$CONTROLLER_KIND" "$CONTROLLER_NAME" -o json | jq -r '.status.unavailableReplicas')

            if [ -n "$unavailable_replicas" ]; then 
                available_replicas=$((replicas - unavailable_replicas))
                if [ $available_replicas -eq 0 ]; then
                    kubectl -n "$NS" scale "$CONTROLLER_KIND" "$CONTROLLER_NAME" --replicas=0
                    sleep 15
                    kubectl -n "$NS" scale "$CONTROLLER_KIND" "$CONTROLLER_NAME" --replicas="$replicas"
                    deployment_name="$CONTROLLER_KIND/$CONTROLLER_NAME"
                    wait_till_rollout "$NS" "deploy" "$deployment_name"
                fi 
            fi
        fi
    done
}

fix_pv_deployments

 

StatefulSet volume attachment error


Pods in RabbitMQ or cattle-monitoring-system or other StatefulSet pods are stuck in the init state.

Description

Occasionally, upon node power failure or during upgrade, an issue causes the pods in RabbitMQ or cattle-monitoring-system to get stuck in init state as the volume required for attaching the PVC to a pod is missing.

Verify if the problem is indeed related to the StatefulSet volume attachment by running the following command:

kubectl -n <namespace> describe pod <pod-name> | grep "cannot get resource \"volumeattachments\" in API group \" storage.k8s.io\""

If it is related to the StatefulSet volume attachment, it will show an error message.

Solution

To fix this issue, reboot the node.

 

Failure to create persistent volumes

Description

Longhorn is successfully installed, but fails to create persistent volumes.

Solution

Verify if the kernel modules are successfully loaded in the cluster by using the command lsmod | grep <module_name>.
Replace <module_name> with each of the kernel modules below:

  • libiscsi_tcp
  • libiscsi
  • iscsi_tcp
  • scsi_transport_iscsi

Load any missing module.

 

rke2-coredns-rke2-coredns-autoscaler pod in CrashLoopBackOff


Description

After node restart, rke2-coredns-rke2-coredns-autoscaler can go in CrashLoopBackOff. This does not have any impact on Automation Suite.

Solution

Delete the rke2-coredns-rke2-coredns-autoscaler pod that is in CrashLoopBackOff using the following command: kubectl delete pod <pod name> -n kube-system.

 

Redis probe failure


Description

Redis probe can fail if the node ID file does not exist. This can happen if the pod is not yet bootstrapped.

There is a recovery job that automatically fixes this issue, and the following steps should not be performed while the job is running.

When a Redis Enterprise cluster loses contact with more than half of its nodes (either because of failed nodes or network split), then the cluster stops responding to client connections. The Pods also fail to rejoin the cluster.

Solution

  1. Delete the Redis cluster and database using the following commands:
kubectl delete redb -n redis-system redis-cluster-db --force --grace-period=0 &
kubectl delete rec -n redis-system redis-cluster --force --grace-period=0 &
kubectl patch redb -n redis-system redis-cluster-db --type=json -p '[{"op":"remove","path":"/metadata/finalizers","value":"finalizer.redisenterprisedatabases.app.redislabs.com"}]'
kubectl patch rec redis-cluster -n redis-system --type=json -p '[{"op":"remove","path":"/metadata/finalizers","value":"redbfinalizer.redisenterpriseclusters.app.redislabs.com"}]'
kubectl delete job redis-cluster-db-job -n redis-system
  1. Go to the ArgoCD UI and sync the redis-cluster application.

 

RKE2 server fails to start


Description

The server fails to start. There are a few different reasons for RKE2 not starting properly, which are usually found in the logs.

Solution

Check the logs using the following commands:

journalctl -u rke2-server

Possible reasons (based on logs): too many learner members in cluster

Too many etcd servers are added to the cluster, and there are two learner nodes trying to be promoted. More information here: Runtime reconfiguration.

Perform the following:

  1. Under normal circumstances, the node should become a full member if enough time is allowed.
  2. An uninstall-reinstall cycle can be attempted.

Alternatively, this could be caused by a networking problem. Ensure you have configured the machine to enable the necessary ports.

 

Node draining does not occur for stopped nodes


Description

If a node is stopped in a cluster and, and its corresponding pods are not rescheduled to available nodes after 15 minutes, run the following script to manually drain the node.

#!/bin/sh

KUBECTL="/usr/local/bin/kubectl"

# Get only nodes which are not drained yet
NOT_READY_NODES=$($KUBECTL get nodes | grep -P 'NotReady(?!,SchedulingDisabled)' | awk '{print $1}' | xargs echo)
# Get only nodes which are still drained
READY_NODES=$($KUBECTL get nodes | grep '\sReady,SchedulingDisabled' | awk '{print $1}' | xargs echo)

echo "Unready nodes that are undrained: $NOT_READY_NODES"
echo "Ready nodes: $READY_NODES"


for node in $NOT_READY_NODES; do
  echo "Node $node not drained yet, draining..."
  $KUBECTL drain --ignore-daemonsets --force --delete-emptydir-data $node
  echo "Done"
done;

for node in $READY_NODES; do
  echo "Node $node still drained, uncordoning..."
  $KUBECTL uncordon $node
  echo "Done"
done;

 

Enable Istio logging


To debug Istio, you need to enable logging. To do that, perform the following steps:

  1. Find the istio-ingressgateway pod by running the following command. Copy the gateway pod name. It should be something like istio-ingressgateway-r4mbx.
kubectl -n istio-system get pods
  1. Open the gateway Pod shell by running the following command.
kubectl exec -it -n istio-system istio-ingressgateway-r4mbx bash
  1. Enable debug level logging by running the following command.
curl -X POST http://localhost:15000/logging?level=debug
  1. Run the following command from a server node.
istioctl_bin=$(find /var/lib/rancher/rke2/ -name "istioctl" -type f -perm -u+x   -print -quit)
if [[ -n ${istioctl_bin} ]]
then
echo "istioctl bin found"
  kubectl -n istio-system get cm istio-installer-base -o go-template='{{ index .data "istio-base.yaml" }}'  > istio-base.yaml
  kubectl -n istio-system get cm istio-installer-overlay  -o go-template='{{ index .data "overlay-config.yaml" }}'  > overlay-config.yaml 
  ${istioctl_bin} -i istio-system install -y -f istio-base.yaml -f overlay-config.yaml --set meshConfig.accessLogFile=/dev/stdout --set meshConfig.accessLogEncoding=JSON 
else
  echo "istioctl bin not found"
fi

 

Secret not found in UiPath namespace


Description

If service installation fails, and checking kubectl -n uipath get pods returns failed pods, take the following steps.

Solution

  1. Check kubectl -n uipath describe pod <pod-name> and look for secret not found.
  2. If the secret is not found, then look for credential manager job logs and see if it failed.
  3. If the credential manager job failed and kubectl get pods -n rook-ceph|grep rook-ceph-tool returns more than one pod, do the following:
    a. delete rook-ceph-tool that is not running.
    b. go to ArgoCD UI and sync sfcore appllication.
    c. Once the job completes, check if all secrets are created in the credential-manager job logs
    d. Now sync uipath application.

 

Cannot log in after migration


Description

An issue might affect the migration from a standalone product to Automation Suite. It prevents you from logging in, with the following error message being displayed: Cannot find client details.

Solution

To fix this problem, you need to re-sync uipath app first, and then sync platform app in ArgoCD.

 

ArgoCD login failed


Description

You may fail to log into ArgoCD when using the admin password or the installer may fail with the following error message:

33963396

Solution

To fix this issue, enter your password, create a bcrypt password, and run the commands described in the following section:

password="<enter_your_password>"
bcryptPassword=<generate bcrypt password using link https://www.browserling.com/tools/bcrypt >

# Enter your bcrypt password and run below command
kubectl -n argocd patch secret argocd-secret \
  -p '{"stringData": {
    "admin.password": "<enter you bcryptPassword here>",
    "admin.passwordMtime": "'$(date +%FT%T%Z)'"
  }}'

# Run below commands
argocdInitialAdminSecretPresent=$(kubectl -n argocd get secret argocd-initial-admin-secret --ignore-not-found )
if [[ -n ${argocdInitialAdminSecretPresent} ]]; then
   echo "Start updating argocd-initial-admin-secret"
   kubectl -n argocd patch secret argocd-initial-admin-secret \
   -p "{
      \"stringData\": {
         \"password\": \"$password\"
      }
   }"
fi

argocAdminSecretName=$(kubectl -n argocd get secret argocd-admin-password --ignore-not-found )
if [[ -n ${argocAdminSecretName} ]]; then
   echo "Start updating argocd-admin-password"
   kubectl -n argocd patch secret argocd-admin-password \
   -p "{
      \"stringData\": {
         \"password\": \"$password\"
      }
   }"
fi

 

After the Initial Install, ArgoCD App went into Progressing State


Description

Whenever the cluster state deviates from what has been defined in the helm repository, argocd tries to sync the state and reconciliation happens every minute. Whenever this happens, you can notice that the ArgoCD app is in progressing state.

Solution

This is the expected behavior of ArgoCD and it does not impact application in any way.

 

Automation Suite requires backlog_wait_time to be set 1


Description

Audit events can cause instability (system freeze) if backlog_wait_time is not set to 1.
For more details, see this issue description.

Solution

If the installer fails with the Automation Suite requires backlog_wait_time to be set 1 error message, take the following steps to set backlog_wait_time to 1.

  1. Set backlog_wait_time to 1 by appending --backlog_wait_time 1 in the /etc/audit/rules.d/audit.rules file.
  2. Reboot the node.
  3. Validate if backlog_wait_time value is set to 1 for auditctl by running sudo auditctl -s | grep "backlog_wait_time" in the node.

 

Failure to resize objectstore PVC


Description

This issue occurs when the objectstore resize-pvc operation fails with the following error:

Failed resizing the PVC: <pvc name> in namespace: rook-ceph, ROLLING BACK

Solution

To fix this problem, take the following steps:

  1. Run the following script manually:
#!/bin/sh

ROOK_CEPH_OSD_PREPARE=$(kubectl -n rook-ceph get pods | grep rook-ceph-osd-prepare-set | awk '{print $1}')
if [[ -n ${ROOK_CEPH_OSD_PREPARE} ]]; then
    for pod in ${ROOK_CEPH_OSD_PREPARE}; do
    echo "Start deleting rook ceph osd pod $pod .."
    kubectl -n rook-ceph delete pod $pod
    echo "Done"
    done;
fi
  1. Rerun the objectstore resize-pvc command.

 

PVC resize does not heal Ceph


Description

If Ceph is unhealthy due to out-of-storage issues, Objectstore PVC resize does not heal it.

Solution

To speed up the Ceph recovery in a non-HA cluster, run the following command:

function set_ceph_pool_config_non_ha() {
  # Return if HA 
  [[ "$(kubectl -n rook-ceph get cephobjectstore rook-ceph  -o jsonpath='{.spec.dataPool.replicated.size}')" -eq 1 ]] || return
  # Set pool size and min_size
  kubectl -n "rook-ceph" exec deploy/rook-ceph-tools – ceph osd pool set  "device_health_metrics" "size" "1" --yes-i-really-mean-it || true
  kubectl -n "rook-ceph" exec deploy/rook-ceph-tools – ceph osd pool set  "device_health_metrics" "min_size" "1" --yes-i-really-mean-it || true
}

 

Failure to upload/download data in object-store (rook-ceph)


967967

Description

This issue may occur when the object-store state is in a degraded state due to a placement group (PG) inconsistency.
Verify if the problem is indeed related to rook-ceph PG inconsistency by running the following commands:

export KUBECONFIG=/etc/rancher/rke2/rke2.yaml PATH=$PATH:/var/lib/rancher/rke2/bin
ROOK_CEPH_TOOLS=$(kubectl -n rook-ceph get pods | grep rook-ceph-tools)
kubectl -n rook-ceph exec -it $ROOK_CEPH_TOOLS -- ceph status

If the problem is related to a rook-ceph PG inconsistency, the output will contain the following messages:

602602
....
....
Possible data damage: X pgs inconsistent
....
....
X active+clean+inconsistent
....
....

Solution

To repair the inconsistent PG, take the following steps:

  1. Exec to rook-ceph tools:
kubectl -n rook-ceph exec -it $ROOK_CEPH_TOOLS -- sh
  1. Trigger the rook-ceph garbage collector process. Wait until the process is complete.
radosgw-admin gc process
  1. Find a list of active+clean+inconsistent PGs:
ceph health detail
# output of this command be like
# ....
# pg <pg-id> is active+clean+inconsistent, acting ..
# pg <pg-id> is active+clean+inconsistent, acting ..
# ....
#
  1. Trigger a deep scrub on the PGs one at a time. This command takes few minutes to run, depending on the PG size.
ceph pg deep-scrub <pg-id>
  1. Watch the scrubbing status:
ceph -w | grep <pg-id>
  1. Check the PG scrub status. If the PG scrub is successful, the PG status should be active+clean+inconsistent.
ceph health detail | grep <pg-id>
  1. Repair the PG:
ceph pg repair <pg-id>
  1. Check the PG repair status. The PG ID should be removed from the active+clean+inconsistent list if the PG is repaired successfully.
ceph health detail | grep <pg-id>
  1. Repeat steps 3 to 8 for the rest of the inconsistent PG.

 

Failure after certificate update


Description

This issue occurs when the certificate update step fails internally. You may not able to access Automation Suite or Orchestrator.

Error

18131813

Solution

  1. Run below commands from any of the server node
export KUBECONFIG=/etc/rancher/rke2/rke2.yaml
export PATH=$PATH:/var/lib/rancher/rke2/bin

kubectl -n uipath rollout restart deployments
  1. Wait for the above command to succeed and then run below command to verify the status of previous command.
deployments=$(kubectl -n uipath get deployment -o name)
for i in $deployments; 
do 
kubectl -n uipath rollout status "$i" -w --timeout=600s; 
if [[ "$?" -ne 0 ]]; 
then
    echo "$i deployment failed in namespace uipath."
fi
done
echo "All deployments are succeeded in namespace uipath"

Once the above command is finished execution, you should able to access Automation Suite and Orchestrator

 

Unexpected inconsistency; run fsck manually


While installing or upgrading Automation Suite, if the any pods cannot mount to the PVC pods, the following error message is displayed:
UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY

11881188

Recovery steps

If you encounter the error above, follow the recovery steps below:

  1. SSH to the system by running the following command:
ssh <user>@<node-ip>
  1. Check the events of the PVC and verify that the issue is related to the PVC mount failure due to file error. To do this, run the following command:
export KUBECONFIG=/etc/rancher/rke2/rke2.yaml PATH=$PATH:/var/lib/rancher/rke2/bin:/usr/local/bin
kubectl get events -n mongodb
kubectl get events -n longhorn-system
  1. Check the PVC volume mentioned in the event and run the fsck command.
fsck -a <pvc-volume-name>
Eg - fsck -a /dev/longhorn/pvc-5abe3c8f-7422-44da-9132-92be5641150a
  1. Delete the failing MongoDB pod to properly mount it to the PVC.
kubectl delete pod <pod-name> -n mongodb

 

Mongo pods in CrashLoopBackOff or pending PVC provisioning after deletion


The Mongo pods might get stuck in a cashback loop due to a corrupt PVC. The most probable cause of this issue is an unclean shutdown.
When experiencing this issue, the logs show the following:

Common point must be at least stable timestamp
{"t":{"$date":"2022-05-18T09:37:55.053+00:00"},"s":"W",  "c":"STORAGE",  "id":22271,   "ctx":"initandlisten","msg":"Detected unclean shutdown - Lock file is not empty","attr":{"lockFile":"/data/mongod.lock"}}
    ['currentState.Running' = false]
    ['currentState.IsVCRedistCorrect' = true]
    ['desiredState.ProcessType' != mongos ('desiredState.ProcessType' = mongod)]

Recovery steps

  1. Delete the failing pod. If this solution does not work, continue to the next steps.
kubectl delete pod <pod-name> -n mongodb
  1. Get the name of the corrupt PVC for the failing pods.
kubectl -n mongodb get pvc
  1. Delete the PVC.
kubectl -n mongodb delete pvc <pvc-name>
  1. At this point, the PVC should be auto-synced, and the pod should experience no issues anymore. If auto-provisioning does not happen, you need to perform the operation manually by taking the following steps.

  2. Get the PVC YAML for a healthy node.

kubectl -n mongodb get pvc <pvc-name> -o yaml > pvc.yaml
  1. Edit the name and remove uuids/pvc-ids from the YAML.
  2. Remove the volume name and UID, and rename the PVC to the deleted PVC name.
11911191
  1. Apply the PVC.
kubectl -n mongodb apply pvc.yaml
  1. The PVC should be provisioned and attached to the PVC for the pod, and the pod should no longer experience any issues. If the pod does not resync, then delete it.

 

MongoDB pod fails to upgrade from 4.4.4-ent to 5.0.7-ent


Description

When upgrading from Automation Suite 2021.10 to 2022.10, the MongoDB pod is stuck is rolling update and cannot move from version 4.4.4-ent to 5.0.7-ent. Only one pod is available, and it is failing the readiness check.

Solution

To check the MongoDB Agent logs, run the following command:

kubectl get pods -n mongodb
Use the pod name and get log for that pod
kubectl logs <pod-name> -n mongodb -c mongodb-agent
In the Agent logs if we are seeing below line then the upgrade hasn't been successful
  ['currentState.Version = 4.4.4-ent is equal to desiredState.Version = 5.0.7-ent' = false]

To fix the issue, take the following steps:

  1. Disable auto-sync in ArgoCD. Go to Applications > MongoDB > AppDetails > Summary, and click Disable Auto-Sync.

  2. The MongoDB sts definition is still using the old MongoDB localhost:30071/mongodb/mongodb-enterprise-appdb-database:4.4.4-ent image. Edit the definition to use version 5.0.7-ent. Save it and sync it.

  3. Delete the old MongoDB pod.

  4. The pods should come up with MongoDB version 5.0.7-ent.

 

Upgrade fails due to unhealthy Ceph

Description

When trying to upgrade to a new Automation Suite version, you might see the following error message: Ceph objectstore is not completely healthy at the moment. Inner exception - Timeout waiting for all PGs to become active+clean.

Solution

To fix this upgrade issue, verify if the OSD pods are running and healthy by running the following command:

kubectl -n rook-ceph get pod -l app=rook-ceph-osd  --no-headers | grep -P '([0-9])/\1'  -v
  • If the command does not output any pods, verify if Ceph placement groups (PGs) are recovering or not by running the following command:
function is_ceph_pg_active_clean() {
  local return_code=1
  if kubectl -n rook-ceph exec  deploy/rook-ceph-tools -- ceph status --format json | jq '. as $root | ($root | .pgmap.num_pgs) as $total_pgs | try ( ($root | .pgmap.pgs_by_state[] | select(.state_name == "active+clean").count)  // 0) as $active_pgs | if $total_pgs == $active_pgs then true else false end' | grep -q 'true';then
    return_code=0
  fi
  [[ $return_code -eq 0 ]] && echo "All Ceph Placement groups(PG) are active+clean"
  if [[ $return_code -ne 0 ]]; then
    echo "All Ceph Placement groups(PG) are not active+clean. Please wait for PGs to become active+clean"
    kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph pg dump --format json | jq -r '.pg_map.pg_stats[] | select(.state!="active+clean") | [.pgid, .state] | @tsv'
  fi
  return "${return_code}"
}
# Execute the function multiple times to get updated ceph PG status
is_ceph_pg_active_clean

📘

If none of the affected Ceph PG recovers even after waiting for more than 30 minutes, raise a ticket with UiPath Support.

  • If the command outputs pod(s), you must first fix the issue affecting them:
    • If a pod is stuck in Init:0/4, then it could be a PV provider (Longhorn) issue. To debut this issue, raise a ticket with UiPath Support.
    • If a pod is in CrashLoopBackOff, fix the issue by running the following command:
function cleanup_crashing_osd() {
    local restart_operator="false"
    local min_required_healthy_osd=1
    local in_osd
    local up_osd
    local healthy_osd_pod_count
    local crashed_osd_deploy
    local crashed_pvc_name

    if ! kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph osd pool ls detail  | grep 'rook-ceph.rgw.buckets.data' | grep -q 'replicated'; then
        min_required_healthy_osd=2
    fi
    in_osd=$(kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph status   -f json  | jq -r '.osdmap.num_in_osds')
    up_osd=$(kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph status   -f json  | jq -r '.osdmap.num_up_osds')
    healthy_osd_pod_count=$(kubectl -n rook-ceph get pod -l app=rook-ceph-osd | grep 'Running' | grep -c -P '([0-9])/\1')
    if ! [[ $in_osd -ge $min_required_healthy_osd && $up_osd -ge $min_required_healthy_osd && $healthy_osd_pod_count -ge $min_required_healthy_osd ]]; then
        return
    fi
    for crashed_osd_deploy in $(kubectl -n rook-ceph get pod -l app=rook-ceph-osd  | grep 'CrashLoopBackOff' | cut -d'-' -f'1-4') ; do
        if kubectl -n rook-ceph logs "deployment/${crashed_osd_deploy}" | grep -q '/crash/'; then
            echo "Found crashing OSD deployment: '${crashed_osd_deploy}'"
            crashed_pvc_name=$(kubectl -n rook-ceph get deployment "${crashed_osd_deploy}" -o json | jq -r '.metadata.labels["ceph.rook.io/pvc"]')
            info "Removing crashing OSD deployment: '${crashed_osd_deploy}' and PVC: '${crashed_pvc_name}'"
            timeout 60  kubectl -n rook-ceph delete deployment "${crashed_osd_deploy}" || kubectl -n rook-ceph delete deployment "${crashed_osd_deploy}" --force --grace-period=0
            timeout 100 kubectl -n rook-ceph delete pvc "${crashed_pvc_name}" || kubectl -n rook-ceph delete pvc "${crashed_pvc_name}" --force --grace-period=0
            restart_operator="true"
        fi
    done
    if [[ $restart_operator == "true" ]]; then
        kubectl -n rook-ceph rollout restart deployment/rook-ceph-operator
    fi
    return 0
}
# Execute the cleanup function
cleanup_crashing_osd

After fixing the crashing OSD, verify if PGs are recovering or not by running the following command:

is_ceph_pg_active_clean

 

Cluster unhealthy after automated upgrade from 2021.10


During the automated upgrade from Automation Suite 2021.10, the CNI provider is migrated from Canal to Cilium. This operation requires that all nodes are restarted. On rare occasions, one or more nodes might not be successfully rebooted, causing pods running on those nodes to remain unhealthy.

Recovery steps

  1. Identify failed restarts.
    During the Ansible execution, you might see output similar to the following snippet:
TASK [Reboot the servers] ***************************************************************************************************************************

fatal: [10.0.1.6]: FAILED! =>

  msg: 'Failed to connect to the host via ssh: ssh: connect to host 10.0.1.6 port 22: Connection timed out'

Alternatively, browse the logs on the Ansible host machine, located at /var/tmp/uipathctl_<version>/_install-uipath.log. If any failed restarts were identified, execute steps 2 through 4 on all nodes.

  1. Confirm a reboot is needed on each node.
    Connect to the each node and run the following command:
ssh <username>@<ip-address>
iptables-save 2>/dev/null | grep -i cali -c

If the result is not zero, a reboot is needed.

  1. Reboot the node:
sudo reboot
  1. Wait for the node to become responsive (you should be able to SSH to it) and repeat steps 2 through 4 on every other node.

 

First installation fails during Longhorn setup


On rare occasions, if the first attempt to install Longhorn fails, subsequent retries might throw a Helm-specific error: Error: UPGRADE FAILED: longhorn has no deployed releases.

Recovery steps

Remove the Longhorn Helm release before retrying the installation by running the following command:

/opt/UiPathAutomationSuite/<version>/bin/helm uninstall longhorn --namespace longhorn-system

 

Automation Suite not working after OS upgrade


Description

After an OS upgrade, Ceph OSD pods can sometimes get stuck in CrashLoopBackOff state. This issue causes Automation Suite not to be accessible.

Solution

  1. Check the state of the pods by running the following command:
kubectl - n rook-ceph get pods
  1. If any of the pods in previous output are in CrashLoopBackOff, recover them by running the following command:
$OSD_PODS=$(kubectl -n rook-ceph get deployment -l app=rook-ceph-rgw --no-headers | awk '{print $1}')
kubectl -n rook-ceph rollout restart deploy $OSD_PODS
  1. Wait for approximately 5 minutes for the pods to be in running state again, and check their status by running the following command:
kubectl - n rook-ceph get pods

 

Azure disk not marked as SSD


Description

An Azure known issue incorrectly shows the rotational flag enabled for the Azure SSD disk. As a result, the Azure disk is not marked as SSD, and the following error occurs when trying to configure the Objectstore disk:

ERROR][2022-08-18T05:26:35+0000]: Rotational device: '/dev/sdf' is not recommended for ceph OSD

Solution

To fix this issue and mark the Azure SSD disk as non-rotational, run the following commands:

echo "0" > "/sys/block/{raw_device_name}/queue/rotational"
echo "KERNEL==\"${raw_device_name}\", ATTR{queue/rotational}=\"0\"" >> "/etc/udev/rules.d/99-ceph-raw-osd-mark-ssd.rules"
udevadm control --reload
udevadm trigger
Click for an example
echo "0" > "/sys/block/sdf/queue/rotational"
echo "KERNEL==\"sdf\", ATTR{queue/rotational}=\"0\"" >> "/etc/udev/rules.d/99-ceph-raw-osd-mark-ssd.rules"
udevadm control --reload
udevadm trigger

 

Antivirus causing installation issues


Description

Using an antivirus results in an Automation Suite installation issue.

Solution

To fix this issue, add the following folders to the antivirus allowlist:

  • /var/lib/rancher
  • /var/lib/kubelet
  • /opt/UiPathAutomationSuite
  • /datadisk
  • /var/lib/rancher/rke2/server/db
  • /var/lib/longhorn

 

Unhealthy services after cluster restore or rollback

Description


Following a cluster restore or rollback, AI Center, Orchestrator, Platform, Document Understanding or Task Mining might be unhealthy, with the RabbitMQ pod logs showing the following error:

[[email protected] UiPathAutomationSuite]# k -n rabbitmq logs rabbitmq-server-0
2022-10-29 07:38:49.146614+00:00 [info] <0.9223.362> accepting AMQP connection <0.9223.362> (10.42.1.161:37524 -> 10.42.0.228:5672)
2022-10-29 07:38:49.147411+00:00 [info] <0.9223.362> Connection <0.9223.362> (10.42.1.161:37524 -> 10.42.0.228:5672) has a client-provided name: rabbitConnectionFactory#77049094:2100
2022-10-29 07:38:49.147644+00:00 [erro] <0.9223.362> Error on AMQP connection <0.9223.362> (10.42.1.161:37524 -> 10.42.0.228:5672, state: starting):
2022-10-29 07:38:49.147644+00:00 [erro] <0.9223.362> PLAIN login refused: user 'aicenter-service' - invalid credentials
2022-10-29 07:38:49.147922+00:00 [info] <0.9223.362> closing AMQP connection <0.9223.362> (10.42.1.161:37524 -> 10.42.0.228:5672 - rabbitConnectionFactory#77049094:2100)
2022-10-29 07:38:55.818447+00:00 [info] <0.9533.362> accepting AMQP connection <0.9533.362> (10.42.0.198:45032 -> 10.42.0.228:5672)
2022-10-29 07:38:55.821662+00:00 [info] <0.9533.362> Connection <0.9533.362> (10.42.0.198:45032 -> 10.42.0.228:5672) has a client-provided name: rabbitConnectionFactory#2100d047:4057
2022-10-29 07:38:55.822058+00:00 [erro] <0.9533.362> Error on AMQP connection <0.9533.362> (10.42.0.198:45032 -> 10.42.0.228:5672, state: starting):
2022-10-29 07:38:55.822058+00:00 [erro] <0.9533.362> PLAIN login refused: user 'aicenter-service' - invalid credentials
2022-10-29 07:38:55.822447+00:00 [info] <0.9533.362> closing AMQP connection <0.9533.362> (10.42.0.198:45032 -> 10.42.0.228:5672 - rabbitConnectionFactory#2100d047:4057)

Solution

To fix the issue, take the following steps:

  1. Check if some or all RabbitMQ pods are in CrashLoopBackOff state due to the Mnesia table data write issue. If all pods are running, skip to step 2.
If some pods are in CrashLoopBackOff state, click here for instructions.
  1. Identify which RabbitMQ pods are stuck in CrashLoopBackOff state, and check the RabbitMQ CrashLoopBackOff pod logs:
kubectl -n rabbitmq get pods
kubectl -n rabbitmq logs <CrashLoopBackOff-Pod-Name>
  1. Check the output of the previous commands. If the issue is related to the Mnesia table data write, you should see an error message similar to the following:
Mnesia('[email protected]'): ** ERROR ** (could not write core file: eacces)
 ** FATAL ** Failed to merge schema: Bad cookie in table definition rabbit_user_permission: '[email protected]' = {cstruct,rabbit_user_permission,set,[],['[email protected]','[email protected]','[email protected]'],[],[],0,read_write,false,[],[],false,user_permission,[user_vhost,permission],[],[],[],{{1667351034020261908,-576460752303416575,1},'[email protected]'},{{4,0},{'[email protected]',{1667,351040,418694}}}}, '[email protected]' = {cstruct,rabbit_user_permission,set,[],['[email protected]'],[],[],0,read_write,false,[],[],false,user_permission,[user_vhost,permission],[],[],[],{{1667372429216834387,-576460752303417087,1},'[email protected]'},{{2,0},[]}}
  1. To fix the issue, take the following steps:
    1. Find the number of RabbitMQ replicas:
      rabbitmqReplicas=$(kubectl -n rabbitmq get rabbitmqcluster rabbitmq -o json | jq -r '.spec.replicas')
    2. Scale down the RabbitMQ replicas:
      kubectl -n rabbitmq patch rabbitmqcluster rabbitmq -p "{\"spec\":{\"replicas\": 0}}" --type=merge
      kubectl -n rabbitmq scale sts rabbitmq-server --replicas=0
    3. Wait until all RabbitMQ pods are terminated:
      kubectl -n rabbitmq get pod
    4. Find and delete the PVC of the RabbitMQ pod that is stuck in CrashLoopBackOff state:
      kubectl -n rabbitmq get pvc
      kubectl -n rabbitmq delete pvc <crashloopbackupoff_pod_pvc_name>
    5. Scale up the RabbitMQ replicas:
      kubectl -n rabbitmq patch rabbitmqcluster rabbitmq -p "{\"spec\":{\"replicas\": $rabbitmqReplicas}}" --type=merge
    6. Check if all RabbitMQ pods are healthy:
      kubectl -n rabbitmq get pod
  1. Delete the users in RabbitMQ:
kubectl -n rabbitmq exec rabbitmq-server-0 -c rabbitmq -- rabbitmqctl  list_users -s --formatter json | jq '.[]|.user' | grep -v default_user | xargs -I{} kubectl -n rabbitmq exec rabbitmq-server-0 -c rabbitmq -- rabbitmqctl delete_user {}
  1. Delete RabbitMQ application secrets in the UiPath namespace:
kubectl -n uipath get secret --template '{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}' | grep -i rabbitmq-secret | xargs -I{} kubectl -n uipath delete secret {}
  1. Delete RabbitMQ application secrets in the RabbitMQ namespace:
kubectl -n rabbitmq get secret --template '{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}' | grep -i rabbitmq-secret | xargs -I{} kubectl -n rabbitmq delete secret {}
  1. Sync sfcore application via ArgoCD and wait for the sync to complete:
28682868
  1. Perform a rollout restart on all applications in the UiPath namespace:
kubectl -n uipath rollout restart deploy

 

 

rke2 not getting started due to space issue


Description

When upgrading to Automation Suite 2022.10 and migrating your data to Azure external storage, rke2 might experience some issues. Specifically, rke2 might not get started and might fail with the following error message: Failed to reconcile with temporary etcd: write /var/lib/rancher/rke2/server/db/etcd-tmp/member/snap/db: no space left on device.

Solution

To fix this issue, take the following steps:

  1. Stop the SSSD service:
systemctl stop sssd
  1. Remove the SSSD logs:
rm /var/log/sssd/*
  1. Start the SSSD service:
systemctl start sssd
  1. We recommend changing the log rotate policy for the SSSD service from weekly to daily.

  2. If the error message persists, try rebooting the host.

 

Identity Server issues

Setting a timeout interval for the Management portals


Pre-installation, you cannot update the expiration time for the token used to authenticate to the host- and organization-level Management portals. Therefore user sessions do not time out.

To set a time interval for timeout for these portals, you can update the accessTokenLifetime property.
The below example sets the timeout interval to 86400 seconds (24 hours):

UPDATE [identity].[Clients] SET AccessTokenLifetime = 86400 WHERE ClientName = 'Portal.OpenId'

 

Kerberos issues


kinit: Cannot find KDC for realm while getting initial credentials

Description

This error might occur during installation (if you have Kerberos authentication enabled) or during the kerberos-tgt-update cron job execution when the UiPath cluster cannot connect to the AD server to obtain the Kerberos ticket for authentication.

Solution

Check the AD domain and ensure it is configured correctly and routable, as follows:

getent ahosts <AD domain> | awk '{print $1}' | sort | uniq

If this command does not return a routable IP address, then the AD domain required for Kerberos authentication is not properly configured.

You need to work with the IT administrators to add the AD domain to your DNS server and make sure this command returns a routable IP address.

 

kinit: Keytab contains no suitable keys for *** while getting initial credentials

Description

This error could be found in the log of a failed job, with one of the following job names: services-preinstall-validations-job, kerberos-jobs-trigger, kerberos-tgt-update.

Solution

Make sure the AD user still exists, is active, and their password was not changed and did not expire. Reset the user's password and regenerate the keytab if needed.
Also make sure to provide the default Kerberos AD user parameter <KERB_DEFAULT_USERNAME> in the following format: HTTP/<Service Fabric FQDN>.

 

GSSAPI operation failed with error: An invalid status code was supplied (Client's credentials have been revoked).

Description

This log could be found when using Kerberos for SQL access, and SQL connection is failing inside services. Similarly, you may see kinit: Client's credentials have been revoked while getting initial credential in one of the following job names: services-preinstall-validations-job, kerberos-jobs-trigger, kerberos-tgt-update.

Solution

This could be caused by the AD user account used to generate the keytab being disabled. Re-enabling the AD user account should fix the issue.

Alarm received for failed kerberos-tgt-update job

Description

This happens if the uipath cluster failed to retrieve the latest Kerberos ticket.

Solution

To find the issue, check the log for a failed job whose name starts with kerberos-tgt-update. After you've identified the problem in the log, check the related troubleshooting information in this section and in the Troubleshooting section for configuring Active Directory.

 

SSPI Provider: Server not found in Kerberos database

Solution

Make sure that the correct SPN records are set up in the AD domain controller for the SQL server. For instructions, see SPN formats in the Microsoft SQL Server documentation.

 

 

Login failed for user <ADDOMAIN>\<aduser>. Reason: The account is disabled.

Description

This log could be found when using Kerberos for SQL access, and SQL connection is failing inside services.

Solution

This issue could be caused by the AD user losing access to the SQL server. See instructions on how to reconfigure the AD user.

 

Orchestrator-related issues


Orchestrator pod in CrashLoopBackOff or 1/2 running with multiple restarts


Description

If the Orchestrator pod in CrashLoopBackOff or 1/2 is running with multiple restarts, the failure could be related to the authentication keys for the object storage provider, Ceph.

To check if the failure is related to Ceph, run the following commands:

kubectl -n uipath get pod -l app.kubernetes.io/component=orchestrator

If the output of this command is similar to one of the following options, you need to run an additional command.

Option 1:
NAME                            READY   STATUS    RESTARTS   AGE
orchestrator-6dc848b7d5-q5c2q   1/2     Running   2          6m1s

OR 

Option 2
NAME                            READY   STATUS             RESTARTS   AGE
orchestrator-6dc848b7d5-q5c2q   1/2     CrashLoopBackOff   6          16m

Verify if the failure is related to Ceph authentication keys by running the following command:

kubectl -n uipath logs -l app.kubernetes.io/component=orchestrator | grep 'Error making request with Error Code InvalidAccessKeyId and Http Status Code Forbidden' -o

If the output of the above command contains the string Error making request with Error Code InvalidAccessKeyId and Http Status Code Forbidden, the failure is due to the Ceph authentication keys.

Solution

Rerun the rook-ceph-configure-script-job and credential-manager jobs using the following commands:

kubectl -n uipath-infra get job "rook-ceph-configure-script-job" -o json | jq 'del(. | .spec.selector, .spec.template.metadata.labels)' | kubectl replace --force -f -
kubectl -n uipath-infra get job "credential-manager-job" -o json | jq 'del(. | .spec.selector, .spec.template.metadata.labels)' | kubectl replace --force -f -
kubectl -n uipath delete pod -l app.kubernetes.io/component=orchestrator

 

Test Manager-related issues


Test Manager licensing issue


If you were assigned a license while being logged, your license assignment may not be detected when opening Test Manager.

If this happens, take the following steps:

  1. Navigate to Test Manager.
  2. Log out from the portal.
  3. Log in again.

 

AI Center-related issues


AI Center Skills deployment issues

Sometimes intermittently DU Model Skill Deployments can fail with Failed to list deployment or Unknown Error when deploying the model for the first time. The workaround is to try deploying the model again. Second time, it will be faster as most of the deployment work of image building would have been done during the first attempt. DU Models takes around 1-1.5 hours for deploying first time, and it will be faster when deploying them again.

In a rare scenario, due to cluster state, asynchronous operations like Skill Deployment or Package upload could be stuck for a long time. If DU Skill deployment is taking more than 2-3 hours, try deploying a simpler model (e.g, TemplateModel). If the model also takes more than an hour, then the mitigation is to restart AI Center services with the following commands:

kubectl -n uipath rollout restart deployment ai-deployer-deployment
kubectl -n uipath rollout restart deployment ai-trainer-deployment
kubectl -n uipath rollout restart deployment ai-pkgmanager-deployment
kubectl -n uipath rollout restart deployment ai-helper-deployment
kubectl -n uipath rollout restart deployment ai-appmanager-deployment

Wait for the AI Center pods to be back up by verifying with the following command:

kubectl -n uipath get pods | grep ai-*

All the above pods should be Running state with container state shown as 2/2.

Disabling streaming logs

To disable log streaming on existing skills, edit the skill deployment and change the LOGS_STREAMING_ENABLED environment variable to false.
You can also add a logsStreamingEnabled global variable with the value set as false using ArgoCD under the aicenter app details. Make sure to sync ArgoCD after the change is done.

28272827

Unknown error when accessing AI Center

Description

The following error can occur when accessing AI Center:
An unknown error has occurred. (#200)

16161616

Solution

To recover from this error, run the final step for installing AI Center. For more information, check the page applicable to your case:

 

Document Understanding-related issues


Document Understanding not on the left rail of Automation Suite


Description

In case Document Understanding cannot be found on the left rail of Automation Suite, please know that Document Understanding is currently not a separate application on Automation Suite, thus it is not shown on the left rail.

Solution

The Data Manager component is part of AI Center, so please make sure to enable AI Center.

Also, please access Form Extractor, Intelligent Form Extractor (including HandwritingRecognition), and Intelligent Keyword Classifier, with the below public URL:

<FQDN>/du_/svc/formextractor
<FQDN>/du_/svc/intelligentforms
<FQDN>/du_/svc/intelligentkeywords

If you get the Your license can not be validated error message when trying to use Intelligent Keyword Classifier, Form Extractor and Intelligent Form Extractor in Studio, besides making sure you have input the right endpoint, please also take the API key that you generated for Document Understanding under License in the Automation Suite install, and not from cloud.uipath.com.

 

Failed status when creating a data labeling session


Description

If you are not able to create data labeling sessions on Data Manager in AI Center, take the following steps.

Solution 1

Please double-check if Document Understanding is properly enabled. You should have updated the configuration file before the installation and set documentunderstanding.enabled to True, or you could update it in ArgoCD post-installation as below.

After doing that, you need to disable it and disable AI Center on the tenant you wish to use the Data Labeling feature on, or create a new tenant.

14921492

Solution 2

If Document Understanding is properly enabled in the configuration file or ArgoCD, sometimes Document Understanding is not enabled for DefaultTenant. This manifests itself as not being able to create data labeling sessions.

To fix this, disable AI Center on the tenant and re-enable it. Note that you might need to wait a few minutes before being able to re-enable it.

 

Failed status when trying to deploy an ML Skill


Description

If you are trying unsuccesfully to deploy a Document Understanding ML Skill on AI Center, check the solutions below.

Solution 1

If you are installing the Automation Suite offline, please double check if the Document Understanding bundle has been downloaded and installed.

The bundle includes the base image (e.g., model library) for the models to properly run on AI Center after uploading the ML Packages via AI Center UI.

For details about installing Document Understanding bundle, please refer to the documentation here and here. To add Document Understanding bundle, please follow the documentation to re-run the Document Understanding bundle installation.

Solution 2

Even if you have installed the Document Understanding bundle for offline installation, another issue might occur along with this error message: modulenotfounderror: no module named 'ocr.release'; 'ocr' is not a package.

When creating a Document Understanding OCR ML Package in AI Center, keep in mind that it cannot be named ocr or OCR, which conflicts with a folder in the package. Please make sure to choose another name.

Solution 3

Sometimes, intermittently, Document Understanding Model Skill Deployments can fail with Failed to list deployment or Unknown Error when deploying the model for the first time.

The workaround is to try deploying the model again. The second time, the deployment is be faster as most of the deployment work of image building would have been done during the first attempt. Document Understanding ML Packages take around 1-1.5 hours for deploying the first time, and these are faster when deploying them again.

 

Migration job fails in ArgoCD


Description

Migration job fails for Document Understanding in ArgoCD.

Solution

Document Understanding requires the FullTextSearch feature to be enabled on the SQL server. Otherwise, the installation can fail without an explicit error message in this regard, as the migration job fails in ArgoCD.

 

Handwriting Recognition with Intelligent Form Extractor not working


Description

Handwriting Recognition with Intelligent Form Extractor not working or working too slow.

Solution 1

If you are using Intelligent Form Extractor offline, please check to ensure that you have enabled handwriting in the configuration file before installation or enabled it in ArgoCD.

To double check, please go to ArgoCD > Document Understanding > App details > du-services.handwritingEnabled (set it to True).

In an air-gapped scenario, the Document Understanding bundle needs to be installed before doing this, otherwise the ArgoCD sync fails.

Solution 2

Although handwriting in the configuration file is enabled, you might still face the same issues.

Please know that the default for the maximum amount of CPUs each container is allowed to use for handwriting is 2. You may need to adjust handwriting.max_cpu_per_pod parameter if you have a larger handwriting processing workload. You could update it in the configuration file before installation or update it in ArgoCD.

For more details on how to calculate the parameter value based on your volume, please check the documentation here.

 

Insights-related issues


Navigating to Insights home page generates a 404


Rarely, a routing error can occur and result in a 404 on the Insights home page. You can resolve this by going to the Insights application in ArgoCD and deleting the virtual service insightsprovisioning-vs. Note that you may have to click clear filters to show X additional resources to see and delete this virtual service.

Looker fails to initialize


During Looker initialization, you might encounter and error stating RuntimeError: Error starting Looker. A Looker pod failure produced this error due to a possible system failure or loss of power. The issue is persistent even if you reinitialize Looker.

To solve this issue, delete the persistent volume claim (PVC) and then restart.

Updated 2 days ago


Troubleshooting


This page explains how to fix issues you might encounter when setting up Automation Suite.

Suggested Edits are limited on API Reference Pages

You can only suggest edits to Markdown body content, but not to the API spec.