Subscribe

UiPath Automation Suite

The UiPath Automation Suite Guide

Troubleshooting

This page explains how to fix issues you might encounter when setting up Automation Suite.

Troubleshooting how-tos


How to troubleshoot services during installation


Take the following steps on one of the cluster server nodes:

  1. Obtain Kubernetes access.
on server nodes:
export KUBECONFIG="/etc/rancher/rke2/rke2.yaml"
export PATH="$PATH:/usr/local/bin:/var/lib/rancher/rke2/bin"

on agent nodes:
export KUBECONFIG="/var/lib/rancher/rke2/agent/kubelet.kubeconfig"
export PATH="$PATH:/usr/local/bin:/var/lib/rancher/rke2/bin"

# To validate, execute the following command which should not return an error:
kubectl get nodes
  1. Retrieve the ArgoCD password by running the following command:
kubectl get secrets/argocd-admin-password -n argocd --template '{{ .data.password }}' | base64 -d
  1. Connect to ArgoCD
    a. Navigate to https://alm.<fqdn>/:443
    b. Login using admin as the username and the password obtained at Step 2.

  2. Locate the UiPath Services application as follows:
    a. Using the search bar provided in ArgoCD, type in uipath.

    b. Then open the UiPath application by clicking its card.

    c. Check for the following: Application was not synced due to a failed job/pod.

    d. If the above error exists, take the following steps.

    e. Locate any un-synced components by looking for the red broken heart icon, as shown in the following image.

    f. Open the right-most component (usually pods) and click the Logs tab. The Logs will contain an error message indicating the reason for the pod failure.

    g. Once you resolve any outstanding configuration issues, go back to the home page and click the Sync button on the UiPath application.

 

How to uninstall the cluster


If you experience issues specific to Kubernetes running on the cluster, you can directly uninstall the rke2 cluster.

  1. Depending on your installation profile, run one of the following commands:
    1.1. In an online setup, run the following script with elevated privileges, i.e. sudo, on each node of the cluster. This will uninstall the nodes.
function remove_rke2_entry_from_exclude() {
  local current_exclude_list new_exclude_list
  YUM_CONF_FILE=$1
  if [[ ! -s "${YUM_CONF_FILE}" ]];
  then
    # File is empty
    return
  fi
  current_exclude_list=$(grep 'exclude=' "${YUM_CONF_FILE}" | tail -1)
  if echo "$current_exclude_list" | grep -q 'rke2-*';
  then
    if [[ -w ${YUM_CONF_FILE} ]];
    then
      new_exclude_list=$(printf '%s\n' "${current_exclude_list//rke2-* /}")
      new_exclude_list=$(printf '%s\n' "${new_exclude_list//rke2-*,/}")
      new_exclude_list=$(printf '%s\n' "${new_exclude_list//rke2-\*/}")
      sed -i "/exclude=.*rke2-\*/d" "${YUM_CONF_FILE}"
      echo "${new_exclude_list}" >> "${YUM_CONF_FILE}"
    else
      error "${YUM_CONF_FILE} file is readonly and contains rke2-* under package exclusion. Please remove the entry for AS to work."
    fi
  fi
}

function enable_rke2_package_upgrade() {
  remove_rke2_entry_from_exclude /etc/dnf/dnf.conf
  remove_rke2_entry_from_exclude /etc/yum.conf
}

enable_rke2_package_upgrade

service_exists() {
    local n=$1
    if [[ $(systemctl list-units --all -t service --full --no-legend "$n.service" | cut -f1 -d' ') == $n.service ]]; then
        return 0
    else
        return 1
    fi
}
if service_exists rke2-server; then
  systemctl stop rke2-server
  systemctl disable rke2-server
fi
if service_exists rke2-agent; then
  systemctl stop rke2-agent
  systemctl disable rke2-agent
fi
if [ -e /usr/bin/rke2-killall.sh ]
then
    echo "Running rke2-killall.sh"
    /usr/bin/rke2-killall.sh > /dev/null
else
    echo "File not found: rke2-killall.sh"
fi
if [ -e /usr/bin/rke2-uninstall.sh ]
then
    echo "Running rke2-uninstall.sh"
    /usr/bin/rke2-uninstall.sh > /dev/null
else
    echo "File not found: rke2-uninstall.sh"
fi

crontab -l > backupcron
sed -i '/backupjob/d' backupcron > /dev/null
crontab backupcron > /dev/null
rm -rf backupcron > /dev/null
rm -rfv /usr/bin/backupjob > /dev/null
rm -rfv /etc/rancher/ > /dev/null
umount -l -f /var/lib/rancher/
rm -rfv /var/lib/rancher/* > /dev/null
rm -rfv /var/lib/rook/ > /dev/null
rm -rfv /var/lib/longhorn/ > /dev/null
umount -l -f /var/lib/kubelet/
rm -rfv /var/lib/kubelet/* > /dev/null
umount -l -f /datadisk/
rm -rfv /datadisk/* > /dev/null
rm -rfv ~/.uipath/* > /dev/null
mkdir -p /var/lib/rancher/rke2/server/db/ && mount -a
rm -rfv /var/lib/rancher/rke2/server/db/* > /dev/null
echo "Uninstall RKE complete."

     1.2 In an offline setup, run the following script with elevated privileges, i.e. sudo, on each node of the cluster. This will uninstall the nodes.

function remove_rke2_entry_from_exclude() {
  local current_exclude_list new_exclude_list
  YUM_CONF_FILE=$1
  if [[ ! -s "${YUM_CONF_FILE}" ]];
  then
    # File is empty
    return
  fi
  current_exclude_list=$(grep 'exclude=' "${YUM_CONF_FILE}" | tail -1)
  if echo "$current_exclude_list" | grep -q 'rke2-*';
  then
    if [[ -w ${YUM_CONF_FILE} ]];
    then
      new_exclude_list=$(printf '%s\n' "${current_exclude_list//rke2-* /}")
      new_exclude_list=$(printf '%s\n' "${new_exclude_list//rke2-*,/}")
      new_exclude_list=$(printf '%s\n' "${new_exclude_list//rke2-\*/}")
      sed -i "/exclude=.*rke2-\*/d" "${YUM_CONF_FILE}"
      echo "${new_exclude_list}" >> "${YUM_CONF_FILE}"
    else
      error "${YUM_CONF_FILE} file is readonly and contains rke2-* under package exclusion. Please remove the entry for AS to work."
    fi
  fi
}

function enable_rke2_package_upgrade() {
  remove_rke2_entry_from_exclude /etc/dnf/dnf.conf
  remove_rke2_entry_from_exclude /etc/yum.conf
}

enable_rke2_package_upgrade

service_exists() {
    local n=$1
    if [[ $(systemctl list-units --all -t service --full --no-legend "$n.service" | cut -f1 -d' ') == $n.service ]]; then
        return 0
    else
        return 1
    fi
}
if service_exists rke2-server; then
  systemctl stop rke2-server
  systemctl disable rke2-server
fi
if service_exists rke2-agent; then
  systemctl stop rke2-agent
  systemctl disable rke2-agent
fi
if [ -e /usr/local/bin/rke2-killall.sh ]
then
  echo "Running rke2-killall.sh"
  /usr/local/bin/rke2-killall.sh > /dev/null
else
  echo "File not found: rke2-killall.sh"
fi
if [ -e /usr/local/bin/rke2-uninstall.sh ]
then
  echo "Running rke2-uninstall.sh"
  /usr/local/bin/rke2-uninstall.sh > /dev/null
else
    echo "File not found: rke2-uninstall.sh"
fi

crontab -l > backupcron
sed -i '/backupjob/d' backupcron > /dev/null
crontab backupcron > /dev/null
rm -rf backupcron > /dev/null
rm -rfv /usr/bin/backupjob > /dev/null
rm -rfv /etc/rancher/ > /dev/null
umount -l -f /var/lib/rancher/
rm -rfv /var/lib/rancher/* > /dev/null
rm -rfv /var/lib/rook/ > /dev/null
rm -rfv /var/lib/longhorn/ > /dev/null
umount -l -f /var/lib/kubelet/
rm -rfv /var/lib/kubelet/* > /dev/null
umount -l -f /datadisk/*
rm -rfv /datadisk/* > /dev/null
rm -rfv ~/.uipath/* > /dev/null
mkdir -p /var/lib/rancher/rke2/server/db/ && mount -a
rm -rfv /var/lib/rancher/rke2/server/db/* > /dev/null
echo "Uninstall RKE complete."
  1. Reboot the node after uninstall.

🚧

Important!

When uninstalling one of the nodes from the cluster, you must run the following command: kubectl delete node <node_name>. This removes the node from the cluster.

 

How to clean up offline artifacts to improve disk space


If you run an offline installation, you typically need a larger disk size due to the offline artifacts that are used.

Once the installation completes, you can remove those local artifacts. Failure to do so can result in unnecessary disk pressure during cluster operations.

On the primary server, where the installation was performed, you can perform a cleanup using the following commands.

  1. Remove all images loaded by podman into the local container storage using the following command.:
podman image rm -af
  1. Then remove the temporary offline folder, used with the flag --offline-tmp-folder. This parameter defaults to /tmp:
rm -rf /path/to/temp/folder

 

Common Issues


Unable to run an offline installation on RHEL 8.4 OS


Description

The following issues can happen if you install RHEL 8.4 and are performing the offline installation, which requires podman. These issues are specific to podman and the OS being installed together. See the two below

Potential issue

  • you cannot install both of the following on the cluster:
    • podman-1.0.0-8.git921f98f.module+el8.3.0+10171+12421f43.x86_64
    • podman-3.0.1-6.module+el8.4.0+10607+f4da7515.x86_64
  • package cockpit-podman-29-2.module+el8.4.0+10607+f4da7515.noarch requires podman >= 1.3.0, but none of the providers can be installed
  • cannot install the best candidate for the job
  • problem with installed package cockpit-podman-29-2.module+el8.4.0+10607+f4da7515.noarch

Potential issue

  • package podman-3.0.1-6.module+el8.4.0+10607+f4da7515.x86_64 requires containernetworking-plugins >= 0.8.1-1, but none of the providers can be installed
  • you cannot install both of the following:
    • containernetworking-plugins-0.7.4-4.git9ebe139.module+el8.3.0+10171+12421f43.x86_64
    • containernetworking-plugins-0.9.1-1.module+el8.4.0+10607+f4da7515.x86_64
  • package podman-catatonit-3.0.1-6.module+el8.4.0+10607+f4da7515.x86_64 requires podman = 3.0.1-6.module+el8.4.0+10607+f4da7515, but none of the providers can be installed
  • cannot install the best candidate for the job
  • problem with installed package podman-catatonit-3.0.1-6.module+el8.4.0+10607+f4da7515.x86_64
    (try to add --allowerasing to command line to replace conflicting packages or --skip-broken to skip uninstallable packages or --nobest to use not only best candidate packages)

Solution

You need to remove the current version of podman and allow Automation Suite to install the required version.

  1. Remove the current version of podman using the yum remove podman command.

  2. Re-run the installer after having removed the current version, which should install the correct version.

 

Offline installation fails because of missing binary


Description

During offline installation, at the fabric stage, the execution fails with the following error message:

Error: overlay: can't stat program "/usr/bin/fuse-overlayfs": stat /usr/bin/fuse-overlayfs: no such file or directory

Solution

You need to remove the line containing the mount_program key from the podman configuration /etc/containers/storage.conf.
Ensure you remove the line rather than comment it out.

 

Failure to get the sandbox image


Description

You can receive an error message specific when trying to get the following sandbox image: index.docker.io/rancher/pause3.2

This can happen in an offline installation.

Solution

Restart either rke2-server or rke2-agent (depending on whether the machine that the pod is scheduled on is either a server or an agent).

To check which node the pod is scheduled on, run kubectl -n <namespace> get pods -o wide.

# If machine is a Master node
systemctl restart rke2-server
# If machine is an Agent Node
systemctl restart rke2-agent

 

SQL connection string validation error


Description

You might receive an error relating to the connection strings as follows:

Sqlcmd: Error: Microsoft Driver 17 for SQL Server :
Server—tcp : <connection string>
Login failed for user

This error appears even though all credentials are correct. The connection string validation failed.

Solution

Make sure the connection string has the following structure:

Server=<Sql server host name>;User Id=<user_name for sql server>;Password=<Password>;Initial Catalog=<database name>;Persist Security Info=False;MultipleActiveResultSets=False;Encrypt=True;TrustServerCertificate=False;Connection Timeout=30;Max Pool Size=100;

📘

Note:

User Id is case-sensitive.

 

Pods not showing in ArgoCD UI


Description

Occasionally, the ArgoCD UI does not show pods, but only displays applications and corresponding deployments. See the following image for details:

When clicking on any of the deployments, the following error is displayed: Unable to load data: EOF.

Solution

You can fix this issue by deleting all Redis replicas from ArgoCD namespace and waiting for it to come back up again.

kubectl -n argocd delete pod argocd-redis-ha-server-0 argocd-redis-ha-server-1 argocd-redis-ha-server-2

# Wait for all 3 pods to come back up
kubectl -n argocd get pods | grep argocd-redis-ha-server

 

Certificate issue in offline installation


Description

You might get an error that the certificate is signed by an unknown authority.

Error: failed to do request: Head "https://sfdev1778654-9f843b23-lb.westeurope.cloudapp.azure.com:30071/v2/helm/audit-service/blobs/sha256:09bffbc520ff000b834fe1a654acd089889b09d22d5cf1129b0edf2d76554892": x509: certificate signed by unknown authority

Solution

Both the rootCA and the server certificates need to be in the trusted store on the machine.

To investigate, execute the following commands:

[[email protected] ~]# find /etc/pki/ca-trust/source{,/anchors} -maxdepth 1 -not -type d -exec ls -1 {} +
/etc/pki/ca-trust/source/anchors/rootCA.crt
/etc/pki/ca-trust/source/anchors/server.crt

The provided certificates need to be in the output of those commands.

Alternatively, execute the following command:

[[email protected] ~]# openssl x509 -in /etc/pki/ca-trust/source/anchors/server.crt -text -noout

Ensure that the fully qualified domain name is present in the Subject Alternative Name from the output.

X509v3 Subject Alternative Name:
                DNS:sfdev1778654-9f843b23-lb.westeurope.cloudapp.azure.com

You can update the CA Certificate as follows:

[[email protected] ~]# update-ca-trust

 

Error in downloading the bundle


Description

The documentation lists wget as an option for downloading the bundles. Because of the large sizes, the connection may be interrupted and not recover.

Solution

One way to mitigate this could be to switch to a different download tool, such as azcopy (more information here). Run these commands, while updating the bundle URL to match the desired version/bundle combination.

wget https://aka.ms/downloadazcopy-v10-linux -O azcopy.tar.gz
tar -xvf ./azcopy.tar.gz
azcopy_linux_amd64_10.11.0/azcopy copy https://download.uipath.com/service-fabric/0.0.23-private4/sf-0.0.23-private4.tar.gz /var/tmp/sf.tar.gz --from-to BlobLocal

 

Longhorn errors


Rook Ceph or Looker pod stuck in Init state

Description

Occasionally, on node restart, an issue causes the Looker or Rook Ceph pod to get stuck in Init state as the volume required for attaching the PVC to a pod is missing.

Verify if the problem is indeed related to Longhorn by running the following command:

kubectl get events -A -o json | jq -r '.items[] | select(.message != null) | select(.message | contains("cannot get resource \"volumeattachments\" in API group \"storage.k8s.io\""))'

If it is related to Longhorn, this command should return a list of pod names affected by the issue. If the command does not return anything, the cause of the problem is different.

Solution

Run the following script to fix the problematic pods if the previous command returns a non-empty output:

#!/bin/bash


function wait_till_rollout() {
    local namespace=$1
    local object_type=$2
    local deploy=$3

    local try=0
    local maxtry=2
    local status="notready"

    while [[ ${status} == "notready" ]]  && (( try != maxtry )) ; do
        kubectl -n "$namespace" rollout status "$deploy" -w --timeout=600s; 
        # shellcheck disable=SC2181
        if [[ "$?" -ne 0 ]]; 
        then
            status="notready"
            try=$((try+1))
        else
            status="ready"
        fi
    done
    if [[ $status == "notready" ]]; then 
        echo "$deploy of type $object_type failed in namespace $namespace. Plz re-run the script once again to verify that it's not a transient issue !!!"
        exit 1
    fi
}

function fix_pv_deployments() {
    for pod_name in $(kubectl get events -A -o json | jq -r '.items[]  | select(.message | contains("cannot get resource \"volumeattachments\" in API group \"storage.k8s.io\"")) | select(.involvedObject.kind == "Pod") | .involvedObject.name + "/" + .involvedObject.namespace' | sort | uniq)
    do
        POD_NAME=$(echo "${pod_name}" | cut -d '/' -f1)
        NS=$(echo "${pod_name}" | cut -d '/' -f2)
        controller_data=$(kubectl -n "${NS}" get po "${POD_NAME}" -o json | jq -r '[.metadata.ownerReferences[] | select(.controller==true)][0] | .kind + "=" + .name')
        [[ $controller_data == "" ]] && error "Error: Could not determine owner for pod: ${POD_NAME}" && exit 1
        CONTROLLER_KIND=$(echo "${controller_data}" | cut -d'=' -f1)
        CONTROLLER_NAME=$(echo "${controller_data}" | cut -d'=' -f2)
        if [[ $CONTROLLER_KIND == "ReplicaSet" ]]
        then
            controller_data=$(kubectl  -n "${NS}" get "${CONTROLLER_KIND}" "${CONTROLLER_NAME}" -o json | jq -r '[.metadata.ownerReferences[] | select(.controller==true)][0] | .kind + "=" + .name')
            CONTROLLER_KIND=$(echo "${controller_data}" | cut -d'=' -f1)
            CONTROLLER_NAME=$(echo "${controller_data}" | cut -d'=' -f2)

            replicas=$(kubectl -n "${NS}" get "$CONTROLLER_KIND" "$CONTROLLER_NAME" -o json | jq -r '.status.replicas')
            unavailable_replicas=$(kubectl -n "${NS}" get "$CONTROLLER_KIND" "$CONTROLLER_NAME" -o json | jq -r '.status.unavailableReplicas')

            if [ -n "$unavailable_replicas" ]; then 
                available_replicas=$((replicas - unavailable_replicas))
                if [ $available_replicas -eq 0 ]; then
                    kubectl -n "$NS" scale "$CONTROLLER_KIND" "$CONTROLLER_NAME" --replicas=0
                    sleep 15
                    kubectl -n "$NS" scale "$CONTROLLER_KIND" "$CONTROLLER_NAME" --replicas="$replicas"
                    deployment_name="$CONTROLLER_KIND/$CONTROLLER_NAME"
                    wait_till_rollout "$NS" "deploy" "$deployment_name"
                fi 
            fi
        fi
    done
}

fix_pv_deployments

 

StatefulSet volume attachment error


Pods in RabbitMQ or cattle-monitoring-system or other StatefulSet pods are stuck in the init state.

Description

Occasionally, upon node power failure or during upgrade, an issue causes the pods in RabbitMQ or cattle-monitoring-system to get stuck in init state as the volume required for attaching the PVC to a pod is missing.

Verify if the problem is indeed related to the StatefulSet volume attachment by running the following command:

kubectl -n <namespace> describe pod <pod-name> | grep "cannot get resource \"volumeattachments\" in API group \" storage.k8s.io\""

If it is related to the StatefulSet volume attachment, it will show an error message.

Solution

To fix this issue, reboot the node.

 

Failure to create persistent volumes

Description

Longhorn is successfully installed, but fails to create persistent volumes.

Solution

Verify if the kernel modules are successfully loaded in the cluster by using the command lsmod | grep <module_name>.
Replace <module_name> with each of the kernel modules below:

  • libiscsi_tcp
  • libiscsi
  • iscsi_tcp
  • scsi_transport_iscsi

Load any missing module.

 

rke2-coredns-rke2-coredns-autoscaler pod in CrashLoopBackOff


Description

After node restart, rke2-coredns-rke2-coredns-autoscaler can go in CrashLoopBackOff. This does not have any impact on Automation Suite.

Solution

Delete the rke2-coredns-rke2-coredns-autoscaler pod that is in CrashLoopBackOff using the following command: kubectl delete pod <pod name> -n kube-system.

 

Redis probe failure


Description

Redis probe can fail if the node ID file does not exist. This can happen if the pod is not yet bootstrapped.

There is a recovery job that automatically fixes this issue, and the following steps should not be performed while the job is running.

When a Redis Enterprise cluster loses contact with more than half of its nodes (either because of failed nodes or network split), then the cluster stops responding to client connections. The Pods also fail to rejoin the cluster.

Solution

  1. Delete the Redis cluster and database using the following commands:
kubectl delete redb -n redis-system redis-cluster-db --force --grace-period=0 &
kubectl delete rec -n redis-system redis-cluster --force --grace-period=0 &
kubectl patch redb -n redis-system redis-cluster-db --type=json -p '[{"op":"remove","path":"/metadata/finalizers","value":"finalizer.redisenterprisedatabases.app.redislabs.com"}]'
kubectl patch rec redis-cluster -n redis-system --type=json -p '[{"op":"remove","path":"/metadata/finalizers","value":"redbfinalizer.redisenterpriseclusters.app.redislabs.com"}]'
kubectl delete job redis-cluster-db-job -n redis-system
  1. Go to the ArgoCD UI and sync the redis-cluster application.

 

RKE2 server fails to start


Description

The server fails to start. There are a few different reasons for RKE2 not starting properly, which are usually found in the logs.

Solution

Check the logs using the following commands:

journalctl -u rke2-server

Possible reasons (based on logs): too many learner members in cluster

Too many etcd servers are added to the cluster, and there are two learner nodes trying to be promoted. More information here: Runtime reconfiguration.

Perform the following:

  1. Under normal circumstances, the node should become a full member if enough time is allowed.
  2. An uninstall-reinstall cycle can be attempted.

Alternatively, this could be caused by a networking problem. Ensure you have configured the machine to enable the necessary ports.

 

Node draining does not occur for stopped nodes


Description

If a node is stopped in a cluster and, and its corresponding pods are not rescheduled to available nodes after 15 minutes, run the following script to manually drain the node.

#!/bin/sh

KUBECTL="/usr/local/bin/kubectl"

# Get only nodes which are not drained yet
NOT_READY_NODES=$($KUBECTL get nodes | grep -P 'NotReady(?!,SchedulingDisabled)' | awk '{print $1}' | xargs echo)
# Get only nodes which are still drained
READY_NODES=$($KUBECTL get nodes | grep '\sReady,SchedulingDisabled' | awk '{print $1}' | xargs echo)

echo "Unready nodes that are undrained: $NOT_READY_NODES"
echo "Ready nodes: $READY_NODES"


for node in $NOT_READY_NODES; do
  echo "Node $node not drained yet, draining..."
  $KUBECTL drain --ignore-daemonsets --force --delete-emptydir-data $node
  echo "Done"
done;

for node in $READY_NODES; do
  echo "Node $node still drained, uncordoning..."
  $KUBECTL uncordon $node
  echo "Done"
done;

 

Enable Istio logging


To debug Istio, you need to enable logging. To do that, perform the following steps:

  1. Find the istio-ingressgateway pod by running the following command. Copy the gateway pod name. It should be something like istio-ingressgateway-r4mbx.
kubectl -n istio-system get pods
  1. Open the gateway Pod shell by running the following command.
kubectl exec -it -n istio-system istio-ingressgateway-r4mbx bash
  1. Enable debug level logging by running the following command.
curl -X POST http://localhost:15000/logging?level=debug
  1. Run the following command from a server node.
istioctl_bin=$(find /var/lib/rancher/rke2/ -name "istioctl" -type f -perm -u+x   -print -quit)
if [[ -n ${istioctl_bin} ]]
then
echo "istioctl bin found"
  kubectl -n istio-system get cm istio-installer-base -o go-template='{{ index .data "istio-base.yaml" }}'  > istio-base.yaml
  kubectl -n istio-system get cm istio-installer-overlay  -o go-template='{{ index .data "overlay-config.yaml" }}'  > overlay-config.yaml 
  ${istioctl_bin} -i istio-system install -y -f istio-base.yaml -f overlay-config.yaml --set meshConfig.accessLogFile=/dev/stdout --set meshConfig.accessLogEncoding=JSON 
else
  echo "istioctl bin not found"
fi

 

Secret not found in UiPath namespace


Description

If service installation fails, and checking kubectl -n uipath get pods returns failed pods, take the following steps.

Solution

  1. Check kubectl -n uipath describe pod <pod-name> and look for secret not found.
  2. If the secret is not found, then look for credential manager job logs and see if it failed.
  3. If the credential manager job failed and kubectl get pods -n rook-ceph|grep rook-ceph-tool returns more than one pod, do the following:
    a. delete rook-ceph-tool that is not running.
    b. go to ArgoCD UI and sync sfcore appllication.
    c. Once the job completes, check if all secrets are created in the credential-manager job logs
    d. Now sync uipath application.

 

Cannot log in after migration


Description

An issue might affect the migration from a standalone product to Automation Suite. It prevents you from logging in, with the following error message being displayed: Cannot find client details.

Solution

To fix this problem, you need to re-sync uipath app first, and then sync platform app in ArgoCD.

 

ArgoCD login failed


Description

You may fail to log into ArgoCD when using the admin password or the installer may fail with the following error message:

Solution

To fix this issue, enter your password, create a bcrypt password, and run the commands described in the following section:

password="<enter_your_password>"
bcryptPassword=<generate bcrypt password using link https://www.browserling.com/tools/bcrypt >

# Enter your bcrypt password and run below command
kubectl -n argocd patch secret argocd-secret \
  -p '{"stringData": {
    "admin.password": "<enter you bcryptPassword here>",
    "admin.passwordMtime": "'$(date +%FT%T%Z)'"
  }}'

# Run below commands
argocdInitialAdminSecretPresent=$(kubectl -n argocd get secret argocd-initial-admin-secret --ignore-not-found )
if [[ -n ${argocdInitialAdminSecretPresent} ]]; then
   echo "Start updating argocd-initial-admin-secret"
   kubectl -n argocd patch secret argocd-initial-admin-secret \
   -p "{
      \"stringData\": {
         \"password\": \"$password\"
      }
   }"
fi

argocAdminSecretName=$(kubectl -n argocd get secret argocd-admin-password --ignore-not-found )
if [[ -n ${argocAdminSecretName} ]]; then
   echo "Start updating argocd-admin-password"
   kubectl -n argocd patch secret argocd-admin-password \
   -p "{
      \"stringData\": {
         \"password\": \"$password\"
      }
   }"
fi

 

After the Initial Install, ArgoCD App went into Progressing State


Description

Whenever the cluster state deviates from what has been defined in the helm repository, argocd tries to sync the state and reconciliation happens every minute. Whenever this happens, you can notice that the ArgoCD app is in progressing state.

Solution

This is the expected behavior of ArgoCD and it does not impact application in any way.

 

Automation Suite requires backlog_wait_time to be set 1


Description

Audit events can cause instability (system freeze) if backlog_wait_time is not set to 1.
For more details, see this issue description.

Solution

If the installer fails with the Automation Suite requires backlog_wait_time to be set 1 error message, take the following steps to set backlog_wait_time to 1.

  1. Set backlog_wait_time to 1 by appending --backlog_wait_time 1 in the /etc/audit/rules.d/audit.rules file.
  2. Reboot the node.
  3. Validate if backlog_wait_time value is set to 1 for auditctl by running sudo auditctl -s | grep "backlog_wait_time" in the node.

 

Failure to resize objectstore PVC


Description

This issue occurs when the objectstore resize-pvc operation fails with the following error:

Failed resizing the PVC: <pvc name> in namespace: rook-ceph, ROLLING BACK

Solution

To fix this problem, take the following steps:

  1. Run the following script manually:
#!/bin/sh

ROOK_CEPH_OSD_PREPARE=$(kubectl -n rook-ceph get pods | grep rook-ceph-osd-prepare-set | awk '{print $1}')
if [[ -n ${ROOK_CEPH_OSD_PREPARE} ]]; then
    for pod in ${ROOK_CEPH_OSD_PREPARE}; do
    echo "Start deleting rook ceph osd pod $pod .."
    kubectl -n rook-ceph delete pod $pod
    echo "Done"
    done;
fi
  1. Rerun the objectstore resize-pvc command.

 

Failure to upload/download data in object-store (rook-ceph)


Description

This issue may occur when the object-store state is in a degraded state due to a placement group (PG) inconsistency.
Verify if the problem is indeed related to rook-ceph PG inconsistency by running the following commands:

export KUBECONFIG=/etc/rancher/rke2/rke2.yaml PATH=$PATH:/var/lib/rancher/rke2/bin
ROOK_CEPH_TOOLS=$(kubectl -n rook-ceph get pods | grep rook-ceph-tools)
kubectl -n rook-ceph exec -it $ROOK_CEPH_TOOLS -- ceph status

If the problem is related to a rook-ceph PG inconsistency, the output will contain the following messages:

....
....
Possible data damage: X pgs inconsistent
....
....
X active+clean+inconsistent
....
....

Solution

To repair the inconsistent PG, take the following steps:

  1. Exec to rook-ceph tools:
kubectl -n rook-ceph exec -it $ROOK_CEPH_TOOLS -- sh
  1. Trigger the rook-ceph garbage collector process. Wait until the process is complete.
radosgw-admin gc process
  1. Find a list of active+clean+inconsistent PGs:
ceph health detail
# output of this command be like
# ....
# pg <pg-id> is active+clean+inconsistent, acting ..
# pg <pg-id> is active+clean+inconsistent, acting ..
# ....
#
  1. Trigger a deep scrub on the PGs one at a time. This command takes few minutes to run, depending on the PG size.
ceph pg deep-scrub <pg-id>
  1. Watch the scrubbing status:
ceph -w | grep <pg-id>
  1. Check the PG scrub status. If the PG scrub is successful, the PG status should be active+clean+inconsistent.
ceph health detail | grep <pg-id>
  1. Repair the PG:
ceph pg repair <pg-id>
  1. Check the PG repair status. The PG ID should be removed from the active+clean+inconsistent list if the PG is repaired successfully.
ceph health detail | grep <pg-id>
  1. Repeat steps 3 to 8 for the rest of the inconsistent PG.

 

Failure after certificate update


Description

This issue occurs when the certificate update step fails internally. You may not able to access Automation Suite or Orchestrator.

Error

Solution

  1. Run below commands from any of the server node
export KUBECONFIG=/etc/rancher/rke2/rke2.yaml
export PATH=$PATH:/var/lib/rancher/rke2/bin

kubectl -n uipath rollout restart deployments
  1. Wait for the above command to succeed and then run below command to verify the status of previous command.
deployments=$(kubectl -n uipath get deployment -o name)
for i in $deployments; 
do 
kubectl -n uipath rollout status "$i" -w --timeout=600s; 
if [[ "$?" -ne 0 ]]; 
then
    echo "$i deployment failed in namespace uipath."
fi
done
echo "All deployments are succeeded in namespace uipath"

Once the above command is finished execution, you should able to access Automation Suite and Orchestrator

 

Unexpected inconsistency; run fsck manually


While installing or upgrading Automation Suite, if the any pods cannot mount to the PVC pods, the following error message is displayed:
UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY

Recovery steps

If you encounter the error above, follow the recovery steps below:

  1. SSH to the system by running the following command:
ssh <user>@<node-ip>
  1. Check the events of the PVC and verify that the issue is related to the PVC mount failure due to file error. To do this, run the following command:
export KUBECONFIG=/etc/rancher/rke2/rke2.yaml PATH=$PATH:/var/lib/rancher/rke2/bin:/usr/local/bin
kubectl get events -n mongodb
kubectl get events -n longhorn-system
  1. Check the PVC volume mentioned in the event and run the fsck command.
fsck -a <pvc-volume-name>
Eg - fsck -a /dev/longhorn/pvc-5abe3c8f-7422-44da-9132-92be5641150a
  1. Delete the failing MongoDB pod to properly mount it to the PVC.
kubectl delete pod <pod-name> -n mongodb

 

Cluster unhealthy after automated upgrade from 2021.10


During the automated upgrade from Automation Suite 2021.10, the CNI provider is migrated from Canal to Cilium. This operation requires that all nodes are restarted. On rare occasions, one or more nodes might not be successfully rebooted, causing pods running on those nodes to remain unhealthy.

Recovery steps

  1. Identify failed restarts.
    During the Ansible execution, you might see output similar to the following snippet:
TASK [Reboot the servers] ***************************************************************************************************************************

fatal: [10.0.1.6]: FAILED! =>

  msg: 'Failed to connect to the host via ssh: ssh: connect to host 10.0.1.6 port 22: Connection timed out'

Alternatively, browse the logs on the Ansible host machine, located at /var/tmp/uipathctl_<version>/_install-uipath.log. If any failed restarts were identified, execute steps 2 through 4 on all nodes.

  1. Confirm a reboot is needed on each node.
    Connect to the each node and run the following command:
ssh <username>@<ip-address>
iptables-save 2>/dev/null | grep -i cali -c

If the result is not zero, a reboot is needed.

  1. Reboot the node:
sudo reboot
  1. Wait for the node to become responsive (you should be able to SSH to it) and repeat steps 2 through 4 on every other node.

 

Identity Server issues

Setting a timeout interval for the Management portals

Pre-installation, you cannot update the expiration time for the token used to authenticate to the host- and organization-level Management portals. Therefore user sessions do not time out.

To set a time interval for timeout for these portals, you can update the accessTokenLifetime property.
The below example sets the timeout interval to 86400 seconds (24 hours):

UPDATE [identity].[Clients] SET AccessTokenLifetime = 86400 WHERE ClientName = 'Portal.OpenId'

 

Kerberos issues


kinit: Cannot find KDC for realm while getting initial credentials

Description

This error might occur during installation (if you have Kerberos authentication enabled) or during the kerberos-tgt-update cron job execution when the UiPath cluster cannot connect to the AD server to obtain the Kerberos ticket for authentication.

Solution

Check the AD domain and ensure it is configured correctly and routable, as follows:

getent ahosts <AD domain> | awk '{print $1}' | sort | uniq

If this command does not return a routable IP address, then the AD domain required for Kerberos authentication is not properly configured.

You need to work with the IT administrators to add the AD domain to your DNS server and make sure this command returns a routable IP address.

 

kinit: Keytab contains no suitable keys for *** while getting initial credentials

Description

This error could be found in the log of a failed job, with one of the following job names: services-preinstall-validations-job, kerberos-jobs-trigger, kerberos-tgt-update.

Solution

Make sure the AD user still exists, is active, and their password was not changed and did not expire. Reset the user's password and regenerate the keytab if needed.
Also make sure to provide the default Kerberos AD user parameter <KERB_DEFAULT_USERNAME> in the following format: HTTP/<Service Fabric FQDN>.

 

GSSAPI operation failed with error: An invalid status code was supplied (Client's credentials have been revoked).

Description

This log could be found when using Kerberos for SQL access, and SQL connection is failing inside services. Similarly, you may see kinit: Client's credentials have been revoked while getting initial credential in one of the following job names: services-preinstall-validations-job, kerberos-jobs-trigger, kerberos-tgt-update.

Solution

This could be caused by the AD user account used to generate the keytab being disabled. Re-enabling the AD user account should fix the issue.

Alarm received for failed kerberos-tgt-update job

Description

This happens if the uipath cluster failed to retrieve the latest Kerberos ticket.

Solution

To find the issue, check the log for a failed job whose name starts with kerberos-tgt-update. After you've identified the problem in the log, check the related troubleshooting information in this section and in the Troubleshooting section for configuring Active Directory.

 

SSPI Provider: Server not found in Kerberos database

Solution

Make sure that the correct SPN records are set up in the AD domain controller for the SQL server. For instructions, see SPN formats in the Microsoft SQL Server documentation.

 

 

Login failed for user <ADDOMAIN>\<aduser>. Reason: The account is disabled.

Description

This log could be found when using Kerberos for SQL access, and SQL connection is failing inside services.

Solution

This issue could be caused by the AD user losing access to the SQL server. See instructions on how to reconfigure the AD user.

 

Orchestrator-related issues


Orchestrator pod in CrashLoopBackOff or 1/2 running with multiple restarts


Description

If the Orchestrator pod in CrashLoopBackOff or 1/2 is running with multiple restarts, the failure could be related to the authentication keys for the object storage provider, Ceph.

To check if the failure is related to Ceph, run the following commands:

kubectl -n uipath get pod -l app.kubernetes.io/component=orchestrator

If the output of this command is similar to one of the following options, you need to run an additional command.

Option 1:
NAME                            READY   STATUS    RESTARTS   AGE
orchestrator-6dc848b7d5-q5c2q   1/2     Running   2          6m1s

OR 

Option 2
NAME                            READY   STATUS             RESTARTS   AGE
orchestrator-6dc848b7d5-q5c2q   1/2     CrashLoopBackOff   6          16m

Verify if the failure is related to Ceph authentication keys by running the following command:

kubectl -n uipath logs -l app.kubernetes.io/component=orchestrator | grep 'Error making request with Error Code InvalidAccessKeyId and Http Status Code Forbidden' -o

If the output of the above command contains the string Error making request with Error Code InvalidAccessKeyId and Http Status Code Forbidden, the failure is due to the Ceph authentication keys.

Solution

Rerun the rook-ceph-configure-script-job and credential-manager jobs using the following commands:

kubectl -n uipath-infra get job "rook-ceph-configure-script-job" -o json | jq 'del(. | .spec.selector, .spec.template.metadata.labels)' | kubectl replace --force -f -
kubectl -n uipath-infra get job "credential-manager-job" -o json | jq 'del(. | .spec.selector, .spec.template.metadata.labels)' | kubectl replace --force -f -
kubectl -n uipath delete pod -l app.kubernetes.io/component=orchestrator

 

Test Manager-related issues


Test Manager licensing issue


If you were assigned a license while being logged, your license assignment may not be detected when opening Test Manager.

If this happens, take the following steps:

  1. Navigate to Test Manager.
  2. Log out from the portal.
  3. Log in again.

 

AI Center-related issues


AI Center Skills deployment issues

Sometimes intermittently DU Model Skill Deployments can fail with Failed to list deployment or Unknown Error when deploying the model for the first time. The workaround is to try deploying the model again. Second time, it will be faster as most of the deployment work of image building would have been done during the first attempt. DU Models takes around 1-1.5 hours for deploying first time, and it will be faster when deploying them again.

In a rare scenario, due to cluster state, asynchronous operations like Skill Deployment or Package upload could be stuck for a long time. If DU Skill deployment is taking more than 2-3 hours, try deploying a simpler model (e.g, TemplateModel). If the model also takes more than an hour, then the mitigation is to restart AI Center services with the following commands:

kubectl -n uipath rollout restart deployment ai-deployer-deployment
kubectl -n uipath rollout restart deployment ai-trainer-deployment
kubectl -n uipath rollout restart deployment ai-pkgmanager-deployment
kubectl -n uipath rollout restart deployment ai-helper-deployment
kubectl -n uipath rollout restart deployment ai-appmanager-deployment

Wait for the AI Center pods to be back up by verifying with the following command:

kubectl -n uipath get pods | grep ai-*

All the above pods should be Running state with container state shown as 2/2.

 

Document Understanding-related issues


Document Understanding not on the left rail of Automation Suite


Description

In case Document Understanding cannot be found on the left rail of Automation Suite, please know that Document Understanding is currently not a separate application on Automation Suite, thus it is not shown on the left rail.

Solution

The Data Manager component is part of AI Center, so please make sure to enable AI Center.

Also, please access Form Extractor, Intelligent Form Extractor (including HandwritingRecognition), and Intelligent Keyword Classifier, with the below public URL:

<FQDN>/du_/svc/formextractor
<FQDN>/du_/svc/intelligentforms
<FQDN>/du_/svc/intelligentkeywords

If you get the Your license can not be validated error message when trying to use Intelligent Keyword Classifier, Form Extractor and Intelligent Form Extractor in Studio, besides making sure you have input the right endpoint, please also take the API key that you generated for Document Understanding under License in the Automation Suite install, and not from cloud.uipath.com.

 

Failed status when creating a data labeling session


Description

If you are not able to create data labeling sessions on Data Manager in AI Center, take the following steps.

Solution 1

Please double-check if Document Understanding is properly enabled. You should have updated the configuration file before the installation and set documentunderstanding.enabled to True, or you could update it in ArgoCD post-installation as below.

After doing that, you need to disable it and disable AI Center on the tenant you wish to use the Data Labeling feature on, or create a new tenant.

Solution 2

If Document Understanding is properly enabled in the configuration file or ArgoCD, sometimes Document Understanding is not enabled for DefaultTenant. This manifests itself as not being able to create data labeling sessions.

To fix this, disable AI Center on the tenant and re-enable it. Note that you might need to wait a few minutes before being able to re-enable it.

 

Failed status when trying to deploy an ML Skill


Description

If you are trying unsuccesfully to deploy a Document Understanding ML Skill on AI Center, check the solutions below.

Solution 1

If you are installing the Automation Suite offline, please double check if the Document Understanding bundle has been downloaded and installed.

The bundle includes the base image (e.g., model library) for the models to properly run on AI Center after uploading the ML Packages via AI Center UI.

For details about installing Document Understanding bundle, please refer to the documentation here and here. To add Document Understanding bundle, please follow the documentation to re-run the Document Understanding bundle installation.

Solution 2

Even if you have installed the Document Understanding bundle for offline installation, another issue might occur along with this error message: modulenotfounderror: no module named 'ocr.release'; 'ocr' is not a package.

When creating a Document Understanding OCR ML Package in AI Center, keep in mind that it cannot be named ocr or OCR, which conflicts with a folder in the package. Please make sure to choose another name.

Solution 3

Sometimes, intermittently, Document Understanding Model Skill Deployments can fail with Failed to list deployment or Unknown Error when deploying the model for the first time.

The workaround is to try deploying the model again. The second time, the deployment is be faster as most of the deployment work of image building would have been done during the first attempt. Document Understanding ML Packages take around 1-1.5 hours for deploying the first time, and these are faster when deploying them again.

 

Migration job fails in ArgoCD


Description

Migration job fails for Document Understanding in ArgoCD.

Solution

Document Understanding requires the FullTextSearch feature to be enabled on the SQL server. Otherwise, the installation can fail without an explicit error message in this regard, as the migration job fails in ArgoCD.

 

Handwriting Recognition with Intelligent Form Extractor not working


Description

Handwriting Recognition with Intelligent Form Extractor not working or working too slow.

Solution 1

If you are using Intelligent Form Extractor offline, please check to ensure that you have enabled handwriting in the configuration file before installation or enabled it in ArgoCD.

To double check, please go to ArgoCD > Document Understanding > App details > du-services.handwritingEnabled (set it to True).

In an air-gapped scenario, the Document Understanding bundle needs to be installed before doing this, otherwise the ArgoCD sync fails.

Solution 2

Although handwriting in the configuration file is enabled, you might still face the same issues.

Please know that the default for the maximum amount of CPUs each container is allowed to use for handwriting is 2. You may need to adjust handwriting.max_cpu_per_pod parameter if you have a larger handwriting processing workload. You could update it in the configuration file before installation or update it in ArgoCD.

For more details on how to calculate the parameter value based on your volume, please check the documentation here.

 

Insights-related issues


Navigating to Insights home page generates a 404


Rarely, a routing error can occur and result in a 404 on the Insights home page. You can resolve this by going to the Insights application in ArgoCD and deleting the virtual service insightsprovisioning-vs. Note that you may have to click clear filters to show X additional resources to see and delete this virtual service.

Updated 14 days ago


Troubleshooting


This page explains how to fix issues you might encounter when setting up Automation Suite.

Suggested Edits are limited on API Reference Pages

You can only suggest edits to Markdown body content, but not to the API spec.