Subscribe

UiPath Automation Suite

The UiPath Automation Suite Guide

Troubleshooting

This page explains how to fix issues you might encounter when setting up Automation Suite.

Troubleshooting how-tos


How to troubleshoot services during installation


Take the following steps on the main server of the cluster (first server in the case of multi-node).

  1. Obtain Kubernetes access.
export KUBECONFIG="/etc/rancher/rke2/rke2.yaml"
export PATH="$PATH:/usr/local/bin:/var/lib/rancher/rke2/bin"

# To validate, execute the following command which should not return an error:
kubectl get nodes
  1. Retrieve the ArgoCD password by running the following command:
kubectl get secrets/argocd-admin-password -n argocd --template '{{ .data.password }}' | base64 -d
  1. Connect to ArgoCD
    a. Navigate to https://alm.<fqdn>/:443
    b. Log in using admin as the username and the password obtained at Step 2.

  2. Locate the UiPath Services application as follows:
    a. Using the search bar provided in ArgoCD, type in uipath.

    b. Then open the UiPath application by clicking its card.

    c. Check for the following: Application was not synced due to a failed job/pod.

    d. If the above error exists, take the following steps.

    e. Locate any un-synced components by looking for the red broken heart icon, as shown in the following image.

    f. Open the right-most component (usually pods) and click the Logs tab. The Logs will contain an error message indicating the reason for the pod failure.

    g. Once you resolve any outstanding configuration issues, go back to the home page and click the Sync button on the UiPath application.

 

How to uninstall the cluster


If you experience issues specific to Kubernetes running on the cluster, you can directly uninstall the rke2 cluster.

  • If you are running an online setup, use the following command to uninstall. Make sure to execute with elevated privileges, i.e. sudo.
service_exists() {
    local n=$1
    if [[ $(systemctl list-units --all -t service --full --no-legend "$n.service" | cut -f1 -d' ') == $n.service ]]; then
        return 0
    else
        return 1
    fi
}
if service_exists rke2-server; then
  systemctl stop rke2-server
  systemctl disable rke2-server
fi
if service_exists rke2-agent; then
  systemctl stop rke2-agent
  systemctl disable rke2-agent
fi
if [ -e /usr/bin/rke2-killall.sh ]
then
    echo "Running rke2-killall.sh"
    /usr/bin/rke2-killall.sh > /dev/null
else
    echo "File not found: rke2-killall.sh"
fi
if [ -e /usr/bin/rke2-uninstall.sh ]
then
    echo "Running rke2-uninstall.sh"
    /usr/bin/rke2-uninstall.sh > /dev/null
else
    echo "File not found: rke2-uninstall.sh"
fi

rm -rfv /etc/rancher/ > /dev/null
rm -rfv /var/lib/rancher/ > /dev/null
rm -rfv /var/lib/rook/ > /dev/null
rm -rfv /var/lib/longhorn/ > /dev/null
findmnt -lo target | grep /var/lib/kubelet/ | xargs umount -l -f
rm -rfv /var/lib/kubelet/ > /dev/null
rm -rfv /datadisk/* > /dev/null
rm -rfv ~/.uipath/* > /dev/null
mkdir -p /var/lib/rancher/rke2/server/db/ && mount -a
rm -rfv /var/lib/rancher/rke2/server/db/* > /dev/null
echo "Uninstall RKE complete."
  • If you are running an offline setup, use the following command to uninstall. Make sure to execute with elevated privileges, i.e. sudo.
service_exists() {
    local n=$1
    if [[ $(systemctl list-units --all -t service --full --no-legend "$n.service" | cut -f1 -d' ') == $n.service ]]; then
        return 0
    else
        return 1
    fi
}
if service_exists rke2-server; then
  systemctl stop rke2-server
  systemctl disable rke2-server
fi
if service_exists rke2-agent; then
  systemctl stop rke2-agent
  systemctl disable rke2-agent
fi
if [ -e /usr/local/bin/rke2-killall.sh ]
then
  echo "Running rke2-killall.sh"
  /usr/local/bin/rke2-killall.sh > /dev/null
else
  echo "File not found: rke2-killall.sh"
fi
if [ -e /usr/local/bin/rke2-uninstall.sh ]
then
  echo "Running rke2-uninstall.sh"
  /usr/local/bin/rke2-uninstall.sh > /dev/null
else
    echo "File not found: rke2-uninstall.sh"
fi

rm -rfv /etc/rancher/ > /dev/null
rm -rfv /var/lib/rancher/ > /dev/null
rm -rfv /var/lib/rook/ > /dev/null
rm -rfv /var/lib/longhorn/ > /dev/null
findmnt -lo target | grep /var/lib/kubelet/ | xargs umount -l -f
rm -rfv /var/lib/kubelet/ > /dev/null
rm -rfv /datadisk/* > /dev/null
rm -rfv ~/.uipath/* > /dev/null
mkdir -p /var/lib/rancher/rke2/server/db/ && mount -a
rm -rfv /var/lib/rancher/rke2/server/db/* > /dev/null
echo "Uninstall RKE complete."

 

How to recreate the databases


If there is an issue with your databases, you can recreate them from scratch directly post-installation.

You can do this by running an SQL command to drop all the DBs and recreate them as follows:

USE [master]
ALTER DATABASE [SF_Orchestrator] SET SINGLE_USER WITH ROLLBACK IMMEDIATE
DROP DATABASE [SF_Orchestrator]
ALTER DATABASE [SF_TestAutomation] SET SINGLE_USER WITH ROLLBACK IMMEDIATE
DROP DATABASE [SF_TestAutomation]
ALTER DATABASE [SF_Update] SET SINGLE_USER WITH ROLLBACK IMMEDIATE
DROP DATABASE [SF_Update]
ALTER DATABASE [SF_Identity] SET SINGLE_USER WITH ROLLBACK IMMEDIATE
DROP DATABASE [SF_Identity]
ALTER DATABASE [SF_Location] SET SINGLE_USER WITH ROLLBACK IMMEDIATE
DROP DATABASE [SF_Location]
ALTER DATABASE [SF_OMS] SET SINGLE_USER WITH ROLLBACK IMMEDIATE
DROP DATABASE [SF_OMS]
ALTER DATABASE [SF_LRM] SET SINGLE_USER WITH ROLLBACK IMMEDIATE
DROP DATABASE [SF_LRM]
ALTER DATABASE [SF_LA] SET SINGLE_USER WITH ROLLBACK IMMEDIATE
DROP DATABASE [SF_LA]
ALTER DATABASE [SF_Audit] SET SINGLE_USER WITH ROLLBACK IMMEDIATE
DROP DATABASE [SF_Audit]
ALTER DATABASE [SF_AOPS] SET SINGLE_USER WITH ROLLBACK IMMEDIATE
DROP DATABASE [SF_AOPS]
ALTER DATABASE [SF_Insights] SET SINGLE_USER WITH ROLLBACK IMMEDIATE
DROP DATABASE [SF_Insights]
ALTER DATABASE [SF_TaskMining] SET SINGLE_USER WITH ROLLBACK IMMEDIATE
DROP DATABASE [SF_TaskMining]
ALTER DATABASE [SF_AutomationHub] SET SINGLE_USER WITH ROLLBACK IMMEDIATE
DROP DATABASE [SF_AutomationHub]
ALTER DATABASE [SF_TestManager] SET SINGLE_USER WITH ROLLBACK IMMEDIATE
DROP DATABASE [SF_TestManager]
ALTER DATABASE [SF_AI_Helper] SET SINGLE_USER WITH ROLLBACK IMMEDIATE
DROP DATABASE [SF_AI_Helper]
ALTER DATABASE [SF_AI_Pkgmanager] SET SINGLE_USER WITH ROLLBACK IMMEDIATE
DROP DATABASE [SF_AI_Pkgmanager]
ALTER DATABASE [SF_AI_Deployer] SET SINGLE_USER WITH ROLLBACK IMMEDIATE
DROP DATABASE [SF_AI_Deployer]
ALTER DATABASE [SF_AI_Trainer] SET SINGLE_USER WITH ROLLBACK IMMEDIATE
DROP DATABASE [SF_AI_Trainer]
ALTER DATABASE [SF_AI_Appmanager] SET SINGLE_USER WITH ROLLBACK IMMEDIATE
DROP DATABASE [SF_AI_Appmanager]
CREATE DATABASE [SF_Orchestrator]
CREATE DATABASE [SF_TestAutomation]
CREATE DATABASE [SF_Update]
CREATE DATABASE [SF_Identity]
CREATE DATABASE [SF_Location]
CREATE DATABASE [SF_OMS]
CREATE DATABASE [SF_LRM]
CREATE DATABASE [SF_LA]
CREATE DATABASE [SF_Audit]
CREATE DATABASE [SF_AOPS]
CREATE DATABASE [SF_Insights]
CREATE DATABASE [SF_TaskMining]
CREATE DATABASE [SF_AutomationHub]
CREATE DATABASE [SF_TestManager]
CREATE DATABASE [SF_AI_Helper]
CREATE DATABASE [SF_AI_Pkgmanager]
CREATE DATABASE [SF_AI_Deployer]
CREATE DATABASE [SF_AI_Trainer]
CREATE DATABASE [SF_AI_Appmanager]
GO

 

How to clean up offline artifacts to improve disk space


If you run an offline installation, you typically need a larger disk size due to the offline artifacts that are used.

Once the installation completes, you can remove those local artifacts. Failure to do so can result in unnecessary disk pressure during cluster operations.

On the primary server, where the installation was performed, you can perform a cleanup using the following commands.

  1. Remove all images loaded by podman into the local container storage using the following command.:
podman image rm -af
  1. Then remove the temporary offline folder, used with the flag --offline-tmp-folder. This parameter defaults to /tmp:
rm -rf /path/to/temp/folder

 

Common Issues


Unable to run an offline installation on RHEL 8.4 OS


Description

The following issues can happen if you install RHEL 8.4 and are performing the offline installation, which requires podman. These issues are specific to podman and the OS being installed together. See the two below

Potential issue

  • you cannot install both of the following on the cluster:
    • podman-1.0.0-8.git921f98f.module+el8.3.0+10171+12421f43.x86_64
    • podman-3.0.1-6.module+el8.4.0+10607+f4da7515.x86_64
  • package cockpit-podman-29-2.module+el8.4.0+10607+f4da7515.noarch requires podman >= 1.3.0, but none of the providers can be installed
  • cannot install the best candidate for the job
  • problem with installed package cockpit-podman-29-2.module+el8.4.0+10607+f4da7515.noarch

Potential issue

  • package podman-3.0.1-6.module+el8.4.0+10607+f4da7515.x86_64 requires containernetworking-plugins >= 0.8.1-1, but none of the providers can be installed
  • you cannot install both of the following:
    • containernetworking-plugins-0.7.4-4.git9ebe139.module+el8.3.0+10171+12421f43.x86_64
    • containernetworking-plugins-0.9.1-1.module+el8.4.0+10607+f4da7515.x86_64
  • package podman-catatonit-3.0.1-6.module+el8.4.0+10607+f4da7515.x86_64 requires podman = 3.0.1-6.module+el8.4.0+10607+f4da7515, but none of the providers can be installed
  • cannot install the best candidate for the job
  • problem with installed package podman-catatonit-3.0.1-6.module+el8.4.0+10607+f4da7515.x86_64
    (try to add --allowerasing to command line to replace conflicting packages or --skip-broken to skip uninstallable packages or --nobest to use not only best candidate packages)

Solution

You need to remove the current version of podman and allow Automation Suite to install the required version.

  1. Remove the current version of podman using the yum remove podman command.

  2. Re-run the installer after having removed the current version, which should install the correct version.

 

Offline installation fails because of missing binary


Description

During offline installation, at the fabric stage, the execution fails with the following error message:

Error: overlay: can't stat program "/usr/bin/fuse-overlayfs": stat /usr/bin/fuse-overlayfs: no such file or directory

Solution

You need to remove the line containing the mount_program key from the podman configuration /etc/containers/storage.conf.
Ensure you remove the line rather than comment it out.

 

Failure to get the sandbox image


Description

You can receive an error message specific when trying to get the following sandbox image: index.docker.io/rancher/pause3.2

This can happen in an offline installation.

Solution

Restart either rke2-server or rke2-agent (depending on whether the machine that the pod is scheduled on is either a server or an agent).

To check which node the pod is scheduled on, run kubectl -n <namespace> get pods -o wide.

# If machine is a Master node
systemctl restart rke2-server
# If machine is an Agent Node
systemctl restart rke2-agent

 

SQL connection string validation error


Description

You might receive an error relating to the connection strings as follows:

Sqlcmd: Error: Microsoft Driver 17 for SQL Server :
Server—tcp : <connection string>
Login failed for user

This error appears even though all credentials are correct. The connection string validation failed.

Solution

Make sure the connection string has the following structure:

Server=<Sql server host name>;User Id=<user_name for sql server>;Password=<Password>;Initial Catalog=<database name>;Persist Security Info=False;MultipleActiveResultSets=False;Encrypt=True;TrustServerCertificate=False;Connection Timeout=30;Max Pool Size=100;

📘

Note:

User Id is case-sensitive.

 

Certificate issue in offline installation


Description

You might get an error that the certificate is signed by an unknown authority.

Error: failed to do request: Head "https://sfdev1778654-9f843b23-lb.westeurope.cloudapp.azure.com:30071/v2/helm/audit-service/blobs/sha256:09bffbc520ff000b834fe1a654acd089889b09d22d5cf1129b0edf2d76554892": x509: certificate signed by unknown authority

Solution

Both the rootCA and the server certificates need to be in the trusted store on the machine.

To investigate, execute the following commands:

[[email protected] ~]# find /etc/pki/ca-trust/source{,/anchors} -maxdepth 1 -not -type d -exec ls -1 {} +
/etc/pki/ca-trust/source/anchors/rootCA.crt
/etc/pki/ca-trust/source/anchors/server.crt

The provided certificates need to be in the output of those commands.

Alternatively, execute the following command:

[[email protected] ~]# openssl x509 -in /etc/pki/ca-trust/source/anchors/server.crt -text -noout

Ensure that the fully qualified domain name is present in the Subject Alternative Name from the output.

X509v3 Subject Alternative Name:
                DNS:sfdev1778654-9f843b23-lb.westeurope.cloudapp.azure.com

You can update the CA Certificate as follows:

[[email protected] ~]# update-ca-trust

 

Error in downloading the bundle


Description

The documentation lists wget as an option for downloading the bundles. Because of the large sizes, the connection may be interrupted and not recover.

Solution

One way to mitigate this could be to switch to a different download tool, such as azcopy (more information here). Run these commands, while updating the bundle URL to match the desired version/bundle combination.

wget https://aka.ms/downloadazcopy-v10-linux -O azcopy.tar.gz
tar -xvf ./azcopy.tar.gz
azcopy_linux_amd64_10.11.0/azcopy copy https://download.uipath.com/service-fabric/0.0.23-private4/sf-0.0.23-private4.tar.gz /var/tmp/sf.tar.gz --from-to BlobLocal

 

MTU configuration errors


Using the default value (1450) when configuring the MTU on Kubernetes Pod level could result in various issues for Azure Virtual Network. To prevent these problems, we recommend setting the Pod MTU to 1350.

To do that, run the following commands on server nodes:

mkdir -p /var/lib/rancher/rke2/server/manifests/
cat > /var/lib/rancher/rke2/server/manifests/rke2-canal-config-override.yaml <<EOF
apiVersion: helm.cattle.io/v1
kind: HelmChartConfig
metadata:
  name: rke2-canal
  namespace: kube-system
spec:
  valuesContent: |-
    calico:
      vethuMTU: "1350"
EOF

If you forgot to update the MTU configuration before the installation, you can still configure it post-installation using the following script:

function mtu_rke2_config() {
  mkdir -p /var/lib/rancher/rke2/server/manifests/
  cat > /var/lib/rancher/rke2/server/manifests/rke2-canal-config-override.yaml <<EOF
apiVersion: helm.cattle.io/v1
kind: HelmChartConfig
metadata:
  name: rke2-canal
  namespace: kube-system
spec:
  valuesContent: |-
    calico:
      vethuMTU: "1350"
EOF
}

## Wait until cm MTU changes
function wait_until_cm_updates() {
  local retry_count=${1:-20}
  local sleep_seconds=15
  local veth_mtu

  if [[ $retry_count -lt 0 ]] 
  then
    echo "Timeout while setting MTU config value"
    return 
  fi

  veth_mtu=$(kubectl -n kube-system get cm rke2-canal-config -o go-template='{{ index .data "veth_mtu" }}')
  if [[ $veth_mtu == "" ]] || [[ $veth_mtu == "<no value>" ]]
  then
    echo "Pod MTU value not found"
    return
  elif [[ $veth_mtu -eq 1350 ]] 
  then
    echo "Pod MTU config value set to 1350. Please proceed to restart the rke2-canal pods"
    restart_canal_pod=true
  else
    retry_count=$((retry_count-1))
    echo "retry..."
    sleep ${sleep_seconds}
    wait_until_cm_updates "$retry_count"
  fi
}

function restart_canal_pods(){
  kubectl -n kube-system rollout restart ds/rke2-canal 
  kubectl -n kube-system rollout status ds/rke2-canal  
  restart_all_controllers_pod=true
}

function restart_all_controllers_pods(){
  local all_controllers
  local line
  local ns
  local controller

  all_controllers=$(kubectl get deploy,ds,sts -A --no-headers=true)
  while IFS="\n" read -r line
  do
    ns=$(echo "${line}" | awk '{print $1}')
    controller=$(echo "${line}" | awk '{print $2}')
    kubectl -n "${ns}" rollout restart "${controller}"
  done <<< "${all_controllers}"
}

function main() {
  mtu_rke2_config
  restart_canal_pod=false
  restart_all_controllers_pod=false
  wait_until_cm_updates
  ## restart the rke canal pods
  if [[ $restart_canal_pod == "true" ]]
  then
    restart_canal_pods
  fi

  ## Restart all pods created via controllers deployment , daemonset , statefulset
  if [[ $restart_all_controllers_pod == "true" ]]
  then
    restart_all_controllers_pods
  fi
}

main

 

Longhorn errors


Description

Pods with RWX PVC attached stuck in Init/ContainerCreating state. You can see the error when a specific deployment goes into a degraded state in ArgoCD and the error looks like this:

MountVolume.SetUp failed for volume "pvc-0998313e-ffd9-4b78-9085-2954008da770" : rpc error: code = InvalidArgument desc = volume pvc-0998313e-ffd9-4b78-9085-2954008da770 hasn't been attached yet

This happens when the pod gets stuck in Init/ContainerCreating state, which is a result of Longhorn getting stuck in a loop when attempting to detach/attach volume to the Pod.

Solution

Perform the following:
For all commands, please replace the deployment name with the name of the deployment where you noticed the error (e.g., orchestrator) and the namespace with the one that contains the deployment (e.g., uipath).

  1. Identify the number of desired replicas the deployment has.
    kubectl get deployment <deployment name> -n <namespace>
    Under the READY column you should see the ready replicas / desired replicas.
  2. Scale down the deployment to 0, and wait until all pods are deleted:
    kubectl scale deployment <deployment name> --replicas=0 -n <namespace>
  3. Wait for the RWX volumes to detach.
  4. Scale back the deployment:
    kubectl scale deployment <deployment name> --replicas=<desired replicas> -n <namespace>

 

rke2-coredns-rke2-coredns-autoscaler pod in CrashLoopBackOff


Description

After node restart, rke2-coredns-rke2-coredns-autoscaler can go in CrashLoopBackOff. This does not have any impact on Automation Suite.

Solution

Delete the rke2-coredns-rke2-coredns-autoscaler pod that is in CrashLoopBackOff using the following command: kubectl delete pod <pod name> -n kube-system.

 

Redis probe failure


Description

Redis probe can fail if the node ID file does not exist. This can happen if the pod is not yet bootstrapped.

There is a recovery job that automatically fixes this issue, and the following steps should not be performed while the job is running.

When a Redis Enterprise cluster loses contact with more than half of its nodes (either because of failed nodes or network split), then the cluster stops responding to client connections. The Pods also fail to rejoin the cluster.

Solution

  1. Delete the Redis cluster and database using the following commands:
kubectl delete redb -n redis-system redis-cluster-db --force --grace-period=0 &
kubectl delete rec -n redis-system redis-cluster --force --grace-period=0 &
kubectl patch redb -n redis-system redis-cluster-db --type=json -p '[{"op":"remove","path":"/metadata/finalizers","value":"finalizer.redisenterprisedatabases.app.redislabs.com"}]'
kubectl patch rec redis-cluster -n redis-system --type=json -p '[{"op":"remove","path":"/metadata/finalizers","value":"redbfinalizer.redisenterpriseclusters.app.redislabs.com"}]'
kubectl delete job redis-cluster-db-job -n redis-system
  1. Go to the ArgoCD UI and sync the redis-cluster application.

 

RKE2 server fails to start


Description

The server fails to start. There are a few different reasons for RKE2 not starting properly, which are usually found in the logs.

Solution

Check the logs using the following commands:

journalctl -u rke2-server

Possible reasons (based on logs): too many learner members in cluster

Too many etcd servers are added to the cluster, and there are two learner nodes trying to be promoted. More information here: Runtime reconfiguration.

Perform the following:

  1. Under normal circumstances, the node should become a full member if enough time is allowed.
  2. An uninstall-reinstall cycle can be attempted.

Alternatively, this could be caused by a networking problem. Ensure you have configured the machine to enable the necessary ports.

 

Node draining does not occur for stopped nodes


Description

If a node is stopped in a cluster and, and its corresponding pods are not rescheduled to available nodes after 15 minutes, run the following script to manually drain the node.

#!/bin/sh

KUBECTL="/usr/local/bin/kubectl"

# Get only nodes which are not drained yet
NOT_READY_NODES=$($KUBECTL get nodes | grep -P 'NotReady(?!,SchedulingDisabled)' | awk '{print $1}' | xargs echo)
# Get only nodes which are still drained
READY_NODES=$($KUBECTL get nodes | grep '\sReady,SchedulingDisabled' | awk '{print $1}' | xargs echo)

echo "Unready nodes that are undrained: $NOT_READY_NODES"
echo "Ready nodes: $READY_NODES"


for node in $NOT_READY_NODES; do
  echo "Node $node not drained yet, draining..."
  $KUBECTL drain --ignore-daemonsets --force --delete-emptydir-data $node
  echo "Done"
done;

for node in $READY_NODES; do
  echo "Node $node still drained, uncordoning..."
  $KUBECTL uncordon $node
  echo "Done"
done;

 

Enable Istio logging


To debug Istio, you need to enable logging. To do that, perform the following steps:

  1. Find the istio-ingressgateway pod by running the following command. Copy the gateway pod name. It should be something like istio-ingressgateway-r4mbx.
kubectl -n istio-system get pods
  1. Open the gateway Pod shell by running the following command.
kubectl exec -it -n istio-system istio-ingressgateway-r4mbx bash
  1. Enable debug level logging by running the following command.
curl -X POST http://localhost:15000/logging?level=debug
  1. Run the following command from first server node.
istioctl_bin=$(find /var/lib/rancher/rke2/ -name "istioctl" -type f -perm -u+x   -print -quit)
if [[ -n ${istioctl_bin} ]]
then
echo "istioctl bin found"
  kubectl -n istio-system get cm istio-installer-base -o go-template='{{ index .data "istio-base.yaml" }}'  > istio-base.yaml
  kubectl -n istio-system get cm istio-installer-overlay  -o go-template='{{ index .data "overlay-config.yaml" }}'  > overlay-config.yaml 
  ${istioctl_bin} -i istio-system install -y -f istio-base.yaml -f overlay-config.yaml --set meshConfig.accessLogFile=/dev/stdout --set meshConfig.accessLogEncoding=JSON 
else
  echo "istioctl bin not found"
fi

 

Secret not found in UiPath namespace


Description

If service installation fails, and checking kubectl -n uipath get pods returns failed pods, take the following steps.

Solution

  1. Check kubectl -n uipath describe pod <pod-name> and look for secret not found.
  2. If the secret is not found, then look for credential manager job logs and see if it failed.
  3. If the credential manager job failed and kubectl get pods -n rook-ceph|grep rook-ceph-tool returns more than one pod, do the following:
    a. delete rook-ceph-tool that is not running.
    b. go to ArgoCD UI and sync sfcore appllication.
    c. Once the job completes, check if all secrets are created in the credential-manager job logs
    d. Now sync uipath application.

 

Installer cannot connect to ArgoCD to check if password was reset


Description

This issue might occur during fabric installation. The installer might fail with below similar error.

appproject.argoproj.io/fabric created
configmap/argocd-cm configured
[INFO] [2021-09-02T09:21:15+0000]: Checking if ArgoCD password was reset, looking for secrets/argocd-admin-password.
FATA[0000] dial tcp: lookup remusr-sf on 168.63.129.16:53: no such host
[INFO] [2021-09-02T09:21:16+0000]: Secret not found, trying to log in with initial password...1/10
FATA[0000] dial tcp: lookup remusr-sf on 168.63.129.16:53: no such host
[INFO] [2021-09-02T09:21:36+0000]: Secret not found, trying to log in with initial password...2/10
FATA[0000] dial tcp: lookup remusr-sf on 168.63.129.16:53: no such host
[INFO] [2021-09-02T09:21:56+0000]: Secret not found, trying to log in with initial password...3/10
FATA[0000] dial tcp: lookup remusr-sf on 168.63.129.16:53: no such host
[INFO] [2021-09-02T09:22:16+0000]: Secret not found, trying to log in with initial password...4/10
FATA[0000] dial tcp: lookup remusr-sf on 168.63.129.16:53: no such host
[INFO] [2021-09-02T09:22:36+0000]: Secret not found, trying to log in with initial password...5/10
FATA[0000] dial tcp: lookup remusr-sf on 168.63.129.16:53: no such host
[INFO] [2021-09-02T09:22:56+0000]: Secret not found, trying to log in with initial password...6/10
FATA[0000] dial tcp: lookup remusr-sf on 168.63.129.16:53: no such host
[INFO] [2021-09-02T09:23:17+0000]: Secret not found, trying to log in with initial password...7/10
FATA[0000] dial tcp: lookup remusr-sf on 168.63.129.16:53: no such host
[INFO] [2021-09-02T09:23:37+0000]: Secret not found, trying to log in with initial password...8/10
FATA[0000] dial tcp: lookup remusr-sf on 168.63.129.16:53: no such host
[INFO] [2021-09-02T09:23:57+0000]: Secret not found, trying to log in with initial password...9/10
FATA[0000] dial tcp: lookup remusr-sf on 168.63.129.16:53: no such host
[INFO] [2021-09-02T09:24:17+0000]: Secret not found, trying to log in with initial password...10/10
[ERROR][2021-09-02T09:24:37+0000]: Failed to log in

Solution 1

Check all the required subdomains and ensure they are configured correctly and are routable as follows:

getent ahosts automationsuite.mycompany.com | awk '{print $1}' | sort | uniq
getent ahosts alm.automationsuite.mycompany.com | awk '{print $1}' | sort | uniq
getent ahosts registry.automationsuite.mycompany.com | awk '{print $1}' | sort | uniq
getent ahosts monitoring.automationsuite.mycompany.com | awk '{print $1}' | sort | uniq
getent ahosts objectstore.automationsuite.mycompany.com | awk '{print $1}' | sort | uniq
getent ahosts insights.automationsuite.mycompany.com | awk '{print $1}' | sort | uniq

🚧

Important!

Replace automationsuite.mycompany.com with you cluster FQDN.

If the above commands/lines do not return a routable IP address, then the subdomain required for Automation Suite is not configured properly.

📘

Note:

This error is encountered when the DNS is not public.

You need to add the Private DNS Zone (for Azure) or route 53 (for AWS).

If above commands return the proper IP address, then follow the below steps.

Solution 2

  1. Delete the ArgoCD namespace by executing following command:
export KUBECONFIG=/etc/rancher/rke2/rke2.yaml
export PATH=$PATH:/var/lib/rancher/rke2/bin

kubectl delete namespace argocd
  1. Run the following command to verify:
kubectl get namespace

There should not be ArgoCD namespace in the output of this command.

📘

Note:

Once the ArgoCD namespace is deleted, resume with the installation.

 

Cannot log in after migration


Description

An issue might affect the migration from a standalone product to Automation Suite. It prevents you from logging in, with the following error message being displayed: Cannot find client details.

Solution

To fix this problem, you need to re-sync uipath app first, and then sync platform app in ArgoCD.

 

After the Initial Install, ArgoCD App went into Progressing State


Description

Whenever the cluster state deviates from what has been defined in the helm repository, argocd tries to sync the state and reconciliation happens every minute. Whenever this happens, you can notice that the ArgoCD app is in progressing state.

Solution

This is the expected behavior of ArgoCD and it does not impact application in any way.

Automation Suite requires backlog_wait_time to be set 1


Description

We have noticed that audit events can cause instability (system freeze) if backlog_wait_time is not set to 1.
See more details here.

Solution

If the installer fails with the "Automation Suite require backlog_wait_time to be set 1" error message, follow the steps below to set backlog_wait_time to 1.

  1. Set backlog_wait_time to 1 by appending --backlog_wait_time 1 in the /etc/audit/rules.d/audit.rules file.
  2. Reboot node.
  3. Validate if backlog_wait_time value is set to 1 for auditctl by running sudo auditctl -s | grep "backlog_wait_time" in the node.

Failure to resize objectstore PVC


Description

This issue occurs when the objectstore resize-pvc operation fails with the following error:

Failed resizing the PVC: <pvc name> in namespace: rook-ceph, ROLLING BACK

Solution

To fix this problem, take the following steps:

  1. Run the following script manually:
#!/bin/sh

ROOK_CEPH_OSD_PREPARE=$(kubectl -n rook-ceph get pods | grep rook-ceph-osd-prepare-set | awk '{print $1}')
if [[ -n ${ROOK_CEPH_OSD_PREPARE} ]]; then
    for pod in ${ROOK_CEPH_OSD_PREPARE}; do
    echo "Start deleting rook ceph osd pod $pod .."
    kubectl -n rook-ceph delete pod $pod
    echo "Done"
    done;
fi
  1. Rerun the objectstore resize-pvc command.

 

Failure after certificate update


Description

This issue occurs when the certificate update step fails internally. You may not able to access Automation Suite or Orchestrator.

Error

Solution

  1. Run below commands from any of the server node
export KUBECONFIG=/etc/rancher/rke2/rke2.yaml
export PATH=$PATH:/var/lib/rancher/rke2/bin

kubectl -n uipath rollout restart deployments
  1. Wait for the above command to succeed and then run below command to verify the status of previous command.
deployments=$(kubectl -n uipath get deployment -o name)
for i in $deployments; 
do 
kubectl -n uipath rollout status "$i" -w --timeout=600s; 
if [[ "$?" -ne 0 ]]; 
then
    echo "$i deployment failed in namespace uipath."
fi
done
echo "All deployments are succeeded in namespace uipath"

Once the above command is finished execution, you should able to access Automation Suite and Orchestrator

 

AI Center Skills deployment issues


Sometimes intermittently DU Model Skill Deployments can fail with Failed to list deployment or Unknown Error when deploying the model for the first time. The workaround is to try deploying the model again. Second time, it will be faster as most of the deployment work of image building would have been done during the first attempt. DU Models takes around 1-1.5 hours for deploying first time, and it will be faster when deploying them again.

In a rare scenario, due to cluster state, asynchronous operations like Skill Deployment or Package upload could be stuck for a long time. If DU Skill deployment is taking more than 2-3 hours, try deploying a simpler model (e.g, TemplateModel). If the model also takes more than an hour, then the mitigation is to restart AI Center services with the following commands:

kubectl -n uipath rollout restart deployment ai-deployer-deployment
kubectl -n uipath rollout restart deployment ai-trainer-deployment
kubectl -n uipath rollout restart deployment ai-pkgmanager-deployment
kubectl -n uipath rollout restart deployment ai-helper-deployment
kubectl -n uipath rollout restart deployment ai-appmanager-deployment

Wait for the AI Center pods to be back up by verifying with the following command:

kubectl -n uipath get pods | grep ai-*

All the above pods should be Running state with container state shown as 2/2.

 

Test Manager licensing issue


If you were assigned a license while being logged, your license assignment may not be detected when opening Test Manager.

If this happens, take the following steps:

  1. Navigate to Test Manager.
  2. Log out from the portal.
  3. Log in again.

 

Document Understanding-related issues


 

Document Understanding not on the left rail of Automation Suite


Description

In case Document Understanding cannot be found on the left rail of Automation Suite, please know that Document Understanding is currently not a separate application on Automation Suite, thus it is not shown on the left rail.

Solution

The Data Manager component is part of AI Center, so please make sure to enable AI Center.

Also, please access Form Extractor, Intelligent Form Extractor (including HandwritingRecognition), and Intelligent Keyword Classifier, with the below public URL:

<FQDN>/du_/svc/formextractor
<FQDN>/du_/svc/intelligentforms
<FQDN>/du_/svc/intelligentkeywords

If you get the Your license can not be validated error message when trying to use Intelligent Keyword Classifier, Form Extractor and Intelligent Form Extractor in Studio, besides making sure you have input the right endpoint, please also take the API key that you generated for Document Understanding under License in the Automation Suite install, and not from cloud.uipath.com.

 

Failed status when creating a data labeling session


Description

If you are not able to create data labeling sessions on Data Manager in AI Center, take the following steps.

Solution 1

Please double-check if Document Understanding is properly enabled. You should have updated the configuration file before the installation and set documentunderstanding.enabled to True, or you could update it in ArgoCD post-installation as below.

After doing that, you need to disable it and disable AI Center on the tenant you wish to use the Data Labeling feature on, or create a new tenant.

Solution 2

If Document Understanding is properly enabled in the configuration file or ArgoCD, sometimes Document Understanding is not enabled for DefaultTenant. This manifests itself as not being able to create data labeling sessions.

To fix this, disable AI Center on the tenant and re-enable it. Note that you might need to wait a few minutes before being able to re-enable it.

 

Failed status when trying to deploy an ML Skill


Description

If you are trying unsuccesfully to deploy a Document Understanding ML Skill on AI Center, check the solutions below.

Solution 1

If you are installing the Automation Suite offline, please double check if the Document Understanding bundle has been downloaded and installed.

The bundle includes the base image (e.g., model library) for the models to properly run on AI Center after uploading the ML Packages via AI Center UI.

For details about installing Document Understanding bundle, please refer to the documentation here and here. To add Document Understanding bundle, please follow the documentation to re-run the Document Understanding bundle installation.

Solution 2

Even if you have installed the Document Understanding bundle for offline installation, another issue might occur along with this error message: modulenotfounderror: no module named 'ocr.release'; 'ocr' is not a package.

When creating a Document Understanding OCR ML Package in AI Center, keep in mind that it cannot be named ocr or OCR, which conflicts with a folder in the package. Please make sure to choose another name.

Solution 3

Sometimes, intermittently, Document Understanding Model Skill Deployments can fail with Failed to list deployment or Unknown Error when deploying the model for the first time.

The workaround is to try deploying the model again. The second time, the deployment is be faster as most of the deployment work of image building would have been done during the first attempt. Document Understanding ML Packages take around 1-1.5 hours for deploying the first time, and these are faster when deploying them again.

 

Long time to deploy an ML Skill


Description

In a rare scenario, due to cluster-state, asynchronous operations like Skill Deployment or Package upload could be stuck for a long time.

Solution

If Document Understanding ML Skill deployment is taking more than 2-3 hours, try deploying a simpler model (e.g, TemplateModel).

If the model also takes more than an hour, then the mitigation is to restart AI Center services with the following commands:

kubectl -n uipath rollout restart deployment ai-deployer-deployment
kubectl -n uipath rollout restart deployment ai-trainer-deployment
kubectl -n uipath rollout restart deployment ai-pkgmanager-deployment
kubectl -n uipath rollout restart deployment ai-helper-deployment
kubectl -n uipath rollout restart deployment ai-appmanager-deployment

Wait for the AI Center pods to be back up by verifying with the following command:
kubectl -n uipath get pods | grep ai-*

All the above pods should be Running state with container state shown as 2/2.

 

Migration job fails in ArgoCD


Description

Migration job fails for Document Understanding in ArgoCD.

Solution

Document Understanding requires the FullTextSearch feature to be enabled on the SQL server. Otherwise, the installation can fail without an explicit error message in this regard, as the migration job fails in ArgoCD.

 

Handwriting Recognition with Intelligent Form Extractor not working


Description

Handwriting Recognition with Intelligent Form Extractor not working or working too slow.

Solution 1

If you are using Intelligent Form Extractor offline, please check to ensure that you have enabled handwriting in the configuration file before installation or enabled it in ArgoCD.

To double check, please go to ArgoCD > Document Understanding > App details > du-services.handwritingEnabled (set it to True).

In an air-gapped scenario, the Document Understanding bundle needs to be installed before doing this, otherwise the ArgoCD sync fails.

Solution 2

Although handwriting in the configuration file is enabled, you might still face the same issues.

Please know that the default for the maximum amount of CPUs each container is allowed to use for handwriting is 2. You may need to adjust handwriting.max_cpu_per_pod parameter if you have a larger handwriting processing workload. You could update it in the configuration file before installation or update it in ArgoCD.

For more details on how to calculate the parameter value based on your volume, please check the documentation: Installing Document Understanding.

 

Insights-related issues


Navigating to Insights home page generates a 404


Rarely, a routing error can occur and result in a 404 on the Insights home page. You can resolve this by going to the Insights application in ArgoCD and deleting the virtual service insightsprovisioning-vs. Note that you may have to click clear filters to show X additional resources to see and delete this virtual service.

 

Apps-related issues


MongoDB Grafana dashboard


A custom Grafana dashboard for monitoring MongoDB resources and health is currently planned for Apps. This dashboard can also include some basic alerts.
This dashboard will be bundled as a default dashboard in the Automation Suite deployment for Apps, but it is not available out-of-the-box yet.
We can provide the dashboard as JSON and also some simple steps to add the dashboard manually.

Updated 4 days ago


Troubleshooting


This page explains how to fix issues you might encounter when setting up Automation Suite.

Suggested Edits are limited on API Reference Pages

You can only suggest edits to Markdown body content, but not to the API spec.