Troubleshooting how-tos
How to troubleshoot services during installation
Take the following steps on one of the cluster server nodes:
- Obtain Kubernetes access.
export KUBECONFIG="/etc/rancher/rke2/rke2.yaml"
export PATH="$PATH:/usr/local/bin:/var/lib/rancher/rke2/bin"
# To validate, execute the following command which should not return an error:
kubectl get nodes
- Retrieve the ArgoCD password by running the following command:
kubectl get secrets/argocd-admin-password -n argocd --template '{{ .data.password }}' | base64 -d
-
Connect to ArgoCD
a. Navigate tohttps://alm.<fqdn>/:443
b. Login usingadmin
as the username and the password obtained at Step 2. -
Locate the UiPath Services application as follows:
a. Using the search bar provided in ArgoCD, type inuipath
.
b. Then open the UiPath application by clicking its card.
c. Check for the following:
Application was not synced due to a failed job/pod
.d. If the above error exists, take the following steps.
e. Locate any un-synced components by looking for the red broken heart icon, as shown in the following image.
f. Open the right-most component (usually pods) and click the Logs tab. The Logs will contain an error message indicating the reason for the pod failure.
g. Once you resolve any outstanding configuration issues, go back to the home page and click the Sync button on the UiPath application.
How to uninstall the cluster
If you experience issues specific to Kubernetes running on the cluster, you can directly uninstall the rke2 cluster. To do that, take the following steps:
- Depending on your installation profile, run one of the following commands:
1.1. In an online setup, run the following script with elevated privileges, i.e.sudo
, on each node of the cluster. This will uninstall the nodes.
service_exists() {
local n=$1
if [[ $(systemctl list-units --all -t service --full --no-legend "$n.service" | cut -f1 -d' ') == $n.service ]]; then
return 0
else
return 1
fi
}
if service_exists rke2-server; then
systemctl stop rke2-server
systemctl disable rke2-server
fi
if service_exists rke2-agent; then
systemctl stop rke2-agent
systemctl disable rke2-agent
fi
if [ -e /usr/bin/rke2-killall.sh ]
then
echo "Running rke2-killall.sh"
/usr/bin/rke2-killall.sh > /dev/null
else
echo "File not found: rke2-killall.sh"
fi
if [ -e /usr/bin/rke2-uninstall.sh ]
then
echo "Running rke2-uninstall.sh"
/usr/bin/rke2-uninstall.sh > /dev/null
else
echo "File not found: rke2-uninstall.sh"
fi
rm -rfv /etc/rancher/ > /dev/null
rm -rfv /var/lib/rancher/ > /dev/null
rm -rfv /var/lib/rook/ > /dev/null
rm -rfv /var/lib/longhorn/ > /dev/null
findmnt -lo target | grep /var/lib/kubelet/ | xargs umount -l -f
rm -rfv /var/lib/kubelet/ > /dev/null
rm -rfv /datadisk/* > /dev/null
rm -rfv ~/.uipath/* > /dev/null
mkdir -p /var/lib/rancher/rke2/server/db/ && mount -a
rm -rfv /var/lib/rancher/rke2/server/db/* > /dev/null
echo "Uninstall RKE complete."
1.2 In an offline setup, run the following script with elevated privileges, i.e. sudo
, on each node of the cluster. This will uninstall the nodes.
service_exists() {
local n=$1
if [[ $(systemctl list-units --all -t service --full --no-legend "$n.service" | cut -f1 -d' ') == $n.service ]]; then
return 0
else
return 1
fi
}
if service_exists rke2-server; then
systemctl stop rke2-server
systemctl disable rke2-server
fi
if service_exists rke2-agent; then
systemctl stop rke2-agent
systemctl disable rke2-agent
fi
if [ -e /usr/local/bin/rke2-killall.sh ]
then
echo "Running rke2-killall.sh"
/usr/local/bin/rke2-killall.sh > /dev/null
else
echo "File not found: rke2-killall.sh"
fi
if [ -e /usr/local/bin/rke2-uninstall.sh ]
then
echo "Running rke2-uninstall.sh"
/usr/local/bin/rke2-uninstall.sh > /dev/null
else
echo "File not found: rke2-uninstall.sh"
fi
rm -rfv /etc/rancher/ > /dev/null
rm -rfv /var/lib/rancher/ > /dev/null
rm -rfv /var/lib/rook/ > /dev/null
rm -rfv /var/lib/longhorn/ > /dev/null
findmnt -lo target | grep /var/lib/kubelet/ | xargs umount -l -f
rm -rfv /var/lib/kubelet/ > /dev/null
rm -rfv /datadisk/* > /dev/null
rm -rfv ~/.uipath/* > /dev/null
mkdir -p /var/lib/rancher/rke2/server/db/ && mount -a
rm -rfv /var/lib/rancher/rke2/server/db/* > /dev/null
echo "Uninstall RKE complete."
- Reboot the node after uninstall.
Important!
When uninstalling one of the nodes from the cluster, you must run the following command:
kubectl delete node <node_name>
. This removes the node from the cluster.
How to clean up offline artifacts to improve disk space
If you run an offline installation, you typically need a larger disk size due to the offline artifacts that are used.
Once the installation completes, you can remove those local artifacts. Failure to do so can result in unnecessary disk pressure during cluster operations.
On the primary server, where the installation was performed, you can perform a cleanup using the following commands.
- Remove all images loaded by podman into the local container storage using the following command.:
podman image rm -af
- Then remove the temporary offline folder, used with the flag
--offline-tmp-folder
. This parameter defaults to/tmp
:
rm -rf /path/to/temp/folder
Common Issues
Unable to run an offline installation on RHEL 8.4 OS
Description
The following issues can happen if you install RHEL 8.4 and are performing the offline installation, which requires podman. These issues are specific to podman and the OS being installed together. See the two below
Potential issue
- you cannot install both of the following on the cluster:
podman-1.0.0-8.git921f98f.module+el8.3.0+10171+12421f43.x86_64
podman-3.0.1-6.module+el8.4.0+10607+f4da7515.x86_64
- package
cockpit-podman-29-2.module+el8.4.0+10607+f4da7515.noarch
requirespodman
>=1.3.0
, but none of the providers can be installed - cannot install the best candidate for the job
- problem with installed package
cockpit-podman-29-2.module+el8.4.0+10607+f4da7515.noarch
Potential issue
- package
podman-3.0.1-6.module+el8.4.0+10607+f4da7515.x86_64
requirescontainernetworking-plugins
>=0.8.1-1
, but none of the providers can be installed - you cannot install both of the following:
containernetworking-plugins-0.7.4-4.git9ebe139.module+el8.3.0+10171+12421f43.x86_64
containernetworking-plugins-0.9.1-1.module+el8.4.0+10607+f4da7515.x86_64
- package
podman-catatonit-3.0.1-6.module+el8.4.0+10607+f4da7515.x86_64
requirespodman
=3.0.1-6.module+el8.4.0+10607+f4da7515
, but none of the providers can be installed - cannot install the best candidate for the job
- problem with installed package
podman-catatonit-3.0.1-6.module+el8.4.0+10607+f4da7515.x86_64
(try to add--allowerasing
to command line to replace conflicting packages or--skip-broken
to skip uninstallable packages or--nobest
to use not only best candidate packages)
Solution
You need to remove the current version of podman
and allow Automation Suite to install the required version.
-
Remove the current version of podman using the
yum remove podman
command. -
Re-run the installer after having removed the current version, which should install the correct version.
Offline installation fails because of missing binary
Description
During offline installation, at the fabric stage, the execution fails with the following error message:
Error: overlay: can't stat program "/usr/bin/fuse-overlayfs": stat /usr/bin/fuse-overlayfs: no such file or directory
Solution
You need to remove the line containing the mount_program
key from the podman
configuration /etc/containers/storage.conf
.
Ensure you remove the line rather than comment it out.
Failure to get the sandbox image
Description
You can receive an error message specific when trying to get the following sandbox image: index.docker.io/rancher/pause3.2
This can happen in an offline installation.
Solution
Restart either rke2-server
or rke2-agent
(depending on whether the machine that the pod is scheduled on is either a server or an agent).
To check which node the pod is scheduled on, run kubectl -n <namespace> get pods -o wide
.
# If machine is a Master node
systemctl restart rke2-server
# If machine is an Agent Node
systemctl restart rke2-agent
SQL connection string validation error
Description
You might receive an error relating to the connection strings as follows:
Sqlcmd: Error: Microsoft Driver 17 for SQL Server :
Server—tcp : <connection string>
Login failed for user
This error appears even though all credentials are correct. The connection string validation failed.
Solution
Make sure the connection string has the following structure:
Server=<Sql server host name>;User Id=<user_name for sql server>;Password=<Password>;Initial Catalog=<database name>;Persist Security Info=False;MultipleActiveResultSets=False;Encrypt=True;TrustServerCertificate=False;Connection Timeout=30;Max Pool Size=100;
Note:
User Id
is case-sensitive.
Pods not showing in ArgoCD UI
Description
Occasionally, the ArgoCD UI does not show pods, but only displays applications and corresponding deployments. See the following image for details:


When clicking on any of the deployments, the following error is displayed: Unable to load data: EOF
.


Solution
You can fix this issue by deleting all Redis replicas from ArgoCD namespace and waiting for it to come back up again.
kubectl -n argocd delete pod argocd-redis-ha-server-0 argocd-redis-ha-server-1 argocd-redis-ha-server-2
# Wait for all 3 pods to come back up
kubectl -n argocd get pods | grep argocd-redis-ha-server
Certificate issue in offline installation
Description
You might get an error that the certificate is signed by an unknown authority.
Error: failed to do request: Head "https://sfdev1778654-9f843b23-lb.westeurope.cloudapp.azure.com:30071/v2/helm/audit-service/blobs/sha256:09bffbc520ff000b834fe1a654acd089889b09d22d5cf1129b0edf2d76554892": x509: certificate signed by unknown authority
Solution
Both the rootCA and the server certificates need to be in the trusted store on the machine.
To investigate, execute the following commands:
[[email protected] ~]# find /etc/pki/ca-trust/source{,/anchors} -maxdepth 1 -not -type d -exec ls -1 {} +
/etc/pki/ca-trust/source/anchors/rootCA.crt
/etc/pki/ca-trust/source/anchors/server.crt
The provided certificates need to be in the output of those commands.
Alternatively, execute the following command:
[[email protected] ~]# openssl x509 -in /etc/pki/ca-trust/source/anchors/server.crt -text -noout
Ensure that the fully qualified domain name is present in the Subject Alternative Name from the output.
X509v3 Subject Alternative Name:
DNS:sfdev1778654-9f843b23-lb.westeurope.cloudapp.azure.com
You can update the CA Certificate as follows:
[[email protected] ~]# update-ca-trust
Error in downloading the bundle
Description
The documentation lists wget
as an option for downloading the bundles. Because of the large sizes, the connection may be interrupted and not recover.
Solution
One way to mitigate this could be to switch to a different download tool, such as azcopy
(more information here). Run these commands, while updating the bundle URL to match the desired version/bundle combination.
wget https://aka.ms/downloadazcopy-v10-linux -O azcopy.tar.gz
tar -xvf ./azcopy.tar.gz
azcopy_linux_amd64_10.11.0/azcopy copy https://download.uipath.com/service-fabric/0.0.23-private4/sf-0.0.23-private4.tar.gz /var/tmp/sf.tar.gz --from-to BlobLocal
Longhorn errors
Rook Ceph or Looker pod stuck in Init state
Description
Occasionally, on node restart, an issue causes the Looker or Rook Ceph pod to get stuck in Init state as the volume required for attaching the PVC to a pod is missing.
Verify if the problem is indeed related to Longhorn by running the following command:
kubectl get events -A -o json | jq -r '.items[] | select(.message != null) | select(.message | contains("cannot get resource \"volumeattachments\" in API group \"storage.k8s.io\""))'
If it is related to Longhorn, this command should return a list of pod names affected by the issue. If the command does not return anything, the cause of the problem is different.
Solution
Run the following script to fix the problematic pods if the previous command returns a non-empty output:
#!/bin/bash
function wait_till_rollout() {
local namespace=$1
local object_type=$2
local deploy=$3
local try=0
local maxtry=2
local status="notready"
while [[ ${status} == "notready" ]] && (( try != maxtry )) ; do
kubectl -n "$namespace" rollout status "$deploy" -w --timeout=600s;
# shellcheck disable=SC2181
if [[ "$?" -ne 0 ]];
then
status="notready"
try=$((try+1))
else
status="ready"
fi
done
if [[ $status == "notready" ]]; then
echo "$deploy of type $object_type failed in namespace $namespace. Plz re-run the script once again to verify that it's not a transient issue !!!"
exit 1
fi
}
function fix_pv_deployments() {
for pod_name in $(kubectl get events -A -o json | jq -r '.items[] | select(.message | contains("cannot get resource \"volumeattachments\" in API group \"storage.k8s.io\"")) | select(.involvedObject.kind == "Pod") | .involvedObject.name + "/" + .involvedObject.namespace' | sort | uniq)
do
POD_NAME=$(echo "${pod_name}" | cut -d '/' -f1)
NS=$(echo "${pod_name}" | cut -d '/' -f2)
controller_data=$(kubectl -n "${NS}" get po "${POD_NAME}" -o json | jq -r '[.metadata.ownerReferences[] | select(.controller==true)][0] | .kind + "=" + .name')
[[ $controller_data == "" ]] && error "Error: Could not determine owner for pod: ${POD_NAME}" && exit 1
CONTROLLER_KIND=$(echo "${controller_data}" | cut -d'=' -f1)
CONTROLLER_NAME=$(echo "${controller_data}" | cut -d'=' -f2)
if [[ $CONTROLLER_KIND == "ReplicaSet" ]]
then
controller_data=$(kubectl -n "${NS}" get "${CONTROLLER_KIND}" "${CONTROLLER_NAME}" -o json | jq -r '[.metadata.ownerReferences[] | select(.controller==true)][0] | .kind + "=" + .name')
CONTROLLER_KIND=$(echo "${controller_data}" | cut -d'=' -f1)
CONTROLLER_NAME=$(echo "${controller_data}" | cut -d'=' -f2)
replicas=$(kubectl -n "${NS}" get "$CONTROLLER_KIND" "$CONTROLLER_NAME" -o json | jq -r '.status.replicas')
unavailable_replicas=$(kubectl -n "${NS}" get "$CONTROLLER_KIND" "$CONTROLLER_NAME" -o json | jq -r '.status.unavailableReplicas')
if [ -n "$unavailable_replicas" ]; then
available_replicas=$((replicas - unavailable_replicas))
if [ $available_replicas -eq 0 ]; then
kubectl -n "$NS" scale "$CONTROLLER_KIND" "$CONTROLLER_NAME" --replicas=0
sleep 15
kubectl -n "$NS" scale "$CONTROLLER_KIND" "$CONTROLLER_NAME" --replicas="$replicas"
deployment_name="$CONTROLLER_KIND/$CONTROLLER_NAME"
wait_till_rollout "$NS" "deploy" "$deployment_name"
fi
fi
fi
done
}
fix_pv_deployments
Failure to create persistent volumes
Description
Longhorn is successfully installed, but fails to create persistent volumes.
Solution
Verify if the kernel modules are successfully loaded in the cluster by using the command lsmod | grep <module_name>
.
Replace <module_name>
with each of the kernel modules below:
libiscsi_tcp
libiscsi
iscsi_tcp
scsi_transport_iscsi
Load any missing module.
rke2-coredns-rke2-coredns-autoscaler pod in CrashLoopBackOff
Description
After node restart, rke2-coredns-rke2-coredns-autoscaler can go in CrashLoopBackOff. This does not have any impact on Automation Suite.
Solution
Delete the rke2-coredns-rke2-coredns-autoscaler pod that is in CrashLoopBackOff
using the following command: kubectl delete pod <pod name> -n kube-system
.
Redis probe failure
Description
Redis probe can fail if the node ID file does not exist. This can happen if the pod is not yet bootstrapped.
There is a recovery job that automatically fixes this issue, and the following steps should not be performed while the job is running.
When a Redis Enterprise cluster loses contact with more than half of its nodes (either because of failed nodes or network split), then the cluster stops responding to client connections. The Pods also fail to rejoin the cluster.
Solution
- Delete the Redis cluster and database using the following commands:
kubectl delete redb -n redis-system redis-cluster-db --force --grace-period=0 &
kubectl delete rec -n redis-system redis-cluster --force --grace-period=0 &
kubectl patch redb -n redis-system redis-cluster-db --type=json -p '[{"op":"remove","path":"/metadata/finalizers","value":"finalizer.redisenterprisedatabases.app.redislabs.com"}]'
kubectl patch rec redis-cluster -n redis-system --type=json -p '[{"op":"remove","path":"/metadata/finalizers","value":"redbfinalizer.redisenterpriseclusters.app.redislabs.com"}]'
kubectl delete job redis-cluster-db-job -n redis-system
- Go to the ArgoCD UI and sync the redis-cluster application.
RKE2 server fails to start
Description
The server fails to start. There are a few different reasons for RKE2 not starting properly, which are usually found in the logs.
Solution
Check the logs using the following commands:
journalctl -u rke2-server
Possible reasons (based on logs): too many learner members in cluster
Too many etcd servers are added to the cluster, and there are two learner nodes trying to be promoted. More information here: Runtime reconfiguration.
Perform the following:
- Under normal circumstances, the node should become a full member if enough time is allowed.
- An uninstall-reinstall cycle can be attempted.
Alternatively, this could be caused by a networking problem. Ensure you have configured the machine to enable the necessary ports.
Node draining does not occur for stopped nodes
Description
If a node is stopped in a cluster and, and its corresponding pods are not rescheduled to available nodes after 15 minutes, run the following script to manually drain the node.
#!/bin/sh
KUBECTL="/usr/local/bin/kubectl"
# Get only nodes which are not drained yet
NOT_READY_NODES=$($KUBECTL get nodes | grep -P 'NotReady(?!,SchedulingDisabled)' | awk '{print $1}' | xargs echo)
# Get only nodes which are still drained
READY_NODES=$($KUBECTL get nodes | grep '\sReady,SchedulingDisabled' | awk '{print $1}' | xargs echo)
echo "Unready nodes that are undrained: $NOT_READY_NODES"
echo "Ready nodes: $READY_NODES"
for node in $NOT_READY_NODES; do
echo "Node $node not drained yet, draining..."
$KUBECTL drain --ignore-daemonsets --force --delete-emptydir-data $node
echo "Done"
done;
for node in $READY_NODES; do
echo "Node $node still drained, uncordoning..."
$KUBECTL uncordon $node
echo "Done"
done;
Enable Istio logging
To debug Istio, you need to enable logging. To do that, perform the following steps:
- Find the
istio-ingressgateway
pod by running the following command. Copy the gateway pod name. It should be something likeistio-ingressgateway-r4mbx
.
kubectl -n istio-system get pods
- Open the gateway Pod shell by running the following command.
kubectl exec -it -n istio-system istio-ingressgateway-r4mbx bash
- Enable debug level logging by running the following command.
curl -X POST http://localhost:15000/logging?level=debug
- Run the following command from a server node.
istioctl_bin=$(find /var/lib/rancher/rke2/ -name "istioctl" -type f -perm -u+x -print -quit)
if [[ -n ${istioctl_bin} ]]
then
echo "istioctl bin found"
kubectl -n istio-system get cm istio-installer-base -o go-template='{{ index .data "istio-base.yaml" }}' > istio-base.yaml
kubectl -n istio-system get cm istio-installer-overlay -o go-template='{{ index .data "overlay-config.yaml" }}' > overlay-config.yaml
${istioctl_bin} -i istio-system install -y -f istio-base.yaml -f overlay-config.yaml --set meshConfig.accessLogFile=/dev/stdout --set meshConfig.accessLogEncoding=JSON
else
echo "istioctl bin not found"
fi
Secret not found in UiPath namespace
Description
If service installation fails, and checking kubectl -n uipath get pods
returns failed pods, take the following steps.
Solution
- Check
kubectl -n uipath describe pod <pod-name>
and look for secret not found. - If the secret is not found, then look for credential manager job logs and see if it failed.
- If the credential manager job failed and
kubectl get pods -n rook-ceph|grep rook-ceph-tool
returns more than one pod, do the following:
a. deleterook-ceph-tool
that is not running.
b. go to ArgoCD UI and syncsfcore
appllication.
c. Once the job completes, check if all secrets are created in the credential-manager job logs
d. Now syncuipath
application.
Cannot log in after migration
Description
An issue might affect the migration from a standalone product to Automation Suite. It prevents you from logging in, with the following error message being displayed: Cannot find client details
.
Solution
To fix this problem, you need to re-sync uipath
app first, and then sync platform
app in ArgoCD.
After the Initial Install, ArgoCD App went into Progressing State
Description
Whenever the cluster state deviates from what has been defined in the helm repository, argocd
tries to sync the state and reconciliation happens every minute. Whenever this happens, you can notice that the ArgoCD app is in progressing state.
Solution
This is the expected behavior of ArgoCD and it does not impact application in any way.
Automation Suite requires backlog_wait_time
to be set 1
backlog_wait_time
to be set 1Description
Audit events can cause instability (system freeze) if backlog_wait_time
is not set to 1
.
For more details, see this issue description.
Solution
If the installer fails with the Automation Suite requires backlog_wait_time to be set 1
error message, take the following steps to set backlog_wait_time
to 1
.
- Set
backlog_wait_time
to1
by appending--backlog_wait_time 1
in the/etc/audit/rules.d/audit.rules
file. - Reboot the node.
- Validate if
backlog_wait_time
value is set to1
forauditctl
by runningsudo auditctl -s | grep "backlog_wait_time"
in the node.
Failure to resize objectstore PVC
Description
This issue occurs when the objectstore resize-pvc
operation fails with the following error:
Failed resizing the PVC: <pvc name> in namespace: rook-ceph, ROLLING BACK
Solution
To fix this problem, take the following steps:
- Run the following script manually:
#!/bin/sh
ROOK_CEPH_OSD_PREPARE=$(kubectl -n rook-ceph get pods | grep rook-ceph-osd-prepare-set | awk '{print $1}')
if [[ -n ${ROOK_CEPH_OSD_PREPARE} ]]; then
for pod in ${ROOK_CEPH_OSD_PREPARE}; do
echo "Start deleting rook ceph osd pod $pod .."
kubectl -n rook-ceph delete pod $pod
echo "Done"
done;
fi
- Rerun the
objectstore resize-pvc
command.
Failure after certificate update
Description
This issue occurs when the certificate update step fails internally. You may not able to access Automation Suite or Orchestrator.
Error


Solution
- Run below commands from any of the server node
export KUBECONFIG=/etc/rancher/rke2/rke2.yaml
export PATH=$PATH:/var/lib/rancher/rke2/bin
kubectl -n uipath rollout restart deployments
- Wait for the above command to succeed and then run below command to verify the status of previous command.
deployments=$(kubectl -n uipath get deployment -o name)
for i in $deployments;
do
kubectl -n uipath rollout status "$i" -w --timeout=600s;
if [[ "$?" -ne 0 ]];
then
echo "$i deployment failed in namespace uipath."
fi
done
echo "All deployments are succeeded in namespace uipath"
Once the above command is finished execution, you should able to access Automation Suite and Orchestrator
Unexpected inconsistency; run fsck manually
While installing or upgrading Automation Suite, if the MongoDB pods cannot mount to the PVC pods, the following error message is displayed:
UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY


Recovery steps
If you encounter the error above, follow the recovery steps below:
- SSH to the system by running the following command:
ssh <user>@<node-ip>
- Check the events of the PVC and verify that the issue is related to the PVC mount failure due to file error. To do this, run the following command:
export KUBECONFIG=/etc/rancher/rke2/rke2.yaml PATH=$PATH:/var/lib/rancher/rke2/bin:/usr/local/bin
kubectl get events -n mongodb
kubectl get events -n longhorn-system
- Check the PVC volume mentioned in the event and run the
fsck
command.
fsck -a <pvc-volume-name>
Eg - fsck -a /dev/longhorn/pvc-5abe3c8f-7422-44da-9132-92be5641150a
- Delete the failing MongoDB pod to properly mount it to the PVC.
kubectl delete pod <pod-name> -n mongodb
Identity Server issues
Setting a timeout interval for the Management portals
Pre-installation, you cannot update the expiration time for the token used to authenticate to the host- and organization-level Management portals. Therefore user sessions do not time out.
To set a time interval for timeout for these portals, you can update the accessTokenLifetime
property.
The below example sets the timeout interval to 86400 seconds (24 hours):
UPDATE [identity].[Clients] SET AccessTokenLifetime = 86400 WHERE ClientName = 'Portal.OpenId'
Kerberos issues
kinit: Cannot find KDC for realm while getting initial credentials
Description
This error might occur during installation (if you have Kerberos authentication enabled) or during the kerberos-tgt-update
cron job execution when the UiPath cluster cannot connect to the AD server to obtain the Kerberos ticket for authentication.
Solution
Check the AD domain and ensure it is configured correctly and routable, as follows:
getent ahosts <AD domain> | awk '{print $1}' | sort | uniq
If this command does not return a routable IP address, then the AD domain required for Kerberos authentication is not properly configured.
You need to work with the IT administrators to add the AD domain to your DNS server and make sure this command returns a routable IP address.
kinit: Keytab contains no suitable keys for *** while getting initial credentials
Description
This error could be found in the log of a failed job, with one of the following job names: services-preinstall-validations-job
, kerberos-jobs-trigger
, kerberos-tgt-update
.
Solution
Make sure the AD user still exists, is active, and their password was not changed and did not expire. Reset the user's password and regenerate the keytab if needed.
Also make sure to provide the default Kerberos AD user parameter <KERB_DEFAULT_USERNAME>
in the following format: HTTP/<Service Fabric FQDN>
.
GSSAPI operation failed with error: An invalid status code was supplied (Client's credentials have been revoked).
Description
This log could be found when using Kerberos for SQL access, and SQL connection is failing inside services. Similarly, you may see kinit: Client's credentials have been revoked while getting initial credential
in one of the following job names: services-preinstall-validations-job
, kerberos-jobs-trigger
, kerberos-tgt-update
.
Solution
This could be caused by the AD user account used to generate the keytab being disabled. Re-enabling the AD user account should fix the issue.
Login failed for user <ADDOMAIN>\<aduser>
. Reason: The account is disabled.
<ADDOMAIN>\<aduser>
. Reason: The account is disabled.Description
This log could be found when using Kerberos for SQL access, and SQL connection is failing inside services.
Solution
This issue could be caused by the AD user losing access to the SQL server. See instructions on how to reconfigure the AD user.
Orchestrator-related issues
Orchestrator pod in CrashLoopBackOff or 1/2 running with multiple restarts
Description
If the Orchestrator pod in CrashLoopBackOff or 1/2 is running with multiple restarts, the failure could be related to the authentication keys for the object storage provider, Ceph.
To check if the failure is related to Ceph, run the following commands:
kubectl -n uipath get pod -l app.kubernetes.io/component=orchestrator
If the output of this command is similar to one of the following options, you need to run an additional command.
Option 1:
NAME READY STATUS RESTARTS AGE
orchestrator-6dc848b7d5-q5c2q 1/2 Running 2 6m1s
OR
Option 2
NAME READY STATUS RESTARTS AGE
orchestrator-6dc848b7d5-q5c2q 1/2 CrashLoopBackOff 6 16m
Verify if the failure is related to Ceph authentication keys by running the following command:
kubectl -n uipath logs -l app.kubernetes.io/component=orchestrator | grep 'Error making request with Error Code InvalidAccessKeyId and Http Status Code Forbidden' -o
If the output of the above command contains the string Error making request with Error Code InvalidAccessKeyId and Http Status Code Forbidden
, the failure is due to the Ceph authentication keys.
Solution
Rerun the rook-ceph-configure-script-job
and credential-manager
jobs using the following commands:
kubectl -n uipath-infra get job "rook-ceph-configure-script-job" -o json | jq 'del(. | .spec.selector, .spec.template.metadata.labels)' | kubectl replace --force -f -
kubectl -n uipath-infra get job "credential-manager-job" -o json | jq 'del(. | .spec.selector, .spec.template.metadata.labels)' | kubectl replace --force -f -
kubectl -n uipath delete pod -l app.kubernetes.io/component=orchestrator
Test Manager-related issues
Test Manager licensing issue
If you were assigned a license while being logged, your license assignment may not be detected when opening Test Manager.
If this happens, take the following steps:
- Navigate to Test Manager.
- Log out from the portal.
- Log in again.
AI Center-related issues
AI Center Skills deployment issues
Sometimes intermittently DU Model Skill Deployments can fail with Failed to list deployment or Unknown Error when deploying the model for the first time. The workaround is to try deploying the model again. Second time, it will be faster as most of the deployment work of image building would have been done during the first attempt. DU Models takes around 1-1.5 hours for deploying first time, and it will be faster when deploying them again.
In a rare scenario, due to cluster state, asynchronous operations like Skill Deployment or Package upload could be stuck for a long time. If DU Skill deployment is taking more than 2-3 hours, try deploying a simpler model (e.g, TemplateModel). If the model also takes more than an hour, then the mitigation is to restart AI Center services with the following commands:
kubectl -n uipath rollout restart deployment ai-deployer-deployment
kubectl -n uipath rollout restart deployment ai-trainer-deployment
kubectl -n uipath rollout restart deployment ai-pkgmanager-deployment
kubectl -n uipath rollout restart deployment ai-helper-deployment
kubectl -n uipath rollout restart deployment ai-appmanager-deployment
Wait for the AI Center pods to be back up by verifying with the following command:
kubectl -n uipath get pods | grep ai-*
All the above pods should be Running state with container state shown as 2/2.
Document Understanding-related issues
Document Understanding not on the left rail of Automation Suite
Description
In case Document Understanding cannot be found on the left rail of Automation Suite, please know that Document Understanding is currently not a separate application on Automation Suite, thus it is not shown on the left rail.
Solution
The Data Manager component is part of AI Center, so please make sure to enable AI Center.
Also, please access Form Extractor, Intelligent Form Extractor (including HandwritingRecognition), and Intelligent Keyword Classifier, with the below public URL:
<FQDN>/du_/svc/formextractor
<FQDN>/du_/svc/intelligentforms
<FQDN>/du_/svc/intelligentkeywords
If you get the Your license can not be validated
error message when trying to use Intelligent Keyword Classifier, Form Extractor and Intelligent Form Extractor in Studio, besides making sure you have input the right endpoint, please also take the API key that you generated for Document Understanding under License in the Automation Suite install, and not from cloud.uipath.com.
Failed status when creating a data labeling session
Description
If you are not able to create data labeling sessions on Data Manager in AI Center, take the following steps.
Solution 1
Please double-check if Document Understanding is properly enabled. You should have updated the configuration file before the installation and set documentunderstanding.enabled
to True, or you could update it in ArgoCD post-installation as below.
After doing that, you need to disable it and disable AI Center on the tenant you wish to use the Data Labeling feature on, or create a new tenant.


Solution 2
If Document Understanding is properly enabled in the configuration file or ArgoCD, sometimes Document Understanding is not enabled for DefaultTenant. This manifests itself as not being able to create data labeling sessions.
To fix this, disable AI Center on the tenant and re-enable it. Note that you might need to wait a few minutes before being able to re-enable it.
Failed status when trying to deploy an ML Skill
Description
If you are trying unsuccesfully to deploy a Document Understanding ML Skill on AI Center, check the solutions below.
Solution 1
If you are installing the Automation Suite offline, please double check if the Document Understanding bundle has been downloaded and installed.
The bundle includes the base image (e.g., model library) for the models to properly run on AI Center after uploading the ML Packages via AI Center UI.
For details about installing Document Understanding bundle, please refer to the documentation here and here. To add Document Understanding bundle, please follow the documentation to re-run the Document Understanding bundle installation.
Solution 2
Even if you have installed the Document Understanding bundle for offline installation, another issue might occur along with this error message: modulenotfounderror: no module named 'ocr.release'; 'ocr' is not a package
.
When creating a Document Understanding OCR ML Package in AI Center, keep in mind that it cannot be named ocr or OCR, which conflicts with a folder in the package. Please make sure to choose another name.
Solution 3
Sometimes, intermittently, Document Understanding Model Skill Deployments can fail with Failed to list deployment
or Unknown Error
when deploying the model for the first time.
The workaround is to try deploying the model again. The second time, the deployment is be faster as most of the deployment work of image building would have been done during the first attempt. Document Understanding ML Packages take around 1-1.5 hours for deploying the first time, and these are faster when deploying them again.
Migration job fails in ArgoCD
Description
Migration job fails for Document Understanding in ArgoCD.
Solution
Document Understanding requires the FullTextSearch feature to be enabled on the SQL server. Otherwise, the installation can fail without an explicit error message in this regard, as the migration job fails in ArgoCD.
Handwriting Recognition with Intelligent Form Extractor not working
Description
Handwriting Recognition with Intelligent Form Extractor not working or working too slow.
Solution 1
If you are using Intelligent Form Extractor offline, please check to ensure that you have enabled handwriting in the configuration file before installation or enabled it in ArgoCD.
To double check, please go to ArgoCD > Document Understanding > App details > du-services.handwritingEnabled (set it to True).
In an air-gapped scenario, the Document Understanding bundle needs to be installed before doing this, otherwise the ArgoCD sync fails.
Solution 2
Although handwriting in the configuration file is enabled, you might still face the same issues.
Please know that the default for the maximum amount of CPUs each container is allowed to use for handwriting is 2. You may need to adjust handwriting.max_cpu_per_pod
parameter if you have a larger handwriting processing workload. You could update it in the configuration file before installation or update it in ArgoCD.
For more details on how to calculate the parameter value based on your volume, please check the documentation here.
Insights-related issues
Navigating to Insights home page generates a 404
Rarely, a routing error can occur and result in a 404 on the Insights home page. You can resolve this by going to the Insights application in ArgoCD and deleting the virtual service insightsprovisioning-vs. Note that you may have to click clear filters to show X additional resources to see and delete this virtual service.
Updated 8 days ago