- Overview
- Requirements
- Installation
- Q&A: Deployment templates
- Configuring the machines
- Configuring the external objectstore
- Configuring an external Docker registry
- Configuring the load balancer
- Configuring the DNS
- Configuring Microsoft SQL Server
- Configuring the certificates
- Online multi-node HA-ready production installation
- Offline multi-node HA-ready production installation
- Disaster recovery - Installing the secondary cluster
- Downloading the installation packages
- install-uipath.sh parameters
- Enabling Redis High Availability Add-On for the cluster
- Document Understanding configuration file
- Adding a dedicated agent node with GPU support
- Adding a dedicated agent Node for Task Mining
- Connecting Task Mining application
- Adding a Dedicated Agent Node for Automation Suite Robots
- Post-installation
- Cluster administration
- Monitoring and alerting
- Migration and upgrade
- Migration options
- Step 1: Moving the Identity organization data from standalone to Automation Suite
- Step 2: Restoring the standalone product database
- Step 3: Backing up the platform database in Automation Suite
- Step 4: Merging organizations in Automation Suite
- Step 5: Updating the migrated product connection strings
- Step 6: Migrating standalone Insights
- Step 7: Deleting the default tenant
- B) Single tenant migration
- Product-specific configuration
- Best practices and maintenance
- Troubleshooting
- How to troubleshoot services during installation
- How to uninstall the cluster
- How to clean up offline artifacts to improve disk space
- How to clear Redis data
- How to enable Istio logging
- How to manually clean up logs
- How to clean up old logs stored in the sf-logs bundle
- How to disable streaming logs for AI Center
- How to debug failed Automation Suite installations
- How to delete images from the old installer after upgrade
- How to automatically clean up Longhorn snapshots
- How to disable TX checksum offloading
- How to manually set the ArgoCD log level to Info
- How to generate the encoded pull_secret_value for external registries
- How to address weak ciphers in TLS 1.2
- Unable to run an offline installation on RHEL 8.4 OS
- Error in downloading the bundle
- Offline installation fails because of missing binary
- Certificate issue in offline installation
- First installation fails during Longhorn setup
- SQL connection string validation error
- Prerequisite check for selinux iscsid module fails
- Azure disk not marked as SSD
- Failure after certificate update
- Antivirus causes installation issues
- Automation Suite not working after OS upgrade
- Automation Suite requires backlog_wait_time to be set to 0
- GPU node affected by resource unavailability
- Volume unable to mount due to not being ready for workloads
- Failure to upload or download data in objectstore
- PVC resize does not heal Ceph
- Failure to resize PVC
- Failure to resize objectstore PVC
- Rook Ceph or Looker pod stuck in Init state
- StatefulSet volume attachment error
- Failure to create persistent volumes
- Storage reclamation patch
- Backup failed due to TooManySnapshots error
- All Longhorn replicas are faulted
- Setting a timeout interval for the management portals
- Update the underlying directory connections
- Authentication not working after migration
- Kinit: Cannot find KDC for realm <AD Domain> while getting initial credentials
- Kinit: Keytab contains no suitable keys for *** while getting initial credentials
- GSSAPI operation failed due to invalid status code
- Alarm received for failed Kerberos-tgt-update job
- SSPI provider: Server not found in Kerberos database
- Login failed for AD user due to disabled account
- ArgoCD login failed
- Failure to get the sandbox image
- Pods not showing in ArgoCD UI
- Redis probe failure
- RKE2 server fails to start
- Secret not found in UiPath namespace
- ArgoCD goes into progressing state after first installation
- Issues accessing the ArgoCD read-only account
- MongoDB pods in CrashLoopBackOff or pending PVC provisioning after deletion
- Unhealthy services after cluster restore or rollback
- Pods stuck in Init:0/X
- Prometheus in CrashloopBackoff state with out-of-memory (OOM) error
- Missing Ceph-rook metrics from monitoring dashboards
- Running High Availability with Process Mining
- Process Mining ingestion failed when logged in using Kerberos
- Unable to connect to AutomationSuite_ProcessMining_Warehouse database using a pyodbc format connection string
- Airflow installation fails with sqlalchemy.exc.ArgumentError: Could not parse rfc1738 URL from string ''
- How to add an IP table rule to use SQL Server port 1433
- Using the Automation Suite Diagnostics Tool
- Using the Automation Suite Support Bundle Tool
- Exploring Logs
Unhealthy services after cluster restore or rollback
Following a cluster restore or rollback, AI Center, Orchestrator, Platform, Document Understanding or Task Mining might be unhealthy, with the RabbitMQ pod logs showing the following error:
[root@server0 UiPathAutomationSuite]# k -n rabbitmq logs rabbitmq-server-0
2022-10-29 07:38:49.146614+00:00 [info] <0.9223.362> accepting AMQP connection <0.9223.362> (10.42.1.161:37524 -> 10.42.0.228:5672)
2022-10-29 07:38:49.147411+00:00 [info] <0.9223.362> Connection <0.9223.362> (10.42.1.161:37524 -> 10.42.0.228:5672) has a client-provided name: rabbitConnectionFactory#77049094:2100
2022-10-29 07:38:49.147644+00:00 [erro] <0.9223.362> Error on AMQP connection <0.9223.362> (10.42.1.161:37524 -> 10.42.0.228:5672, state: starting):
2022-10-29 07:38:49.147644+00:00 [erro] <0.9223.362> PLAIN login refused: user 'aicenter-service' - invalid credentials
2022-10-29 07:38:49.147922+00:00 [info] <0.9223.362> closing AMQP connection <0.9223.362> (10.42.1.161:37524 -> 10.42.0.228:5672 - rabbitConnectionFactory#77049094:2100)
2022-10-29 07:38:55.818447+00:00 [info] <0.9533.362> accepting AMQP connection <0.9533.362> (10.42.0.198:45032 -> 10.42.0.228:5672)
2022-10-29 07:38:55.821662+00:00 [info] <0.9533.362> Connection <0.9533.362> (10.42.0.198:45032 -> 10.42.0.228:5672) has a client-provided name: rabbitConnectionFactory#2100d047:4057
2022-10-29 07:38:55.822058+00:00 [erro] <0.9533.362> Error on AMQP connection <0.9533.362> (10.42.0.198:45032 -> 10.42.0.228:5672, state: starting):
2022-10-29 07:38:55.822058+00:00 [erro] <0.9533.362> PLAIN login refused: user 'aicenter-service' - invalid credentials
2022-10-29 07:38:55.822447+00:00 [info] <0.9533.362> closing AMQP connection <0.9533.362> (10.42.0.198:45032 -> 10.42.0.228:5672 - rabbitConnectionFactory#2100d047:4057)
[root@server0 UiPathAutomationSuite]# k -n rabbitmq logs rabbitmq-server-0
2022-10-29 07:38:49.146614+00:00 [info] <0.9223.362> accepting AMQP connection <0.9223.362> (10.42.1.161:37524 -> 10.42.0.228:5672)
2022-10-29 07:38:49.147411+00:00 [info] <0.9223.362> Connection <0.9223.362> (10.42.1.161:37524 -> 10.42.0.228:5672) has a client-provided name: rabbitConnectionFactory#77049094:2100
2022-10-29 07:38:49.147644+00:00 [erro] <0.9223.362> Error on AMQP connection <0.9223.362> (10.42.1.161:37524 -> 10.42.0.228:5672, state: starting):
2022-10-29 07:38:49.147644+00:00 [erro] <0.9223.362> PLAIN login refused: user 'aicenter-service' - invalid credentials
2022-10-29 07:38:49.147922+00:00 [info] <0.9223.362> closing AMQP connection <0.9223.362> (10.42.1.161:37524 -> 10.42.0.228:5672 - rabbitConnectionFactory#77049094:2100)
2022-10-29 07:38:55.818447+00:00 [info] <0.9533.362> accepting AMQP connection <0.9533.362> (10.42.0.198:45032 -> 10.42.0.228:5672)
2022-10-29 07:38:55.821662+00:00 [info] <0.9533.362> Connection <0.9533.362> (10.42.0.198:45032 -> 10.42.0.228:5672) has a client-provided name: rabbitConnectionFactory#2100d047:4057
2022-10-29 07:38:55.822058+00:00 [erro] <0.9533.362> Error on AMQP connection <0.9533.362> (10.42.0.198:45032 -> 10.42.0.228:5672, state: starting):
2022-10-29 07:38:55.822058+00:00 [erro] <0.9533.362> PLAIN login refused: user 'aicenter-service' - invalid credentials
2022-10-29 07:38:55.822447+00:00 [info] <0.9533.362> closing AMQP connection <0.9533.362> (10.42.0.198:45032 -> 10.42.0.228:5672 - rabbitConnectionFactory#2100d047:4057)
CrashLoopBackOff
state due to the Mnesia table data write issue.
If all pods are running, take the following steps:
-
Delete the users in RabbitMQ:
kubectl -n rabbitmq exec rabbitmq-server-0 -c rabbitmq -- rabbitmqctl list_users -s --formatter json | jq '.[]|.user' | grep -v default_user | xargs -I{} kubectl -n rabbitmq exec rabbitmq-server-0 -c rabbitmq -- rabbitmqctl delete_user {}
kubectl -n rabbitmq exec rabbitmq-server-0 -c rabbitmq -- rabbitmqctl list_users -s --formatter json | jq '.[]|.user' | grep -v default_user | xargs -I{} kubectl -n rabbitmq exec rabbitmq-server-0 -c rabbitmq -- rabbitmqctl delete_user {} -
Delete RabbitMQ application secrets in the UiPath namespace:
kubectl -n uipath get secret --template '{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}' | grep -i rabbitmq-secret | xargs -I{} kubectl -n uipath delete secret {}
kubectl -n uipath get secret --template '{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}' | grep -i rabbitmq-secret | xargs -I{} kubectl -n uipath delete secret {} -
Delete RabbitMQ application secrets in the RabbitMQ namespace:
kubectl -n rabbitmq get secret --template '{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}' | grep -i rabbitmq-secret | xargs -I{} kubectl -n rabbitmq delete secret {}
kubectl -n rabbitmq get secret --template '{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}' | grep -i rabbitmq-secret | xargs -I{} kubectl -n rabbitmq delete secret {} -
Sync
sfcore
application via ArgoCD and wait for the sync to complete: -
Perform a rollout restart on all applications in the UiPath namespace:
kubectl -n uipath rollout restart deploy
kubectl -n uipath rollout restart deploy
If some pods are in CrashLoopBackOff state, take the following steps:
-
Identify which RabbitMQ pods are stuck in
CrashLoopBackOff
state, and check the RabbitMQCrashLoopBackOff
pod logs:kubectl -n rabbitmq get pods kubectl -n rabbitmq logs <CrashLoopBackOff-Pod-Name>
kubectl -n rabbitmq get pods kubectl -n rabbitmq logs <CrashLoopBackOff-Pod-Name> -
Check the output of the previous commands. If the issue is related to the Mnesia table data write, you should see an error message similar to the following:
Mnesia('rabbit@rabbitmq-server-0.rabbitmq-nodes.rabbitmq'): ** ERROR ** (could not write core file: eacces) ** FATAL ** Failed to merge schema: Bad cookie in table definition rabbit_user_permission: 'rabbit@rabbitmq-server-0.rabbitmq-nodes.rabbitmq' = {cstruct,rabbit_user_permission,set,[],['rabbit@rabbitmq-server-2.rabbitmq-nodes.rabbitmq','rabbit@rabbitmq-server-0.rabbitmq-nodes.rabbitmq','rabbit@rabbitmq-server-1.rabbitmq-nodes.rabbitmq'],[],[],0,read_write,false,[],[],false,user_permission,[user_vhost,permission],[],[],[],{{1667351034020261908,-576460752303416575,1},'rabbit@rabbitmq-server-1.rabbitmq-nodes.rabbitmq'},{{4,0},{'rabbit@rabbitmq-server-2.rabbitmq-nodes.rabbitmq',{1667,351040,418694}}}}, 'rabbit@rabbitmq-server-1.rabbitmq-nodes.rabbitmq' = {cstruct,rabbit_user_permission,set,[],['rabbit@rabbitmq-server-1.rabbitmq-nodes.rabbitmq'],[],[],0,read_write,false,[],[],false,user_permission,[user_vhost,permission],[],[],[],{{1667372429216834387,-576460752303417087,1},'rabbit@rabbitmq-server-1.rabbitmq-nodes.rabbitmq'},{{2,0},[]}}
Mnesia('rabbit@rabbitmq-server-0.rabbitmq-nodes.rabbitmq'): ** ERROR ** (could not write core file: eacces) ** FATAL ** Failed to merge schema: Bad cookie in table definition rabbit_user_permission: 'rabbit@rabbitmq-server-0.rabbitmq-nodes.rabbitmq' = {cstruct,rabbit_user_permission,set,[],['rabbit@rabbitmq-server-2.rabbitmq-nodes.rabbitmq','rabbit@rabbitmq-server-0.rabbitmq-nodes.rabbitmq','rabbit@rabbitmq-server-1.rabbitmq-nodes.rabbitmq'],[],[],0,read_write,false,[],[],false,user_permission,[user_vhost,permission],[],[],[],{{1667351034020261908,-576460752303416575,1},'rabbit@rabbitmq-server-1.rabbitmq-nodes.rabbitmq'},{{4,0},{'rabbit@rabbitmq-server-2.rabbitmq-nodes.rabbitmq',{1667,351040,418694}}}}, 'rabbit@rabbitmq-server-1.rabbitmq-nodes.rabbitmq' = {cstruct,rabbit_user_permission,set,[],['rabbit@rabbitmq-server-1.rabbitmq-nodes.rabbitmq'],[],[],0,read_write,false,[],[],false,user_permission,[user_vhost,permission],[],[],[],{{1667372429216834387,-576460752303417087,1},'rabbit@rabbitmq-server-1.rabbitmq-nodes.rabbitmq'},{{2,0},[]}} -
To fix the issue, take the following steps:
-
Find the number of RabbitMQ replicas:
rabbitmqReplicas=$(kubectl -n rabbitmq get rabbitmqcluster rabbitmq -o json | jq -r '.spec.replicas')
-
Scale down the RabbitMQ replicas:
kubectl -n rabbitmq patch rabbitmqcluster rabbitmq -p "{\"spec\":{\"replicas\": 0}}" --type=merge
kubectl -n rabbitmq scale sts rabbitmq-server --replicas=0
-
Wait until all RabbitMQ pods are terminated:
kubectl -n rabbitmq get pod
-
Find and delete the PVC of the RabbitMQ pod that is stuck in
CrashLoopBackOff
state:kubectl -n rabbitmq get pvc
kubectl -n rabbitmq delete pvc <crashloopbackupoff_pod_pvc_name>
-
Scale up the RabbitMQ replicas:
kubectl -n rabbitmq patch rabbitmqcluster rabbitmq -p "{\"spec\":{\"replicas\": $rabbitmqReplicas}}" --type=merge
-
Check if all RabbitMQ pods are healthy:
kubectl -n rabbitmq get pod
-