- Overview
- Requirements
- Installation
- Post-installation
- Cluster administration
- Managing products
- Managing the cluster in ArgoCD
- Setting up the external NFS server
- Automated: Enabling the Backup on the Cluster
- Automated: Disabling the Backup on the Cluster
- Automated, Online: Restoring the Cluster
- Automated, Offline: Restoring the Cluster
- Manual: Enabling the Backup on the Cluster
- Manual: Disabling the Backup on the Cluster
- Manual, Online: Restoring the Cluster
- Manual, Offline: Restoring the Cluster
- Additional configuration
- Migrating objectstore from persistent volume to raw disks
- Monitoring and alerting
- Migration and upgrade
- Migration options
- Step 1: Moving the Identity organization data from standalone to Automation Suite
- Step 2: Restoring the standalone product database
- Step 3: Backing up the platform database in Automation Suite
- Step 4: Merging organizations in Automation Suite
- Step 5: Updating the migrated product connection strings
- Step 6: Migrating standalone Insights
- Step 7: Deleting the default tenant
- B) Single tenant migration
- Product-specific configuration
- Best practices and maintenance
- Troubleshooting
- How to Troubleshoot Services During Installation
- How to Uninstall the Cluster
- How to clean up offline artifacts to improve disk space
- How to clear Redis data
- How to enable Istio logging
- How to manually clean up logs
- How to clean up old logs stored in the sf-logs bucket
- How to disable streaming logs for AI Center
- How to debug failed Automation Suite installations
- How to delete images from the old installer after upgrade
- How to automatically clean up Longhorn snapshots
- How to disable TX checksum offloading
- How to address weak ciphers in TLS 1.2
- Unable to run an offline installation on RHEL 8.4 OS
- Error in Downloading the Bundle
- Offline installation fails because of missing binary
- Certificate issue in offline installation
- First installation fails during Longhorn setup
- SQL connection string validation error
- Prerequisite check for selinux iscsid module fails
- Azure disk not marked as SSD
- Failure After Certificate Update
- Automation Suite not working after OS upgrade
- Automation Suite Requires Backlog_wait_time to Be Set 1
- Volume unable to mount due to not being ready for workloads
- RKE2 fails during installation and upgrade
- Failure to upload or download data in objectstore
- PVC resize does not heal Ceph
- Failure to Resize Objectstore PVC
- Rook Ceph or Looker pod stuck in Init state
- StatefulSet volume attachment error
- Failure to create persistent volumes
- Storage reclamation patch
- Backup failed due to TooManySnapshots error
- All Longhorn replicas are faulted
- Setting a timeout interval for the management portals
- Update the underlying directory connections
- Cannot Log in After Migration
- Kinit: Cannot Find KDC for Realm <AD Domain> While Getting Initial Credentials
- Kinit: Keytab Contains No Suitable Keys for *** While Getting Initial Credentials
- GSSAPI Operation Failed With Error: An Invalid Status Code Was Supplied (Client's Credentials Have Been Revoked).
- Alarm Received for Failed Kerberos-tgt-update Job
- SSPI Provider: Server Not Found in Kerberos Database
- Login Failed for User <ADDOMAIN><aduser>. Reason: The Account Is Disabled.
- ArgoCD login failed
- Failure to get the sandbox image
- Pods not showing in ArgoCD UI
- Redis Probe Failure
- RKE2 Server Fails to Start
- Secret Not Found in UiPath Namespace
- After the Initial Install, ArgoCD App Went Into Progressing State
- MongoDB pods in CrashLoopBackOff or pending PVC provisioning after deletion
- Unexpected Inconsistency; Run Fsck Manually
- Degraded MongoDB or Business Applications After Cluster Restore
- Missing Self-heal-operator and Sf-k8-utils Repo
- Unhealthy Services After Cluster Restore or Rollback
- RabbitMQ pod stuck in CrashLoopBackOff
- Prometheus in CrashloopBackoff state with out-of-memory (OOM) error
- Missing Ceph-rook metrics from monitoring dashboards
- Pods cannot communicate with FQDN in a proxy environment
- Using the Automation Suite Diagnostics Tool
- Using the Automation Suite support bundle
- Exploring Logs
Step 3: Post-deployment steps
\
may not work as expected. To ensure new lines are interpreted correctly, use the console's clipboard widget.
installResult
(in the container) is successful
. The contents will be failed
if the installation failed.
The installation process generates self-signed certificates on your behalf. However, the Azure deployment template also gives you the option to provide a CA-issued server certificate at installation time instead of using an auto-generated self-signed certificate.
Self-signed certificates will expire in 90 days, and you must replace them with certificates signed by a trusted CA as soon as installation completes. If you do not update the certificates, the installation will stop working after 90 days.
For instructions, see Managing certificates.
If you need more information on the Automation Suite installation process or other operations, a good place to start is the storage account used to store various flags and logs during cluster deployment and maintenance.
To locate the storage account, take the following steps:
The flags container stores various flags or files needed for orchestration or just to report the status of various operations. On a new cluster, the flags container contents typically look as shown in the following example:
Files in the flags containers are used to orchestrate various operations, such as the Automation Suite installation process on the cluster, or specific cluster operations, such as Instance Refresh. For example:
uipath-server-000000.success
denotes that the infrastructure installation was completed successfully on that specific node of the cluster;installResult
readssuccess
if the overall installation is successful.
When performing an operation, it typically produces a log file in the logs container. On a fresh cluster, the logs container contents typically look as shown in the following example:
Every file in the logs container represents the logs for a specific step of the installation process. For example:
infra-uipath-server-000000.log
stores the infrastructure installation logs;fabric.log
stores the logs for the fabric installation;services.log
stores the logs for the application and services installation.
Once the installation is complete, you need to access the Deployment Outputs in the Outputs tab.
DateTime
) → Outputs.
Output |
Description |
---|---|
Documentation |
A link to the documentation. |
URL |
The Load Balancer URL. Can be used for direct access. If custom domains were enabled this is the domain that you would use for the CNAME binding. |
KeyVaultURL |
The Azure Portal URL for the Key Vault created by the deployment. It contains all the secrets (credentials) used in the deployment. |
ArgoCDURL |
The URL for accessing ArgoCD. This is available within the VNet. External access to this URL must be set up as described in: Step 4: Configuring the DNS. |
ArgoCDPassword |
The password used to log in to the ArgoCD portal. |
HostAdminUsername and HostAdminPassword |
The credentials used for Host Administration. |
All credentials used in the deployment are stored as secrets inside a Key Vault provisioned during the deployment. To access the secrets, filter the resources inside the Resource Group, search for Vault, and then click Secrets.
The operation “List” is not enabled in the key vault’s access policy
warning under the Secrets tab, take the following steps:
- Go to Access policies → Add access policy → Configure the template → Secret Management → Select Principal.
- Select your user, then click Save.
- Navigate back to Secrets. The warning should be gone, and the secrets should be visible.
The VMs are provisioned inside a private VNet. You can access them through Azure Bastion by following these steps:
As mentioned in Step 1: Preparing your Azure Deployment, the Automation Suite Azure deployment creates a Load Balancer with a public IP and a DNS label associated. This DNS label is Microsoft-owned.
The deployment also provisions a Private DNS zone inside the cluster VNet and adds several records that are used during the installation and configuration process.
If you choose to connect from an external machine, you will not be able to use the private DNS zone to resolve the DNS for various services, so you need to add these records to your host file.
See Step 4: Configuring the DNS for more details.
You should now be able to connect to various services running on your cluster.
The general-use Automation Suite user interface serves as a portal for both organization administrators and organization users. It is a common organization-level resource from where everyone can access all Automation Suite areas: administration pages, platform-level pages, service-specific pages, and user-specific pages.
To access Automation Suite, take the following steps:
- Go to the following URL:
https://${Loadbalancer_dns}
, where<loadbalancer_dns>
is the DNS label for the load balancer and is found under outputs. - Switch to the Default organization.
- The username is orgadmin.
- Retrieve the password by going to Keyvault,Secrets, and then Host Admin Password.
The host portal is where system administrators configure the Automation Suite instance. The settings configured from this portal are inherited by all your organizations, and some can be overwritten at the organization level.
See Managing system administrators for more on host administrators.
See Interface tour for more on the host portal.
To access host administration, take the following steps:
- Go to the following URL:
https://${Loadbalancer_dns}
, where<loadbalancer_dns>
is the DNS label for the load balancer and is found under Outputs. - Switch to the Host organization.
- Enter the username you previously specified as a value for the UiPath Admin Username parameter.
- Enter the password you previously specified as a value for the UiPath Admin Password parameter. Retrieve the password by going to Keyvault,Secrets, and then Host Admin Password.
You can use the ArgoCD console to manage installed products.
To access ArgoCD, take the following steps:
- Go to the following URL:
https://alm.${Loadbalancer_dns}
, where<loadbalancer_dns>
is the DNS label for the load balancer and is found under Outputs. Note that you must configure external access to this URL as described in Step 4: Configuring the DNS. - The username is admin.
- To access the password, go to the Outputs tab or the credential Keyvault.
Automation Suite uses Rancher to provide cluster management tools out of the box. This helps you manage the cluster and access monitoring and troubleshooting.
See Rancher documentation for more details.
See Using the monitoring stack for more on how to use Rancher monitoring in Automation Suite.
To access the Rancher console, take the following steps:
Compute resources provisioned from the deployment consist of Azure Scale Sets, which allow for easy scaling.
You can manually add additional resources to a specific Scale Set, including adding server nodes, agent nodes, or specialized agent nodes (such as GPU nodes).
You can perform a manual scale by identifying the specific Scale Set and add resources directly.
To do so, take the following steps:
Azure allows a 15-minute window at most to prepare for shutdown, whereas the graceful termination of an Automation Suite node varies from 20 minute (for agent and GPU agent nodes) to hours (in the case of server nodes).
To avoid data loss, the server's VMSS upgrade policy is set to manual, and the server VMs have the protection for the scale set actions enabled. As a result, we recommend managing the servers lifecycle via the provided Runbooks.
InstanceRefresh
, RemoveNodes
, RemoveServers
, and CheckServerZoneResilience
are supported only for multi-node HA-ready production deployments.
The number of servers after running any runbook must be odd and greater than three ( e.g., you cannot execute an Instance Refresh if you have 4 servers; you cannot remove a server if you have a total of five).
Running
state.
Only one runbook must run at a time.
Description
InstanceRefresh
runbook has the following use cases:
- Update VMSS OS SKU on the server, agent, and GPU scale sets.
- Perform a node rotation operation for one/more VMSSes.
- Other VMSS configuration changes that were applied to the VMSS beforehand.
Usage
Implementation details
InstanceRefresh
runbook is a wrapper for the RemoveNodes
runbook. As a result, the status is tracked while runningRemoveNodes
. It updates all the VMSS OS versions (if needed) and extracts, based on the received parameters, the hostname for the node
rotation operation and forwards them to the RemoveNodes
. If the cluster has exactly three servers, the InstanceRefresh
runbook creates three new servers; otherwise, RemoveNodes
handles the scale-up to maintain at least one server in each Availability Zone at all times.
Description
RemoveNodes
runbook has the following use cases:
- Remove the specified nodes from the Automation Suite cluster.
- Perform a node rotation operation for one/two VMs.
Usage
Implementation Details
RemoveNodes
runbook has a recursive approach to overcome the 3-hour fair share timeout. It removes or repaves the first or the first two nodes (the number is chosen in order to fulfill the odd number
of servers constraint) from the received list and reruns another instance of the runbook with the remaining list.
The node repaving operation for a node requires taking the following steps:
- Scale out the VMSS with one or two VMs based on the number of nodes that will be removed.
- Perform the node removal for the old instances.
The node removal operation for a node requires taking the following steps:
- Cordon and drain the instances. The operation times out after 20 minutes for an agent and
number_of_instances * 60
minutes for servers. - Stop the rke service on the instances. The operation times out after 5 minutes.
- Remove the nodes from the Automation Suite cluster and delete the VMs. The operation times out after 20 minutes for agents
and
number_of_instances * 60
minutes for servers.
Description
RemoveServers
runbook has the following use case:
- remove servers from the Automation Suite cluster.
Usage
- Go to the Azure Portal and search for the resource called
RemoveServers
. - Click the start button to open the parameter list. Complete the parameters considering the following:
-
REMOVEDSERVERSCOUNT
is the number of servers that will be removed. We recommend removing no more than 2 servers at a time in order not to hit the fair share timeout.
Implementation details
RemoveServers
runbook removes the number of servers received as a parameter from the Availability Zones with the most VMs.
Description
CheckServerZoneResilience
runbook scales out the server VMSS and uses the RemoveServers
runbook to balance the servers across Availability Zones. This is part of the InstanceRefresh
flow and should not be run manually.
- In case a VM fails to join the Automation Suite cluster, a rollback will be tried. The newly created VMs will follow the same
steps as an usual node removal (cordon, drain, stop the rke service, remove the node from the cluster, and delete the VMs).
You can find the logs from the joining node procedure in the storage account, inside the logs container, in blobs like
infra-<hostname>.log
. -
In case of a failure while deleting nodes, any runbook will stop and display the logs for the step that failed. Fix the issue, complete the process manually or using the
RemoveNodes
runbook. You can find all the logs in the storage account, inside the logs container, as follows:- Cordon and drain –
<timestamp>-<runbook_abreviation>-drain_nodes.log
- Stop the rke service –
<timestamp>-<runbook_abreviation>-stop_rke.log
- Remove the node from the cluster –
<timestamp>-<runbook_abreviation>-remove_nodes.log
- Cordon and drain –
- In case of a timeout, you should wait for the step to finish its execution, check the logs, and complete the process manually
or using the
RemoveNodes
runbook. All runbooks use the Azure Run Command feature to execute code in the context of the VMs. One limitation of this method is that it does not return the status of the execution. Therefore, the steps for cordoning, draining, and stopping the rke service run asynchronous, and the status is kept with blobs in the following format:<timestamp>-<runbook_abreviation>-<step_name>.<success/fail>
.
- Validating the installation
- Updating Certificates
- Exploring flags and logs
- Flags container
- Logs container
- Accessing Deployment Outputs
- Deployment Outputs
- Accessing cluster VMs
- DNS requirements
- Accessing Automation Suite general interface
- Accessing Host Administration
- Accessing ArgoCD
- Accessing Rancher
- Scaling your cluster
- Azure VM Lifecycle Operations
- InstanceRefresh
- RemoveNodes
- RemoveServers
- CheckServerZoneResilience
- Troubleshooting