- Overview
- Requirements
- Installation
- Post-installation
- Cluster administration
- Monitoring and alerting
- Using the monitoring stack
- Alert Runbooks
- Migration and upgrade
- Product-specific configuration
- Best practices and maintenance
- Troubleshooting
- How to Troubleshoot Services During Installation
- How to Uninstall the Cluster
- How to clean up offline artifacts to improve disk space
- How to disable TLS 1.0 and 1.1
- How to enable Istio logging
- How to manually clean up logs
- How to clean up old logs stored in the sf-logs bucket
- How to debug failed Automation Suite installations
- How to disable TX checksum offloading
- Unable to run an offline installation on RHEL 8.4 OS
- Error in Downloading the Bundle
- Offline installation fails because of missing binary
- Certificate issue in offline installation
- SQL connection string validation error
- Failure After Certificate Update
- Automation Suite Requires Backlog_wait_time to Be Set 1
- Cannot Log in After Migration
- Setting a timeout interval for the management portals
- Update the underlying directory connections
- Kinit: Cannot Find KDC for Realm <AD Domain> While Getting Initial Credentials
- Kinit: Keytab Contains No Suitable Keys for *** While Getting Initial Credentials
- GSSAPI Operation Failed With Error: An Invalid Status Code Was Supplied (Client's Credentials Have Been Revoked).
- Login Failed for User <ADDOMAIN><aduser>. Reason: The Account Is Disabled.
- Alarm Received for Failed Kerberos-tgt-update Job
- SSPI Provider: Server Not Found in Kerberos Database
- Failure to get the sandbox image
- Pods not showing in ArgoCD UI
- Redis Probe Failure
- RKE2 Server Fails to Start
- Secret Not Found in UiPath Namespace
- ArgoCD goes into progressing state after first installation
- Unexpected Inconsistency; Run Fsck Manually
- Missing Self-heal-operator and Sf-k8-utils Repo
- Degraded MongoDB or Business Applications After Cluster Restore
- Unhealthy Services After Cluster Restore or Rollback
- Using the Automation Suite Diagnostics Tool
- Using the Automation Suite support bundle
- Exploring Logs
Using the monitoring stack
The monitoring stack for Automation Suite clusters includes Prometheus, Grafana, and Alertmanager, which are integrated within the Rancher Cluster Explorer UI.
Node failures might lead to a Kubernetes shutdown, which would disrupt Prometheus alerts. To prevent this, we recommend setting up a separate alert on the RKE2 server.
This page describes a series of monitoring scenarios. For more details, see the official Rancher documentation on using Rancher Monitoring.
When using collectors to export metrics to third-party tools, enabling application monitoring may disrupt the functionality of Automation Suite.
In the Monitoring dashboard, check the bottom pane for currently firing alerts. The following screenshots shows several currently firing alerts.
If alerts are too noisy, you can silence them. To do that, take the following steps:
It is highly recommended to set up an external receiver for alerts. This way, alerts will be pushed as they happen, instead of requiring a refresh of the Monitoring dashboard to see the latest alerts.
For details on how to send alerts to an external receiver, see the Rancher documentation on Alertmanager Receiver Configuration.
In addition to a receiver, you must configure at least one route that uses that receiver. A route defines how alerts are grouped, and which alerts are sent to the receiver. See the Rancher documentation on Alertmanager Route Configuration.
See below for an example of how the alerts will be displayed when using the Slack receiver. Clicking the link to AlertManager will take you to the AlertManager console where alerts can be silenced and there are further links to the Prometheus expression that triggered the alert. Clicking the Runbook URL will take you to this page with specific remediation instructions. These links are also present when alerts are sent to other external receivers.
On the Monitoring dashboard, click the Grafana tile. The Grafana dashboard is now displayed.
You can monitor the Istio Service Mesh via the following Grafana dashboards: Istio Mesh and Istio Workload.
This dashboard shows the overall request volume, as well as 400 and 500 error rates across the entire service mesh, for the selected time period. The data is displayed in the upper-right corner of the window. See the 4 charts across the top for this information.
It also shows the immediate Success Rate over the past minute for each individual service. Note that a Success Rate of NaN indicates the service is not currently serving traffic.
This dashboard shows the traffic metrics over the time range selected in the upper-right corner of the window.
Use the selectors at the top of the dashboard to drill into specific workloads. Of particular interest is the uipath namespace.
The top section shows overall metrics, the Inbound Workloads section separates out traffic based on origin, and the Outbound Services section separates out traffic based on destination.
You can monitor persistent volumes via the Kubernetes / Persistent Volumes dashboard. You can keep track of the free and used space for each volume.
You can also check the status of each volume by clicking the PersistentVolumes item within the Storage menu of the Cluster Explorer.
To check the hardware utilization per node, you can use the Nodes dashboard. Data on the CPU, Memory, Disk, and Network is available.
You can monitor the hardware utilization for specific workloads using the Kubernetes / Compute Resources / Namespace (Workloads) dashboard. Select the uipath namespace to get the needed data.
- Click the downwards pointing arrow next to the chart title, and then select Share.
- Click the Snapshot tab, and set the Snapshot name,Expire, and Timeout.
- Click Publish to snapshot.raintank.io.
For more details, see the Grafana documentation on sharing dashboards.
For details on how to create custom persisten Grafana dashboards, see Rancher documentation.
Admin access to Grafana is not typically needed in Automation Suite clusters as dashboards are available for read access by default to anonymous users, and creating custom persistent dashboards must be created using the Kubernetes-native instructions linked above in this document.
Nonetheless, admin access to Grafana is possible with the instructions below.
The default username and password for Grafana admin access can be retrieved as follows:
kubectl get secret -n cattle-monitoring-system rancher-monitoring-grafana -o jsonpath='{.data.admin-user}' | base64 -d && echo
kubectl get secret -n cattle-monitoring-system rancher-monitoring-grafana -o jsonpath='{.data.admin-password}' | base64 -d && echo
kubectl get secret -n cattle-monitoring-system rancher-monitoring-grafana -o jsonpath='{.data.admin-user}' | base64 -d && echo
kubectl get secret -n cattle-monitoring-system rancher-monitoring-grafana -o jsonpath='{.data.admin-password}' | base64 -d && echo
Note that in High Availability Automation Suite clusters, there are multiple Grafana pods in order to enable uninterrupted read access in case of node failure, as well as a higher volume of read queries. This is incompatible with admin access because the pods do not share session state and logging in requires it. In order to work around this, the number of Grafana replicas must be temporarily scaled to 1 while admin access is desired. See below for instructions on how to scale the number of Grafana replicas:
# scale down
kubectl scale -n cattle-monitoring-system deployment/rancher-monitoring-grafana --replicas=1
# scale up
kubectl scale -n cattle-monitoring-system deployment/rancher-monitoring-grafana --replicas=2
# scale down
kubectl scale -n cattle-monitoring-system deployment/rancher-monitoring-grafana --replicas=1
# scale up
kubectl scale -n cattle-monitoring-system deployment/rancher-monitoring-grafana --replicas=2
Documentation on the available metrics is here:
You can create custom alerts using a Prometheus query with a Boolean expression.
To see the status of pods, deployments, statefulsets, etc., you can use the Cluster Explorer UI. This is the same landing page as accessed after logging into the rancher-server endpoint. The homepage shows a summary, with drill downs into specific details for each resource type on the left. Note the namespace selector at the top of the page. This dashboard may also be replaced with the Lens tool.
Prometheus uses the Prometheus remote write feature to collect and export Prometheus metrics to an external system.
remote_write
on an Automation Suite cluster:
- Accessing the Rancher Monitoring Dashboard
- Checking Currently Firing Alerts
- Silencing alerts
- Sending Alerts to an External Receiver
- Accessing the Grafana dashboard
- Monitoring the Service Mesh
- Istio Mesh dashboard
- Istio Workload dashboard
- Monitoring Persistent Volumes
- Monitoring hardware utilization
- Creating shareable visual snapshot of a Grafana chart
- Creating custom persistent Grafana dashboards
- Admin access to Grafana
- Querying Prometheus
- Creating custom alerts
- Monitoring Kubernetes resource status
- Exporting Prometheus Metrics to an External System