Automation Suite
2021.10
false
- Overview
- Requirements
- Installation
- Post-installation
- Cluster administration
- Monitoring and alerting
- Migration and upgrade
- Product-specific configuration
- Best practices and maintenance
- Troubleshooting
- How to Troubleshoot Services During Installation
- How to Uninstall the Cluster
- How to clean up offline artifacts to improve disk space
- How to Disable TLS 1.0 and 1.1
- How to enable Istio logging
- How to manually clean up logs
- How to clean up old logs stored in the sf-logs bundle
- How to debug failed Automation Suite installations
- How to disable NIC checksum offloading
- Unable to run an offline installation on RHEL 8.4 OS
- Error in Downloading the Bundle
- Offline installation fails because of missing binary
- Certificate issue in offline installation
- SQL connection string validation error
- Failure After Certificate Update
- Automation Suite Requires Backlog_wait_time to Be Set 1
- Cannot Log in After Migration
- Setting a timeout interval for the management portals
- Update the underlying directory connections
- Kinit: Cannot Find KDC for Realm <AD Domain> While Getting Initial Credentials
- Kinit: Keytab Contains No Suitable Keys for *** While Getting Initial Credentials
- GSSAPI Operation Failed With Error: An Invalid Status Code Was Supplied (Client's Credentials Have Been Revoked).
- Login Failed for User <ADDOMAIN><aduser>. Reason: The Account Is Disabled.
- Alarm Received for Failed Kerberos-tgt-update Job
- SSPI Provider: Server Not Found in Kerberos Database
- Failure to get the sandbox image
- Pods not showing in ArgoCD UI
- Redis Probe Failure
- RKE2 Server Fails to Start
- Secret Not Found in UiPath Namespace
- ArgoCD goes into progressing state after first installation
- Unexpected Inconsistency; Run Fsck Manually
- Missing Self-heal-operator and Sf-k8-utils Repo
- Degraded MongoDB or Business Applications After Cluster Restore
- Unhealthy Services After Cluster Restore or Rollback
- Using the Automation Suite Diagnostics Tool
- Using the Automation Suite Support Bundle Tool
- Exploring Logs
Starting and shutting down a node
Automation Suite Installation Guide
Last updated Apr 19, 2024
Starting and shutting down a node
This page explains the manual and automatic startup and shutdown behavior of Automation Suite.
The
rke2-service
starts and is followed by node-drainer
and node-uncordon
. node-drainer
does not do any action at startup, just returns confirmation that the service is up.
The
node-uncordon
only runs once and starts /opt/node-drain.sh nodestart
, which uncordons the node. As part of the drain procedure that occurs at shutdown, this cordons the node, making it unschedulable.
This state persists when the rke2 service starts. As such, the node must be uncordoned after rke2-service
restarts.
Manual startup
The service starts automatically with Automation Suite. However, if
rke2-service
was manually stopped, you must start the service again by running the following commands:
- Start the Kubernetes process running on the server node:
systemctl start rke2-server
systemctl start rke2-server - Start the Kubernetes process running on the server node:
systemctl start rke2-agent
systemctl start rke2-agent - Once the
rke2
service is started, uncordon the node to ensure Kubernetes can now schedule workloads on this node:systemctl restart node-uncordon
systemctl restart node-uncordon - Once the node is started, you must drain the node:
systemctl start node-drain.service
systemctl start node-drain.serviceImportant:Skipping step 4 could cause the Kubelet service to shut down in an unhealthy way if the system is restarted.
During shutdown,
systemd
stops the services in the order they were started. Since the node-drain
service has the After=rke2-server.service
or After=rke2-agent.service
directive, it executes its shutdown sequence before the rke2-service
shutdown. This means that, in a properly configured system, simply gracefully shutting down the node is a safe operation.
Manual restart
If you plan to stop the rke2 service and reboot the machine, take the following steps:
-
To ensure that the cluster is healthy while performing node maintenance activity, you must drain the workloads running on that node to other nodes. To drain the node, run the following command:
systemctl stop node-drain.service
systemctl stop node-drain.service - Stop the Kubernetes process running on the server node:
systemctl stop rke2-server
systemctl stop rke2-server - Stop the Kubernetes process running on the agent node:
systemctl stop rke2-agent
systemctl stop rke2-agent -
Kill the rke2 services and containerd and all child processes:
This should already be in the path, but it is located atrke2-killall.sh
rke2-killall.sh/bin/rke2-killall.sh
.
- The following unit files are created during installation:
rke2-server.service
(server only). Starts therke2-server
, which starts the server node.rke2-agent.service
(agent only). Starts therke2-agent
, which starts the agent node.node-drain.service
. Used at shutdown time. Executed before shutting downrke2-agent
orrke2-server
and performs a drain. Has a timeout of 300 seconds.node-uncordon.service
. Used at startup to uncordon a node.var-lib-kubelet.mount
. Autogenerated by fstab generator.var-lib-rancher-rke2-server-db.mount
. Autogenerated by fstab generator.var-lib-rancher.mount
. Autogenerated by fstab generator.
There are no strong dependencies between the unit files. However,
node-drain
and node-uncordon
have the After=rke2-server.service
or After=rke2-agent.service
directive. This means that those services will start after the rke2-service
.