Automation Suite
2021.10
false
Banner background image
Automation Suite Installation Guide
Last updated Apr 19, 2024

Performing Node Maintenance

There are scenarios where you may want to perform a node maintenance activity, such as the following:

  • When applying security patches;
  • When performing an operating system upgrade;
  • When changing any network configuration;
  • When performing any other activity that your organization mandates.

While performing node maintenance operations, it is possible that you may accidentally break the cluster. To avoid any adverse situation, follow the guidance provided here.

Note:
  • UiPath does not provide guidance on how to perform node maintenance activities. You must contact your IT team for this.
  • The following guidelines only provide instructions on the steps you must take before and after the node maintenance operation, to ensure the cluster is healthy.
  • It is good practice to perform node maintenance activities on one node at a time.

Pre-node Maintenance

  1. To ensure that the cluster is healthy while you are performing node maintenance activity, you must drain the workloads running on that node to other nodes. To drain the node, save the drain-node.sh script on the targeted node and run it using the following command:
    sudo bash drain-node.shsudo bash drain-node.sh
    

    drain-node.sh script

    #!/bin/bash
    
    # =================
    #
    #
    #
    #
    # Copyright UiPath 2021
    #
    # =================
    # LICENSE AGREEMENT
    # -----------------
    #   Use of paid UiPath products and services is subject to the licensing agreement
    #   executed between you and UiPath. Unless otherwise indicated by UiPath, use of free
    #   UiPath products is subject to the associated licensing agreement available here:
    #   https://www.uipath.com/legal/trust-and-security/legal-terms (or successor website).
    #   You must not use this file separately from the product it is a part of or is associated with.
    #
    #
    #
    # =================
    
    fetch_hostname(){
    
        HOST_NAME_NODE=$(kubectl get nodes -o name | cut -d'/' -f2 | grep "$(hostname)")
    
        if ! [[ -n ${HOST_NAME_NODE} && "$(hostname)" == "$HOST_NAME_NODE" ]]; then
            for private_ip in $(hostname --all-ip-addresses); do
                output=$(kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.addresses[?(@.type=="InternalIP")].address}{"\n"}{end}' | grep "$private_ip")
                ip_address=$(echo "$output" | cut -f2 -d$'\t')
    
                if [[ -n ${ip_address} && "$private_ip" == "$ip_address" ]]; then
                    HOST_NAME_NODE=$(echo "$output" | cut -f1 -d$'\t')
                    break
                fi
            done
        fi
    }
    
    set_kubeconfig(){
        export PATH=$PATH:/var/lib/rancher/rke2/bin:/usr/local/bin
        [[ -f "/var/lib/rancher/rke2/agent/kubelet.kubeconfig" ]] && export KUBECONFIG="/var/lib/rancher/rke2/agent/kubelet.kubeconfig"
        [[ -f "/etc/rancher/rke2/rke2.yaml" ]] && export KUBECONFIG="/etc/rancher/rke2/rke2.yaml"
    }
    
    is_kubectl_enabled(){
      local try=0
      local maxtry=60
      local status="notready"
      echo "Checking if node $HOST_NAME_NODE is ready to run kubectl command."
      while [[ ${status} == "notready" ]] && (( try != maxtry )) ; do
              try=$((try+1))
              kubectl cluster-info >/dev/null 2>&1  && status="ready"
              sleep 5;
      done
    
      if [[ ${status} == "notready" ]]; then
        echo "Node is not ready to accept kubectl command"
      else
        echo "Node is ready to accept kubectl command"
      fi
    }
    
    enable_ipforwarding() {
      local file_name="/etc/sysctl.conf"
      echo "Enable IP Forwarding..."
    
      if [[ ! -f "${file_name}" || -w "${file_name}" ]]; then
        # either file is not available or user doesn't have edit permission
        echo "Either file ${file_name} not present or file is not writable. Enabling ip forward using /proc/sys/net/ipv4/ip_forward..."
        echo 1 > /proc/sys/net/ipv4/ip_forward
      else
        echo "File ${file_name} is available and is writable. Checking and enabling ip forward..."
        is_ipforwarding_available=$(grep "net.ipv4.ip_forward" "${file_name}") || true
        if [[ -z ${is_ipforwarding_available} ]]; then
          echo "Adding net.ipv4.ip_forward = 1 in ${file_name}..."
          echo "net.ipv4.ip_forward = 1" >> ${file_name}
        else
          echo "Updating net.ipv4.ip_forward value with 1 in ${file_name}..."
          # shellcheck disable=SC2016
          sed -i -n -e '/^net.ipv4.ip_forward/!p' -e '$anet.ipv4.ip_forward = 1' ${file_name}
        fi
        sysctl -p
      fi
    }
    
    set_kubeconfig
    is_kubectl_enabled
    fetch_hostname
    
    if [[ -n "$HOST_NAME_NODE" ]]; then
        # Pass an argument to uncordon the node. This is to cover reboot scenarios.
        if [ "$1" ]; then
            # enable ip forward
            enable_ipforwarding
            # uncordan node
            echo "Uncordon $HOST_NAME_NODE ..."
            kubectl uncordon "$HOST_NAME_NODE"
        else
            #If PDB is enabled and they are zero available replicas on other nodes, drain would fail for those pods but thats not the behaviour we want
            #Thats when the second command would come to rescue which will ignore the PDB and continue with the eviction of those pods for which eviction failed earlier https://github.com/kubernetes/kubernetes/issues/83307
            kubectl drain "$HOST_NAME_NODE" --delete-emptydir-data --ignore-daemonsets  --timeout=90s --skip-wait-for-delete-timeout=10 --force --ignore-errors || kubectl drain "$HOST_NAME_NODE" --delete-emptydir-data --ignore-daemonsets  --force  --disable-eviction=true --timeout=30s --ignore-errors --skip-wait-for-delete-timeout=10 --pod-selector 'app!=csi-attacher,longhorn.io/component!=instance-manager,k8s-app!=kube-dns'
            node_mounted_pv=$(kubectl get volumeattachment -o json | jq --arg node "${HOST_NAME_NODE}" -r '.items[] | select(.spec.nodeName==$node) | .metadata.name + ":" + .spec.source.persistentVolumeName')
            if [[ -n "${node_mounted_pv}" ]] ; then
              while IFS=$'\n' read -r VOL_ATTACHMENT_PV_ID
              do
                PV_ID=$(echo "${VOL_ATTACHMENT_PV_ID}" | cut -d':' -f2)
                VOL_ATTACHMENT_ID=$(echo "${VOL_ATTACHMENT_PV_ID}" | cut -d':' -f1)
                if [[ -n "${PV_ID}" ]] ; then
                  mounts=$(grep "${PV_ID}" /proc/mounts  | awk '{print $2}')
                  if [[ -n $mounts ]] ; then
                    echo "Removing dangling mounts for pvc: ${PV_ID}"
                    {
                      timeout 20s xargs umount -l <<< "${mounts}"
                      exitCode="$?"
                      if [[ $exitCode -eq 0 ]] ; then
                        echo "Command to remove dangling mounts for pvc ${PV_ID} executed successfully"
                        echo "Waiting to remove dangling mounts for pvc ${PV_ID}"
                        if timeout 1m bash -c "while grep -q '${PV_ID}' /proc/mounts ; do sleep 1 ; done"  ; then
                          kubectl delete volumeattachment "${VOL_ATTACHMENT_ID}"
                          if timeout 2m bash -c "while kubectl get node '${HOST_NAME_NODE}' -o yaml | grep -q '${PV_ID}' ; do sleep 1 ; done" ; then
                          #shellcheck disable=SC1012
                            find /var/lib/kubelet -name "${PV_ID}" -print0 | xargs -0 \rm -rf
                            echo "Removed dangling mounts for pvc: ${PV_ID} successfully"
                          else
                           echo "Timeout while waiting to remove node dangling mounts for pvc: ${PV_ID}"
                         fi
                        else
                          echo "Timeout while waiting to remove dangling mounts for pvc: ${PV_ID}"
                        fi
                      elif [[ $exitCode -eq 124 ]] ; then
                        echo "Timeout while executing remove dangling mounts for pvc: ${PV_ID}"
                      else
                        echo "Error while executing remove dangling mounts for pvc: ${PV_ID}"
                      fi
                    } &
                  fi
                fi
              done <<< "${node_mounted_pv}"
              wait
            fi
        fi
    else
      echo "Not able to fetch hostname"
    fi#!/bin/bash
    
    # =================
    #
    #
    #
    #
    # Copyright UiPath 2021
    #
    # =================
    # LICENSE AGREEMENT
    # -----------------
    #   Use of paid UiPath products and services is subject to the licensing agreement
    #   executed between you and UiPath. Unless otherwise indicated by UiPath, use of free
    #   UiPath products is subject to the associated licensing agreement available here:
    #   https://www.uipath.com/legal/trust-and-security/legal-terms (or successor website).
    #   You must not use this file separately from the product it is a part of or is associated with.
    #
    #
    #
    # =================
    
    fetch_hostname(){
    
        HOST_NAME_NODE=$(kubectl get nodes -o name | cut -d'/' -f2 | grep "$(hostname)")
    
        if ! [[ -n ${HOST_NAME_NODE} && "$(hostname)" == "$HOST_NAME_NODE" ]]; then
            for private_ip in $(hostname --all-ip-addresses); do
                output=$(kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.addresses[?(@.type=="InternalIP")].address}{"\n"}{end}' | grep "$private_ip")
                ip_address=$(echo "$output" | cut -f2 -d$'\t')
    
                if [[ -n ${ip_address} && "$private_ip" == "$ip_address" ]]; then
                    HOST_NAME_NODE=$(echo "$output" | cut -f1 -d$'\t')
                    break
                fi
            done
        fi
    }
    
    set_kubeconfig(){
        export PATH=$PATH:/var/lib/rancher/rke2/bin:/usr/local/bin
        [[ -f "/var/lib/rancher/rke2/agent/kubelet.kubeconfig" ]] && export KUBECONFIG="/var/lib/rancher/rke2/agent/kubelet.kubeconfig"
        [[ -f "/etc/rancher/rke2/rke2.yaml" ]] && export KUBECONFIG="/etc/rancher/rke2/rke2.yaml"
    }
    
    is_kubectl_enabled(){
      local try=0
      local maxtry=60
      local status="notready"
      echo "Checking if node $HOST_NAME_NODE is ready to run kubectl command."
      while [[ ${status} == "notready" ]] && (( try != maxtry )) ; do
              try=$((try+1))
              kubectl cluster-info >/dev/null 2>&1  && status="ready"
              sleep 5;
      done
    
      if [[ ${status} == "notready" ]]; then
        echo "Node is not ready to accept kubectl command"
      else
        echo "Node is ready to accept kubectl command"
      fi
    }
    
    enable_ipforwarding() {
      local file_name="/etc/sysctl.conf"
      echo "Enable IP Forwarding..."
    
      if [[ ! -f "${file_name}" || -w "${file_name}" ]]; then
        # either file is not available or user doesn't have edit permission
        echo "Either file ${file_name} not present or file is not writable. Enabling ip forward using /proc/sys/net/ipv4/ip_forward..."
        echo 1 > /proc/sys/net/ipv4/ip_forward
      else
        echo "File ${file_name} is available and is writable. Checking and enabling ip forward..."
        is_ipforwarding_available=$(grep "net.ipv4.ip_forward" "${file_name}") || true
        if [[ -z ${is_ipforwarding_available} ]]; then
          echo "Adding net.ipv4.ip_forward = 1 in ${file_name}..."
          echo "net.ipv4.ip_forward = 1" >> ${file_name}
        else
          echo "Updating net.ipv4.ip_forward value with 1 in ${file_name}..."
          # shellcheck disable=SC2016
          sed -i -n -e '/^net.ipv4.ip_forward/!p' -e '$anet.ipv4.ip_forward = 1' ${file_name}
        fi
        sysctl -p
      fi
    }
    
    set_kubeconfig
    is_kubectl_enabled
    fetch_hostname
    
    if [[ -n "$HOST_NAME_NODE" ]]; then
        # Pass an argument to uncordon the node. This is to cover reboot scenarios.
        if [ "$1" ]; then
            # enable ip forward
            enable_ipforwarding
            # uncordan node
            echo "Uncordon $HOST_NAME_NODE ..."
            kubectl uncordon "$HOST_NAME_NODE"
        else
            #If PDB is enabled and they are zero available replicas on other nodes, drain would fail for those pods but thats not the behaviour we want
            #Thats when the second command would come to rescue which will ignore the PDB and continue with the eviction of those pods for which eviction failed earlier https://github.com/kubernetes/kubernetes/issues/83307
            kubectl drain "$HOST_NAME_NODE" --delete-emptydir-data --ignore-daemonsets  --timeout=90s --skip-wait-for-delete-timeout=10 --force --ignore-errors || kubectl drain "$HOST_NAME_NODE" --delete-emptydir-data --ignore-daemonsets  --force  --disable-eviction=true --timeout=30s --ignore-errors --skip-wait-for-delete-timeout=10 --pod-selector 'app!=csi-attacher,longhorn.io/component!=instance-manager,k8s-app!=kube-dns'
            node_mounted_pv=$(kubectl get volumeattachment -o json | jq --arg node "${HOST_NAME_NODE}" -r '.items[] | select(.spec.nodeName==$node) | .metadata.name + ":" + .spec.source.persistentVolumeName')
            if [[ -n "${node_mounted_pv}" ]] ; then
              while IFS=$'\n' read -r VOL_ATTACHMENT_PV_ID
              do
                PV_ID=$(echo "${VOL_ATTACHMENT_PV_ID}" | cut -d':' -f2)
                VOL_ATTACHMENT_ID=$(echo "${VOL_ATTACHMENT_PV_ID}" | cut -d':' -f1)
                if [[ -n "${PV_ID}" ]] ; then
                  mounts=$(grep "${PV_ID}" /proc/mounts  | awk '{print $2}')
                  if [[ -n $mounts ]] ; then
                    echo "Removing dangling mounts for pvc: ${PV_ID}"
                    {
                      timeout 20s xargs umount -l <<< "${mounts}"
                      exitCode="$?"
                      if [[ $exitCode -eq 0 ]] ; then
                        echo "Command to remove dangling mounts for pvc ${PV_ID} executed successfully"
                        echo "Waiting to remove dangling mounts for pvc ${PV_ID}"
                        if timeout 1m bash -c "while grep -q '${PV_ID}' /proc/mounts ; do sleep 1 ; done"  ; then
                          kubectl delete volumeattachment "${VOL_ATTACHMENT_ID}"
                          if timeout 2m bash -c "while kubectl get node '${HOST_NAME_NODE}' -o yaml | grep -q '${PV_ID}' ; do sleep 1 ; done" ; then
                          #shellcheck disable=SC1012
                            find /var/lib/kubelet -name "${PV_ID}" -print0 | xargs -0 \rm -rf
                            echo "Removed dangling mounts for pvc: ${PV_ID} successfully"
                          else
                           echo "Timeout while waiting to remove node dangling mounts for pvc: ${PV_ID}"
                         fi
                        else
                          echo "Timeout while waiting to remove dangling mounts for pvc: ${PV_ID}"
                        fi
                      elif [[ $exitCode -eq 124 ]] ; then
                        echo "Timeout while executing remove dangling mounts for pvc: ${PV_ID}"
                      else
                        echo "Error while executing remove dangling mounts for pvc: ${PV_ID}"
                      fi
                    } &
                  fi
                fi
              done <<< "${node_mounted_pv}"
              wait
            fi
        fi
    else
      echo "Not able to fetch hostname"
    fi
  2. Stop the Kubernetes process running on the node. Run either of the following commands:
    • Server node:

      systemctl stop rke2-serversystemctl stop rke2-server
    • Agent node:

      systemctl stop rke2-agentsystemctl stop rke2-agent
  3. If your maintenance activity includes upgrading the RPM packages on the machine, you must skip upgrading the rke2 package to avoid any compatibility issues.
    • It is recommended to add the rke2 package to the exclusion list of the RPM upgrade. To modify the /etc/yum.conf file, add rke2 in exclusion. For details, see these instructions.
    • Alternatively, you can temporarily exclude rke2 during yum upgrade using the following command:
      yum upgrade --exclude "rke2-*"yum upgrade --exclude "rke2-*"
      Important:
      If not excluded, rke2- packages might get upgraded to the latest version, causing issues in the Automation Suite cluster. rke2-* package upgrade will be handled via the Automation Suite upgrade.
      Updating yum overwrites the /etc/yum.conf file and removes rke2-* from the exclusion list. To prevent that, update the yum tool using the following command: yum update --exclude yum-utils.
      To check if rke-2 is excluded, review the /etc/yum.conf file.
  4. Proceed with your node maintenance activity. Once the upgrade is complete, continue with the post-node maintenance activity.

Post-node Maintenance

  1. Reboot the node either by running sudo reboot or using any other safe reboot mechanism you may prefer.
  2. Once the node is rebooted, ensure the rke2 service is started. Run either of the following commands:
    • Server node:

      systemctl start rke2-serversystemctl start rke2-server
    • Agent node:

      systemctl start rke2-agentsystemctl start rke2-agent
  3. Once the rke2 service is started, you must restart the node by running the following command:
    sudo bash drain-node.sh nodestartsudo bash drain-node.sh nodestart
  • Pre-node Maintenance
  • Post-node Maintenance

Was this page helpful?

Get The Help You Need
Learning RPA - Automation Courses
UiPath Community Forum
Uipath Logo White
Trust and Security
© 2005-2024 UiPath. All rights reserved.