Exécution de la maintenance des nœuds

Il existe des scénarios dans lesquels vous souhaiterez peut-être effectuer une activité de maintenance des nœuds, tels que les suivants :

Lors de l'application des correctifs de sécurité ;
Lors de l'exécution d'une mise à niveau du système d'exploitation ;
Lors de la modification d'une configuration réseau ;
Lors de l’exécution de toute autre activité mandatée par votre organisation.

Lors de l'exécution des opérations de maintenance des nœuds, il est possible que vous interrompiez accidentellement le cluster. Pour éviter toute situation défavorable, suivez les instructions fournies ici.

Remarque :

UiPath ne fournit pas de conseils sur la façon d'effectuer les activités de maintenance des nœuds. Vous devez contacter votre équipe informatique pour cela.
Les instructions suivantes fournissent uniquement des instructions sur les étapes que vous devez suivre avant et après l'opération de maintenance des nœuds, pour vous assurer que le cluster est sain.
Il est recommandé d'effectuer les activités de maintenance des nœuds sur un nœud à la fois.

Maintenance pré-nœud

Pour vous assurer que le cluster est sain pendant que vous effectuez une activité de maintenance de nœud, vous devez drainer les charges de travail en cours d'exécution sur ce nœud vers d'autres nœuds. Pour vider le nœud, enregistrez le script drain-node.sh sur le nœud ciblé et exécutez-le à l'aide de la commande suivante :

sudo bash drain-node.shsudo bash drain-node.sh

drain-node.sh script

#!/bin/bash

# =================
#
#
#
#
# Copyright UiPath 2021
#
# =================
# LICENSE AGREEMENT
# -----------------
#   Use of paid UiPath products and services is subject to the licensing agreement
#   executed between you and UiPath. Unless otherwise indicated by UiPath, use of free
#   UiPath products is subject to the associated licensing agreement available here:
#   https://www.uipath.com/legal/trust-and-security/legal-terms (or successor website).
#   You must not use this file separately from the product it is a part of or is associated with.
#
#
#
# =================

fetch_hostname(){

    HOST_NAME_NODE=$(kubectl get nodes -o name | cut -d'/' -f2 | grep "$(hostname)")

    if ! [[ -n ${HOST_NAME_NODE} && "$(hostname)" == "$HOST_NAME_NODE" ]]; then
        for private_ip in $(hostname --all-ip-addresses); do
            output=$(kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.addresses[?(@.type=="InternalIP")].address}{"\n"}{end}' | grep "$private_ip")
            ip_address=$(echo "$output" | cut -f2 -d$'\t')

            if [[ -n ${ip_address} && "$private_ip" == "$ip_address" ]]; then
                HOST_NAME_NODE=$(echo "$output" | cut -f1 -d$'\t')
                break
            fi
        done
    fi
}

set_kubeconfig(){
    export PATH=$PATH:/var/lib/rancher/rke2/bin:/usr/local/bin
    [[ -f "/var/lib/rancher/rke2/agent/kubelet.kubeconfig" ]] && export KUBECONFIG="/var/lib/rancher/rke2/agent/kubelet.kubeconfig"
    [[ -f "/etc/rancher/rke2/rke2.yaml" ]] && export KUBECONFIG="/etc/rancher/rke2/rke2.yaml"
}

is_kubectl_enabled(){
  local try=0
  local maxtry=60
  local status="notready"
  echo "Checking if node $HOST_NAME_NODE is ready to run kubectl command."
  while [[ ${status} == "notready" ]] && (( try != maxtry )) ; do
          try=$((try+1))
          kubectl cluster-info >/dev/null 2>&1  && status="ready"
          sleep 5;
  done

  if [[ ${status} == "notready" ]]; then
    echo "Node is not ready to accept kubectl command"
  else
    echo "Node is ready to accept kubectl command"
  fi
}

enable_ipforwarding() {
  local file_name="/etc/sysctl.conf"
  echo "Enable IP Forwarding..."

  if [[ ! -f "${file_name}" || -w "${file_name}" ]]; then
    # either file is not available or user doesn't have edit permission
    echo "Either file ${file_name} not present or file is not writable. Enabling ip forward using /proc/sys/net/ipv4/ip_forward..."
    echo 1 > /proc/sys/net/ipv4/ip_forward
  else
    echo "File ${file_name} is available and is writable. Checking and enabling ip forward..."
    is_ipforwarding_available=$(grep "net.ipv4.ip_forward" "${file_name}") || true
    if [[ -z ${is_ipforwarding_available} ]]; then
      echo "Adding net.ipv4.ip_forward = 1 in ${file_name}..."
      echo "net.ipv4.ip_forward = 1" >> ${file_name}
    else
      echo "Updating net.ipv4.ip_forward value with 1 in ${file_name}..."
      # shellcheck disable=SC2016
      sed -i -n -e '/^net.ipv4.ip_forward/!p' -e '$anet.ipv4.ip_forward = 1' ${file_name}
    fi
    sysctl -p
  fi
}

set_kubeconfig
is_kubectl_enabled
fetch_hostname

if [[ -n "$HOST_NAME_NODE" ]]; then
    # Pass an argument to uncordon the node. This is to cover reboot scenarios.
    if [ "$1" ]; then
        # enable ip forward
        enable_ipforwarding
        # uncordan node
        echo "Uncordon $HOST_NAME_NODE ..."
        kubectl uncordon "$HOST_NAME_NODE"
    else
        #If PDB is enabled and they are zero available replicas on other nodes, drain would fail for those pods but thats not the behaviour we want
        #Thats when the second command would come to rescue which will ignore the PDB and continue with the eviction of those pods for which eviction failed earlier https://github.com/kubernetes/kubernetes/issues/83307
        kubectl drain "$HOST_NAME_NODE" --delete-emptydir-data --ignore-daemonsets  --timeout=90s --skip-wait-for-delete-timeout=10 --force --ignore-errors || kubectl drain "$HOST_NAME_NODE" --delete-emptydir-data --ignore-daemonsets  --force  --disable-eviction=true --timeout=30s --ignore-errors --skip-wait-for-delete-timeout=10 --pod-selector 'app!=csi-attacher,longhorn.io/component!=instance-manager,k8s-app!=kube-dns'
        node_mounted_pv=$(kubectl get volumeattachment -o json | jq --arg node "${HOST_NAME_NODE}" -r '.items[] | select(.spec.nodeName==$node) | .metadata.name + ":" + .spec.source.persistentVolumeName')
        if [[ -n "${node_mounted_pv}" ]] ; then
          while IFS=$'\n' read -r VOL_ATTACHMENT_PV_ID
          do
            PV_ID=$(echo "${VOL_ATTACHMENT_PV_ID}" | cut -d':' -f2)
            VOL_ATTACHMENT_ID=$(echo "${VOL_ATTACHMENT_PV_ID}" | cut -d':' -f1)
            if [[ -n "${PV_ID}" ]] ; then
              mounts=$(grep "${PV_ID}" /proc/mounts  | awk '{print $2}')
              if [[ -n $mounts ]] ; then
                echo "Removing dangling mounts for pvc: ${PV_ID}"
                {
                  timeout 20s xargs umount -l <<< "${mounts}"
                  exitCode="$?"
                  if [[ $exitCode -eq 0 ]] ; then
                    echo "Command to remove dangling mounts for pvc ${PV_ID} executed successfully"
                    echo "Waiting to remove dangling mounts for pvc ${PV_ID}"
                    if timeout 1m bash -c "while grep -q '${PV_ID}' /proc/mounts ; do sleep 1 ; done"  ; then
                      kubectl delete volumeattachment "${VOL_ATTACHMENT_ID}"
                      if timeout 2m bash -c "while kubectl get node '${HOST_NAME_NODE}' -o yaml | grep -q '${PV_ID}' ; do sleep 1 ; done" ; then
                      #shellcheck disable=SC1012
                        find /var/lib/kubelet -name "${PV_ID}" -print0 | xargs -0 \rm -rf
                        echo "Removed dangling mounts for pvc: ${PV_ID} successfully"
                      else
                       echo "Timeout while waiting to remove node dangling mounts for pvc: ${PV_ID}"
                     fi
                    else
                      echo "Timeout while waiting to remove dangling mounts for pvc: ${PV_ID}"
                    fi
                  elif [[ $exitCode -eq 124 ]] ; then
                    echo "Timeout while executing remove dangling mounts for pvc: ${PV_ID}"
                  else
                    echo "Error while executing remove dangling mounts for pvc: ${PV_ID}"
                  fi
                } &
              fi
            fi
          done <<< "${node_mounted_pv}"
          wait
        fi
    fi
else
  echo "Not able to fetch hostname"
fi#!/bin/bash

# =================
#
#
#
#
# Copyright UiPath 2021
#
# =================
# LICENSE AGREEMENT
# -----------------
#   Use of paid UiPath products and services is subject to the licensing agreement
#   executed between you and UiPath. Unless otherwise indicated by UiPath, use of free
#   UiPath products is subject to the associated licensing agreement available here:
#   https://www.uipath.com/legal/trust-and-security/legal-terms (or successor website).
#   You must not use this file separately from the product it is a part of or is associated with.
#
#
#
# =================

fetch_hostname(){

    HOST_NAME_NODE=$(kubectl get nodes -o name | cut -d'/' -f2 | grep "$(hostname)")

    if ! [[ -n ${HOST_NAME_NODE} && "$(hostname)" == "$HOST_NAME_NODE" ]]; then
        for private_ip in $(hostname --all-ip-addresses); do
            output=$(kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.addresses[?(@.type=="InternalIP")].address}{"\n"}{end}' | grep "$private_ip")
            ip_address=$(echo "$output" | cut -f2 -d$'\t')

            if [[ -n ${ip_address} && "$private_ip" == "$ip_address" ]]; then
                HOST_NAME_NODE=$(echo "$output" | cut -f1 -d$'\t')
                break
            fi
        done
    fi
}

set_kubeconfig(){
    export PATH=$PATH:/var/lib/rancher/rke2/bin:/usr/local/bin
    [[ -f "/var/lib/rancher/rke2/agent/kubelet.kubeconfig" ]] && export KUBECONFIG="/var/lib/rancher/rke2/agent/kubelet.kubeconfig"
    [[ -f "/etc/rancher/rke2/rke2.yaml" ]] && export KUBECONFIG="/etc/rancher/rke2/rke2.yaml"
}

is_kubectl_enabled(){
  local try=0
  local maxtry=60
  local status="notready"
  echo "Checking if node $HOST_NAME_NODE is ready to run kubectl command."
  while [[ ${status} == "notready" ]] && (( try != maxtry )) ; do
          try=$((try+1))
          kubectl cluster-info >/dev/null 2>&1  && status="ready"
          sleep 5;
  done

  if [[ ${status} == "notready" ]]; then
    echo "Node is not ready to accept kubectl command"
  else
    echo "Node is ready to accept kubectl command"
  fi
}

enable_ipforwarding() {
  local file_name="/etc/sysctl.conf"
  echo "Enable IP Forwarding..."

  if [[ ! -f "${file_name}" || -w "${file_name}" ]]; then
    # either file is not available or user doesn't have edit permission
    echo "Either file ${file_name} not present or file is not writable. Enabling ip forward using /proc/sys/net/ipv4/ip_forward..."
    echo 1 > /proc/sys/net/ipv4/ip_forward
  else
    echo "File ${file_name} is available and is writable. Checking and enabling ip forward..."
    is_ipforwarding_available=$(grep "net.ipv4.ip_forward" "${file_name}") || true
    if [[ -z ${is_ipforwarding_available} ]]; then
      echo "Adding net.ipv4.ip_forward = 1 in ${file_name}..."
      echo "net.ipv4.ip_forward = 1" >> ${file_name}
    else
      echo "Updating net.ipv4.ip_forward value with 1 in ${file_name}..."
      # shellcheck disable=SC2016
      sed -i -n -e '/^net.ipv4.ip_forward/!p' -e '$anet.ipv4.ip_forward = 1' ${file_name}
    fi
    sysctl -p
  fi
}

set_kubeconfig
is_kubectl_enabled
fetch_hostname

if [[ -n "$HOST_NAME_NODE" ]]; then
    # Pass an argument to uncordon the node. This is to cover reboot scenarios.
    if [ "$1" ]; then
        # enable ip forward
        enable_ipforwarding
        # uncordan node
        echo "Uncordon $HOST_NAME_NODE ..."
        kubectl uncordon "$HOST_NAME_NODE"
    else
        #If PDB is enabled and they are zero available replicas on other nodes, drain would fail for those pods but thats not the behaviour we want
        #Thats when the second command would come to rescue which will ignore the PDB and continue with the eviction of those pods for which eviction failed earlier https://github.com/kubernetes/kubernetes/issues/83307
        kubectl drain "$HOST_NAME_NODE" --delete-emptydir-data --ignore-daemonsets  --timeout=90s --skip-wait-for-delete-timeout=10 --force --ignore-errors || kubectl drain "$HOST_NAME_NODE" --delete-emptydir-data --ignore-daemonsets  --force  --disable-eviction=true --timeout=30s --ignore-errors --skip-wait-for-delete-timeout=10 --pod-selector 'app!=csi-attacher,longhorn.io/component!=instance-manager,k8s-app!=kube-dns'
        node_mounted_pv=$(kubectl get volumeattachment -o json | jq --arg node "${HOST_NAME_NODE}" -r '.items[] | select(.spec.nodeName==$node) | .metadata.name + ":" + .spec.source.persistentVolumeName')
        if [[ -n "${node_mounted_pv}" ]] ; then
          while IFS=$'\n' read -r VOL_ATTACHMENT_PV_ID
          do
            PV_ID=$(echo "${VOL_ATTACHMENT_PV_ID}" | cut -d':' -f2)
            VOL_ATTACHMENT_ID=$(echo "${VOL_ATTACHMENT_PV_ID}" | cut -d':' -f1)
            if [[ -n "${PV_ID}" ]] ; then
              mounts=$(grep "${PV_ID}" /proc/mounts  | awk '{print $2}')
              if [[ -n $mounts ]] ; then
                echo "Removing dangling mounts for pvc: ${PV_ID}"
                {
                  timeout 20s xargs umount -l <<< "${mounts}"
                  exitCode="$?"
                  if [[ $exitCode -eq 0 ]] ; then
                    echo "Command to remove dangling mounts for pvc ${PV_ID} executed successfully"
                    echo "Waiting to remove dangling mounts for pvc ${PV_ID}"
                    if timeout 1m bash -c "while grep -q '${PV_ID}' /proc/mounts ; do sleep 1 ; done"  ; then
                      kubectl delete volumeattachment "${VOL_ATTACHMENT_ID}"
                      if timeout 2m bash -c "while kubectl get node '${HOST_NAME_NODE}' -o yaml | grep -q '${PV_ID}' ; do sleep 1 ; done" ; then
                      #shellcheck disable=SC1012
                        find /var/lib/kubelet -name "${PV_ID}" -print0 | xargs -0 \rm -rf
                        echo "Removed dangling mounts for pvc: ${PV_ID} successfully"
                      else
                       echo "Timeout while waiting to remove node dangling mounts for pvc: ${PV_ID}"
                     fi
                    else
                      echo "Timeout while waiting to remove dangling mounts for pvc: ${PV_ID}"
                    fi
                  elif [[ $exitCode -eq 124 ]] ; then
                    echo "Timeout while executing remove dangling mounts for pvc: ${PV_ID}"
                  else
                    echo "Error while executing remove dangling mounts for pvc: ${PV_ID}"
                  fi
                } &
              fi
            fi
          done <<< "${node_mounted_pv}"
          wait
        fi
    fi
else
  echo "Not able to fetch hostname"
fi

Arrêtez le processus Kubernetes en cours d'exécution sur le nœud. Exécutez l'une des commandes suivantes :
- Nœuds de serveur :
```
systemctl stop rke2-serversystemctl stop rke2-server
```
- Nœuds d'agent :
```
systemctl stop rke2-agentsystemctl stop rke2-agent
```
Si votre activité de maintenance comprend la mise à niveau des packages RPM sur la machine, vous devez ignorer la mise à niveau du package rke2 pour éviter tout problème de compatibilité.
- Il est recommandé d’ajouter le package rke2 à la liste d’exclusion de la mise à niveau de RPM. Pour modifier le fichier /etc/yum.conf, ajoutez rke2 dans l’exclusion. Pour plus de détails, consultez ces instructions.
- Vous pouvez également exclure temporairement rke2 pendant yum upgrade à l'aide de la commande suivante :
```
yum upgrade --exclude "rke2-*"yum upgrade --exclude "rke2-*"
```
  Important :
  S'ils ne sont pas exclus, les packages rke2- peuvent être mis à niveau vers la dernière version, provoquant des problèmes dans le cluster Automation Suite. La mise à niveau du package rke2-* sera gérée via la mise à niveau d'Automation Suite.
  
  La mise à jour de yum écrase le fichier /etc/yum.conf et supprime rke2-* de la liste d'exclusions. Pour éviter cela, mettez à jour l'outil yum à l'aide de la commande suivante : yum update --exclude yum-utils .
  
  Pour vérifier si rke-2 est exclu, examinez le fichier /etc/yum.conf .
Poursuivez votre activité de maintenance des nœuds. Une fois la mise à niveau terminée, continuez avec l'activité de maintenance post-nœud.

Maintenance post-nœud

Redémarrez le nœud en exécutant sudo reboot ou en utilisant tout autre mécanisme de redémarrage sécurisé que vous préférez.
Une fois le nœud redémarré, assurez-vous que le service rke2 est démarré. Exécutez l'une des commandes suivantes :
- Nœuds de serveur :
```
systemctl start rke2-serversystemctl start rke2-server
```
- Nœuds d'agent :
```
systemctl start rke2-agentsystemctl start rke2-agent
```
Une fois le service rke2 démarré, vous devez redémarrer le nœud en exécutant la commande suivante :
```
sudo bash drain-node.sh nodestartsudo bash drain-node.sh nodestart
```