Mantenimiento de los nodos

Hay escenarios en los que puedes querer realizar una actividad de mantenimiento del nodo, como los siguientes:

Al aplicar parches de seguridad;
Al realizar una actualización del sistema operativo;
Al cambiar cualquier configuración de red;
Al realizar cualquier actividad que ordene tu organización.

Al realizar operaciones de mantenimiento de nodos, es posible que rompas accidentalmente el clúster. Para evitar cualquier situación adversa, sigue estos consejos.

Nota:

UiPath no proporciona orientación sobre cómo realizar actividades de mantenimiento de nodos. Deberás ponerte en contacto con tu equipo de TI para ello.
Las siguientes directrices solo proporcionan instrucciones sobre los pasos que debe dar antes y después de la operación de mantenimiento de nodos, para garantizar que el clúster esté sano.
Es una buena práctica realizar las actividades de mantenimiento de nodo en nodo.

Mantenimiento previo al nodo

Para asegurarse de que el clúster está sano mientras realizas la actividad de mantenimiento del nodo, debes drenar las cargas de trabajo que se ejecutan en ese nodo a otros nodos. Para drenar el nodo, guarda la secuencia de comandos drain-node.sh en el nodo objetivo y ejecútala con el siguiente comando:

sudo bash drain-node.shsudo bash drain-node.sh

Script Drain-node.sh

#!/bin/bash

# =================
#
#
#
#
# Copyright UiPath 2021
#
# =================
# LICENSE AGREEMENT
# -----------------
#   Use of paid UiPath products and services is subject to the licensing agreement
#   executed between you and UiPath. Unless otherwise indicated by UiPath, use of free
#   UiPath products is subject to the associated licensing agreement available here:
#   https://www.uipath.com/legal/trust-and-security/legal-terms (or successor website).
#   You must not use this file separately from the product it is a part of or is associated with.
#
#
#
# =================

fetch_hostname(){

    HOST_NAME_NODE=$(kubectl get nodes -o name | cut -d'/' -f2 | grep "$(hostname)")

    if ! [[ -n ${HOST_NAME_NODE} && "$(hostname)" == "$HOST_NAME_NODE" ]]; then
        for private_ip in $(hostname --all-ip-addresses); do
            output=$(kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.addresses[?(@.type=="InternalIP")].address}{"\n"}{end}' | grep "$private_ip")
            ip_address=$(echo "$output" | cut -f2 -d$'\t')

            if [[ -n ${ip_address} && "$private_ip" == "$ip_address" ]]; then
                HOST_NAME_NODE=$(echo "$output" | cut -f1 -d$'\t')
                break
            fi
        done
    fi
}

set_kubeconfig(){
    export PATH=$PATH:/var/lib/rancher/rke2/bin:/usr/local/bin
    [[ -f "/var/lib/rancher/rke2/agent/kubelet.kubeconfig" ]] && export KUBECONFIG="/var/lib/rancher/rke2/agent/kubelet.kubeconfig"
    [[ -f "/etc/rancher/rke2/rke2.yaml" ]] && export KUBECONFIG="/etc/rancher/rke2/rke2.yaml"
}

is_kubectl_enabled(){
  local try=0
  local maxtry=60
  local status="notready"
  echo "Checking if node $HOST_NAME_NODE is ready to run kubectl command."
  while [[ ${status} == "notready" ]] && (( try != maxtry )) ; do
          try=$((try+1))
          kubectl cluster-info >/dev/null 2>&1  && status="ready"
          sleep 5;
  done

  if [[ ${status} == "notready" ]]; then
    echo "Node is not ready to accept kubectl command"
  else
    echo "Node is ready to accept kubectl command"
  fi
}

enable_ipforwarding() {
  local file_name="/etc/sysctl.conf"
  echo "Enable IP Forwarding..."

  if [[ ! -f "${file_name}" || -w "${file_name}" ]]; then
    # either file is not available or user doesn't have edit permission
    echo "Either file ${file_name} not present or file is not writable. Enabling ip forward using /proc/sys/net/ipv4/ip_forward..."
    echo 1 > /proc/sys/net/ipv4/ip_forward
  else
    echo "File ${file_name} is available and is writable. Checking and enabling ip forward..."
    is_ipforwarding_available=$(grep "net.ipv4.ip_forward" "${file_name}") || true
    if [[ -z ${is_ipforwarding_available} ]]; then
      echo "Adding net.ipv4.ip_forward = 1 in ${file_name}..."
      echo "net.ipv4.ip_forward = 1" >> ${file_name}
    else
      echo "Updating net.ipv4.ip_forward value with 1 in ${file_name}..."
      # shellcheck disable=SC2016
      sed -i -n -e '/^net.ipv4.ip_forward/!p' -e '$anet.ipv4.ip_forward = 1' ${file_name}
    fi
    sysctl -p
  fi
}

set_kubeconfig
is_kubectl_enabled
fetch_hostname

if [[ -n "$HOST_NAME_NODE" ]]; then
    # Pass an argument to uncordon the node. This is to cover reboot scenarios.
    if [ "$1" ]; then
        # enable ip forward
        enable_ipforwarding
        # uncordan node
        echo "Uncordon $HOST_NAME_NODE ..."
        kubectl uncordon "$HOST_NAME_NODE"
    else
        #If PDB is enabled and they are zero available replicas on other nodes, drain would fail for those pods but thats not the behaviour we want
        #Thats when the second command would come to rescue which will ignore the PDB and continue with the eviction of those pods for which eviction failed earlier https://github.com/kubernetes/kubernetes/issues/83307
        kubectl drain "$HOST_NAME_NODE" --delete-emptydir-data --ignore-daemonsets  --timeout=90s --skip-wait-for-delete-timeout=10 --force --ignore-errors || kubectl drain "$HOST_NAME_NODE" --delete-emptydir-data --ignore-daemonsets  --force  --disable-eviction=true --timeout=30s --ignore-errors --skip-wait-for-delete-timeout=10 --pod-selector 'app!=csi-attacher,longhorn.io/component!=instance-manager,k8s-app!=kube-dns'
        node_mounted_pv=$(kubectl get volumeattachment -o json | jq --arg node "${HOST_NAME_NODE}" -r '.items[] | select(.spec.nodeName==$node) | .metadata.name + ":" + .spec.source.persistentVolumeName')
        if [[ -n "${node_mounted_pv}" ]] ; then
          while IFS=$'\n' read -r VOL_ATTACHMENT_PV_ID
          do
            PV_ID=$(echo "${VOL_ATTACHMENT_PV_ID}" | cut -d':' -f2)
            VOL_ATTACHMENT_ID=$(echo "${VOL_ATTACHMENT_PV_ID}" | cut -d':' -f1)
            if [[ -n "${PV_ID}" ]] ; then
              mounts=$(grep "${PV_ID}" /proc/mounts  | awk '{print $2}')
              if [[ -n $mounts ]] ; then
                echo "Removing dangling mounts for pvc: ${PV_ID}"
                {
                  timeout 20s xargs umount -l <<< "${mounts}"
                  exitCode="$?"
                  if [[ $exitCode -eq 0 ]] ; then
                    echo "Command to remove dangling mounts for pvc ${PV_ID} executed successfully"
                    echo "Waiting to remove dangling mounts for pvc ${PV_ID}"
                    if timeout 1m bash -c "while grep -q '${PV_ID}' /proc/mounts ; do sleep 1 ; done"  ; then
                      kubectl delete volumeattachment "${VOL_ATTACHMENT_ID}"
                      if timeout 2m bash -c "while kubectl get node '${HOST_NAME_NODE}' -o yaml | grep -q '${PV_ID}' ; do sleep 1 ; done" ; then
                      #shellcheck disable=SC1012
                        find /var/lib/kubelet -name "${PV_ID}" -print0 | xargs -0 \rm -rf
                        echo "Removed dangling mounts for pvc: ${PV_ID} successfully"
                      else
                       echo "Timeout while waiting to remove node dangling mounts for pvc: ${PV_ID}"
                     fi
                    else
                      echo "Timeout while waiting to remove dangling mounts for pvc: ${PV_ID}"
                    fi
                  elif [[ $exitCode -eq 124 ]] ; then
                    echo "Timeout while executing remove dangling mounts for pvc: ${PV_ID}"
                  else
                    echo "Error while executing remove dangling mounts for pvc: ${PV_ID}"
                  fi
                } &
              fi
            fi
          done <<< "${node_mounted_pv}"
          wait
        fi
    fi
else
  echo "Not able to fetch hostname"
fi#!/bin/bash

# =================
#
#
#
#
# Copyright UiPath 2021
#
# =================
# LICENSE AGREEMENT
# -----------------
#   Use of paid UiPath products and services is subject to the licensing agreement
#   executed between you and UiPath. Unless otherwise indicated by UiPath, use of free
#   UiPath products is subject to the associated licensing agreement available here:
#   https://www.uipath.com/legal/trust-and-security/legal-terms (or successor website).
#   You must not use this file separately from the product it is a part of or is associated with.
#
#
#
# =================

fetch_hostname(){

    HOST_NAME_NODE=$(kubectl get nodes -o name | cut -d'/' -f2 | grep "$(hostname)")

    if ! [[ -n ${HOST_NAME_NODE} && "$(hostname)" == "$HOST_NAME_NODE" ]]; then
        for private_ip in $(hostname --all-ip-addresses); do
            output=$(kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.addresses[?(@.type=="InternalIP")].address}{"\n"}{end}' | grep "$private_ip")
            ip_address=$(echo "$output" | cut -f2 -d$'\t')

            if [[ -n ${ip_address} && "$private_ip" == "$ip_address" ]]; then
                HOST_NAME_NODE=$(echo "$output" | cut -f1 -d$'\t')
                break
            fi
        done
    fi
}

set_kubeconfig(){
    export PATH=$PATH:/var/lib/rancher/rke2/bin:/usr/local/bin
    [[ -f "/var/lib/rancher/rke2/agent/kubelet.kubeconfig" ]] && export KUBECONFIG="/var/lib/rancher/rke2/agent/kubelet.kubeconfig"
    [[ -f "/etc/rancher/rke2/rke2.yaml" ]] && export KUBECONFIG="/etc/rancher/rke2/rke2.yaml"
}

is_kubectl_enabled(){
  local try=0
  local maxtry=60
  local status="notready"
  echo "Checking if node $HOST_NAME_NODE is ready to run kubectl command."
  while [[ ${status} == "notready" ]] && (( try != maxtry )) ; do
          try=$((try+1))
          kubectl cluster-info >/dev/null 2>&1  && status="ready"
          sleep 5;
  done

  if [[ ${status} == "notready" ]]; then
    echo "Node is not ready to accept kubectl command"
  else
    echo "Node is ready to accept kubectl command"
  fi
}

enable_ipforwarding() {
  local file_name="/etc/sysctl.conf"
  echo "Enable IP Forwarding..."

  if [[ ! -f "${file_name}" || -w "${file_name}" ]]; then
    # either file is not available or user doesn't have edit permission
    echo "Either file ${file_name} not present or file is not writable. Enabling ip forward using /proc/sys/net/ipv4/ip_forward..."
    echo 1 > /proc/sys/net/ipv4/ip_forward
  else
    echo "File ${file_name} is available and is writable. Checking and enabling ip forward..."
    is_ipforwarding_available=$(grep "net.ipv4.ip_forward" "${file_name}") || true
    if [[ -z ${is_ipforwarding_available} ]]; then
      echo "Adding net.ipv4.ip_forward = 1 in ${file_name}..."
      echo "net.ipv4.ip_forward = 1" >> ${file_name}
    else
      echo "Updating net.ipv4.ip_forward value with 1 in ${file_name}..."
      # shellcheck disable=SC2016
      sed -i -n -e '/^net.ipv4.ip_forward/!p' -e '$anet.ipv4.ip_forward = 1' ${file_name}
    fi
    sysctl -p
  fi
}

set_kubeconfig
is_kubectl_enabled
fetch_hostname

if [[ -n "$HOST_NAME_NODE" ]]; then
    # Pass an argument to uncordon the node. This is to cover reboot scenarios.
    if [ "$1" ]; then
        # enable ip forward
        enable_ipforwarding
        # uncordan node
        echo "Uncordon $HOST_NAME_NODE ..."
        kubectl uncordon "$HOST_NAME_NODE"
    else
        #If PDB is enabled and they are zero available replicas on other nodes, drain would fail for those pods but thats not the behaviour we want
        #Thats when the second command would come to rescue which will ignore the PDB and continue with the eviction of those pods for which eviction failed earlier https://github.com/kubernetes/kubernetes/issues/83307
        kubectl drain "$HOST_NAME_NODE" --delete-emptydir-data --ignore-daemonsets  --timeout=90s --skip-wait-for-delete-timeout=10 --force --ignore-errors || kubectl drain "$HOST_NAME_NODE" --delete-emptydir-data --ignore-daemonsets  --force  --disable-eviction=true --timeout=30s --ignore-errors --skip-wait-for-delete-timeout=10 --pod-selector 'app!=csi-attacher,longhorn.io/component!=instance-manager,k8s-app!=kube-dns'
        node_mounted_pv=$(kubectl get volumeattachment -o json | jq --arg node "${HOST_NAME_NODE}" -r '.items[] | select(.spec.nodeName==$node) | .metadata.name + ":" + .spec.source.persistentVolumeName')
        if [[ -n "${node_mounted_pv}" ]] ; then
          while IFS=$'\n' read -r VOL_ATTACHMENT_PV_ID
          do
            PV_ID=$(echo "${VOL_ATTACHMENT_PV_ID}" | cut -d':' -f2)
            VOL_ATTACHMENT_ID=$(echo "${VOL_ATTACHMENT_PV_ID}" | cut -d':' -f1)
            if [[ -n "${PV_ID}" ]] ; then
              mounts=$(grep "${PV_ID}" /proc/mounts  | awk '{print $2}')
              if [[ -n $mounts ]] ; then
                echo "Removing dangling mounts for pvc: ${PV_ID}"
                {
                  timeout 20s xargs umount -l <<< "${mounts}"
                  exitCode="$?"
                  if [[ $exitCode -eq 0 ]] ; then
                    echo "Command to remove dangling mounts for pvc ${PV_ID} executed successfully"
                    echo "Waiting to remove dangling mounts for pvc ${PV_ID}"
                    if timeout 1m bash -c "while grep -q '${PV_ID}' /proc/mounts ; do sleep 1 ; done"  ; then
                      kubectl delete volumeattachment "${VOL_ATTACHMENT_ID}"
                      if timeout 2m bash -c "while kubectl get node '${HOST_NAME_NODE}' -o yaml | grep -q '${PV_ID}' ; do sleep 1 ; done" ; then
                      #shellcheck disable=SC1012
                        find /var/lib/kubelet -name "${PV_ID}" -print0 | xargs -0 \rm -rf
                        echo "Removed dangling mounts for pvc: ${PV_ID} successfully"
                      else
                       echo "Timeout while waiting to remove node dangling mounts for pvc: ${PV_ID}"
                     fi
                    else
                      echo "Timeout while waiting to remove dangling mounts for pvc: ${PV_ID}"
                    fi
                  elif [[ $exitCode -eq 124 ]] ; then
                    echo "Timeout while executing remove dangling mounts for pvc: ${PV_ID}"
                  else
                    echo "Error while executing remove dangling mounts for pvc: ${PV_ID}"
                  fi
                } &
              fi
            fi
          done <<< "${node_mounted_pv}"
          wait
        fi
    fi
else
  echo "Not able to fetch hostname"
fi

Detener el proceso Kubernetes que se ejecuta en el nodo.Ejecuta cualquiera de los siguientes comandos:
- Nodo del servidor:
```
systemctl stop rke2-serversystemctl stop rke2-server
```
- Nodo agente:
```
systemctl stop rke2-agentsystemctl stop rke2-agent
```
Si tu actividad de mantenimiento incluye la actualización de los paquetes RPM en la máquina, debes omitir la actualización del paquete rke2 para evitar cualquier problema de compatibilidad.
- Se recomienda añadir el paquete rke2 a la lista de exclusión de la actualización de RPM. Para modificar el archivo /etc/yum.conf, añade rke2 en la exclusión.Para más detalles, consulta estas instrucciones.
- También puedes excluir temporalmente rke2 durante yum upgrade utilizando el siguiente comando:
```
yum upgrade --exclude "rke2-*"yum upgrade --exclude "rke2-*"
```
  Importante:
  Si no se excluye, los paquetes rke2- podrían actualizarse a la última versión, causando problemas en el clúster de Automation Suite. rke2-* La actualización de los paquetes se gestionará a través de la actualización de Automation Suite.
  
  La actualización de yum sobrescribe el archivo /etc/yum.conf y elimina rke2-* de la lista de exclusión. Para evitarlo, actualiza la herramienta yum con el siguiente comando: yum update --exclude yum-utils.
  
  Para comprobar si rke-2 está excluido, revise el archivo /etc/yum.conf .
Continúa con tu actividad de mantenimiento de nodo. Una vez finalizada la actualización, continúa con la actividad de mantenimiento posterior al nodo.

Mantenimiento posterior al nodo

Reinicia el nodo ejecutando sudo reboot o utilizando cualquier otro mecanismo de reinicio seguro que prefiera.
Una vez que el nodo se reinicia, asegúrate de que el servicio rke2 se inicia. Ejecuta cualquiera de los siguientes comandos:
- Nodo del servidor:
```
systemctl start rke2-serversystemctl start rke2-server
```
- Nodo agente:
```
systemctl start rke2-agentsystemctl start rke2-agent
```
Una vez iniciado el servicio rke2, debes reiniciar el nodo ejecutando el siguiente comando:
```
sudo bash drain-node.sh nodestartsudo bash drain-node.sh nodestart
```