ノードメンテナンスを実行する

ノードメンテナンスアクティビティを実行するシナリオとして、次のようなものがあります。

セキュリティパッチを適用する場合
オペレーティングシステムのアップグレードを実行する場合
ネットワーク構成を変更する場合
組織で義務付けられているその他のアクティビティを実行する場合

ノードメンテナンス操作を実行中に、誤ってクラスターを破損することがあります。問題を避けるには、ここに示すガイダンスに従ってください。

注:

UiPath では、ノードメンテナンスアクティビティの実行方法に関するガイダンスを提供していません。このガイダンスについては、IT チームに問い合わせる必要があります。
以下のガイドラインには、クラスターの健全性を確保するために、ノードメンテナンス操作の前後に実行する必要のある手順のみを記載しています。
ノードメンテナンスアクティビティは、一度に 1 つのノードに対して実行することをお勧めします。

ノードメンテナンス前のアクティビティ

ノードメンテナンスアクティビティの実行中にクラスターの健全性を確保するためには、そのノードで実行されているワークロードを他のノードにドレインする必要があります。ノードをドレインするには、drain-node.sh スクリプトを対象のノードに保存し、次のコマンドを使用して実行します。

sudo bash drain-node.shsudo bash drain-node.sh

drain-node.sh のスクリプト

#!/bin/bash

# =================
#
#
#
#
# Copyright UiPath 2021
#
# =================
# LICENSE AGREEMENT
# -----------------
#   Use of paid UiPath products and services is subject to the licensing agreement
#   executed between you and UiPath. Unless otherwise indicated by UiPath, use of free
#   UiPath products is subject to the associated licensing agreement available here:
#   https://www.uipath.com/legal/trust-and-security/legal-terms (or successor website).
#   You must not use this file separately from the product it is a part of or is associated with.
#
#
#
# =================

fetch_hostname(){

    HOST_NAME_NODE=$(kubectl get nodes -o name | cut -d'/' -f2 | grep "$(hostname)")

    if ! [[ -n ${HOST_NAME_NODE} && "$(hostname)" == "$HOST_NAME_NODE" ]]; then
        for private_ip in $(hostname --all-ip-addresses); do
            output=$(kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.addresses[?(@.type=="InternalIP")].address}{"\n"}{end}' | grep "$private_ip")
            ip_address=$(echo "$output" | cut -f2 -d$'\t')

            if [[ -n ${ip_address} && "$private_ip" == "$ip_address" ]]; then
                HOST_NAME_NODE=$(echo "$output" | cut -f1 -d$'\t')
                break
            fi
        done
    fi
}

set_kubeconfig(){
    export PATH=$PATH:/var/lib/rancher/rke2/bin:/usr/local/bin
    [[ -f "/var/lib/rancher/rke2/agent/kubelet.kubeconfig" ]] && export KUBECONFIG="/var/lib/rancher/rke2/agent/kubelet.kubeconfig"
    [[ -f "/etc/rancher/rke2/rke2.yaml" ]] && export KUBECONFIG="/etc/rancher/rke2/rke2.yaml"
}

is_kubectl_enabled(){
  local try=0
  local maxtry=60
  local status="notready"
  echo "Checking if node $HOST_NAME_NODE is ready to run kubectl command."
  while [[ ${status} == "notready" ]] && (( try != maxtry )) ; do
          try=$((try+1))
          kubectl cluster-info >/dev/null 2>&1  && status="ready"
          sleep 5;
  done

  if [[ ${status} == "notready" ]]; then
    echo "Node is not ready to accept kubectl command"
  else
    echo "Node is ready to accept kubectl command"
  fi
}

enable_ipforwarding() {
  local file_name="/etc/sysctl.conf"
  echo "Enable IP Forwarding..."

  if [[ ! -f "${file_name}" || -w "${file_name}" ]]; then
    # either file is not available or user doesn't have edit permission
    echo "Either file ${file_name} not present or file is not writable. Enabling ip forward using /proc/sys/net/ipv4/ip_forward..."
    echo 1 > /proc/sys/net/ipv4/ip_forward
  else
    echo "File ${file_name} is available and is writable. Checking and enabling ip forward..."
    is_ipforwarding_available=$(grep "net.ipv4.ip_forward" "${file_name}") || true
    if [[ -z ${is_ipforwarding_available} ]]; then
      echo "Adding net.ipv4.ip_forward = 1 in ${file_name}..."
      echo "net.ipv4.ip_forward = 1" >> ${file_name}
    else
      echo "Updating net.ipv4.ip_forward value with 1 in ${file_name}..."
      # shellcheck disable=SC2016
      sed -i -n -e '/^net.ipv4.ip_forward/!p' -e '$anet.ipv4.ip_forward = 1' ${file_name}
    fi
    sysctl -p
  fi
}

set_kubeconfig
is_kubectl_enabled
fetch_hostname

if [[ -n "$HOST_NAME_NODE" ]]; then
    # Pass an argument to uncordon the node. This is to cover reboot scenarios.
    if [ "$1" ]; then
        # enable ip forward
        enable_ipforwarding
        # uncordan node
        echo "Uncordon $HOST_NAME_NODE ..."
        kubectl uncordon "$HOST_NAME_NODE"
    else
        #If PDB is enabled and they are zero available replicas on other nodes, drain would fail for those pods but thats not the behaviour we want
        #Thats when the second command would come to rescue which will ignore the PDB and continue with the eviction of those pods for which eviction failed earlier https://github.com/kubernetes/kubernetes/issues/83307
        kubectl drain "$HOST_NAME_NODE" --delete-emptydir-data --ignore-daemonsets  --timeout=90s --skip-wait-for-delete-timeout=10 --force --ignore-errors || kubectl drain "$HOST_NAME_NODE" --delete-emptydir-data --ignore-daemonsets  --force  --disable-eviction=true --timeout=30s --ignore-errors --skip-wait-for-delete-timeout=10 --pod-selector 'app!=csi-attacher,longhorn.io/component!=instance-manager,k8s-app!=kube-dns'
        node_mounted_pv=$(kubectl get volumeattachment -o json | jq --arg node "${HOST_NAME_NODE}" -r '.items[] | select(.spec.nodeName==$node) | .metadata.name + ":" + .spec.source.persistentVolumeName')
        if [[ -n "${node_mounted_pv}" ]] ; then
          while IFS=$'\n' read -r VOL_ATTACHMENT_PV_ID
          do
            PV_ID=$(echo "${VOL_ATTACHMENT_PV_ID}" | cut -d':' -f2)
            VOL_ATTACHMENT_ID=$(echo "${VOL_ATTACHMENT_PV_ID}" | cut -d':' -f1)
            if [[ -n "${PV_ID}" ]] ; then
              mounts=$(grep "${PV_ID}" /proc/mounts  | awk '{print $2}')
              if [[ -n $mounts ]] ; then
                echo "Removing dangling mounts for pvc: ${PV_ID}"
                {
                  timeout 20s xargs umount -l <<< "${mounts}"
                  exitCode="$?"
                  if [[ $exitCode -eq 0 ]] ; then
                    echo "Command to remove dangling mounts for pvc ${PV_ID} executed successfully"
                    echo "Waiting to remove dangling mounts for pvc ${PV_ID}"
                    if timeout 1m bash -c "while grep -q '${PV_ID}' /proc/mounts ; do sleep 1 ; done"  ; then
                      kubectl delete volumeattachment "${VOL_ATTACHMENT_ID}"
                      if timeout 2m bash -c "while kubectl get node '${HOST_NAME_NODE}' -o yaml | grep -q '${PV_ID}' ; do sleep 1 ; done" ; then
                      #shellcheck disable=SC1012
                        find /var/lib/kubelet -name "${PV_ID}" -print0 | xargs -0 \rm -rf
                        echo "Removed dangling mounts for pvc: ${PV_ID} successfully"
                      else
                       echo "Timeout while waiting to remove node dangling mounts for pvc: ${PV_ID}"
                     fi
                    else
                      echo "Timeout while waiting to remove dangling mounts for pvc: ${PV_ID}"
                    fi
                  elif [[ $exitCode -eq 124 ]] ; then
                    echo "Timeout while executing remove dangling mounts for pvc: ${PV_ID}"
                  else
                    echo "Error while executing remove dangling mounts for pvc: ${PV_ID}"
                  fi
                } &
              fi
            fi
          done <<< "${node_mounted_pv}"
          wait
        fi
    fi
else
  echo "Not able to fetch hostname"
fi#!/bin/bash

# =================
#
#
#
#
# Copyright UiPath 2021
#
# =================
# LICENSE AGREEMENT
# -----------------
#   Use of paid UiPath products and services is subject to the licensing agreement
#   executed between you and UiPath. Unless otherwise indicated by UiPath, use of free
#   UiPath products is subject to the associated licensing agreement available here:
#   https://www.uipath.com/legal/trust-and-security/legal-terms (or successor website).
#   You must not use this file separately from the product it is a part of or is associated with.
#
#
#
# =================

fetch_hostname(){

    HOST_NAME_NODE=$(kubectl get nodes -o name | cut -d'/' -f2 | grep "$(hostname)")

    if ! [[ -n ${HOST_NAME_NODE} && "$(hostname)" == "$HOST_NAME_NODE" ]]; then
        for private_ip in $(hostname --all-ip-addresses); do
            output=$(kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.addresses[?(@.type=="InternalIP")].address}{"\n"}{end}' | grep "$private_ip")
            ip_address=$(echo "$output" | cut -f2 -d$'\t')

            if [[ -n ${ip_address} && "$private_ip" == "$ip_address" ]]; then
                HOST_NAME_NODE=$(echo "$output" | cut -f1 -d$'\t')
                break
            fi
        done
    fi
}

set_kubeconfig(){
    export PATH=$PATH:/var/lib/rancher/rke2/bin:/usr/local/bin
    [[ -f "/var/lib/rancher/rke2/agent/kubelet.kubeconfig" ]] && export KUBECONFIG="/var/lib/rancher/rke2/agent/kubelet.kubeconfig"
    [[ -f "/etc/rancher/rke2/rke2.yaml" ]] && export KUBECONFIG="/etc/rancher/rke2/rke2.yaml"
}

is_kubectl_enabled(){
  local try=0
  local maxtry=60
  local status="notready"
  echo "Checking if node $HOST_NAME_NODE is ready to run kubectl command."
  while [[ ${status} == "notready" ]] && (( try != maxtry )) ; do
          try=$((try+1))
          kubectl cluster-info >/dev/null 2>&1  && status="ready"
          sleep 5;
  done

  if [[ ${status} == "notready" ]]; then
    echo "Node is not ready to accept kubectl command"
  else
    echo "Node is ready to accept kubectl command"
  fi
}

enable_ipforwarding() {
  local file_name="/etc/sysctl.conf"
  echo "Enable IP Forwarding..."

  if [[ ! -f "${file_name}" || -w "${file_name}" ]]; then
    # either file is not available or user doesn't have edit permission
    echo "Either file ${file_name} not present or file is not writable. Enabling ip forward using /proc/sys/net/ipv4/ip_forward..."
    echo 1 > /proc/sys/net/ipv4/ip_forward
  else
    echo "File ${file_name} is available and is writable. Checking and enabling ip forward..."
    is_ipforwarding_available=$(grep "net.ipv4.ip_forward" "${file_name}") || true
    if [[ -z ${is_ipforwarding_available} ]]; then
      echo "Adding net.ipv4.ip_forward = 1 in ${file_name}..."
      echo "net.ipv4.ip_forward = 1" >> ${file_name}
    else
      echo "Updating net.ipv4.ip_forward value with 1 in ${file_name}..."
      # shellcheck disable=SC2016
      sed -i -n -e '/^net.ipv4.ip_forward/!p' -e '$anet.ipv4.ip_forward = 1' ${file_name}
    fi
    sysctl -p
  fi
}

set_kubeconfig
is_kubectl_enabled
fetch_hostname

if [[ -n "$HOST_NAME_NODE" ]]; then
    # Pass an argument to uncordon the node. This is to cover reboot scenarios.
    if [ "$1" ]; then
        # enable ip forward
        enable_ipforwarding
        # uncordan node
        echo "Uncordon $HOST_NAME_NODE ..."
        kubectl uncordon "$HOST_NAME_NODE"
    else
        #If PDB is enabled and they are zero available replicas on other nodes, drain would fail for those pods but thats not the behaviour we want
        #Thats when the second command would come to rescue which will ignore the PDB and continue with the eviction of those pods for which eviction failed earlier https://github.com/kubernetes/kubernetes/issues/83307
        kubectl drain "$HOST_NAME_NODE" --delete-emptydir-data --ignore-daemonsets  --timeout=90s --skip-wait-for-delete-timeout=10 --force --ignore-errors || kubectl drain "$HOST_NAME_NODE" --delete-emptydir-data --ignore-daemonsets  --force  --disable-eviction=true --timeout=30s --ignore-errors --skip-wait-for-delete-timeout=10 --pod-selector 'app!=csi-attacher,longhorn.io/component!=instance-manager,k8s-app!=kube-dns'
        node_mounted_pv=$(kubectl get volumeattachment -o json | jq --arg node "${HOST_NAME_NODE}" -r '.items[] | select(.spec.nodeName==$node) | .metadata.name + ":" + .spec.source.persistentVolumeName')
        if [[ -n "${node_mounted_pv}" ]] ; then
          while IFS=$'\n' read -r VOL_ATTACHMENT_PV_ID
          do
            PV_ID=$(echo "${VOL_ATTACHMENT_PV_ID}" | cut -d':' -f2)
            VOL_ATTACHMENT_ID=$(echo "${VOL_ATTACHMENT_PV_ID}" | cut -d':' -f1)
            if [[ -n "${PV_ID}" ]] ; then
              mounts=$(grep "${PV_ID}" /proc/mounts  | awk '{print $2}')
              if [[ -n $mounts ]] ; then
                echo "Removing dangling mounts for pvc: ${PV_ID}"
                {
                  timeout 20s xargs umount -l <<< "${mounts}"
                  exitCode="$?"
                  if [[ $exitCode -eq 0 ]] ; then
                    echo "Command to remove dangling mounts for pvc ${PV_ID} executed successfully"
                    echo "Waiting to remove dangling mounts for pvc ${PV_ID}"
                    if timeout 1m bash -c "while grep -q '${PV_ID}' /proc/mounts ; do sleep 1 ; done"  ; then
                      kubectl delete volumeattachment "${VOL_ATTACHMENT_ID}"
                      if timeout 2m bash -c "while kubectl get node '${HOST_NAME_NODE}' -o yaml | grep -q '${PV_ID}' ; do sleep 1 ; done" ; then
                      #shellcheck disable=SC1012
                        find /var/lib/kubelet -name "${PV_ID}" -print0 | xargs -0 \rm -rf
                        echo "Removed dangling mounts for pvc: ${PV_ID} successfully"
                      else
                       echo "Timeout while waiting to remove node dangling mounts for pvc: ${PV_ID}"
                     fi
                    else
                      echo "Timeout while waiting to remove dangling mounts for pvc: ${PV_ID}"
                    fi
                  elif [[ $exitCode -eq 124 ]] ; then
                    echo "Timeout while executing remove dangling mounts for pvc: ${PV_ID}"
                  else
                    echo "Error while executing remove dangling mounts for pvc: ${PV_ID}"
                  fi
                } &
              fi
            fi
          done <<< "${node_mounted_pv}"
          wait
        fi
    fi
else
  echo "Not able to fetch hostname"
fi

ノードで実行されている Kubernetes プロセスを停止します。次のいずれかのコマンドを実行します。
- サーバーノード:
```
systemctl stop rke2-serversystemctl stop rke2-server
```
- エージェントノード:
```
systemctl stop rke2-agentsystemctl stop rke2-agent
```
メンテナンスアクティビティにマシン上の RPM パッケージのアップグレードが含まれる場合は、互換性の問題を回避するために、rke2 パッケージのアップグレードをスキップする必要があります。
- rke2 パッケージを RPM アップグレードの除外リストに追加することをお勧めします。/etc/yum.conf ファイルを変更するには、rke2 を除外対象に追加します。詳しくは、こちらの手順をご覧ください。
- または、次のコマンドを使用して、yum upgrade 中に rke2 を一時的に除外できます。
```
yum upgrade --exclude "rke2-*"yum upgrade --exclude "rke2-*"
```
  重要:
  除外しない場合、rke2- パッケージが最新バージョンにアップグレードされ、Automation Suite クラスターに問題が発生する可能性があります。rke2-* パッケージのアップグレードは、Automation Suite のアップグレードにより処理されます。
  
  yum を更新すると、/etc/yum.conf ファイルが上書きされ、rke2-* が除外リストから削除されます。これを防ぐには、yum update --exclude yum-utils コマンドを使用して yum ツールを更新します。
  
  rke-2 が除外されているかどうかを確認するには、/etc/yum.conf ファイルを確認します。
ノードメンテナンスアクティビティを続行します。アップグレードが完了したら、ノードメンテナンス後のアクティビティに進みます。

ノードメンテナンス後のアクティビティ

sudo reboot を実行するか、他の安全な再起動メカニズムを使用して、ノードを再起動します。
ノードが再起動したら、rke2 サービスを開始します。次のいずれかのコマンドを実行します。
- サーバーノード:
```
systemctl start rke2-serversystemctl start rke2-server
```
- エージェントノード:
```
systemctl start rke2-agentsystemctl start rke2-agent
```
rke2 サービスが開始したら、次のコマンドを実行してノードを再起動する必要があります。
```
sudo bash drain-node.sh nodestartsudo bash drain-node.sh nodestart
```