Automation Suite
2023.10
falso
Imagem de fundo do banner
Guia de instalação do Automation Suite no Linux
Última atualização 19 de abr de 2024

Upgrade fails due to unhealthy Ceph

Description

Ao tentar atualizar para uma nova versão do Automation Suite, você pode ver a seguinte mensagem de erro: Ceph objectstore is not completely healthy at the moment. Inner exception - Timeout waiting for all PGs to become active+clean.

Solução

Para corrigir esse problema de atualização, verifique se os pods OSD estão em execução e íntegros executando o seguinte comando:

kubectl -n rook-ceph get pod -l app=rook-ceph-osd  --no-headers | grep -P '([0-9])/\1'  -vkubectl -n rook-ceph get pod -l app=rook-ceph-osd  --no-headers | grep -P '([0-9])/\1'  -v
  • Se o comando não gerar nenhum pod, verifique se os grupos de posicionamento (PGs) do Ceph estão se recuperando ou não executando o seguinte comando:

    function is_ceph_pg_active_clean() {
      local return_code=1
      if kubectl -n rook-ceph exec  deploy/rook-ceph-tools -- ceph status --format json | jq '. as $root | ($root | .pgmap.num_pgs) as $total_pgs | try ( ($root | .pgmap.pgs_by_state[] | select(.state_name == "active+clean").count)  // 0) as $active_pgs | if $total_pgs == $active_pgs then true else false end' | grep -q 'true';then
        return_code=0
      fi
      [[ $return_code -eq 0 ]] && echo "All Ceph Placement groups(PG) are active+clean"
      if [[ $return_code -ne 0 ]]; then
        echo "All Ceph Placement groups(PG) are not active+clean. Please wait for PGs to become active+clean"
        kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph pg dump --format json | jq -r '.pg_map.pg_stats[] | select(.state!="active+clean") | [.pgid, .state] | @tsv'
      fi
      return "${return_code}"
    }
    # Execute the function multiple times to get updated ceph PG status
    is_ceph_pg_active_cleanfunction is_ceph_pg_active_clean() {
      local return_code=1
      if kubectl -n rook-ceph exec  deploy/rook-ceph-tools -- ceph status --format json | jq '. as $root | ($root | .pgmap.num_pgs) as $total_pgs | try ( ($root | .pgmap.pgs_by_state[] | select(.state_name == "active+clean").count)  // 0) as $active_pgs | if $total_pgs == $active_pgs then true else false end' | grep -q 'true';then
        return_code=0
      fi
      [[ $return_code -eq 0 ]] && echo "All Ceph Placement groups(PG) are active+clean"
      if [[ $return_code -ne 0 ]]; then
        echo "All Ceph Placement groups(PG) are not active+clean. Please wait for PGs to become active+clean"
        kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph pg dump --format json | jq -r '.pg_map.pg_stats[] | select(.state!="active+clean") | [.pgid, .state] | @tsv'
      fi
      return "${return_code}"
    }
    # Execute the function multiple times to get updated ceph PG status
    is_ceph_pg_active_clean
    Observação: se nenhum dos PGs do Ceph afetados se recuperar, mesmo depois de esperar mais de 30 minutos, abra um tíquete no Suporte da UiPath®.
  • Se o comando gerar pod(s), você deve primeiro corrigir o problema que os afeta:

    • Se um pod estiver travado em Init:0/4, pode ser um problema do provedor de PV (Longhorn). Para discutir esse problema, abra um tíquete no Suporte da UiPath®.
    • Se um pod estiver em CrashLoopBackOff, corrija o problema executando o seguinte comando:
      function cleanup_crashing_osd() {
          local restart_operator="false"
          local min_required_healthy_osd=1
          local in_osd
          local up_osd
          local healthy_osd_pod_count
          local crashed_osd_deploy
          local crashed_pvc_name
      
          if ! kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph osd pool ls detail  | grep 'rook-ceph.rgw.buckets.data' | grep -q 'replicated'; then
              min_required_healthy_osd=2
          fi
          in_osd=$(kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph status   -f json  | jq -r '.osdmap.num_in_osds')
          up_osd=$(kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph status   -f json  | jq -r '.osdmap.num_up_osds')
          healthy_osd_pod_count=$(kubectl -n rook-ceph get pod -l app=rook-ceph-osd | grep 'Running' | grep -c -P '([0-9])/\1')
          if ! [[ $in_osd -ge $min_required_healthy_osd && $up_osd -ge $min_required_healthy_osd && $healthy_osd_pod_count -ge $min_required_healthy_osd ]]; then
              return
          fi
          for crashed_osd_deploy in $(kubectl -n rook-ceph get pod -l app=rook-ceph-osd  | grep 'CrashLoopBackOff' | cut -d'-' -f'1-4') ; do
              if kubectl -n rook-ceph logs "deployment/${crashed_osd_deploy}" | grep -q '/crash/'; then
                  echo "Found crashing OSD deployment: '${crashed_osd_deploy}'"
                  crashed_pvc_name=$(kubectl -n rook-ceph get deployment "${crashed_osd_deploy}" -o json | jq -r '.metadata.labels["ceph.rook.io/pvc"]')
                  info "Removing crashing OSD deployment: '${crashed_osd_deploy}' and PVC: '${crashed_pvc_name}'"
                  timeout 60  kubectl -n rook-ceph delete deployment "${crashed_osd_deploy}" || kubectl -n rook-ceph delete deployment "${crashed_osd_deploy}" --force --grace-period=0
                  timeout 100 kubectl -n rook-ceph delete pvc "${crashed_pvc_name}" || kubectl -n rook-ceph delete pvc "${crashed_pvc_name}" --force --grace-period=0
                  restart_operator="true"
              fi
          done
          if [[ $restart_operator == "true" ]]; then
              kubectl -n rook-ceph rollout restart deployment/rook-ceph-operator
          fi
          return 0
      }
      # Execute the cleanup function
      cleanup_crashing_osdfunction cleanup_crashing_osd() {
          local restart_operator="false"
          local min_required_healthy_osd=1
          local in_osd
          local up_osd
          local healthy_osd_pod_count
          local crashed_osd_deploy
          local crashed_pvc_name
      
          if ! kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph osd pool ls detail  | grep 'rook-ceph.rgw.buckets.data' | grep -q 'replicated'; then
              min_required_healthy_osd=2
          fi
          in_osd=$(kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph status   -f json  | jq -r '.osdmap.num_in_osds')
          up_osd=$(kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph status   -f json  | jq -r '.osdmap.num_up_osds')
          healthy_osd_pod_count=$(kubectl -n rook-ceph get pod -l app=rook-ceph-osd | grep 'Running' | grep -c -P '([0-9])/\1')
          if ! [[ $in_osd -ge $min_required_healthy_osd && $up_osd -ge $min_required_healthy_osd && $healthy_osd_pod_count -ge $min_required_healthy_osd ]]; then
              return
          fi
          for crashed_osd_deploy in $(kubectl -n rook-ceph get pod -l app=rook-ceph-osd  | grep 'CrashLoopBackOff' | cut -d'-' -f'1-4') ; do
              if kubectl -n rook-ceph logs "deployment/${crashed_osd_deploy}" | grep -q '/crash/'; then
                  echo "Found crashing OSD deployment: '${crashed_osd_deploy}'"
                  crashed_pvc_name=$(kubectl -n rook-ceph get deployment "${crashed_osd_deploy}" -o json | jq -r '.metadata.labels["ceph.rook.io/pvc"]')
                  info "Removing crashing OSD deployment: '${crashed_osd_deploy}' and PVC: '${crashed_pvc_name}'"
                  timeout 60  kubectl -n rook-ceph delete deployment "${crashed_osd_deploy}" || kubectl -n rook-ceph delete deployment "${crashed_osd_deploy}" --force --grace-period=0
                  timeout 100 kubectl -n rook-ceph delete pvc "${crashed_pvc_name}" || kubectl -n rook-ceph delete pvc "${crashed_pvc_name}" --force --grace-period=0
                  restart_operator="true"
              fi
          done
          if [[ $restart_operator == "true" ]]; then
              kubectl -n rook-ceph rollout restart deployment/rook-ceph-operator
          fi
          return 0
      }
      # Execute the cleanup function
      cleanup_crashing_osd

Depois de corrigir o OSD com falha, verifique se os PGs estão se recuperando ou não, executando o seguinte comando:

is_ceph_pg_active_cleanis_ceph_pg_active_clean
  • Description
  • Solução

Was this page helpful?

Obtenha a ajuda que você precisa
Aprendendo RPA - Cursos de automação
Fórum da comunidade da Uipath
Logotipo branco da Uipath
Confiança e segurança
© 2005-2024 UiPath. All rights reserved.