automation-suite
2023.4
false
- 概述
- 要求
- 安装
- 安装后
- 集群管理
- 监控和警示
- 迁移和升级
- 特定于产品的配置
- 最佳实践和维护
- 故障排除
由于 Ceph 运行状况不佳,升级失败
重要 :
请注意此内容已使用机器翻译进行了部分本地化。
Linux 版 Automation Suite 安装指南
Last updated 2024年9月5日
由于 Ceph 运行状况不佳,升级失败
尝试升级到新的 Automation Suite 版本时,您可能会看到以下错误消息:
Ceph objectstore is not completely healthy at the moment. Inner exception - Timeout waiting for all PGs to become active+clean
。
要修复此升级问题,请运行以下命令,验证 OSD Pod 是否正在运行且运行状况良好:
kubectl -n rook-ceph get pod -l app=rook-ceph-osd --no-headers | grep -P '([0-9])/\1' -v
kubectl -n rook-ceph get pod -l app=rook-ceph-osd --no-headers | grep -P '([0-9])/\1' -v
-
如果该命令未输出任何 Pod,请运行以下命令,验证 Ceph 归置组 (PG) 是否正在恢复:
function is_ceph_pg_active_clean() { local return_code=1 if kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph status --format json | jq '. as $root | ($root | .pgmap.num_pgs) as $total_pgs | try ( ($root | .pgmap.pgs_by_state[] | select(.state_name == "active+clean").count) // 0) as $active_pgs | if $total_pgs == $active_pgs then true else false end' | grep -q 'true';then return_code=0 fi [[ $return_code -eq 0 ]] && echo "All Ceph Placement groups(PG) are active+clean" if [[ $return_code -ne 0 ]]; then echo "All Ceph Placement groups(PG) are not active+clean. Please wait for PGs to become active+clean" kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph pg dump --format json | jq -r '.pg_map.pg_stats[] | select(.state!="active+clean") | [.pgid, .state] | @tsv' fi return "${return_code}" } # Execute the function multiple times to get updated ceph PG status is_ceph_pg_active_clean
function is_ceph_pg_active_clean() { local return_code=1 if kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph status --format json | jq '. as $root | ($root | .pgmap.num_pgs) as $total_pgs | try ( ($root | .pgmap.pgs_by_state[] | select(.state_name == "active+clean").count) // 0) as $active_pgs | if $total_pgs == $active_pgs then true else false end' | grep -q 'true';then return_code=0 fi [[ $return_code -eq 0 ]] && echo "All Ceph Placement groups(PG) are active+clean" if [[ $return_code -ne 0 ]]; then echo "All Ceph Placement groups(PG) are not active+clean. Please wait for PGs to become active+clean" kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph pg dump --format json | jq -r '.pg_map.pg_stats[] | select(.state!="active+clean") | [.pgid, .state] | @tsv' fi return "${return_code}" } # Execute the function multiple times to get updated ceph PG status is_ceph_pg_active_clean注意:如果受影响的 Ceph PG 在等待超过 30 分钟后仍未恢复,请通过 UiPath™ 支持团队提出工单。 -
如果命令输出 Pod,则必须首先修复影响 Pod 的问题:
- 如果 Pod 卡在
Init:0/4
中,则可能是 PV 提供程序 (Longhorn) 的问题。如要解决此问题,请向 UiPath™ 支持团队提交工单。 -
如果 Pod 位于
CrashLoopBackOff
中,请运行以下命令来解决此问题:function cleanup_crashing_osd() { local restart_operator="false" local min_required_healthy_osd=1 local in_osd local up_osd local healthy_osd_pod_count local crashed_osd_deploy local crashed_pvc_name if ! kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph osd pool ls detail | grep 'rook-ceph.rgw.buckets.data' | grep -q 'replicated'; then min_required_healthy_osd=2 fi in_osd=$(kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph status -f json | jq -r '.osdmap.num_in_osds') up_osd=$(kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph status -f json | jq -r '.osdmap.num_up_osds') healthy_osd_pod_count=$(kubectl -n rook-ceph get pod -l app=rook-ceph-osd | grep 'Running' | grep -c -P '([0-9])/\1') if ! [[ $in_osd -ge $min_required_healthy_osd && $up_osd -ge $min_required_healthy_osd && $healthy_osd_pod_count -ge $min_required_healthy_osd ]]; then return fi for crashed_osd_deploy in $(kubectl -n rook-ceph get pod -l app=rook-ceph-osd | grep 'CrashLoopBackOff' | cut -d'-' -f'1-4') ; do if kubectl -n rook-ceph logs "deployment/${crashed_osd_deploy}" | grep -q '/crash/'; then echo "Found crashing OSD deployment: '${crashed_osd_deploy}'" crashed_pvc_name=$(kubectl -n rook-ceph get deployment "${crashed_osd_deploy}" -o json | jq -r '.metadata.labels["ceph.rook.io/pvc"]') info "Removing crashing OSD deployment: '${crashed_osd_deploy}' and PVC: '${crashed_pvc_name}'" timeout 60 kubectl -n rook-ceph delete deployment "${crashed_osd_deploy}" || kubectl -n rook-ceph delete deployment "${crashed_osd_deploy}" --force --grace-period=0 timeout 100 kubectl -n rook-ceph delete pvc "${crashed_pvc_name}" || kubectl -n rook-ceph delete pvc "${crashed_pvc_name}" --force --grace-period=0 restart_operator="true" fi done if [[ $restart_operator == "true" ]]; then kubectl -n rook-ceph rollout restart deployment/rook-ceph-operator fi return 0 } # Execute the cleanup function cleanup_crashing_osd
function cleanup_crashing_osd() { local restart_operator="false" local min_required_healthy_osd=1 local in_osd local up_osd local healthy_osd_pod_count local crashed_osd_deploy local crashed_pvc_name if ! kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph osd pool ls detail | grep 'rook-ceph.rgw.buckets.data' | grep -q 'replicated'; then min_required_healthy_osd=2 fi in_osd=$(kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph status -f json | jq -r '.osdmap.num_in_osds') up_osd=$(kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph status -f json | jq -r '.osdmap.num_up_osds') healthy_osd_pod_count=$(kubectl -n rook-ceph get pod -l app=rook-ceph-osd | grep 'Running' | grep -c -P '([0-9])/\1') if ! [[ $in_osd -ge $min_required_healthy_osd && $up_osd -ge $min_required_healthy_osd && $healthy_osd_pod_count -ge $min_required_healthy_osd ]]; then return fi for crashed_osd_deploy in $(kubectl -n rook-ceph get pod -l app=rook-ceph-osd | grep 'CrashLoopBackOff' | cut -d'-' -f'1-4') ; do if kubectl -n rook-ceph logs "deployment/${crashed_osd_deploy}" | grep -q '/crash/'; then echo "Found crashing OSD deployment: '${crashed_osd_deploy}'" crashed_pvc_name=$(kubectl -n rook-ceph get deployment "${crashed_osd_deploy}" -o json | jq -r '.metadata.labels["ceph.rook.io/pvc"]') info "Removing crashing OSD deployment: '${crashed_osd_deploy}' and PVC: '${crashed_pvc_name}'" timeout 60 kubectl -n rook-ceph delete deployment "${crashed_osd_deploy}" || kubectl -n rook-ceph delete deployment "${crashed_osd_deploy}" --force --grace-period=0 timeout 100 kubectl -n rook-ceph delete pvc "${crashed_pvc_name}" || kubectl -n rook-ceph delete pvc "${crashed_pvc_name}" --force --grace-period=0 restart_operator="true" fi done if [[ $restart_operator == "true" ]]; then kubectl -n rook-ceph rollout restart deployment/rook-ceph-operator fi return 0 } # Execute the cleanup function cleanup_crashing_osd
- 如果 Pod 卡在
修复崩溃的 OSD 后,通过运行以下命令验证 PG 是否正在恢复:
is_ceph_pg_active_clean
is_ceph_pg_active_clean