Automation Suite
2022.4
false
Banner background image
Automation Suite Installation Guide
Last updated Apr 24, 2024

Prometheus in CrashloopBackoff state with out-of-memory (OOM) error

Description

In certain situations, Prometheus pods can fail to start due to an out-of-memory (OOM) error. To identify the affected Prometheus pods, search for the last termination state that specifically relates to OOM errors.

Solution

To fix the issue, take the following steps:

  1. Update argocd configmap to ignore service-monitors by updating resource.exclusions.
    1. Create a backup of the old configmap:
      kubectl get  configmap argocd-cm -n argocd -o yaml >> argocd-cm-old.yamlkubectl get  configmap argocd-cm -n argocd -o yaml >> argocd-cm-old.yaml  
    2. Edit argocd-cm:
      kubectl edit configmap argocd-cm -n argocdkubectl edit configmap argocd-cm -n argocd
    3. Add a new data entry for resource.exclusions:
      data:
        resource.exclusions: |
          - apiGroups:
              - monitoring.coreos.com
            kinds:
              - ServiceMonitor
            clusters:
              - "*"data:
        resource.exclusions: |
          - apiGroups:
              - monitoring.coreos.com
            kinds:
              - ServiceMonitor
            clusters:
              - "*"
  2. Delete the envoy-stats-monitorservice monitor under theistio-system namespace:
    kubectl -n istio-system  delete servicemonitor  envoy-stats-monitorkubectl -n istio-system  delete servicemonitor  envoy-stats-monitor
  3. Force-delete the Prometheus pods:
    kubectl delete pod  prometheus-rancher-monitoring-prometheus-0  -n cattle-monitoring-system
    kubectl delete pod  prometheus-rancher-monitoring-prometheus-1  -n cattle-monitoring-systemkubectl delete pod  prometheus-rancher-monitoring-prometheus-0  -n cattle-monitoring-system
    kubectl delete pod  prometheus-rancher-monitoring-prometheus-1  -n cattle-monitoring-system
  4. Wait for the Prometheus pods under cattle-monitoring-system to restart successfully.
Important: Deleting the the wal and chunks_head directories results in a loss of all monitoring data gathered during the days these files were accumulated. However, once Prometheus is up and running again, any new alerts and metrics data will be accessible.
  • Description
  • Solution

Was this page helpful?

Get The Help You Need
Learning RPA - Automation Courses
UiPath Community Forum
Uipath Logo White
Trust and Security
© 2005-2024 UiPath. All rights reserved.