Prometheus in CrashloopBackoff state with out-of-memory (OOM) error

Description

In certain situations, Prometheus pods can fail to start due to an out-of-memory (OOM) error. To identify the affected Prometheus pods, search for the last termination state that specifically relates to OOM errors.

Solution

To fix the issue, take the following steps:

Update argocd configmap to ignore service-monitors by updating resource.exclusions.
1. Create a backup of the old configmap:
```
kubectl get  configmap argocd-cm -n argocd -o yaml >> argocd-cm-old.yamlkubectl get  configmap argocd-cm -n argocd -o yaml >> argocd-cm-old.yaml  
```
2. Edit argocd-cm:
```
kubectl edit configmap argocd-cm -n argocdkubectl edit configmap argocd-cm -n argocd
```
3. Add a new data entry for resource.exclusions:
```
data:
  resource.exclusions: |
    - apiGroups:
        - monitoring.coreos.com
      kinds:
        - ServiceMonitor
      clusters:
        - "*"data:
  resource.exclusions: |
    - apiGroups:
        - monitoring.coreos.com
      kinds:
        - ServiceMonitor
      clusters:
        - "*"
```

Delete the envoy-stats-monitorservice monitor under theistio-system namespace:

kubectl -n istio-system  delete servicemonitor  envoy-stats-monitorkubectl -n istio-system  delete servicemonitor  envoy-stats-monitor

Force-delete the Prometheus pods:

kubectl delete pod  prometheus-rancher-monitoring-prometheus-0  -n cattle-monitoring-system
kubectl delete pod  prometheus-rancher-monitoring-prometheus-1  -n cattle-monitoring-systemkubectl delete pod  prometheus-rancher-monitoring-prometheus-0  -n cattle-monitoring-system
kubectl delete pod  prometheus-rancher-monitoring-prometheus-1  -n cattle-monitoring-system

Wait for the Prometheus pods under cattle-monitoring-system to restart successfully.

Important: Deleting the the wal and chunks_head directories results in a loss of all monitoring data gathered during the days these files were accumulated. However, once Prometheus is up and running again, any new alerts and metrics data will be accessible.

On this page

Description
Solution

Was this page helpful?

PREVIOUSRabbitMQ pod stuck in CrashLoopBackOff

NEXTMissing Ceph-rook metrics from monitoring dashboards

Support and Services

Get The Help You Need

UiPath Academy

Learning RPA - Automation Courses

UiPath Forum

UiPath Community Forum

Trust and Security

Cookies Policy