Systematic Debugging of CrashLoopBackOff: A Field Guide From Someone Who's Been Paged Too Many Times

CrashLoopBackOff Is a Symptom, Not a Diagnosis

Here's the thing about CrashLoopBackOff — it tells you exactly one thing: your container started and then exited, and Kubernetes is restarting it with exponential backoff. That's it. The actual problem could be any of two dozen different root causes, and the approach you take to debug it matters.

I've watched engineers spend hours staring at kubectl get pods waiting for the status to change, or blindly deleting and recreating pods hoping the problem goes away. Let me tell you why a systematic approach saves you time every single time, and walk you through the decision tree I use when I get paged at 3 AM.

Step 0: Gather the Facts Before You Touch Anything

Before you start fixing, start observing. Run these commands first and read the output carefully:

# Get the pod status and restart count
kubectl get pod $POD_NAME -n $NAMESPACE -o wide

# Get detailed pod information including events and container states
kubectl describe pod $POD_NAME -n $NAMESPACE

# Get the exit code from the last termination
kubectl get pod $POD_NAME -n $NAMESPACE -o jsonpath='{.status.containerStatuses[0].lastState.terminated}'

The exit code is your most important clue. Write it down before doing anything else.

Exit Code 0   → Container exited successfully (shouldn't be restarting — check restartPolicy)
Exit Code 1   → Application error (generic failure, check logs)
Exit Code 2   → Shell/command misuse (bad entrypoint or command syntax)
Exit Code 126 → Permission denied on entrypoint
Exit Code 127 → Entrypoint or command not found
Exit Code 137 → SIGKILL (OOM kill or external termination)
Exit Code 139 → SIGSEGV (segmentation fault — native code crash)
Exit Code 143 → SIGTERM (graceful shutdown, but container didn't stop in time)

Step 1: Check the Logs

This sounds obvious, but there's a nuance. When a pod is in CrashLoopBackOff, the current container might not have any logs yet because it hasn't started. You need the previous container's logs:

# Current container logs (might be empty or very short)
kubectl logs $POD_NAME -n $NAMESPACE

# Previous container logs (this is usually what you want)
kubectl logs $POD_NAME -n $NAMESPACE --previous

# If the pod has multiple containers, specify which one
kubectl logs $POD_NAME -n $NAMESPACE -c $CONTAINER_NAME --previous

Here's the thing — if --previous returns nothing, the container is crashing before it can write any log output. This usually means the problem is at the OS/runtime level, not the application level. Skip ahead to Step 4.

Step 2: Application-Level Failures (Exit Code 1)

Exit code 1 is the most common and the least specific. Your application started, encountered an error, and exited. The logs from Step 1 should tell you what happened. Common causes:

Missing Configuration

The application expects an environment variable or config file that doesn't exist:

# Check what environment variables are actually set in the container
kubectl exec -it $POD_NAME -n $NAMESPACE -- env 2>/dev/null

# If the pod keeps crashing, use a debug container
kubectl debug -it $POD_NAME -n $NAMESPACE --image=busybox --target=$CONTAINER_NAME -- sh

Verify that ConfigMaps and Secrets referenced by the pod actually exist:

# List all ConfigMap and Secret references in the pod spec
kubectl get pod $POD_NAME -n $NAMESPACE -o jsonpath='{range .spec.containers[*].envFrom[*]}{.configMapRef.name}{.secretRef.name}{"\n"}{end}'

# Check if they exist
kubectl get configmap $CONFIGMAP_NAME -n $NAMESPACE
kubectl get secret $SECRET_NAME -n $NAMESPACE

Failed Database or Service Connections

The application tries to connect to a dependency at startup and fails. Look for connection timeout errors in the logs. Verify the dependency is reachable from the pod's network namespace:

# Test connectivity from within the pod's network
kubectl debug -it $POD_NAME -n $NAMESPACE --image=nicolaka/netshoot --target=$CONTAINER_NAME -- \
  nc -zv database-service.production.svc.cluster.local 5432

Missing or Incompatible Dependencies

A common one after image updates — the new version expects a library, schema, or file that isn't present:

# Shell into the container image to inspect it
kubectl run debug-shell --rm -it --image=$IMAGE_NAME -- /bin/sh

Step 3: OOM Kills (Exit Code 137)

Exit code 137 means the container received SIGKILL. In Kubernetes, this is almost always an OOM (Out of Memory) kill. The container used more memory than its limit allows.

# Confirm it was an OOM kill
kubectl describe pod $POD_NAME -n $NAMESPACE | grep -A3 "Last State"
# Look for: Reason: OOMKilled

# Check the memory limit
kubectl get pod $POD_NAME -n $NAMESPACE -o jsonpath='{.spec.containers[0].resources.limits.memory}'

# Check actual memory usage before the kill (if metrics-server is available)
kubectl top pod $POD_NAME -n $NAMESPACE

The fix depends on whether the memory usage is legitimate or a leak:

Legitimate high usage: Increase the memory limit. But don't guess — check historical usage in your monitoring system first. Set the limit to P99 usage plus 20-30% headroom.

Memory leak: The container's memory grows steadily over time until it hits the limit. Increasing the limit only delays the crash. You need to fix the leak in the application code, or as a temporary measure, reduce the livenessProbe threshold to restart the container before it hits the OOM limit.

# Temporary workaround for a memory leak: restart before OOM
livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10
  failureThreshold: 3

Let me tell you why this is a workaround and not a fix: you're trading OOM kills for regular restarts, which is marginally better for your users but still causes downtime. Fix the leak.

Step 4: Container Won't Start (Exit Codes 126, 127)

These exit codes mean the container runtime couldn't execute the entrypoint.

Exit code 127 — command not found:

# Check the entrypoint/command in the pod spec
kubectl get pod $POD_NAME -n $NAMESPACE -o jsonpath='{.spec.containers[0].command}'
kubectl get pod $POD_NAME -n $NAMESPACE -o jsonpath='{.spec.containers[0].args}'

# Common causes:
# 1. Typo in the command path
# 2. The binary exists but the base image changed (e.g., switched from debian to alpine, /bin/bash → /bin/sh)
# 3. Multi-stage Docker build didn't copy the binary

Exit code 126 — permission denied:

# The entrypoint exists but isn't executable
# Check file permissions in the image
kubectl run debug --rm -it --image=$IMAGE_NAME -- ls -la /app/entrypoint.sh

# Fix: chmod +x in the Dockerfile, or adjust securityContext

Also check if the securityContext is preventing execution:

# A restrictive securityContext can prevent binary execution
securityContext:
  readOnlyRootFilesystem: true  # App might need to write temp files
  runAsNonRoot: true            # Binary might be owned by root
  runAsUser: 1000               # User might not have permission to execute

Step 5: Image Pull Issues Masquerading as CrashLoop

Sometimes what looks like CrashLoopBackOff started as an image pull problem. The pod pulled a wrong or corrupted image, the container starts with unexpected contents, and crashes immediately.

# Verify the exact image being used (including digest)
kubectl get pod $POD_NAME -n $NAMESPACE -o jsonpath='{.status.containerStatuses[0].imageID}'

# Check for ImagePullBackOff in events
kubectl describe pod $POD_NAME -n $NAMESPACE | grep -A5 "Events"

A particularly nasty variant: the latest tag was updated and the new image has a breaking change. The pod restarts, pulls the new image, crashes, restarts, pulls the same broken image, crashes. This is why you should never use latest in production. Pin your image tags or use digests.

Step 6: Liveness Probe Killing Healthy Containers

Here's the thing that trips up even experienced operators: a misconfigured liveness probe can cause CrashLoopBackOff that has nothing to do with the application being unhealthy.

# Check the liveness probe configuration
kubectl get pod $POD_NAME -n $NAMESPACE -o jsonpath='{.spec.containers[0].livenessProbe}' | jq .

Common probe misconfigurations:

# Problem: initialDelaySeconds is too short for the app to start
livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 5   # App takes 30 seconds to start
  periodSeconds: 10
  failureThreshold: 3

# Fix: use a startupProbe for slow-starting apps
startupProbe:
  httpGet:
    path: /healthz
    port: 8080
  failureThreshold: 30
  periodSeconds: 10
  # Allows up to 300 seconds (30 * 10) for startup
livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  periodSeconds: 10
  failureThreshold: 3

The startupProbe disables the liveness probe until the container passes the startup check. This is the correct solution for applications with variable startup times, not increasing initialDelaySeconds to some arbitrary high number.

Step 7: Volume Mount Failures

If a container depends on a volume that fails to mount, it can crash at startup with confusing errors:

# Check for volume-related events
kubectl describe pod $POD_NAME -n $NAMESPACE | grep -i -A2 "volume\|mount\|attach"

# Check if PVCs are bound
kubectl get pvc -n $NAMESPACE

# Common issues:
# - PVC is Pending (no PV available or storage class misconfigured)
# - Secret or ConfigMap referenced as volume doesn't exist
# - ReadOnlyRootFilesystem conflicts with app needing to write

The Debugging Decision Tree

When you get paged, follow this order:

1. kubectl describe pod → get exit code and events
2. kubectl logs --previous → get application error output
   ├── Got logs? → Read them. The answer is usually there.
   └── No logs? → Problem is pre-application (image, permissions, volumes)
3. Exit code 137? → OOM kill. Check memory limits and usage.
4. Exit code 1? → App error. Check config, dependencies, connectivity.
5. Exit code 127/126? → Binary not found or not executable. Check image and securityContext.
6. No obvious exit code? → Check liveness probe. Check volume mounts. Check init containers.
7. Still stuck? → kubectl debug with a debug container and investigate from inside.

Prevention: Stop CrashLoops Before They Happen

The best debugging session is the one that never happens. Here's what I enforce on every cluster I manage:

# 1. Always use startupProbes for apps that take time to initialize
# 2. Always set resource limits (especially memory)
# 3. Never use :latest tags in production
# 4. Always have readiness probes (separate from liveness)
# 5. Run pre-deploy checks that verify ConfigMaps and Secrets exist

Set up alerts on container restart counts:

# Alert when a container has restarted more than 3 times in 15 minutes
increase(kube_pod_container_status_restarts_total[15m]) > 3

This catches CrashLoopBackOff early, often before it pages you at 3 AM.

Final Thoughts

CrashLoopBackOff feels scary because it's vague. But once you have a systematic approach — exit code, logs, then targeted investigation — it becomes a mechanical process. The exit code tells you the category, the logs tell you the specifics, and the fix follows from the diagnosis.

Let me tell you why I wrote this as a decision tree rather than a list of tips: at 3 AM, you don't want to think creatively. You want a checklist. Follow the steps, gather the data, and the root cause will present itself. Every single time.

On this page

Systematic Debugging of CrashLoopBackOff: A Field Guide From Someone Who's Been Paged Too Many Times

CrashLoopBackOff Is a Symptom, Not a Diagnosis

Step 0: Gather the Facts Before You Touch Anything

Step 1: Check the Logs

Step 2: Application-Level Failures (Exit Code 1)

Missing Configuration

Failed Database or Service Connections

Missing or Incompatible Dependencies

Step 3: OOM Kills (Exit Code 137)

Step 4: Container Won't Start (Exit Codes 126, 127)

Step 5: Image Pull Issues Masquerading as CrashLoop

Step 6: Liveness Probe Killing Healthy Containers

Step 7: Volume Mount Failures

The Debugging Decision Tree

Prevention: Stop CrashLoops Before They Happen

Final Thoughts

Related Articles

Fix Kubernetes OOMKilled: Pod Keeps Getting Killed for Memory

Kubernetes Pod Security Standards: A Complete Guide

Fix Kubernetes 'Evicted' Pods Filling Up the Node

Fix Kubernetes ImagePullBackOff: Container Image Won't Pull

Istio Observability and Authorization: Distributed Tracing, Metrics, and Access Policies

Istio Service Mesh: Installation, Traffic Management, and mTLS

More in Kubernetes

Kubernetes Vertical Pod Autoscaler: Automating Resource Request Tuning In Production

Fix Helm 'UPGRADE FAILED: has no deployed releases'

Running Kafka on Kubernetes with Strimzi Operator

Kubernetes Network Policies: Implementing Zero-Trust Pod Communication

Discussion

Related Articles

Kubernetes Pod Security Standards: A Complete Guide

The Complete Guide to Kubernetes Deployment Strategies: Rolling, Blue-Green, Canary, and Progressive Delivery

Encrypting Kubernetes Secrets at Rest: Because Base64 Is Not Encryption