Kubernetes Debugging Patterns for Production

Kubernetes debugging is its own skill. The error messages are often vague, the failure modes are distributed, and the logs needed are scattered across multiple layers. These are the patterns that come up most often in production.

CrashLoopBackOff

The most common Kubernetes problem. Your pod starts, crashes, restarts, crashes again, and the backoff delay grows exponentially.

diagnose-crashloop.sh

# Step 1: Check what the container is actually doing
kubectl logs pod/myapp-7b4f6d8c5-x2k9m --previous
 
# Step 2: If logs are empty, the process crashed before logging
kubectl describe pod myapp-7b4f6d8c5-x2k9m | grep -A5 "Last State"
 
# Step 3: Check the exit code
kubectl get pod myapp-7b4f6d8c5-x2k9m -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'

Common exit codes and what they mean:

Exit Code	Meaning	Common Cause
1	Application error	Unhandled exception, missing config
137	SIGKILL (OOMKilled)	Memory limit exceeded
139	SIGSEGV	Segmentation fault
143	SIGTERM	Graceful shutdown (normal)

OOMKilled Specifically

When the exit code is 137, the container exceeded its memory limit:

# Confirm OOMKilled
kubectl describe pod myapp-7b4f6d8c5-x2k9m | grep -i oom
 
# Check current memory usage vs limits
kubectl top pod myapp-7b4f6d8c5-x2k9m
kubectl get pod myapp-7b4f6d8c5-x2k9m -o jsonpath='{.spec.containers[0].resources}'

Stuck Deployments

A deployment that never completes — pods are in Pending, ContainerCreating, or Init state indefinitely.

Pending Pods

# Why is the pod pending?
kubectl describe pod myapp-pending-pod | grep -A10 "Events"
 
# Common causes:
# - Insufficient CPU/memory on nodes
# - No nodes matching nodeSelector/affinity rules
# - PersistentVolumeClaim not bound

Check cluster capacity:

# Node resource availability
kubectl describe nodes | grep -A5 "Allocated resources"
 
# Unschedulable nodes
kubectl get nodes -o wide | grep -v Ready

ContainerCreating Stuck

Usually an image pull problem or volume mount issue:

# Check events for the specific pod
kubectl describe pod myapp-stuck | grep -A20 "Events"
 
# Common causes:
# - ImagePullBackOff: wrong image name, private registry auth missing
# - Volume mount failure: PVC not bound, NFS server unreachable

fix-image-pull-secret.yml

apiVersion: v1
kind: Pod
spec:
  containers:
    - name: myapp
      image: registry.example.com/myapp:latest
  # If using a private registry, you need this
  imagePullSecrets:
    - name: registry-credentials

Networking Issues

Kubernetes networking problems are the hardest to debug because the symptoms are indirect — timeouts, connection refused, intermittent failures.

Service Not Reachable

# Step 1: Verify the service exists and has endpoints
kubectl get svc myapp-service
kubectl get endpoints myapp-service
 
# If endpoints are empty, the selector doesn't match any pods
kubectl get pods -l app=myapp --show-labels
 
# Step 2: Test from inside the cluster
kubectl run debug --rm -it --image=busybox -- sh
# Inside the pod:
wget -qO- http://myapp-service:8080/health
nslookup myapp-service

DNS Resolution Failures

# Check if CoreDNS is healthy
kubectl get pods -n kube-system -l k8s-app=kube-dns
 
# Test DNS resolution from a pod
kubectl run dns-test --rm -it --image=busybox -- nslookup kubernetes.default
 
# Check CoreDNS logs for errors
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50

Network Policies Blocking Traffic

# List all network policies in the namespace
kubectl get networkpolicies -n production
 
# Describe a specific policy to see what it allows/denies
kubectl describe networkpolicy default-deny -n production
 
# Quick test: temporarily delete the policy and see if traffic flows
# (only in non-production environments)

Resource Debugging Tools

Ephemeral Debug Containers

Kubernetes 1.25+ supports ephemeral containers — attach a debug container to a running pod without restarting it:

# Attach a debug container with networking tools
kubectl debug pod/myapp-7b4f6d8c5-x2k9m -it \
  --image=nicolaka/netshoot \
  --target=myapp
 
# Inside the debug container, you have full networking tools:
# tcpdump, dig, curl, netstat, ss, iperf, etc.

Resource Inspection One-Liners

# Pods sorted by CPU usage
kubectl top pods --sort-by=cpu -A
 
# Pods sorted by memory
kubectl top pods --sort-by=memory -A
 
# Events sorted by time (last 30 minutes)
kubectl get events --sort-by='.lastTimestamp' -A | tail -30
 
# All pods not in Running state
kubectl get pods -A --field-selector status.phase!=Running
 
# Pods with restarts > 0
kubectl get pods -A -o json | jq -r '
  .items[] |
  select(.status.containerStatuses[]?.restartCount > 0) |
  [.metadata.namespace, .metadata.name,
   (.status.containerStatuses[0].restartCount | tostring)] |
  join("\t")'

Debugging Checklist

When something goes wrong in Kubernetes, a reliable sequence to follow is:

kubectl get pods -A — what's the cluster-wide state?
kubectl describe pod <name> — what do the events say?
kubectl logs <pod> --previous — what happened before the crash?
kubectl top pods — is it a resource problem?
kubectl get events --sort-by=time — what happened recently?
Debug from inside the cluster — kubectl run a debug pod

Most problems are answered by steps 1-3. Steps 4-6 are for the harder cases.

Key Takeaways

Always check events first — kubectl describe tells you more than kubectl get
Read the exit code — it tells you the category of failure immediately
OOMKilled means profile, not resize — increasing limits masks the real problem
Debug networking from inside — external tools give misleading results
Keep a debug pod image handy — nicolaka/netshoot has every tool you need

#CrashLoopBackOff

#OOMKilled Specifically

#Stuck Deployments

#Pending Pods

#ContainerCreating Stuck

#Networking Issues

#Service Not Reachable

#DNS Resolution Failures

#Network Policies Blocking Traffic

#Resource Debugging Tools

#Ephemeral Debug Containers

#Resource Inspection One-Liners

#Debugging Checklist

#Key Takeaways