Troubleshooting Debugging and Incident Response

Reading time
10 min read
Word count
1927 words
Diagram count
1 diagram

Source: Victor Bona's Obsidian Compendium snapshot, Knowledge base/kubernetes/11 Troubleshooting Debugging and Incident Response.md.

Purpose: provide a practical Kubernetes troubleshooting and incident response playbook for workload, networking, storage, node, policy, and rollout failures.

Troubleshooting, Debugging, and Incident Response

This note expands Kubernetes, 03 Deployments ReplicaSets StatefulSets DaemonSets Jobs and CronJobs, 04 Services DNS Ingress Gateway API and Traffic Routing, 07 Storage Volumes PVCs StorageClasses CSI and Stateful Data, 10 Observability Logging Metrics Tracing Events and Probes, 09 Security RBAC Pod Security Admission and Supply Chain, and 10 Observability Logging Metrics Tracing Events and Probes. Kubernetes troubleshooting works best when you follow object ownership from the symptom to the controller that makes the decision: Deployment to ReplicaSet to Pod, Service to Endpoints to Pod labels, PVC to PV to CSI events, Ingress to Service to endpoint.

Rendering diagram...

First five minutes

kubectl config current-context
kubectl get ns
kubectl get deploy,rs,pod,svc,endpoints,ingress -n payments -o wide
kubectl get events -n payments --sort-by=.lastTimestamp
kubectl rollout history deploy/payments-api -n payments
kubectl rollout status deploy/payments-api -n payments
kubectl logs deploy/payments-api -n payments --since=15m --tail=300

Immediate rules:

RuleWhy
Confirm context and namespace before actionPrevents wrong cluster changes
Capture evidence before deleting PodsDeletion can erase useful state
Prefer rollout undo over manual Pod editsControllers recreate Pods from templates
Check events before deep debuggingEvents often name the exact blocker
Distinguish no capacity from app crashMitigations are different
Communicate impact and next update timeKeeps incident coordination clear

Debugging order

  1. Identify the user visible symptom: error rate, latency, failed deploy, missing endpoint, or unavailable node.
  2. Scope blast radius: cluster, namespace, service, version, node pool, or tenant.
  3. Check recent changes: rollout, config, secret, ingress, policy, node maintenance, dependency.
  4. Inspect desired state: Deployment, StatefulSet, DaemonSet, Job, Service, Ingress, PVC.
  5. Inspect actual state: Pods, ReplicaSets, endpoints, nodes, events.
  6. Read logs: current container, previous container, sidecars, init containers.
  7. Check resource saturation: CPU, memory, ephemeral storage, disk pressure, API rate limits.
  8. Test network path: Pod DNS, Service endpoints, Ingress routing, TLS, NetworkPolicy.
  9. Test policy path: RBAC, Pod Security Admission, validating policies, quotas.
  10. Mitigate with the lowest risk reversible action, then preserve evidence.

Diagnostic capture

Create an incident folder in your normal incident system, then capture command output. Do not store secrets in notes.

kubectl get deploy payments-api -n payments -o yaml
kubectl get rs -n payments -l app=payments-api -o wide
kubectl get pods -n payments -l app=payments-api -o wide
kubectl describe pod -n payments -l app=payments-api
kubectl get events -n payments --sort-by=.lastTimestamp
kubectl logs deploy/payments-api -n payments --all-containers --since=30m
kubectl top pods -n payments
kubectl top nodes

For a single Pod:

POD=payments-api-7d75b9c8b6-h9x4q
kubectl get pod "$POD" -n payments -o yaml
kubectl describe pod "$POD" -n payments
kubectl logs "$POD" -n payments --all-containers --tail=500
kubectl logs "$POD" -n payments --all-containers --previous --tail=500

Workload states

State or reasonMeaningFirst checks
CrashLoopBackOffContainer starts then exits repeatedlyPrevious logs, exit code, env, config, probes
ImagePullBackOffImage pull failed and kubelet is backing offImage name, tag or digest, registry auth, node egress
ErrImagePullImmediate image pull failureEvent message, registry, secret
PendingPod not running yetScheduling, PVC, image pull, quota
UnschedulableScheduler cannot place PodRequests, taints, affinity, topology spread, node selector
OOMKilledKernel killed process for memoryLimits, memory profile, recent traffic
EvictedKubelet removed Pod under pressureNode memory, disk, inode, ephemeral storage
RunContainerErrorRuntime could not start containercommand, args, mounts, permissions
CreateContainerConfigErrorConfig reference invalidSecret, ConfigMap, env, volume name

CrashLoopBackOff

Triage:

kubectl describe pod "$POD" -n payments
kubectl logs "$POD" -n payments --previous --tail=300
kubectl get pod "$POD" -n payments -o jsonpath='{.status.containerStatuses[*].lastState}'
kubectl get deploy payments-api -n payments -o yaml

Decision table:

EvidenceLikely causeAction
Exit code 1 with app errorBad config or app bugCompare ConfigMap, Secret, rollout diff
Exit code 137OOMKilledRaise memory limit or rollback memory regression
Probe failure before exitProbe too strict or app not readyInspect probe path and startup timing
Missing file or permission deniedImage or volume path issueCheck read only filesystem, UID, mounts
Cannot connect to dependencyDependency outage or NetworkPolicyTest DNS, Service endpoints, egress

Mitigations:

kubectl rollout undo deploy/payments-api -n payments
kubectl scale deploy/payments-api -n payments --replicas=0
kubectl set env deploy/payments-api FEATURE_X_ENABLED=false -n payments

Use scale to zero only when stopping bad traffic is better than partial availability.

ImagePullBackOff and ErrImagePull

kubectl describe pod "$POD" -n payments
kubectl get secret regcred -n payments -o yaml
kubectl get pod "$POD" -n payments -o jsonpath='{.spec.imagePullSecrets}'
kubectl get events -n payments --sort-by=.lastTimestamp | tail -40

Common causes:

Event textCauseFix
not foundWrong image name, tag, or digestCorrect manifest or publish image
unauthorizedBad registry credentialsFix imagePullSecret or workload identity
manifest unknownTag missing for architecturePublish multi arch image or use correct node pool
i/o timeoutNode cannot reach registryCheck DNS, proxy, firewall, registry status
ImagePullBackOff after transient errorKubelet backoffFix cause, then wait or recreate Pod

Production guidance: use image digests, keep registry credentials scoped per namespace, and alert on pull errors after deploys.

Pending and Unschedulable

kubectl describe pod "$POD" -n payments
kubectl get nodes -o wide
kubectl describe node node-1
kubectl get quota,limitrange -n payments
kubectl get pvc -n payments

Scheduling blockers:

BlockerEvidenceFix
Insufficient CPU or memory0/5 nodes are available: Insufficient cpuReduce requests, scale node pool, move workload
Taint not toleratedEvent names taintAdd toleration only if workload belongs there
Node selector mismatchNo nodes match labelsFix selector or label nodes
Affinity too strictScheduler event mentions affinityRelax rules
PVC unboundPod waits for PVCDebug storage class and PV provisioning
Quota exceededexceeded quota eventReduce replicas or increase quota

Requests matter for scheduling. Limits matter for runtime enforcement. A Pod can be Pending even when actual node CPU usage looks low if requested CPU cannot fit.

OOMKilled, CPU throttling, and resource pressure

kubectl describe pod "$POD" -n payments | rg -i 'oom|killed|reason|limits|requests'
kubectl top pod "$POD" -n payments --containers
kubectl top nodes

PromQL examples:

sum by (pod) (increase(kube_pod_container_status_restarts_total{namespace="payments"}[15m]))
sum by (pod, container) (
  rate(container_cpu_cfs_throttled_periods_total{namespace="payments"}[5m])
)
/
sum by (pod, container) (
  rate(container_cpu_cfs_periods_total{namespace="payments"}[5m])
)

Resource tradeoffs:

ActionBenefitRisk
Increase memory limitStops OOM if app legitimately needs memoryCan increase node pressure
Remove memory limitAvoids container OOM from tight limitPod can pressure node and be evicted
Increase CPU limitReduces throttlingMay increase noisy neighbor impact
Increase CPU requestBetter scheduling guaranteeMay require more nodes
Add HPAHandles traffic variationNeeds correct metrics and scaling limits

Evictions

Eviction means the kubelet removed Pods because the node was under pressure.

kubectl describe pod "$POD" -n payments
kubectl describe node node-1 | rg -i 'pressure|evict|allocatable|capacity'
kubectl get events -A --field-selector reason=Evicted --sort-by=.lastTimestamp

Eviction causes:

CauseEvidenceFix
Memory pressureNode condition MemoryPressureReduce memory use, move workloads, add capacity
Disk pressureDiskPressure, imagefs or nodefs messagesClean images, reduce logs, add disk
Ephemeral storagePod exceeds ephemeral storageSet requests and limits, move temp data to volume
Inode pressureMany small filesFix workload file churn

Probe failures

kubectl describe pod "$POD" -n payments | rg -i 'liveness|readiness|startup|unhealthy'
kubectl logs "$POD" -n payments --previous
kubectl port-forward pod/"$POD" -n payments 18080:8080
curl -i http://127.0.0.1:18080/health/ready

Diagnosis:

SymptomLikely causeFix
Liveness restarts during dependency outageLiveness checks dependencyMove dependency checks to readiness
Readiness never succeedsApp not bound, wrong path, missing dependencyTest endpoint inside Pod and inspect config
Startup fails for slow appThreshold too lowAdd or relax startup probe
Probe timeout under loadHandler shares saturated worker poolUse cheap local health path

DNS failures

kubectl run dns-debug -n payments --rm -it --image=registry.k8s.io/e2e-test-images/agnhost:2.45 --restart=Never -- nslookup kubernetes.default.svc.cluster.local
kubectl run dns-debug -n payments --rm -it --image=registry.k8s.io/e2e-test-images/agnhost:2.45 --restart=Never -- nslookup payments-api.payments.svc.cluster.local
kubectl get pods -n kube-system -l k8s-app=kube-dns -o wide
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=200

DNS decision table:

FailureCheck
All cluster DNS failsCoreDNS Pods, Service IP, kube-proxy or CNI
One Service name failsService exists, namespace, headless service, endpoints
External DNS failsNode DNS, egress, NetworkPolicy, upstream resolver
Sporadic timeoutsCoreDNS saturation, node conntrack, UDP drops

Service has no endpoints

kubectl get svc payments-api -n payments -o yaml
kubectl get endpoints payments-api -n payments -o yaml
kubectl get endpointslice -n payments -l kubernetes.io/service-name=payments-api -o yaml
kubectl get pods -n payments --show-labels
kubectl describe pod -n payments -l app=payments-api

Common causes:

CauseEvidenceFix
Selector mismatchService selector does not match Pod labelsFix labels or selector
Pods not readyEndpoints absent or ready: falseDebug readiness
Wrong port nameService targetPort references missing nameAlign container port name
Pods in another namespaceService selects only same namespaceCreate Service in correct namespace

Ingress 404, 502, and TLS failures

kubectl get ingress -n payments -o wide
kubectl describe ingress payments-api -n payments
kubectl get svc,endpoints -n payments
kubectl logs -n ingress-nginx deploy/ingress-nginx-controller --tail=300
curl -vk https://payments.example.com/health/ready

Ingress triage:

SymptomLikely causeCheck
404 from ingress controllerHost or path rule does not matchIngress host, pathType, class
502 or 503Backend has no ready endpoints or wrong portService endpoints, targetPort, Pod readiness
TLS certificate mismatchWrong Secret or hostIngress TLS section and cert SAN
Redirect loopApp and ingress both force schemeHeaders and app trust proxy config
Works by port forward but not ingressIngress, Service, NetworkPolicy, or backend protocolController logs and Service port

PVC Pending and storage failures

kubectl get pvc -n payments
kubectl describe pvc data-payments-db-0 -n payments
kubectl get storageclass
kubectl get pv
kubectl get events -n payments --sort-by=.lastTimestamp | rg -i 'volume|mount|attach|provision'

Storage causes:

SymptomLikely causeFix
PVC PendingNo matching StorageClass or provisioner failureCheck StorageClass, CSI controller, quota
Pod stuck mountingVolume attach or node mount issueCheck CSI node logs and node status
Read only mount unexpectedlyFilesystem error or access modeInspect events and storage backend
StatefulSet replacement stuckVolume bound to old zone or nodeCheck topology and PV node affinity

Node NotReady

kubectl get nodes -o wide
kubectl describe node node-1
kubectl get pods -A --field-selector spec.nodeName=node-1 -o wide
kubectl get events -A --field-selector involvedObject.kind=Node --sort-by=.lastTimestamp

Node NotReady causes:

EvidenceLikely issue
Kubelet stopped reportingKubelet crash, node down, network partition
DiskPressureDisk full, log growth, image garbage collection failing
MemoryPressureWorkloads exceed capacity
PIDPressureProcess leak
NetworkUnavailableCNI issue

Cordon and drain:

kubectl cordon node-1
kubectl drain node-1 --ignore-daemonsets --delete-emptydir-data
kubectl uncordon node-1

Drain tradeoffs:

OptionBenefitRisk
cordon onlyStops new PodsExisting impact may continue
drainMoves eligible Pods awayCan disrupt if PDBs or replicas are weak
--delete-emptydir-dataAllows drain of Pods using emptyDirDeletes local ephemeral data
Force deleteUnblocks stuck drainCan violate availability assumptions

Check PodDisruptionBudgets before draining production nodes:

kubectl get pdb -A
kubectl describe pdb payments-api -n payments

Stuck rollouts

kubectl rollout status deploy/payments-api -n payments
kubectl rollout history deploy/payments-api -n payments
kubectl describe deploy payments-api -n payments
kubectl get rs -n payments -l app=payments-api -o wide
kubectl get pods -n payments -l app=payments-api -o wide

Causes:

EvidenceCauseFix
New Pods CrashLoopApp or config regressionRollback
New Pods PendingCapacity or schedulingAdd capacity or reduce requests
New Pods unreadyReadiness or dependencyFix readiness blocker
Old ReplicaSet not scaling downPDB or maxUnavailableReview rollout strategy
ProgressDeadlineExceededDeployment failed to make progressInspect ReplicaSet and events

Rollback:

kubectl rollout undo deploy/payments-api -n payments
kubectl rollout status deploy/payments-api -n payments

RBAC Forbidden and admission denied

RBAC Forbidden:

kubectl auth can-i create pods/exec -n payments
kubectl auth can-i create pods/exec -n payments --as=system:serviceaccount:payments:diagnostics
kubectl get role,rolebinding -n payments
kubectl get clusterrolebinding -o wide | rg diagnostics

Admission denied:

kubectl apply -f deployment.yaml --dry-run=server
kubectl get events -n payments --field-selector reason=FailedCreate
kubectl describe namespace payments
kubectl get validatingadmissionpolicy,validatingadmissionpolicybinding
kubectl get validatingwebhookconfiguration

Interpretation:

MessageLikely layer
forbidden: User cannot get resourceRBAC
violates PodSecurityPod Security Admission
denied by policyValidatingAdmissionPolicy, Gatekeeper, Kyverno, or webhook
exceeded quotaResourceQuota
is forbidden: unable to validate against any pod security policyOld cluster with PodSecurityPolicy

NetworkPolicy blocks

kubectl get networkpolicy -n payments
kubectl describe networkpolicy -n payments
kubectl get pods -n payments --show-labels
kubectl run net-debug -n payments --rm -it --image=curlimages/curl:8.8.0 --restart=Never -- sh

Inside debug shell:

nslookup payments-api.payments.svc.cluster.local
curl -sv http://payments-api.payments.svc.cluster.local/health/ready
curl -sv http://10.96.0.1:443

NetworkPolicy checklist:

  • Does any policy select the source Pod.
  • Does any policy select the destination Pod.
  • Are ingress and egress both required by the CNI.
  • Do labels match exactly.
  • Is DNS egress allowed.
  • Is the namespace selector correct.
  • Does the CNI enforce NetworkPolicy.

Exec, port-forward, and ephemeral containers

Use direct debugging tools carefully. They can expose secrets, mutate state, or bypass normal deployment paths.

kubectl exec -it "$POD" -n payments -- /bin/sh
kubectl port-forward pod/"$POD" -n payments 18080:8080
kubectl debug -it "$POD" -n payments --image=busybox:1.36 --target=api
kubectl debug node/node-1 -it --image=busybox:1.36

Tradeoffs:

ToolBest forRisk
execInspect process environment and filesCan mutate running container
port-forwardTest local path to Pod or ServiceBypasses ingress and client network path
Ephemeral containerDebug distroless imagesRequires powerful RBAC
Node debugNode level inspectionVery high privilege

Prefer read only commands first:

id
pwd
ls -la /app
printenv | sort
cat /etc/resolv.conf
ss -lntp

Avoid printing full environment variables in shared incident channels because they may contain secrets.

Incident response checklist

  • Declare severity and incident commander.
  • Confirm cluster, namespace, service, and user impact.
  • Start an incident timeline.
  • Freeze nonessential deploys for affected systems.
  • Capture current Deployment, ReplicaSet, Pod, Service, Ingress, PVC, and event state.
  • Identify recent changes.
  • Decide mitigation: rollback, scale, config revert, traffic shift, node cordon, dependency failover.
  • Execute one mitigation at a time with a named owner.
  • Verify user visible recovery using SLO metrics and synthetic checks.
  • Preserve logs, traces, audit evidence, and command history.
  • Rotate secrets if credentials may have been exposed.
  • Remove temporary debug permissions and tooling.
  • Write follow up actions with owners and dates in the incident system.

Common mistakes

MistakeConsequenceBetter action
Deleting crash looping Pods before reading previous logsLoses root cause evidenceCapture logs and describe output first
Debugging only the PodMisses Service, endpoints, ingress, and policyFollow the request path
Scaling up during dependency failureAmplifies load on dependencyShed load or disable feature
Fixing live Pod filesystemChange disappears and hides real fixPatch controller or rollback
Ignoring ResourceQuotaNew Pods cannot be createdCheck quota and LimitRange early
Assuming DNS failure is CoreDNSService name or NetworkPolicy may be wrongTest both Service DNS and external DNS
Draining nodes without PDB reviewCauses avoidable outageCheck PDBs and replicas first
Leaving debug RBAC in placeCreates durable privilege riskRemove after incident

Production review checklist

  • Deployments have reasonable requests, limits, and rollout strategy.
  • Critical workloads have PodDisruptionBudgets.
  • Services have matching selectors and named target ports.
  • Readiness probes protect traffic during startup and dependency failures.
  • Liveness probes are conservative and local.
  • Logs include correlation ids and are centrally retained.
  • Metrics cover request rate, errors, latency, saturation, restarts, and throttling.
  • Alerts have runbooks with exact commands.
  • Oncall has RBAC for read only diagnostics and controlled debug escalation.
  • Admission denials and RBAC failures are observable.
  • Node maintenance uses cordon, drain, and PDB review.
  • Incident response includes evidence capture before destructive actions.