Observability Logging Metrics Tracing Events and Probes

Reading time
8 min read
Word count
1477 words
Diagram count
1 diagram

Source: Victor Bona's Obsidian Compendium snapshot, Knowledge base/kubernetes/10 Observability Logging Metrics Tracing Events and Probes.md.

Purpose: explain how to observe Kubernetes workloads through logs, metrics, traces, events, probes, audit records, SLOs, alerts, and runbooks.

Observability, Logging, Metrics, Tracing, Events, and Probes

This note expands Kubernetes, 10 Observability Logging Metrics Tracing Events and Probes, 03 Deployments ReplicaSets StatefulSets DaemonSets Jobs and CronJobs, and 04 Services DNS Ingress Gateway API and Traffic Routing. Kubernetes observability is the ability to answer what is running, whether users are affected, why the system changed, and which component owns the fix. Logs, metrics, traces, events, probes, and audit logs answer different questions and should be used together.

Rendering diagram...

Observability signals

SignalBest forWeaknessFirst commands or tools
LogsSpecific errors, request context, application decisionsHigh volume, hard to aggregate without structurekubectl logs, log backend
MetricsRates, saturation, trends, SLOs, alertingPoor at explaining one requestMetrics Server, Prometheus, Grafana
TracesCross service request path and latencyRequires instrumentation and sampling designOpenTelemetry, tracing backend
EventsKubernetes control plane decisionsShort retention, not a log systemkubectl get events, kubectl describe
ProbesContainer health and traffic eligibilityCan cause outages if designed badlyPod spec, kubelet events
Audit logsAPI access and security investigationHigh volume, sensitive, platform ownedAPI server audit backend

Fast kubectl inspection

kubectl get pods -n payments -o wide
kubectl describe pod api-7d75b9c8b6-h9x4q -n payments
kubectl logs deploy/api -n payments --tail=200
kubectl logs deploy/api -n payments -c api --since=15m
kubectl logs pod/api-7d75b9c8b6-h9x4q -n payments --previous
kubectl get events -n payments --sort-by=.lastTimestamp
kubectl top pods -n payments
kubectl top nodes

Use describe when Kubernetes is making the decision: scheduling, image pull, probes, volume attach, endpoints, and rollout failures. Use logs when the application accepted control and then failed.

Container logs

Kubernetes expects containers to write application logs to stdout and stderr. The node runtime stores them and a log agent usually ships them to a central backend.

Recommended log fields:

FieldReason
timestampReconstruct timeline
levelFilter noise
serviceGroup workload output
namespacePreserve tenant or environment context
podLink log to runtime instance
containerDistinguish sidecars and app container
request_id or trace_idJoin logs to traces
user_id or tenant idDebug scoped impact with privacy controls
error.kindAggregate failure classes
duration_msSupport latency analysis

Structured JSON log example:

{
  "timestamp": "2026-06-15T12:00:00Z",
  "level": "error",
  "service": "payments-api",
  "namespace": "payments",
  "trace_id": "9d2c3a2f2b5d4f7a",
  "request_id": "req_123",
  "route": "POST /payments",
  "status": 502,
  "duration_ms": 1840,
  "error.kind": "upstream_timeout"
}

Logging tradeoffs:

ChoiceBenefitCost
Plain text logsEasy for humans locallyHard to query reliably
JSON logsStrong search and aggregationRequires disciplined schema
High cardinality fieldsPrecise debuggingStorage and index cost
Debug logs always onMore context during incidentsCost and secret leakage risk
Dynamic log levelIncident flexibilityNeeds authentication and audit

Common log mistakes:

MistakeConsequenceFix
Logging secrets or tokensCredential exposureRedact at source and test redaction
Logging only success pathsIncidents lack evidenceLog error class, outcome, and correlation id
Logging without request idsCannot connect servicesPropagate request id or trace context
Logging stack traces for every user errorNoise hides real failuresSeparate validation errors from server errors
Storing logs only on nodesLost after node deletionShip to centralized backend

Events

Events are Kubernetes status breadcrumbs. They explain scheduler decisions, image pulls, probe failures, mount failures, and controller activity.

kubectl get events -n payments --sort-by=.lastTimestamp
kubectl get events -A --field-selector involvedObject.kind=Pod --sort-by=.lastTimestamp
kubectl describe deployment api -n payments
kubectl describe replicaset -n payments -l app=api

Event interpretation:

Event reasonLikely issue
FailedSchedulingResource shortage, taints, affinity, PVC, node selector
FailedMountSecret, ConfigMap, PVC, CSI, or permission issue
BackOffCrashLoopBackOff or image pull backoff
UnhealthyProbe failure
FailedCreateQuota, admission, RBAC, or invalid Pod spec
KillingProbe restart, eviction, rollout, or deletion

Events are not durable enough for long incident investigations. Ship them or scrape them if the organization relies on historical event timelines.

Metrics Server

Metrics Server provides recent CPU and memory usage for Kubernetes resource APIs. It powers commands such as kubectl top and the HorizontalPodAutoscaler resource metrics path. It is not a long term metrics database.

kubectl top nodes
kubectl top pods -A
kubectl top pod api-7d75b9c8b6-h9x4q -n payments --containers
kubectl get apiservice v1beta1.metrics.k8s.io

Metrics Server scope:

Good usePoor use
Quick resource snapshotLong term capacity planning
HPA CPU and memory metricsApplication SLO alerting
Node pressure triageHistorical incident analysis

Prometheus and Grafana

Prometheus is the common Kubernetes metrics backend. Grafana is the common dashboard layer. A production setup usually scrapes cluster components, kube state metrics, node exporters, application metrics, and custom SLO metrics.

Prometheus scrape annotations are simple but less controlled than ServiceMonitor or PodMonitor CRDs from the Prometheus Operator:

apiVersion: v1
kind: Service
metadata:
  name: payments-api
  namespace: payments
  labels:
    app: payments-api
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8080"
    prometheus.io/path: "/metrics"
spec:
  selector:
    app: payments-api
  ports:
    - name: http
      port: 80
      targetPort: 8080

Application metrics to expose:

MetricTypePurpose
http_requests_totalCounterRequest rate and error ratio
http_request_duration_secondsHistogramLatency percentiles
queue_depthGaugeBacklog and saturation
worker_jobs_totalCounterJob throughput and failures
db_pool_in_useGaugeDatabase pool pressure
external_request_duration_secondsHistogramDependency latency

PromQL examples:

sum(rate(http_requests_total{namespace="payments",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{namespace="payments"}[5m]))
histogram_quantile(
  0.95,
  sum by (le, route) (
    rate(http_request_duration_seconds_bucket{namespace="payments"}[5m])
  )
)
sum by (pod) (
  rate(container_cpu_usage_seconds_total{namespace="payments",container!="POD"}[5m])
)

Dashboard design:

DashboardMust show
Service overviewRequest rate, error ratio, p50, p95, p99, saturation, deploy version
Kubernetes workloadReplicas, restarts, CPU, memory, throttling, network, probe failures
DependencyUpstream latency, error rate, timeout count, circuit breaker state
Node poolAllocatable, requested, used, pressure, evictions
SLOBurn rate, budget remaining, user impact

OpenTelemetry and tracing

Tracing connects one user request across services. OpenTelemetry provides APIs, SDKs, semantic conventions, and collectors.

Basic collector pipeline:

apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-collector-config
  namespace: observability
data:
  config.yaml: |
    receivers:
      otlp:
        protocols:
          grpc:
          http:
    processors:
      batch:
      memory_limiter:
        limit_mib: 512
    exporters:
      otlp:
        endpoint: tracing-backend.observability.svc.cluster.local:4317
        tls:
          insecure: true
    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [memory_limiter, batch]
          exporters: [otlp]

Trace guidance:

PracticeReason
Propagate W3C trace contextJoins spans across services
Put trace_id in logsEnables log to trace pivot
Sample intelligentlyControls cost while preserving rare errors
Capture dependency spansIdentifies slow upstreams
Avoid sensitive span attributesTraces are broadly queried during incidents
Name routes with templatesAvoids high cardinality paths

Audit logs

Kubernetes audit logs record API server requests. They are security and change evidence, not application logs.

Audit policy example:

apiVersion: audit.k8s.io/v1
kind: Policy
rules:
  - level: Metadata
    resources:
      - group: ""
        resources: ["pods", "services", "configmaps"]
  - level: RequestResponse
    resources:
      - group: ""
        resources: ["secrets"]
    verbs: ["create", "update", "patch", "delete"]
  - level: Metadata
    resources:
      - group: "rbac.authorization.k8s.io"
        resources: ["roles", "rolebindings", "clusterroles", "clusterrolebindings"]

Audit review questions:

QuestionEvidence
Who changed this DeploymentAudit log user, verb, objectRef, request body if captured
Who read a SecretAudit log for get secrets, if configured
Was exec usedAudit log for create pods/exec
Did a controller make the changeUser agent and username show controller identity
Was admission bypassedCompare request, admission logs, and final object

Probes

Probes are kubelet checks. They do not prove the whole service is healthy. They decide restart behavior and traffic readiness for individual containers.

ProbeAction on failureBest for
StartupDisables liveness and readiness until it succeedsSlow boot apps
ReadinessRemoves Pod from Service endpointsDependency readiness and graceful deploys
LivenessRestarts containerDeadlocks and unrecoverable local failure

Production HTTP probes:

startupProbe:
  httpGet:
    path: /health/startup
    port: 8080
  periodSeconds: 5
  failureThreshold: 24
readinessProbe:
  httpGet:
    path: /health/ready
    port: 8080
  periodSeconds: 5
  timeoutSeconds: 2
  failureThreshold: 3
livenessProbe:
  httpGet:
    path: /health/live
    port: 8080
  periodSeconds: 10
  timeoutSeconds: 2
  failureThreshold: 3

Probe timing:

FieldMeaningGuidance
initialDelaySecondsDelay before first probePrefer startup probes for long boot
periodSecondsProbe frequencyBalance detection speed and load
timeoutSecondsSingle probe timeoutMust be longer than normal local response time
successThresholdSuccesses required after failureReadiness can use values above 1
failureThresholdFailures before actionAvoid aggressive liveness restarts

Health endpoint design:

EndpointShould checkShould not check
/health/liveProcess is not deadlocked and can answer locallyDatabase, cache, third party APIs
/health/readyApp can serve traffic for required dependenciesOptional integrations that can degrade gracefully
/health/startupMigrations, cache warmup, one time initializationLong external calls without timeout

Readiness can depend on database reachability if the service cannot handle requests without it. Liveness usually should not. A database outage should remove Pods from endpoints or return user errors, not restart every container in the fleet.

Probe failure triage:

kubectl describe pod api-7d75b9c8b6-h9x4q -n payments
kubectl logs pod/api-7d75b9c8b6-h9x4q -n payments --previous
kubectl get endpoints payments-api -n payments -o wide
kubectl port-forward pod/api-7d75b9c8b6-h9x4q -n payments 18080:8080
curl -i http://127.0.0.1:18080/health/ready

SLO observability

SLOs connect telemetry to user promises. They should be defined from user visible behavior, not from pod health.

Example SLO:

ItemDefinition
ServicePayments API
Objective99.9 percent of valid payment requests succeed over 30 days
Good eventHTTP 2xx or accepted async response
Bad eventHTTP 5xx, timeout, or dependency failure visible to user
ExcludedClient validation errors and explicit rate limits
AlertMulti window burn rate on error budget

Burn rate alert shape:

(
  sum(rate(http_requests_total{service="payments-api",status=~"5.."}[5m]))
  /
  sum(rate(http_requests_total{service="payments-api"}[5m]))
) > 0.02

Alert design:

Alert typePage?Example
User impact SLO burnYesHigh 5xx rate for production API
Imminent saturationYes when impact is likelyNode disk pressure with evictions
Single pod restartNo by defaultTicket or dashboard annotation
Missing metricsYes for critical telemetryPrometheus cannot scrape production service
Deployment failedUsually ticket, page if production stuckRollout timeout

Runbooks

Every actionable alert should have a runbook. A useful runbook is concrete enough for a tired responder.

Runbook template:

SectionContent
SignalAlert name, query, dashboard link
ImpactWhat users experience
First checksExact commands and dashboards
Triage branchesCommon causes and how to distinguish them
MitigationRollback, scale, disable feature, fail over
EscalationOwning team and dependency contacts
EvidenceLogs, traces, events, deployment id
AftercareFollow up metrics and post incident notes

Example first checks:

kubectl rollout status deploy/payments-api -n payments
kubectl get pods -n payments -l app=payments-api -o wide
kubectl describe deploy payments-api -n payments
kubectl get events -n payments --sort-by=.lastTimestamp | tail -40
kubectl logs deploy/payments-api -n payments --since=10m --tail=500

Review checklist

  • Application logs are structured and include correlation ids.
  • Logs do not contain secrets, tokens, or full payment data.
  • Metrics include request rate, errors, latency, and saturation.
  • Histograms use stable route labels, not raw URLs.
  • Traces propagate across ingress, service calls, queues, and workers.
  • Probe endpoints are separate for startup, readiness, and liveness.
  • Liveness does not depend on database or third party availability.
  • Readiness removes Pods from endpoints when required dependencies are unavailable.
  • Events are collected or retained long enough for incident timelines.
  • Audit logs capture RBAC, Secret, exec, and workload change activity.
  • Alerts page on user impact or imminent impact, not normal churn.
  • Every page has a concrete runbook.