Reliability Observability and Operations

Reading time
37 min read
Word count
7207 words
Diagram count
5 diagrams

Source: Victor Bona's Obsidian Compendium snapshot, Knowledge base/Software Engineering/08 Reliability Observability and Operations.md.

Reliability Observability and Operations

Reliability is a product property. Operations are the feedback loop that keeps reliability real. A system is reliable when users can complete the work they came to do, within a tolerable time, with acceptable correctness, even while individual components fail.

Reliability work is not the same as eliminating every fault. It is the practice of defining acceptable service behavior, detecting departures from it, limiting blast radius, restoring service quickly, and using incidents to improve the system.

Core concepts

ConceptMeaningTypical ownerCommon mistake
SLIService level indicator. A measured signal of user-facing service behavior.Service teamMeasuring what is easy instead of what users experience.
SLOService level objective. A target for an SLI over a window.Product and engineeringSetting every target to 100 percent.
SLAService level agreement. An external promise, often contractual.Business, legal, engineeringTreating an internal SLO as a contractual promise.
Error budgetAllowed unreliability before the SLO is missed.Service teamSpending it without changing release or risk posture.
Burn rateSpeed at which the error budget is being consumed.On-call and service teamAlerting only after the budget is already gone.
ToilManual, repetitive, automatable operational work.Operations and service teamNormalizing toil instead of paying it down.
MTTRMean time to recovery or repair.Operations and service teamOptimizing restoration without fixing detection.
MTTDMean time to detect.Observability and service teamDepending on customer reports as the primary detector.

Good reliability language is precise:

  • Availability asks: can users reach the service and get a valid response?
  • Latency asks: can users complete the operation within a useful time?
  • Durability asks: will accepted data still exist later?
  • Correctness asks: is the response or state semantically valid?
  • Freshness asks: is the returned data recent enough?
  • Coverage asks: is the service available to all expected users, regions, tenants, and device classes?

Reliability model

Reliability models connect user journeys to technical signals. The best model starts with critical user actions, not infrastructure components.

User journeyExample SLIExample SLOFailure consequence
Sign inSuccessful login requests divided by valid login attempts99.9 percent over 30 daysUsers cannot access the product.
CheckoutPaid orders accepted without duplicate charge or lost order99.95 percent over 30 daysRevenue loss and support escalations.
SearchQueries returning a usable result page under 800 ms99 percent over 7 daysUsers abandon discovery.
File uploadUploads durably stored and visible within 60 seconds99.9 percent over 30 daysData loss perception.
Notification deliveryNotifications delivered within policy window99 percent over 24 hoursUsers miss time-sensitive events.
Control plane updateAccepted configuration changes reconciled correctly99.9 percent over 30 daysOperators lose trust or cause fleet impact.

SLI design

An SLI should be:

  • User-facing: tied to a user-visible outcome.
  • Measurable: derived from telemetry the system actually emits.
  • Bounded: scoped by service, operation, tenant class, region, and time window.
  • Resistant to gaming: not improved by dropping traffic, suppressing errors, or excluding hard cases.
  • Actionable: a bad value should point toward a response path.

Common SLI forms:

SLI typeFormulaNotes
Request successgood requests / valid requestsExclude invalid client input only when it is truly outside service responsibility.
Latencyrequests faster than threshold / valid requestsPrefer threshold SLIs over average latency. Averages hide tail pain.
Availabilitysuccessful probes or requests / expected probes or requestsSynthetic probes are useful but should not replace real traffic SLIs.
Durabilityrecords retrievable after accepted write / accepted writesRequires audit sampling or reconciliation.
Freshnessreads with data age under threshold / readsImportant for caches, streams, analytics, search, and replication.
Correctnesscorrect outcomes / sampled outcomesOften needs domain-specific validation, shadow checks, or reconciliation jobs.
Queue timelinessjobs completed within deadline / accepted jobsBetter than queue depth alone.

SLO selection

SLOs are product decisions expressed as engineering targets. A strong SLO has:

  • A clearly named service and operation.
  • An SLI definition with numerator and denominator.
  • A time window.
  • An objective.
  • Exclusions that are explicit and rare.
  • An alerting policy.
  • A consequence policy when the budget is low.

Example:

FieldExample
ServicePayments API
OperationAuthorize payment
SLISuccessful authorization responses under 1500 ms divided by valid authorization requests
WindowRolling 30 days
Objective99.95 percent
ExclusionsSynthetic test cards, invalid merchant credentials, explicit provider maintenance windows
AlertPage on fast burn, ticket on slow burn
Budget policyFreeze risky changes below 25 percent budget remaining

Do not set all services to the same objective. A public checkout path and an internal report export job need different reliability levels because their user impact and cost profile differ.

SLA design

SLAs are external commitments. They should be looser than internal SLOs so the organization has time to respond before a contractual breach.

LayerExample targetPurpose
Internal SLO99.95 percent monthly API availabilityEngineering target and alerting basis.
Public status target99.9 percent monthly availabilityCustomer-facing reliability expectation.
Contractual SLA99.5 percent monthly availability with creditsLegal and commercial commitment.

SLA language should define:

  • Covered services and regions.
  • Measurement source.
  • Maintenance windows.
  • Exclusions.
  • Customer credit process.
  • Support response obligations.
  • Security and data-loss obligations if applicable.

Error budgets

An error budget is the amount of unreliability available within the SLO window.

For a 99.9 percent availability SLO over 30 days:

30 days = 43,200 minutes
Allowed bad time = 0.1 percent * 43,200 = 43.2 minutes

For request-based SLOs:

Error budget = valid events * (1 - SLO target)
Budget consumed = bad events / error budget
SLO30 day allowed bad time7 day allowed bad timeNotes
99 percent432 minutes100.8 minutesSuitable for non-critical or batch surfaces.
99.5 percent216 minutes50.4 minutesCommon for lower-tier user-facing systems.
99.9 percent43.2 minutes10.08 minutesCommon for important interactive services.
99.95 percent21.6 minutes5.04 minutesRequires strong automation and incident maturity.
99.99 percent4.32 minutes1.008 minutesExpensive and operationally demanding.

Error budget policy converts reliability into day-to-day decisions:

Budget remainingEngineering postureOperational posture
Above 75 percentNormal delivery.Watch trends. Improve observability opportunistically.
50 to 75 percentNormal delivery with careful rollout.Review recurring causes. Tighten canaries.
25 to 50 percentLimit risky changes.Prioritize reliability fixes and capacity checks.
0 to 25 percentFreeze non-essential risky changes.Incident review, executive visibility, active mitigation.
ExhaustedStop change that could worsen the SLO unless required for recovery.Restore budget by reducing error rate and closing systemic gaps.

Budgets should be spent intentionally. Releasing faster is reasonable when the service is healthy. Slowing delivery is reasonable when users are paying reliability debt.

Burn rates

Burn rate measures how fast a service is consuming its error budget.

Burn rate = observed error rate / allowed error rate

For a 99.9 percent SLO, the allowed error rate is 0.1 percent. If the observed bad event rate is 2 percent:

Burn rate = 2 / 0.1 = 20

At a burn rate of 20, a 30 day budget is consumed in:

30 days / 20 = 1.5 days

Multi-window burn alerts reduce noise and catch both fast outages and slow leaks.

Alert typeShort windowLong windowExample burn rateAction
Fast page5 minutes1 hour14x or higherWake a human. User impact is severe or imminent.
Medium page30 minutes6 hours6x or higherPage during service hours or when impact is material.
Slow ticket2 hours1 day3x or higherCreate owned work. Budget is leaking.
Trend review1 day3 days1x or higherInvestigate recurring degradation before it pages.

Alert when both the short and long window breach. The short window confirms immediacy. The long window confirms persistence.

Rendering diagram...

Failure modes

Failures are easier to handle when they are named. A precise failure mode makes detection, mitigation, and testing more concrete.

Failure modeDescriptionDetection signalCommon mitigation
Crash failureProcess stops or container exits.Restart count, missing heartbeat, 5xx spike.Supervisor restart, redundancy, graceful shutdown.
Omission failureExpected response never arrives.Timeout rate, missing events, queue age.Timeouts, retries with budget, dead letter queue.
Slow failureComponent responds but too slowly.Tail latency, saturation, queue delay.Load shedding, scaling, timeout tuning, dependency bypass.
Gray failureComponent is partially unhealthy but not obviously down.Regional skew, subset errors, odd percentile shifts.Synthetic checks, quorum, adaptive routing, health scoring.
Byzantine failureComponent returns arbitrary or inconsistent results.Invariant violations, checksum mismatch, divergent replicas.Validation, quorum reads, circuit isolation.
Correlated dependency failureShared dependency affects many services at once.Cross-service error spike, dependency latency.Bulkheads, caching, fallback, dependency SLOs.
OverloadDemand exceeds capacity.CPU, memory, queue, concurrency, saturation.Backpressure, rate limits, load shedding, autoscaling.
Data corruptionStored or transmitted data becomes wrong.Reconciliation drift, checksum failure, audit failure.Backups, write validation, repair jobs, restore tests.
DeadlockWork stops because resources wait on each other.No progress with active locks or threads.Lock ordering, timeouts, watchdogs.
LivelockSystem is busy but no useful progress occurs.High activity with flat completion rate.Retry jitter, circuit breakers, admission control.
Split brainMultiple controllers believe they are primary.Conflicting writes, dual leaders, fencing errors.Leader election, fencing tokens, quorum.
Operator errorHuman action degrades service.Change correlation, audit log, sudden config drift.Guardrails, reviews, dry run, rollback, least privilege.
Clock failureTime assumptions become invalid.Token expiry anomalies, ordering errors.NTP monitoring, monotonic clocks, bounded skew handling.
Partial deploy failureMixed versions conflict.Error rate by version, schema mismatch.Compatible migrations, canaries, feature flags.
Resource leakCapacity degrades over time.Memory growth, file descriptor count, connection count.Limits, restart policy, profiling, leak tests.

Failure scenario matrix

ScenarioSymptomFirst questionImmediate actionLong-term prevention
Database primary is slowAPI p99 latency spikes and request queues grow.Is the database saturated, locked, or network delayed?Shed non-critical traffic, reduce concurrency, fail over only if confidence is high.Query limits, index review, connection pooling, load testing.
Cache cluster failsRead latency increases and database QPS spikes.Can the origin absorb direct traffic?Enable cache bypass only for critical paths, rate limit expensive endpoints.Stale cache fallback, request coalescing, cache dependency SLO.
Message queue stallsJobs miss deadlines while API looks healthy.Are producers still accepting work that cannot finish on time?Stop or throttle producers, scale consumers, dead letter poison messages.Queue timeliness SLO, backpressure from consumers to producers.
DNS misconfigurationUsers in some regions cannot connect.Which resolvers and records are affected?Revert zone change, lower impact with alternate endpoint if available.DNS change review, staged rollout, synthetic DNS probes.
Certificate expiryTLS failures begin at clients or load balancers.Which certificates and paths are expired?Renew and reload, bypass broken termination only if safe.Expiry alerts, automated renewal, inventory.
Bad releaseErrors correlate with one version.Can the release be rolled back safely?Roll back, disable feature flag, or route away from version.Canary gates, automated rollback, schema compatibility.
Region outageOne region loses dependencies or networking.Is failover automatic, safe, and capacity-backed?Drain or route traffic to healthy region if data and capacity allow.Regular failover drills, regional isolation, capacity reserve.
Control plane bugDeployments or config changes mutate too much.Is the data plane still serving?Freeze writes, disable controllers, restore last known good state.Dry run, staged reconciliation, audit guards, narrow privileges.

Blast radius

Blast radius is the maximum damage a failure can cause before containment works. It is shaped by topology, permissions, rollout design, dependency coupling, and operational process.

Blast radius controlWhat it limitsExample
Cell architectureNumber of tenants affected by a cell failureShard tenants across independent stacks.
Regional isolationGeography affected by a regional failureKeep regional control loops independent.
BulkheadsFailure propagation across poolsSeparate worker pools for free and paid tiers.
Rate limitsLoad a caller can imposePer-tenant and per-token limits.
Feature flagsScope of behavior changesEnable feature by tenant cohort.
Canary releasesScope of release defectsStart with 1 percent traffic or one cell.
Least privilegeScope of human or service account mistakesRestrict controllers to owned namespaces.
QuotasScope of resource exhaustionPer-tenant storage and job quotas.
Circuit breakersScope of dependency failureStop calling a failing provider temporarily.
Data partitioningScope of corruption or hot keysPartition writes and add tenant-aware limits.

Blast radius questions:

  • Can one tenant exhaust shared capacity for all tenants?
  • Can one bad deploy affect all regions at once?
  • Can one operator command mutate all production resources?
  • Can one dependency outage stop every user journey?
  • Can one queue poison message block an entire worker group?
  • Can one schema migration lock a high-traffic table?
  • Can one control plane reconciliation bug delete healthy data plane resources?
Rendering diagram...

Graceful degradation

Graceful degradation preserves the most important user outcomes when some capability is impaired. It is a product and engineering design choice.

Degradation patternUse whenExample
Read-only modeWrites are unsafe but reads are possible.Allow users to view documents during database failover.
Stale dataFreshness is less important than availability.Serve cached catalog data with a freshness label.
Reduced fidelityApproximate output is acceptable.Return simpler search ranking when personalization is down.
Optional feature disableCore path must survive optional dependency failure.Hide recommendations when recommendation service fails.
Async acceptanceWork can complete later.Accept upload and process thumbnails after recovery.
Manual queueHuman workflow can bridge automation outage.Queue refunds for support review when provider API is down.
Tiered serviceSome users or paths are higher priority.Preserve paid checkout before background exports.

Degradation must be explicit. Silent degradation can become correctness failure.

Checklist:

  • Define the core user outcome to preserve.
  • Decide which features can be disabled first.
  • Show clear user-facing state when behavior changes.
  • Keep degraded mode observable with its own metric and event.
  • Test entry and exit from degraded mode.
  • Document data consistency implications.
  • Define who can enable and disable the mode.
  • Make degraded mode reversible.

Backpressure

Backpressure tells upstream callers to slow down before the system collapses. Without backpressure, overload moves downstream until it becomes an outage.

Backpressure can be:

  • Synchronous: return 429, 503, retry-after, or explicit quota error.
  • Asynchronous: reject new jobs, pause consumers, reduce producer rate.
  • Resource-based: limit concurrency, queue length, memory, file descriptors, or connections.
  • Priority-based: preserve critical traffic while throttling lower-priority work.
SignalMeaningBackpressure response
Queue age risingAccepted work is missing freshness or deadline targets.Stop accepting low-priority jobs and scale consumers.
Queue length risingArrival rate exceeds completion rate.Limit producers and add admission control.
Thread or connection pool fullConcurrency is saturated.Return fast failure instead of waiting indefinitely.
Memory pressureProcess is near OOM or garbage collection pressure.Reduce batch size, reject expensive requests, shed caches carefully.
Database lock waitTransactions are blocking useful work.Reduce write concurrency, pause migrations, kill unsafe long queries.
Provider throttlingDependency is enforcing limits.Slow callers, use fallback, respect retry-after.

Backpressure design rules:

  • Bound every queue.
  • Bound every retry loop.
  • Bound every concurrency pool.
  • Prefer fast, clear rejection over unbounded waiting.
  • Propagate retry guidance to callers.
  • Add jitter to retries.
  • Preserve priority classes.
  • Make overload visible in metrics and logs.
Rendering diagram...

Load shedding

Load shedding intentionally drops or rejects work to preserve the system. It should be designed before overload happens.

Load shedding methodPreservesRisk
Reject low-priority requestsCritical user journeysPoor prioritization can harm important users.
Drop duplicate workCapacityRequires idempotency and request identity.
Disable expensive featuresCore availabilityProduct experience degrades.
Serve stale cacheAvailabilityFreshness or correctness may suffer.
Reduce sampling or enrichmentCore path latencyObservability or analytics may lose detail.
Enforce per-tenant quotasFairnessNoisy users may see explicit errors.
Admission controlSystem stabilityRequires accurate capacity signals.

Good load shedding is:

  • Early: starts before saturation becomes collapse.
  • Fair: prevents one caller from consuming all capacity.
  • Transparent: returns explicit status and retry guidance.
  • Observable: emits shed count, reason, priority, tenant, and path.
  • Reversible: stops automatically or through a documented control.

Avoid shedding:

  • Security checks.
  • Authorization checks.
  • Data integrity validation.
  • Audit-critical events.
  • Required billing or compliance records.

Timeouts, retries, and circuit breakers

Timeouts, retries, and circuit breakers are reliability tools only when they are bounded and coordinated.

MechanismPurposeFailure riskGood default
TimeoutStop waiting for work that no longer has value.Timeout too high causes thread exhaustion. Timeout too low causes false failure.Set from downstream latency distribution and caller deadline.
RetryRecover from transient failure.Retry storms amplify outages.Retry only idempotent work with exponential backoff and jitter.
Circuit breakerStop calling a failing dependency.Opens too aggressively and causes avoidable degradation.Open on sustained failure, half-open with limited probes.
HedgingSend duplicate request to reduce tail latency.Doubles load under stress.Use only for idempotent reads with strict budget.
Deadline propagationShare remaining time across service calls.Missing propagation causes useless downstream work.Pass a request deadline through all internal calls.

Retry checklist:

  • Is the operation idempotent?
  • Is there an idempotency key?
  • Is the total retry duration below the caller deadline?
  • Is there jitter?
  • Is the retry budget shared across layers?
  • Does the system stop retrying on permanent errors?
  • Does the retry path emit metrics?

Observability

Observability is the ability to ask new questions about system behavior without shipping new code for every question. It requires high-quality telemetry, context propagation, consistent naming, and an operational workflow that turns signals into decisions.

Telemetry signals

SignalBest forWeak atRequired fields
MetricsAggregates, alerts, trends, SLOs, capacity.Explaining one specific request.service, operation, status, region, version, tenant class.
LogsDiscrete facts, errors, local context.Reliable alerting at high scale.timestamp, severity, service, trace id, request id, user or tenant where safe.
TracesRequest path and dependency latency.Background aggregate behavior without sampling care.trace id, span id, parent id, operation, status, duration.
ProfilesCPU, memory, allocation, lock, IO cost.User-level correctness.service, version, host or pod, sample type, time range.
EventsBusiness and operational state changes.High-cardinality time series math.actor, subject, action, result, reason, timestamp.

Signals work best together:

  • Metrics tell you that something is wrong.
  • Traces show where time or errors accumulate.
  • Logs explain local decisions and exceptions.
  • Profiles show resource cost and contention.
  • Events explain what changed.

Metrics

Metric design rules:

  • Use counters for events that only increase.
  • Use gauges for current state.
  • Use histograms for latency, size, and duration.
  • Avoid high-cardinality labels such as raw user id, request id, or full URL.
  • Keep units explicit in names or metadata.
  • Prefer service-level metrics over host-only metrics for alerting.
  • Record both accepted work and completed work.
  • Split by outcome, dependency, region, version, and priority where useful.

Important metric groups:

GroupExamplesOperational use
User journeyrequest count, success count, latency bucketSLOs and paging.
Dependencydownstream errors, downstream latency, circuit stateTriage and containment.
SaturationCPU, memory, queue age, pool usage, disk IOCapacity and overload detection.
Changedeploy version, feature flag state, config generationCorrelate incidents with changes.
Datareplication lag, reconciliation drift, failed writesCorrectness and durability.
Control planereconcile count, reconcile errors, desired vs actual driftSafe automation.

Logs

Logs should be structured, searchable, and safe.

Good log properties:

  • One event per log record.
  • Stable field names.
  • Machine-readable severity.
  • Correlation identifiers.
  • Clear message and reason code.
  • No secrets, tokens, passwords, or raw sensitive payloads.
  • Sampling for noisy success paths.
  • Retention matched to debugging, compliance, and cost.

Log levels:

LevelUse
debugDetailed diagnostics, usually sampled or disabled in production.
infoImportant state transition or operational fact.
warnUnexpected condition that did not break the request but may need attention.
errorRequest or job failed and the service could not complete expected work.
fatalProcess cannot continue safely.

Log antipatterns:

  • Logging the same error at every layer.
  • Logging huge payloads.
  • Logging secrets or customer data.
  • Using free-form messages where fields are needed.
  • Missing trace id or request id.
  • Treating logs as the only alert source.

Traces

Traces make distributed execution visible. They are most valuable when span names and attributes are consistent.

Trace design:

  • Propagate context across HTTP, RPC, queues, and background jobs.
  • Name spans by operation, not raw URL or user input.
  • Mark errors with reason and status.
  • Attach dependency, region, version, and retry attempt where useful.
  • Sample intelligently. Keep error traces and slow traces at higher rates.
  • Link asynchronous work back to the originating request where possible.

Trace questions:

  • Which dependency owns the latency?
  • Did retries improve or worsen the request?
  • Did a queue wait dominate execution?
  • Did one tenant, region, or version cause the tail?
  • Did a feature flag change the path?

Profiles

Profiles show where resources go. They are essential when latency or cost cannot be explained by request metrics alone.

Profile types:

ProfileShowsExample use
CPUHot functions and expensive loops.Identify expensive serialization or compression.
AllocationMemory allocation rate.Find allocation churn causing garbage collection.
HeapRetained memory.Find leaks and unbounded caches.
Lock or mutexContention points.Diagnose thread stalls.
IODisk or network wait.Separate CPU saturation from IO wait.
Goroutine or threadBlocked concurrency.Find deadlock-like states or runaway workers.

Continuous profiling is most useful when profiles include version, region, and workload labels.

Events

Events capture meaning, not just resource behavior.

Operational events:

  • Deployment started, promoted, failed, rolled back.
  • Feature flag changed.
  • Certificate renewed.
  • DNS record changed.
  • Autoscaler decision made.
  • Controller reconciliation applied.
  • Manual override enabled.
  • Degraded mode entered or exited.

Business events:

  • Order accepted.
  • Payment authorized.
  • File uploaded.
  • Notification sent.
  • Workspace created.
  • Subscription changed.

Events are critical during incidents because they answer: what changed?

Golden signals, RED, and USE

Golden signals:

SignalMeaningExample
LatencyTime to serve a request.p50, p95, p99 by endpoint.
TrafficDemand placed on the system.Requests per second, jobs per minute.
ErrorsRate of failed work.5xx, failed jobs, rejected writes.
SaturationHow full the system is.CPU, memory, queue age, pool utilization.

RED for request-driven services:

SignalMeaning
RateHow many requests are arriving.
ErrorsHow many requests fail.
DurationHow long requests take.

USE for resources:

SignalMeaning
UtilizationPercent of time or capacity used.
SaturationAmount of queued or delayed work.
ErrorsResource-level failures.

Use RED for service behavior and USE for resource debugging. A CPU graph alone is rarely a user-facing alert.

Alert quality

An alert is a request for human attention. It must be worth interrupting someone.

An alert should mean:

  • User impact exists or is imminent.
  • A human action is needed now.
  • The owner is clear.
  • The runbook is linked.
  • The alert is specific enough to triage.
  • The alert has a known severity.
  • The alert includes service, region, environment, and affected operation.
  • The alert includes recent change context when possible.
Alert quality dimensionGoodBad
ImpactTied to SLO or critical symptom.Tied only to an internal metric.
ActionabilityOn-call can mitigate or escalate.No known response.
SpecificityNames service, operation, region, dependency."High errors" with no scope.
NoisePages rarely and meaningfully.Pages for every transient spike.
OwnershipRoutes to the team that can act.Goes to a generic channel.
RunbookCurrent, tested, and linked.Missing or stale.

Severity guide:

SeverityImpactResponse
SEV1Critical user journey down, data loss risk, security-critical operational impact.Page immediately, incident commander, active comms.
SEV2Major degradation or partial outage for important users.Page owner, incident channel, updates on schedule.
SEV3Limited impact, workaround exists, or slow budget burn.Ticket or business-hours response.
SEV4Minor issue, cleanup, or informational.Backlog or routine maintenance.

Alert review checklist:

  • Did this alert catch real user impact?
  • Did it fire early enough?
  • Did it fire too often?
  • Was the runbook correct?
  • Was the owning team correct?
  • Was the severity correct?
  • Were labels sufficient for routing and triage?
  • Should it be converted to a ticket, dashboard, or SLO burn alert?

Incident response

Incident response is a structured way to reduce impact under uncertainty. The objective is restoration first, then learning.

Rendering diagram...

Incident roles

RoleResponsibilityNot responsible for
Incident commanderOwns coordination, priorities, status, and decisions.Debugging every technical detail.
Operations leadExecutes operational mitigations.External communication.
Communications leadSends internal and external updates.Choosing technical fix without input.
Subject matter expertProvides system-specific diagnosis and options.Overall incident command.
ScribeCaptures timeline, decisions, commands, and links.Driving mitigation.
Customer support liaisonBrings customer impact and support context.Technical recovery.

Small incidents can combine roles, but command and execution should remain conceptually separate.

Incident timeline

A useful timeline records:

  • First bad signal.
  • First alert.
  • First human acknowledgement.
  • Incident declaration time.
  • Customer impact start.
  • Mitigation attempts and results.
  • Rollback or failover decisions.
  • Recovery time.
  • Customer impact end.
  • Follow-up creation.

Response priorities

  1. Protect users and data.
  2. Stop the bleeding.
  3. Restore critical paths.
  4. Communicate status and expectations.
  5. Preserve evidence.
  6. Identify root and contributing causes.
  7. Prevent recurrence.

During an incident:

  • Prefer reversible mitigations.
  • Avoid speculative broad changes.
  • Keep a written command log.
  • Assign one person per action.
  • Set update cadence.
  • Escalate when stuck.
  • Separate facts from hypotheses.
  • Do not run destructive commands without explicit review unless data safety demands immediate action.

Incident communication

Status updates should include:

  • What is impacted.
  • Who is impacted.
  • Since when.
  • Current mitigation.
  • Next update time.
  • Known workaround if any.

Example internal update:

SEV2: Checkout authorization latency is above SLO for us-east users since 14:05 UTC.
Payment provider calls are timing out. We have disabled optional fraud enrichment and are monitoring recovery.
Next update at 14:30 UTC.

Postmortems

A postmortem is a learning document, not a blame document. It should explain how the system allowed the incident and what will change.

Postmortem sections:

SectionPurpose
SummaryOne paragraph explaining impact and recovery.
ImpactUser, business, data, compliance, and internal impact.
TimelineObjective sequence of signals, actions, and decisions.
DetectionHow the incident was found and how detection could improve.
Root causeProximate technical cause.
Contributing factorsConditions that made the incident possible or worse.
What went wellPractices that reduced impact.
What went poorlyGaps in design, operations, tooling, or communication.
Corrective actionsOwned, dated, specific changes.
LessonsStandards or patterns to apply elsewhere.

Strong corrective actions:

  • Change a guardrail, test, alert, runbook, default, or architecture.
  • Have one owner.
  • Have a due date.
  • Are small enough to finish.
  • Are linked to the incident cause.
  • Can be verified.

Weak corrective actions:

  • "Be more careful."
  • "Improve monitoring" without naming the signal.
  • "Document process" without an owner or exercise.
  • "Rewrite service" without an incremental path.
  • "Add alert" without defining impact and action.

Postmortem review questions:

  • Why did the system permit this fault?
  • Why did detection take as long as it did?
  • Why was mitigation slower than expected?
  • What assumption was wrong?
  • What automated guardrail would have stopped or limited this?
  • What similar systems have the same weakness?
  • Did the incident consume error budget?
  • Should release posture change until fixes land?

Runbooks

A runbook is operational code in prose. It should be clear enough for a tired on-call engineer to use under pressure.

Runbook structure:

SectionContent
PurposeWhat alert or failure this runbook covers.
SeverityWhen to declare an incident and what severity to choose.
PreconditionsRequired access, tools, dashboards, and permissions.
Safety notesActions that are risky, irreversible, or data-affecting.
TriageSteps to confirm scope, impact, and likely cause.
MitigationReversible actions to reduce impact.
RecoverySteps to return to normal operation.
VerificationSignals that prove recovery.
EscalationWho to call and when.
Post-incidentFollow-up checks, tickets, and cleanup.

Runbook quality checklist:

  • Linked from alerts.
  • Tested in a drill or real incident.
  • Uses current commands and dashboard names.
  • Names required permissions.
  • Shows expected output or decision criteria.
  • Separates diagnosis from mitigation.
  • Calls out irreversible operations.
  • Includes rollback instructions.
  • Includes verification steps.
  • Has an owner and review date.

Runbook antipatterns:

  • A list of commands with no context.
  • Tribal knowledge hidden in chat history.
  • A dashboard link with no interpretation guidance.
  • Mitigation steps that can cause data loss without warning.
  • No escalation path.
  • No recovery verification.

Control planes

Control planes manage desired state. They need stronger correctness than ordinary data paths because their errors can damage many workloads.

Examples:

  • Kubernetes API server and controllers.
  • Deployment orchestrator.
  • Feature flag control plane.
  • Service discovery.
  • Certificate automation.
  • Database failover manager.
  • Billing entitlement manager.
  • Policy engine.
  • Infrastructure provisioning system.

Control plane properties:

PropertyWhy it matters
AuthorizationPrevents broad or cross-tenant mutation.
AuditabilityExplains who changed desired state and when.
IdempotencyAllows safe retries and reconciliation.
Dry runShows intended mutation before applying.
Drift detectionFinds desired state and actual state mismatch.
Staged rolloutLimits blast radius of bad desired state.
Rate limitingPrevents controllers from overwhelming dependencies.
BackpressureSlows reconciliation when dependencies are unhealthy.
Safe defaultsPrevents missing config from becoming dangerous config.
Break glassAllows emergency action with extra audit and expiry.

Control plane failure modes:

FailureExampleMitigation
Bad reconciliationController deletes healthy resources.Finalizers, dry run, canary reconciliation, delete budgets.
Stale desired stateOld config overwrites emergency fix.Pause reconciliation, conflict detection, audit.
Privilege excessController can mutate unrelated resources.Least privilege and namespace scoping.
Write stormController retries rapidly on dependency error.Exponential backoff and work queue limits.
Split brainTwo controllers own the same resource.Leader election, ownership labels, fencing.
Unsafe defaultMissing policy means allow all.Default deny and explicit enablement.
Control plane dependency outageOperators cannot deploy or roll back.Emergency data plane controls and cached config.

Control plane readiness checklist:

  • Every mutation is authenticated and authorized.
  • Every mutation is audit logged.
  • Reconciliation is idempotent.
  • Retries are bounded and jittered.
  • Destructive actions have safeguards.
  • Deletes have budgets or confirmation rules.
  • Operators can pause reconciliation.
  • Drift is visible.
  • Last known good state is recoverable.
  • Controller metrics include reconcile count, error count, latency, queue depth, and drift.

Network operations

Networks fail partially, asymmetrically, and sometimes silently. Network operations need visibility from clients, edges, services, and dependencies.

Operational networking concerns:

  • DNS propagation and cache behavior.
  • Load balancer health checks.
  • Connection draining.
  • TLS expiry and rotation.
  • Firewall and network policy drift.
  • Packet loss and jitter.
  • Conntrack exhaustion.
  • Ephemeral port exhaustion.
  • Cross-region latency.
  • Private connectivity failure.
  • MTU mismatch.
  • BGP or route instability.
  • NAT gateway saturation.
  • IPv4 and IPv6 behavior differences.
  • Proxy and service mesh misconfiguration.
Network issueSymptomSignalMitigation
DNS failureSome clients cannot resolve service.Resolver errors, synthetic DNS probes.Revert records, reduce TTL before risky changes, use secondary path.
TLS expiryClients fail handshake.Certificate expiry metric, handshake errors.Renew certificate, reload termination, automate renewal.
Load balancer bad health checkHealthy nodes removed or bad nodes retained.Backend health mismatch, 5xx by backend.Fix health check path, drain bad backend.
Conntrack exhaustionRandom connection failures under load.Conntrack table usage, SYN retry, reset spikes.Increase limits, reduce connection churn, tune keepalive.
Ephemeral port exhaustionOutbound calls fail from specific nodes.Port usage, connect errors.Connection reuse, NAT scaling, local port range tuning.
Packet lossLatency and retries rise.retransmits, packet loss probes.Route around, reduce load, fix physical or provider issue.
MTU mismatchLarge requests fail while small requests work.Fragmentation needed, timeout on large payloads.Set MTU, enable path MTU discovery, clamp MSS.
Network policy driftService cannot reach dependency.Denied connection logs, policy diffs.Reconcile policy, add tests for required flows.

Network change checklist:

  • Identify client, edge, service, and dependency paths.
  • Check DNS TTL and cache behavior.
  • Verify health check behavior before rollout.
  • Confirm rollback path and propagation time.
  • Monitor errors by region, ASN, client type, and edge.
  • Validate TLS chain and certificate expiry.
  • Confirm connection draining before removing backends.
  • Test both IPv4 and IPv6 where supported.
  • Record the change as an operational event.

Production readiness

Production readiness is the evidence that a service can be operated safely. It should be reviewed before launch and after major architectural changes.

AreaReadiness questions
OwnershipWho owns the service, on-call, runbooks, and backlog?
SLOsWhat user journey is protected and what is the error budget?
ObservabilityCan operators detect impact, localize cause, and verify recovery?
CapacityWhat is expected load, peak load, and headroom?
DependenciesWhat happens when each dependency is slow, down, or incorrect?
RolloutCan changes be canaried, rolled back, and paused?
DataAre backups, restores, migrations, and retention tested?
SecurityAre auth, secrets, audit, and least privilege in place?
OperationsAre alerts, runbooks, dashboards, and escalation paths ready?
CostCan runaway usage or telemetry cost be detected and limited?
ComplianceAre audit, retention, privacy, and data residency requirements met?

Production readiness checklist:

  • Service has named owners and escalation path.
  • Critical user journeys have SLOs.
  • Alerts are tied to SLOs or actionable symptoms.
  • Runbooks are linked from alerts.
  • Dashboards show traffic, errors, latency, saturation, dependencies, and deploy state.
  • Logs and traces include correlation IDs.
  • Secrets are managed through approved storage.
  • Dependencies have timeouts, retries, and circuit behavior.
  • Queues are bounded and monitored by age.
  • Load shedding and degraded modes are documented.
  • Capacity test covers expected peak and failure scenarios.
  • Rollback has been tested.
  • Database migrations are backward and forward compatible.
  • Backups and restores have been tested.
  • On-call has access to required tools.
  • Cost and quota limits are visible.
  • Security and audit logs are retained.
  • Incident process is documented.

Operational dashboards

Dashboards should support decisions, not decoration. A good dashboard starts with impact, then narrows to causes.

Dashboard layout:

SectionPanels
User impactSLO compliance, burn rate, request success, latency thresholds.
TrafficRequests, jobs, tenant distribution, regional distribution.
ErrorsError rate by operation, status, dependency, version.
Latencyp50, p95, p99, threshold compliance, queue time.
SaturationCPU, memory, disk, queue age, pool utilization, connection count.
DependenciesDownstream success, latency, throttling, circuit state.
ChangeDeploy versions, feature flags, config generations, operational events.
DataReplication lag, failed writes, reconciliation drift, backup status.

Dashboard review checklist:

  • Can a new on-call identify user impact in 60 seconds?
  • Can the dashboard distinguish regional and global failures?
  • Can it show whether a deploy or config change correlated with the issue?
  • Can it distinguish dependency failure from internal overload?
  • Can it show if mitigation worked?
  • Are labels and units consistent?
  • Are panels still used during incidents?

Capacity and overload operations

Capacity work connects reliability, performance, and cost. Headroom without control wastes money. Efficiency without headroom causes incidents.

Capacity signals:

  • Requests per second.
  • Concurrent requests.
  • Queue age and queue length.
  • CPU utilization and throttling.
  • Memory working set and allocation rate.
  • Disk IOPS and latency.
  • Network throughput and retransmits.
  • Database connection pool usage.
  • Lock wait time.
  • Cache hit ratio.
  • Provider quota usage.

Overload response checklist:

  • Confirm whether user impact exists.
  • Identify the saturated resource.
  • Check whether the saturation is global, regional, tenant-specific, or version-specific.
  • Stop non-critical work.
  • Enable load shedding or degraded mode.
  • Reduce concurrency if downstream is saturated.
  • Increase capacity only if the bottleneck is understood.
  • Watch tail latency and queue age after mitigation.
  • Capture profiles if CPU, memory, or lock contention is unclear.
  • Create follow-up for permanent admission control.

Change management

Most incidents are related to change. Change management should reduce blast radius without freezing delivery.

Safe change patterns:

  • Small batches.
  • Peer review for risky changes.
  • Automated tests.
  • Canary release.
  • Progressive delivery by region, cell, tenant, or percentage.
  • Feature flags with kill switches.
  • Backward-compatible schema migrations.
  • Automated rollback on SLO regression.
  • Change events emitted to observability systems.
  • Clear ownership and rollback instructions.

Risky changes:

  • Global configuration edits.
  • Database migrations on hot tables.
  • Certificate, DNS, and load balancer changes.
  • IAM and policy changes.
  • Control plane controller changes.
  • Retry and timeout changes.
  • Queue consumer concurrency changes.
  • Shared library upgrades used by many services.

Change checklist:

  • What user journey can this affect?
  • What is the blast radius of the first rollout step?
  • What metric proves it is healthy?
  • What metric proves it is unsafe?
  • How long before the impact appears?
  • How do we roll back?
  • Is rollback safe after data changes?
  • Who is watching the rollout?
  • Is the change recorded as an event?

Data reliability

Data incidents often last longer than availability incidents because recovery may require repair, replay, or customer communication.

Data reliability concerns:

  • Accepted writes must be durable.
  • Reads must return correct data for the consistency model.
  • Backups must restore.
  • Replication lag must stay within product tolerance.
  • Migrations must preserve meaning.
  • Deletions must respect retention and legal requirements.
  • Reconciliation must detect drift.
  • Idempotency must prevent duplicate side effects.

Data failure scenarios:

ScenarioRiskPrevention
Duplicate paymentCustomer charged twice.Idempotency keys, provider reconciliation.
Lost writeUser action disappears.Durable commit before success response.
Stale readUser sees old state as current.Freshness SLI, cache invalidation, clear labels.
Bad migrationData meaning changes incorrectly.Backfill validation, shadow reads, rollback plan.
Broken restoreBackups exist but cannot recover service.Regular restore drills and checksum checks.
Poison eventBad message repeatedly fails a worker.Dead letter queue, retry cap, message quarantine.

Security and reliability overlap

Security controls must remain reliable, and reliability controls must not bypass security.

Examples:

  • Load shedding must not skip authorization.
  • Degraded mode must not expose private data.
  • Emergency access must be audited and time-bound.
  • Rate limits must distinguish abuse from legitimate surge when possible.
  • Incident communication must avoid leaking sensitive details.
  • Logs and traces must not include secrets.
  • Backups must be protected and restorable.
  • Control plane privileges must be scoped.

Security-related operational checklist:

  • Secrets rotation has an operational runbook.
  • Certificate renewal is monitored.
  • Audit logs are retained and searchable.
  • Break-glass access expires automatically.
  • On-call can identify whether an incident has security implications.
  • Sensitive data is redacted from telemetry.
  • Disaster recovery procedures include access recovery.

Example reliability design

Example service: user-facing document export.

RequirementDesign choice
Users need exports eventually, not instantly.Accept request synchronously, process asynchronously.
Large exports can overload workers.Per-tenant queue limits and max export size.
Users need status.Export status event and polling endpoint.
Workers depend on object storage.Timeout, retry with jitter, circuit breaker, degraded message.
Duplicate requests are common.Idempotency key by user, document, format, and version.
Export freshness matters.SLI: exports completed within 5 minutes divided by accepted exports.
Failures need support visibility.Structured event for accepted, started, completed, failed, expired.

Failure path:

Rendering diagram...

Operational behavior:

  • If queue age breaches the SLO threshold, alert the service owner.
  • If object storage latency spikes, open the circuit for optional thumbnail generation.
  • If tenant quota is exhausted, reject new exports for that tenant without affecting others.
  • If workers are CPU saturated, scale workers only after checking storage and queue health.
  • If the queue contains poison jobs, quarantine them rather than blocking the queue.

Incident examples

Scenario: retry storm

Symptoms:

  • Downstream dependency latency rises.
  • API timeout rate rises.
  • Request volume to the dependency increases even though user traffic is flat.
  • CPU and connection pools saturate.

Likely cause:

  • Multiple service layers retry the same failing operation without shared retry budget.

Immediate response:

  • Disable or reduce retries at the highest safe layer.
  • Increase timeout only if the dependency is healthy but slow and caller deadlines allow it.
  • Enable circuit breaker or degraded mode.
  • Shed low-priority traffic.

Prevention:

  • Add retry budgets.
  • Propagate deadlines.
  • Use exponential backoff with jitter.
  • Alert on dependency call amplification.

Scenario: slow database migration

Symptoms:

  • API p99 latency spikes.
  • Database lock wait increases.
  • Error rate rises for write endpoints.
  • Migration process appears healthy but user paths degrade.

Immediate response:

  • Pause or cancel migration if safe.
  • Reduce write concurrency.
  • Route non-critical jobs away from database.
  • Verify no data corruption occurred.

Prevention:

  • Test migration against production-like data volume.
  • Use small batches.
  • Avoid long locks.
  • Add migration progress and lock metrics.
  • Require rollback or roll-forward plan.

Scenario: control plane deletes healthy resources

Symptoms:

  • Data plane workloads disappear or restart.
  • Audit logs show controller-issued deletes.
  • Desired state changed recently or controller version changed.

Immediate response:

  • Pause controller reconciliation.
  • Stop further destructive actions.
  • Restore last known good desired state.
  • Recreate affected resources from source of truth.
  • Keep audit logs and controller logs for analysis.

Prevention:

  • Add delete budgets.
  • Add dry run and diff review for destructive changes.
  • Scope controller permissions.
  • Canary controller changes.
  • Alert on unusual delete volume.

Scenario: observability outage during production incident

Symptoms:

  • Dashboards are blank or delayed.
  • Alerts stop firing or flood.
  • Service health is uncertain.

Immediate response:

  • Use independent probes, load balancer metrics, cloud provider health, and customer reports.
  • Preserve core service capacity over telemetry enrichment if necessary.
  • Avoid making broad changes without evidence.
  • Assign a separate owner to observability recovery.

Prevention:

  • Keep observability isolated from production critical paths.
  • Monitor telemetry pipeline health.
  • Retain local fallback logs.
  • Use synthetic probes from independent locations.

Operational checklists

New service checklist

  • Define owner, on-call rotation, and escalation path.
  • Identify critical user journeys.
  • Define SLIs, SLOs, and error budget policy.
  • Add metrics for traffic, errors, latency, and saturation.
  • Add structured logs with trace and request IDs.
  • Add distributed tracing across dependencies.
  • Add profiling for production workloads where feasible.
  • Add deployment, config, and feature flag events.
  • Create dashboards for impact and triage.
  • Create alerts tied to SLOs or actionable symptoms.
  • Write runbooks and link them from alerts.
  • Define degraded modes and load shedding policy.
  • Bound queues, retries, timeouts, and concurrency.
  • Verify rollback and feature flag kill switch.
  • Test backup and restore if the service stores data.
  • Run a capacity test.
  • Run a game day or failure drill for the highest-risk dependency.

On-call triage checklist

  • What user journey is impacted?
  • When did impact start?
  • How many users, tenants, regions, or requests are affected?
  • Is the issue ongoing or recovered?
  • Did a deploy, config change, feature flag, DNS change, certificate change, or dependency event happen nearby?
  • Are errors, latency, traffic, or saturation abnormal?
  • Is the issue isolated to one version, region, tenant, or dependency?
  • Is data safety at risk?
  • Is a reversible mitigation available?
  • Does this require incident declaration?
  • Who needs to be notified?

Mitigation checklist

  • Prefer the smallest reversible action that reduces user impact.
  • Roll back a correlated bad change when rollback is safe.
  • Disable optional features before critical features.
  • Shed low-priority traffic before high-priority traffic.
  • Pause background jobs if they compete with interactive traffic.
  • Reduce concurrency when downstream is overloaded.
  • Scale only after identifying the bottleneck.
  • Verify mitigation with user-facing SLIs.
  • Communicate current status and next update time.
  • Record actions in the incident timeline.

Recovery checklist

  • Confirm SLO-relevant metrics returned to normal.
  • Confirm queues are draining within deadlines.
  • Confirm error rate and tail latency are stable.
  • Confirm dependency health is stable.
  • Confirm data reconciliation if data may have been affected.
  • Disable emergency overrides when no longer needed.
  • Restore paused jobs carefully.
  • Watch for relapse over at least one long alert window.
  • Close incident only after user impact is over.
  • Create postmortem and corrective actions.

Quarterly reliability review

  • Review SLO performance and error budget consumption.
  • Review incidents and recurring contributing factors.
  • Review alert noise and missed detections.
  • Review runbook freshness.
  • Review capacity headroom and cost.
  • Review dependency risks and SLAs.
  • Review backup restore evidence.
  • Review disaster recovery test evidence.
  • Review access, secrets, certificate, and audit controls.
  • Review top operational toil sources.
  • Retire unused dashboards and alerts.

Reliability tradeoffs

Reliability has costs. The goal is not maximum reliability everywhere, but appropriate reliability for the user and business need.

TradeoffRisk of too littleRisk of too much
Availability targetUsers cannot rely on the service.Excessive cost and slower delivery.
Alert sensitivityMissed incidents.Alert fatigue and ignored pages.
RetentionInsufficient debugging and audit evidence.High cost and privacy risk.
Retry aggressivenessTransient failures surface to users.Retry storms and dependency collapse.
CachingSlow or unavailable reads.Stale or incorrect data.
AutomationManual toil and slow response.Automated broad damage if guardrails are weak.
Change velocitySlow product learning.More incidents from uncontrolled change.