Distributed Systems

Reading time
29 min read
Word count
5789 words
Diagram count
12 diagrams

Source: Victor Bona's Obsidian Compendium snapshot, Knowledge base/Software Engineering/05 Distributed Systems.md.

Distributed Systems

Distributed systems are systems where independent components communicate over unreliable networks and fail independently. Their central difficulty is not scale by itself. It is the combination of partial failure, concurrency, uncertain ordering, independent clocks, resource limits, and human operations.

A useful definition: a distributed system is one where the failure of a component you did not directly call can still change the correctness, latency, or availability of the operation you are performing.

First principles

  • Networks are unreliable.
  • Latency is variable.
  • Clocks disagree.
  • Processes pause.
  • Messages can be lost, duplicated, reordered, corrupted, or delayed.
  • Failures are partial.
  • Retries can amplify outages.
  • Observability is incomplete.
  • Operators are part of the system.
  • Backpressure must cross service boundaries.
  • Correctness depends on explicit invariants, not on optimistic diagrams.

Core vocabulary

TermMeaningDesign implication
NodeA process, machine, pod, or participant in the system.Nodes need identity, health checks, restart behavior, and isolation boundaries.
ReplicaA node holding a copy of state or an execution role.Replicas require synchronization, divergence detection, and recovery.
PartitionA communication failure between subsets of nodes.The system must define what continues, what blocks, and what fails closed.
QuorumA subset large enough to make progress while preserving intersection.Quorums protect safety only if membership and acknowledgements are well defined.
LeaderA node with authority to order writes or coordinate work.Leader authority needs election, lease, fencing, and failover semantics.
Epoch or termA monotonically increasing leadership generation.Stale leaders can be fenced by comparing epochs.
LogOrdered sequence of commands or state changes.Durable logs allow replay, replication, recovery, and audit.
CommitPoint at which an operation is durable enough to be externally visible.Commit rules define what clients may rely on after success.
Idempotency keyClient supplied operation identity.Enables safe retry after timeout or ambiguous failure.
BackpressureA signal that downstream capacity is exhausted.Without it, callers convert local overload into systemic failure.

Partial failure

Partial failure is the defining feature of distributed systems. A local program usually fails as a unit. A distributed system can have one service running, another service down, a network path broken in one direction, a queue delayed, a DNS cache stale, and a client retrying an operation that already succeeded.

Common partial failure modes:

Failure modeExampleWhy it is dangerousTypical mitigation
Crash failureA process exits or a machine dies.In flight work may be unknown.Durable logs, idempotency, leader election, replay.
Omission failureA message is dropped.Caller cannot distinguish loss from delay.Timeouts, retries with budgets, acknowledgements.
Timing failureA response arrives too late.Late work can race with newer decisions.Deadlines, fencing tokens, monotonic epochs.
Byzantine behaviorA node lies, corrupts data, or violates protocol.Standard quorum assumptions may break.Byzantine fault tolerant protocols, signatures, isolation.
Gray failureA component is alive but degraded.Health checks may pass while requests fail.Synthetic probes, adaptive load shedding, circuit breakers.
Asymmetric partitionA can reach B, B cannot reach A, or only some protocols work.Membership views diverge.Bidirectional health checks, quorum based authority.
Dependency failureA storage, DNS, auth, or queue dependency fails.Unrelated paths can collapse through shared dependencies.Bulkheads, dependency maps, graceful degradation.
Operator errorBad config, wrong rollout, unsafe migration.Humans create correlated failures.Change control, staged deploys, automated rollback, runbooks.

Design rule: treat timeout as "unknown outcome", not as failure.

Rendering diagram...

Consistency models

Consistency defines what reads are allowed to observe after writes. Availability defines whether an operation receives a non-error response. Durability defines whether acknowledged state survives failures. These are separate properties.

ModelMeaningTypical useRisk
LinearizabilityOperations appear atomic in real time order.Locks, registers, metadata, balances, ownership transfer.Higher latency, reduced availability under partitions.
Sequential consistencyOperations appear in one global order, but not necessarily real time order.Concurrent abstractions where real time visibility is not required.Can surprise users after visible writes.
Causal consistencyCausally related operations are observed in order.Collaboration, messaging, social feeds.Concurrent writes still need conflict handling.
Read your writesA client sees its own prior writes.User profile updates, settings, dashboards.Requires session stickiness, tokens, or replica lag tracking.
Monotonic readsA client does not move backward in observed state.Sessions reading replicas.Needs version tracking across reads.
Monotonic writesA client's writes are applied in client order.User workflows with dependent updates.Requires per-client sequence or single writer routing.
Writes follow readsA write after a read is ordered after the read dependency.Collaborative edits, workflows.Requires causal metadata.
Eventual consistencyReplicas converge if writes stop and conflicts are resolved.Search indexes, caches, feeds, analytics projections.Temporary stale reads and conflict visibility.
Strong eventual consistencyReplicas that receive the same updates converge without coordination.CRDT based collaboration.Data type must encode merge semantics.

Design rule: name the consistency promise in product language, then choose the mechanism.

Examples:

  • "After the user changes billing email, they immediately see the new value" usually requires read-your-writes.
  • "Inventory cannot be sold below zero" usually requires a strongly consistent reservation path or a compensating business process.
  • "Search results update within one minute" can usually use eventual consistency with lag monitoring.
  • "Two admins cannot both own the same exclusive role" usually requires linearizable compare-and-swap or consensus backed metadata.

Invariants before mechanisms

Start with invariants. A distributed design is correct only relative to what must always be true.

Invariant typeExampleMechanisms that often fit
SafetyA payment is captured at most once.Idempotency keys, unique constraints, fencing, transactional outbox.
ConservationMoney is neither created nor destroyed.Double entry ledger, serializable transactions, reconciliation.
OwnershipA resource has one active owner.Consensus, leases with fencing, compare-and-swap.
MonotonicityOrder status never moves from shipped back to pending.State machine validation, version checks.
CapacityAllocated units do not exceed available units.Reservation service, escrow, bounded counters.
PrivacyA tenant cannot read another tenant's data.Authorization at every boundary, partitioned storage, policy tests.
DurabilityA confirmed operation survives process or node failure.Write ahead log, replicated storage, fsync policy, backups.

Invariant checklist:

  • What must never happen, even during failover?
  • What may happen temporarily but must converge?
  • What can be repaired by compensation?
  • What must be visible to users immediately?
  • What is the source of truth for each invariant?
  • Which component has authority to decide the invariant?
  • Which failures were explicitly considered?

CAP Theorem

CAP Theorem says that during a network partition, a distributed data system must choose between availability and strong consistency for affected operations.

The practical meaning is narrower and more useful than the slogan:

  • Consistency in CAP means linearizability.
  • Availability means every request to a non-failing node receives a non-error response.
  • Partition tolerance is not optional in real networks.
  • The tradeoff applies per operation and per failure domain.
Choice during partitionBehaviorSuitable whenExample
CPReject or block operations that cannot be safely ordered.Incorrect acceptance is worse than temporary unavailability.Lock service, metadata store, payment capture.
APAccept operations and reconcile later.Availability is more important than immediate global agreement.Shopping cart, likes, local-first notes.
MixedSome operations block, others continue.Invariants differ by operation.Place order blocks on inventory, browse catalog continues.

CAP design questions:

  • Which exact operation is partitioned?
  • Which nodes can still communicate?
  • Which invariant would be violated by accepting writes on both sides?
  • Is stale read acceptable?
  • Can the system expose degraded mode explicitly?
  • How will reconciliation happen after healing?

PACELC

PACELC extends CAP by describing the normal case tradeoff:

  • If partition, choose availability or consistency.
  • Else, choose latency or consistency.
System posturePartition caseNormal caseCommon result
PC/ECPrefer consistency during partitions, prefer consistency otherwise.Higher latency, stronger guarantees.Consensus metadata, strongly consistent databases.
PC/ELPrefer consistency during partitions, prefer latency otherwise.Some operations are strong, read paths may be optimized.Leader writes with follower reads under bounded staleness.
PA/ECPrefer availability during partitions, prefer consistency otherwise.Accepts partition writes but coordinates in healthy state.Some multi-region systems with conflict repair.
PA/ELPrefer availability during partitions, prefer latency otherwise.Fast local operations, more reconciliation.Highly available caches, local-first applications.

Mistake to avoid: using CAP as a slogan. The useful question is which operation, which partition, which invariant, which user-visible behavior.

Rendering diagram...

The clock problem

Distributed systems cannot rely on one perfect global clock. Physical clocks drift. NTP can step time backward or forward. Processes can pause during garbage collection, host suspension, scheduler stalls, or overloaded runtime loops. Network delays are variable and asymmetric.

Clock concepts:

Clock or mechanismWhat it givesWhat it does not giveCommon use
Wall clockHuman time.Reliable ordering.Timestamps, logs, retention, display.
Monotonic clockNon-decreasing elapsed time within one process.Cross-node ordering.Timeouts, durations, retry backoff.
Lamport clockHappens-before ordering when one event causally precedes another.Detection of concurrency.Event ordering, protocol reasoning.
Vector clockCausality and concurrency across actors.Compact metadata at high cardinality.Conflict detection, sync protocols.
Hybrid logical clockPhysical time plus logical counter.Perfect physical accuracy.Distributed databases, causally informed timestamps.
TrueTime style APIBounded time uncertainty interval.Zero uncertainty.External consistency when clocks have bounded skew.
LeaseTime bounded authority.Safety without clock assumptions and fencing.Leader authority, cache ownership, lock service.

Clock design rules:

  • Use monotonic time for durations and timeouts.
  • Treat wall clock as input data, not proof of ordering.
  • Use fencing tokens when stale holders can perform writes.
  • Do not use timestamps alone for correctness critical conflict resolution unless data loss is acceptable.
  • Record both event time and ingestion time when debugging pipelines.
  • Make clock skew visible in metrics and alerts.

Happens-before

The happens-before relation captures causal ordering:

  • Within one process, earlier events happen before later events.
  • Sending a message happens before receiving that message.
  • The relation is transitive.
  • If neither event happens before the other, the events are concurrent.
Rendering diagram...

Lamport clocks

A Lamport clock is a logical counter:

  1. Each node keeps an integer counter.
  2. Before each local event, increment the counter.
  3. Send the counter with each message.
  4. On receive, set local counter to max(local, received) + 1.

If event A happened before event B, then Lamport(A) < Lamport(B). The converse is not guaranteed. If Lamport(A) < Lamport(B), A may or may not have caused B.

BenefitLimitation
Simple and compact.Cannot detect concurrency.
Good for total ordering with tie breakers.Tie breaker order may be arbitrary, not causal.
Useful in protocols and logs.Does not represent real time.

Example tie breaker:

ordered_event_id = (lamport_counter, node_id)

This creates a deterministic total order, but it does not mean the earlier tuple happened first in physical time.

Vector clocks

A vector clock tracks one counter per participant. It can tell whether one event causally dominates another or whether two events are concurrent.

Comparison:

RelationshipConditionMeaning
A before BEvery component of A <= B and at least one component is lower.B has observed A causally.
B before AEvery component of B <= A and at least one component is lower.A has observed B causally.
ConcurrentNeither vector dominates the other.Both updates must be merged or resolved.
EqualAll components equal.Same causal history.

Vector clock example:

VersionNode ANode BNode CInterpretation
v1100A wrote once.
v2110B observed v1 and wrote.
v3101C observed v1 and wrote independently of B.

Comparison of v2 and v3:

VersionVectorRelationship
v2[1, 1, 0]Concurrent with v3.
v3[1, 0, 1]Concurrent with v2.

Vector clocks are valuable when conflict detection matters more than compact metadata. They become expensive when the number of writers is large or unbounded.

Hybrid logical clocks

Hybrid logical clocks combine physical time with a logical counter. They preserve causal ordering while keeping timestamps close to wall clock time.

Typical representation:

hlc = (physical_millis, logical_counter, node_id)

Use cases:

  • Distributed SQL timestamp ordering.
  • Change data capture ordering.
  • Multi-region replication with causality hints.
  • Debugging when physical time proximity is useful.

Design cautions:

  • HLC still depends on clock discipline for its physical component.
  • HLC does not remove the need for conflict resolution.
  • HLC does not prove a remote physical event happened before another unless the protocol supplies bounds.

Leases

A lease is time bounded authority. A holder may act until the lease expires. Leases are attractive because they reduce coordination, but they are dangerous when used as correctness proof without fencing.

Lease risks:

RiskScenarioMitigation
Clock skewHolder thinks lease is valid, coordinator thinks it expired.Conservative expiry, bounded skew assumptions, monotonic clocks.
Process pauseHolder pauses longer than lease, resumes and writes stale data.Fencing tokens checked by storage.
Network delayRenew request arrives late or duplicate arrives after new lease.Epoch based renewals and compare-and-swap.
Split brainTwo nodes believe they hold authority.Quorum lease service, fencing, single writer enforcement.

Lease checklist:

  • What clock is used for local duration measurement?
  • What maximum skew or pause assumption is required?
  • Does the downstream resource verify fencing tokens?
  • Can a stale holder still perform side effects?
  • Is renewal idempotent?
  • What happens when the lease service is unavailable?

Fencing tokens

A fencing token is a monotonically increasing value issued when authority is granted. Every operation sent to a protected resource includes the token. The resource rejects operations with tokens older than the highest token it has already seen.

Rendering diagram...

Fencing is required when stale actors can reach the resource after losing authority. A lease without fencing is often only a performance optimization, not a safety mechanism.

Consensus algorithms

Consensus lets nodes agree on a value despite failures. Replicated state machines use consensus to agree on an ordered log of commands.

Core properties:

PropertyMeaning
AgreementCorrect nodes decide the same value.
ValidityDecided values were proposed according to the protocol.
TerminationCorrect nodes eventually decide under required assumptions.
IntegrityA node decides at most once for a given consensus instance.
Quorum intersectionAny two decision quorums overlap in at least one correct participant.

Consensus is usually needed for:

  • Leader election.
  • Membership changes.
  • Metadata updates.
  • Distributed locks.
  • Exactly one active controller.
  • Replicated logs.
  • Configuration changes that must not diverge.

Consensus is usually not needed for:

  • Best effort telemetry.
  • Derived search indexes.
  • Caches that can be rebuilt.
  • Idempotent asynchronous side effects.
  • Data where merge semantics are explicit and acceptable.

Replicated state machines

A replicated state machine executes the same deterministic commands in the same order on multiple replicas.

Rendering diagram...

Requirements:

  • Deterministic command execution.
  • Same command order on all replicas.
  • Durable log.
  • Snapshot strategy.
  • Membership change protocol.
  • Idempotent client handling.
  • Clear commit index.
  • Replay compatibility across software versions.

Non-determinism to avoid:

SourceProblemFix
Local wall clockReplicas compute different values.Put timestamp in the log entry.
Random numbersReplicas diverge.Log seed or generated value.
External API callReplay changes behavior.Execute side effects outside the state machine through an outbox.
Map iteration orderDifferent runtimes produce different order.Sort keys or use deterministic structures.
Floating point differencesDifferent platforms round differently.Use integer or decimal arithmetic for critical state.

Raft

Raft is a consensus algorithm designed for understandability. It separates leader election, log replication, and safety rules.

Raft roles:

RoleBehavior
FollowerResponds to leader and candidate requests.
CandidateStarts election after election timeout.
LeaderAccepts client commands, replicates log entries, advances commit index.

Key Raft concepts:

  • Term: monotonically increasing election epoch.
  • Election timeout: randomized timeout that reduces split votes.
  • Heartbeat: AppendEntries message without log entries.
  • Log index: position in the replicated log.
  • Commit index: highest log entry known to be committed.
  • Leader completeness: a leader contains all committed entries from previous terms.
  • Joint consensus: safe membership change by overlapping old and new quorums.
Rendering diagram...

Raft failure handling:

FailureExpected behavior
Leader crashes before replicationCommand is not committed and client must retry.
Leader crashes after majority replicationNew leader must preserve the committed entry.
Follower falls behindLeader sends missing entries or snapshot.
Split voteCandidates time out and retry with higher term.
Stale leader receives requestIt steps down after observing higher term or fails to contact quorum.
Membership change interruptedJoint consensus prevents independent majorities.

Raft implementation checklist:

  • Election timeouts are randomized and larger than heartbeat interval.
  • Persistent state is flushed before responses that rely on it.
  • Log matching property is enforced with previous index and term.
  • Client commands include deduplication identity.
  • Snapshots include last included index and term.
  • Membership changes use a safe transition protocol.
  • Linearizable reads use leader lease only with valid assumptions or use quorum read.

Paxos and Multi-Paxos

Paxos is a family of consensus protocols based on ballots and quorum intersection. It is harder to understand than Raft but deeply influential.

Classic Paxos roles:

RoleResponsibility
ProposerSuggests a value with a ballot number.
AcceptorPromises and accepts values according to ballot rules.
LearnerLearns the chosen value after quorum acceptance.

Classic Paxos phases:

PhaseActionPurpose
PrepareProposer asks acceptors to promise not to accept lower ballots.Establish authority for a ballot.
PromiseAcceptors return prior accepted values, if any.Preserve safety across retries.
AcceptProposer asks acceptors to accept a value.Drive a value toward quorum.
LearnLearners observe quorum accepted value.Publish decision.

Safety intuition:

  • Any two majorities intersect.
  • A new proposer must learn accepted values from a quorum.
  • If a value might already have been chosen, later ballots must preserve it.

Multi-Paxos optimizes repeated consensus by using a stable leader. After leadership is established, the leader can skip repeated prepare phases for new log slots until leadership changes.

Paxos cautions:

  • The single decree protocol is not a full database.
  • Production systems need log management, reconfiguration, snapshots, flow control, and operator tooling.
  • Liveness can suffer under dueling proposers.
  • Correctness depends on stable persistent promises and accepted values.

Consensus family comparison

ProtocolStrengthComplexityCommon use
RaftUnderstandable replicated log.Moderate.etcd, Consul style metadata stores.
Multi-PaxosMature consensus foundation.High.Distributed databases and storage systems.
ZabAtomic broadcast with primary.Moderate to high.ZooKeeper.
Viewstamped ReplicationPrimary backup consensus family.Moderate.Academic and practical consensus design.
EPaxosLower latency for non-conflicting commands.High.Specialized leaderless consensus workloads.
Byzantine fault tolerant consensusTolerates malicious or arbitrary faults.Very high.Permissioned ledgers, adversarial settings.

Replication

Replication copies data or commands across nodes. It improves availability, durability, locality, and read scale, but it introduces lag, conflict, failover complexity, and operational ambiguity.

Replication forms:

FormDescriptionStrengthRisk
Leader followerOne leader accepts writes, followers replicate.Simple write ordering.Leader bottleneck and failover window.
Multi-leaderMultiple leaders accept writes.Local writes across regions.Conflicts and complex reconciliation.
LeaderlessClients coordinate reads and writes across replicas.High availability.Read repair, hinted handoff, conflict resolution.
Statement replicationReplicate SQL or commands.Compact.Non-determinism can diverge replicas.
Row replicationReplicate changed rows.More deterministic.Schema and ordering concerns.
Physical replicationReplicate storage pages or log bytes.Close copy of source.Less flexible across versions or formats.
Logical replicationReplicate semantic changes.Flexible integrations.Requires schema and ordering discipline.
Event sourced projectionReplicate events into derived views.Replayable and auditable.Projection lag and poison events.

Replication topology:

Rendering diagram...

Replication questions:

  • Can replicas diverge?
  • Who accepts writes?
  • What happens on failover?
  • What reads can go to replicas?
  • How is lag measured?
  • How is conflict resolved?
  • Is acknowledgement synchronous, asynchronous, or semi-synchronous?
  • Are schema changes replicated safely?
  • How is bootstrap performed for a new replica?
  • How are corrupted replicas detected and rebuilt?

Synchronous and asynchronous replication

ModeCommit conditionBenefitFailure tradeoff
SynchronousLeader waits for required replicas before success.Stronger durability and failover safety.Higher latency and lower availability.
Semi-synchronousLeader waits for at least one replica or limited condition.Middle ground.Edge cases still need clear commit semantics.
AsynchronousLeader acknowledges before replicas apply.Low latency.Acknowledged writes may be lost on failover.

Design rule: client success must mean something precise. If success means "accepted by leader memory", "fsynced locally", "replicated to quorum", or "visible in all regions", say so explicitly.

Quorums

Quorum systems require agreement from a subset of nodes.

Common formula:

  • N: replica count.
  • W: write acknowledgements.
  • R: read acknowledgements.
  • If R + W > N, reads intersect with writes.

Examples:

NWRBehavior
322Majority reads and writes, tolerates one failed replica.
331Fast reads, slow writes, no write availability if one replica fails.
533Majority quorum, tolerates two failed replicas for reads or writes.
515Fast writes, expensive reads, weak write durability until repair.
522R + W <= N, reads may miss recent writes.

Quorum risks:

  • Sloppy quorum can acknowledge writes outside the normal replica set.
  • Read repair can hide inconsistency until reads occur.
  • Clock based last-write-wins can lose causally valid writes.
  • Quorum success does not guarantee application invariant correctness.
  • Membership changes can break intersection if old and new quorums do not overlap.
  • Correlated failures can defeat nominal replica counts.
Rendering diagram...

Conflict resolution

Conflicts occur when two or more updates are accepted without a single ordering authority.

Conflict strategies:

StrategyHow it worksGood forRisk
Last write winsPick highest timestamp or version.Caches, presence, replaceable values.Silently loses writes.
Application mergeDomain code combines concurrent versions.Documents, carts, preferences.Requires careful UX and tests.
CRDTData type guarantees convergence under merge.Counters, sets, collaborative structures.More complex data modeling.
Operational transformTransform concurrent edits.Text collaboration.Hard correctness model.
Manual resolutionHuman chooses or edits final value.Rare business conflicts.Operational cost and delay.
Reject on conflictRequire client to reread and retry.Admin forms, inventory, metadata.Lower availability under contention.

Conflict design checklist:

  • Can concurrent writes happen?
  • Is silent loss acceptable?
  • Does the merge preserve business invariants?
  • Is conflict metadata stored durably?
  • Can users understand and repair conflicts?
  • Are retries safe after conflict response?
  • Are conflict rates measured?

CRDTs

Conflict free replicated data types encode merge rules that are associative, commutative, and idempotent.

CRDTMeaningExample use
G-counterGrow only counter.Likes or local increments where decrement is not needed.
PN-counterPositive and negative counters.Distributed counters with increments and decrements.
G-setGrow only set.Observed event IDs.
OR-setObserved remove set.Add and remove with causal tags.
LWW-registerRegister resolved by timestamp.Replaceable value where loss is acceptable.
MV-registerMulti-value register keeps concurrent values.Conflict surfacing.

CRDTs are not magic. They move complexity into data type design and domain semantics.

Idempotency and retries

Distributed systems turn timeout into ambiguity. If a client times out, the operation may have failed, succeeded, or still be running.

Required tools:

  • Idempotency key.
  • Client generated operation ID.
  • Deduplication table.
  • Stable response replay.
  • Outbox for side effects.
  • Inbox for consumers.
  • Retry budget.
  • Exponential backoff with jitter.
  • Deadlines propagated across calls.
  • Circuit breakers for persistent dependency failure.

Idempotency states:

StateMeaningResponse
MissingNo record for key.Attempt operation and persist result.
In progressAnother attempt is executing.Return 409, 202, or wait with deadline.
SucceededOperation completed.Replay stable success response.
Failed retryableAttempt failed before commit.Allow retry with same key.
Failed finalOperation rejected by business rule.Replay stable failure response.

Retry checklist:

  • Is the operation safe to retry with the same key?
  • Does the server deduplicate before side effects?
  • Is the response stable across duplicate attempts?
  • Is there a retry budget?
  • Is jitter used to avoid synchronized retries?
  • Are deadlines shorter than upstream timeouts?
  • Are non-retryable errors classified clearly?
  • Does the caller stop retrying when cancellation is requested?

Retry storm pattern:

Rendering diagram...

Outbox and inbox patterns

The outbox pattern records state changes and outbound messages in the same local transaction. A relay later publishes the messages.

The inbox pattern records consumed message IDs before or while applying consumer effects, allowing duplicate delivery to be ignored.

Rendering diagram...

Use these patterns when exactly-once delivery is requested. Most practical systems implement at-least-once delivery plus idempotent effects.

Split brain

Split brain occurs when two or more nodes believe they have exclusive authority. It is usually caused by partitions, stale leases, unsafe failover, or independent control planes.

Split brain examples:

SystemSplit brain symptomConsequence
Database primaryTwo primaries accept writes.Divergent data and hard reconciliation.
Job schedulerTwo schedulers run exclusive job.Duplicate side effects.
Lock serviceTwo clients hold same lock.Corrupted shared resource.
Cluster controllerMultiple controllers mutate same objects.Flapping and lost updates.
Storage volumeTwo nodes mount writeable volume.Filesystem corruption.

Prevention techniques:

  • Quorum based leader election.
  • Fencing tokens on all writes.
  • Storage level single writer enforcement.
  • STONITH or power fencing in infrastructure clusters.
  • Monotonic epochs embedded in commands.
  • Fail closed when quorum is lost.
  • Avoid manual force promotion unless the old leader is fenced.

Split brain checklist:

  • What proves the old leader cannot write?
  • What rejects stale epochs?
  • Can both sides reach a shared dependency?
  • Can an operator accidentally force two primaries?
  • Are clients pinned to stale endpoints?
  • Is DNS or load balancer state slower than failover?

Network partitions

A partition is not always a clean cut. It can be asymmetric, protocol specific, intermittent, or regional.

Partition scenarios:

ScenarioDescriptionDesign concern
Clean partitionGroup A cannot talk to group B.Which side has quorum?
Asymmetric reachabilityA can reach B but B cannot reach A.Heartbeats can lie.
Partial port failureHTTP works, replication port fails.Health checks miss data plane failure.
DNS partitionName resolution fails while IP connectivity works.Clients fail differently by cache state.
Regional isolationA region cannot reach another region.Local availability vs global invariants.
Packet lossMessages arrive sometimes with high latency.Retries and retransmits increase load.
BlackholeTraffic is accepted but never answered.Timeouts dominate recovery time.

Partition response matrix:

OperationPartition behaviorReason
Read public catalogServe stale local copy.Availability matters and data is low risk.
Update payment methodRequire authoritative region or quorum.Incorrect update can affect billing.
Place inventory reservationCoordinate with inventory authority.Oversell violates capacity invariant.
Add item to cartAccept locally and merge later.User value is high and conflicts are manageable.
Acquire cluster leadershipRequire quorum.Split brain is worse than downtime.

Advanced networking

Advanced networking topics every Staff engineer should understand:

  • TCP handshake, congestion control, retransmission, slow start.
  • TLS handshake, certificates, mTLS, session resumption.
  • DNS caching, TTLs, split horizon, negative caching.
  • HTTP/1.1 vs HTTP/2 vs HTTP/3.
  • gRPC transport, deadlines, cancellation, streaming flow control.
  • Load balancing: L4, L7, client side, server side, consistent hashing.
  • NAT, conntrack, ephemeral port exhaustion.
  • Backpressure across network boundaries.
  • Head-of-line blocking.
  • Packet loss vs latency vs jitter.
  • Anycast and regional routing.
  • Service discovery and health checking.
  • Network partitions and asymmetric reachability.
  • MTU, fragmentation, path MTU discovery.
  • Keepalive behavior and idle connection timeouts.
  • SYN backlog, accept queue, connection pool exhaustion.
  • Kernel socket buffers and application read loops.

Networking failure table:

SymptomPossible causeInvestigation
Periodic request spikesDNS TTL expiry or synchronized refresh.Compare latency with DNS cache events.
Random connection resetsLoad balancer idle timeout or deploy churn.Inspect connection age and upstream logs.
Slow first requestTLS handshake, cold connection pool, DNS lookup.Break down timing by phase.
High tail latencyPacket loss, queueing, overloaded dependency.Check retransmits, saturation, queue depth.
Works by IP not nameDNS, search suffix, split horizon, stale cache.Query authoritative and local resolvers.
Some clients failNAT, conntrack, regional route, MTU.Segment by source, path, and protocol.
gRPC stream stallsFlow control, proxy timeout, head-of-line blocking.Inspect stream windows and proxy config.

Failure patterns

PatternDescriptionMitigation
Split brainMultiple authorities act at once.Quorum, fencing, fail closed.
Retry stormRetrying callers overload a weak dependency.Retry budgets, jitter, circuit breakers.
Thundering herdMany clients wake or refresh at once.Jitter, request coalescing, leases.
Cache stampedeHot cache key expires and all callers recompute.Soft TTL, single flight, early refresh.
Hot partitionOne shard or key receives disproportionate traffic.Sharding, key salting, adaptive routing.
Poison messageOne message repeatedly fails consumer processing.Dead letter queue, quarantine, schema validation.
Slow consumerConsumer lag grows until recovery becomes hard.Backpressure, autoscaling, compaction, lag alerts.
Gray failureComponent passes health checks but fails real work.Synthetic checks, brownout detection, load shedding.
Cascading failureOne failure causes dependent failures.Bulkheads, graceful degradation, dependency budgets.
Coordinated omissionMeasurements hide time spent waiting to send work.External load generation, corrected latency histograms.
Clock skew incidentTime assumptions break ordering or expiry.Clock monitoring, monotonic durations, fencing.
Control plane overloadManagement path fails and blocks recovery.Rate limits, priority queues, emergency access.

Failure scenario analysis

Use scenario analysis before productionizing a distributed workflow.

ScenarioQuestions to answer
Client times out after server commitsCan the client retry without duplicate side effects?
Leader commits then crashes before responseWill the new leader preserve and reveal the committed command?
Follower serves stale readIs stale data acceptable for this endpoint?
Message broker redeliversIs the consumer idempotent?
Outbox relay publishes twiceCan subscribers dedupe?
Region is isolatedWhich operations continue locally?
Clock moves backwardDo expirations, leases, and ordering still behave?
Schema deploy is halfway doneCan old and new versions both process replicated data?
Queue backlog growsIs there backpressure or admission control?
Operator force promotes a replicaWhat prevents the old primary from writing?

Practical design example: payment capture

Payment capture is a safety critical workflow. Duplicate capture is unacceptable. Delayed capture may be acceptable. Availability is less important than preventing double charge.

Recommended posture:

  • Use an idempotency key per capture attempt.
  • Persist payment state transition and outbox message in one transaction.
  • Use a unique constraint on provider operation ID or business operation ID.
  • Treat provider timeout as unknown.
  • Reconcile with provider by operation ID.
  • Never retry with a new key unless creating a distinct business operation.
Rendering diagram...

Correctness target:

InvariantMechanism
Capture at most onceIdempotency key, unique operation record, provider idempotency.
Local state matches provider eventuallyReconciliation job.
User sees stable resultOperation status endpoint.
Side effects are not lostTransactional outbox.

Practical design example: inventory reservation

Inventory reservation has a capacity invariant: confirmed reservations must not exceed available stock.

Design options:

OptionBehaviorTradeoff
Single authoritative regionAll reservations go through one region.Simple invariant, higher remote latency.
Consensus backed counterReservation requires quorum.Stronger availability than single node, still latency sensitive.
Escrow per regionAllocate regional quotas.Fast local reservations, complex rebalancing.
Optimistic accept and compensateAccept orders then cancel if oversold.High availability, poor user experience for scarce goods.

Reservation checklist:

  • Is oversell impossible or merely compensated?
  • Is stock decremented at reserve, pay, or ship time?
  • Do reservations expire?
  • Are expirations safe under clock skew?
  • Can cancellation and payment race?
  • Is each state transition monotonic?

Practical design example: distributed job scheduler

An exclusive job scheduler must avoid duplicate execution when jobs have non-idempotent side effects.

Recommended posture:

  • Use consensus or database compare-and-swap for job ownership.
  • Issue fencing token for each lease acquisition.
  • Include token in writes to job result storage.
  • Make job steps idempotent where possible.
  • Store heartbeats and progress durably.
  • Prefer resumable jobs over long invisible critical sections.

Scheduler state machine:

Rendering diagram...

Practical design example: multi-region user profile

User profile writes often want low latency, but different fields have different invariants.

FieldSuggested consistencyReason
Display nameLast write wins or version check.Low safety risk.
Email addressStronger validation and uniqueness check.Identity and notifications depend on it.
Marketing preferencesPer-field merge.Independent toggles can merge safely.
Account deletionStrong global marker.Must dominate later writes.
Security settingsAuthoritative region or quorum.Incorrect stale writes are dangerous.

Pattern: classify fields by invariant instead of assigning one consistency model to the whole object.

Operational checklists

Design review checklist

  • The system names its safety invariants.
  • Each operation has a documented consistency model.
  • Timeouts are shorter than caller deadlines.
  • Retries have budgets and jitter.
  • All non-idempotent operations have idempotency keys.
  • Every async side effect has an outbox, inbox, or equivalent dedupe mechanism.
  • Replication lag is measured and visible.
  • Failover semantics are documented.
  • Split brain prevention has been tested.
  • Clock assumptions are explicit.
  • Backpressure exists at every queue and RPC boundary.
  • Reconciliation exists for every eventually consistent path.

Production readiness checklist

  • Dashboards show request rate, error rate, saturation, and tail latency.
  • Dashboards show quorum health, leader changes, and replication lag.
  • Alerts distinguish user impact from internal redundancy loss.
  • Runbooks cover leader loss, partition, lag, replay, and bad deploy.
  • Load tests include dependency latency and packet loss.
  • Chaos tests include process pause and asymmetric partition where feasible.
  • Backups are restored regularly.
  • Schema migrations are compatible with old and new writers.
  • Idempotency storage has retention and capacity planning.
  • Dead letter queues have ownership and replay procedure.
  • Manual override procedures include fencing checks.

Incident debugging checklist

  • Establish timeline using monotonic durations where available.
  • Separate event time from ingestion time.
  • Identify first failing dependency, not only loudest symptom.
  • Check recent deploys, config changes, and traffic shifts.
  • Check leader changes, elections, and quorum loss.
  • Check DNS, load balancer, TLS, and connection pool metrics.
  • Check retries, queue depth, and consumer lag.
  • Check clock skew and host pauses.
  • Preserve logs before replay or compaction removes evidence.
  • Record which operations may have unknown outcome.

Rules of thumb

  • Prefer one writer when the invariant is strict and write volume permits it.
  • Prefer explicit merge when availability matters and conflicts are acceptable.
  • Prefer rejecting ambiguous writes over corrupting core invariants.
  • Prefer operation IDs over trying to infer duplicate intent from payloads.
  • Prefer monotonic state machines over ad hoc status updates.
  • Prefer bounded queues with backpressure over unbounded memory growth.
  • Prefer boring consensus implementations over custom protocols.
  • Prefer durable reconciliation over assuming every message arrives once.
  • Prefer testing failover paths before the incident.