Databases Storage and Transactions

Reading time
25 min read
Word count
4841 words
Diagram count
11 diagrams

Source: Victor Bona's Obsidian Compendium snapshot, Knowledge base/Software Engineering/04 Databases Storage and Transactions.md.

Databases Storage and Transactions

Databases are correctness systems, not only persistence tools. A database is a contract between application invariants, storage media, concurrency control, recovery logic, and operational discipline. Most production failures come from a mismatch between what the application assumes and what the database actually guarantees.

Core mental model

Rendering diagram...

Every write path answers four questions:

QuestionWhy it mattersCommon implementation
Where is the new value first made durable?Defines the recovery point after a crash.WAL append, journal, consensus log, object storage manifest.
When is the client acknowledged?Defines what an application may safely assume.Local fsync, quorum commit, async replica receipt, leader memory only.
How are readers isolated from writers?Defines visible anomalies under concurrency.Locks, MVCC snapshots, timestamp ordering, serial validation.
How is old state reclaimed?Defines storage growth and read amplification.Vacuum, compaction, page pruning, tombstone garbage collection.

Advanced database map

AreaConceptsProduction risk
Storage enginesB-Trees, LSM Trees, heaps, column stores, compression, page layout.Latency spikes, write amplification, fragmentation, read amplification.
RecoveryWAL, checkpoints, redo, undo, torn pages, checksums.Acknowledged commits lost, long restart time, silent corruption.
ConcurrencyLocks, latches, MVCC, predicate locks, deadlock detection.Lost updates, write skew, lock convoying, starvation.
Query processingStatistics, cardinality estimation, join algorithms, indexes.Bad plans, table scans, unstable latency after data drift.
DistributionReplication, sharding, consensus, quorums, distributed transactions.Split brain, stale reads, hot shards, partial commits.
OperationsMigrations, backups, CDC, rebalancing, failover drills.Irreversible schema changes, broken restore, duplicate events.

Storage engine mental model

Storage engines optimize the physical representation of logical tables. The same SQL interface may sit on very different persistence structures.

Engine styleWrite behaviorRead behaviorBest fitWatchouts
Heap plus secondary indexesAppend or update heap pages, update indexes.Index lookup then heap fetch unless covering.General OLTP.Bloat, random I/O, vacuum pressure.
B-Tree clustered tableRows stored in primary key order.Range scans are efficient on clustering key.Ordered OLTP, point lookups, range queries.Random primary keys fragment pages and reduce cache locality.
LSM TreeAppend to WAL and memtable, flush immutable SSTables.Search memtable plus multiple sorted runs, helped by Bloom filters.High write throughput, time series, key-value workloads.Compaction stalls, tombstone buildup, read amplification.
ColumnarStore columns separately, often compressed by segment.Scan only needed columns, vectorized execution.OLAP, analytics, large scans.Single row updates are expensive.
Log structured heapAppend records, maintain indirection or compaction.Reads follow latest pointer or scan logs.Event stores, write heavy systems.Compaction and point lookup indexing are essential.
In-memory with persistenceKeep data in memory, persist log or snapshots.Very low read latency.Caches, queues, low latency state.Recovery time, memory pressure, persistence mode confusion.
Rendering diagram...

Pages, buffer pools, and durability

Most row-oriented engines operate in fixed-size pages. A page contains records, free space, line pointers, checksums, and metadata. The buffer pool caches pages and coordinates dirty page flushing.

ComponentResponsibilityFailure scenario
PageSmallest logical unit for many reads and writes.Torn page writes half old and half new bytes after power loss.
Buffer poolCaches pages, tracks dirty pages, pins pages during use.Cache churn makes an index look slow even when the plan is correct.
LatchProtects in-memory structures for very short critical sections.Hot pages cause CPU contention without showing as transaction locks.
LockProtects logical data for transaction isolation.Long transaction blocks writers or DDL.
ChecksumDetects corruption in stored pages.Storage silently returns corrupt bytes; checksum catches the mismatch.
FsyncForces log or page bytes to durable media.Misconfigured durability acknowledges commits that die with the host.

WAL and recovery

Write-ahead logging means the database must durably record enough information to redo or undo changes before the corresponding data pages are considered durable.

Rendering diagram...
Recovery phasePurposeExample
AnalysisDetermine transaction and dirty page state at crash time.Find committed transactions whose pages might not be flushed.
RedoReapply idempotent changes that may be missing from data files.Rebuild a page update from WAL after the dirty page was lost.
UndoRoll back uncommitted changes if the engine uses undo logging.Remove a debit from a transaction that never committed.
CheckpointBound how far back recovery must read.Flush dirty pages and persist a recovery marker.

Correct WAL reasoning:

  • The WAL record must reach stable storage before the changed data page can be flushed.
  • A commit is only as durable as the commit record and the policy used to flush it.
  • Group commit batches many commits behind one fsync to improve throughput.
  • Checkpoints reduce restart time but can create I/O bursts.
  • Replicated WAL still needs a clear acknowledgement rule, such as local durable, replica received, or quorum committed.

B-Trees

B-Trees keep sorted keys in a shallow tree. Internal pages route searches, leaf pages hold keys, row pointers, or clustered rows. Most database B-Trees are B+Trees, where full records or row references live in leaves and leaves are linked for range scans.

Rendering diagram...
OperationCost intuitionImportant details
Point lookupO(log n), usually 2 to 4 page reads when cached internal pages are hot.Secondary indexes may require heap lookup.
Range scanO(log n) to first key plus sequential leaf traversal.Excellent for ordered queries and pagination by keyset.
InsertLocate leaf, insert, sometimes split.Random keys scatter writes; monotonic keys can hot spot the rightmost page.
DeleteMark or remove entry, sometimes merge or rebalance.Space may not return to the OS without vacuum or rebuild.
UpdateIn-place if same page and size allow, otherwise move or create new version.Indexed column updates usually delete and insert index entries.

Practical B-Tree tuning:

  • Choose primary keys that balance locality and hot spot risk.
  • Avoid indexing every column. Each secondary index increases write cost and migration cost.
  • Prefer keyset pagination over large OFFSET scans.
  • Use covering indexes for hot read paths only when the write cost is justified.
  • Watch index bloat after heavy churn.
  • Rebuild or reorganize indexes only with a measured problem and a safe maintenance window.

LSM Trees

Log structured merge trees turn random writes into sequential writes. New writes go to a WAL and an in-memory sorted structure. When the memtable fills, it becomes an immutable sorted string table. Background compaction merges sorted runs.

Rendering diagram...
ConceptMeaningFailure mode
MemtableMutable in-memory sorted table.Large memtables improve write batching but increase recovery replay time.
SSTableImmutable sorted file.Too many files increase read amplification.
Bloom filterProbabilistic test that a key is absent from an SSTable.False positives cause extra reads; false negatives indicate corruption or bug.
TombstoneDelete marker written like any other value.Range deletes or expired data can poison reads until compaction removes tombstones.
CompactionMerge and discard obsolete records.Stalls, high I/O, or disk full if compaction cannot keep up.
Leveled compactionKeeps levels non-overlapping.Lower read amplification, higher write amplification.
Size-tiered compactionMerges similar sized runs.Higher read amplification, lower write amplification for some workloads.

LSM correctness notes:

  • Deletes are not immediate physical deletes. Tombstones must survive until all older versions they cover are gone.
  • Snapshots and long reads can prevent old SSTables from being reclaimed.
  • Disk usage can temporarily exceed logical data size by a large factor during compaction.
  • Read repair and compaction may surface latent corruption long after the original write.

Indexes and access paths

Index design determines both read speed and write cost. An index is not only a lookup helper. It is a materialized ordering with maintenance obligations.

Index typeUseRisk
Primary indexEntity identity and clustering.Bad key choice creates hot spots.
Secondary indexLookup by alternate fields.Write amplification.
Composite indexMulti-column filtering and ordering.Column order matters.
Covering indexQuery served from index only.Larger index, more write cost.
Partial indexIndex subset of rows.Query predicate must match.
Full text indexSearch over text.Relevance and update complexity.
Bloom filterAvoid unnecessary reads.False positives only, never false negatives when correct.
Expression indexIndex a computed expression.Query must use a matching expression or planner may not use it.
Hash indexEquality lookup.Poor for ranges and ordering.
Spatial indexGeometric and nearest-neighbor queries.Complex selectivity and bounding box false matches.
Inverted indexToken or element to document mapping.Expensive updates and large posting lists.

Composite index order matters:

IndexEfficient predicatesInefficient predicates
(tenant_id, created_at)tenant_id = ?, tenant_id = ? ORDER BY created_at, tenant-scoped time range.Global created_at range without tenant filter.
(status, priority, id)Status queue by priority with stable keyset pagination.Lookup by priority alone.
(user_id, lower(email))User-scoped normalized email lookup.Global normalized email lookup unless user is supplied.

Index correctness examples:

  • A unique index on (tenant_id, slug) is stronger than an application-side "check then insert" because concurrent inserts cannot both commit.
  • A partial unique index such as (user_id) WHERE active = true can enforce "one active subscription per user" without serializing unrelated historical rows.
  • A foreign key prevents orphaned rows, but it can also create lock contention on parent rows during high write volume.
  • A missing index on a foreign key can make parent deletes or updates unexpectedly expensive.

Links:

Query planning and execution

The optimizer chooses a physical plan for a logical query. It estimates row counts, selectivity, I/O cost, CPU cost, sort cost, memory use, and join order. Bad estimates produce bad plans.

Rendering diagram...
Plan nodeGood whenPathology
Sequential scanLarge percentage of table needed, table small, or index not selective.Accidentally scans huge table due to missing predicate or stale stats.
Index scanPredicate is selective and row fetches are limited.Many random heap fetches make it slower than a scan.
Index only scanIndex covers projected columns and visibility map allows it.Falls back to heap checks if pages are not marked all visible.
Nested loop joinOuter side small, inner lookup indexed.Catastrophic if outer side is much larger than estimated.
Hash joinEqui-join with enough memory.Spills to disk if hash table exceeds memory.
Merge joinInputs already sorted or sorting is cheap.Sort cost dominates for large unsorted inputs.
SortNeeded for order, grouping, merge join, distinct.Spills to disk when work memory is too small.

Planner failure scenarios:

ScenarioSymptomFix direction
Stale statistics after bulk loadGood query suddenly uses nested loop or table scan.Analyze table, improve autovacuum or stats schedule.
Correlated predicatesPlanner multiplies independent selectivities and underestimates rows.Extended statistics, composite index, query rewrite.
Parameter-sensitive planOne cached plan is bad for some tenant sizes.Plan hints where available, query split, custom plans, tenant-aware routing.
Function on indexed columnIndex ignored for lower(email) unless expression index exists.Add expression index or store normalized field.
Implicit castIndex not used because column and parameter types differ.Fix parameter type and schema consistency.

Transactions and ACID

ACID is a set of guarantees, not a performance feature.

PropertyMeaningCommon misconception
AtomicityTransaction effects commit together or not at all.It does not mean the transaction is small or indivisible internally.
ConsistencyDeclared constraints and application invariants are preserved if transactions are correct.The database cannot enforce invariants it does not know.
IsolationConcurrent transactions observe behavior allowed by the isolation level.Default isolation is rarely fully serializable.
DurabilityCommitted state survives the promised failure class.Durability depends on fsync, replication, storage, and configuration.

Transaction boundaries should surround a complete invariant change:

BEGIN;

UPDATE accounts
SET balance = balance - 100
WHERE id = 'checking'
  AND balance >= 100;

UPDATE accounts
SET balance = balance + 100
WHERE id = 'savings';

COMMIT;

Correctness checks for transactional code:

  • Does every statement in the transaction participate in the same connection and transaction context?
  • Are external calls avoided inside the transaction, or bounded with timeouts and retries?
  • Are retries safe after serialization failures, deadlocks, and connection drops?
  • Are uniqueness, foreign key, and check constraints used instead of only application validation?
  • Does the code know whether a timeout happened before or after commit?
  • Is idempotency present at the API boundary?

MVCC

Multi-version concurrency control stores multiple versions of rows so readers and writers can often proceed without blocking each other. A snapshot defines which versions are visible.

Rendering diagram...
MVCC conceptMeaningOperational impact
SnapshotSet of committed versions visible to a transaction.Long snapshots retain old versions and delay cleanup.
Tuple versionOld and new physical row versions.Updates can increase table and index bloat.
Vacuum or cleanupReclaims versions no snapshot can see.Blocked by long transactions and idle-in-transaction sessions.
Visibility metadataTracks creator and deleter transaction information.Extra heap checks can prevent index-only scans.
Serialization validationDetects dangerous dependency cycles in stronger isolation.Transactions may abort and must be retried.

MVCC failure scenario:

  1. A report transaction opens a repeatable snapshot and runs for 4 hours.
  2. The application updates millions of rows during that window.
  3. Vacuum cannot reclaim old versions because the report might still need them.
  4. Table bloat grows, indexes become less efficient, and normal queries slow down.
  5. The fix is not only "run vacuum"; it is to shorten snapshots, route analytics to replicas, paginate reports, and enforce idle transaction timeouts.

Isolation levels and anomalies

IsolationPreventsStill allows
Read uncommittedAlmost nothing.Dirty reads, nonrepeatable reads, phantoms.
Read committedDirty reads.Nonrepeatable reads, phantoms, write skew.
Repeatable readNonrepeatable reads.Phantoms in some systems, write skew under snapshot isolation.
Snapshot isolationReads a consistent snapshot.Write skew.
SerializableEquivalent to serial execution.Lower concurrency or aborts under conflict.

Common anomalies:

AnomalyExamplePrevention
Dirty readTransaction reads a payment row that later rolls back.Read committed or stronger.
Nonrepeatable readSame row read twice returns different values.Repeatable read, snapshot isolation, or explicit lock.
Phantom readQuery for available seats returns different matching rows later.Serializable, predicate locks, range locks, or constraint design.
Lost updateTwo users read counter 5 and both write 6.Atomic update, row lock, version check, serializable.
Write skewTwo doctors both go off call after each sees the other is on call.Serializable or materialized constraint with locking.
Read skewTransfer updates two rows; reader sees debit but not credit.Consistent snapshot.

Correctness examples

Lost update with unsafe read-modify-write:

-- Both sessions read quantity = 10.
SELECT quantity FROM inventory WHERE sku = 'A';

-- Both compute 9 in application code.
UPDATE inventory SET quantity = 9 WHERE sku = 'A';

Safer atomic update:

UPDATE inventory
SET quantity = quantity - 1
WHERE sku = 'A'
  AND quantity > 0;

Optimistic version check:

UPDATE documents
SET body = $1,
    version = version + 1
WHERE id = $2
  AND version = $3;

Write skew under snapshot isolation:

-- Invariant: at least one doctor must remain on call.
-- T1 sees doctor B on call and turns A off.
-- T2 sees doctor A on call and turns B off.
-- Both update different rows, so row-level write conflict may not occur.

Stronger designs:

  • Use serializable isolation and retry serialization failures.
  • Represent the invariant as one row that both transactions update.
  • Use an exclusion or check constraint when the database can express it.
  • Lock the predicate range or parent invariant row before updating child rows.

Correctness tools:

  • Unique constraints.
  • Foreign keys.
  • Check constraints.
  • Transaction boundaries.
  • Optimistic version columns.
  • Pessimistic locks.
  • Idempotency keys.
  • Outbox and inbox tables.
  • Reconciliation jobs.

Locking, latching, and deadlocks

Locks protect logical database objects. Latches protect in-memory data structures. Confusing them hides root causes.

MechanismScopeDurationExample
LatchInternal memory page or structure.Microseconds or milliseconds.Protect B-Tree page split.
Row lockLogical row.Transaction duration.SELECT ... FOR UPDATE.
Predicate or range lockSet of possible rows.Transaction duration.Prevent insert into queried range.
Advisory lockApplication-defined key.Session or transaction.Serialize per-tenant maintenance job.
DDL lockSchema object.Statement or transaction.ALTER TABLE waits for active queries.

Deadlock example:

Rendering diagram...

Deadlock mitigation:

  • Lock rows in a stable order.
  • Keep transactions short.
  • Avoid user input and network calls while holding locks.
  • Use lock timeouts and retry deadlock victims.
  • Prefer single-statement atomic updates when possible.

Distributed transactions

Distributed transactions coordinate state changes across failure domains. The hard part is not writing two rows. The hard part is knowing what happened when a process, network, or coordinator fails between steps.

PatternGuaranteeFailure modeUse when
Two-phase commitParticipants commit or abort as one unit if coordinator completes.Blocking if coordinator dies after prepare.Small set of trusted participants with operational control.
Consensus transactionCommit decision replicated through Raft, Paxos, or similar.Higher latency and quorum dependency.Strongly consistent distributed databases.
SagaSequence of local commits with compensating actions.Compensation may fail or be semantically incomplete.Business processes can tolerate visible intermediate state.
OutboxState change and message intent committed atomically in one database.Relay lag or duplicate publish.Need reliable event publication without XA.
InboxConsumer records processed message IDs atomically with side effects.Inbox table growth and dedupe retention mistakes.At-least-once delivery consumers.
EscrowPre-allocate bounded rights to shards.Rebalancing rights and handling expired reservations.Counters, inventory, quota where overuse must be bounded.

Two-phase commit:

Rendering diagram...

The dangerous window is after participants prepare but before they learn the final decision. Participants must retain locks and prepared state. If the coordinator is unavailable, they cannot safely decide alone.

Outbox flow:

Rendering diagram...

Links:

Replication

Replication copies data to improve durability, availability, locality, or read scale. It also creates ambiguity: which copy is authoritative, how stale can reads be, and what happens during failover?

TopologyWrite pathStrengthRisk
Single leader, async followersLeader commits locally, followers catch up later.Simple and fast.Acknowledged writes may be lost if leader dies before replication.
Single leader, sync followerCommit waits for one or more replicas.Better durability.Higher latency and write unavailability if replicas fail.
Multi-leaderMultiple nodes accept writes.Regional writes and availability.Conflict detection and resolution are application-visible.
Leaderless quorumClients read and write to N replicas with R and W thresholds.Tunable consistency and availability.Sloppy quorum, hinted handoff, and read repair complexity.
Consensus replicated logLeader orders writes through quorum agreement.Strong consistency for committed log entries.Requires quorum for progress.

Replication lag scenarios:

ScenarioImpactMitigation
Read-your-writes missingUser creates project then replica read says it does not exist.Sticky leader reads, session tokens, or wait for replica LSN.
Stale authorizationPermission revoked on leader but replica still allows action.Leader reads for security checks, bounded staleness, cache invalidation.
Failover data lossOld leader accepted write that never reached promoted replica.Synchronous replication, quorum commit, recovery reconciliation.
Replica overloadAnalytics query delays replay.Dedicated analytics replicas, workload isolation, query limits.
Clock confusionLast-write-wins uses skewed timestamps.Logical clocks, version vectors, server-side ordering.

Quorum databases

Quorum systems store each item on N replicas and require enough acknowledgements to consider operations successful. A common rule of thumb is R + W > N, where R is read quorum and W is write quorum. This only gives useful consistency under specific assumptions: stable replica sets, no sloppy quorum surprises, correct conflict resolution, and reads that reconcile divergent versions.

ParameterMeaningTradeoff
NReplication factor.Higher durability and availability, higher storage and repair cost.
WWrite acknowledgements required.Higher W improves durability, increases write latency and failures.
RRead acknowledgements required.Higher R improves freshness, increases read latency and failures.
Sloppy quorumAccept writes on substitute nodes during outages.Improves availability, weakens intuitive quorum reasoning.
Hinted handoffLater deliver substitute writes to intended replicas.Helps repair, can resurrect old values if conflict handling is weak.
Read repairFix stale replicas during reads.Repairs hot keys faster than cold keys.
Merkle repairCompare tree summaries between replicas.Efficient anti-entropy for large ranges.

Quorum failure example:

  1. N is 3 and W is 2.
  2. Network partition isolates replica C.
  3. Write X reaches A and B and is acknowledged.
  4. A fails before C is repaired.
  5. A read with R equals 1 from C can return the old value.
  6. If clients require monotonic reads, the application needs stronger read policy, session consistency, or a different database guarantee.

Sharding and partitioning

Partitioning splits data into smaller physical units. Sharding distributes those units across nodes.

StrategyGood forRisk
Range partitioningTime series, ordered archival, range scans.Hot latest partition, manual split planning.
Hash partitioningEven distribution for point lookups.Range scans across many partitions.
List partitioningExplicit tenant, region, or category placement.Operational complexity as lists grow.
Composite partitioningTime plus tenant, region plus hash.More complex routing and indexing.
Consistent hashingDynamic node membership.Hot keys still hot, range queries poor.
Directory based shardingFlexible tenant-to-shard mapping.Directory is critical metadata.

Shard key selection:

CandidateProsCons
tenant_idTenant isolation, easy per-tenant moves.Large tenants become hot shards.
user_idGood for user-local queries.Cross-user collaboration queries scatter.
created_atEasy retention and archival.New writes concentrate on newest shard.
Random UUID hashBalanced writes.Poor locality and expensive range queries.
Composite tenant plus hashBalances large tenants with controlled scatter.Requires fanout for tenant-wide scans.

Cross-shard operations:

  • Local transaction: all touched data lives on one shard.
  • Fanout read: query all relevant shards and merge results.
  • Scatter-gather write: high risk unless idempotent and reconciled.
  • Global secondary index: index entry and base row may live on different shards.
  • Resharding: requires copy, dual write or change stream catch-up, validation, and cutover.

Hot partition scenario:

  1. Orders are partitioned by day.
  2. All current writes go to today's partition.
  3. One partition receives nearly all insert, index, and autovacuum pressure.
  4. Adding nodes does not help because the hot key range is still single-owned.
  5. A better design may hash within current time buckets or route by tenant plus time.

Change data capture

CDC turns database changes into a stream for search indexes, caches, warehouses, audit logs, and event-driven systems.

CDC sourceAdvantagesRisks
WAL or binlog decodingCaptures committed database order with low application coupling.Schema changes and decoder lag need careful handling.
TriggersFlexible and close to data.Adds write latency and can miss changes if disabled or misdeployed.
Polling updated timestampsSimple.Clock issues, missed same-timestamp updates, deletes are hard.
Application eventsRich semantic events.Can diverge from database commit unless outbox is used.

CDC correctness checklist:

  • Is the stream based on committed changes only?
  • Are events ordered per aggregate, table, shard, or globally?
  • Can consumers handle duplicates and reordering?
  • Is delete represented as a tombstone or explicit event?
  • Are schema versions included?
  • Is consumer offset committed atomically with side effects?
  • Is lag monitored against retention so WAL or binlog is not discarded before consumption?

Schema migrations

Online migrations must support mixed versions of application code and database schema. The safe pattern is expand, backfill, switch, contract.

Rendering diagram...
Migration typeSafer approachDangerous approach
Add nullable columnAdd column, deploy code, backfill in chunks, then constrain.Add non-null with table rewrite during peak traffic.
Rename columnAdd new column, dual write, backfill, switch reads, drop old later.Rename in place while old app version still runs.
Change typeAdd new typed column and convert gradually.Alter type on huge table without lock and rewrite analysis.
Add indexBuild concurrently or online where supported.Blocking index build on high traffic table.
Add constraintAdd not valid, backfill or clean data, validate later.Immediate validation that locks or fails production writes.
Split tableDual write or CDC to new table, validate, then switch.Big bang copy with no drift detection.

Migration failure scenario:

  1. Version A writes full_name.
  2. Migration renames full_name to display_name.
  3. Version A pods still run during rolling deploy and begin failing writes.
  4. Some requests retry and create duplicate side effects.
  5. Correct pattern is to add display_name, make code compatible with both, backfill, switch reads, stop old writes, then drop full_name after the compatibility window.

Backups, restore, and disaster recovery

Backups are only real if restores are tested. A backup file that cannot meet recovery objectives is an artifact, not a recovery plan.

Backup typeCapturesStrengthRisk
Logical dumpSchema and rows through database interface.Portable, easy to inspect.Slow for large databases, may miss roles or extensions.
Physical base backupData files at a consistent point.Fast restore for large systems.Version and storage layout dependent.
WAL archivingChanges after base backup.Point-in-time recovery.Useless if WAL gaps exist.
SnapshotStorage volume point in time.Fast and cheap.Must coordinate with database flush or recovery rules.
Replica backupOffloads primary.Reduces production impact.Replica lag or corruption can be backed up.

Recovery concepts:

TermMeaning
RPOMaximum acceptable data loss.
RTOMaximum acceptable time to restore service.
PITRRestore to a specific point using base backup plus WAL.
Restore drillScheduled proof that backups can be restored and queried.
Integrity checkVerify checksums, row counts, constraints, and application invariants.

Restore validation should include:

  • Restore into an isolated environment.
  • Confirm database starts without recovery errors.
  • Run migration history checks.
  • Verify row counts and critical checksums.
  • Verify application can boot against the restored database.
  • Exercise representative reads and writes.
  • Document elapsed restore time and recovered timestamp.
  • Alert if backups are stale, missing, too small, too large, or unrestorable.

Storage correctness checklist:

  • What is the acknowledged durability point?
  • Can acknowledged writes be lost on failover?
  • Are replicas used for reads?
  • What is the maximum acceptable replication lag?
  • Can stale reads violate product invariants?
  • How are conflicts detected and resolved?
  • How is corruption detected?
  • How are backups restored and verified?

Distributed databases

Distributed databases combine storage with 05 Distributed Systems problems:

  • CAP Theorem constraints.
  • PACELC latency and consistency tradeoffs.
  • Quorum reads and writes.
  • Shard placement.
  • Hot partition detection.
  • Cross-shard transaction limits.
  • Global secondary index complexity.
  • Clock assumptions.
  • Consensus for metadata.
  • Rebalancing and resharding.

Additional distributed concerns:

ConcernWhy it mattersExample
Metadata consensusShard ownership must be consistent.Two nodes both believe they own writes for the same range.
Fencing tokensPrevent old leaders from writing after failover.Old primary resumes and writes stale state to shared storage.
Clock modelSome systems depend on bounded clock uncertainty.External consistency requires commit wait or logical ordering.
RebalancingMoving data competes with foreground traffic.Node add causes latency spike due to uncontrolled streaming.
Global indexesSecondary index update may be cross-shard.Index says row exists before base row commit is visible.
Tenant placementNoisy neighbors and data residency.Large tenant overloads shard, requiring live move.

Data migration playbook

  • Define source of truth.
  • Define invariant checks.
  • Add backward compatible schema.
  • Dual write only with outbox or repair strategy.
  • Backfill idempotently.
  • Validate counts and checksums.
  • Shift reads gradually.
  • Monitor error rate, lag, and drift.
  • Keep rollback path until confidence is real.
  • Remove old path after compatibility window.

Detailed migration controls:

ControlPurposeExample
Idempotent backfillAllow safe resume after crash.Update rows where new column is null.
ChunkingBound locks, WAL growth, and replica lag.Process 5,000 rows per transaction.
ThrottlingProtect production latency.Pause when replica lag exceeds threshold.
ChecksumsDetect drift beyond row counts.Compare hash aggregates per partition.
Dual read validationCompare old and new read models before switching.Shadow query new index or table.
Kill switchStop new path quickly.Feature flag for read routing.
Rollback planKnow which writes are reversible.Keep old column populated until confidence window closes.

Practical failure scenarios

FailureRoot causeBetter design
User sees 404 after creating resourceCreate committed on leader, read served from lagging replica.Read leader for session, use replica LSN wait, or sticky routing.
Inventory oversoldCheck and decrement done in separate statements outside a transaction.Atomic conditional update or serializable transaction.
Duplicate chargePayment API retried after timeout without idempotency key.Idempotency key with unique constraint and stored outcome.
Migration locks tableBlocking DDL on large table during traffic.Online DDL, concurrent index build, expand-contract rollout.
WAL disk fillsLong replica slot or CDC consumer prevents WAL recycling.Lag alerts, retention limits, consumer recovery runbook.
Backup restores but app failsBackup excluded extensions, roles, secrets, or migration metadata.Full restore drill with application boot validation.
Query degrades overnightData distribution changed and stats became stale.Analyze, extended stats, plan regression tests for critical queries.
Compaction causes latency spikeLSM write amplification and background I/O saturation.Tune compaction, provision I/O headroom, isolate workloads.
Split brain writesFailover lacks fencing and old leader accepts writes.Consensus, fencing tokens, lease discipline, client routing controls.
Old tombstones resurrect dataReplica missed delete and tombstone expired before repair.Repair within tombstone retention and monitor anti-entropy.

Design review questions

QuestionStrong answer
What invariant must never be violated?It is enforced by a database constraint, serializable transaction, or single-writer design.
What happens if the request times out after commit?Retry path uses idempotency and can return the committed outcome.
What happens if failover occurs after acknowledgement?Acknowledgement policy makes data loss impossible for that failure class, or reconciliation is explicit.
Can stale reads cause harm?Read routing separates harmless stale reads from security and money decisions.
How do migrations run during rolling deploys?New and old app versions are schema-compatible throughout rollout.
How is backup quality proven?Automated restore drills verify data and application behavior.
How is CDC replay handled?Consumers are idempotent and offsets are managed with side effects.
How are hot tenants handled?Placement, throttling, split strategy, and shard movement are planned.