Data Structures Algorithms and Complexity

Reading time
20 min read
Word count
3881 words
Diagram count
0 diagrams

Source: Victor Bona's Obsidian Compendium snapshot, Knowledge base/Software Engineering/03 Data Structures Algorithms and Complexity.md.

Data Structures Algorithms and Complexity

This note connects algorithmic fundamentals to production engineering decisions. In production systems, algorithms are not only interview exercises. They shape latency, memory pressure, database cost, incident blast radius, scaling limits, and the ability to reason about correctness under load.

Existing anchors

Complexity as an engineering tool

Big O is a starting point. It tells you how an algorithm grows as input size grows, but production engineering also asks what resource is being consumed, where the resource limit is, and whether the system fails gradually or abruptly.

Complexity reasoning is most useful when it is tied to a real workload:

  • Number of users, tenants, rows, messages, files, nodes, edges, keys, or partitions.
  • Operation mix: reads, writes, deletes, scans, joins, merges, retries, and background maintenance.
  • Distribution: uniform, Zipfian, time clustered, tenant skewed, adversarial, or bursty.
  • Constraints: p50 latency, p95 latency, p99 latency, memory ceiling, disk budget, network budget, CPU budget, and operational simplicity.
  • Failure mode: slow response, memory exhaustion, retry storm, compaction backlog, lock convoy, hot shard, or data loss.
Complexity signalWhat it means in productionExample question
Time complexityCPU work per operation or batch.Does this request loop over every tenant, every row, or only the affected keys?
Space complexityMemory retained during and after the operation.Does this export stream rows or materialize the whole result set?
IO complexityDisk reads, disk writes, fsyncs, page reads, and seeks.Does this query use one index seek or scan millions of pages?
Network complexityNumber and size of remote calls.Is this endpoint O(1) database calls or O(n) service calls?
Contention complexityWork amplification caused by shared locks, atomics, queues, or hot keys.Does adding workers improve throughput or only increase waiting?
Tail complexityWork at p95 and p99, not only average work.What happens when a tenant has 100x more records than the median tenant?
Maintenance complexityBackground work caused by the foreground operation.Does each write create compaction, reindexing, cache invalidation, or fanout?

Practical rule: state complexity in the same units that fail in production. "This is O(n)" is less useful than "this allocates one object per row, performs one remote call per row, and blocks the request thread until all calls finish."

Complexity beyond Big O

Big O intentionally ignores details that can dominate real systems.

DimensionWhy it mattersProduction example
Constant factorsA theoretically better algorithm may have a larger fixed cost.A regex engine, JSON parser, or crypto primitive can dominate an O(n) pass.
Cache localitySequential memory access is often much faster than pointer chasing.A sorted array can beat a tree for small or medium read-heavy indexes.
Allocation rateFrequent allocation increases garbage collection, fragmentation, and allocator contention.Building temporary maps inside every request raises latency under load.
Branch predictionData-dependent branches can reduce CPU pipeline efficiency.A tight loop with unpredictable branches may be slower than a branchless version.
VectorizationContiguous homogeneous data can use SIMD or optimized library paths.Columnar analytics scans can process many values per CPU instruction.
Lock contentionSerialized critical sections cap throughput.A global cache lock turns a many-core service into a single-lane service.
AmortizationAverage cost can hide occasional expensive operations.Hash table resize, LSM compaction, and vector growth create latency spikes.
BackpressureAlgorithms that accept work faster than they process it create queues.A consumer with O(n) per message accumulates lag during bursts.
Data skewAverage cardinality hides large tenants and hot keys.A query fast for most customers times out for the largest customer.
Adversarial inputInputs may be malicious or accidentally worst case.Poor hash choices can turn expected O(1) into O(n) chains.

Complexity should be documented with assumptions. For example: "Expected O(1) lookup with a good hash and bounded load factor; worst case O(n) if many keys collide."

Data structure selection

The right structure follows the dominant operation and the invariant the system must preserve.

NeedUsually considerAvoid when
Fast lookup by exact keyData Structures/Hash Tables, Data Structures/Hash SetsYou need ordered scans, prefix queries, or predictable worst-case latency.
Ordered iteration and range queriesData Structures/Binary Search Trees, Data Structures/AVL Trees, Data Structures/Red-Black Trees, Data Structures/B-Trees, Data Structures/Skip ListsYou only need membership and memory is tight.
Append and indexed accessData Structures/Arrays or vectorsFrequent inserts or deletes in the middle dominate.
Frequent insert or delete at known nodesData Structures/Linked ListsYou need cache locality, random access, or compact memory layout.
FIFO workData Structures/QueuesPriority, deadline, or fairness ordering is required.
LIFO workData Structures/StacksWork must preserve arrival order.
Top k or next deadlineData Structures/HeapsYou need efficient search, delete arbitrary item, or ordered full scan.
Prefix lookupData Structures/TriesKeys are long, sparse, and memory is constrained.
Interval or prefix sumsData Structures/Segment Trees, Data Structures/Fenwick TreesData is small enough for direct recomputation or updates are rare.
ConnectivityData Structures/Disjoint Set (Union-Find)You need shortest paths, deletion of edges, or path reconstruction.
Eviction by recencyData Structures/LRU CacheFrequency, cost, size, or TTL should dominate eviction.

Selection checklist:

  • What is the invariant: membership, order, priority, connectivity, recency, prefix, uniqueness, or reachability?
  • What is the read to write ratio?
  • Are keys fixed size or variable size?
  • Is the data bounded or unbounded?
  • Is iteration order important?
  • Is worst-case latency important, or is expected latency acceptable?
  • Does the structure live in memory, on disk, across a network, or inside a database index?
  • Can the structure be rebuilt, or must every mutation be durable?
  • Is concurrent mutation required?

Production examples

ScenarioReasonable choiceReasoning
API authorization checks for user permissionsHash set per user or compressed bitmap per permission domainExact membership dominates. Keep checks O(1) or word-level bit operations.
Search suggestions by prefixTrie, compressed trie, or sorted array with binary search rangePrefix lookup is the invariant. Sorted arrays may be simpler and more cache friendly for static dictionaries.
Job scheduler by next run timeMin heap or timing wheelPop the next deadline quickly. Timing wheels help when deadlines are bucketed.
Tenant billing aggregationStreamed fold plus database indexAvoid materializing all events. Complexity is bounded by scanned events and grouping cardinality.
Dependency build orderGraph plus topological sortDetect cycles and produce an execution order.
Cache with bounded memoryLRU, LFU, TinyLFU, or size-aware evictionRecency alone may be wrong when entries have different costs and sizes.
Duplicate detection in a pipelineHash set, Bloom filter, or external sortExactness, memory budget, and replay semantics determine the choice.
Time-series downsamplingRing buffers, sketches, reservoir sampling, or segment treePreserve useful aggregates without storing every raw point forever.
Database secondary indexB-Tree or LSM-backed indexWorkload decides between range-read friendliness and write amplification tradeoffs.

Storage oriented structures

Storage structures optimize for the cost model of disks, SSDs, pages, block caches, write-ahead logs, and background maintenance. The dominant question is often not "how many comparisons" but "how many page reads, page writes, and compactions."

StructureProduction use
B-TreeDatabase indexes, filesystem metadata, range queries.
LSM TreeWrite-heavy storage, compaction based databases, log structured systems.
Bloom FilterNegative lookup acceleration, storage engine read reduction, cache membership checks.
Skip ListOrdered in-memory indexes, memtables, concurrent maps.
Hash TableCaches, lookup tables, joins, dedupe stores.
TriePrefix search, routing tables, autocomplete, IP matching variants.
HeapSchedulers, timers, priority queues.
GraphDependency analysis, reachability, planning, workflow systems.
StructureStrengthsTradeoffsOperational signals
B-Tree and B+TreeOrdered keys, range scans, stable read latency, page-oriented layout.Random writes update pages in place and may split pages.Page cache hit ratio, index bloat, random IO, fill factor.
LSM TreeHigh write throughput, sequential writes, good compression, natural write-ahead flow.Read amplification, write amplification, compaction stalls, tombstone buildup.Compaction backlog, level sizes, read amplification, write stalls.
Log structured fileAppend-only writes, simple crash recovery, fast sequential IO.Requires indexing, garbage collection, and segment cleaning.Segment count, reclaimable bytes, replay time.
Columnar storageEfficient scans, compression, vectorized analytics.Point updates are harder, row reconstruction costs more.Scan throughput, compression ratio, skipped row groups.
Inverted indexText search, tag search, document lookup by term.Updates and deletes require segment management.Segment count, merge backlog, postings list size.
Bitmap indexLow-cardinality filters, fast set operations.High-cardinality raw bitmaps can be large without compression.Bitmap density, compression ratio, filter selectivity.
Spatial indexGeospatial and multidimensional lookup.More complex balancing and query planning.Bounding box selectivity, false candidate count.

Storage examples:

  • A B-Tree index is often best for "all orders for tenant X between two dates" because it supports ordered range scans.
  • An LSM design is often best for "accept many writes per second and tolerate background compaction" because it turns random writes into sequential writes.
  • A Bloom filter in an LSM can avoid reading disk files that definitely do not contain a key.
  • An inverted index is better than scanning all documents when the query starts with terms rather than primary keys.

Cache locality

Modern CPUs are fast relative to memory. Data layout can dominate asymptotic complexity for realistic sizes.

LayoutGood forWeakness
Contiguous arrayScans, binary search, vectorization, compact storage.Middle insertions and deletions require shifting.
Array of structsRow-like access where all fields are used together.Wastes bandwidth if only one field is read across many records.
Struct of arraysColumn-like scans and SIMD.More awkward when complete objects are frequently accessed.
Linked nodesStable references and cheap local insertion once position is known.Pointer chasing, poor locality, per-node allocation overhead.
Packed trie or radix treePrefix lookup with less pointer overhead than naive trie.More complex implementation and update logic.
Ring bufferPredictable bounded memory and sequential access.Fixed capacity and overwrite policy must be explicit.

Common pattern: a sorted vector plus binary search can outperform a tree for small to medium collections because it has fewer allocations and better locality, even though insertion is O(n). This is especially true when reads dominate writes.

Memory allocation and ownership

Allocation behavior is part of algorithmic complexity.

PatternRiskMitigation
Allocate per item in a hot loopHigh allocator overhead and garbage collection pressure.Preallocate, reuse buffers, use arenas, batch work.
Build whole response in memoryMemory spikes and out-of-memory failures.Stream results, paginate, use bounded buffers.
Store duplicate keys and stringsHidden memory multiplier.Intern strings, use references, normalize schemas, compress.
Unbounded cacheSlow memory leak disguised as optimization.Set maximum entries, maximum bytes, TTL, and admission rules.
Per-object metadata overheadMany small objects cost much more than payload size.Pack data, use arrays, use specialized primitive collections.
Fragmented long-lived heapMore memory retained and slower allocation.Separate lifetimes, pool carefully, compact where runtime allows.

When choosing a structure, estimate memory as concretely as possible:

  • Number of entries.
  • Key size.
  • Value size.
  • Pointer and object overhead.
  • Load factor or spare capacity.
  • Index overhead.
  • Replication factor.
  • Cache duplication.

Example: a hash table with 10 million entries is not just the payload. It also includes buckets, pointers, hash metadata, allocator overhead, spare capacity for load factor, and possibly duplicate key storage.

Concurrent data structures

Concurrent structures add correctness dimensions:

  • Linearization point: the instant an operation appears to take effect.
  • Progress guarantee: blocking, lock-free, wait-free.
  • Memory reclamation: safe deletion while other threads may read.
  • Contention behavior: throughput under many writers.
  • Fairness: whether some actors starve.

Examples:

  • Concurrent queue: producer consumer systems, work stealing, async runtimes.
  • Concurrent hash map: shared caches and registries.
  • Ring buffer: low latency streams and logging.
  • Copy-on-write map: read-heavy configuration snapshots.
  • RCU list: read-heavy kernel and infrastructure workloads.
StructureUseful whenWatch for
Mutex-protected mapSimplicity and moderate contention matter more than maximum throughput.Lock convoying, long critical sections, blocking under callbacks.
Sharded mapMany independent keys are updated concurrently.Hot shards, expensive resize, iteration across shards.
Concurrent queueProducers and consumers run independently.Backpressure, fairness, memory growth, cancellation semantics.
Work-stealing dequeTask schedulers and fork-join workloads.Subtle correctness and load balancing under uneven tasks.
Ring bufferBounded low-latency handoff.Overflow policy, single producer vs multiple producer assumptions.
Copy-on-write snapshotReads dominate and updates are rare.Update cost copies large state, stale readers may observe old versions.
RCU style structureExtremely read-heavy systems.Memory reclamation after readers finish, complex lifecycle reasoning.
Lock-free stack or queueBlocking is unacceptable in a narrow hot path.ABA problems, memory ordering, reclamation, harder testing.

Progress terms:

  • Blocking: a stalled thread can prevent others from completing.
  • Lock-free: at least one thread makes progress even if others stall.
  • Wait-free: every operation completes in a bounded number of steps.
  • Obstruction-free: a thread makes progress if it eventually runs alone.

Production guidance:

  • Prefer simple locks until measurement shows contention is material.
  • Keep critical sections small and avoid remote calls while holding locks.
  • Make ownership and cancellation explicit for queued work.
  • Bound queues or apply backpressure before memory becomes the queue.
  • Test under skew, not only equal producer and consumer rates.
  • Treat memory ordering bugs as correctness bugs, not performance bugs.

Probabilistic structures

Probabilistic structures trade exactness for lower memory, lower IO, or bounded latency. They are valuable when the system can tolerate a known error shape.

StructureAnswersError typeProduction use
Data Structures/Bloom Filters"Might contain key?"False positives, no false negatives if used correctly.Avoid unnecessary disk reads or remote lookups.
Counting Bloom filter"Might contain key?" with deletion supportFalse positives and counter overflow risk.Membership with removals.
Cuckoo filter"Might contain key?"False positives, supports deletion.Compact membership filters.
HyperLogLog"How many distinct items?"Approximate cardinality.Unique users, unique IPs, distinct keys at scale.
Count-Min Sketch"How frequent is this item?"Overestimates counts.Heavy hitters, abuse detection, traffic analysis.
Reservoir sampling"Representative sample from stream?"Sampling error.Observability, debugging, analytics sampling.
Quantile sketch"What is the p95 or p99?"Approximate quantiles.Latency telemetry and large metric streams.

Example decisions:

  • Use a Bloom filter to skip a disk lookup when a negative answer is common and false positives only waste work.
  • Do not use a Bloom filter when a false positive would grant access, charge money, or claim data exists.
  • Use HyperLogLog when exact distinct count would require storing too many identifiers.
  • Use Count-Min Sketch when overcounting is acceptable but undercounting is not.

Graph algorithms

Graphs model dependencies, networks, permissions, workflows, ownership, fraud rings, package trees, and infrastructure topology. Production graph work is usually constrained by graph size, update frequency, and whether queries are online or offline.

ProblemAlgorithm or structureUse caseCaveat
ReachabilityBFS or DFS over Data Structures/GraphsCan service A reach service B?Large graphs need pruning and visited sets.
Shortest unweighted pathBFSFewest hops in dependency or social graph.Path count can explode in dense graphs.
Shortest weighted pathDijkstraRouting, cost-based planning.Requires non-negative weights.
Negative weightsBellman-FordCurrency arbitrage style reasoning, constraint systems.More expensive than Dijkstra.
All-pairs shortest pathFloyd-Warshall or repeated DijkstraSmall dense graphs, precomputed routing.O(n^3) is impractical for large graphs.
Minimum spanning treeKruskal or PrimNetwork design, clustering, cheap connection sets.Does not solve shortest path between all pairs.
ConnectivityData Structures/Disjoint Set (Union-Find)Dynamic union of components.Edge deletion is not handled well.
Topological orderKahn algorithm or DFSBuild systems, migrations, workflow scheduling.Cycles must be reported clearly.
Strongly connected componentsTarjan or KosarajuCycle groups, service dependency clusters.Component graph should be inspected after condensation.
Max flowEdmonds-Karp, DinicAllocation, matching, capacity planning.Modeling source, sink, and capacities matters more than code cleverness.

Practical graph modeling:

  • Define node identity carefully. A service, deployment, pod, endpoint, and tenant are different nodes.
  • Define edge meaning. "Calls", "owns", "depends on", "can access", and "contains" should not be mixed without labels.
  • Decide whether edges are directed.
  • Decide whether edges are weighted and what the weight means.
  • Bound traversal depth for online requests unless full traversal is required.
  • Record why a path exists, not only that it exists, when the result drives security or operations.

Anti-pattern: running recursive graph traversal inside every request without a maximum depth, cache, or precomputed closure. This often becomes a latency incident when the graph grows or cycles appear.

Streaming algorithms

Streaming algorithms process data one item at a time with bounded memory. They are used when data is too large, too fast, or too continuous to load fully.

GoalTechniqueProduction example
Running aggregateFold with accumulatorSum bytes per tenant in a log stream.
Sliding window countRing buffer, deque, or windowed bucketsRequests per minute for rate limiting.
Approximate distinct countHyperLogLogUnique devices per day.
Heavy hittersCount-Min Sketch plus heapTop abusive IPs in a high-volume stream.
Approximate quantilesQuantile sketchp99 latency without storing every sample.
SamplingReservoir samplingKeep representative examples for debugging.
DeduplicationHash set, Bloom filter, or windowed stateDrop duplicate events within 24 hours.
Stream joinWindowed state by keyMatch payment event and fulfillment event.

Streaming design questions:

  • Is event time or processing time authoritative?
  • How late can events arrive?
  • What is the window: tumbling, sliding, session, or global?
  • Is state bounded by count, time, key cardinality, or disk?
  • Are results exact or approximate?
  • What happens on replay?
  • Are operations idempotent?
  • How are checkpoints and offsets committed?

Example: a rate limiter that stores every request timestamp forever has unbounded memory. A fixed-size ring of per-second buckets gives O(1) update and bounded memory for a fixed window, with known resolution loss.

Algorithmic design patterns

  • Divide and conquer.
  • Dynamic programming.
  • Greedy algorithms.
  • Backtracking.
  • Graph traversal.
  • Shortest path.
  • Topological sort.
  • Union find for connectivity.
  • Approximation and probabilistic algorithms.
  • Streaming algorithms with bounded memory.
PatternUseful forProduction caution
Divide and conquerSorting, search, parallel processing.Recursive overhead and merge memory can dominate.
Dynamic programmingOverlapping subproblems and optimal substructure.Tables can become large. Compress state when possible.
GreedyLocal choices that are provably sufficient.Needs proof or strong domain argument. Greedy scheduling can be unfair.
BacktrackingConstraint search.Explodes exponentially without pruning and limits.
Branch and boundOptimization with pruning.Worst case can still be exponential.
Graph traversalReachability and dependency reasoning.Requires visited sets and cycle handling.
Union findIncremental connectivity.Does not naturally support edge deletion.
Binary search on answerMonotonic feasibility problems.Requires a true monotonic predicate.
Two pointersOrdered arrays and window problems.Input ordering assumptions must be explicit.
Sliding windowBounded recent history.Late events and out-of-order streams complicate semantics.
MemoizationCache repeated work.Cache key correctness, memory bounds, and invalidation matter.

Interview vs production differences

Interviews often compress the problem until the main algorithmic idea is visible. Production expands the problem until all resource and correctness edges are visible.

Interview framingProduction framing
Input fits in memory.Input may exceed memory and must be streamed, paginated, indexed, or partitioned.
Single process.Work may span services, queues, databases, caches, and retries.
Average case is enough.Tail latency and worst tenants matter.
Simple data types.Objects have serialization, schema evolution, ownership, and validation cost.
One correct answer.There may be tradeoffs between correctness, freshness, cost, simplicity, and operability.
No failures.Disk, network, dependency, and process failures are part of the algorithm.
No concurrency.Threads, async tasks, transactions, and distributed actors race.
Clean input.Input may be malformed, malicious, duplicated, late, or skewed.
Big O dominates.Constants, locality, allocation, and IO often dominate.

Interview skill is still useful. It provides vocabulary and baseline reasoning. Production skill adds measurement, workload modeling, safety limits, and maintainable implementation.

Correctness before cleverness

Algorithmic choices should be justified by workload:

  • Access pattern.
  • Mutation frequency.
  • Cardinality.
  • Distribution.
  • Latency objective.
  • Memory budget.
  • Concurrency model.
  • Durability requirement.

Avoid optimizing a data structure before naming the invariant it protects.

Correctness questions:

  • What must always be true before and after each operation?
  • What happens if the process crashes mid-operation?
  • What happens if the same message is processed twice?
  • What happens if two writers update the same key at once?
  • Is stale data acceptable?
  • Can the operation be retried safely?
  • Is the result exact, approximate, or best effort?
  • How is a partial result detected?

Anti-patterns

Anti-patternWhy it failsBetter approach
N plus 1 queriesNetwork and database round trips grow with result count.Batch, prefetch, join, or use a data loader.
Full table scan in request pathLatency grows with database size.Add an index, precompute, paginate, or move to background job.
Unbounded in-memory aggregationMemory grows with input and can crash the process.Stream, spill to disk, use bounded sketches, or partition.
Sorting when selection is enoughO(n log n) work for top k.Use a heap or selection algorithm.
Global lock around hot mapThroughput collapses under concurrency.Shard, reduce critical section, or use immutable snapshots.
Linked list for general collectionsPoor cache locality and high allocation overhead.Use arrays, vectors, deques, or intrusive lists only when justified.
Recursive traversal without depth or visited checksStack overflow, cycles, and repeated work.Use iterative traversal, visited set, and explicit limits.
Cache without invalidation strategyServes stale or inconsistent data.Define TTL, versioning, dependency invalidation, or write-through semantics.
Hashing untrusted input without protectionCollision attacks can degrade performance.Use hardened hash functions, caps, and request limits.
Approximate structure in authorization pathFalse positives or approximation can violate correctness.Use exact checks for security and money movement.

Measurement and validation

Complexity claims should be verified with representative data.

Validation methodFinds
Unit testsInvariants, boundary cases, exact behavior.
Property testsUnexpected inputs and invariant violations.
BenchmarksCPU, allocation, throughput, and scaling curves.
Load testsQueueing, contention, remote dependency limits.
ProfilingHot functions, allocation sites, lock waits.
TracingNetwork fanout, retries, dependency latency.
Production metricsTail latency, memory growth, cache hit ratio, compaction, backpressure.

Useful benchmark inputs:

  • Empty, one item, and small collections.
  • Median production size.
  • Large tenant or largest known partition.
  • Skewed keys.
  • Duplicate-heavy input.
  • Already sorted input.
  • Reverse sorted input.
  • Random input.
  • Malformed or adversarial input.

Decision records for algorithm choices

For important choices, record:

  • Workload assumptions.
  • Data size now and expected data size later.
  • Chosen structure or algorithm.
  • Alternatives considered.
  • Complexity in CPU, memory, IO, and network terms.
  • Correctness invariant.
  • Failure mode.
  • Observability signals.
  • Reason the decision should be revisited.

Example:

FieldExample
ProblemTrack recently seen event IDs for deduplication.
Workload50,000 events per second, duplicates usually arrive within 10 minutes.
ChoiceTime-windowed hash sets rotated by minute.
ComplexityExpected O(1) lookup and insert, bounded memory by retention window and event rate.
AlternativeBloom filter.
Why not alternativeFalse positives would drop valid events, which is not acceptable.
Revisit whenEvent rate or retention window increases beyond memory budget.