Performance Capacity and Cost

Reading time
23 min read
Word count
4531 words
Diagram count
2 diagrams

Source: Victor Bona's Obsidian Compendium snapshot, Knowledge base/Software Engineering/11 Performance Capacity and Cost.md.

Performance Capacity and Cost

Performance engineering is the discipline of predicting, measuring, and controlling how a system consumes scarce resources while serving real demand. Capacity engineering asks whether the system can meet its service objectives at expected and abnormal load. Cost engineering asks whether the same outcome is being delivered with acceptable economic efficiency.

Performance is not one number. A service can have excellent average latency and still be unusable for the slowest 1 percent of requests. A system can have high throughput in a benchmark and still fail under production contention. A platform can be cheap at idle and expensive under retry storms, excess observability cardinality, or poor cache behavior.

Core Vocabulary

TermMeaningCommon mistakeBetter practice
LatencyTime for one operation to completeReporting only average latencyTrack percentile latency by operation, dependency, and tenant class
ThroughputCompleted work per unit timeTreating peak throughput as sustainable capacityReport sustained throughput at an SLO-bound latency target
Service timeTime a worker actively spends on a requestConfusing it with end-to-end latencySeparate processing time from queueing and network waits
Queueing delayTime spent waiting before service startsIgnoring it until saturationModel queues explicitly and alert on growing wait time
UtilizationFraction of a resource in useAssuming 90 percent CPU is always efficientCompare utilization with latency, run queue, throttling, and error rate
SaturationDemand approaches available capacityTreating saturation as a binary stateDetect early via queue growth, retries, pool waits, and scheduler delay
HeadroomSpare capacity before unacceptable behaviorKeeping fixed percent headroom without scenario analysisReserve headroom for spikes, failover, deployments, and noisy neighbors
Tail latencyHigh percentile behavior such as p95, p99, p999Optimizing median onlyBudget every layer so tails do not multiply
Cost per unitCost per request, job, tenant, GB, or model callLooking only at monthly cloud spendAttribute cost to the unit that drives demand

Latency

Latency is the elapsed time observed by a caller. It includes local compute, network hops, queueing, dependency calls, serialization, retries, lock waits, garbage collection pauses, scheduler delay, and client-side connection behavior.

Useful decomposition:

end_to_end_latency = client_wait
                   + network_time
                   + ingress_queue_time
                   + application_queue_time
                   + service_time
                   + dependency_time
                   + retry_time
                   + response_transfer_time

Latency should be measured from multiple perspectives:

PerspectiveCapturesMisses
Client sideDNS, TLS, network, retries, load balancer behavior, real perceived latencyInternal span detail unless propagated
Edge or ingressRouting, WAF, TLS termination, upstream selectionBrowser behavior and deep application detail
ApplicationHandler time, queue time, dependency time, business operation labelsNetwork before ingress and client retry behavior
DependencyDatabase or cache service timeCaller-side pool waits and serialization
Synthetic probeAvailability and known path latencyTenant-specific hot paths and workload variety

Tail Percentiles

Percentiles describe distribution shape. p50 is the median. p95 means 95 percent of observations are at or below that value. p99 means 1 in 100 observations are slower. p999 means 1 in 1000 observations are slower.

Tail latency matters because users and workflows often experience many operations, not one.

probability_all_fast = fast_probability_per_call ^ number_of_calls

Example:
If each call is within target 99 percent of the time and a page requires 40 calls:
0.99 ^ 40 = 0.669

Only about 66.9 percent of page loads have all 40 calls within target.

Tail latency amplifiers:

  • Fan-out to many dependencies.
  • Queueing near saturation.
  • Noisy neighbor effects.
  • Stop-the-world garbage collection.
  • Cold starts and autoscaler lag.
  • Cache misses on hot paths.
  • Lock convoys.
  • Database lock waits.
  • Retry storms.
  • Packet loss and TCP retransmission.
  • Large payloads mixed into latency-sensitive queues.

Percentile measurement pitfalls:

PitfallWhy it misleadsFix
Averaging percentilesp99 values are not additive or safely averageableAggregate raw histograms or use mergeable sketches
Low sample countp999 with 1000 requests is unstableRequire minimum sample volume per window
Coordinated omissionLoad generator waits for slow response before issuing next requestPreserve intended arrival rate and measure client-observed delay
Too-wide labelsOne metric blends fast and slow operationsPartition by operation, dependency, status, tenant tier, and region
Too-many labelsCardinality cost and query instabilityBound label values and use exemplars for deep traces

Throughput

Throughput is completed work per unit time. It is usually reported as requests per second, jobs per minute, bytes per second, messages per second, or transactions per second.

Throughput is only meaningful with latency and error constraints:

sustainable_capacity = max_throughput where
                       latency_percentile <= target
                       and error_rate <= target
                       and resource_saturation <= target

Throughput ceilings often come from the narrowest resource:

BottleneckSymptomTypical corrective action
CPUHigh run queue, high CPU time, low idle, throttlingOptimize hot code, increase cores, reduce serialization, scale out
MemoryPaging, OOM kills, high GC time, allocator pressureReduce working set, tune heap, pool carefully, fix retention
Disk I/OHigh await, low IOPS headroom, compaction lagBatch, change access pattern, provision IOPS, separate workloads
NetworkRetransmits, bandwidth saturation, connection resetsReduce payloads, compress carefully, shard traffic, colocate services
DatabaseLock waits, slow queries, connection pool exhaustionIndex, partition, tune queries, reduce transactions, use read replicas
QueueRising age, consumer lag, uneven partitionsAdd consumers, rebalance partitions, reduce per-message cost
External APIRate limit errors, long dependency spansCache, batch, budget calls, degrade gracefully

Throughput optimization anti-patterns:

  • Increasing concurrency without measuring queueing delay.
  • Adding replicas while the database is the bottleneck.
  • Using async I/O to hide blocking work without bounding inflight requests.
  • Treating benchmark throughput as production capacity.
  • Ignoring request mix and payload size distribution.
  • Measuring accepted requests instead of completed successful work.

Little's Law

Littles law and efficient queue strategy is a core capacity anchor:

L = lambda * W

L      = average number of items in the system
lambda = average arrival rate
W      = average time an item spends in the system

Equivalent forms:

W = L / lambda
lambda = L / W

Example:

arrival_rate = 200 requests/second
average_latency = 0.150 seconds

average_inflight = 200 * 0.150 = 30 requests

If average latency rises to 0.600 seconds at the same arrival rate:

average_inflight = 200 * 0.600 = 120 requests

The system now needs 4 times as many concurrent request slots, database connections, memory buffers, and downstream capacity just to keep up. Little's Law also exposes why retries are dangerous: retries increase effective arrival rate, which increases queue length, which increases latency, which triggers more retries.

Queueing and Utilization

Utilization above roughly 70 to 80 percent often causes unstable queueing latency for variable workloads. The exact threshold depends on arrival variance, service-time variance, batching behavior, and scheduling discipline.

Simplified intuition:

utilization = arrival_rate / service_rate

as utilization approaches 1.0:
queueing_delay grows nonlinearly
Rendering diagram...

Queueing controls:

ControlUse whenRisk
Bounded queueWork should wait briefly but not indefinitelyDropped work if capacity is too low
Load sheddingProtecting the service is more important than accepting all requestsRequires clear caller contracts
BackpressureUpstream can slow down safelyCan propagate latency across systems
Priority queueSome work is more important or time-sensitiveStarvation if priorities are not aged
Separate poolsSlow work must not block fast workPoor pool sizing can waste capacity
Rate limitingDemand must be shaped per actor or global budgetBad limits can punish healthy clients
Circuit breakerDependency failure should not consume all workersAggressive breakers can reduce availability

CPU Performance

CPU performance depends on instruction count, instruction-level parallelism, branch predictability, memory access, cache behavior, vectorization, scheduling, and synchronization.

CPU Hot Path Checklist

  • Identify where CPU time is spent with a profiler before changing code.
  • Distinguish user CPU, system CPU, steal time, and throttled time.
  • Measure both wall time and CPU time.
  • Check whether the hot path is allocation-heavy.
  • Check branch misprediction if tight loops are unexpectedly slow.
  • Check whether data layout causes pointer chasing.
  • Check whether serialization or compression dominates request time.
  • Check whether TLS, JSON parsing, regex, hashing, or logging is unexpectedly expensive.
  • Check whether CPU limits cause CFS throttling in containers.
  • Validate that optimization improves representative workloads, not microbenchmarks only.

Common CPU Bottlenecks

BottleneckPatternBetter approach
Excess serializationRepeated JSON encode or decode across layersPass structured values internally, encode once at boundaries
Regex on hot pathComplex expressions per requestPrecompile, simplify, or use parser/state machine
Logging costFormatting large logs before level checkGuard expensive fields and sample high-volume logs
Compression everywhereCPU spent compressing tiny payloadsUse size thresholds and appropriate algorithms
Hash map churnAllocate, hash, resize per requestReuse structures carefully, pre-size, use arrays for dense keys
Virtual dispatch in loopsBranchy polymorphism in tight pathSpecialize outside the loop or use data-oriented layout
Container throttlingHigh latency despite moderate average CPUInspect throttled periods and raise limits or reduce burst CPU

Memory Performance

Memory affects latency through allocation cost, cache misses, garbage collection, page faults, memory bandwidth, NUMA effects, and OOM behavior.

Memory questions to answer:

QuestionSignal
What is the working set?Resident memory, heap live set, cache size, page faults
What is the allocation rate?Allocations per request, bytes allocated per second
What is retained?Heap profile, dominator tree, object graph retention
What is copied?Buffer copies, serialization boundaries, compression buffers
Is memory locality good?Cache miss counters, pointer chasing, CPU stalls
Is the allocator contended?Allocator CPU, thread-cache misses, lock contention
Is GC affecting tails?Pause time, concurrent marking CPU, promotion rate
Is the system paging?Major faults, swap in, swap out, reclaim stalls

Memory anti-patterns:

  • Treating cache as free memory.
  • Using unbounded maps keyed by tenant, user, request, or trace id.
  • Retaining request-scoped objects in global structures.
  • Returning slices or views that retain large backing buffers.
  • Building huge intermediate arrays instead of streaming.
  • Copying payloads across every layer.
  • Pooling objects without measuring retention and contention.
  • Ignoring memory overhead of observability labels and exemplars.

Cache Locality

Cache locality is the tendency to access nearby data close together in time. CPUs are much faster when data is in L1 or L2 cache than when it must be fetched from main memory.

Locality typeMeaningExample
Temporal localityReuse the same data soonReusing a parsed routing table
Spatial localityAccess nearby dataIterating over a contiguous array
Instruction localityExecute nearby instructions repeatedlyTight loop with predictable branches
Data ownership localitySame core repeatedly mutates same dataPer-worker counters

Design techniques:

  • Prefer compact contiguous structures for hot loops.
  • Keep hot fields together and cold fields separate.
  • Avoid pointer-heavy object graphs on critical paths.
  • Batch operations to amortize cache misses.
  • Use per-thread or per-shard state for frequently updated counters.
  • Keep lock metadata away from frequently read immutable data.
  • Align heavily contended fields if false sharing is proven.

Cache Coherency and False Sharing

Modern CPUs keep per-core caches coherent. When one core writes to a cache line, other cores may need to invalidate or reload that line. This is correct behavior, but it can become expensive when many cores mutate data that happens to share a cache line.

False sharing occurs when independent variables live on the same cache line and are written by different cores.

cache_line:
[ counter_a ][ counter_b ][ counter_c ][ counter_d ]

Thread 1 writes counter_a.
Thread 2 writes counter_b.

The variables are logically independent, but the cache line bounces between cores.

Mitigations:

MitigationUse whenCost
Per-core countersVery frequent incrementsRequires aggregation
Sharded stateHot key or global counter contentionMore complex reads and rebalancing
Padding or alignmentProven false sharing on adjacent fieldsMore memory usage
Immutable snapshotsMany readers and few writersSnapshot freshness and copy cost
Ownership transferOne worker owns mutationQueueing and routing complexity

Avoid padding as a superstition. Use hardware counters or profiler evidence first.

Contention

Contention happens when many workers compete for the same resource. The resource can be a mutex, CPU core, cache line, database row, connection pool, queue partition, memory allocator, rate limiter bucket, log sink, or hot cache key.

Examples:

  • Mutex hot path.
  • Global atomic counter.
  • Database row lock.
  • Queue partition.
  • Thread pool.
  • Connection pool.
  • Rate limiter.
  • Cache key.
  • Allocator arena.
  • Logger append lock.
  • Metrics registry lock.

Contention Diagnosis

SymptomLikely causeEvidence
CPU low, latency highWaiting on locks, I/O, or poolsThread dumps, blocked time, pool wait histograms
CPU high, throughput flatSpin loops, cache-line bouncing, serializationCPU profile, perf counters, flame graph
p99 spikes under concurrencyLock convoy or queue saturationMutex wait histogram, queue age, scheduler delay
One partition lagsHot key or uneven routingPer-partition throughput and lag
Database CPU moderate, requests slowLock waits or pool exhaustionDB wait events, connection pool wait time
More workers make it slowerShared bottleneck or coherency stormScaling curve with throughput vs concurrency

Mutex Hot Paths

A mutex is often the right tool. The problem is not that a lock exists, but that it protects a hot path for too long or at too high a frequency.

Mutex hot path checklist:

  • Is the lock acquired per request, per item, or per batch?
  • What is the p50, p95, and p99 wait time for the lock?
  • What is the critical section duration?
  • Does the critical section perform I/O, logging, allocation, or callbacks?
  • Does the lock protect multiple unrelated fields?
  • Does one slow holder block all other callers?
  • Is lock ordering documented for nested locks?
  • Does the lock cause priority inversion?
  • Does the lock interact with async runtimes or event loops incorrectly?

Mitigations:

MitigationGood fitCaution
Reduce critical sectionExpensive work can move outside lockMust preserve invariants
Shard lockMany independent keysHot keys can still dominate
Read-write lockMany readers, rare writersWriter starvation or reader overhead
Copy-on-writeReads dominate and snapshots are acceptableWrite amplification
Actor ownershipOne worker owns mutable stateMailbox can become queue bottleneck
BatchingHigh-frequency small mutationsAdds latency and failure semantics
Local aggregationMetrics, counters, statisticsReads are approximate or require merge

Lock-Free and Wait-Free Tradeoffs

Lock-free algorithms guarantee that at least one thread can make progress. Wait-free algorithms guarantee that every operation completes in a bounded number of steps.

TechniqueProgress guaranteeBenefitCost
Blocking lockNone if holder stallsSimple invariants and maintainabilityConvoys, deadlocks, priority inversion
Try-lock with fallbackDepends on fallbackAvoids blocking in some pathsComplex retry behavior
Lock-freeSystem-wide progressAvoids stalled lock holderABA problems, memory ordering, livelock, hard testing
Wait-freePer-operation bounded progressStrong tail-latency guaranteeVery complex, often high memory or copy cost
RCU-style readsReaders avoid locksExcellent read scalabilityReclamation complexity and stale reads

Lock-free tradeoffs:

  • Correct memory ordering is hard to prove and hard to review.
  • Performance may be worse under contention due to repeated compare-and-swap failures.
  • Busy retry loops can burn CPU and harm neighboring workloads.
  • Memory reclamation is often the hardest part.
  • Debugging rare interleavings can dominate the value of the optimization.
  • Lock-free code can improve p99 only if the lock was actually the bottleneck.

Wait-free tradeoffs:

  • Strong bounded progress can be valuable in schedulers, real-time systems, telemetry hot paths, and safety-critical control loops.
  • General-purpose application code rarely needs wait-free algorithms.
  • Complexity and maintenance risk are usually higher than the performance win.
  • Proof obligations matter. If the team cannot explain the bound, it should not be called wait-free.

Decision rule:

Use the simplest synchronization primitive that meets the measured latency target.
Escalate from lock to sharding, batching, ownership, lock-free, and wait-free only with evidence.

Profiling

Profiling turns performance work from opinion into evidence. The minimum useful loop is measure, hypothesize, change one thing, remeasure, and compare under the same workload.

Rendering diagram...

Profiler types:

ProfilerAnswersWatch for
CPU samplingWhere CPU time goesSampling bias, missing wall-clock waits
Wall-clock profilingWhere elapsed time goesNeeds enough labels to separate waiting from work
Allocation profilingWhat allocates and how muchSampling may miss short-lived bursts
Heap profilingWhat is retainedSnapshot timing changes interpretation
Lock profilingWho waits on synchronizationInstrumentation overhead
I/O profilingDisk, network, and dependency waitsCorrelate with caller-level latency
eBPF or kernel profilingScheduler, TCP, filesystem, syscallsRequires careful symbolization and permissions
Database profilingQueries, locks, plans, buffer usageProduction plans differ by parameters and data shape

Profiling checklist:

  • Capture request rate, request mix, data size, and feature flags with the profile.
  • Record hardware, container limits, runtime version, and deployment shape.
  • Use production-like data distributions.
  • Compare p50, p95, p99, throughput, errors, and resource usage.
  • Preserve profiles and flame graphs for later regression comparison.
  • Look for missing time: if spans show 100 ms but client sees 500 ms, instrumentation is incomplete.
  • Recheck after compiler, runtime, kernel, or dependency upgrades.

Load Testing

Load testing validates behavior under controlled demand. It should be tied to service objectives, not just a maximum requests-per-second number.

Test types:

TestPurposePass signal
Baseline loadEstablish normal operating behaviorMeets SLO with expected headroom
Peak loadValidate known high-demand periodMeets SLO at forecast peak
Stress to failureFind breaking point and failure modeFails predictably and recovers cleanly
Soak testFind leaks, compaction issues, and slow degradationStable latency, memory, and error rate over time
Spike testValidate sudden demand changesAutoscaling and queues stabilize before SLO breach budget is consumed
Failover loadValidate loss of zone, region, node, or dependencyRemaining capacity handles redirected load
Dependency degradationValidate partial failure behaviorBackpressure, timeouts, and fallbacks protect the system

Avoid coordinated omission by measuring from the client perspective and preserving intended arrival rate. A load generator that waits for each response before scheduling the next request hides the exact latency growth that matters.

Load test design checklist:

  • Use realistic arrival patterns, not only closed-loop workers.
  • Model read/write mix, payload sizes, tenant skew, cache warmness, and burstiness.
  • Include authentication, authorization, serialization, and observability overhead.
  • Include slow and failing dependency responses.
  • Measure queue wait, pool wait, retry count, timeout count, and dropped work.
  • Run long enough to observe GC, compaction, rotation, autoscaling, and storage effects.
  • Verify that generated load is not bottlenecked by the test client.
  • Define stop conditions to protect shared environments.
  • Compare results to a previous baseline, not only to the absolute target.

Capacity Planning

Capacity planning connects demand, resources, SLOs, and failure scenarios.

Inputs:

  • Demand forecast.
  • Peak to average ratio.
  • Growth rate.
  • SLO target.
  • Dependency limits.
  • Backpressure behavior.
  • Failover headroom.
  • Regional loss scenario.
  • Deployment surge capacity.
  • Batch and cron overlap.
  • Cost ceiling.

Capacity worksheet:

ItemExample questionOutput
Demand unitWhat drives load?Requests, jobs, tenants, GB, active sessions
Current baselineWhat do we serve today?p50, p95, p99, throughput, CPU, memory, cost
Peak multiplierHow high is peak vs average?Peak factor by hour, day, season, event
GrowthHow fast is demand changing?Monthly or quarterly multiplier
SLOWhat must remain true at peak?Latency, availability, correctness, freshness
BottleneckWhich resource saturates first?CPU, DB, queue, memory, network, external API
Scaling unitWhat is added when scaling?Pod, VM, shard, partition, queue consumer
FailoverWhat if one zone or region is lost?N+1, N+2, or active-active capacity target
CostWhat is the acceptable unit cost?Cost per request, tenant, GB, job

Common formulas:

peak_demand = average_demand * peak_to_average_ratio

future_peak = current_peak * (1 + growth_rate) ^ periods

required_capacity = future_peak * safety_factor

headroom_percent = (capacity - load) / capacity * 100

unit_cost = total_cost / business_units_served

effective_arrival_rate = original_arrival_rate + retry_arrival_rate

Example:

current_peak = 8,000 requests/second
growth_rate = 0.12 per quarter
periods = 4
safety_factor = 1.35

future_peak = 8,000 * (1 + 0.12) ^ 4 = 12,588 requests/second
required_capacity = 12,588 * 1.35 = 16,994 requests/second

Autoscaling

Autoscaling changes capacity in response to demand or saturation signals. It reduces idle cost but introduces control-loop risk.

Autoscaling signal comparison:

SignalGood forWeakness
CPU utilizationCPU-bound stateless servicesSlow for bursty traffic, misleading under I/O waits
Memory utilizationMemory-bound workers and cachesScaling out may not reduce per-process memory
Request ratePredictable per-request costIgnores heavy requests and dependency delays
Queue depthBackground workersNeeds arrival rate and processing time context
Queue ageUser-visible backlog freshnessMay react after backlog is already harmful
p95 latencyUser experienceCan be noisy and late
Inflight requestsConcurrency-bound servicesNeeds per-instance limit discipline
Custom saturation metricKnown bottleneckRequires maintenance and validation

Autoscaling failure modes:

  • Scaling on average CPU while p99 latency is driven by queueing.
  • Scaling pods while the database, queue partition, or external API is saturated.
  • Cold starts consume the entire spike budget.
  • Long stabilization windows react too late.
  • Short windows cause oscillation.
  • New replicas receive traffic before caches, JIT, or connection pools are warm.
  • Downscaling removes capacity during slow drains.
  • Per-pod connection pools multiply database connections beyond safe limits.

Autoscaling checklist:

  • Define minimum capacity for baseline and failover.
  • Define maximum capacity to protect dependencies and cost.
  • Use readiness checks that represent real ability to serve.
  • Pre-warm expensive caches or accept a warmup budget.
  • Keep per-instance concurrency limits explicit.
  • Coordinate autoscaling with rate limits and connection pool limits.
  • Test scale-up and scale-down behavior during load tests.
  • Alert on desired replicas hitting max replicas.
  • Alert when queue age grows while replicas are already scaling.

Cost Engineering

Cost is a system property. It is shaped by architecture, data retention, request mix, tenant behavior, deployment topology, observability, development workflow, and failure handling.

Cost dimensions:

DimensionExamplesOptimization lever
ComputeVMs, containers, serverless, CI runnersRight-size, bin-pack, optimize CPU hot paths
MemoryLarge nodes, cache fleets, heap overheadReduce working set, tune caches, fix leaks
StorageDatabases, object storage, logs, backupsLifecycle policies, compression, retention classes
NetworkEgress, cross-zone traffic, CDN missesColocate services, cache at edge, reduce payloads
ObservabilityMetrics, traces, logs, profilesCardinality budgets, sampling, retention tiers
DatabaseReads, writes, indexes, replicas, IOPSQuery tuning, partitioning, connection discipline
AI inferenceTokens, model calls, embeddings, rerankingPrompt budgets, caching, batching, smaller models
Developer workflowBuilds, tests, preview environmentsCache builds, prune stale envs, parallelize selectively

Unit economics examples:

cost_per_request = monthly_service_cost / monthly_successful_requests

cost_per_tenant = monthly_allocated_cost / active_tenants

cost_per_gb_processed = monthly_pipeline_cost / gb_processed

cost_per_model_answer = (input_tokens_cost + output_tokens_cost + retrieval_cost + orchestration_cost)

Cost anti-patterns:

  • Scaling every tier together when only one tier is saturated.
  • Keeping full-fidelity logs forever.
  • High-cardinality metrics for unbounded user, request, or trace identifiers.
  • Cross-zone or cross-region chatter inside tight request paths.
  • Oversized database instances compensating for missing indexes.
  • Running large caches with poor hit rates.
  • Using premium storage for cold data.
  • Keeping preview environments alive indefinitely.
  • Retrying expensive external calls without budgets.
  • Optimizing cloud bill totals without tracking unit cost and user value.

Performance Budgets

A performance budget turns performance into an explicit design constraint. It should be assigned before implementation and enforced during review and release.

Budget examples:

BudgetExample
API latencyp95 below 200 ms and p99 below 800 ms for read path
Page loadLCP below 2.5 seconds on target device and network
CPUHandler uses less than 20 ms CPU at p95
MemoryWorker live heap below 512 MB after 6 hour soak
DatabaseNo request path performs more than 3 queries
PayloadResponse body below 100 KB for common case
CacheHit ratio above 90 percent for hot catalog reads
CostCost per 1000 successful requests below target
ObservabilityNew metric labels must use bounded cardinality

Budget review checklist:

  • What user or business outcome owns the budget?
  • Which percentile is the budget based on?
  • Which device, region, tenant tier, or workload is in scope?
  • Is the budget enforced in CI, canary, load test, or production alerting?
  • What happens when the budget is exceeded?
  • Which dependency consumes the largest portion?
  • What is the regression threshold?
  • Is the budget still realistic after product changes?

System Design Patterns

PatternPerformance valueCapacity or cost risk
CachingReduces latency and backend loadStaleness, invalidation complexity, memory cost
Read replicasIncreases read capacityReplica lag and inconsistent reads
PartitioningSpreads load and dataHot partitions and operational complexity
BatchingImproves throughput and amortizes overheadAdds latency and larger failure units
Async processingMoves slow work out of request pathBacklog freshness and retry semantics
BackpressurePrevents collapse under overloadRequires caller cooperation
Load sheddingPreserves critical capacityVisible errors and product tradeoffs
CDN or edge cacheReduces origin and network costCache invalidation and personalization limits
Connection poolingReduces setup costPool exhaustion and multiplied downstream load
CompressionReduces bandwidthCPU cost and latency for small payloads

Examples

Example 1: API p99 Regression

Observed:

  • p50 latency remains 45 ms.
  • p95 rises from 180 ms to 420 ms.
  • p99 rises from 600 ms to 2.4 seconds.
  • CPU is 55 percent.
  • Database CPU is 40 percent.
  • Connection pool wait p99 is 1.8 seconds.

Interpretation:

The bottleneck is not average CPU. The application is waiting for database connections. Adding application replicas may make the database connection problem worse if each replica opens its own pool.

Better response:

  • Inspect pool size, query count, and transaction duration.
  • Add pool wait histograms and query traces.
  • Remove long work from transactions.
  • Fix slow queries or missing indexes.
  • Cap per-instance concurrency so requests queue before consuming all downstream resources.

Example 2: Throughput Drops After Adding Threads

Observed:

  • 8 workers: 40,000 operations/second.
  • 16 workers: 42,000 operations/second.
  • 32 workers: 31,000 operations/second.
  • CPU is high.
  • Lock profiling shows a global metrics lock.

Interpretation:

More workers increased contention. The global lock became a serialization point.

Better response:

  • Replace global counter updates with per-worker counters.
  • Aggregate periodically.
  • Keep metric label sets bounded.
  • Re-run the scaling curve.

Example 3: Cache Saves Latency but Raises Cost

Observed:

  • Cache hit ratio is 35 percent.
  • Cache memory fleet is expensive.
  • Origin database still handles most traffic.
  • p99 improves only for a narrow path.

Interpretation:

The cache may be storing low-value or poorly reusable entries. More memory is not automatically better.

Better response:

Anti-Patterns

Anti-patternWhy it failsPrefer
Optimizing without a profileTime is spent where the team guesses, not where the system hurtsBaseline profile and target metric
Average-only dashboardsHide tail behavior and overloadPercentiles, histograms, and saturation signals
Unbounded concurrencyTurns overload into queue explosionExplicit limits, backpressure, and shedding
Infinite retriesAmplifies incidentsRetry budgets, jitter, deadlines, and circuit breakers
One shared worker poolSlow work blocks fast workSeparate pools by class and priority
Global mutable stateForces serialization and coherency trafficSharding, ownership, or local aggregation
Huge critical sectionsOne slow operation blocks all callersMinimize protected state
Lock-free by defaultComplexity without measured needSimple locks plus evidence-driven escalation
Autoscaling as a fix for all bottlenecksScales the wrong layerIdentify the saturated resource
Cache everythingMemory cost, staleness, low hit ratioCache by measured reuse and value
No capacity failover modelNormal load passes but failure load collapsesPlan for zone, node, and dependency loss
Observability without budgetsTelemetry becomes the cost driverSampling, retention tiers, and cardinality controls
Ignoring deployment effectsRollouts consume capacity and warmup timeSurge capacity, warmup, and canary budgets

Operational Checklists

Performance Investigation

  • State the user-visible symptom and affected operation.
  • Capture p50, p95, p99, throughput, error rate, and saturation for the same time window.
  • Compare against a known good baseline.
  • Check deploys, traffic mix, data size, and dependency changes.
  • Break latency into queue time, service time, dependency time, and retry time.
  • Inspect CPU, memory, disk, network, database, queue, and external API signals.
  • Generate a profile under representative load.
  • Make one change at a time and remeasure.
  • Document the result, including negative findings.

Release Readiness

  • Load test covers baseline, peak, spike, soak, and failover scenarios.
  • SLOs are defined for latency, errors, freshness, and availability.
  • Capacity model includes growth, peak multiplier, and failover.
  • Autoscaling limits protect dependencies and cost.
  • Backpressure and load shedding behavior is explicit.
  • Retry budgets and deadlines are configured.
  • Dashboards include percentiles, saturation, queue age, pool wait, and unit cost.
  • Alerts fire before user-visible exhaustion when possible.
  • Rollback path is tested.
  • Performance budgets are reviewed with product and operations owners.

Cost Review

  • Identify the business unit of demand.
  • Attribute costs by service, tenant class, route, job type, or data volume.
  • Separate fixed, variable, idle, and failure-mode cost.
  • Check top cost drivers over the last 30 to 90 days.
  • Review storage retention and lifecycle policies.
  • Review log, metric, trace, and profile cardinality.
  • Check cross-region and cross-zone transfer.
  • Check caches for hit ratio, eviction rate, and memory waste.
  • Check database indexes, query plans, replicas, and IOPS.
  • Check CI, build cache, preview environment, and artifact retention cost.
  • Track cost per unit alongside latency and throughput.