Engineering Fundamentals

Reading time
19 min read
Word count
3766 words
Diagram count
3 diagrams

Source: Victor Bona's Obsidian Compendium snapshot, Knowledge base/Software Engineering/01 Engineering Fundamentals.md.

Engineering Fundamentals

Engineering fundamentals are the ideas that let you predict system behavior below the framework level. They connect source code to runtime behavior: state ownership, memory layout, synchronization, scheduling, resource lifetime, failure handling, and performance under load.

The practical goal is not to know every primitive by name. The goal is to design systems where correctness can be explained before production traffic tests it.

Mental model

LayerMain questionCommon failure
Source codeWhat does this operation mean?Ambiguous ownership, hidden side effects.
Compiler and runtimeWhat can be optimized, reordered, suspended, or collected?Assuming source order is execution order.
OS schedulerWho runs, blocks, wakes, or gets preempted?Latency spikes, starvation, priority inversion.
CPU and memoryWhich core sees which writes, and when?Data races, stale reads, false sharing.
Distributed systemWhich node owns the truth, and how is failure observed?Split brain, duplicate effects, lost updates.

Core invariants:

  • Every mutable state cell needs exactly one ownership story.
  • Every concurrent interaction needs a synchronization story.
  • Every resource needs an acquisition, transfer, and release story.
  • Every retryable operation needs an idempotency story.
  • Every failure path needs an observability story.

Advanced programming

Advanced programming is control over abstraction, state, effects, resource lifetime, concurrency, and failure. It is not syntax volume.

Main concerns

ConcernDesign pressureUseful question
Data representationLayout, identity, value semantics, aliasing, mutability.Can two references mutate the same object?
Control flowSync calls, async tasks, callbacks, continuations, cancellation.Where can this operation pause or reenter?
Error handlingTyped errors, exceptions, result values, retries, compensations, panic boundaries.Which errors are expected and which are fatal?
Resource managementMemory, file descriptors, sockets, transactions, locks, thread pools.Who releases the resource on every path?
Type systemsNominal types, structural types, generics, variance, algebraic data types, phantom types.Can invalid states be represented?
Runtime behaviorGC, JIT, event loop, scheduler, stack, heap, CPU cache, syscalls.What work is hidden behind this abstraction?
API designMinimal surface, explicit ownership, stable contracts, impossible states.What misuse does the API make easy?

Abstraction boundaries

Good abstractions hide implementation details, not important effects. A storage API can hide SQL syntax, but it should not hide transaction semantics, consistency level, timeout behavior, idempotency requirements, or whether callbacks can run while a lock is held.

Checklist for an abstraction:

  • State which component owns mutation.
  • State whether calls are synchronous, asynchronous, blocking, or cancellable.
  • State whether callbacks can be reentrant.
  • State whether operations are idempotent.
  • State what ordering is guaranteed.
  • State what happens after partial failure.
  • State how resources are released.

Example: explicit ownership

type BufferOwner:
    buffer
    closed = false

    write(bytes):
        require not closed
        buffer.append(bytes)

    close():
        if closed:
            return
        flush(buffer)
        release(buffer)
        closed = true

The owner is the only code allowed to mutate buffer or call release. Other code may receive snapshots or borrowed views, but not shared mutable authority.

State, identity, and mutability

Hard bugs often come from unclear ownership.

ConceptMeaningEngineering consequence
ValueReplaceable by equal content.Safe to copy, compare, and persist.
EntityIdentity persists across state changes.Needs versioning and conflict control.
SnapshotImmutable state at a point in time.Safe to share between threads or tasks.
CommandRequest to change state.Must validate intent and permissions.
EventFact that state changed.Should be immutable and append-only.
ProjectionDerived read model.Can be stale and rebuilt.
CapabilityAuthority to perform an action.Should be explicit and revocable where possible.

Design rule: write down the owner for every mutable state cell. If no owner exists, the design is incomplete.

Ownership patterns

PatternUse whenRisk
Single writerOne actor owns mutation.Bottleneck if the owner is too broad.
Immutable snapshotMany readers need consistent state.Copy cost or stale reads.
Borrowed referenceTemporary access without ownership transfer.Lifetime bugs if the owner outlives assumptions.
Message passingOwnership moves between tasks.Backpressure and queue growth.
Shared lock protected stateMultiple threads need coordinated mutation.Deadlock and contention.
Atomic stateState fits into independent machine words.Memory ordering mistakes.

State transition table

Current stateInputNext stateGuardSide effect
OpenCloseClosingNo active writers.Flush outstanding data.
ClosingFlush completeClosedAll buffers persisted.Release descriptor.
ClosingFlush failedFailedError is not retryable.Publish failure event.
FailedRetryClosingRetry budget remains.Reopen descriptor.
ClosedWriteClosedAlways false.Reject request.

State machines make concurrency easier because illegal transitions become visible.

Rendering diagram...

Concurrency fundamentals

Concurrency is overlapping work. Parallelism is simultaneous execution. Asynchrony is a control-flow style where work may suspend and resume later. These are related but not interchangeable.

ModelWhat it optimizesTypical primitiveMain risk
ThreadsCPU parallelism and blocking IO tolerance.Mutex, condition variable, thread pool.Data races and scheduling nondeterminism.
Event loopMany mostly idle IO operations.Future, promise, callback, task.Blocking the loop, cancellation leaks.
Actor modelLocal ownership with message passing.Mailbox, channel, supervisor.Mailbox overload, ordering assumptions.
Data parallelismSame operation over many items.Work stealing pool, SIMD, GPU kernel.Shared accumulator contention.
PipelineStaged processing.Bounded queue, backpressure.Head-of-line blocking.

Synchronization decision table

NeedPreferAvoid when
Protect small shared stateMutexCritical section performs blocking IO.
Limit concurrent access to a poolSemaphoreTasks can be cancelled without releasing permits.
Wait for a predicateCondition variablePredicate is not protected by the same lock.
Publish immutable data onceAtomic pointer or once primitiveData lifetime is unclear.
Transfer work between ownersBounded channelProducers cannot handle backpressure.
Count events at high frequencySharded countersExact instant reads are required.
Coordinate phasesBarrierParticipants may fail independently.

Concurrency primitives

PrimitivePurposeCorrectness invariantFailure mode
MutexExclusive access to shared state.The protected data is accessed only while locked.Deadlock, priority inversion, lock convoy, hidden contention.
SemaphoreBound concurrent access to a finite resource.Every acquired permit is released exactly once.Permit leak, starvation, overload when the limit is wrong.
Read write lockAllow many readers or one writer.Readers do not mutate, writers exclude all others.Writer starvation, upgrade deadlock, excessive reader optimism.
Condition variableWait until a predicate changes.Waiters check the predicate while holding the lock.Lost wakeup, spurious wakeup, predicate checked outside lock.
Atomic variableSingle-location synchronization.All shared accesses follow the atomic protocol.Incorrect memory ordering, ABA problem, false sharing.
Channel or queueTransfer ownership or messages between tasks.Sender and receiver agree on backpressure and close semantics.Unbounded memory, blocked producers, dropped work.
BarrierCoordinate phases across workers.Every participant reaches the barrier or the barrier is broken.Stragglers, stuck participants, cancellation complexity.
LatchAllow waiters to proceed after a one-time signal.The signal is monotonic.Waiters block forever if the signal path fails.
OnceRun initialization once.Initialization result is safely published.Recursive initialization deadlock.

Rule: shared mutable state needs a synchronization story. Message passing still has shared state, but it moves ownership boundaries.

Mutexes

A mutex serializes access to a critical section. It protects an invariant, not a line of code.

mutex m
state balance = 0

deposit(amount):
    lock(m)
    try:
        require amount > 0
        balance = balance + amount
    finally:
        unlock(m)

The lock and the protected data should have the same scope. A global mutex protecting unrelated data creates artificial contention and makes deadlocks harder to reason about.

Mutex checklist

  • Name the invariant protected by the lock.
  • Keep the critical section small and nonblocking.
  • Do not call unknown callbacks while holding the lock.
  • Do not perform network IO while holding the lock.
  • Use try/finally, RAII, defer, or scoped guards to guarantee unlock.
  • Define a global order for multiple locks.
  • Document whether lock acquisition is fair, timed, interruptible, or cancellable.

Lock granularity

StrategyBenefitCost
Coarse lockSimple invariants.Lower concurrency, convoy risk.
Fine-grained locksHigher concurrency.More deadlock surfaces.
Lock stripingReduces contention by partitioning state.Cross-stripe operations are complex.
Immutable copy and swapReaders avoid locking.Copy cost and atomic publication concerns.
Single owner taskNo shared mutable state across threads.Queue latency and owner bottleneck.

Semaphores

A semaphore controls access to a finite resource. It does not protect state by itself.

semaphore permits = 8

handle_request(req):
    acquire(permits)
    try:
        return call_downstream(req)
    finally:
        release(permits)

Use semaphores for concurrency limits: database connections, outbound requests, file handles, GPU slots, or expensive CPU work.

Semaphore failure modes

FailureCausePrevention
Permit leakCancellation or exception skips release.Release in finalization scope.
Thundering herdToo many waiters wake at once.Use bounded queues and fair scheduling.
Wrong limitLimit ignores downstream capacity.Size from measured bottlenecks.
Hidden deadlockTask holds permit while waiting for work needing another permit.Avoid nested semaphores or define ordering.
StarvationUnfair wake policy or hot tenant.Per-tenant limits, fair queues.

Condition variables

A condition variable lets threads sleep until a predicate may have changed. The predicate is the important part.

Correct pattern:

mutex m
condition not_empty
queue q

take():
    lock(m)
    try:
        while q.is_empty():
            wait(not_empty, m)
        return q.pop_front()
    finally:
        unlock(m)

put(item):
    lock(m)
    try:
        q.push_back(item)
        notify_one(not_empty)
    finally:
        unlock(m)

The waiter uses while, not if, because wakeups can be spurious and other consumers may take the item first.

Lost wakeup pattern

bad_take():
    if q.is_empty():
        wait(not_empty)
    return q.pop_front()

This is wrong because the predicate is checked outside a lock and can change between the check and the wait.

Atomics

Atomics provide indivisible operations on a memory location. Atomicity and ordering are different properties.

OperationTypical useCaveat
LoadRead a shared flag or pointer.Ordering determines what other data is visible.
StorePublish a flag or pointer.Must pair with a compatible read.
ExchangeSwap state.Can drop ownership if old value is ignored.
Compare and swapConditional update.ABA and retry loops.
Fetch addCounters, ticket locks, sequence numbers.Contention and overflow.
FenceOrdering without data access.Easy to misuse, prefer higher-level primitives.

Atomic counter

atomic_int count = 0

record_event():
    count.fetch_add(1, relaxed)

read_count():
    return count.load(relaxed)

Relaxed ordering is acceptable for a statistical counter when no other data depends on the count. It is not acceptable for publishing object initialization.

Publish and read initialized data

data payload
atomic_bool ready = false

producer():
    payload = build_payload()
    ready.store(true, release)

consumer():
    if ready.load(acquire):
        use(payload)

The release store makes prior writes visible to an acquire load that observes true.

Memory models and ordering

Memory ordering defines what writes become visible to which threads and in what order. Code that works on one CPU, compiler, or runtime can be wrong under a weaker memory model.

Key concepts

ConceptMeaningPractical implication
Program orderOrder written in source code before optimization.Compilers and CPUs may reorder when allowed.
VisibilityWhether a write by one thread can be read by another.Requires synchronization, not hope.
Happens-beforeFormal relationship that makes memory effects visible.Use this as the proof language.
Data raceConflicting accesses without synchronization.Behavior may be undefined or runtime-specific.
AcquirePrevents following reads and writes from moving before the acquire.Used by consumers.
ReleasePrevents preceding reads and writes from moving after the release.Used by producers.
Acq relCombines acquire and release for read-modify-write operations.Useful for queues and state machines.
Sequential consistencyOperations appear in one global order.Easiest to reason about, often more expensive.
RelaxedAtomicity without cross-location ordering.Good for independent counters and IDs.
FenceExplicit ordering constraint.Last resort when operations cannot carry ordering.

Memory ordering table

OrderingGuaranteesCommon useCommon mistake
RelaxedAtomic access to one location only.Metrics counters, unique IDs.Assuming it publishes other data.
AcquireLater operations stay after the load.Reading a readiness flag or pointer.Loading the wrong flag.
ReleaseEarlier operations stay before the store.Publishing initialized data.Writing data after the release.
Acquire releaseBoth sides on read-modify-write.Lock-free queue indexes.Forgetting failed CAS ordering.
Sequential consistencySingle global order for seq-cst operations.Simple correctness-first atomics.Assuming it fixes non-atomic races.

Happens-before proof template

Use this checklist when reviewing atomic code:

  1. Identify every shared memory location.
  2. Mark each access as atomic or protected by a lock.
  3. Find the write that initializes the data.
  4. Find the release operation after initialization.
  5. Find the acquire operation that observes the release.
  6. Confirm the consumer reads data only after the acquire.
  7. Confirm no non-atomic access races with atomic access.
  8. Confirm object lifetime extends through all readers.

Incorrect publication

payload = build_payload()
ready.store(true, relaxed)

if ready.load(relaxed):
    use(payload)

The flag is atomic, but the payload is not safely published. The consumer can observe ready without a happens-before edge that makes payload visible.

Cache coherency

Cache coherency is the hardware property that keeps multiple CPU caches consistent for the same memory location. It does not make programs automatically safe.

ConceptMeaningDesign implication
Cache lineUnit moved between memory and CPU cache, often 64 bytes.Independent hot fields can interfere.
MESI-style protocolsModified, exclusive, shared, invalid cache-line states.Shared writes cause invalidation traffic.
Store bufferWrites may sit before becoming globally visible.Source order is not enough for visibility.
NUMAMemory access cost depends on CPU and memory locality.Pinning and locality can matter.
Coherency trafficProtocol work to keep caches consistent.Hot atomics can become bottlenecks.
Memory barrierPrevents specific reorderings.Should match the language memory model.

False sharing

False sharing happens when independent variables share a cache line and different cores write them frequently.

struct BadCounters:
    atomic_int worker0
    atomic_int worker1
    atomic_int worker2
    atomic_int worker3

struct BetterCounters:
    padded_atomic_int worker0
    padded_atomic_int worker1
    padded_atomic_int worker2
    padded_atomic_int worker3

The BadCounters fields may live on the same cache line. Each write invalidates the line for other cores even though workers are updating logically independent counters.

Cache-aware design

  • Put hot counters on separate cache lines when contention matters.
  • Prefer sharded counters over one global atomic counter.
  • Batch updates before touching shared state.
  • Keep read-mostly state immutable and publish snapshots.
  • Avoid writing to shared progress indicators in tight loops.
  • Measure under realistic core counts and CPU topology.

Lock-free and wait-free programming

Nonblocking algorithms make progress without ordinary locks, but they are not automatically faster or simpler.

ClassGuaranteeMeaning
Obstruction-freeOne thread makes progress if it runs alone.Weak progress guarantee.
Lock-freeAt least one thread makes progress system-wide.System progresses, individual starvation possible.
Wait-freeEvery operation finishes in a bounded number of steps.Strongest guarantee, hardest to design.

Building blocks

Building blockUseRisk
Compare and swapConditional pointer or state update.ABA and retry storms.
Fetch and addCounters and ticket allocation.Hot cache line contention.
Atomic pointer swapPublish replacement structure.Reclamation of old structure.
Version counterDetect changed state.Overflow and torn protocols.
Hazard pointerAnnounce node currently being read.Per-thread cleanup complexity.
Epoch reclamationReclaim after all readers leave old epochs.Stalled readers delay memory reuse.
Read copy updateReaders run without locks over old versions.Writer and reclamation complexity.

Compare and swap loop

push(stack, node):
    loop:
        old_head = stack.head.load(acquire)
        node.next = old_head
        if stack.head.compare_exchange(old_head, node, release, relaxed):
            return

This is only a sketch. A real stack also needs safe memory reclamation. Without it, another thread can read a node that has already been freed and reused.

ABA problem

The ABA problem occurs when a location changes from A to B and back to A. A compare-and-swap sees A and assumes nothing changed.

Mitigations:

  • Pair pointers with version counters.
  • Use tagged pointers where alignment leaves spare bits.
  • Use hazard pointers to prevent reuse while readers exist.
  • Use epoch-based reclamation.
  • Prefer tested library algorithms over custom lock-free structures.

When lock-free is appropriate

Use lock-free algorithms when:

  • Profiling shows lock contention is a real bottleneck.
  • Blocking inside a critical path is unacceptable.
  • The data structure is small enough to reason about formally.
  • Memory reclamation is solved.
  • There is a stress test that runs under high contention.

Avoid lock-free algorithms when:

  • A simple mutex meets latency requirements.
  • The team cannot maintain the memory ordering proof.
  • Object lifetime is complex.
  • Fairness matters more than aggregate throughput.

Deadlocks, livelocks, starvation, and priority inversion

FailureDefinitionTypical causeDetection
DeadlockParticipants wait forever for each other.Cyclic lock acquisition, blocking while holding a lock.Thread dumps, wait-for graph, stalled progress metrics.
LivelockParticipants keep acting but no useful progress occurs.Repeated retries, conflict symmetry, polite backoff.High activity with no throughput.
StarvationOne participant rarely or never gets service.Unfair locks, priority scheduling, hot partition.Per-tenant or per-worker latency histograms.
Priority inversionLow priority work blocks high priority work.Locks across priority classes.Scheduler traces, blocked high priority queues.

Deadlock example

thread_a:
    lock(accounts[1])
    lock(accounts[2])
    transfer()

thread_b:
    lock(accounts[2])
    lock(accounts[1])
    transfer()

Fix by acquiring locks in a stable global order:

transfer(from, to, amount):
    first = min(from.id, to.id)
    second = max(from.id, to.id)

    lock(account[first])
    try:
        lock(account[second])
        try:
            move_money(from, to, amount)
        finally:
            unlock(account[second])
    finally:
        unlock(account[first])

Liveness prevention checklist

  • Define global lock ordering.
  • Avoid blocking IO while holding locks.
  • Keep critical sections small.
  • Avoid nested locks unless the order is documented.
  • Add timeouts to prevent permanent waits, but do not treat timeouts as correctness proof.
  • Use bounded retries with jitter for optimistic concurrency.
  • Use fair queues when per-request latency matters.
  • Enable priority inheritance or avoid cross-priority locks in real-time systems.
  • Track queue age, not only queue length.

Wait-for graph

Rendering diagram...

A cycle in a wait-for graph is a deadlock.

Async runtimes

Async runtimes multiplex many logical tasks onto a smaller set of OS threads. They are powerful for IO-bound workloads and dangerous when blocking work sneaks into the scheduler.

Runtime conceptMeaningFailure mode
Event loopPolls readiness and schedules tasks.Blocked by CPU work or sync IO.
TaskSuspendable unit of work.Detached tasks outlive their owner.
Future or promiseRepresents eventual completion.Never polled, never awaited, or silently dropped.
ExecutorRuns tasks.Starvation from unfair scheduling.
ReactorWatches IO readiness.Readiness event not drained.
Work stealingIdle workers take tasks from others.Poor locality or surprising execution thread.
BackpressureProducers slow when consumers lag.Unbounded memory if absent.

Async rules

  • Do not block the event loop with CPU-heavy work.
  • Move blocking calls to a dedicated blocking pool.
  • Await every task or intentionally detach it with a lifecycle owner.
  • Use bounded queues by default.
  • Propagate cancellation through child tasks.
  • Treat cancellation as a normal control path.
  • Avoid holding a mutex across await unless the lock is async-aware and the design is deliberate.
  • Prefer structured concurrency for request-scoped work.

Async pipeline

async handle_request(req):
    with cancellation_scope(req.deadline):
        user = await load_user(req.user_id)
        permit = await outbound_limit.acquire()
        try:
            quote = await call_pricing_service(user)
        finally:
            permit.release()
        return render_response(user, quote)

The permit release must happen even when the task is cancelled. Cancellation can arrive at almost any await point.

Structured concurrency

Structured concurrency means child tasks cannot outlive the scope that created them unless explicitly transferred to another owner.

async build_page(id):
    async with task_group() as group:
        profile_task = group.spawn(load_profile(id))
        orders_task = group.spawn(load_orders(id))
        recommendations_task = group.spawn(load_recommendations(id))

    return render(
        await profile_task,
        await orders_task,
        await recommendations_task
    )

If one child fails, the group can cancel siblings and join them before leaving the scope. That prevents background work from mutating state after the request is gone.

Cancellation

Cancellation is a protocol, not a signal to randomly stop code. It must preserve invariants and release resources.

Cancellation styleMeaningRisk
Cooperative tokenCode checks a token at safe points.Long CPU loops ignore cancellation.
TimeoutDeadline triggers cancellation.Work may continue if not propagated.
InterruptRuntime interrupts blocking wait.Cleanup may be skipped in unsafe APIs.
Drop futureFuture is abandoned.Destructors or finalizers must release resources.
Context cancellationParent scope cancels children.Detached children escape unless owned.

Cancellation-safe code

async copy_stream(input, output, cancel):
    buffer = acquire_buffer()
    try:
        while not cancel.is_set():
            chunk = await input.read(buffer)
            if chunk.is_empty():
                break
            await output.write(chunk)
    finally:
        release_buffer(buffer)
        await output.flush_or_abort()

Checklist:

  • All acquired resources are released on cancellation.
  • Partial writes are either committed, rolled back, or marked incomplete.
  • Child tasks are cancelled and joined.
  • Permits, locks, leases, and transactions have finalization paths.
  • Cancellation does not convert user-visible state into an impossible state.
  • Timeout errors include enough context to debug the blocked dependency.

Resource lifetime

Resources are anything finite: memory, descriptors, sockets, tasks, permits, locks, timers, temporary files, transactions, leases, and external reservations.

Lifetime table

PhaseQuestionCommon technique
AcquireWhat can fail during acquisition?Factory function, constructor, open call.
ValidateIs the resource usable?Health check, handshake, version check.
TransferWho owns release after transfer?Move semantics, unique handle, explicit owner.
UseWhat invariants must hold?Scoped guard, lease, transaction.
ReleaseIs release guaranteed on all paths?RAII, defer, finally, context manager.
ObserveHow do we know release failed?Logs, metrics, finalizer alerts.

Practical resource patterns

PatternBenefitExample
RAII or scoped guardRelease bound to lexical scope.Mutex guard unlocks on exit.
Context managerExplicit block controls lifetime.Open file inside with.
Reference countingShared ownership with automatic release.Shared immutable buffer.
Lease with TTLExternal resource expires if owner dies.Distributed lock lease.
PoolReuse expensive resources.Database connections.
FinalizerLast-resort cleanup.Warning when descriptor leaked.

Lifetime checklist

  • Can acquisition partially succeed?
  • Can release fail?
  • Is release idempotent?
  • Can ownership transfer after acquisition?
  • Can a reference outlive the owner?
  • What happens if cancellation occurs during use?
  • What metric detects leaked resources?
  • What limit prevents unbounded acquisition?

Practical examples

Bounded worker pool

queue jobs capacity 1000
semaphore active = 32

submit(job):
    if not jobs.try_push(job):
        return rejected("queue full")

worker():
    loop:
        job = jobs.pop()
        active.acquire()
        try:
            process(job)
        finally:
            active.release()

Key properties:

  • The queue provides backpressure.
  • The semaphore limits expensive work.
  • Release happens in a finalization path.
  • Rejection is explicit instead of unbounded memory growth.

Double-checked locking

Double-checked locking is often wrong unless the publication operation is safe.

atomic_ptr instance = null
mutex init_lock

get_instance():
    p = instance.load(acquire)
    if p != null:
        return p

    lock(init_lock)
    try:
        p = instance.load(relaxed)
        if p == null:
            p = new_object()
            instance.store(p, release)
        return p
    finally:
        unlock(init_lock)

The release store publishes initialization. The acquire load observes it. A language-level once primitive is usually better.

Read mostly configuration

atomic_ptr current_config

reload_config():
    new_config = parse_and_validate_config()
    old = current_config.exchange(new_config, acq_rel)
    retire_after_readers_finish(old)

read_config():
    cfg = current_config.load(acquire)
    return cfg.snapshot_view()

This works only if old configurations are not freed while readers still use them.

Avoiding hot global counters

array shard_counts[num_workers]

record(worker_id):
    shard_counts[worker_id].fetch_add(1, relaxed)

read_total():
    total = 0
    for shard in shard_counts:
        total = total + shard.load(relaxed)
    return total

The total may be slightly stale while being read, but throughput is much better than a single hot counter.

Backpressure with async channels

async producer(ch):
    for item in input:
        await ch.send(item)       # waits when channel is full

async consumer(ch):
    async for item in ch:
        await process(item)

An unbounded channel converts downstream slowness into memory growth. A bounded channel converts it into producer waiting or explicit rejection.

Engineering review checklists

Shared state review

  • What data is shared?
  • Is it mutable?
  • Who owns mutation?
  • Which primitive protects it?
  • Are all accesses protected by the same protocol?
  • Is object lifetime longer than all readers?
  • Is there a test that creates real contention?

Locking review

  • What invariant does each lock protect?
  • Can locks be acquired in different orders?
  • Can code block, await, or call callbacks while holding a lock?
  • Are timeouts used as mitigation rather than correctness proof?
  • Is fairness required?
  • Are metrics available for wait time and hold time?

Atomic review

  • Why is a lock insufficient?
  • What exact memory ordering is required?
  • What is the happens-before proof?
  • Are atomic and non-atomic accesses mixed?
  • Is ABA possible?
  • How is removed memory reclaimed?
  • Does the code behave on weak memory architectures?

Async review

  • Can any operation block the event loop?
  • Are all spawned tasks awaited, joined, or owned?
  • Does cancellation release locks, permits, buffers, and transactions?
  • Are queues bounded?
  • Is backpressure visible to callers?
  • Are deadlines propagated to downstream calls?
  • Does the runtime have enough threads for blocking work?

Performance review

  • Is the bottleneck measured or assumed?
  • Is contention visible through wait-time metrics?
  • Are hot fields sharing cache lines?
  • Are counters sharded or batched?
  • Does optimization preserve the correctness proof?
  • Has behavior been tested under realistic core counts?

Mermaid: concurrency design flow

Rendering diagram...

Quick reference

ProblemFirst tool to considerEscalate to
Protecting compound stateMutexActor, transaction, lock striping.
Limiting concurrent IOSemaphoreAdaptive limiter, per-tenant quotas.
Waiting for state changeCondition variable or channelEvent stream, actor.
High-frequency metricsRelaxed atomic sharded counterPer-core buffers.
Read-mostly stateImmutable snapshotRCU-style publication.
Cross-task ownershipBounded channelDurable queue.
Request-scoped async workStructured task groupSupervisor with explicit ownership.
Low-latency shared queueLibrary lock-free queueCustom nonblocking algorithm only with proof.