Software Engineering

Staff and Principal-level notes on architecture, distributed systems, reliability, security, delivery, leadership, and AI-native engineering.

16
382 min
201
94

Study map

This is the canonical entry point for the Software Engineering knowledge base. Use it to move from broad system judgment to focused topic notes without losing the whole-system context. The goal is Staff and Principal engineering depth: concept mastery, design judgment, operational correctness, verification discipline, and the ability to change complex systems without creating hidden risk.

This note is a map, not a textbook. Leaf notes own depth, proofs, examples, checklists, code, and operational playbooks. This index owns routing, coverage, study order, and the relationships between domains.

How to use this index

Use this page in four modes:

ModeUse whenStart hereWhat good output looks like
OrientationYou need the shape of the field before diving deeper.00 Staff Principal Software Engineering and the #Domain map.You can explain how correctness, reliability, security, performance, and execution fit together.
Design reviewYou are evaluating an architecture, RFC, migration, incident fix, or platform change.#Staff and Principal standard and #System review lens.You identify invariants, failure modes, tradeoffs, verification gates, and owner boundaries.
Topic studyYou need to master a specific area such as consensus, memory ordering, TLA+, queues, or release safety.#Required topic coverage matrix.You know the primary note, adjacent notes, and the question the topic helps answer.
Execution planningYou need to sequence learning or project work at senior depth.#Staff and Principal study path.You have a staged path from fundamentals to cross-org technical leadership.

Navigation rules:

  • Start with the numbered notes when you need breadth. They are dense MOCs for each domain.
  • Jump to existing vault anchors when they already own a topic, especially Data Structures/Data Structures, Design Patterns/Design patterns, Event-Driven Architectures and Event Sourcing, Software testing, Software Supply Chain Security, kubernetes/Kubernetes, AI-Enhanced Software Development, Indexing Large Codebases for AI-Assisted Development, Context-Aware Systems and MCP Protocols, and LLMOps and Model Deployment.
  • Prefer the domain note before creating a new leaf note. If a topic only needs a routing sentence, keep it in the MOC. If it needs examples, proofs, diagrams, incident stories, or implementation detail, split it into an atomic note.
  • Read laterally. Most important software engineering problems cross boundaries: a queueing issue can be a product SLO issue, a database isolation issue can be a security issue, and a deployment strategy can be an organizational design issue.
  • Treat every note as a tool for decisions. Ask: what decision does this help me make, what invariant does it protect, and what evidence would show it is working?

Core map

  1. 00 Staff Principal Software Engineering: Staff and Principal mental model, execution loop, system-property thinking, and review prompts.
  2. 01 Engineering Fundamentals: programming models, concurrency, memory ordering, cache coherency, nonblocking algorithms, liveness, and mutability.
  3. 02 Architecture and Design: code architecture, boundaries, state machines, architecture governance, ADRs, and design review.
  4. 03 Data Structures Algorithms and Complexity: complexity analysis, storage-oriented structures, concurrent data structures, and algorithmic patterns.
  5. 04 Databases Storage and Transactions: storage engines, indexes, isolation, transactions, replication, distributed databases, and migrations.
  6. 05 Distributed Systems: consistency, time, CAP, PACELC, consensus, replicated state machines, quorums, retries, networking, and failure patterns.
  7. 06 Caching Queues and Streaming: caching, invalidation, queueing theory, delivery semantics, retry design, streaming, and Kafka.
  8. 07 APIs Contracts and Integration: API contracts, idempotency, event contracts, schema evolution, and integration risk.
  9. 08 Reliability Observability and Operations: failure modes, observability, alerts, incidents, control planes, and network operations.
  10. 09 Security and Supply Chain: threat modeling, access boundaries, secrets, supply chain controls, application security, and security review.
  11. 10 Testing Verification and Quality Bars: test layers, formal methods, TLA+, concurrency testing, quality bars, and review checklists.
  12. 11 Performance Capacity and Cost: latency, throughput, CPU, memory, contention, capacity planning, load testing, and cost engineering.
  13. 12 Delivery Migrations and Release Engineering: release strategies, migration safety, rollback, CI/CD, and GitOps.
  14. 13 Technical Leadership and Execution: Conway's Law, strategy, leverage, review systems, decisions, and operating rhythm.
  15. 14 AI Native Software Engineering: AI-assisted development, agentic systems, retrieval, context engineering, LLMOps, and model deployment.

Domain map

DomainPrimary noteCore questionTypical artifacts
Engineering judgment00 Staff Principal Software EngineeringWhat system property are we changing, and who owns the risk?Review prompts, decision frames, escalation criteria, learning plans.
Programming foundations01 Engineering FundamentalsWhat behavior does the code have under concurrency, memory effects, mutation, and failure?Invariants, concurrency contracts, liveness analysis, correctness notes.
Architecture02 Architecture and DesignWhat boundaries, states, and dependencies make change safer or more expensive?ADRs, context diagrams, state machines, interface contracts.
Algorithms and structures03 Data Structures Algorithms and ComplexityWhat complexity, access pattern, and data shape does the system rely on?Complexity budgets, data-structure choices, proof sketches, benchmark plans.
Data systems04 Databases Storage and TransactionsWhat guarantees does persistent state actually provide?Isolation analysis, migration plans, schema evolution rules, backup and restore criteria.
Distributed systems05 Distributed SystemsWhat happens when time, networks, membership, and partial failure become unreliable?Consistency choices, quorum design, consensus notes, retry policies.
Flow control06 Caching Queues and StreamingHow do we absorb load, hide latency, and move work without violating correctness?Cache policy, invalidation model, queue topology, stream processing contract.
Integration07 APIs Contracts and IntegrationWhat do producers and consumers depend on, and how does that contract evolve?API specs, event schemas, compatibility rules, idempotency keys.
Reliability08 Reliability Observability and OperationsHow does the system fail, and how do humans detect and recover it?SLOs, dashboards, alerts, runbooks, incident reviews.
Security09 Security and Supply ChainWhat trust boundary can be crossed, and what prevents abuse or compromise?Threat models, permission matrices, secret handling rules, supply chain attestations.
Verification10 Testing Verification and Quality BarsWhat evidence is strong enough to trust this behavior?Test strategy, model checks, fuzz tests, quality gates, review checklists.
Performance and cost11 Performance Capacity and CostWhere are the bottlenecks, limits, and economic tradeoffs?Capacity models, latency budgets, load tests, cost envelopes.
Delivery12 Delivery Migrations and Release EngineeringHow do we ship, migrate, and recover without depending on luck?Release plans, rollback plans, migration scripts, progressive delivery controls.
Leadership13 Technical Leadership and ExecutionHow do people, ownership, and incentives shape the technical system?Strategy docs, decision logs, team topology, operating rhythms.
AI-native engineering14 AI Native Software EngineeringHow do AI tools and systems change build, test, operate, and govern loops?Prompt protocols, evals, retrieval design, agent boundaries, LLMOps controls.

Knowledge graph

Rendering diagram...

System review lens

Use this lens for architecture reviews, incident follow-ups, migration plans, and production readiness checks.

LensQuestionEvidence to look forRelated notes
CorrectnessWhat invariant must remain true under retries, concurrency, partial failure, deploys, and repair jobs?Explicit invariants, idempotency keys, transaction boundaries, model checks, race tests.01 Engineering Fundamentals, 04 Databases Storage and Transactions, 10 Testing Verification and Quality Bars
ReliabilityWhat does the user observe when dependencies fail, slow down, split brain, or return stale data?SLOs, graceful degradation, retry budgets, circuit breakers, alert quality, runbooks.05 Distributed Systems, 08 Reliability Observability and Operations, 11 Performance Capacity and Cost
OperabilityCan humans detect, understand, mitigate, and repair bad behavior quickly?Dashboards, structured logs, traces, safe admin actions, incident playbooks, rollback commands.08 Reliability Observability and Operations, 12 Delivery Migrations and Release Engineering
EvolvabilityCan the design absorb new requirements without hidden coupling or data breakage?Stable contracts, compatibility tests, bounded contexts, ADRs, schema evolution policy.02 Architecture and Design, 07 APIs Contracts and Integration, 13 Technical Leadership and Execution
SecurityWhat trust boundary exists, and what prevents privilege escalation, data exposure, or supply chain compromise?Threat model, least privilege, secret controls, dependency provenance, abuse cases.09 Security and Supply Chain, Software Supply Chain Security
PerformanceWhat happens at peak, during contention, and when the hot path shifts?Latency budgets, throughput targets, load tests, profiles, queue depth, capacity model.06 Caching Queues and Streaming, 11 Performance Capacity and Cost, Littles law and efficient queue strategy
Delivery safetyHow is the change deployed, observed, rolled back, and cleaned up?Progressive rollout, migration phases, rollback plan, release gates, ownership.12 Delivery Migrations and Release Engineering, 10 Testing Verification and Quality Bars

Example review prompt:

> If this change is retried, deployed halfway, processed out of order, observed through stale caches, or rolled back after partial writes, what invariant still holds?

Required topic coverage matrix

This matrix is the minimum advanced-topic routing table for this knowledge base. "Mastery signal" names the behavior that shows the topic is not just memorized.

TopicPrimary locationRelated locationsMastery signal
Consistency05 Distributed Systems#Consistency models04 Databases Storage and Transactions#Isolation and correctness, 07 APIs Contracts and Integration#Event contractsYou can choose and defend a consistency model for a user-visible workflow.
Linearizability05 Distributed Systems#Consistency models10 Testing Verification and Quality Bars#Formal methods and model checkingYou can distinguish real-time ordering from serializability and test the claim.
Serializability04 Databases Storage and Transactions#Isolation and correctness04 Databases Storage and Transactions#TransactionsYou can explain anomalies and pick isolation levels based on invariants.
Eventual consistency05 Distributed Systems#Consistency models06 Caching Queues and Streaming#Message delivery semantics, 07 APIs Contracts and Integration#Event contractsYou can design reconciliation, read-your-writes expectations, and conflict handling.
Mutex01 Engineering Fundamentals#Concurrency primitives11 Performance Capacity and Cost#ContentionYou can explain mutual exclusion, convoying, priority inversion, and lock scope.
Semaphores01 Engineering Fundamentals#Concurrency primitives06 Caching Queues and Streaming#Queueing fundamentals, 11 Performance Capacity and Cost#ContentionYou can use permits to bound concurrency without hiding overload.
Condition variables01 Engineering Fundamentals#Concurrency primitives10 Testing Verification and Quality Bars#Concurrency testingYou can reason about wait predicates, missed wakeups, and spurious wakeups.
Memory ordering01 Engineering Fundamentals#Memory models and ordering11 Performance Capacity and Cost#CPU and memory performanceYou can explain acquire, release, fences, and why data races invalidate reasoning.
Atomic operations01 Engineering Fundamentals#Memory models and ordering01 Engineering Fundamentals#Nonblocking algorithmsYou can use CAS or fetch-add while preserving a clear invariant.
Lock free programming, also written lock-free programming01 Engineering Fundamentals#Lock-free and wait-free programming11 Performance Capacity and Cost#Lock-free and wait-free tradeoffsYou can separate progress guarantees from raw speed and identify ABA risks.
Wait-free algorithms01 Engineering Fundamentals#Nonblocking algorithms03 Data Structures Algorithms and Complexity#Concurrent data structuresYou can explain bounded per-thread progress and when the complexity is justified.
Deadlocks01 Engineering Fundamentals#Liveness failures10 Testing Verification and Quality Bars#Concurrency testingYou can identify circular wait and remove it through ordering, timeouts, or ownership changes.
Livelocks01 Engineering Fundamentals#Liveness failures08 Reliability Observability and Operations#Failure modesYou can detect systems doing work without progress and add backoff or coordination.
Starvation01 Engineering Fundamentals#Liveness failures11 Performance Capacity and Cost#ContentionYou can identify unfair scheduling and design bounded waiting.
Cache coherency01 Engineering Fundamentals#Cache coherency11 Performance Capacity and Cost#CPU and memory performanceYou can connect false sharing, cache lines, and memory visibility to latency.
Algorithmic complexity03 Data Structures Algorithms and Complexity#Complexity as an engineering tool11 Performance Capacity and Cost#Capacity planningYou can turn asymptotic complexity into an operational capacity limit.
Storage structures03 Data Structures Algorithms and Complexity#Storage oriented structures04 Databases Storage and Transactions#Storage engine mental modelYou can choose between B-trees, LSM trees, hash indexes, and log structures by workload.
Advanced databases04 Databases Storage and Transactions#Advanced databases03 Data Structures Algorithms and Complexity#Storage oriented structuresYou can explain storage, indexing, transactions, replication, and recovery as one system.
MVCC04 Databases Storage and Transactions#Isolation and correctness04 Databases Storage and Transactions#TransactionsYou can explain snapshot visibility, write skew, vacuum pressure, and anomaly boundaries.
Replication05 Distributed Systems#Replication04 Databases Storage and Transactions#Replication and storage correctnessYou can choose sync, async, leader, follower, and conflict models based on loss tolerance.
Quorum05 Distributed Systems#Quorums04 Databases Storage and Transactions#Distributed databasesYou can reason about read and write quorums, failure tolerance, and stale reads.
CAP theorem05 Distributed Systems#CAP PACELC and failure tradeoffs04 Databases Storage and Transactions#Distributed databasesYou can apply CAP only under partition and avoid using it as a vague slogan.
PACELC05 Distributed Systems#CAP PACELC and failure tradeoffs11 Performance Capacity and Cost#Latency and throughputYou can connect normal-case latency choices to failure-case consistency choices.
Consensus algorithms05 Distributed Systems#Consensus algorithms08 Reliability Observability and Operations#Control planesYou can explain leader election, log replication, commit, membership, and split brain.
Replicated state machines05 Distributed Systems#Replicated state machines02 Architecture and Design#State machinesYou can model commands, deterministic application, replay, and recovery.
The clock problem05 Distributed Systems#Time clocks and ordering10 Testing Verification and Quality Bars#Formal methods and model checkingYou can explain wall clocks, monotonic clocks, logical clocks, and clock skew failure modes.
Idempotency07 APIs Contracts and Integration#Idempotent APIs05 Distributed Systems#Idempotency and retries, 12 Delivery Migrations and Release Engineering#Migration safetyYou can design duplicate-safe side effects across APIs, jobs, queues, and deploys.
Advanced networking05 Distributed Systems#Advanced networking08 Reliability Observability and Operations#Network operations, kubernetes/KubernetesYou can debug latency, loss, DNS, load balancing, connection pools, and service discovery.
Caching06 Caching Queues and Streaming#Caching patterns06 Caching Queues and Streaming#Cache invalidation, 11 Performance Capacity and Cost#Latency and throughputYou can state what is cached, why it is safe, how it expires, and how it is invalidated.
Queueing theory06 Caching Queues and Streaming#Queueing fundamentalsLittles law and efficient queue strategy, 11 Performance Capacity and Cost#Capacity planningYou can use Little's Law to connect arrival rate, service time, queue depth, and latency.
Streaming06 Caching Queues and Streaming#Streaming systemsEvent-Driven Architectures and Event Sourcing, 07 APIs Contracts and Integration#Event contractsYou can reason about ordering, replay, partitions, offsets, schema evolution, and poison records.
API contracts07 APIs Contracts and Integration#Contract types07 APIs Contracts and Integration#Schema evolutionYou can evolve producers and consumers independently without breaking compatibility.
Observability08 Reliability Observability and Operations#Observability pillars08 Reliability Observability and Operations#Alert qualityYou can design telemetry around user-visible symptoms and debuggable causes.
Incident response08 Reliability Observability and Operations#Incident response13 Technical Leadership and Execution#Leadership operating rhythmYou can coordinate mitigation, communicate clearly, and produce useful learning.
Threat modeling09 Security and Supply Chain#Threat modeling09 Security and Supply Chain#Access boundariesYou can name assets, actors, trust boundaries, abuse cases, and mitigations.
Supply chain security09 Security and Supply Chain#Supply chain controlsSoftware Supply Chain SecurityYou can defend dependency provenance, builds, artifacts, secrets, and CI/CD boundaries.
TLA+10 Testing Verification and Quality Bars#Formal methods and model checking05 Distributed Systems#Consensus algorithms, 02 Architecture and Design#State machinesYou can model a small state machine and find a counterexample before code exists.
Concurrency testing10 Testing Verification and Quality Bars#Concurrency testing01 Engineering Fundamentals#Liveness failuresYou can combine stress, schedule control, race detection, and invariant checks.
Performance profiling11 Performance Capacity and Cost#CPU and memory performance11 Performance Capacity and Cost#Load testingYou can distinguish CPU, IO, allocation, lock contention, and queueing bottlenecks.
Capacity planning11 Performance Capacity and Cost#Capacity planning06 Caching Queues and Streaming#Queueing fundamentalsYou can forecast load, saturation, headroom, and degradation behavior.
Migration safety12 Delivery Migrations and Release Engineering#Migration safety04 Databases Storage and Transactions#Data migration playbookYou can split schema, code, and data changes into reversible phases.
Rollback and rollforward12 Delivery Migrations and Release Engineering#Rollback and rollforward08 Reliability Observability and Operations#Incident responseYou can choose when to revert, when to roll forward, and what data cleanup is needed.
GitOps12 Delivery Migrations and Release Engineering#GitOps operating modelkubernetes/Kubernetes, 08 Reliability Observability and Operations#Control planesYou can keep desired state, live state, drift, and reconciliation clear.
Conway's Law13 Technical Leadership and Execution#Conways Law02 Architecture and Design#Architecture and organizationYou can connect team topology to module boundaries, ownership, and communication cost.
Technical strategy13 Technical Leadership and Execution#Technical strategy00 Staff Principal Software Engineering#The execution loopYou can turn ambiguous technical direction into sequenced bets and decision points.
AI-assisted development14 AI Native Software Engineering#AI assisted development quality barAI-Enhanced Software Development, 10 Testing Verification and Quality Bars#Quality barsYou can raise throughput without lowering review, evidence, or ownership standards.
Retrieval and context14 AI Native Software Engineering#Retrieval and contextIndexing Large Codebases for AI-Assisted Development, Context-Aware Systems and MCP ProtocolsYou can design context systems with freshness, relevance, permissions, and auditability.
LLMOps14 AI Native Software Engineering#LLMOpsLLMOps and Model Deployment, 08 Reliability Observability and Operations#Observability pillarsYou can evaluate, deploy, monitor, and roll back model behavior with production discipline.

Cross-domain trails

Use these trails when the question is broader than one note.

QuestionTrail
How do I design a reliable stateful service?02 Architecture and Design#State machines -> 04 Databases Storage and Transactions#Transactions -> 05 Distributed Systems#Replication -> 08 Reliability Observability and Operations#Observability pillars -> 12 Delivery Migrations and Release Engineering#Migration safety
How do I make retries safe?07 APIs Contracts and Integration#Idempotent APIs -> 05 Distributed Systems#Idempotency and retries -> 06 Caching Queues and Streaming#Poison messages and retries -> 10 Testing Verification and Quality Bars#Quality bars
How do I debug production latency?11 Performance Capacity and Cost#Latency and throughput -> 06 Caching Queues and Streaming#Queueing fundamentals -> 08 Reliability Observability and Operations#Observability pillars -> 05 Distributed Systems#Advanced networking
How do I ship a risky data change?04 Databases Storage and Transactions#Data migration playbook -> 12 Delivery Migrations and Release Engineering#Migration safety -> 10 Testing Verification and Quality Bars#Quality bars -> 08 Reliability Observability and Operations#Incident response
How do I evaluate a platform architecture?00 Staff Principal Software Engineering#System property checklist -> 02 Architecture and Design#Boundaries -> 13 Technical Leadership and Execution#Conways Law -> 09 Security and Supply Chain#Threat modeling
How do I design event-driven systems?Event-Driven Architectures and Event Sourcing -> 06 Caching Queues and Streaming#Streaming systems -> 07 APIs Contracts and Integration#Event contracts -> 05 Distributed Systems#Time clocks and ordering
How do I review AI-native engineering work?14 AI Native Software Engineering#AI assisted development quality bar -> Indexing Large Codebases for AI-Assisted Development -> Context-Aware Systems and MCP Protocols -> 10 Testing Verification and Quality Bars#Quality bars

Staff and Principal standard

A senior engineer can implement a feature. A Staff or Principal engineer can reason about the system property the feature changes.

  • Correctness: the invariant remains true under retries, concurrency, partial failure, deploys, backfills, and repair jobs.
  • Reliability: users get predictable behavior when dependencies fail, slow down, split brain, or return stale data.
  • Operability: humans can detect, understand, mitigate, and repair production behavior with bounded confusion.
  • Evolvability: future product changes do not require unplanned rewrites or unsafe coupling across ownership lines.
  • Simplicity: the design minimizes concepts, states, owners, and failure modes while still meeting the real requirement.
  • Verification: tests, simulations, model checks, reviews, and runtime signals are strong enough for the blast radius.
  • Leadership: the decision improves the technical system and the human system that owns it.

Staff and Principal depth is visible in the questions asked before implementation:

  • What is the smallest durable invariant?
  • What is the largest plausible blast radius?
  • What state can become inconsistent, orphaned, duplicated, stale, or unowned?
  • What happens if the operation runs twice, runs halfway, runs out of order, or runs during deploy?
  • Which dependencies are trusted, which are only best effort, and which must fail closed?
  • What must be observable before rollout, during rollout, after rollback, and after cleanup?
  • What future change would this design make easier, and what future change would it make harder?

Staff and Principal study path

This path is ordered by dependency, not by difficulty. Move forward when you can use the topic in a real design review.

Rendering diagram...

1. Foundations: reason about local correctness

Read:

Practice:

  • Explain mutexes, semaphores, atomics, memory ordering, and liveness failures without relying on framework behavior.
  • Turn a complex function into explicit invariants, state transitions, and failure cases.
  • Estimate the operational impact of an algorithmic choice under realistic load.

Exit standard:

  • You can identify when a bug is caused by mutation, concurrency, aliasing, hidden state, or complexity growth.

2. Architecture: design boundaries that survive change

Read:

Practice:

  • Draw module boundaries and name the contracts between them.
  • Model a workflow as a state machine before choosing tables, queues, or APIs.
  • Write an ADR that states the rejected alternatives and the operating consequences.

Exit standard:

  • You can explain how a design changes failure modes, team ownership, migration paths, and future options.

3. Data and distributed systems: handle partial failure honestly

Read:

Practice:

  • Compare isolation levels through concrete anomaly examples.
  • Design idempotent APIs and queue consumers that tolerate duplicate delivery.
  • Explain when to use cache invalidation, leases, quorums, logical clocks, or consensus.

Exit standard:

  • You can make correctness claims under stale reads, retries, partitions, clock skew, replica lag, and reprocessing.

4. Reliability, security, and verification: prove enough before trust

Read:

Practice:

  • Build a test strategy that matches blast radius rather than code volume.
  • Write a small TLA+ model or state-machine model for a workflow with concurrency or retries.
  • Create a threat model that covers assets, actors, trust boundaries, abuse paths, and controls.
  • Design alerts around user-visible symptoms and actionable causes.

Exit standard:

  • You can say what evidence is sufficient, what evidence is missing, and what residual risk remains.

5. Performance, capacity, and delivery: ship under real constraints

Read:

Practice:

  • Build a capacity model using arrival rate, service time, utilization, and queue depth.
  • Profile before optimizing and separate CPU, IO, allocation, lock, and network bottlenecks.
  • Plan a reversible database migration with expand, migrate, contract phases.
  • Define rollout, rollback, observability, and cleanup gates for a production change.

Exit standard:

  • You can ship changes with measurable safety rather than optimism.

6. Technical leadership: scale judgment through people and systems

Read:

Practice:

  • Translate ambiguous business pressure into technical strategy and decision points.
  • Use Conway's Law to reason about ownership, interfaces, and review boundaries.
  • Run reviews that improve the system without turning every concern into a blocker.

Exit standard:

  • You can make high-leverage technical decisions legible to engineers, managers, security, operations, and product leaders.

7. AI-native engineering: use AI with production discipline

Read:

  • 14 AI Native Software Engineering
  • AI-Enhanced Software Development
  • Indexing Large Codebases for AI-Assisted Development
  • Context-Aware Systems and MCP Protocols
  • LLMOps and Model Deployment

Practice:

  • Define evals before using an AI behavior in a critical workflow.
  • Treat retrieval and context as governed systems with freshness, relevance, permissions, and auditability.
  • Review generated code by invariants, tests, threat model, and operational behavior, not by surface plausibility.

Exit standard:

  • You can use AI to improve throughput while preserving evidence, ownership, and production accountability.

Existing vault anchors

Use these notes as established source nodes instead of duplicating depth in this MOC:

  • Data Structures/Data Structures
  • Design Patterns/Design patterns
  • Event-Driven Architectures and Event Sourcing
  • Software Engineering glossary
  • Software testing
  • Software Supply Chain Security
  • SWE Review topics
  • Littles law and efficient queue strategy
  • kubernetes/Kubernetes
  • kubernetes/One-Day Kubernetes Crash Course
  • AI-Enhanced Software Development
  • Indexing Large Codebases for AI-Assisted Development
  • Context-Aware Systems and MCP Protocols
  • LLMOps and Model Deployment

Maintenance rules

  • Keep this note canonical: every major Software Engineering domain should be reachable from here in one hop.
  • Keep leaf depth out of this file unless the example improves routing or decision quality.
  • Preserve wikilinks when renaming or splitting notes.
  • Add new topics to the coverage matrix only when they are required for Staff or Principal judgment.
  • Prefer domain notes for intermediate routing and atomic notes for deep worked examples.

Ordered notes

Staff Principal Software Engineering

Staff Principal Software Engineering This note defines the operating model for senior individual contributor engineering at Staff and Principal scope. The rest of the folder breaks this model into specific disciplines....

Engineering Fundamentals

Engineering Fundamentals Engineering fundamentals are the ideas that let you predict system behavior below the framework level. They connect source code to runtime behavior: state ownership, memory layout,...

Architecture and Design

Architecture and Design Architecture is the set of hard to change decisions that shape a system's behavior, constraints, economics, and ability to evolve. It is not only diagrams, frameworks, or service counts. It is...

Data Structures Algorithms and Complexity

Data Structures Algorithms and Complexity This note connects algorithmic fundamentals to production engineering decisions. In production systems, algorithms are not only interview exercises. They shape latency, memory...

Databases Storage and Transactions

Databases Storage and Transactions Databases are correctness systems, not only persistence tools. A database is a contract between application invariants, storage media, concurrency control, recovery logic, and...

Distributed Systems

Distributed Systems Distributed systems are systems where independent components communicate over unreliable networks and fail independently. Their central difficulty is not scale by itself. It is the combination of...

Caching Queues and Streaming

Caching Queues and Streaming Caches, queues, and streams are coordination tools. They move work across time, space, and process boundaries. They improve latency, cost, throughput, and resilience, but they also create...

APIs Contracts and Integration

APIs Contracts and Integration APIs are long lived contracts. Integration quality determines how safely systems can evolve, how quickly teams can ship, and how much production risk appears at service boundaries. A good...

Reliability Observability and Operations

Reliability Observability and Operations Reliability is a product property. Operations are the feedback loop that keeps reliability real. A system is reliable when users can complete the work they came to do, within a...

Security and Supply Chain

Security and Supply Chain Security engineering is the disciplined reduction of exploitable risk under adversarial conditions. Supply chain security extends that discipline across the path from source code to running...

Testing Verification and Quality Bars

Testing Verification and Quality Bars Testing is not only bug detection. It is evidence for system properties: correctness, compatibility, resilience, performance, security, operability, and maintainability. A good...

Performance Capacity and Cost

Performance Capacity and Cost Performance engineering is the discipline of predicting, measuring, and controlling how a system consumes scarce resources while serving real demand. Capacity engineering asks whether the...

Delivery Migrations and Release Engineering

Delivery Migrations and Release Engineering High quality software depends on safe change, not only good design. Release engineering is the discipline of turning code, configuration, database changes, infrastructure...

Technical Leadership and Execution

Technical Leadership and Execution Technical leadership converts judgment into repeatable organizational capability. It is not the act of making every hard decision personally. It is the work of shaping strategy,...

AI Native Software Engineering

AI Native Software Engineering AI native software engineering applies normal engineering rigor to systems where language models assist, decide, retrieve, generate, test, review, operate, or act through tools. The...

Software Engineering

Software Engineering This is the canonical entry point for the Software Engineering knowledge base. Use it to move from broad system judgment to focused topic notes without losing the whole system context. The goal is...