Software Engineering
Staff and Principal-level notes on architecture, distributed systems, reliability, security, delivery, leadership, and AI-native engineering.
- 16
- 382 min
- 201
- 94
Study map
This is the canonical entry point for the Software Engineering knowledge base. Use it to move from broad system judgment to focused topic notes without losing the whole-system context. The goal is Staff and Principal engineering depth: concept mastery, design judgment, operational correctness, verification discipline, and the ability to change complex systems without creating hidden risk.
This note is a map, not a textbook. Leaf notes own depth, proofs, examples, checklists, code, and operational playbooks. This index owns routing, coverage, study order, and the relationships between domains.
How to use this index
Use this page in four modes:
| Mode | Use when | Start here | What good output looks like |
|---|---|---|---|
| Orientation | You need the shape of the field before diving deeper. | 00 Staff Principal Software Engineering and the #Domain map. | You can explain how correctness, reliability, security, performance, and execution fit together. |
| Design review | You are evaluating an architecture, RFC, migration, incident fix, or platform change. | #Staff and Principal standard and #System review lens. | You identify invariants, failure modes, tradeoffs, verification gates, and owner boundaries. |
| Topic study | You need to master a specific area such as consensus, memory ordering, TLA+, queues, or release safety. | #Required topic coverage matrix. | You know the primary note, adjacent notes, and the question the topic helps answer. |
| Execution planning | You need to sequence learning or project work at senior depth. | #Staff and Principal study path. | You have a staged path from fundamentals to cross-org technical leadership. |
Navigation rules:
- Start with the numbered notes when you need breadth. They are dense MOCs for each domain.
- Jump to existing vault anchors when they already own a topic, especially Data Structures/Data Structures, Design Patterns/Design patterns, Event-Driven Architectures and Event Sourcing, Software testing, Software Supply Chain Security, kubernetes/Kubernetes, AI-Enhanced Software Development, Indexing Large Codebases for AI-Assisted Development, Context-Aware Systems and MCP Protocols, and LLMOps and Model Deployment.
- Prefer the domain note before creating a new leaf note. If a topic only needs a routing sentence, keep it in the MOC. If it needs examples, proofs, diagrams, incident stories, or implementation detail, split it into an atomic note.
- Read laterally. Most important software engineering problems cross boundaries: a queueing issue can be a product SLO issue, a database isolation issue can be a security issue, and a deployment strategy can be an organizational design issue.
- Treat every note as a tool for decisions. Ask: what decision does this help me make, what invariant does it protect, and what evidence would show it is working?
Core map
- 00 Staff Principal Software Engineering: Staff and Principal mental model, execution loop, system-property thinking, and review prompts.
- 01 Engineering Fundamentals: programming models, concurrency, memory ordering, cache coherency, nonblocking algorithms, liveness, and mutability.
- 02 Architecture and Design: code architecture, boundaries, state machines, architecture governance, ADRs, and design review.
- 03 Data Structures Algorithms and Complexity: complexity analysis, storage-oriented structures, concurrent data structures, and algorithmic patterns.
- 04 Databases Storage and Transactions: storage engines, indexes, isolation, transactions, replication, distributed databases, and migrations.
- 05 Distributed Systems: consistency, time, CAP, PACELC, consensus, replicated state machines, quorums, retries, networking, and failure patterns.
- 06 Caching Queues and Streaming: caching, invalidation, queueing theory, delivery semantics, retry design, streaming, and Kafka.
- 07 APIs Contracts and Integration: API contracts, idempotency, event contracts, schema evolution, and integration risk.
- 08 Reliability Observability and Operations: failure modes, observability, alerts, incidents, control planes, and network operations.
- 09 Security and Supply Chain: threat modeling, access boundaries, secrets, supply chain controls, application security, and security review.
- 10 Testing Verification and Quality Bars: test layers, formal methods, TLA+, concurrency testing, quality bars, and review checklists.
- 11 Performance Capacity and Cost: latency, throughput, CPU, memory, contention, capacity planning, load testing, and cost engineering.
- 12 Delivery Migrations and Release Engineering: release strategies, migration safety, rollback, CI/CD, and GitOps.
- 13 Technical Leadership and Execution: Conway's Law, strategy, leverage, review systems, decisions, and operating rhythm.
- 14 AI Native Software Engineering: AI-assisted development, agentic systems, retrieval, context engineering, LLMOps, and model deployment.
Domain map
| Domain | Primary note | Core question | Typical artifacts |
|---|---|---|---|
| Engineering judgment | 00 Staff Principal Software Engineering | What system property are we changing, and who owns the risk? | Review prompts, decision frames, escalation criteria, learning plans. |
| Programming foundations | 01 Engineering Fundamentals | What behavior does the code have under concurrency, memory effects, mutation, and failure? | Invariants, concurrency contracts, liveness analysis, correctness notes. |
| Architecture | 02 Architecture and Design | What boundaries, states, and dependencies make change safer or more expensive? | ADRs, context diagrams, state machines, interface contracts. |
| Algorithms and structures | 03 Data Structures Algorithms and Complexity | What complexity, access pattern, and data shape does the system rely on? | Complexity budgets, data-structure choices, proof sketches, benchmark plans. |
| Data systems | 04 Databases Storage and Transactions | What guarantees does persistent state actually provide? | Isolation analysis, migration plans, schema evolution rules, backup and restore criteria. |
| Distributed systems | 05 Distributed Systems | What happens when time, networks, membership, and partial failure become unreliable? | Consistency choices, quorum design, consensus notes, retry policies. |
| Flow control | 06 Caching Queues and Streaming | How do we absorb load, hide latency, and move work without violating correctness? | Cache policy, invalidation model, queue topology, stream processing contract. |
| Integration | 07 APIs Contracts and Integration | What do producers and consumers depend on, and how does that contract evolve? | API specs, event schemas, compatibility rules, idempotency keys. |
| Reliability | 08 Reliability Observability and Operations | How does the system fail, and how do humans detect and recover it? | SLOs, dashboards, alerts, runbooks, incident reviews. |
| Security | 09 Security and Supply Chain | What trust boundary can be crossed, and what prevents abuse or compromise? | Threat models, permission matrices, secret handling rules, supply chain attestations. |
| Verification | 10 Testing Verification and Quality Bars | What evidence is strong enough to trust this behavior? | Test strategy, model checks, fuzz tests, quality gates, review checklists. |
| Performance and cost | 11 Performance Capacity and Cost | Where are the bottlenecks, limits, and economic tradeoffs? | Capacity models, latency budgets, load tests, cost envelopes. |
| Delivery | 12 Delivery Migrations and Release Engineering | How do we ship, migrate, and recover without depending on luck? | Release plans, rollback plans, migration scripts, progressive delivery controls. |
| Leadership | 13 Technical Leadership and Execution | How do people, ownership, and incentives shape the technical system? | Strategy docs, decision logs, team topology, operating rhythms. |
| AI-native engineering | 14 AI Native Software Engineering | How do AI tools and systems change build, test, operate, and govern loops? | Prompt protocols, evals, retrieval design, agent boundaries, LLMOps controls. |
Knowledge graph
System review lens
Use this lens for architecture reviews, incident follow-ups, migration plans, and production readiness checks.
| Lens | Question | Evidence to look for | Related notes |
|---|---|---|---|
| Correctness | What invariant must remain true under retries, concurrency, partial failure, deploys, and repair jobs? | Explicit invariants, idempotency keys, transaction boundaries, model checks, race tests. | 01 Engineering Fundamentals, 04 Databases Storage and Transactions, 10 Testing Verification and Quality Bars |
| Reliability | What does the user observe when dependencies fail, slow down, split brain, or return stale data? | SLOs, graceful degradation, retry budgets, circuit breakers, alert quality, runbooks. | 05 Distributed Systems, 08 Reliability Observability and Operations, 11 Performance Capacity and Cost |
| Operability | Can humans detect, understand, mitigate, and repair bad behavior quickly? | Dashboards, structured logs, traces, safe admin actions, incident playbooks, rollback commands. | 08 Reliability Observability and Operations, 12 Delivery Migrations and Release Engineering |
| Evolvability | Can the design absorb new requirements without hidden coupling or data breakage? | Stable contracts, compatibility tests, bounded contexts, ADRs, schema evolution policy. | 02 Architecture and Design, 07 APIs Contracts and Integration, 13 Technical Leadership and Execution |
| Security | What trust boundary exists, and what prevents privilege escalation, data exposure, or supply chain compromise? | Threat model, least privilege, secret controls, dependency provenance, abuse cases. | 09 Security and Supply Chain, Software Supply Chain Security |
| Performance | What happens at peak, during contention, and when the hot path shifts? | Latency budgets, throughput targets, load tests, profiles, queue depth, capacity model. | 06 Caching Queues and Streaming, 11 Performance Capacity and Cost, Littles law and efficient queue strategy |
| Delivery safety | How is the change deployed, observed, rolled back, and cleaned up? | Progressive rollout, migration phases, rollback plan, release gates, ownership. | 12 Delivery Migrations and Release Engineering, 10 Testing Verification and Quality Bars |
Example review prompt:
> If this change is retried, deployed halfway, processed out of order, observed through stale caches, or rolled back after partial writes, what invariant still holds?
Required topic coverage matrix
This matrix is the minimum advanced-topic routing table for this knowledge base. "Mastery signal" names the behavior that shows the topic is not just memorized.
Cross-domain trails
Use these trails when the question is broader than one note.
Staff and Principal standard
A senior engineer can implement a feature. A Staff or Principal engineer can reason about the system property the feature changes.
- Correctness: the invariant remains true under retries, concurrency, partial failure, deploys, backfills, and repair jobs.
- Reliability: users get predictable behavior when dependencies fail, slow down, split brain, or return stale data.
- Operability: humans can detect, understand, mitigate, and repair production behavior with bounded confusion.
- Evolvability: future product changes do not require unplanned rewrites or unsafe coupling across ownership lines.
- Simplicity: the design minimizes concepts, states, owners, and failure modes while still meeting the real requirement.
- Verification: tests, simulations, model checks, reviews, and runtime signals are strong enough for the blast radius.
- Leadership: the decision improves the technical system and the human system that owns it.
Staff and Principal depth is visible in the questions asked before implementation:
- What is the smallest durable invariant?
- What is the largest plausible blast radius?
- What state can become inconsistent, orphaned, duplicated, stale, or unowned?
- What happens if the operation runs twice, runs halfway, runs out of order, or runs during deploy?
- Which dependencies are trusted, which are only best effort, and which must fail closed?
- What must be observable before rollout, during rollout, after rollback, and after cleanup?
- What future change would this design make easier, and what future change would it make harder?
Staff and Principal study path
This path is ordered by dependency, not by difficulty. Move forward when you can use the topic in a real design review.
1. Foundations: reason about local correctness
Read:
- 01 Engineering Fundamentals
- 03 Data Structures Algorithms and Complexity
- Data Structures/Data Structures
- Design Patterns/Design patterns
Practice:
- Explain mutexes, semaphores, atomics, memory ordering, and liveness failures without relying on framework behavior.
- Turn a complex function into explicit invariants, state transitions, and failure cases.
- Estimate the operational impact of an algorithmic choice under realistic load.
Exit standard:
- You can identify when a bug is caused by mutation, concurrency, aliasing, hidden state, or complexity growth.
2. Architecture: design boundaries that survive change
Read:
- 02 Architecture and Design
- 07 APIs Contracts and Integration
- Event-Driven Architectures and Event Sourcing
Practice:
- Draw module boundaries and name the contracts between them.
- Model a workflow as a state machine before choosing tables, queues, or APIs.
- Write an ADR that states the rejected alternatives and the operating consequences.
Exit standard:
- You can explain how a design changes failure modes, team ownership, migration paths, and future options.
3. Data and distributed systems: handle partial failure honestly
Read:
- 04 Databases Storage and Transactions
- 05 Distributed Systems
- 06 Caching Queues and Streaming
- Littles law and efficient queue strategy
Practice:
- Compare isolation levels through concrete anomaly examples.
- Design idempotent APIs and queue consumers that tolerate duplicate delivery.
- Explain when to use cache invalidation, leases, quorums, logical clocks, or consensus.
Exit standard:
- You can make correctness claims under stale reads, retries, partitions, clock skew, replica lag, and reprocessing.
4. Reliability, security, and verification: prove enough before trust
Read:
- 08 Reliability Observability and Operations
- 09 Security and Supply Chain
- 10 Testing Verification and Quality Bars
- Software testing
- Software Supply Chain Security
Practice:
- Build a test strategy that matches blast radius rather than code volume.
- Write a small TLA+ model or state-machine model for a workflow with concurrency or retries.
- Create a threat model that covers assets, actors, trust boundaries, abuse paths, and controls.
- Design alerts around user-visible symptoms and actionable causes.
Exit standard:
- You can say what evidence is sufficient, what evidence is missing, and what residual risk remains.
5. Performance, capacity, and delivery: ship under real constraints
Read:
- 11 Performance Capacity and Cost
- 12 Delivery Migrations and Release Engineering
- kubernetes/Kubernetes
- kubernetes/One-Day Kubernetes Crash Course
Practice:
- Build a capacity model using arrival rate, service time, utilization, and queue depth.
- Profile before optimizing and separate CPU, IO, allocation, lock, and network bottlenecks.
- Plan a reversible database migration with expand, migrate, contract phases.
- Define rollout, rollback, observability, and cleanup gates for a production change.
Exit standard:
- You can ship changes with measurable safety rather than optimism.
6. Technical leadership: scale judgment through people and systems
Read:
- 13 Technical Leadership and Execution
- 00 Staff Principal Software Engineering
- SWE Review topics
Practice:
- Translate ambiguous business pressure into technical strategy and decision points.
- Use Conway's Law to reason about ownership, interfaces, and review boundaries.
- Run reviews that improve the system without turning every concern into a blocker.
Exit standard:
- You can make high-leverage technical decisions legible to engineers, managers, security, operations, and product leaders.
7. AI-native engineering: use AI with production discipline
Read:
- 14 AI Native Software Engineering
- AI-Enhanced Software Development
- Indexing Large Codebases for AI-Assisted Development
- Context-Aware Systems and MCP Protocols
- LLMOps and Model Deployment
Practice:
- Define evals before using an AI behavior in a critical workflow.
- Treat retrieval and context as governed systems with freshness, relevance, permissions, and auditability.
- Review generated code by invariants, tests, threat model, and operational behavior, not by surface plausibility.
Exit standard:
- You can use AI to improve throughput while preserving evidence, ownership, and production accountability.
Existing vault anchors
Use these notes as established source nodes instead of duplicating depth in this MOC:
- Data Structures/Data Structures
- Design Patterns/Design patterns
- Event-Driven Architectures and Event Sourcing
- Software Engineering glossary
- Software testing
- Software Supply Chain Security
- SWE Review topics
- Littles law and efficient queue strategy
- kubernetes/Kubernetes
- kubernetes/One-Day Kubernetes Crash Course
- AI-Enhanced Software Development
- Indexing Large Codebases for AI-Assisted Development
- Context-Aware Systems and MCP Protocols
- LLMOps and Model Deployment
Maintenance rules
- Keep this note canonical: every major Software Engineering domain should be reachable from here in one hop.
- Keep leaf depth out of this file unless the example improves routing or decision quality.
- Preserve wikilinks when renaming or splitting notes.
- Add new topics to the coverage matrix only when they are required for Staff or Principal judgment.
- Prefer domain notes for intermediate routing and atomic notes for deep worked examples.
Ordered notes
Staff Principal Software Engineering
Staff Principal Software Engineering This note defines the operating model for senior individual contributor engineering at Staff and Principal scope. The rest of the folder breaks this model into specific disciplines....
Engineering Fundamentals
Engineering Fundamentals Engineering fundamentals are the ideas that let you predict system behavior below the framework level. They connect source code to runtime behavior: state ownership, memory layout,...
Architecture and Design
Architecture and Design Architecture is the set of hard to change decisions that shape a system's behavior, constraints, economics, and ability to evolve. It is not only diagrams, frameworks, or service counts. It is...
Data Structures Algorithms and Complexity
Data Structures Algorithms and Complexity This note connects algorithmic fundamentals to production engineering decisions. In production systems, algorithms are not only interview exercises. They shape latency, memory...
Databases Storage and Transactions
Databases Storage and Transactions Databases are correctness systems, not only persistence tools. A database is a contract between application invariants, storage media, concurrency control, recovery logic, and...
Distributed Systems
Distributed Systems Distributed systems are systems where independent components communicate over unreliable networks and fail independently. Their central difficulty is not scale by itself. It is the combination of...
Caching Queues and Streaming
Caching Queues and Streaming Caches, queues, and streams are coordination tools. They move work across time, space, and process boundaries. They improve latency, cost, throughput, and resilience, but they also create...
APIs Contracts and Integration
APIs Contracts and Integration APIs are long lived contracts. Integration quality determines how safely systems can evolve, how quickly teams can ship, and how much production risk appears at service boundaries. A good...
Reliability Observability and Operations
Reliability Observability and Operations Reliability is a product property. Operations are the feedback loop that keeps reliability real. A system is reliable when users can complete the work they came to do, within a...
Security and Supply Chain
Security and Supply Chain Security engineering is the disciplined reduction of exploitable risk under adversarial conditions. Supply chain security extends that discipline across the path from source code to running...
Testing Verification and Quality Bars
Testing Verification and Quality Bars Testing is not only bug detection. It is evidence for system properties: correctness, compatibility, resilience, performance, security, operability, and maintainability. A good...
Performance Capacity and Cost
Performance Capacity and Cost Performance engineering is the discipline of predicting, measuring, and controlling how a system consumes scarce resources while serving real demand. Capacity engineering asks whether the...
Delivery Migrations and Release Engineering
Delivery Migrations and Release Engineering High quality software depends on safe change, not only good design. Release engineering is the discipline of turning code, configuration, database changes, infrastructure...
Technical Leadership and Execution
Technical Leadership and Execution Technical leadership converts judgment into repeatable organizational capability. It is not the act of making every hard decision personally. It is the work of shaping strategy,...
AI Native Software Engineering
AI Native Software Engineering AI native software engineering applies normal engineering rigor to systems where language models assist, decide, retrieve, generate, test, review, operate, or act through tools. The...
Software Engineering
Software Engineering This is the canonical entry point for the Software Engineering knowledge base. Use it to move from broad system judgment to focused topic notes without losing the whole system context. The goal is...