Kubernetes Mastery Roadmap
- Reading time
- 8 min read
- Word count
- 1484 words
- Diagram count
- 0 diagrams
Source: Victor Bona's Obsidian Compendium snapshot, Knowledge base/kubernetes/00 Kubernetes Mastery Roadmap.md.
Purpose: This note gives a rigorous learning path for mastering Kubernetes from core mental models to production operations without confusing local practice with production readiness.
Kubernetes Mastery Roadmap
This roadmap is organized around the sequence that makes Kubernetes easier to reason about: API first, reconciliation second, workloads third, networking and storage fourth, then security, operations, and platform engineering. Use Kubernetes as the root map and 01 Kubernetes Mental Model and Architecture as the foundation note.
Mastery Outcomes
By the end of this path, you should be able to:
- Explain Kubernetes as a declarative API and reconciliation system.
- Trace a workload from
kubectl applythrough admission, storage, scheduling, kubelet execution, networking, and status reporting. - Debug Pods, Deployments, Services, DNS, scheduling, image pulls, probes, quotas, permissions, and node pressure.
- Design label, namespace, RBAC, resource, and policy conventions for a multi-team cluster.
- Separate local-cluster learning from production evidence.
- Know where Kubernetes stops and where surrounding platform choices begin.
Phase 1: The Core Model
Goal: Build the mental model before memorizing commands.
| Learn | Why it matters | Practice |
|---|---|---|
| Desired state | Everything starts with API objects that describe intent. | Apply a Pod, Deployment, Service, and ConfigMap. Read metadata, spec, and status. |
| Reconciliation loops | Kubernetes is made of controllers converging state. | Scale a Deployment and watch ReplicaSet and Pod changes. |
| API server and etcd | kube-apiserver is the durable front door, etcd is the backing store. | Use kubectl get --raw /api and inspect API groups. |
| Control plane vs worker nodes | Decisions and execution are separate. | Compare kubectl get nodes -o wide with running Pods. |
| Events and conditions | Troubleshooting begins with controller feedback. | Run kubectl describe pod on a failing Pod and read Events before logs. |
Read: 01 Kubernetes Mental Model and Architecture
Review questions:
- What changes when you edit
spec.replicason a Deployment? - Why does a ReplicaSet create Pods instead of the Deployment directly acting forever on each Pod?
- What is stored in etcd, and why should ordinary components not bypass kube-apiserver?
Phase 2: Object Metadata and API Shape
Goal: Treat Kubernetes manifests as API contracts, not as inert YAML.
Core topics:
apiVersion,kind,metadata,spec, andstatus.- Names, UIDs, resource versions, generations, and managed fields.
- API groups and versions such as
apps/v1,batch/v1,networking.k8s.io/v1, andrbac.authorization.k8s.io/v1. - Namespaced vs cluster-scoped resources.
- Labels and selectors.
- Annotations.
- Owner references and garbage collection.
- Finalizers.
- Server-side apply and field ownership.
Example:
apiVersion: apps/v1
kind: Deployment
metadata:
name: web
namespace: apps
labels:
app.kubernetes.io/name: web
app.kubernetes.io/component: frontend
spec:
replicas: 3
selector:
matchLabels:
app.kubernetes.io/name: web
template:
metadata:
labels:
app.kubernetes.io/name: web
app.kubernetes.io/component: frontend
spec:
containers:
- name: web
image: ghcr.io/example/web:1.0.0
ports:
- containerPort: 8080
Common mistake: changing a Deployment selector after creation. Selectors are identity contracts. If selector and template labels do not match, the controller cannot safely manage the intended Pods.
Phase 3: Workloads
Goal: Know which workload API matches which application shape.
| Workload | Use for | Avoid when |
|---|---|---|
| Pod | Smallest schedulable unit, debugging, static manifests. | Long-lived app management by humans. Prefer higher-level controllers. |
| Deployment | Stateless replicated services and rolling updates. | Stable network identity or ordered storage is required. |
| ReplicaSet | Deployment implementation detail. | Direct user management in normal app delivery. |
| StatefulSet | Stable identity, ordered rollout, persistent identity. | Stateless apps that do not need stable names. |
| DaemonSet | One Pod per selected node, agents, log collectors, CNIs. | Arbitrary application scaling. |
| Job | Run-to-completion batch work. | Long-running services. |
| CronJob | Scheduled batch work. | Work that must run continuously. |
Production guidance:
- Use Deployments for most stateless services.
- Use StatefulSets only when the application actually needs stable identity or storage semantics.
- Use Jobs for migrations carefully. Make them idempotent, observable, and bounded.
- Use DaemonSets for node agents, not for convenience scaling.
Phase 4: Scheduling and Resources
Goal: Understand why Pods do or do not land on nodes.
Core topics:
- Resource requests and limits.
- QoS classes.
- Node selectors, affinity, anti-affinity, taints, and tolerations.
- Topology spread constraints.
- Priority classes and preemption.
- Quotas and LimitRanges.
- Node pressure and evictions.
Troubleshooting sequence:
- Read
kubectl describe pod. - Inspect scheduling Events.
- Check resource requests against node allocatable capacity.
- Check taints, tolerations, affinity, and topology constraints.
- Check namespace quota.
- Check Pod Security Admission, admission webhooks, and image policy.
Phase 5: Networking and DNS
Goal: Build a precise model of Pod IPs, Services, DNS, ingress, and Gateway API.
Core topics:
- Every Pod gets an IP in the cluster network.
- Services provide stable virtual endpoints over changing Pod backends.
- kube-proxy or an equivalent dataplane programs Service routing.
- CoreDNS resolves Service and Pod DNS names.
- NetworkPolicy is enforced by a capable CNI plugin, not by the API object alone.
- Ingress is an API object that needs an ingress controller.
- Gateway API is an official add-on API family that requires CRDs and a controller.
Common mistake: creating a NetworkPolicy and assuming it works without confirming that the CNI enforces it.
Phase 6: Configuration, Secrets, and Storage
Goal: Separate configuration delivery, sensitive material, and durable data.
| Area | Kubernetes object | Production concern |
|---|---|---|
| App config | ConfigMap | Rollout triggers, validation, size limits, config drift. |
| Sensitive values | Secret | Encryption at rest, external secret managers, RBAC, rotation, audit. |
| Persistent data | PersistentVolumeClaim | StorageClass behavior, reclaim policy, snapshots, backup, restore. |
| Driver integration | CSI | Topology, expansion, snapshots, mount options, failure modes. |
Guidance:
- Do not treat Kubernetes Secrets as complete secret management. They are API objects with access controlled by RBAC and cluster storage configuration.
- Keep restore drills close to storage design. A backup that has never been restored is only an assumption.
- For stateful systems, understand application consistency, not only volume consistency.
Phase 7: Security and Policy
Goal: Build defense in depth.
Core topics:
- Authentication and authorization.
- RBAC verbs, resources, subresources, namespaced roles, and cluster roles.
- ServiceAccounts and projected tokens.
- Admission control.
- Pod Security Admission, stable since Kubernetes v1.25.
- PodSecurityPolicy removal in Kubernetes v1.25.
- NetworkPolicy.
- Image provenance, registry policy, and runtime hardening.
- Audit logs.
Tradeoff table:
| Control | Strength | Limitation |
|---|---|---|
| RBAC | Controls API access precisely. | Does not control in-Pod network or filesystem behavior. |
| Pod Security Admission | Built-in baseline for Pod security levels. | Namespace-label driven and less expressive than custom policy engines. |
| NetworkPolicy | Limits traffic paths when enforced by CNI. | Default behavior depends on policies and plugin support. |
| Admission policy | Blocks invalid or risky objects before persistence. | Requires careful rollout to avoid breaking delivery. |
| Runtime detection | Detects behavior after start. | Reactive unless paired with prevention and response. |
Phase 8: Observability and Troubleshooting
Goal: Debug from API state outward.
Recommended order:
- Object status and conditions.
- Events.
- Controller logs.
- Pod logs.
- Probe results.
- Node status and kubelet logs.
- Metrics.
- Network traces and DNS checks.
- Admission and audit logs.
Useful commands:
kubectl get deploy,rs,pod,svc,endpointslice -n apps
kubectl describe pod -n apps web-abc123
kubectl get events -n apps --sort-by=.lastTimestamp
kubectl auth can-i create pods --as system:serviceaccount:apps:web
kubectl explain deployment.spec.strategy
kubectl diff -f manifests/
Phase 9: Production Platform Practice
Goal: Move from "I can run a Pod" to "this cluster is operable."
Production areas:
- Cluster lifecycle: upgrades, version skew, node replacement, etcd backup.
- GitOps: desired state is reviewed, applied, and audited.
- Policy: admission checks before runtime failures.
- Capacity: requests, limits, autoscaling, quotas, and headroom.
- Networking: CNI, DNS, Service routing, ingress, Gateway, certificates.
- Storage: CSI, snapshots, backups, restore tests, reclaim policies.
- Security: RBAC, Pod Security Admission, network policy, image controls, audit.
- Reliability: SLOs, alerts, runbooks, incident drills.
Production-vs-local check:
| Claim | Local evidence | Production evidence |
|---|---|---|
| The manifest is valid | kubectl apply --dry-run=server or local apply. | Server-side dry run against the target cluster plus admission policy result. |
| The app starts | Pod runs in kind or minikube. | Pod runs with production security context, resources, CNI, DNS, registry, and storage. |
| The rollout is safe | Manual test passes. | Rollout strategy, probes, alerts, rollback plan, and capacity headroom are verified. |
| The service is reachable | Port-forward works. | Service, Gateway or Ingress, DNS, TLS, firewall, and client paths work. |
| The data is safe | PVC exists. | Backup, restore, retention, encryption, and failure-domain behavior are tested. |
Suggested Study Loop
For each topic:
- Read the concept.
- Create the smallest object that demonstrates it.
- Break it intentionally.
- Read Events and status before changing anything.
- Fix it.
- Explain which controller reconciled the state.
- Write one production caveat.
Example loop for Deployments:
- Create a Deployment with three replicas.
- Change the image to a missing tag.
- Observe
ImagePullBackOff. - Inspect ReplicaSet, Pods, Events, and Deployment conditions.
- Fix the tag.
- Explain how the Deployment controller, ReplicaSet controller, scheduler, kubelet, and runtime each participated.
Mastery Checklist
- I can explain Kubernetes without saying "it runs Docker."
- I can draw the control plane and worker node components from memory.
- I can explain why kube-apiserver, not etcd, is the integration boundary.
- I can describe scheduler decisions using requests, constraints, and node state.
- I can explain CoreDNS and Service discovery.
- I can distinguish Ingress from Gateway API.
- I can explain why Gateway API is an add-on official project, not a core object installed in every cluster.
- I can explain server-side apply and field ownership.
- I can debug a stuck deletion caused by a finalizer.
- I can debug a Deployment that is not progressing.
- I can explain why local cluster success is not production proof.
Cross Links
- Root map: Kubernetes
- Architecture model: 01 Kubernetes Mental Model and Architecture
- Crash course: 00 Kubernetes Mastery Roadmap