Linux Ecosystem Tools and Learning Projects

Reading time
10 min read
Word count
1847 words
Diagram count
1 diagram

Source: Victor Bona's Obsidian Compendium snapshot, Knowledge base/linux-systems-engineering/18 Linux Ecosystem Tools and Learning Projects.md.

Purpose: Turn Linux systems engineering into a practical tool map and project sequence that is safe on local learning machines, disciplined on production Linux hosts, and realistic for production clusters.

Related notes: 09 cgroups Namespaces Containers and Runtime Isolation, 17 Production Operations Troubleshooting and Runbooks

Operating Position

Tools are only useful when you know their failure domain. A local learning machine is for destructive experiments, kernel feature exploration, and repeatable break-fix drills. A production Linux host is for narrow, audited observation and reversible repair. A production cluster is for declarative changes, node isolation, orchestrator-native evidence, and controlled replacement.

The learning objective is not to memorize commands. It is to recognize which layer owns the symptom:

Rendering diagram...

Tool Map

DomainFirst toolsDeeper toolsProduction caution
Process stateps, top, pidstat, pstree, /proc/$pidstrace, perf, gdb, /proc/$pid/stackTracing can slow or perturb critical processes.
CPUuptime, mpstat, pidstat, topperf top, perf record, flame graphs, eBPF profilersSampling overhead and symbol handling matter on busy hosts.
Memoryfree, vmstat, /proc/meminfo, pmapPSI, heap dumps, cgroup memory files, smemHeap dump collection can be large and sensitive.
Disk spacedf, du, findmnt, lsof +L1filesystem debug tools, runtime GC toolsDeleting unknown files can corrupt services or erase evidence.
Block IOiostat, lsblk, blkidblktrace, bpftrace, vendor telemetryStorage commands can be destructive; read man pages before repair modes.
Networkingip, ss, dig, curl, tcpdumpnft, iptables-save, conntrack, ethtool, eBPF datapath toolsPacket capture can expose secrets and customer data.
TLSopenssl s_client, curl -v, dateCA store inspection, service mesh toolingDo not disable verification as a fix.
systemdsystemctl, journalctl, systemd-analyzeunit drop-ins, coredumpctl, resource controlsEmergency drop-ins must be removed after incident.
Containerspodman, docker, nerdctl, crictl, ctrrunc, nsenter, cgroup files, runtime logsctr and runtime internals can bypass orchestrator expectations.
Kubernetes nodekubectl describe, events, logscrictl, kubelet logs, CNI logs, node shellPrefer cordon and drain before node-level mutation.

Local Learning Machine vs Production Host vs Cluster

PracticeLocal learning machineProduction Linux hostProduction cluster
Namespace experimentsUse unshare, nsenter, rootless Podman, disposable VMs.Observe namespaces with /proc and nsenter only when needed.Use debug containers or node debug workflow under policy.
cgroup experimentsUse systemd-run, stress-ng, direct cgroup files.Prefer systemd unit properties or runtime-owned cgroups.Change pod resources and node config declaratively.
Network breaksAdd bad routes, DNS failures, MTU mismatch in lab.Use timed rollback and console access.Use scoped NetworkPolicy or test namespaces.
Storage breaksFill disks, exhaust inodes, corrupt throwaway filesystems.Preserve evidence and escalate before repair.Replace or cordon nodes when runtime storage is suspect.
Runtime internalsUse runc, ctr, crictl freely on lab nodes.Use read-only inspection first.Avoid bypassing kubelet except during node incident response.

Containers And Runtime Tools

Use 09 cgroups Namespaces Containers and Runtime Isolation as the conceptual base. The tools below reveal different layers of the same container.

ToolBest useTrap
podmanRootless local containers, pods, image builds, systemd integration.Rootless behavior can differ from production rootful clusters.
dockerCommon developer workflow and image packaging.Desktop environments hide Linux VM details.
nerdctlcontainerd-native Docker-like workflow.It exposes containerd concepts that may differ from Docker assumptions.
ctrLow-level containerd inspection and debugging.Not a friendly operational interface; can bypass higher-level contracts.
crictlKubernetes CRI node inspection.Requires runtime endpoint and node access; not a replacement for kubectl.
runcOCI runtime learning and reproduction.Too low-level for routine production operations.
nsenterEnter existing process namespaces.Enter only the namespaces needed; full entry can distort context.
unshareCreate namespaces for experiments.Lab-only unless you are writing a runtime or controlled service.

Local project: build an OCI bundle by hand.

  1. Create a rootfs from a tiny image or directory.
  2. Generate or write an OCI config.
  3. Run it with runc.
  4. Inspect /proc/$pid/ns, /proc/$pid/cgroup, and mounts from host and inside.
  5. Repeat with added network namespace, read-only rootfs, dropped capabilities, and a cgroup memory limit.

Production lesson: the same primitives are present under kubelet, containerd, CRI-O, and runc, but the right control surface is usually Kubernetes or systemd, not manual runtime commands.

cgroup And Resource Projects

Project 1: CPU throttling lab.

  1. Run a CPU-bound process under systemd-run --scope or a transient service.
  2. Apply a CPU quota.
  3. Watch cpu.stat, latency, and throughput.
  4. Remove the quota and compare.

Project 2: cgroup OOM lab.

  1. Run a memory allocator under a low memory max.
  2. Watch memory.current, memory.events, dmesg, and exit code.
  3. Repeat with swap behavior if enabled.
  4. Compare process OOM, cgroup OOM, and host OOM language.

Project 3: pids controller lab.

  1. Run a controlled fork or thread generator.
  2. Set pids.max.
  3. Observe failed forks and pids.events.
  4. Connect the behavior to web workers, thread pools, and runaway process supervisors.

Tradeoff table:

ControlGood forRisk
CPU quotaHard tenant ceiling.Latency from throttling.
CPU weightFair sharing under contention.No strict cap.
Memory maxProtect host and neighbors.OOM if working set bursts.
Memory low or minProtect important workload memory.Can starve lower-priority work.
IO maxPrevent noisy IO tenant.Can stretch recovery and batch jobs.
cpusetIsolation and NUMA-aware workloads.Fragmented capacity and poor scheduler flexibility.

systemd Learning Projects

systemd is the normal production entry point for Linux services. Learn it as a service manager, dependency graph, cgroup manager, log index, and boot coordinator.

Project: production-shaped service.

  1. Write a small service with ExecStart, Restart, RestartSec, User, Group, WorkingDirectory, and environment file.
  2. Add ReadWritePaths, ProtectSystem, PrivateTmp, NoNewPrivileges, and capability restrictions.
  3. Add resource controls such as MemoryMax and CPUQuota.
  4. Break the binary path and observe status=203/EXEC.
  5. Create a restart loop and observe start-limit behavior.
  6. Use a drop-in override, then remove it.

Operational lesson: a production service is more than a process. It has dependencies, restart semantics, resource policy, sandboxing, logs, and boot ordering. See 17 Production Operations Troubleshooting and Runbooks for incident handling.

Filesystem And Storage Projects

Project: disk full without data loss.

  1. Create a loopback filesystem in a file.
  2. Mount it in a lab VM or container with appropriate privileges.
  3. Fill it with large files and many small files separately.
  4. Compare df -h and df -ih.
  5. Open a file, delete it, and observe lsof +L1.
  6. Practice logrotate on the mounted filesystem.

Project: mount failure.

  1. Add a bad test fstab entry in a disposable VM.
  2. Boot into rescue or emergency mode.
  3. Fix the entry, run findmnt --verify, and reboot.

Production lesson: storage is where "quick cleanup" becomes data loss. Before deleting, know ownership, service semantics, backup state, and whether the file is still open.

Network Projects

Project: namespace network path.

  1. Create a network namespace.
  2. Add a veth pair.
  3. Assign IP addresses and routes.
  4. Enable forwarding and NAT in a lab.
  5. Break DNS, route, MTU, and firewall one at a time.
  6. Observe with ip, ss, tcpdump, nft, and conntrack.

Project: TLS failure matrix.

  1. Run a local HTTPS server with a self-signed certificate.
  2. Test wrong name, expired cert, missing CA, and wrong SNI.
  3. Use openssl s_client and curl -v.
  4. Add a reverse proxy and compare frontend vs backend TLS.

Production lesson: packet path evidence is namespace-specific. In clusters, pod namespace, node namespace, CNI datapath, service translation, DNS, ingress, and egress policy can all produce the same application timeout.

Observability And Evidence Tools

EvidenceToolingWhat it answers
Kernel messagesdmesg, journalctl -kOOM, blocked tasks, device errors, panics, LSM denials.
Unit logsjournalctl -u, app logsService lifecycle and application failure.
MetricsPrometheus node exporter, cAdvisor, kubelet metrics, systemd-exporterTrends, saturation, pressure, restarts.
TracesOpenTelemetry, service mesh telemetryRequest path and dependency latency.
Profilesperf, eBPF profilers, language profilersHot code paths and kernel time.
EventsKubernetes events, audit logs, cloud eventsScheduling, eviction, policy, infrastructure change.

Do not confuse metrics with evidence. Metrics show shape and timing. Logs show events and messages. Profiles show where time goes. Runtime state shows what the kernel is enforcing now.

Cluster Learning Projects

Use a disposable cluster such as kind, minikube, k3d, or a temporary VM-based cluster. Do not use a shared production cluster for failure drills.

Project: Kubernetes resource behavior.

  1. Deploy a workload with requests but no limits.
  2. Add CPU limit and observe throttling.
  3. Add memory limit and trigger OOMKilled.
  4. Inspect pod events, kubelet logs if available, and cgroup files on the node.
  5. Compare Guaranteed, Burstable, and BestEffort behavior.

Project: runtime and image path.

  1. Pull an image by tag and by digest.
  2. Inspect image layers and node image cache.
  3. Break registry credentials.
  4. Fill image filesystem in a lab node and observe eviction signals.

Project: CNI and DNS.

  1. Deploy two namespaces and services.
  2. Apply NetworkPolicy to break selected paths.
  3. Break DNS search assumptions with intentionally ambiguous names.
  4. Capture from pod namespace and node namespace.

Production lesson: Kubernetes is a desired-state and scheduling system over Linux primitives. When the kernel refuses memory, IO, pids, mounts, or syscalls, the cluster reports symptoms but the node enforces the boundary.

Common Mistakes

MistakeBetter habit
Learning only distro commands, not /proc, /sys, and kernel semantics.Tie every command to the kernel object it reads or changes.
Practicing destructive repairs on a daily workstation.Use disposable VMs, snapshots, and loopback filesystems.
Treating Docker Desktop as equivalent to a production Linux host.Learn on a real Linux VM or lab node too.
Assuming Kubernetes hides Linux.Use Kubernetes to manage Linux primitives, then inspect the node when needed.
Memorizing one CNI or runtime.Learn the generic model, then map each implementation.
Using low-level tools in production because they worked in the lab.Start with supported control planes and read-only evidence.
Disabling TLS, firewall, SELinux, seccomp, or AppArmor to "test".Narrow the hypothesis and restore the control immediately after evidence collection.

Production Tool Discipline

Use this policy:

  • Observation commands are allowed when they are read-only, scoped, and logged.
  • Diagnostic attachment requires owner awareness for critical services.
  • Mutation requires rollback plan, impact scope, and evidence preservation.
  • Destructive storage operations require service-owner approval.
  • Node-level cluster operations require cordon, drain, or an explicit reason not to.
  • Runtime-internal commands require a note explaining why orchestrator-native tools are insufficient.

Suggested Mastery Sequence

  1. Linux process and /proc literacy.
  2. systemd services, logs, units, resource controls, and rescue workflows.
  3. Filesystems, mounts, disk pressure, inodes, loopback labs.
  4. Networking namespaces, routing, DNS, firewall, packet capture.
  5. cgroups v2 controllers and pressure signals.
  6. Container image layers, overlayfs, OCI runtime, rootless containers.
  7. Kubernetes CRI, pod resources, CNI, DNS, and node pressure.
  8. Incident runbooks, post-incident reviews, and production change discipline.

Official Reference Anchors