Purpose: Build a production performance engineering playbook for Linux systems that connects capacity models, bottleneck discipline, sampling, perf, flamegraphs, off-CPU analysis, and host or cluster profiling into repeatable decisions.

11 Performance Engineering perf Flamegraphs and Capacity

Performance engineering is not making every function faster. It is preserving user-visible service quality under expected and unexpected load with enough evidence to know which constraint matters. Linux performance work must connect workload demand, resource supply, queueing, kernel behavior, application architecture, and failure modes. The fastest local benchmark can be irrelevant if production bottlenecks are cgroup CPU quota, remote storage latency, DNS retries, lock contention, noisy neighbors, packet loss, or a database queue outside the host.

On a local learning machine, use synthetic benchmarks to learn tools and mechanics. On production Linux hosts, measure the real workload with bounded overhead before changing code or kernel settings. In production clusters, include pod limits, scheduling, autoscaling, service mesh, storage classes, CNI dataplane, and node pressure in the capacity model.

Rendering diagram...

Bottleneck Discipline

A bottleneck is the resource, queue, lock, dependency, or policy that limits useful work at the moment. It can move after each mitigation. Performance work fails when engineers optimize a visible cost that is not limiting throughput or latency.

Discipline:

State the symptom in user or SLO terms.
Define the workload window and affected population.
Identify the first saturated resource or queue.
Gather evidence with sampling before tracing heavily.
Change one meaningful variable.
Verify against the original metric and a control.
Update the capacity model.

Bad question	Better question
Why is the server slow?	Which resource or queue explains p95 latency from 14:05 to 14:20?
Can we optimize this function?	Does this function consume enough on-CPU time to affect the SLO?
Is CPU high?	Is useful throughput limited by CPU, run queue latency, throttling, or another wait?
Is memory used?	Is memory pressure causing reclaim, swap, OOM, or cgroup stalls?
Is disk busy?	Are writes queued, slow to complete, throttled, or waiting on a remote backend?

Capacity Model

A capacity model describes how much useful work the system can handle before the next constraint becomes unacceptable. It should be simple enough to update during incidents.

Core variables:

arrival rate: requests, jobs, packets, messages, queries, bytes
service time: CPU time, IO time, upstream time, lock wait, queue time
concurrency: active requests, worker threads, connections, queue depth
resource budget: cores, memory, IO operations, bandwidth, file descriptors, ephemeral ports
limits: cgroup quota, memory limit, connection pools, rate limits, max workers
SLO: latency percentile, error rate, freshness, throughput, recovery time

Rendering diagram...

Useful equations are approximations, not truth:

Model	Use	Caution
utilization = demand / capacity	quick saturation estimate	ignores burstiness and queues
concurrency = arrival rate x latency	estimate in-flight work	latency includes wait and service time
headroom = capacity - peak demand	planning buffer	capacity changes when bottleneck moves
p95 latency budget = sum of stage budgets	service-level budget	tails are not additive in a simple way
work per request = CPU seconds or IO ops per request	sizing and regression detection	requires stable workload mix

Production capacity guidance:

track peak and sustained demand separately
model cgroup limits, not only host hardware
document one bottleneck per service at current scale
include dependency budgets such as database connections and remote storage IOPS
validate autoscaling lag and cold-start cost
reserve incident headroom for retries and failover

Local, Host, Cluster

Environment	Performance goal	Trap
Local learning machine	learn tools, reproduce micro-behavior, inspect symbols	extrapolating laptop benchmarks to production
Production Linux host	protect SLO and isolate the active bottleneck	changing tunables based on folklore
Production cluster	maintain service capacity across nodes, pods, quotas, and dependencies	treating pod latency as only application code

Cluster performance is often capacity policy. A pod can be slow because it has a CPU quota lower than its burst demand, a memory limit that drives GC and reclaim, a noisy node, remote volume latency, a sidecar with its own queue, or service routing that sends traffic across zones.

Sampling

Sampling observes a subset of events to infer where time or events concentrate. It is the default production profiling approach because it can be bounded and lower overhead than tracing every event.

Sampling choices:

Choice	Good for	Tradeoff
fixed frequency CPU sampling	hot code paths	may miss rare latency events
event-based sampling	cache misses, page faults, branches	hardware and permission dependent
tracepoint sampling	kernel subsystem events	event volume can be high
wall-clock profiling	language runtime latency	runtime-specific support
off-CPU sampling	wait time and blocked stacks	needs scheduler or BPF support
allocation sampling	heap growth and churn	may miss short-lived or native allocations

Production rules:

capture during the symptom window
record duration, sample rate, PID, cgroup, host, and kernel
prefer multiple short samples over one huge sample
avoid coordinated profiling across every node unless capacity is planned
annotate whether the result is on-CPU, off-CPU, allocation, IO, network, or lock data

perf Mental Model

perf fronts Linux perf events. It can count events, sample events, trace selected events, and read kernel tracepoints. It connects hardware PMU counters, software events, and kernel instrumentation through a consistent CLI.

Command	Primary job	Typical output
`perf stat`	count events over a time window	cycles, instructions, faults, context switches
`perf record`	sample events and write `perf.data`	captured samples and call graphs
`perf report`	inspect recorded profile interactively	symbols, percentages, call chains
`perf script`	dump samples for tooling	stack lines for flamegraphs
`perf top`	live hot symbol view	live sample table
`perf sched`	scheduler recording and latency analysis	wakeup and scheduling delay
`perf list`	show available events	event names and PMU support

Important constraints:

perf_event_paranoid and capabilities control access
container profiling may require host PID namespace or cgroup filters
hardware events vary across CPUs and virtual machines
call graphs need frame pointers, DWARF unwind data, LBR, or kernel unwind support
symbol resolution depends on binaries, debug packages, build IDs, and JIT maps
high-frequency sampling adds overhead

perf stat

perf stat counts events for a command, process, CPU, or system. It is a fast first step because counters can show whether work is CPU-bound, syscall-heavy, fault-heavy, migration-heavy, or context-switch-heavy.

Examples:

perf stat -- sleep 10
perf stat -p 1234 -- sleep 10
perf stat -a -- sleep 10
perf stat -e cycles,instructions,cache-misses,context-switches,cpu-migrations,page-faults -p 1234 -- sleep 10
perf stat -d -- /usr/local/bin/example-benchmark

Interpretation:

Signal	Direction
high cycles with low instructions per cycle	stalls, cache misses, branch misses, memory latency, virtualization
high context switches	locks, IO waits, thread oversubscription, event loop wakeups
high CPU migrations	scheduler movement, poor affinity, cache locality loss
high page faults	cold start, mmap behavior, memory pressure, file-backed demand paging
high cache misses	working set too large, poor locality, shared data contention

Production guidance:

use perf stat to compare before and after a change
run on the same workload window when possible
avoid treating one counter as root cause
pair counters with latency and throughput metrics

perf record and perf report

perf record captures samples into perf.data. perf report reads that file and attributes samples to symbols and call chains.

CPU profile:

PID=1234
perf record -F 99 -g -p "$PID" -- sleep 30
perf report

System-wide short profile:

sudo perf record -a -F 99 -g -- sleep 30
sudo perf report

Container-aware direction:

perf record -F 99 -g -p "$PID" -- sleep 30
perf report --stdio

If symbols are poor:

install debug symbols for the package
ensure binaries are not stripped without build IDs
preserve deployed artifacts for symbolization
enable frame pointers in performance-critical services where acceptable
configure language runtime symbol support for JITs

Common mistakes:

profiling outside the incident window
recording too briefly for a bursty symptom
comparing samples with different load shape
ignoring kernel samples because the service is "application code"
interpreting flat profile percentages without call chains
optimizing a startup path for a steady-state issue

Flamegraph Workflow

A flamegraph compresses stack samples into a visual shape. Wider frames mean more samples in that stack context. It does not directly show time order. It shows aggregation.

Workflow:

PID=1234
perf record -F 99 -g -p "$PID" -- sleep 30
perf script > /tmp/perf.script
# Convert perf script output to folded stacks, then render a flamegraph SVG.

Reading rules:

width is sample count, not wall-clock duration unless the sampling source means that
vertical position is stack depth
top frames are leaf work
wide plateaus indicate broad cost centers
narrow towers may be deep but not important
missing symbols can hide the real owner

Tradeoffs:

Graph type	Answers	Does not answer
CPU flamegraph	where on-CPU samples landed	why tasks waited off CPU
off-CPU flamegraph	where blocked time accumulated	what consumed CPU while waiting
differential flamegraph	what changed between two profiles	whether change improved user SLO by itself
allocation flamegraph	where allocations originate	retained memory without heap retention data
lock flamegraph	where lock wait accumulates	whether the lock protects necessary design state

Production guidance:

always label graph type, host, PID or cgroup, time, sample rate, and workload
keep raw perf.data or folded stacks when allowed
restrict access because stack symbols can reveal code and business logic
compare against a healthy control when possible

Off-CPU Analysis

Off-CPU time is time a task is not running on CPU because it is sleeping, blocked, waiting for IO, waiting for a lock, throttled, or waiting to be scheduled. Many latency incidents are off-CPU incidents.

Evidence:

pidstat -w -p 1234 1
ps -L -p 1234 -o pid,tid,state,pcpu,wchan:32,comm
cat /proc/1234/stack
cat /proc/pressure/cpu
cat /proc/pressure/io
cat /proc/pressure/memory
perf sched record -- sleep 10
perf sched latency

Common wait classes:

Wait	Typical source	Evidence
futex	userspace locks, runtime scheduler, condition variables	`strace`, off-CPU stacks, high context switches
disk IO	sync writes, reads, filesystem journal, remote volumes	`iostat`, IO PSI, `D` state, block tracepoints
network IO	upstream slow, TCP loss, DNS, connection pool	`ss`, `tcpdump`, app spans
scheduler	CPU saturation, affinity, cgroup quota	CPU PSI, run queue latency, throttling
memory reclaim	pressure, swap, compaction	memory PSI, `vmstat`, kernel logs

Do not answer off-CPU questions with only CPU flamegraphs. A service can be slow because every worker is waiting on futex or ep_poll while CPU looks healthy.

CPU Performance

CPU capacity is not just percent busy. A core can be busy doing useful work, kernel overhead, interrupts, spin, bad speculation, cache misses, or work that will be discarded due to retries.

Checklist:

mpstat -P ALL 1
pidstat -t -u -p ALL 1
perf stat -a -- sleep 10
perf top
cat /proc/softirqs
cat /proc/pressure/cpu

Production questions:

Is high CPU in user, system, irq, softirq, steal, or guest?
Is one thread hot or all workers hot?
Is work evenly distributed across CPUs?
Is the process CPU-throttled by cgroup quota?
Are interrupts concentrated on one CPU?
Did retries, logging, compression, encryption, or serialization increase?

CPU mitigations:

Cause	Short mitigation	Long fix
traffic spike	rate limit, scale out, shed low-priority work	capacity and autoscaling model
hot code path	rollback, disable feature flag	optimize proven hot path
excessive retries	reduce retry storm, circuit break	bounded retry and backoff design
softirq pressure	rebalance queues, reduce packet rate	NIC, kernel, CNI, or architecture tuning
cgroup throttling	adjust request and limit, scale replicas	resource model based on real demand

Memory Performance

Memory capacity failures include allocation latency, reclaim stalls, swap storms, OOM kills, allocator fragmentation, page cache churn, memcg limits, kernel slab growth, and NUMA locality issues.

Commands:

free -h
vmstat 1
cat /proc/meminfo
cat /proc/pressure/memory
ps -eo pid,comm,rss,vsz,%mem,oom_score --sort=-rss | head
cat /proc/1234/smaps_rollup
numastat -p 1234 2>/dev/null

Performance reading:

memory use is not bad; memory pressure is bad
page cache improves performance until churn or reclaim dominates
swap configured is not the same as swap storm
GC-heavy runtimes can turn memory limits into CPU and latency spikes
cgroup memory limits change OOM behavior from host-wide to workload-local

Capacity questions:

What is steady RSS per unit of concurrency?
How much memory is cache, heap, stack, mmap, and kernel?
What happens at peak traffic plus failover traffic?
Does the service degrade before OOM?
Are memory limits aligned with runtime heap sizing?

IO Performance

Block IO performance is queueing. A storage path includes filesystem, page cache, block layer, scheduler, device, hypervisor, remote backend, and application sync behavior.

Commands:

iostat -xz 1
pidstat -d -p ALL 1
cat /proc/pressure/io
vmstat 1
journalctl -k --since '1 hour ago'

IO questions:

Is latency read, write, flush, discard, or metadata?
Is the workload buffered or direct IO?
Is the pain per-device or per-filesystem?
Are writes synchronous with request latency?
Are containers writing through overlay layers?
Is the backing volume remote, throttled, burst-limited, or shared?

Tradeoffs:

Tactic	Helps	Risk
batching writes	fewer syscalls and flushes	higher loss window and tail latency
async IO	better concurrency	harder backpressure
caching	lower read latency	stale data and memory pressure
more queue depth	higher throughput	worse tail latency under saturation
faster disk	more headroom	bottleneck may move to locks or CPU

Network Performance

Network capacity includes socket buffers, kernel queues, NIC queues, TCP congestion, DNS, TLS, routing, conntrack, firewall rules, overlays, service mesh, and upstream service time.

Commands:

ss -s
ss -tin
ip -s link
nstat -az
sar -n DEV,TCP,ETCP 1
tcpdump -i eth0 -nn host 198.51.100.10 and port 443

Network questions:

Is latency before connect, during connect, during TLS, during request, or during response body?
Are retransmits or zero windows visible?
Are send or receive queues growing?
Are drops increasing on interfaces or qdiscs?
Is conntrack full or expensive?
Does cluster routing cross zones or nodes unnecessarily?
Is DNS latency hidden inside application request timing?

Mitigation examples:

Cause	Short mitigation	Long fix
DNS latency	cache, reduce resolver retries, pin known-good resolver	resolver architecture and observability
TCP retransmits	shift traffic, fix path, reduce overload	network path and congestion control review
connection pool exhaustion	raise carefully, shed load	pool sizing by concurrency model
sidecar queueing	bypass, scale, or tune sidecar	service mesh capacity model
conntrack saturation	reduce churn, increase limits with care	architecture to reduce connection churn

Lock Contention

Lock contention is a queue with ownership. It can live in application mutexes, runtime locks, kernel locks, filesystem locks, database locks, or distributed locks.

Signals:

pidstat -w -p 1234 1
strace -p 1234 -f -tt -T -e trace=futex
perf lock record -- sleep 10
perf lock report

Not every futex wait is a problem. Event loops and runtimes use futexes normally. A lock is suspect when wait time, queue length, or tail latency rises with load and throughput stops scaling.

Design fixes:

reduce critical section length
shard state
replace global locks with per-core or per-tenant structures
avoid blocking IO while holding locks
add backpressure before lock queues explode
remove unnecessary cross-request shared state

Scheduler and Run Queue Latency

Run queue latency is time a runnable task waits before getting CPU. Scheduler latency matters when CPU looks busy enough to delay work or when policy prevents work from running.

Commands:

uptime
pidstat -w -p ALL 1
cat /proc/pressure/cpu
perf sched record -- sleep 10
perf sched latency
cat /proc/1234/sched

Common causes:

too many runnable threads for available cores
cgroup CPU quota too low for bursty demand
CPU affinity confines work to a subset of cores
realtime or high-priority tasks starve normal work
host noisy neighbor or hypervisor steal
GC or runtime worker policies interact badly with quotas

Production cluster note: Kubernetes CPU requests affect scheduling, while limits create quota. A workload can be placed on a node because requests fit, then throttle under real bursts because limits are lower than demand. Host CPU can have headroom while the pod is throttled.

Capacity Testing

Capacity tests should answer a specific question, not produce a vanity requests-per-second number.

Test design:

Dimension	Production-grade choice
workload mix	match real endpoints, payloads, tenants, cache states
ramp	gradual enough to observe knee points
duration	long enough to hit caches, GC, compaction, rotation, and autoscaling
success metric	SLO, error rate, saturation, and recovery
environment	same limits, kernel, storage, network, and sidecars where possible
failure	include dependency slowness, retries, failover, and partial outage

Capacity artifacts:

maximum sustainable load at SLO
first saturation point
second bottleneck after mitigation if known
per-request CPU, memory, IO, and network cost
autoscaling lag and overshoot
rollback or load-shedding threshold

Performance Hardening

Performance hardening means making overload predictable and survivable.

Patterns:

explicit resource requests and limits
bounded queues
timeouts and budgets per dependency
retry budgets with jittered backoff
load shedding before collapse
connection pool limits tied to downstream capacity
cache limits and eviction policy
log rate limits
batch size caps
memory-aware runtime configuration
graceful degradation paths

Anti-patterns:

unlimited workers
unlimited request body size
unlimited logging during errors
retry loops without deadlines
queues that hide overload until OOM
autoscaling only on CPU when bottleneck is IO, memory, or downstream latency
treating restarts as capacity management

Common Mistakes

Mistake	Consequence	Correction
optimizing before measuring	time spent on non-bottlenecks	sample under real load first
using local microbenchmarks as proof	wrong workload and missing production limits	reproduce production constraints
ignoring cgroups	host looks fine while workload is throttled	inspect cgroup CPU, memory, IO, and PSI
reading CPU flamegraphs for waiting problems	misses lock, IO, and network waits	use off-CPU and scheduler analysis
comparing profiles with poor symbols	wrong owner assignment	preserve symbols and build IDs
tuning kernel knobs blindly	destabilizes host	tie every tuning to measured evidence and rollback
no capacity artifact	repeat incidents	record bottleneck, limit, and next threshold

Troubleshooting Playbooks

CPU-Bound Service

pidstat -t -u -p 1234 1
perf stat -p 1234 -- sleep 10
perf record -F 99 -g -p 1234 -- sleep 30
perf report

Actions:

Confirm service throughput rises with more CPU.
Check cgroup throttling.
Identify hot stacks.
Compare with healthy version or previous release.
Mitigate with scale-out, feature flag, rollback, or traffic shedding.
Optimize only proven hot paths.

High Latency With Low CPU

cat /proc/pressure/cpu /proc/pressure/io /proc/pressure/memory
pidstat -w -p 1234 1
ps -L -p 1234 -o tid,state,wchan:32,pcpu,comm
ss -tin
iostat -xz 1

Actions:

Look for waiting, not compute.
Split lock, IO, network, memory reclaim, and scheduler delay.
Capture off-CPU evidence.
Reduce queueing or dependency latency.

Slow Disk

iostat -xz 1
pidstat -d -p ALL 1
cat /proc/pressure/io
journalctl -k --since '1 hour ago'

Actions:

Map path to filesystem and device.
Separate read, write, flush, and metadata latency.
Identify responsible process or cgroup.
Check remote volume or cloud throttling.
Mitigate by reducing sync writes, shedding load, moving workload, or increasing provisioned storage.

Network Bottleneck

ss -s
ss -tin
sar -n DEV,TCP,ETCP 1
ip -s link
tcpdump -i eth0 -nn host 198.51.100.10 and port 443 -G 60 -W 1 -w /tmp/net.pcap

Actions:

Split DNS, connect, TLS, request, and response.
Check retransmits and queues.
Compare node, pod, and upstream views.
Mitigate with route shift, pool tuning, traffic reduction, or dependency failover.

Reference Anchors

perf, perf-stat, perf-record, and perf-report man pages define the perf CLI workflows.
Linux kernel ftrace, tracefs, scheduler, and PSI documentation define low-level tracing and stall evidence.
Linux man pages for syscalls, ptrace, and related tools explain the syscall boundary that many profiles and traces expose.
systemd journal documentation supports correlating performance symptoms with unit restarts and host logs.
bpftool and BPF documentation support production inspection of BPF-based profilers and tracers.