Performance Engineering perf Flamegraphs and Capacity

Reading time
15 min read
Word count
2979 words
Diagram count
2 diagrams

Source: Victor Bona's Obsidian Compendium snapshot, Knowledge base/linux-systems-engineering/11 Performance Engineering perf Flamegraphs and Capacity.md.

Purpose: Build a production performance engineering playbook for Linux systems that connects capacity models, bottleneck discipline, sampling, perf, flamegraphs, off-CPU analysis, and host or cluster profiling into repeatable decisions.

11 Performance Engineering perf Flamegraphs and Capacity

Related notes: 10 Observability Logs Metrics Tracing and Debugging, 02 Processes Threads Scheduling Signals and Jobs, 03 Memory Virtual Memory Paging Allocators and OOM, 04 Filesystems VFS Block IO Page Cache and Storage, 05 Linux Networking TCP IP Routing Firewalling and DNS, 06 System Calls ABI libc and User Kernel Boundaries, 09 cgroups Namespaces Containers and Runtime Isolation, 17 Production Operations Troubleshooting and Runbooks

Performance engineering is not making every function faster. It is preserving user-visible service quality under expected and unexpected load with enough evidence to know which constraint matters. Linux performance work must connect workload demand, resource supply, queueing, kernel behavior, application architecture, and failure modes. The fastest local benchmark can be irrelevant if production bottlenecks are cgroup CPU quota, remote storage latency, DNS retries, lock contention, noisy neighbors, packet loss, or a database queue outside the host.

On a local learning machine, use synthetic benchmarks to learn tools and mechanics. On production Linux hosts, measure the real workload with bounded overhead before changing code or kernel settings. In production clusters, include pod limits, scheduling, autoscaling, service mesh, storage classes, CNI dataplane, and node pressure in the capacity model.

Rendering diagram...

Bottleneck Discipline

A bottleneck is the resource, queue, lock, dependency, or policy that limits useful work at the moment. It can move after each mitigation. Performance work fails when engineers optimize a visible cost that is not limiting throughput or latency.

Discipline:

  1. State the symptom in user or SLO terms.
  2. Define the workload window and affected population.
  3. Identify the first saturated resource or queue.
  4. Gather evidence with sampling before tracing heavily.
  5. Change one meaningful variable.
  6. Verify against the original metric and a control.
  7. Update the capacity model.
Bad questionBetter question
Why is the server slow?Which resource or queue explains p95 latency from 14:05 to 14:20?
Can we optimize this function?Does this function consume enough on-CPU time to affect the SLO?
Is CPU high?Is useful throughput limited by CPU, run queue latency, throttling, or another wait?
Is memory used?Is memory pressure causing reclaim, swap, OOM, or cgroup stalls?
Is disk busy?Are writes queued, slow to complete, throttled, or waiting on a remote backend?

Capacity Model

A capacity model describes how much useful work the system can handle before the next constraint becomes unacceptable. It should be simple enough to update during incidents.

Core variables:

  • arrival rate: requests, jobs, packets, messages, queries, bytes
  • service time: CPU time, IO time, upstream time, lock wait, queue time
  • concurrency: active requests, worker threads, connections, queue depth
  • resource budget: cores, memory, IO operations, bandwidth, file descriptors, ephemeral ports
  • limits: cgroup quota, memory limit, connection pools, rate limits, max workers
  • SLO: latency percentile, error rate, freshness, throughput, recovery time
Rendering diagram...

Useful equations are approximations, not truth:

ModelUseCaution
utilization = demand / capacityquick saturation estimateignores burstiness and queues
concurrency = arrival rate x latencyestimate in-flight worklatency includes wait and service time
headroom = capacity - peak demandplanning buffercapacity changes when bottleneck moves
p95 latency budget = sum of stage budgetsservice-level budgettails are not additive in a simple way
work per request = CPU seconds or IO ops per requestsizing and regression detectionrequires stable workload mix

Production capacity guidance:

  • track peak and sustained demand separately
  • model cgroup limits, not only host hardware
  • document one bottleneck per service at current scale
  • include dependency budgets such as database connections and remote storage IOPS
  • validate autoscaling lag and cold-start cost
  • reserve incident headroom for retries and failover

Local, Host, Cluster

EnvironmentPerformance goalTrap
Local learning machinelearn tools, reproduce micro-behavior, inspect symbolsextrapolating laptop benchmarks to production
Production Linux hostprotect SLO and isolate the active bottleneckchanging tunables based on folklore
Production clustermaintain service capacity across nodes, pods, quotas, and dependenciestreating pod latency as only application code

Cluster performance is often capacity policy. A pod can be slow because it has a CPU quota lower than its burst demand, a memory limit that drives GC and reclaim, a noisy node, remote volume latency, a sidecar with its own queue, or service routing that sends traffic across zones.

Sampling

Sampling observes a subset of events to infer where time or events concentrate. It is the default production profiling approach because it can be bounded and lower overhead than tracing every event.

Sampling choices:

ChoiceGood forTradeoff
fixed frequency CPU samplinghot code pathsmay miss rare latency events
event-based samplingcache misses, page faults, brancheshardware and permission dependent
tracepoint samplingkernel subsystem eventsevent volume can be high
wall-clock profilinglanguage runtime latencyruntime-specific support
off-CPU samplingwait time and blocked stacksneeds scheduler or BPF support
allocation samplingheap growth and churnmay miss short-lived or native allocations

Production rules:

  • capture during the symptom window
  • record duration, sample rate, PID, cgroup, host, and kernel
  • prefer multiple short samples over one huge sample
  • avoid coordinated profiling across every node unless capacity is planned
  • annotate whether the result is on-CPU, off-CPU, allocation, IO, network, or lock data

perf Mental Model

perf fronts Linux perf events. It can count events, sample events, trace selected events, and read kernel tracepoints. It connects hardware PMU counters, software events, and kernel instrumentation through a consistent CLI.

CommandPrimary jobTypical output
perf statcount events over a time windowcycles, instructions, faults, context switches
perf recordsample events and write perf.datacaptured samples and call graphs
perf reportinspect recorded profile interactivelysymbols, percentages, call chains
perf scriptdump samples for toolingstack lines for flamegraphs
perf toplive hot symbol viewlive sample table
perf schedscheduler recording and latency analysiswakeup and scheduling delay
perf listshow available eventsevent names and PMU support

Important constraints:

  • perf_event_paranoid and capabilities control access
  • container profiling may require host PID namespace or cgroup filters
  • hardware events vary across CPUs and virtual machines
  • call graphs need frame pointers, DWARF unwind data, LBR, or kernel unwind support
  • symbol resolution depends on binaries, debug packages, build IDs, and JIT maps
  • high-frequency sampling adds overhead

perf stat

perf stat counts events for a command, process, CPU, or system. It is a fast first step because counters can show whether work is CPU-bound, syscall-heavy, fault-heavy, migration-heavy, or context-switch-heavy.

Examples:

perf stat -- sleep 10
perf stat -p 1234 -- sleep 10
perf stat -a -- sleep 10
perf stat -e cycles,instructions,cache-misses,context-switches,cpu-migrations,page-faults -p 1234 -- sleep 10
perf stat -d -- /usr/local/bin/example-benchmark

Interpretation:

SignalDirection
high cycles with low instructions per cyclestalls, cache misses, branch misses, memory latency, virtualization
high context switcheslocks, IO waits, thread oversubscription, event loop wakeups
high CPU migrationsscheduler movement, poor affinity, cache locality loss
high page faultscold start, mmap behavior, memory pressure, file-backed demand paging
high cache missesworking set too large, poor locality, shared data contention

Production guidance:

  • use perf stat to compare before and after a change
  • run on the same workload window when possible
  • avoid treating one counter as root cause
  • pair counters with latency and throughput metrics

perf record and perf report

perf record captures samples into perf.data. perf report reads that file and attributes samples to symbols and call chains.

CPU profile:

PID=1234
perf record -F 99 -g -p "$PID" -- sleep 30
perf report

System-wide short profile:

sudo perf record -a -F 99 -g -- sleep 30
sudo perf report

Container-aware direction:

perf record -F 99 -g -p "$PID" -- sleep 30
perf report --stdio

If symbols are poor:

  • install debug symbols for the package
  • ensure binaries are not stripped without build IDs
  • preserve deployed artifacts for symbolization
  • enable frame pointers in performance-critical services where acceptable
  • configure language runtime symbol support for JITs

Common mistakes:

  • profiling outside the incident window
  • recording too briefly for a bursty symptom
  • comparing samples with different load shape
  • ignoring kernel samples because the service is "application code"
  • interpreting flat profile percentages without call chains
  • optimizing a startup path for a steady-state issue

Flamegraph Workflow

A flamegraph compresses stack samples into a visual shape. Wider frames mean more samples in that stack context. It does not directly show time order. It shows aggregation.

Workflow:

PID=1234
perf record -F 99 -g -p "$PID" -- sleep 30
perf script > /tmp/perf.script
# Convert perf script output to folded stacks, then render a flamegraph SVG.

Reading rules:

  • width is sample count, not wall-clock duration unless the sampling source means that
  • vertical position is stack depth
  • top frames are leaf work
  • wide plateaus indicate broad cost centers
  • narrow towers may be deep but not important
  • missing symbols can hide the real owner

Tradeoffs:

Graph typeAnswersDoes not answer
CPU flamegraphwhere on-CPU samples landedwhy tasks waited off CPU
off-CPU flamegraphwhere blocked time accumulatedwhat consumed CPU while waiting
differential flamegraphwhat changed between two profileswhether change improved user SLO by itself
allocation flamegraphwhere allocations originateretained memory without heap retention data
lock flamegraphwhere lock wait accumulateswhether the lock protects necessary design state

Production guidance:

  • always label graph type, host, PID or cgroup, time, sample rate, and workload
  • keep raw perf.data or folded stacks when allowed
  • restrict access because stack symbols can reveal code and business logic
  • compare against a healthy control when possible

Off-CPU Analysis

Off-CPU time is time a task is not running on CPU because it is sleeping, blocked, waiting for IO, waiting for a lock, throttled, or waiting to be scheduled. Many latency incidents are off-CPU incidents.

Evidence:

pidstat -w -p 1234 1
ps -L -p 1234 -o pid,tid,state,pcpu,wchan:32,comm
cat /proc/1234/stack
cat /proc/pressure/cpu
cat /proc/pressure/io
cat /proc/pressure/memory
perf sched record -- sleep 10
perf sched latency

Common wait classes:

WaitTypical sourceEvidence
futexuserspace locks, runtime scheduler, condition variablesstrace, off-CPU stacks, high context switches
disk IOsync writes, reads, filesystem journal, remote volumesiostat, IO PSI, D state, block tracepoints
network IOupstream slow, TCP loss, DNS, connection poolss, tcpdump, app spans
schedulerCPU saturation, affinity, cgroup quotaCPU PSI, run queue latency, throttling
memory reclaimpressure, swap, compactionmemory PSI, vmstat, kernel logs

Do not answer off-CPU questions with only CPU flamegraphs. A service can be slow because every worker is waiting on futex or ep_poll while CPU looks healthy.

CPU Performance

CPU capacity is not just percent busy. A core can be busy doing useful work, kernel overhead, interrupts, spin, bad speculation, cache misses, or work that will be discarded due to retries.

Checklist:

mpstat -P ALL 1
pidstat -t -u -p ALL 1
perf stat -a -- sleep 10
perf top
cat /proc/softirqs
cat /proc/pressure/cpu

Production questions:

  • Is high CPU in user, system, irq, softirq, steal, or guest?
  • Is one thread hot or all workers hot?
  • Is work evenly distributed across CPUs?
  • Is the process CPU-throttled by cgroup quota?
  • Are interrupts concentrated on one CPU?
  • Did retries, logging, compression, encryption, or serialization increase?

CPU mitigations:

CauseShort mitigationLong fix
traffic spikerate limit, scale out, shed low-priority workcapacity and autoscaling model
hot code pathrollback, disable feature flagoptimize proven hot path
excessive retriesreduce retry storm, circuit breakbounded retry and backoff design
softirq pressurerebalance queues, reduce packet rateNIC, kernel, CNI, or architecture tuning
cgroup throttlingadjust request and limit, scale replicasresource model based on real demand

Memory Performance

Memory capacity failures include allocation latency, reclaim stalls, swap storms, OOM kills, allocator fragmentation, page cache churn, memcg limits, kernel slab growth, and NUMA locality issues.

Commands:

free -h
vmstat 1
cat /proc/meminfo
cat /proc/pressure/memory
ps -eo pid,comm,rss,vsz,%mem,oom_score --sort=-rss | head
cat /proc/1234/smaps_rollup
numastat -p 1234 2>/dev/null

Performance reading:

  • memory use is not bad; memory pressure is bad
  • page cache improves performance until churn or reclaim dominates
  • swap configured is not the same as swap storm
  • GC-heavy runtimes can turn memory limits into CPU and latency spikes
  • cgroup memory limits change OOM behavior from host-wide to workload-local

Capacity questions:

  • What is steady RSS per unit of concurrency?
  • How much memory is cache, heap, stack, mmap, and kernel?
  • What happens at peak traffic plus failover traffic?
  • Does the service degrade before OOM?
  • Are memory limits aligned with runtime heap sizing?

IO Performance

Block IO performance is queueing. A storage path includes filesystem, page cache, block layer, scheduler, device, hypervisor, remote backend, and application sync behavior.

Commands:

iostat -xz 1
pidstat -d -p ALL 1
cat /proc/pressure/io
vmstat 1
journalctl -k --since '1 hour ago'

IO questions:

  • Is latency read, write, flush, discard, or metadata?
  • Is the workload buffered or direct IO?
  • Is the pain per-device or per-filesystem?
  • Are writes synchronous with request latency?
  • Are containers writing through overlay layers?
  • Is the backing volume remote, throttled, burst-limited, or shared?

Tradeoffs:

TacticHelpsRisk
batching writesfewer syscalls and flusheshigher loss window and tail latency
async IObetter concurrencyharder backpressure
cachinglower read latencystale data and memory pressure
more queue depthhigher throughputworse tail latency under saturation
faster diskmore headroombottleneck may move to locks or CPU

Network Performance

Network capacity includes socket buffers, kernel queues, NIC queues, TCP congestion, DNS, TLS, routing, conntrack, firewall rules, overlays, service mesh, and upstream service time.

Commands:

ss -s
ss -tin
ip -s link
nstat -az
sar -n DEV,TCP,ETCP 1
tcpdump -i eth0 -nn host 198.51.100.10 and port 443

Network questions:

  • Is latency before connect, during connect, during TLS, during request, or during response body?
  • Are retransmits or zero windows visible?
  • Are send or receive queues growing?
  • Are drops increasing on interfaces or qdiscs?
  • Is conntrack full or expensive?
  • Does cluster routing cross zones or nodes unnecessarily?
  • Is DNS latency hidden inside application request timing?

Mitigation examples:

CauseShort mitigationLong fix
DNS latencycache, reduce resolver retries, pin known-good resolverresolver architecture and observability
TCP retransmitsshift traffic, fix path, reduce overloadnetwork path and congestion control review
connection pool exhaustionraise carefully, shed loadpool sizing by concurrency model
sidecar queueingbypass, scale, or tune sidecarservice mesh capacity model
conntrack saturationreduce churn, increase limits with carearchitecture to reduce connection churn

Lock Contention

Lock contention is a queue with ownership. It can live in application mutexes, runtime locks, kernel locks, filesystem locks, database locks, or distributed locks.

Signals:

pidstat -w -p 1234 1
strace -p 1234 -f -tt -T -e trace=futex
perf lock record -- sleep 10
perf lock report

Not every futex wait is a problem. Event loops and runtimes use futexes normally. A lock is suspect when wait time, queue length, or tail latency rises with load and throughput stops scaling.

Design fixes:

  • reduce critical section length
  • shard state
  • replace global locks with per-core or per-tenant structures
  • avoid blocking IO while holding locks
  • add backpressure before lock queues explode
  • remove unnecessary cross-request shared state

Scheduler and Run Queue Latency

Run queue latency is time a runnable task waits before getting CPU. Scheduler latency matters when CPU looks busy enough to delay work or when policy prevents work from running.

Commands:

uptime
pidstat -w -p ALL 1
cat /proc/pressure/cpu
perf sched record -- sleep 10
perf sched latency
cat /proc/1234/sched

Common causes:

  • too many runnable threads for available cores
  • cgroup CPU quota too low for bursty demand
  • CPU affinity confines work to a subset of cores
  • realtime or high-priority tasks starve normal work
  • host noisy neighbor or hypervisor steal
  • GC or runtime worker policies interact badly with quotas

Production cluster note: Kubernetes CPU requests affect scheduling, while limits create quota. A workload can be placed on a node because requests fit, then throttle under real bursts because limits are lower than demand. Host CPU can have headroom while the pod is throttled.

Capacity Testing

Capacity tests should answer a specific question, not produce a vanity requests-per-second number.

Test design:

DimensionProduction-grade choice
workload mixmatch real endpoints, payloads, tenants, cache states
rampgradual enough to observe knee points
durationlong enough to hit caches, GC, compaction, rotation, and autoscaling
success metricSLO, error rate, saturation, and recovery
environmentsame limits, kernel, storage, network, and sidecars where possible
failureinclude dependency slowness, retries, failover, and partial outage

Capacity artifacts:

  • maximum sustainable load at SLO
  • first saturation point
  • second bottleneck after mitigation if known
  • per-request CPU, memory, IO, and network cost
  • autoscaling lag and overshoot
  • rollback or load-shedding threshold

Performance Hardening

Performance hardening means making overload predictable and survivable.

Patterns:

  • explicit resource requests and limits
  • bounded queues
  • timeouts and budgets per dependency
  • retry budgets with jittered backoff
  • load shedding before collapse
  • connection pool limits tied to downstream capacity
  • cache limits and eviction policy
  • log rate limits
  • batch size caps
  • memory-aware runtime configuration
  • graceful degradation paths

Anti-patterns:

  • unlimited workers
  • unlimited request body size
  • unlimited logging during errors
  • retry loops without deadlines
  • queues that hide overload until OOM
  • autoscaling only on CPU when bottleneck is IO, memory, or downstream latency
  • treating restarts as capacity management

Common Mistakes

MistakeConsequenceCorrection
optimizing before measuringtime spent on non-bottleneckssample under real load first
using local microbenchmarks as proofwrong workload and missing production limitsreproduce production constraints
ignoring cgroupshost looks fine while workload is throttledinspect cgroup CPU, memory, IO, and PSI
reading CPU flamegraphs for waiting problemsmisses lock, IO, and network waitsuse off-CPU and scheduler analysis
comparing profiles with poor symbolswrong owner assignmentpreserve symbols and build IDs
tuning kernel knobs blindlydestabilizes hosttie every tuning to measured evidence and rollback
no capacity artifactrepeat incidentsrecord bottleneck, limit, and next threshold

Troubleshooting Playbooks

CPU-Bound Service

pidstat -t -u -p 1234 1
perf stat -p 1234 -- sleep 10
perf record -F 99 -g -p 1234 -- sleep 30
perf report

Actions:

  1. Confirm service throughput rises with more CPU.
  2. Check cgroup throttling.
  3. Identify hot stacks.
  4. Compare with healthy version or previous release.
  5. Mitigate with scale-out, feature flag, rollback, or traffic shedding.
  6. Optimize only proven hot paths.

High Latency With Low CPU

cat /proc/pressure/cpu /proc/pressure/io /proc/pressure/memory
pidstat -w -p 1234 1
ps -L -p 1234 -o tid,state,wchan:32,pcpu,comm
ss -tin
iostat -xz 1

Actions:

  1. Look for waiting, not compute.
  2. Split lock, IO, network, memory reclaim, and scheduler delay.
  3. Capture off-CPU evidence.
  4. Reduce queueing or dependency latency.

Slow Disk

iostat -xz 1
pidstat -d -p ALL 1
cat /proc/pressure/io
journalctl -k --since '1 hour ago'

Actions:

  1. Map path to filesystem and device.
  2. Separate read, write, flush, and metadata latency.
  3. Identify responsible process or cgroup.
  4. Check remote volume or cloud throttling.
  5. Mitigate by reducing sync writes, shedding load, moving workload, or increasing provisioned storage.

Network Bottleneck

ss -s
ss -tin
sar -n DEV,TCP,ETCP 1
ip -s link
tcpdump -i eth0 -nn host 198.51.100.10 and port 443 -G 60 -W 1 -w /tmp/net.pcap

Actions:

  1. Split DNS, connect, TLS, request, and response.
  2. Check retransmits and queues.
  3. Compare node, pod, and upstream views.
  4. Mitigate with route shift, pool tuning, traffic reduction, or dependency failover.

Reference Anchors

  • perf, perf-stat, perf-record, and perf-report man pages define the perf CLI workflows.
  • Linux kernel ftrace, tracefs, scheduler, and PSI documentation define low-level tracing and stall evidence.
  • Linux man pages for syscalls, ptrace, and related tools explain the syscall boundary that many profiles and traces expose.
  • systemd journal documentation supports correlating performance symptoms with unit restarts and host logs.
  • bpftool and BPF documentation support production inspection of BPF-based profilers and tracers.