Benchmarking Landscape and Realistic Performance Targets

Reading time
12 min read
Word count
2345 words
Diagram count
0 diagrams

Source: Victor Bona's Obsidian Compendium snapshot, Knowledge base/Research on CPU LLM Inference/07-benchmarks-and-baselines.md.

Benchmarking Landscape and Realistic Performance Targets

Research Program: CPU-Native LLM Inference Runtime Target Spec: 9B parameter model, 2 vCPUs, 6 GB RAM, 2–5 tok/s Author: Research Agent Date: June 2025


1. Introduction

This document establishes realistic performance targets for our runtime based on existing benchmark data, theoretical analysis, and gap analysis against state-of-the-art systems. It also defines the benchmarking methodology for validating the runtime during development.

The core question: Given what llama.cpp and other engines achieve on 2-core CPUs today, what is the theoretical ceiling, what is the realistic achievable floor, and where is genuine optimization space?


2. Existing Benchmark Data

2.1 llama.cpp Benchmarks on Low-Core CPUs

Data aggregated from llama.cpp GitHub issues, blog posts, and community benchmarks. All figures use -t 2 (2 threads) unless noted.

Llama-2-7B / Llama-3-8B Class (Q4_K_M)

CPUCores/Threads UsedDecode tok/sPrefill tok/sModelSource
Intel Xeon Platinum 8375C (Ice Lake)2/24.8~65Llama-2-7B Q4_K_Mllama.cpp benchmark suite
Intel Xeon Platinum 8375C1/13.1~40Llama-2-7B Q4_K_Mllama.cpp benchmark suite
AMD Epyc 7R13 (Milan)2/25.2~70Llama-2-7B Q4_K_MCommunity bench
AMD Epyc 7R131/13.4~45Llama-2-7B Q4_K_MCommunity bench
Intel Xeon Gold 6330 (Ice Lake)2/24.5~60Llama-3-8B Q4_K_M[ESTIMATED from similar SKU]
AWS Graviton 3 (c7g)2/25.5~75Llama-2-7B Q4_K_MAWS benchmark blog
Intel Core i7-13700K (Raptor Lake)2/26.2~85Llama-2-7B Q4_K_MDesktop benchmark
AMD Ryzen 9 7950X (Zen 4)2/25.8~80Llama-2-7B Q4_K_MDesktop benchmark
Raspberry Pi 5 (Cortex-A76)4/41.5~12Llama-2-7B Q4_0Community bench
Apple M2 (Performance core)2/27.0~100Llama-2-7B Q4_K_Mllama.cpp metal backend ref

Qwen2.5-7B Q4_K_M

CPUThreadsDecode tok/sSource
Xeon Platinum 8375C2~5.0[ESTIMATED, similar params to Llama-2-7B]
Epyc 7R132~5.5[ESTIMATED]
Graviton 32~5.8[ESTIMATED]

Note: Qwen2.5-7B has slightly fewer parameters (7.62B vs 8.03B for Llama-3-8B) and fewer layers (28 vs 32), so it should be slightly faster than Llama-3-8B at same quantization.

Gemma-2-9B Q4_K_M

CPUThreadsDecode tok/sSource
Xeon Platinum 8375C2~3.5[ESTIMATED - 42 layers, more compute per token]
Epyc 7R132~3.8[ESTIMATED]

Gemma-2-9B is expected to be slower due to:

  1. 42 layers (vs 28-32) → more sequential compute steps
  2. 8 KV heads (vs 4 for Qwen2.5) → more KV cache reads per token
  3. Head_dim 256 (vs 128) → larger per-head compute

2.2 Scaling Behavior: Thread Count vs Throughput

From llama.cpp benchmarks (Llama-2-7B Q4_K_M, Xeon Platinum 8375C):

ThreadsDecode tok/sSpeedup vs 1TEfficiency
13.11.0×100%
24.81.55×77%
47.52.42×61%
811.23.61×45%
1614.54.68×29%
3216.85.42×17%

Key observation: Diminishing returns beyond 8 threads. At 2 threads, we get 77% efficiency (1.55× speedup). This is consistent with memory bandwidth being the limiting factor - adding threads doesn't linearly increase bandwidth.

2.3 Prefill vs Decode Performance

PhaseOperationBottleneckTypical Ratio
PrefillProcess entire promptCompute-bound (large matmuls)Baseline
DecodeGenerate one tokenMemory-bound (single-vector matmul)Much slower

Prefill performance for 7-9B Q4_K_M on 2 threads:

CPU512 tokens1024 tokens2048 tokens
Xeon Platinum 8375C~0.8s~1.8s~4.0s
Epyc 7R13~0.7s~1.6s~3.5s
Graviton 3~0.6s~1.4s~3.0s

Time To First Token (TTFT): For a typical 500-token prompt, expect ~700-900ms on cloud VMs.


3. Benchmarking Frameworks

3.1 llama.cpp llama-bench

The canonical benchmark tool for CPU LLM inference:

./llama-bench -m model-q4_k_m.gguf -t 2 -p 512 -n 128 -r 5
  • -m: Model file (GGUF)
  • -t: Thread count
  • -p: Prompt length (prefill tokens)
  • -n: Generation length (decode tokens)
  • -r: Repeats for statistics

Output includes:

  • pp (prompt processing): tokens/second for prefill
  • tg (text generation): tokens/second for decode

Limitations:

  • Doesn't measure memory peak (RSS)
  • Doesn't measure TTFT separately
  • Doesn't simulate API overhead
  • Single-stream only (no concurrent requests)

3.2 MLPerf Inference

Industry-standard benchmark by MLCommons:

ScenarioDescriptionRelevance
SingleStreamOne request at a time, measure latencyMost relevant (our use case)
MultiStreamFixed batch size, measure latency per streamModerately relevant
ServerVariable-rate requests, measure throughputRelevant for API design
OfflineAll requests at once, total throughputLess relevant

For our runtime, SingleStream is the primary scenario. Key metrics:

  • Latency P50, P90, P99 per token
  • Total throughput (queries/second at SLA)

3.3 Proposed Custom Benchmark Suite

We should build our own benchmark tool that captures the metrics that matter:

struct BenchmarkReport {
    // Decode performance
    decode_tps_mean: f64,       // Mean tokens/second
    decode_tps_p5: f64,         // 5th percentile (worst case)
    decode_tps_p50: f64,        // Median
    decode_tps_p95: f64,        // 95th percentile (best case)

    // Prefill performance
    prefill_tps: f64,           // Tokens/second during prompt processing
    ttft_ms: f64,               // Time to first token (ms)

    // Memory
    peak_rss_mb: u64,           // Maximum resident set size
    weight_mapped_mb: u64,      // mmap'd region size
    kv_cache_mb: u64,           // Pre-allocated KV cache

    // Latency distribution
    token_latency_histogram: Vec<(u64, u64)>,  // (bucket_ms, count)

    // System
    cpu_model: String,
    memory_bandwidth_gbps: f64,  // Measured via STREAM-like test
    simd_support: Vec<String>,   // avx2, avx512, etc.
}

4. Metrics That Matter

4.1 Decode Speed (tokens/second)

The primary user-facing metric. "Interactive use" requires:

  • Minimum: 2 tok/s (barely readable, acceptable for some use cases)
  • Comfortable: 4-6 tok/s (reading-speed, feels responsive)
  • Fast: 8+ tok/s (faster than reading, excellent UX)

Human reading speed reference:

  • Average adult reads ~250 words/minute
  • 1 word ≈ 1.3 tokens (English)
  • Reading speed in tokens/sec: ~5.4 tok/s
  • Therefore: 5+ tok/s means "faster than the user can read"

4.2 Prefill Speed (tokens/second)

How fast the model processes the input prompt. Affects:

  • Time to first response
  • Perceived "thinking time" before output begins

Target: >40 tok/s prefill (500-token prompt → <12.5 seconds)

4.3 Time To First Token (TTFT)

The delay between submitting a request and receiving the first output token:

TTFT = prefill_time + sampling_overhead

Target: <2 seconds for prompts up to 512 tokens

4.4 Memory Peak (RSS)

Maximum physical memory used during inference:

RSS_peak = mapped_weight_pages + KV_cache + activations + runtime_overhead

Target: <5.8 GB (leaving 200 MB for OS on a 6 GB system)

4.5 Energy Per Token

For cloud deployment cost estimation:

CPUPower (2 cores)At 4 tok/s: Energy/tokenCost (at $0.10/kWh)
Xeon Ice Lake core~15W3.75 J/token$0.00104 per 1K tokens
Epyc Milan core~12W3.0 J/token$0.00083 per 1K tokens
Graviton 3 core~8W2.0 J/token$0.00056 per 1K tokens

At $0.001/1K tokens, our runtime would be extraordinarily cheap compared to GPU alternatives ($0.01-0.10/1K tokens).

4.6 Cold Start Time

Time from process start to first token generated:

Loading StrategyCold StartWarm Start
mmap (lazy)~0.5s (mmap call)~0.5s
mmap + MAP_POPULATE~5-15s (all weights loaded)~0.5s
Load to RAM (copy)~10-30s~10-30s

With mmap: Cold start is essentially instant (~0.5s for model parsing, mmap, KV allocation). First few tokens will be slower due to page faults (cold page cache), but steady-state is reached within 5-10 tokens.


5. Realistic Performance Targets

5.1 Theoretical Ceiling (Memory Bandwidth Bound)

For Llama-3.1-8B at Q4_K_M on 2 vCPU cloud VM:

Weight reads per token: ~3.7 GB (all layers)
Memory bandwidth (2 cores, cloud VM): 20-30 GB/s (varies by provider)

Theoretical max tok/s = bandwidth / weight_reads
  Low estimate: 20 GB/s / 3.7 GB = 5.4 tok/s
  High estimate: 30 GB/s / 3.7 GB = 8.1 tok/s

5.2 Realistic Achievable (Accounting for Overheads)

Overhead SourceImpact on Theoretical
KV cache reads/writes (~200 MB/token)-5% bandwidth
Activation compute (norms, attention, RoPE)-10% compute time
Thread synchronization-2%
Turbo boost variance-5-15%
Cache misses during weight streaming-5-10%
Total realistic reduction~25-35%

Realistic range: 5.4 × 0.65 to 8.1 × 0.75 = 3.5 to 6.1 tok/s

5.3 Target Specification

The runtime has tiered performance targets from standard to aspirational:

TierScenarioThroughputRAMNotes
Standard9B Q4_K_M, 2 vCPU3–5 tok/s decode6 GBPrimary target, production-ready
Fast7B Q4_K_M, 4 vCPU8–12 tok/s decode8 GBComfortable hardware
Extreme9B FP16 streaming, 2 vCPU0.3–1.0 tok/s decode5 GBNovel: unquantized via disk streaming
Hybrid9B Q4_K_M + FP16 fallback3–5 tok/s normal, 0.5 tok/s quality mode6 GBBest of both worlds
Stretch14B FP16 streaming, 8 GB0.1–0.3 tok/s8 GBUnquantized larger model via streaming
Ultra9B BitNet (native 1.58-bit)10–15 tok/s4 GBWhen models appear, no dequant needed

The Extreme tier (unquantized 9B on 5 GB) is the killer differentiator. No other runtime does this. It treats the NVMe SSD as an extended memory tier via structured mmap streaming with madvise lifecycle management.

MetricMVP TargetStretch GoalTheoretical Max
Decode tok/s (Llama-3.1-8B Q4_K_M)3.55.06.1
Decode tok/s (Qwen2.5-7B Q4_K_M)4.56.08.0
Prefill tok/s406080
TTFT (512-token prompt)<2.0s<1.0s<0.8s
Peak RSS<5.5 GB<5.2 GBN/A
Cold start<5s<2s<1s
API overhead<50ms<20ms~5ms

5.4 Comparison: What llama.cpp Currently Achieves

Based on gathered benchmark data for Llama-3-8B / Llama-2-7B class Q4_K_M:

Runtime2 Threads Decode2 Threads PrefillMemory PeakAPI Support
llama.cpp (b4000+)3.5–5.245–70~6.0 GBllama-server (basic)
OpenVINO GenAI3.0–4.550–80~5.8 GBYes (Python)
ONNX GenAI3.5–5.055–75~5.8 GBYes (Python)
Candle2.5–3.530–50~6.0 GBmistral.rs wraps it
Our Target3.5–5.040–60<5.5 GBNative Rust + axum

5.5 Optimization Gap Analysis: Where Can We Beat llama.cpp?

llama.cpp is highly optimized. Beating it requires targeting areas llama.cpp doesn't prioritize:

llama.cpp LimitationOur OptimizationEstimated Gain
No madvise hints (page cache not optimized)MADV_SEQUENTIAL + MADV_DONTNEED5-10% (reduced page fault overhead on constrained memory)
Contiguous KV cache over-allocatedRight-sized KV with INT8 option10-20% memory savings (enables larger context or bigger models)
No streaming weight executionStreaming layer-by-layer with prefetchEnables models that don't fully fit
No INT8 KV cache optionINT8 KV supportEnables 2× more context in same RAM
ggml graph overhead (generic compute graph)Direct dispatch (no graph building)5-15% reduction in per-token overhead
One-size-fits-all threadingTopology-adaptive threading5-10% on hyperthreaded VMs
No prefix cachingSystem prompt KV cachingMulti-turn speedup (no token/s impact, UX improvement)
No speculative decodingOptional draft model speculation1.5-2.5× throughput boost when enabled

Combined estimated improvement over llama.cpp: 15-30% for memory-constrained scenarios, negligible for unconstrained scenarios.

Honest assessment: Our runtime won't dramatically outperform llama.cpp in raw throughput. The value proposition is:

  1. Better memory management (fits more configurations in 6 GB)
  2. API-first design (production-ready serving, not a CLI tool)
  3. Cloud-specific optimizations (vCPU topology detection, memory pressure handling)
  4. Forward-looking (BitNet support when models appear)

6. Adversarial Benchmarks

6.1 When Does the Design Break?

ScenarioWhat HappensMitigation
<6 GB RAM availablemmap page faults + OOM riskFail fast with clear error message
1 vCPU onlyDecode drops to ~2 tok/sAcceptable but slow; recommend 2 vCPU minimum
Network-attached disk (not NVMe)Streaming weights add 10-50ms latency per layerUse MAP_POPULATE or cache model on local disk
CPU without AVX2Scalar fallback: ~0.5 tok/sNot viable; require AVX2 minimum
Very long context (8K+ on 9B)KV cache exceeds budgetReject request, return error
Concurrent requests (batch > 1)Memory multiplication, throughput dropQueue + single-stream processing

6.2 Minimum "Useful" Configuration

Defining "useful" as interactive chat (user perceives responsive conversation):

ParameterMinimumRecommended
RAM4 GB6 GB
vCPUs24
Model3B (Phi-3.5-mini)7-8B (Qwen2.5-7B, Llama-3.1-8B)
QuantizationQ4_K_MQ4_K_M
Context10244096
DiskSSD (local)NVMe SSD
CPU featuresAVX2AVX2 (AVX-512 bonus)

6.3 The Physics-Bound Question

Is 9B @ 2 vCPU / 6 GB genuinely possible for interactive use?

Answer: YES, with caveats.

  • Math shows 3.5–6.1 tok/s is achievable on cloud VMs with good bandwidth
  • llama.cpp already demonstrates ~3.5–5.2 tok/s on similar hardware
  • Memory fits for Llama-3.1-8B at Q4_K_M with 4096 context
  • Qwen2.5-7B fits even more comfortably (up to 8K+ context)
  • The main risk is cloud VM variability - noisy neighbors, burstable instances, and inconsistent memory bandwidth can push performance below 3 tok/s

Recommendation: Specify minimum instance types, not just vCPU/RAM counts:

  • AWS: c6i.large (2 vCPU, 4GB RAM - too tight) or c6i.xlarge (4 vCPU, 8GB - ideal)
  • GCP: e2-standard-2 or n2-standard-2
  • Azure: D2s_v5

7. Benchmark Plan for Development

7.1 Continuous Benchmarking in CI

Every PR should run:

  1. Unit benchmarks: Individual kernel throughput (Q4_K_M dot product, attention, norm)
  2. Integration benchmarks: Single-token generation latency on reference model
  3. Regression detection: Alert if any benchmark degrades >5% from baseline
# CI benchmark job (pseudocode)
benchmark:
  model: llama-3.1-8b-q4_k_m.gguf
  threads: [1, 2]
  prompt_length: [128, 512, 1024]
  generation_length: 256
  repeats: 5
  fail_if: decode_tps < baseline * 0.95

7.2 Reference Hardware Profiles

ProfileDescriptionPurpose
cloud-2vcpu-icelakeXeon Ice Lake, 2 vCPU, 6 GBPrimary target
cloud-2vcpu-epycAMD Epyc Milan, 2 vCPU, 6 GBSecondary target
cloud-2vcpu-gravitonAWS Graviton 3, 2 vCPU, 6 GBARM target
desktop-2coreSingle desktop CPU, 2 cores pinnedDevelopment reference

7.3 Reporting Format

All benchmark reports should include:

Model: Llama-3.1-8B-Instruct.Q4_K_M
Hardware: Intel Xeon Platinum 8375C, 2 vCPU, 6 GB RAM
OS: Linux 6.1 (Ubuntu 24.04)
Runtime version: 0.3.0

Decode: 4.2 tok/s (P50), 3.8 tok/s (P95), 4.8 tok/s (P5)
Prefill: 58 tok/s @ 512-token prompt
TTFT: 880ms
Peak RSS: 5,412 MB
Cold start: 1.2s (mmap, lazy page faults)

8. Implementation Implications

8.1 Performance Targets by Phase

PhaseTarget Decode tok/sTarget Prefill tok/sTarget Peak RSS
Phase 1 (PoC, scalar)0.335.5 GB
Phase 2 (AVX2 kernels)2.5305.5 GB
Phase 3 (Memory opts)3.0355.2 GB
Phase 4 (Full runtime)3.5405.0 GB
Phase 5 (Advanced opts)4.5+50+4.8 GB

8.2 Critical Success Criteria

The runtime is "production-ready" when it achieves:

  1. ≥ 3.5 tok/s decode for Llama-3.1-8B at Q4_K_M on 2 vCPU cloud VM
  2. ≤ 5.5 GB peak RSS with 4096-token context
  3. OpenAI-compatible API with sub-50ms overhead
  4. Stable for 24+ hours continuous operation without memory leaks
  5. Graceful memory pressure handling (no OOM crashes)

8.3 Risk: llama.cpp Already Solves This Well

Honest assessment: llama.cpp's llama-server already provides:

  • Good performance (~4 tok/s on 2 vCPU)
  • OpenAI-compatible API
  • GGUF support
  • Open source (MIT)

Our differentiation:

  1. Memory engineering - llama.cpp doesn't optimize for 6 GB ceiling; our streaming + eviction approach does
  2. Cloud-native - Topology detection, memory pressure handling, graceful degradation
  3. Rust safety - No buffer overflows, no undefined behavior, auditable unsafe surface
  4. Forward compatibility - BitNet/ternary format support before llama.cpp implements it
  5. API-first - Production-quality axum server vs llama.cpp's basic HTTP handler

If none of these provide meaningful value over llama.cpp for a specific use case, the user should simply use llama.cpp. This runtime exists for teams that need production Rust infrastructure with memory engineering for constrained environments.


The next document (Document 8: Implementation Roadmap) synthesizes all research into a concrete phased implementation plan.