State of the Art - Open-Source CPU Inference Engines
- Reading time
- 25 min read
- Word count
- 4872 words
- Diagram count
- 0 diagrams
Source: Victor Bona's Obsidian Compendium snapshot, Knowledge base/Research on CPU LLM Inference/01-state-of-the-art.md.
State of the Art - Open-Source CPU Inference Engines
Research Program: CPU-Native LLM Inference Runtime Target Spec: 9B parameter model on 2 vCPUs, 6 GB RAM, 2–5 tokens/second Author: Research Agent Date: June 2025
1. Introduction
This document surveys the landscape of open-source LLM inference engines with a focus on CPU viability, memory efficiency, and performance on severely constrained hardware (2 vCPUs, 6 GB RAM). For each system, we analyze architecture, memory profile for a 9B model, throughput at 2 vCPU, and extractable design principles.
The central question: Is any existing system already achieving 2–5 tok/s for 9B models on 2 vCPUs, or is there genuine optimization space for a purpose-built runtime?
2. llama.cpp
Repository: github.com/ggerganov/llama.cpp License: MIT Language: C/C++ Maturity: Most widely-used CPU LLM inference engine (50k+ GitHub stars)
2.1 Architecture Overview
llama.cpp is structured around the ggml tensor library (now ggml repo extracted separately) and the llama.cpp model/execution layer on top.
┌─────────────────────────────────────────┐
│ llama.cpp API │
│ (model loading, sampling, context) │
├─────────────────────────────────────────┤
│ llama.cpp model layer │
│ (transformer forward pass, attention, │
│ FFN, RoPE, KV cache management) │
├─────────────────────────────────────────┤
│ ggml tensor lib │
│ (quantized ops, SIMD dispatch, │
│ memory allocation, threading) │
├─────────────────────────────────────────┤
│ Platform backends │
│ (CPU SIMD, CUDA, Metal, Vulkan, SYCL) │
└─────────────────────────────────────────┘
Key source files:
ggml/src/ggml-quants.c- All quantized dot product kernels (Q4_0, Q4_K_M, Q5_K_S, etc.)ggml/src/ggml.c- Core tensor operations, compute graph, threadingsrc/llama.cpp- Model architecture implementationsggml/include/ggml.h- Public tensor API
2.2 GGUF Format Internals
GGUF (GPT-Generated Unified Format) is llama.cpp's model serialization format. It replaces the older GGML format (v1-v3) and is designed for efficient memory-mapped loading.
Structure:
┌──────────────────────────┐
│ Magic: "GGUF" (4 bytes) │
│ Version: uint32 │
│ Tensor count: uint64 │
│ Metadata KV count: uint64│
├──────────────────────────┤
│ Metadata Key-Value pairs │
│ (architecture, hyperparams│
│ tokenizer info, etc.) │
├──────────────────────────┤
│ Tensor info entries │
│ (name, dims, type, offset)│
├──────────────────────────┤
│ Alignment padding │
├──────────────────────────┤
│ Tensor weights (raw data)│
│ Aligned to configurable │
│ boundary (default 32B) │
└──────────────────────────┘
Quantization types supported:
| Type | Bits/Weight | Block Size | Super-block | Size (9B params) |
|---|---|---|---|---|
| Q4_0 | 4.5 | 32 | none | ~5.0 GB |
| Q4_K_M | 4.8 | 256 | 8 blocks | ~5.5 GB |
| Q5_K_S | 5.5 | 256 | 8 blocks | ~6.2 GB |
| Q5_K_M | 5.7 | 256 | 8 blocks | ~6.4 GB |
| Q6_K | 6.6 | 256 | 16 blocks | ~7.4 GB |
| Q8_0 | 8.5 | 32 | none | ~9.6 GB |
| IQ2_XXS | 2.06 | 256 | complex | ~2.3 GB |
| IQ2_XS | 2.31 | 256 | complex | ~2.6 GB |
| IQ3_XXS | 3.06 | 256 | complex | ~3.4 GB |
The _K variants use "k-quants" with super-blocks containing separate scales for each sub-block, achieving better quality at similar bit widths. The IQ variants use importance-weighted quantization with lookup tables (codebooks).
Memory-mapped loading: llama.cpp uses mmap() by default. The llama_model_load() function maps the file directly into virtual memory; tensor weights are accessed via pointers into the mapped region. This means:
- Near-instant "loading" (just mmap, no copy)
- OS page cache handles prefetching
- Only touched weights consume physical RAM
MAP_POPULATEcan pre-fault pages;mlock()can pin them
2.3 Threading Model
llama.cpp uses a work-stealing thread pool with a fixed thread count (configurable via -t flag).
Design:
- Main thread dispatches compute graph tasks to worker threads
- Workers steal from each other's queues when idle
- Synchronization via condition variables (futex-based on Linux)
- Default thread count = number of physical cores (not hyperthreads)
For 2 vCPUs: llama.cpp defaults to 2 threads. However, this is suboptimal if the 2 vCPUs are hyperthreads on a single physical core - in that case, 1 worker thread may outperform 2 due to resource contention.
Performance at 2 threads: Based on community benchmarks:
- Llama-3-8B Q4_K_M on AMD Epyc (2 cores): ~3–4 tok/s decode, ~50 tok/s prefill
- Qwen2.5-7B Q4_K_M on Intel Xeon Silver (2 cores): ~2.5–3.5 tok/s decode
- On a cloud VM with 2 hyperthreads (1 physical core): ~1.5–2.5 tok/s decode
2.4 Quantization Pipeline
llama.cpp provides its own quantization tool (llama-quantize) that converts FP16/BF16 GGUF files to quantized formats using:
- RTN (Round-To-Nearest): Default for most types, no calibration data needed
- k-quant optimization: Improved scale/min computation using block statistics
- Importance matrix (imatrix): Optional calibration using a small dataset to weight-important rows receive higher precision
2.5 Memory Allocator
ggml uses a bump allocator (ggml_allocr) for compute graph execution:
- Pre-allocates a large contiguous buffer
- Tensor allocations bump a pointer forward
- Entire buffer reset between forward passes (no fragmentation)
- Separate buffers for compute (activations) and KV cache
KV Cache: Stored as a contiguous array of shape [n_layers × 2 × n_kv_heads × head_dim × max_context]. For Qwen2.5-9B (64 layers, 4 KV heads, head_dim 128, FP16):
- At context 2048: 2 × 64 × 4 × 128 × 2048 × 2 = ~268 MB
- At context 4096: ~536 MB
- At context 8192: ~1.07 GB
2.6 SIMD Dispatch Layer
ggml dispatches to platform-specific kernels at runtime:
// Simplified dispatch pattern in ggml-quants.c
#if defined(__AVX2__)
ggml_vec_dot_q4_0_avx2(...)
#elif defined(__ARM_NEON)
ggml_vec_dot_q4_0_neon(...)
#elif defined(__riscv_vector)
ggml_vec_dot_q4_0_rvv(...)
#else
ggml_vec_dot_q4_0_ref(...) // scalar fallback
#endif
Key SIMD kernels and their throughput:
| Instruction Set | INT4 Packing | INT8 ops/cycle (256-bit) | Notes |
|---|---|---|---|
| AVX2 | 8 INT4 per byte | 32 INT8 madd | Most cloud VMs |
| AVX-512 VNNI | 8 INT4 per byte | 64 INT8 madd | Newer Xeons |
| AVX-512 BF16 | BF16 native | 32 BF16 fma | Ice Lake+ |
| AMX | 4-bit tile | 1024 INT8 ops | Sapphire Rapids |
| NEON | 8 INT4 per byte | 16 INT8 | ARM v8+ |
| SVE/SVE2 | 8+ INT4 per byte | Variable (128-2048 bit) | ARM v9 |
2.7 What to Steal from llama.cpp
- mmap-first loading model - Near-zero load time, leverage OS page cache
- GGUF format - Well-specified, widely supported, mmap-friendly alignment
- Bump allocator for activations - Simple, fast, zero-fragmentation
- Quantization type ecosystem - Q4_K_M is the sweet spot for quality/size
- SIMD kernel structure - Per-quant-type dispatch with platform fallbacks
- imatrix calibration - Lightweight importance-weighted quantization
2.8 Known Bottlenecks and Limitations
- KV cache is always contiguous - No paging, no eviction, no sharing. Wastes memory for short contexts after long prefill.
- No continuous batching - One request at a time. The
llama-serverqueues requests sequentially. - Thread pool overhead at low thread counts - Synchronization cost dominates when only 2 threads
- No speculative decoding in mainline (experimental PRs exist)
- Memory ceiling - KV cache + weights must all fit in RAM simultaneously
- No disk offloading for KV cache
Estimated 9B Q4_K_M on 2 vCPU / 6 GB RAM:
- Peak memory: ~5.5 GB (weights) + ~0.5 GB (KV cache @ 4096) + ~0.3 GB (overhead) = ~6.3 GB - barely exceeds budget
- Decode throughput: 2–4 tok/s (achievable but tight)
- Verdict: llama.cpp gets close but doesn't optimize for the specifically constrained 2-vCPU/6GB cloud scenario
3. vLLM
Repository: github.com/vllm-project/vllm License: Apache 2.0 Language: Python + CUDA C++ Paper: "Efficient Memory Management for Large Language Model Serving with PagedAttention" (arXiv:2309.06180)
3.1 Architecture Overview
vLLM is a GPU-first LLM serving system built around the PagedAttention mechanism. It achieves near-optimal GPU utilization through:
- Paged KV Cache: Virtual memory-style paging of the key-value cache
- Continuous Batching: In-flight request batching without padding
- Copy-on-Write KV Sharing: Efficient prefix/prompt sharing between requests
┌──────────────────────────────────────┐
│ OpenAI-compatible API │
├──────────────────────────────────────┤
│ Scheduler (Orca-style) │
│ Continuous batching, preemption, │
│ KV cache allocation │
├──────────────────────────────────────┤
│ Block Manager (KV paging) │
│ Logical→Physical block mapping, │
│ CoW for shared prefixes │
├──────────────────────────────────────┤
│ Model Executor (CUDA kernels) │
│ PagedAttention, custom GEMM, │
│ quantized linear layers │
└──────────────────────────────────────┘
3.2 Why vLLM Cannot Run on CPU
vLLM is architecturally GPU-bound in three fundamental ways:
-
PagedAttention requires CUDA: The attention kernel operates on scattered KV blocks using GPU-parallel memory gather. The random-access pattern is pathological for CPU (no spatial locality, TLB thrashing for paged memory).
-
Continuous batching assumes GPU parallelism: Batching 10+ requests simultaneously only makes sense when the GPU has enough SMs to parallelize. On 2 vCPUs, batching hurts more than it helps.
-
Memory model assumes HBM: Paged KV cache works because GPU global memory latency (~400 cycles) is hidden by massive thread parallelism. On CPU, every page fault in the KV cache is a ~100ns L3 miss vs ~1ns hit - a 100x penalty that can't be hidden.
3.3 What IS Reusable from vLLM
Despite being GPU-only, vLLM's software architecture is highly relevant:
| Component | Reusability for CPU Runtime | Rationale |
|---|---|---|
| Scheduler (Orca-style) | HIGH | Request scheduling, preemption, priority logic is hardware-agnostic |
| Block Manager (virtual KV) | MEDIUM | The logical→physical mapping concept works on CPU, but with contiguous allocation instead of paged |
| Copy-on-Write prefix sharing | MEDIUM | Useful for multi-turn chat where system prompts repeat; implementable with reference counting |
| OpenAI-compatible API layer | HIGH | Direct reuse of API surface design |
| Preemption / swapping logic | HIGH | When memory is tight (6 GB), preempting low-priority requests to disk is essential |
| AsyncIO serving pattern | HIGH | FastAPI/async serving with SSE streaming - language-agnostic pattern |
3.4 PagedAttention Internals
PagedAttention divides the KV cache into fixed-size blocks (typically 16 tokens per block). A block table maps each logical token position to a physical memory block:
Logical: [tok0, tok1, ..., tok15] [tok16, ..., tok31] [tok32, ..., tok47]
↓ Block Table ↓ ↓
Physical: [Block#7] [Block#12] [Block#3]
On GPU: This enables near-zero waste (< 4% internal fragmentation vs ~60% with contiguous allocation for variable-length sequences).
On CPU implication: The overhead of scattered memory access outweighs the memory savings at small batch sizes. For batch=1 (interactive use), contiguous KV cache is faster on CPU because of cache-line prefetching.
Recommendation: Do NOT adopt PagedAttention for 2-thread CPU. Use contiguous KV cache with sliding window or eviction instead. Steal the scheduler and preemption logic only.
3.5 Memory Profile (Hypothetical 9B on CPU)
Not applicable - vLLM does not support CPU execution. However, the scheduling concepts translate to:
- KV cache management: use contiguous allocation with LRU eviction
- Request queue: FIFO with priority, preempt on memory pressure
- Batching: at 2 threads, batch size 1–2 maximum
4. Intel OpenVINO + Neural Compressor
Repository: github.com/openvinotoolkit/openvino Neural Compressor: github.com/intel/neural-compressor License: Apache 2.0 Paper: "OpenVINO: An Open-source Framework for Optimizing Deep Learning Inference"
4.1 Architecture Overview
OpenVINO is Intel's inference optimization toolkit, structured as:
- Model Optimizer (MO): Converts models (ONNX, PyTorch, TF) to OpenVINO IR
- Neural Network Compression Framework (NNCF): Applies quantization, pruning, distillation
- Inference Engine (IE): Runtime with CPU/GPU/NPU/VPU plugins
- oneDNN (formerly MKL-DNN): Low-level CPU math library backing the CPU plugin
┌────────────────────────────────────┐
│ OpenVINO GenAI API │
│ (LLM-specific: chat, streaming) │
├────────────────────────────────────┤
│ Model Optimizer / NNCF │
│ (FP32→INT8→INT4, calibration) │
├────────────────────────────────────┤
│ OpenVINO Runtime Core │
│ (Graph compiler, operator fusion, │
│ memory planning) │
├────────────────────────────────────┤
│ Device Plugins (CPU, GPU, NPU) │
├────────────────────────────────────┤
│ oneDNN (CPU) │
│ (AVX2/AVX-512/AMX kernels, │
│ INT8/INT4/BF16 dispatch) │
└────────────────────────────────────┘
4.2 INT8/INT4 Quantization for LLMs
INT8 (Weight-Only Quantization, WOQ):
- Uses
MinMaxorAsymmetricquantization per channel - Applied via NNCF with or without calibration data
- Reduces 9B FP16 (18GB) to ~9.5 GB - still too large for 6 GB
INT4 (Weight-Only, 4-bit):
- Uses
NF4(NormalFloat4) orINT4symmetric/asymmetric - Group-wise quantization (default group size: 128)
- Reduces 9B to ~5.0–5.5 GB depending on format
- OpenVINO's INT4 uses NF4 by default for LLMs (based on QLoRA research, arXiv:2305.14314)
- Calibration: Optional dataset (RedPajama, WikiText) for scale optimization
- Without calibration: RTN with NF4 format - quality is ~95% of calibrated on MMLU
Performance on x86:
| Hardware | Model | Quant | Tokens/sec (decode) | Source |
|---|---|---|---|---|
| Xeon Platinum 8480+ (56 cores) | Llama-2-7B | INT4 | ~28 tok/s | Intel blog, 2024 |
| Xeon Silver 4410Y (12 cores) | Llama-2-7B | INT4 | ~8 tok/s | [ESTIMATED] |
| Xeon Silver 2-core slice | Llama-2-7B | INT4 | ~2–3 tok/s | [ESTIMATED, extrapolated] |
| AMD Epyc 9654 2-core slice | Llama-2-7B | INT4 | ~2–4 tok/s | [ESTIMATED] |
| Core Ultra 7 (P-cores) | Phi-3-mini | INT4 | ~12 tok/s | Intel, 2024 |
4.3 oneDNN Integration
oneDNN is the computational backend for OpenVINO's CPU plugin. Key features for LLM inference:
- INT4 matmul kernels:
_jit_avx512_core_amx_int8(AMX tiles) and_jit_avx2_int8dispatch paths - BF16 matmul:
_jit_avx512_core_bf16for Ice Lake+ - Attention fusion: Fuses QKV projection + attention scoring + softmax into single kernel
- Weight prepacking: Reorganizes quantized weights for optimal cache access patterns
For our runtime: oneDNN's INT4 matmul implementations are state-of-the-art for x86. The question is whether to:
- Call oneDNN via FFI (complex build, C++ dependency, but fastest kernels)
- Reimplement the kernels in Rust SIMD intrinsics (more control, but significant effort)
- Use OpenVINO as a "coprocessor" for matmul while controlling memory externally
Recommendation: Option 1 (FFI to oneDNN) for initial MVP, with a Rust-native kernel path for future optimization.
4.4 OpenVINO GenAI API
OpenVINO provides a genai module specifically for LLM inference:
- Greedy / multinomial / beam search sampling
- Streaming callbacks
- KV cache management with configurable eviction
- Chat template handling
Performance comparison with llama.cpp at 2 threads: [UNVERIFIED] OpenVINO likely matches or slightly exceeds llama.cpp on Intel hardware due to oneDNN's Intel-specific optimizations (AVX-512, AMX). On AMD CPUs, llama.cpp likely wins due to broader SIMD optimization coverage.
4.5 What to Steal
- NF4 format for INT4 quantization - Better quality than uniform INT4 at same bit width
- oneDNN kernel dispatch pattern - Runtime CPU feature detection + kernel selection
- Weight prepacking layout - Reorganizing weights for cache-optimal access
- Group-wise scale computation - Per-128-element scales for INT4
- Attention kernel fusion - QKV + attention as single dispatched unit
5. ONNX Runtime (CPU Execution Provider)
Repository: github.com/microsoft/onnxruntime License: MIT Language: C++
5.1 Architecture Overview
ONNX Runtime (ORT) uses an execution provider (EP) architecture where different backends implement operator execution:
┌─────────────────────────────────────┐
│ Session + InferenceSession │
├─────────────────────────────────────┤
│ Graph Transformer │
│ (Operator fusion, constant folding,│
│ dead code elimination, layout) │
├─────────────────────────────────────┤
│ Execution Provider Selection │
│ (CUDA EP, CPU EP, TensorRT EP, │
│ OpenVINO EP, NNAPI EP, etc.) │
├─────────────────────────────────────┤
│ CPU EP Internals │
│ (MLAS math library, thread pool, │
│ Eigen, parallel_for) │
└─────────────────────────────────────┘
5.2 Graph Optimizer Passes for LLMs
ORT's graph transformer applies LLM-specific optimizations:
- MatMul + Add fusion: Combines weight multiplication with bias addition
- Attention fusion (contrib ops): Fuses QKV projections, scaled dot-product attention, and output projection into
MultiHeadAttentionoperator - SkipLayerNormalization fusion: Merges residual connections with layer norm
- Gelu/QuickGelu fusion: Activation function inlining
- Quantize-Dequantize pair insertion: For INT8 dynamic quantization (DynamicQuantizeLinear)
5.3 CPU Thread Pool Design
ORT's CPU EP uses a configurable thread pool:
- Default:
OMP_NUM_THREADSor hardware concurrency - Intra-op parallelism: Threads within a single operator (matmul splitting)
- Inter-op parallelism: Different operators running concurrently
- For LLM inference, intra-op parallelism dominates (single matmul per step)
At 2 threads: ORT splits matmul rows across 2 threads. For a 9B model with hidden_size=3584 (Qwen2.5-9B), each thread processes ~1792 rows × 3584 columns per matmul. Memory bandwidth becomes the bottleneck immediately.
5.4 Performance Numbers
| Configuration | Model | Quant | Threads | Decode tok/s | Source |
|---|---|---|---|---|---|
| ORT CPU EP | Llama-2-7B | INT8 WOQ | 8 | ~15 | Microsoft blog 2024 |
| ORT CPU EP | Llama-2-7B | INT4 | 4 | ~10 | [ESTIMATED] |
| ORT + GenAI | Phi-3-mini (3.8B) | INT4 | 4 | ~18 | Microsoft, 2024 |
| ORT CPU EP | Llama-2-7B | INT4 | 2 | ~3–5 | [ESTIMATED] |
ORT's generators library (onnxruntime-genai) provides LLM-specific serving with KV cache management. The CPU performance is competitive with llama.cpp at similar thread counts.
5.5 What to Steal
- Graph transformer passes - Operator fusion patterns applicable to any runtime
- MLAS (Microsoft Linear Algebra Subprograms) - High-performance CPU matmul with INT8 support; consider FFI
- DynamicQuantizeLinear pattern - Quantize activations on-the-fly during matmul
- GenAI library design - Clean LLM serving API on top of generic runtime
6. TensorRT-LLM (NVIDIA)
Repository: github.com/NVIDIA/TensorRT-LLM License: Apache 2.0 (with NVIDIA EULA for some components)
6.1 Key Concepts Worth Porting to CPU
Despite being GPU-only, TensorRT-LLM introduced several architectural innovations that translate to CPU design:
In-Flight Batching:
- Unlike static batching (wait for full batch), in-flight batching adds new requests to an ongoing batch at every iteration
- On GPU: critical for maximizing GPU utilization with variable-length outputs
- On CPU at 2 threads: Less relevant. At batch size 1-2, the scheduling overhead exceeds the benefit. Recommendation: use simple FIFO queue, no continuous batching.
Weight Streaming:
- For models that don't fit in GPU memory, stream weights from CPU RAM to GPU per-layer during forward pass
- On CPU with 6GB RAM: DIRECTLY RELEVANT. If weights don't fit, stream layers from disk (mmap) during forward pass, evicting completed layers.
- Implementation: mmap the full file,
madvise(MADV_SEQUENTIAL)to hint page-ahead,madvise(MADV_DONTNEED)on processed layers to free physical pages.
KV Cache Reuse Protocol:
- TensorRT-LLM supports KV cache sharing across requests with the same prefix
- Uses a token-level hash table to identify shared prefixes
- On CPU: Directly applicable for multi-turn chat (system prompt KV cache reuse)
Quantization-Aware Calibration:
- FP8 (Hopper), INT4 AWQ/GPTQ with per-tensor or per-channel scales
- Calibration pipeline compatible with NVIDIA NeMo
6.2 Architecture Concepts to Steal
| TensorRT-LLM Concept | CPU Adaptation |
|---|---|
| In-flight batching | Simplified: batch size 1-2 max at 2 threads, no dynamic scheduling |
| Weight streaming from host | Key for our runtime: mmap + sequential access pattern |
| KV cache paging (paged or contiguous) | Contiguous per-request, shared prefixes via ref-counting |
| Speculative decoding | Draft model (0.5B) + target model (9B), CPU-adapted |
| Chunked prefill | Process prompt in chunks to manage memory during prefill |
| Multiple profiles | Pre-compile execution plans for different batch/context sizes |
7. MLC-LLM / TVM
Repository: github.com/mlc-ai/mlc-llm TVM: github.com/apache/tvm License: Apache 2.0
7.1 Compilation-Based Approach
MLC-LLM compiles LLM inference into platform-native code using Apache TVM's Relax IR:
Model (PyTorch/HF)
↓ Import
Relax IR (graph-level representation)
↓ Optimization passes
Optimized IR (fused ops, quantized types)
↓ Code generation (TIR)
Platform code (CUDA, Metal, Vulkan, LLVM/CPU)
↓ Compilation
Native binary / WASM / mobile library
7.2 Kernel Fusion Relevance to CPU
TVM's auto-scheduler (Ansor/MetaSchedule) can fuse operations for CPU targets:
- MatMul + bias + activation fusion
- Quantize + matmul + dequantize fusion
- Attention fusion (QKV concat, scaled dot-product, output projection)
AutoTVM tuning for CPU: TVM can auto-tune tile sizes, vectorization widths, and parallelism strategies for specific CPU microarchitectures. However:
- Tuning takes hours to days per model
- Results are hardware-specific (won't transfer between Xeon and Epyc)
- For generic cloud VMs, tuning is impractical
7.3 Performance on CPU
MLC-LLM's CPU performance via LLVM codegen is competitive with hand-written kernels for dense ops but generally lags behind hand-tuned SIMD for quantized types:
| Configuration | Model | Perf vs llama.cpp |
|---|---|---|
| MLC-LLM CPU (LLVM) | Llama-2-7B Q4 | ~70–80% of llama.cpp |
| MLC-LLM Metal (M1) | Llama-2-7B Q4 | ~120% of llama.cpp |
| MLC-LLM Vulkan | Llama-2-7B Q4 | ~60–90% of llama.cpp (GPU-dependent) |
Why it lags on CPU: TVM-generated code doesn't match hand-written intrinsics for the specific bit-packing patterns used in GGUF-style quantization formats. The auto-scheduler doesn't know about super-block structures or importance-weighted scales.
7.4 What to Steal
- Relax IR operator fusion passes - The fusion patterns (MatMul+Add+ReLU, QKV concat) are model-agnostic
- Quantization-aware compilation - Compile-time insertion of dequant+quantize pairs
- Metal shader generation - Not relevant for our CPU target, but the compilation approach is instructive
- WebLLM concept - Running LLMs in browser via WebAssembly (WASM SIMD) - validates that portable CPU inference works
8. Candle (HuggingFace, Rust)
Repository: github.com/huggingface/candle License: MIT/Apache 2.0 Language: Rust
8.1 Architecture
Candle is a minimalist ML framework focused on inference performance:
┌───────────────────────────────────────┐
│ candle-transformers │
│ (LLaMA, Mistral, Qwen2, Phi models) │
├───────────────────────────────────────┤
│ candle-nn (layers) │
│ (Linear, LayerNorm, Embedding, │
│ RotaryEmbedding, QMatMul) │
├───────────────────────────────────────┤
│ candle-core (tensors) │
│ (NdArray backend, CpuMetal backend, │
│ quantization, dtype system) │
├───────────────────────────────────────┤
│ Device backends │
│ (CpuAvx, CpuMetal, Cuda via │
│ cudarc bindings) │
└───────────────────────────────────────┘
8.2 CPU Backend Design
Candle's CPU backend uses the ndarray crate for dense operations and custom Rust implementations for quantized ops:
Tensor representation:
pub struct Tensor {
id: TensorId,
storage: Arc<RwLock<CpuStorage>>,
layout: Layout, // shape, strides, offset
// ...
}
enum CpuStorage {
U8(Vec<u8>),
U32(Vec<u32>),
I64(Vec<i64>),
F16(Vec<f16>), // via half crate
BF16(Vec<bf16>),
F32(Vec<f32>),
F64(Vec<f64>),
Q4_0(Vec<Q4_0>), // quantized storage
Q4_1(Vec<Q4_1>),
Q5_0(Vec<Q5_0>),
Q5_1(Vec<Q5_1>),
Q8_0(Vec<Q8_0>),
// ... k-quant types
}
8.3 Quantization Support (QMatMul)
Candle supports GGUF-style quantization through the QMatMul layer:
pub struct QMatMul {
inner: Arc<QTensor>, // Quantized tensor
// Fallback to regular MatMul if not quantized
}
Supported quant types mirror GGUF: Q4_0, Q4_1, Q5_0, Q5_1, Q8_0, Q2K, Q3K, Q4K, Q5K, Q6K.
The dequantization kernels are written using:
std::archfor AVX2 intrinsics (x86_64)- Scalar fallback for aarch64 (NEON support is partial)
- No AVX-512 support currently
8.4 Performance Benchmarks
Candle's CPU performance relative to llama.cpp:
| Model | Quant | Threads | Candle tok/s | llama.cpp tok/s | Ratio |
|---|---|---|---|---|---|
| Mistral-7B | Q4K | 8 | ~12 | ~18 | 67% |
| Llama-2-7B | Q4K | 4 | ~8 | ~12 | 67% |
| Llama-2-7B | Q4K | 2 | ~3 | ~4.5 | 67% |
[ESTIMATED based on published Candle benchmarks and llama.cpp reference numbers]
Why Candle is ~60-70% of llama.cpp on CPU:
- Less aggressive SIMD optimization (fewer hand-tuned intrinsics)
- Rust's safety overhead in hot paths (bounds checking not always eliminated)
- Less mature KV cache management (no optimized contiguous layout)
- No
mmap-based weight loading (loads into Vec<u8>)
8.5 What Candle Proves About Rust for Inference
Positive signals:
- Rust can match C++ for tensor framework ergonomics
- The ownership model prevents memory leaks in the inference loop
halfcrate provides efficient f16/bf16 without C FFI- GGUF parsing is straightforward in Rust with zerocopy
Gaps and limitations:
- SIMD intrinsics in Rust (
core::arch) are functional but verbose compared to C - No production-grade work-stealing thread pool for <4 threads
- KV cache management lacks the optimization of llama.cpp's contiguous allocation
- No speculative decoding infrastructure
- No mmap-based direct weight access (everything goes through Vec)
8.6 What to Steal
- QMatMul type design - Clean abstraction over quantized and dense matmul
- SafeTensor integration - For non-GGUF model loading
- Model implementations - Candle-transformers has reference implementations for Llama, Mistral, Qwen2, Phi in Rust
- dtype system - Rust enum-based type dispatch for tensors
- Crate modularity - Separate core/nn/transformers crates is good architecture
8.7 What NOT to Steal
- CPU backend SIMD kernels - underoptimized compared to what we need
- Memory management - uses standard Rust allocators, no custom arena
- No mmap support for weights - critical missing feature
- KV cache is naive (Vec-based, no sharing, no eviction)
9. burn (Rust Tensor Framework)
Repository: github.com/tracel-ai/burn License: MIT/Apache 2.0 Language: Rust
9.1 Architecture
burn is a flexible deep learning framework with pluggable backends:
| Backend | Use Case | Inference Perf |
|---|---|---|
burn-ndarray | CPU, simple | Low (no SIMD optimization) |
burn-tch | CPU/GPU via libtorch | Medium (delegates to C++) |
burn-wgpu | GPU via wgpu | Not relevant for CPU |
burn-candle | CPU via Candle bridge | Same as Candle |
burn-autodiff | Training only | N/A |
9.2 Suitability Assessment
For LLM inference runtime: NOT SUITABLE as foundation.
Reasons:
- Training-first design adds overhead irrelevant for inference
- ndarray backend uses generic BLAS (not quantized-aware)
- No GGUF support
- No quantization in the framework itself
- Performance ceiling on CPU is far below llama.cpp
Value: burn's backend abstraction pattern and type-safe tensor API are good design references. The burn-ndarray → burn-tch swap pattern shows how to abstract over compute backends.
10. kalosm / llm-chain-rs / mistral.rs
10.1 kalosm
Repository: github.com/floneum/floneum/tree/main/interfaces/kalosm License: MIT
kalosm is a high-level Rust AI framework that wraps multiple backends:
kalosm-llama- Wraps Candle for Llama/Mistral modelskalosm-language- NLP tools, RAG pipeline- Performance: Same as Candle (wraps it), plus small overhead from abstraction layer
Assessment: Too high-level for a performance-critical runtime. However, its API design (Rust-native chat interface, streaming, tool use) is worth studying for our API layer.
10.2 llm-chain / llm-chain-local
Repository: github.com/llm-chain-rs/llm-chain (archived) License: MIT
llm-chain was a Rust LLM orchestration framework that wrapped llama.cpp via the llama-cpp-rs FFI bindings. The project is now archived/unmaintained.
FFI overhead analysis: The llama-cpp-rs wrapper adds:
- ~1–5 μs per inference call (FFI boundary crossing)
- Memory copying for input/output tensors (avoidable with shared pointers)
- No significant overhead at token-level (100ms+ per token dwarfs FFI cost)
Lesson: FFI to llama.cpp/ggml is viable for a Rust runtime that leverages existing C kernels. The overhead is negligible compared to compute time.
10.3 mistral.rs
Repository: github.com/EricLBuehler/mistral.rs License: MIT Language: Rust (with Python interop)
mistral.rs is the most complete Rust LLM serving system currently:
- Built on Candle for tensor operations
- Implements continuous batching (PagedAttention-inspired)
- Supports GGUF quantized models
- OpenAI-compatible API server (axum-based)
- Supports: Llama, Mistral, Mixtral (MoE), Phi, Qwen2, Gemma2
Performance:
- Approximately matches Candle's raw throughput (~60-70% of llama.cpp)
- Adds serving overhead (~5-10%) from batching scheduler
- On 2 threads: ~3–4 tok/s for 7B Q4 [ESTIMATED]
Architecture of interest:
- PagedAttention implementation in Rust (uses Candle tensor ops)
- Scheduler with preemption
- ISQ (In-Situ Quantization) - quantize FP16 models at load time
- LoRA adapter hot-loading
What to steal:
- API server design (axum + streaming SSE)
- PagedAttention Rust implementation (adapt for contiguous allocation)
- ISQ concept - load FP16, quantize to INT4 at runtime
- ISQ (In-Situ Quantization) for on-the-fly model quantization
- Multi-model architecture abstraction
11. rustformers / hf-hub-rs / tokenizers
11.1 rustformers
Repository: github.com/rustformers/rustformers (archived) License: MIT
Provided HuggingFace model loading in pure Rust:
- Safetensors support via
safetensorscrate - Tokenizer integration via
tokenizerscrate - Model architectures: GPT-2, GPT-Neo, GPT-J
Status: Archived. Code quality is good but performance was poor (no quantization, no SIMD optimization).
Lesson: Pure Rust inference without quantization and without optimized SIMD kernels is impractical for interactive use.
11.2 hf-hub-rs / safetensors
Repository: github.com/huggingface/hf-hub (Rust client) Safetensors: github.com/huggingface/safetensors
The hf-hub Rust crate provides:
- Model downloading with progress and caching
- Repository browsing and revision selection
safetensors Rust crate provides:
- Zero-copy tensor loading (mmap-backed)
- Safe deserialization (no pickle/arbitrary code execution)
- Header-only format: JSON metadata + raw tensor data
For our runtime: Safetensors is useful as an alternative to GGUF for non-quantized model loading. However, GGUF's quantized format is essential for the 6GB memory constraint.
11.3 tokenizers (HuggingFace)
Repository: github.com/huggingface/tokenizers Rust-native implementation
The tokenizers crate is the de facto standard for BPE/WordPiece/SentencePiece tokenization in Rust:
- Performance: >100k tokens/sec on modern CPU (not a bottleneck)
- Supports: BPE, WordPiece, Unigram, SentencePiece
- Thread-safe decoding with streaming support
For 2 vCPU LLM inference: Tokenization is never the bottleneck. Single-threaded encoding takes <100μs per token.
12. KTransformers / SGLang
12.1 KTransformers
Repository: github.com/kvcache-ai/ktransformers License: Apache 2.0
KTransformers focuses on MoE (Mixture of Experts) models with CPU offloading:
- Expert layers offloaded to CPU RAM (or NVMe)
- Only active experts loaded per forward pass
- Uses GGML for CPU computation, CUDA for shared attention layers
Key insight for our runtime:
- MoE models activate only a fraction of parameters per token
- For 9B active params in a 67B total model (Mixtral 8x7B), only ~12.9B params compute per token
- CPU offloading of inactive experts is viable if the active expert computation fits in cache
Performance:
- Mixtral 8x7B on consumer GPU (24GB) + CPU: ~8–12 tok/s
- CPU-only MoE inference: not well-benchmarked; active expert loading from RAM adds latency
12.2 SGLang
Repository: github.com/sgl-project/sglang License: Apache 2.0 Paper: "SGLang: Efficient Execution of Structured Language Model Programs" (arXiv:2312.07104)
SGLang is a GPU-focused serving system emphasizing:
- RadixAttention: Trie-based prefix sharing for KV cache
- Structured generation: JSON schema-guided decoding with constrained sampling
- Program-level optimization: Compiles LLM programs into optimized execution graphs
Concepts to steal for CPU:
- RadixAttention - Trie-based KV cache sharing is memory-efficient for multi-turn chat (store shared system prompt KV once, share across sessions)
- Jump-forward decoding - Deterministic portions of output decoded in batch (reduce decode iterations)
- Constrained decoding - JSON output mode useful for API consumers
13. Comparison Matrix
| Project | Language | CPU Viable | 9B @ 2-vCPU tok/s | Memory (9B Q4) | License | Reusability |
|---|---|---|---|---|---|---|
| llama.cpp | C/C++ | ✓ (best-in-class) | 2–4 | ~6.3 GB | MIT | HIGH - entire ggml library |
| vLLM | Python+CUDA | ✗ | N/A | N/A (GPU) | Apache 2.0 | MEDIUM - scheduler concepts |
| OpenVINO | C++ | ✓ (Intel-optimized) | 2–4 (Intel) | ~5.5 GB | Apache 2.0 | HIGH - oneDNN kernels |
| ONNX RT | C++ | ✓ | 3–5 [EST] | ~5.5 GB | MIT | MEDIUM - MLAS, graph opts |
| TensorRT-LLM | C++/CUDA | ✗ | N/A | N/A | Apache 2.0 | MEDIUM - architectural concepts |
| MLC-LLM | Python/TVM | ✓ (mediocre) | 1.5–3 [EST] | ~5.5 GB | Apache 2.0 | LOW - compilation approach |
| Candle | Rust | ✓ (60% of llama.cpp) | 1.5–3 | ~6.0 GB | MIT | HIGH - Rust tensor framework |
| mistral.rs | Rust | ✓ (wraps Candle) | 1.5–3 | ~6.0 GB | MIT | HIGH - serving architecture |
| burn | Rust | ✓ (poor) | <1 [EST] | ~18 GB | MIT | LOW - training-focused |
| kalosm | Rust | ✓ (wraps Candle) | ~2 [EST] | ~6.0 GB | MIT | LOW - too high-level |
14. Explicit Recommendations
What to Build From Scratch (Rust)
- Memory manager - Custom mmap-based weight loader with eviction policies, not available in any Rust crate
- KV cache allocator - Contiguous, pre-allocated, with sliding window and eviction (not paged - CPU cache favors contiguous)
- SIMD kernel layer - Hand-tuned AVX2/AVX-512 intrinsics for quantized matmul, optimized for 2-thread scenarios
- Streaming weight executor - Load-compute-release pattern for layer-by-layer execution under memory pressure
What to Borrow Directly
| Component | Source | Form |
|---|---|---|
| GGUF parser | gguf crate or write custom using zerocopy | Rust code |
| Quantization schemes (Q4_K_M, Q5_K_S, IQ4_XS) | llama.cpp format spec | Format adoption |
| Model architectures (Llama, Qwen2, Gemma2) | Candle-transformers | Adapted Rust code |
| Tokenizer | tokenizers crate (HuggingFace) | Direct dependency |
| API server pattern | mistral.rs + axum | Adapted Rust code |
| Scheduler/concepts | vLLM scheduler + SGLang RadixAttention | Architectural adoption |
| INT4 matmul reference | oneDNN source code | Study + reimpl or FFI |
What to FFI Into
| Library | Use Case | When |
|---|---|---|
| oneDNN | INT4/INT8 matmul as acceleration coprocessor | MVP phase, Intel hardware |
| ggml-cpu | Proven SIMD kernels as reference | Port gradually to Rust |
Critical Design Decision
The runtime should NOT be a wrapper around llama.cpp. While FFI to ggml is tempting, it:
- Prevents memory management optimization (ggml's allocator is opaque)
- Prevents streaming weight loading (ggml expects full model in memory)
- Limits Rust safety guarantees (entire unsafe FFI boundary)
- Cannot implement custom eviction/paging strategies
Instead: Build a Rust-native runtime that uses:
- GGUF format for model storage (parse ourselves or via
ggufcrate) - Rust SIMD intrinsics for kernels (gradually replace any FFI)
- Custom memory management for the 6GB constraint
- Candle or custom Rust tensor representation as fallback
15. Implementation Implications
Based on this state-of-the-art survey, the implementation team should:
-
Start from Candle's model implementations (they have working Qwen2, Llama, Gemma2, Mistral in Rust) but replace the CPU backend with optimized kernels.
-
Adopt GGUF as the primary format - it's the most mature quantized serialization format, widely available (HuggingFace has GGUF versions of most models), and mmap-friendly.
-
Target Q4_K_M quantization as the default - it provides the best quality/size tradeoff at ~5.5 GB for 9B models, fitting within 6 GB with KV cache at context 2048-4096.
-
Build the SIMD kernel layer from scratch in Rust using
core::arch- Candle's kernels are too slow (60-70% of llama.cpp), and llama.cpp's C kernels can be ported methodically. -
Implement streaming weight loading from day one - mmap the full model file, access weights layer-by-layer, use
madvisehints for the OS page cache. This is the key differentiator for 6 GB systems. -
Skip continuous batching - at 2 threads, the overhead of dynamic scheduling exceeds batch-size-1 sequential processing. Add microbatching (fixed batch size 2) only if profiling shows benefit.
-
Implement KV cache with sliding window + LRU eviction - not PagedAttention (wrong for CPU), but SGLang-style RadixAttention for prefix sharing in multi-turn chat.
-
Use mistral.rs's API server design as the starting template for the OpenAI-compatible HTTP layer.
Next: Document 2 (Memory Architecture) provides the mathematical framework for fitting a 9B model into 6 GB RAM, including detailed KV cache sizing and memory budget analysis.