Implementation Roadmap and Open Problems
- Reading time
- 15 min read
- Word count
- 2825 words
- Diagram count
- 0 diagrams
Source: Victor Bona's Obsidian Compendium snapshot, Knowledge base/Research on CPU LLM Inference/08-implementation-roadmap.md.
Implementation Roadmap and Open Problems
Research Program: CPU-Native LLM Inference Runtime Target Spec: 9B parameter model, 2 vCPUs, 6 GB RAM, 2–5 tok/s Author: Research Agent Date: June 2025
1. Recommended Architecture (Synthesis)
Based on the research across Documents 1–7 and 9, the following architecture is recommended:
1.1 Core Design Decisions
| Decision | Choice | Justification |
|---|---|---|
| Language | Rust | Memory safety, zero-cost abstractions, no GC pauses, strong SIMD support |
| Quantization format | GGUF (parse) → internal CQR-4 aligned representation | GGUF ecosystem + CPU-optimized internal layout |
| Default quantization | Q4_K_M | Best quality/size tradeoff at ~5 GB for 8-9B models |
| KV cache | Contiguous, pre-allocated, FP16 default + INT8 option | CPU cache-friendly; INT8 doubles context headroom |
| KV eviction | Sliding window + LRU session eviction | Bounded memory; simple; covers multi-session |
| Threading | Topology-adaptive (1 or 2 threads) | Avoid hyperthreading contention on cloud VMs |
| Batching | Batch size 1, sequential processing | Memory + thread constraints make batching counterproductive |
| Weight loading | mmap + madvise(MADV_SEQUENTIAL) | Near-instant load; OS page cache management |
| Memory allocator | Bump arena (256 MB) for activations | O(1) alloc/free; zero fragmentation |
| API | axum + tokio, OpenAI-compatible | Production-grade; async streaming |
| SIMD kernels | Pure Rust core::arch (AVX2 primary, scalar fallback) | No C FFI build complexity |
| Speculative decoding | Optional, draft model (Qwen2.5-0.5B) | Enable when base throughput < 3 tok/s |
| Prefix caching | System prompt KV cache | Multi-turn chat optimization |
1.2 Primary Target Model
Llama-3.1-8B-Instruct at Q4_K_M
- Fits 6 GB with 4096-token context (FP16 KV) or 8192 (INT8 KV)
- Standard architecture (no exotic features)
- Massive community support and GGUF model availability
- 4:1 GQA ratio - reasonable KV cache
Secondary: Qwen2.5-7B (more memory headroom, 7:1 GQA) Stretch: Gemma-2-9B (highest quality but tight fit, requires INT8 KV)
1.3 What to Borrow vs Build
Borrow (reuse with minimal modification):
- GGUF format specification and parsing logic (from
ggufcrate or write from spec) - Quantization scheme definitions (Q4_K_M block structure, scale encoding)
- Model architecture implementations (adapted from Candle-transformers)
- Tokenizer (
tokenizerscrate from HuggingFace) - HTTP API patterns (from mistral.rs axum server)
Build from scratch (no adequate crate exists):
- SIMD kernel layer (Q4_K_M dot product, attention, RoPE, RMSNorm)
- Memory manager (mmap + madvise + bump allocator + memory budget tracking)
- KV cache manager (contiguous allocation, sliding window, INT8 quantization)
- Streaming weight executor (layer-by-layer with prefetch hints)
- Topology-adaptive thread pool
Borrow via FFI (optional, for performance validation):
- oneDNN matmul (as benchmark reference, not runtime dependency)
- OpenBLAS/BLIS (for FP16 fallback matmul, if Rust kernels too slow initially)
1.4 License Verification
| Component | License | Reuse Safety |
|---|---|---|
| ggml.cpp (llama.cpp's tensor lib) | MIT | ✅ Safe to study, safe to port algorithms |
| llama.cpp source | MIT | ✅ Safe to borrow algorithms and re-implement |
| Candle | MIT/Apache-2.0 | ✅ Safe to adapt model implementations |
| mistral.rs | MIT | ✅ Safe to adapt server patterns |
| GGUF format spec | Public domain | ✅ No license concerns |
| Model weights (Llama, Qwen, Gemma) | Various (community, research) | ⚠️ Check per-model license |
Recommendation: All MIT/Apache-2.0 licensed code can be freely studied and re-implemented. No copyleft concerns (no GPL dependencies in critical path).
2. Phased Implementation Plan
Phase 1: Proof of Concept (3 weeks)
Goal: Load a GGUF model, run a single forward pass, generate 1 token. Everything built from scratch.
Deliverables:
- Custom GGUF file parser (zero-copy mmap access, ~400 lines)
- Custom
f16type with IEEE-754 conversions (~200 lines) - Custom
libc::mmapwrapper for model file loading (~100 lines) - Model architecture implementation (Llama-3.1-8B: attention, FFN, RMSNorm, RoPE)
- Scalar (non-SIMD) matmul reference implementation
- Custom BPE tokenizer (~500 lines)
- Basic token sampling (greedy + temperature)
- CLI tool:
cargo run -- generate <model.gguf> --prompt "Hello"
Technical notes:
libccrate is the ONLY external dependency - rawmmap,munmap,madvisecalls- Custom f16 struct: wrapping
u16withto_f32()/from_f32()using bit manipulation - Custom
cast_slice::<T>(bytes: &[u8]) -> &[T]for zero-copy tensor access - Tokenizer parses HuggingFace
tokenizer.jsonusing our custom JSON parser - No performance optimization - correctness first
Build vs reuse:
- GGUF parser: from scratch (alignment-sensitive, tied to streaming design)
- Model architectures: from scratch (tightly coupled to our memory layout)
- Tokenizer:
tokenizerscrate or from scratch (either works, not a bottleneck) - HTTP API:
axum+serde(Phase 5 - not blocking)
Success criteria:
- Output matches llama.cpp reference output for the same model + seed
- Model loads from GGUF file (or raw safetensors for streaming mode)
Dependencies:
libc: syscall bindings (mmap, madvise, sched_getaffinity)memmap2: mmap convenience (optional, can use raw libc)half: f16 type (optional, can build custom)- Rust
stdandcore::arch- SIMD intrinsics are in the standard library
Phase 2: Optimized Kernels (3 weeks)
Goal: AVX2 SIMD kernels for quantized matmul and attention. Throughput reaches 2.5+ tok/s.
Deliverables:
q4k_matmul.rs: Q4_K_M × FP16 vector dot product using AVX2 intrinsics- 256-weight block processing
- Fused dequant + multiply + accumulate
- Scalar fallback for non-AVX2 CPUs
attention.rs: Multi-head attention with GQA support- QKV projection (3 matmuls or fused wider matmul)
- Scaled dot-product attention
- RoPE application (fused with Q/K projection if possible)
norm.rs: RMSNorm (AVX2 vectorized)activation.rs: SiLU element-wise (AVX2)- Benchmark suite: kernel-level throughput measurement
Technical approach:
// AVX2 kernel detection at compile time
#[cfg(target_feature = "avx2")]
pub unsafe fn q4k_dot_avx2(weights: &[u8], scales: &[u8], x: &[f16], n: usize) -> f32 {
// _mm256_load_si256, _mm256_and_si256, _mm256_maddubs_epi16, etc.
}
#[cfg(not(target_feature = "avx2"))]
pub fn q4k_dot_scalar(weights: &[u8], scales: &[u8], x: &[f16], n: usize) -> f32 {
// Plain Rust loop
}
// Runtime dispatch
pub fn q4k_dot(weights: &[u8], scales: &[u8], x: &[f16], n: usize) -> f32 {
#[cfg(target_feature = "avx2")]
if is_avx2_available() {
return unsafe { q4k_dot_avx2(weights, scales, x, n) };
}
q4k_dot_scalar(weights, scales, x, n)
}
Success criteria:
- AVX2 Q4_K_M dot product matches scalar reference within 0.1% (numerical accuracy)
- Achieves ≥2.5 tok/s decode on reference hardware (Xeon Ice Lake 2-core)
- Kernel throughput >10 GFLOPs (INT4 ops)
Phase 3: Memory Management (2 weeks)
Goal: mmap-based weight loading, KV cache management, memory budget enforcement.
Deliverables:
mmap.rs: Model weight loading via mmap with madvise hintsmmap(MAP_PRIVATE)for the model filemadvise(MADV_SEQUENTIAL)after loading- Optional
madvise(MADV_DONTNEED)per layer (streaming mode)
kv_cache.rs: Pre-allocated contiguous KV cache- Initialization with configurable max_context
- Append (write K/V for new token)
- Read (get K/V for attention computation)
- Sliding window support (circular buffer mode)
arena.rs: Bump allocator for activation tensors- Pre-allocate 256 MB
- O(1) alloc and reset
- Memory budget monitor
- Track RSS via
/proc/self/status - Alert/reject when approaching limit
- Track RSS via
Success criteria:
- Model "loads" in <1 second (mmap is instant)
- Peak RSS < 5.5 GB for Llama-3.1-8B Q4_K_M at context 4096
- Sliding window correctly bounds KV cache (no unbounded growth)
Phase 4: Threading and Batching (1 week)
Goal: Optimal threading for 2 vCPU systems, simple request queue.
Deliverables:
- Topology detection (physical vs logical cores via CPUID on x86, sys on ARM)
- Thread pool with adaptive sizing (1 or 2 threads)
- Row-sliced parallel matmul (split output rows across threads)
- Micro-benchmark: 1 thread vs 2 threads, pick winner
- Simple FIFO request queue with backpressure
Success criteria:
- Correctly detects 1 physical core + 1 hyperthread → uses 1 worker thread
- Correctly detects 2 physical cores → uses 2 worker threads
- 2-thread configuration achieves ≥1.4× speedup over 1-thread (when applicable)
Phase 5: API Server (1 week)
Goal: Production-ready OpenAI-compatible HTTP API with SSE streaming.
Deliverables:
axum-based HTTP server with SSE streaming support- OpenAI-compatible endpoints:
POST /v1/chat/completions,POST /v1/completions,GET /v1/models,GET /health serde-based JSON request/response handling- Request queue with backpressure (bounded channel, 429 on overflow)
clap-based CLI withtomlconfig file supporttracing-based structured logging- Graceful shutdown on SIGTERM/SIGINT
Success criteria:
- API overhead < 50ms per request
- Streaming delivers tokens within 100ms of generation
- Compatible with openai-python, curl, and standard OpenAI client libraries
Phase 6: Streaming Weight Execution (2 weeks)
Goal: Run unquantized 9B FP16 models on 5 GB RAM via mmap streaming from NVMe disk. This is the killer differentiator.
Deliverables:
- Streaming executor - Layer-by-layer forward pass with
madvise(DONTNEED)after each layer - Intelligent readahead - Prefetch next layer weights from disk while computing current layer
MADV_SEQUENTIALhints - Optimize kernel readahead pattern for sequential layer access- Hybrid mode - Auto-select quantized (fast) vs streaming FP16 (quality) based on RAM and request
- Partial resident caching - Keep embedding/lm_head + frequently accessed small layers pinned
- Disk speed benchmarking - Auto-detect SSD vs HDD, adjust streaming parameters
Success criteria:
- Unquantized Llama-3.1-8B FP16 generates tokens on 5 GB RAM machine
- Achieves ≥0.2 tok/s decode (with NVMe)
- Prefill throughput ≥50 tok/s after initial weight load
- No OOM kills, graceful degradation on slow disk
Why this works: The model file is mmap'd into virtual memory (16 GB virtual, 0 physical). Each layer (~500 MB FP16) is page-faulted from SSD as needed, computed, then released via madvise(DONTNEED). Peak physical RAM: ~1.3 GB (current layer + KV cache + overhead). The SSD acts as extended memory - essentially a "memory tier between RAM and network storage."
Phase 7: Advanced Features (2 weeks)
Goal: Optional performance and UX enhancements.
Deliverables:
- INT8 KV cache - Quantize KV cache entries to INT8 (half memory, ~0.2 ppl loss)
- Prefix caching - Cache system prompt KV for multi-turn speedup
- Speculative decoding - Optional draft model (Qwen2.5-0.5B) for throughput boost
- AVX-512 VNNI kernels - When available, 30-50% per-core speedup
- Memory pressure handling - Dynamic context reduction when RAM is tight
Success criteria:
- INT8 KV cache: no quality regression >0.5 ppl on standard benchmarks
- Prefix caching: >50% TTFT reduction for multi-turn with shared system prompt
- Speculative decoding: >1.5× speedup when enabled (with compatible draft model)
3. Timeline Summary
| Phase | Duration | Cumulative | Deliverable |
|---|---|---|---|
| Phase 1: PoC | 3 weeks | 3 weeks | Scalar forward pass, GGUF loader, basic generation |
| Phase 2: Kernels | 3 weeks | 6 weeks | AVX2/AVX-512 SIMD kernels, ≥2.5 tok/s |
| Phase 3: Memory | 2 weeks | 8 weeks | mmap streaming, KV cache, arena allocator |
| Phase 4: Threading | 1 week | 9 weeks | Adaptive threading, request queue |
| Phase 5: API | 1 week | 10 weeks | axum HTTP server, OpenAI-compatible API |
| Phase 6: Streaming | 2 weeks | 12 weeks | Unquantized 9B FP16 on 5 GB RAM via disk streaming |
| Phase 7: Advanced | 2 weeks | 14 weeks | INT8 KV, prefix cache, speculative decoding |
| Total | ~14 weeks | MVP + streaming milestone |
Buffer: Add 2-3 weeks for debugging, profiling, and optimization → 16-17 weeks (~4 months) for a single engineer.
Milestone targets:
| Milestone | When | Achievement |
|---|---|---|
| M1: First Token | Week 3 | 9B model loads, generates 1 token (Q4_K_M, 6 GB) |
| M2: Interactive | Week 6 | 3-5 tok/s decode, usable for chat |
| M3: Production | Week 10 | OpenAI API, deployment-ready |
| M4: Extreme | Week 12 | 9B unquantized FP16 on 5 GB RAM (streaming) |
| M5: Advanced | Week 14 | Speculative decoding, INT8 KV, prompt cache |
4. Open Research Problems
4.1 Is 2 vCPU Genuinely Viable or Physics-Bound?
Status: RESOLVED (conditionally viable)
From Document 7 analysis:
- Theoretical max: 5.4–8.1 tok/s for Llama-3.1-8B Q4_K_M on good cloud VMs
- llama.cpp achieves: 3.5–5.2 tok/s on similar hardware
- Our target: 3.5–5.0 tok/s (matching llama.cpp baseline)
Conclusion: 2 vCPU IS viable for interactive use (>2 tok/s) on properly provisioned cloud VMs (non-burstable, local SSD, AVX2+). The physics bound is memory bandwidth, not compute - and cloud VM memory bandwidth (~20-30 GB/s) is sufficient.
Risk: Burstable instances (t3, e2-micro) or noisy neighbor scenarios can drop below 2 tok/s. The runtime should detect and report this condition.
4.2 Is There a CPU-Optimied Quantization Format That Doesn't Exist Yet?
Status: PARTIALLY RESOLVED
From Document 3 analysis:
- Current best: Q4_K_M (GGUF) with super-block scales - already well-optimized
- Proposed CQR-4 format: 64-byte aligned blocks, optimized for AVX2 sequential access
- Potential improvement: 64-byte alignment + precomputed weight ordering for stride-1 access → 5-10% bandwidth improvement over GGUF
Open question: Can we design a format that encodes weights in the access order of the compute graph (not just tensor-major order), enabling true zero-overhead sequential streaming? This would require encoding the execution plan into the weight file layout.
4.3 Context Lengths >4096 on 6 GB Without Disk
Status: PARTIALLY RESOLVED
Solutions identified:
- INT8 KV cache: doubles effective context (4096→8192 for Llama-3.1-8B)
- INT4 KV cache: quadruples (but quality loss ~1-2% accuracy)
- Sliding window: bounds context at model-native window (4096 for Gemma-2-9B)
- Streaming weight execution: frees memory for larger KV cache
For >8K context on 6 GB: Not achievable without either INT4 KV (significant quality loss) or disk offloading (throughput penalty). The runtime should document this limitation clearly.
4.4 NUMA-Unaware 2-Hyperthread Cloud VM: Distinct Optimization Target?
Status: RESOLVED
On 2 hyperthreads sharing 1 physical core:
- 1 thread outperforms 2 threads for SIMD workloads (resource contention)
- Memory bandwidth is shared - no benefit from 2 threads for memory-bound work
- Distinct optimization: always use 1 thread on hyperthreaded single-core VMs
This is already captured in our topology-adaptive threading design.
5. Risks and Mitigations
5.1 Risk Matrix
| Risk | Probability | Impact | Mitigation |
|---|---|---|---|
| llama.cpp already solves this well | HIGH | MEDIUM | Differentiate on memory engineering + API-first design; see differentiation strategy below |
| Memory budget impossible at 6 GB for interactive chat | LOW (proven achievable) | HIGH | Fallback to 7B model (Qwen2.5-7B) or reduce context; runtime auto-detects and warns |
| 2 vCPU throughput < 1 tok/s on bad VMs | MEDIUM | HIGH | Document minimum instance specs; runtime benchmark + warning on startup |
| Rust SIMD performance < C++ parity | LOW | MEDIUM | Profile extensively; fallback to FFI oneDNN for hot kernels if needed |
| GGUF format changes breaking compatibility | LOW | LOW | Pin GGUF reader to specific version; abstract format layer |
| Quantization quality insufficient at Q4 | LOW | MEDIUM | Support Q5_K_M as premium option; implement INT8 KV to free memory for larger quant |
| Build complexity (SIMD intrinsics) | MEDIUM | LOW | Feature-flag AVX2/AVX-512/NEON; extensive unit tests against scalar reference |
5.2 Differentiation Strategy: Why Not Just Use llama.cpp?
Legitimate reasons to build this runtime instead of wrapping llama.cpp:
| llama.cpp Limitation | Our Solution | User Impact |
|---|---|---|
| No memory pressure handling | Proactive budget management + graceful degradation | No OOM crashes in production |
| Basic HTTP server | Production axum server with connection pooling, health checks, metrics | Production-ready deployment |
| No INT8 KV cache | INT8 KV option doubles context headroom | Longer conversations fit in RAM |
| C/C++ (memory safety) | Rust (memory safe, auditable) | No buffer overflows; security audits easier |
| Monolithic architecture | Modular crate-based design | Embeddable in larger Rust applications |
| No speculative decoding | Draft model speculative decoding | 1.5-2.5× throughput when enabled |
| Not cloud-native | Topology detection, memory-adaptive, cloud-optimized | Better performance on constrained VMs |
Honest assessment: For most users, llama.cpp's llama-server is sufficient. Our runtime targets teams that:
- Need Rust integration (embedded in a larger Rust service)
- Need production-grade serving infrastructure
- Need aggressive memory engineering for constrained environments
- Want forward-looking features (BitNet) not yet in llama.cpp
6. Required Follow-Up Research
6.1 Empirical Experiments to Run
These experiments require hardware access and should be performed during Phase 2:
-
Memory bandwidth measurement on target cloud VMs:
- Run STREAM benchmark on AWS c6i.large, GCP e2-standard-2, Azure D2s_v5
- Establish actual bandwidth baseline (vs theoretical)
-
llama.cpp comparison benchmark:
- Run
llama-bench -m llama-3.1-8b-q4_k_m.gguf -t 1 -t 2 -r 10on target VMs - Establish baseline to beat
- Run
-
AVX2 vs AVX-512 matmul throughput:
- Implement reference Q4_K_M kernel in both instruction sets
- Measure on VMs with/without AVX-512
-
INT8 KV cache quality measurement:
- Run MMLU benchmark with FP16 KV vs INT8 KV
- Quantify quality loss
-
Hyperthreading impact measurement:
- Pin to 1 physical core (2 HT) vs 2 physical cores
- Measure throughput difference
6.2 Papers to Read
| Paper | Relevance | Priority |
|---|---|---|
| "FlashAttention-3: Fast and Accurate Attention with Asynchronism and Low-precision Math" (2024) | Attention optimization applicable to CPU | MEDIUM |
| "KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache" (2024) | INT2 KV cache techniques | HIGH (if INT4 KV is insufficient) |
| "EfficientDM: An Efficient Diffusion Model Accelerator Using 4-bit Quantization" (2024) | Quantization techniques generalizable to LLMs | LOW |
| "Atom: Low-bit Quantization for Efficient and Accurate LLM Serving" (2024) | INT4 activation quantization | MEDIUM |
| "FlatAttention: Efficient KV Cache Compression for Long Context LLMs" (2025) | KV cache compression techniques | HIGH |
| "SpinQuant: LLM Quantization with Learned Rotations" (2024) | Rotation-based quantization (QuIP# variant) | MEDIUM |
| "SageAttention: Lossless 8-bit Attention" (2024) | INT8 attention without quality loss | HIGH |
7. Implementation Implications
7.1 Critical Path
The critical path to production-ready MVP:
Phase 1 (2w) → Phase 2 (3w) → Phase 3 (2w) → Phase 4 (1w) → Phase 5 (1w)
Phase 6 (advanced features) can be deferred without blocking the MVP.
Earliest MVP date (single engineer full-time): 9 weeks from start.
7.2 First Week Action Items
- Create Cargo workspace with crate structure from Document 5
- Implement GGUF parser (use
ggufcrate or write from spec) - Download Llama-3.1-8B-Instruct.Q4_K_M.gguf for development
- Implement RMSNorm scalar reference (simplest layer to validate pipeline)
7.3 Success Metrics Dashboard
Track continuously during development:
- Decode tok/s on reference hardware per commit
- Peak RSS per commit
- Kernel test pass rate (correctness)
- Unsafe line count (minimize, track)
- API latency overhead per commit
This roadmap synthesizes the complete research into actionable implementation steps. The next document (Document 9: Rust Ecosystem) provides the definitive crate landscape and gap analysis for implementation.