Runtime Architecture - Scheduling, Batching, and Request Handling
- Reading time
- 12 min read
- Word count
- 2286 words
- Diagram count
- 0 diagrams
Source: Victor Bona's Obsidian Compendium snapshot, Knowledge base/Research on CPU LLM Inference/05-inference-engine-architecture.md.
Runtime Architecture - Scheduling, Batching, and Request Handling
Research Program: CPU-Native LLM Inference Runtime Target Spec: 9B parameter model, 2 vCPUs, 6 GB RAM, 2–5 tok/s Author: Research Agent Date: June 2025
1. Introduction
This document proposes the complete runtime architecture for our CPU-native LLM inference engine. It covers component boundaries, the Rust crate ecosystem decisions, continuous batching applicability, KV cache management, speculative decoding, prompt caching, and the HTTP API layer.
Zero-dependency philosophy: everything is built from scratch. No third-party crates except libc for syscall wrappers (mmap, madvise, sched_getaffinity). Every component - HTTP server, tokenizer, thread pool, memory allocator, SIMD kernels, JSON parser - is custom-implemented. This gives:
- Total control over every allocation and cache line
- Zero supply chain risk
- Maximum optimization surface (no opaque crate behavior)
- Full auditability of the entire codebase
- Binary size <10 MB (no dependency bloat)
2. Overall Runtime Architecture
2.1 Component Diagram
┌─────────────────────────────────────────────────────────────┐
│ API Server (axum) │
│ OpenAI-compatible HTTP + SSE streaming + WebSocket │
├─────────────────────────────────────────────────────────────┤
│ Request Queue │
│ Priority queue with backpressure, timeout handling │
├──────────────┬──────────────┬───────────────┬───────────────┤
│ Scheduler │ KV Cache │ Tokenizer │ Sampler │
│ (FIFO + │ Manager │ (HF tok. │ (temperature, │
│ priority │ (contiguous,│ crate) │ top-p, etc.)│
│ + preempt) │ sliding, │ │ │
│ │ eviction) │ │ │
├──────────────┴──────────────┴───────────────┴───────────────┤
│ Executor (Core Loop) │
│ Layer-by-layer forward pass, SIMD kernel dispatch │
├─────────────────────────────────────────────────────────────┤
│ Kernel Layer (Rust + SIMD) │
│ Q4_K_M matmul, attention, RoPE, RMSNorm, activations │
├─────────────────────────────────────────────────────────────┤
│ Memory Manager │
│ mmap weights, bump allocator, KV pre-allocation │
└─────────────────────────────────────────────────────────────┘
2.2 Crate Organization (Cargo Workspace)
cpu-llm-runtime/
├── Cargo.toml # Workspace root
├── crates/
│ ├── runtime/ # Main binary (API server + orchestration)
│ │ └── src/
│ │ ├── main.rs
│ │ ├── server.rs # axum HTTP API
│ │ ├── scheduler.rs
│ │ └── config.rs
│ ├── executor/ # Core inference loop
│ │ └── src/
│ │ ├── lib.rs
│ │ ├── forward.rs # Layer-by-layer execution
│ │ ├── attention.rs
│ │ ├── sampling.rs
│ │ └── kv_cache.rs
│ ├── kernels/ # SIMD-optimized operations
│ │ └── src/
│ │ ├── lib.rs
│ │ ├── q4k_matmul.rs # Q4_K_M dot product (AVX2/AVX-512)
│ │ ├── matmul_f16.rs # FP16 matmul (for small ops)
│ │ ├── attention.rs
│ │ ├── rope.rs
│ │ ├── norm.rs
│ │ └── activation.rs
│ ├── model/ # Model loading + architecture
│ │ └── src/
│ │ ├── lib.rs
│ │ ├── gguf.rs # GGUF parser
│ │ ├── arch.rs # Architecture configs (Llama, Qwen2, etc.)
│ │ ├── loader.rs # mmap-based model loading
│ │ └── tokenizer.rs
│ └── memory/ # Memory management
│ └── src/
│ ├── lib.rs
│ ├── mmap.rs # mmap wrapper with madvise
│ ├── arena.rs # Bump allocator
│ └── budget.rs # Memory budget tracking
3. Core Execution Loop
3.1 Token Generation Cycle
The core loop executes once per generated token:
/// Generate one token given current context
fn generate_step(
model: &Model,
kv_cache: &mut KVCache,
context: &[u32], // token IDs in context
last_token: u32,
) -> u32 {
let mut hidden = model.embedding.lookup(last_token); // [hidden_dim]
// Apply layer-by-layer with RoPE position encoding
let position = context.len();
for (layer_idx, layer) in model.layers.iter().enumerate() {
// 1. RMSNorm (pre-attention)
let normed = rms_norm(&hidden, &layer.input_norm);
// 2. Multi-head attention with KV cache
let attn_out = attention_forward(
&normed,
&layer.attn_weights, // Q, K, V, O projections
kv_cache,
layer_idx,
position,
&model.config,
);
// 3. Residual connection
hidden.add_inplace(&attn_out);
// 4. RMSNorm (pre-FFN)
let normed = rms_norm(&hidden, &layer.post_attn_norm);
// 5. Feed-forward network (SwiGLU for most models)
let ffn_out = ffn_forward(&normed, &layer.ffn_weights);
// 6. Residual connection
hidden.add_inplace(&ffn_out);
}
// Final norm + LM head (logits)
let normed = rms_norm(&hidden, &model.final_norm);
let logits = matmul(&model.lm_head, &normed); // [vocab_size]
// Sample next token
sample_token(&logits, &sampler_config)
}
3.2 Memory Access Pattern During One Token
For Llama-3.1-8B Q4_K_M, one token generation involves:
| Phase | Memory Reads | Memory Writes | Notes |
|---|---|---|---|
| Embedding lookup | 8 KB | 8 KB | 4096 × FP16 |
| Per-layer (×32): RMSNorm | 8 KB | 8 KB | In-place |
| Per-layer: Q projection | 8 MB | 8 KB | 4096² × 4.625/8 |
| Per-layer: K projection | 1 MB | 32 B | GQA: 8 KV heads |
| Per-layer: V projection | 1 MB | 32 B | GQA: 8 KV heads |
| Per-layer: KV cache update | - | 64 B | Append K, V for current token |
| Per-layer: Attention | 512 KB | 8 KB | Read full K,V from cache |
| Per-layer: O projection | 8 MB | 8 KB | |
| Per-layer: FFN gate+up | 45 MB | 22 KB | 11008 × 4096 × 2 |
| Per-layer: FFN down | 22 MB | 8 KB | 4096 × 11008 |
| LM head | 0.5 MB | 300 KB | 128K vocab × 4096 |
| TOTAL per token | ~3.7 GB | ~2 MB |
This confirms the memory-bandwidth-bound nature: ~3.7 GB reads per token.
4. Language Decision: Rust Ecosystem
4.1 Zero-Dependency Inference Engine Core
The inference engine core is built from scratch. Infrastructure crates are allowed.
| Component | Approach | Why |
|---|---|---|
| SIMD matmul kernels | FROM SCRATCH (core::arch) | The hot path - maximum control, cache-optimal |
| Quantization formats | FROM SCRATCH | Custom CQR-4 format optimized for streaming |
| KV cache manager | FROM SCRATCH | Novel streaming + eviction design |
| Weight memory manager | FROM SCRATCH | mmap streaming with madvise - no crate does this |
| Model execution loop | FROM SCRATCH | Layer-by-layer streaming design |
| Sampling / decoding | FROM SCRATCH | Tightly integrated with execution |
| BPE tokenizer | FROM SCRATCH or tokenizers crate | Either works, not a bottleneck |
| F16/BF16 types | FROM SCRATCH or half crate | Either works, simple type |
| Arena allocator | FROM SCRATCH (80 lines) | Trivial to build, full control |
| HTTP server | axum + tokio | Production-grade, not inference-related |
| JSON serialization | serde + serde_json | Standard, not inference-related |
| Config parsing | toml or clap | Standard, not inference-related |
| Logging | tracing | Standard, not inference-related |
The principle: Everything that touches model weights, KV cache, attention computation, or memory bandwidth is from scratch. Everything that's standard infrastructure uses the best available crate.
4.2 FFI Boundaries
Call oneDNN from Rust?
Option: Use onednn-sys (bindgen-generated bindings) to call oneDNN matmul kernels.
| Factor | FFI to oneDNN | Pure Rust kernels |
|---|---|---|
| Performance on Intel | ⭐⭐⭐⭐⭐ (optimal) | ⭐⭐⭐⭐ (close) |
| Performance on AMD | ⭐⭐⭐ (not AMD-optimized) | ⭐⭐⭐⭐ (AVX2-tuned) |
| Build complexity | High (C++ build, linker issues) | Low (pure Cargo) |
| Debuggability | Hard (C++ symbols, opaque) | Easy (Rust backtraces) |
| Portability | Linux x86_64 only | Cross-platform |
| Maintenance | External dep version tracking | Self-contained |
Decision: Pure Rust kernels. The marginal perf gain from oneDNN on Intel doesn't justify the build complexity and portability cost. Rust core::arch provides equivalent AVX2 throughput for the specific patterns we need (quantized matmul).
Call OpenBLAS? For FP32/FP16 matmul (embedding layer, LM head): openblas-src or intel-mkl-src could help. However, these are only needed for the small number of non-quantized operations. Decision: Use a simple Rust FP16 dot product for small matrices, avoid BLAS dependency.
5. Continuous Batching on CPU
5.1 What Continuous Batching Achieves (on GPU)
On GPU, continuous batching (in-flight batching, from Orca/vLLM) allows:
- Multiple requests to share the same batch at different decode stages
- New requests inserted into ongoing batch without waiting for current batch to complete
- Maximizes GPU utilization (fill idle SMs with concurrent work)
5.2 Why Continuous Battering Matters Less on CPU at 2 Threads
Analysis for 2-vCPU:
| Factor | GPU (continuous batching) | CPU 2-thread |
|---|---|---|
| Parallelism available | 100+ SMs | 2 cores |
| Batch size sweet spot | 8–64 | 1–2 |
| Scheduling overhead | Amortized over 100s of threads | ~5-15% of total time |
| Memory overhead per request | KV cache only | KV cache + activation buffers |
| Implementation complexity | High | Still high, lower payoff |
At batch size 1: Single-threaded decode. No batching needed. Maximum per-request throughput.
At batch size 2: Two requests processed simultaneously.
- Memory: 2× KV cache (e.g., 2 × 512 MB = 1 GB for Llama-3.1-8B at context 4096)
- Compute: Split 2 threads → each thread handles one batch element
- Throughput: ~1.5× total throughput (not 2× due to shared bandwidth)
- Per-request latency: same as batch-1 (no benefit to requester)
Verdict: At 2 threads with 6 GB RAM, continuous batching is not worth the complexity. The memory overhead of maintaining multiple KV caches and the scheduling logic is not justified by the minimal throughput gain.
5.3 Recommended Approach: Micro-batch Size 1 with FIFO Queue
Request Queue (FIFO, priority-aware)
│
▼
Take 1 request → Process fully (prefill → decode stream) → Return response
│
▼ (if next request waiting)
Take next request → Process → Return
Simplification: One request at a time. Sequential processing. The API server queues requests and returns them as they complete. This eliminates:
- Batch tensor dimension management
- Per-request KV cache sizing complexity
- Scheduling fairness logic
- Attention masking for heterogeneous sequences
When to add batching: Only if profiling shows the API is frequently backlogged (multiple concurrent users exceeding sequential processing rate). At 3–4 tok/s per user and typical chat patterns (10–30 second thinking time between messages), sequential processing supports ~6-12 concurrent casual users.
6. KV Cache Management
6.1 Contiguous Allocation (Not Paged)
Based on Document 2's analysis, we reject PagedAttention for CPU:
pub struct KVCache {
// Single contiguous allocation per model
keys: Vec<f16>, // Shape: [num_layers × max_context × n_kv_heads × head_dim]
values: Vec<f16>, // Same shape
num_layers: usize,
n_kv_heads: usize,
head_dim: usize,
max_context: usize,
current_len: usize, // Number of tokens in current context
}
impl KVCache {
pub fn new(config: &ModelConfig, max_context: usize) -> Self {
let per_layer_size = max_context * config.n_kv_heads * config.head_dim;
let total = config.num_layers * per_layer_size;
Self {
keys: vec![0.0; total],
values: vec![0.0; total],
num_layers: config.num_layers,
n_kv_heads: config.n_kv_heads,
head_dim: config.head_dim,
max_context,
current_len: 0,
}
}
/// Append K, V for the current token at the given layer
pub fn append(&mut self, layer: usize, k: &[f16], v: &[f16]) {
let offset = layer * self.max_context * self.n_kv_heads * self.head_dim
+ self.current_len * self.n_kv_heads * self.head_dim;
self.keys[offset..offset + k.len()].copy_from_slice(k);
self.values[offset..offset + v.len()].copy_from_slice(v);
}
}
Advantages for CPU:
- Sequential memory access during attention (stride-1 within a head)
- No page table lookups (direct pointer arithmetic)
- Prefetcher-friendly (hardware stride detection works on contiguous data)
- Simple implementation, no memory fragmentation
Disadvantages:
- Must pre-allocate for max_context (wastes memory if context is short)
- Cannot share KV between requests (no paging/sharing)
- Fixed maximum context (cannot grow beyond allocation)
6.2 Sliding Window Attention (Mistral-style)
For models like Gemma-2-9B with native sliding window attention:
impl KVCache {
/// Get attention key slice for the sliding window
pub fn get_keys_window(&self, layer: usize, window_size: usize) -> &[f16] {
let start = if self.current_len > window_size {
self.current_len - window_size
} else {
0
};
let offset = layer * self.max_context * self.n_kv_heads * self.head_dim;
let start_offset = offset + start * self.n_kv_heads * self.head_dim;
let end_offset = offset + self.current_len * self.n_kv_heads * self.head_dim;
&self.keys[start_offset..end_offset]
}
}
Memory benefit: With sliding window of 4096, the KV cache only needs to store the last 4096 tokens' K and V - regardless of total conversation length. This saves ~50% memory for long conversations.
Circular buffer implementation for sliding window:
impl KVCache {
fn append_circular(&mut self, layer: usize, k: &[f16], v: &[f16]) {
let write_pos = self.current_len % self.max_context; // Circular write position
let offset = layer * self.max_context * self.stride_per_token
+ write_pos * self.stride_per_token;
self.keys[offset..offset + k.len()].copy_from_slice(k);
self.values[offset..offset + v.len()].copy_from_slice(v);
}
}
6.3 KV Cache Eviction Policy
For multi-session support (future), eviction policies:
| Policy | Description | Overhead | Effectiveness |
|---|---|---|---|
| LRU (time-based) | Evict oldest session's KV | O(1) | Good for bursty traffic |
| LFU (token-count) | Evict session with fewest tokens | O(1) | Good for skewed usage |
| Context-length-based | Evict longest context (frees most memory) | O(1) | Emergency memory recovery |
| Attention-score-based | Evict tokens with lowest cumulative attention | O(n) per eviction | Best quality preservation |
For single-session deployment (our primary case): No eviction needed. Just pre-allocate for the session's expected max context.
For multi-session (stretch goal): LRU eviction. When memory pressure exceeds threshold, drop the oldest session's KV cache and reallocate.
7. Speculative Decoding on CPU
7.1 Concept
Speculative decoding uses a small "draft" model to generate K candidate tokens cheaply, then verifies them against the large "target" model in a single batched forward pass:
1. Draft model (0.5B) generates K tokens: [t1, t2, ..., tk]
2. Target model (9B) verifies all K tokens in ONE forward pass (batched)
3. Accept tokens until first rejection; resume from there
4. Net speedup: K × (draft_time / target_time) if acceptance rate high
Speedup formula (Leviathan et al., arXiv:2211.17192):
speedup = (1 - α^(K+1)) / ((1-α) × target_time_ratio)
where α = acceptance probability per token
7.2 CPU-Specific Analysis
Draft model candidates (for 9B target):
- Qwen2.5-0.5B at Q4_K_M: ~0.3 GB weights, ~30 tok/s decode on 2 vCPU [ESTIMATED]
- Phi-3.5-mini at Q4_K_M: ~2.2 GB weights, ~8 tok/s decode
Timing analysis:
For draft K=4 tokens with Qwen2.5-0.5B:
- Draft time: 4 × 33 ms = 132 ms (sequential on 1 thread)
- Target verification: 1 forward pass for 4 tokens (batched) = ~400 ms
- NOT 4× target token time because: batched verification is faster than sequential decode (matrix × 4 vectors is more efficient than 4 × matrix × 1 vector on vectorizable ops)
- Actually on CPU with 2 threads: batch-4 forward ≈ 3.5× single-token time = ~700 ms
Hmm. Let me reconsider. On CPU, batched forward pass for K tokens:
- Prefill-like: process K tokens through all layers simultaneously
- Memory reads: same weight reads (3.7 GB) regardless of batch size
- Compute: K× more MACs
- On memory-bound workload: batch-4 costs ~same time as batch-1 (memory BW dominates)
Key insight for CPU: Speculative decoding is MORE effective on CPU than GPU because:
- The target model's forward pass is memory-bound - batch-1 and batch-4 take similar wall time
- The draft model runs at ~30 tok/s (fast)
- Acceptance rate: ~70-85% for aligned draft/target models (same architecture family)
Estimated speedup:
- Base decode: 4 tok/s
- With speculative (K=4, 80% acceptance): effective speedup ~1.5-2.5×
- Result: 6-10 tok/s effective
7.3 Self-Speculation (Medusa-like)
Medusa (arXiv:2401.10774) adds extra "heads" to the target model that predict multiple future tokens simultaneously, without a separate draft model:
For our runtime:
- Requires additional trained heads (not available for standard models)
- Memory overhead: extra parameters
- Not viable without pre-trained Medusa variants of our target models
Verdict: Skip Medusa. Use standard speculative decoding with a small draft model if throughput is below target.
7.4 When Speculative Decoding is Worth It
| Scenario | Worth It? |
|---|---|
| Base decode ≥ 5 tok/s | ❌ Not needed |
| Base decode 2-4 tok/s | ✅ Try if draft model fits in memory |
| Memory headroom available (≥ 1 GB) | ✅ Can afford draft model |
| Tight memory (< 500 MB free) | ❌ No room for draft model |
Recommendation: Implement speculative decoding as an optional feature. Enable when:
- Base throughput < 3 tok/s
- A compatible draft model is available (same tokenizer)
- Memory budget allows (target + draft < 6 GB)
8. Prompt Caching / Prefix Sharing
8.1 Concept
Multi-turn chat always reuses a system prompt. If we cache the KV entries for the system prompt, we avoid recomputing them on every turn.
Turn 1: System prompt (500 tokens) + User message (50 tokens) → Compute KV for all 550 tokens
Turn 2: System prompt (500 tokens) + User msg 1 (50) + Asst reply (200) + User msg 2 (30)
└─ KV for first 500 tokens cached ─┘ Only compute new 280 tokens
Savings: At 500-token system prompt, saves ~500 × prefill_time / layer = significant time for multi-turn.
8.2 Implementation for 6 GB RAM
Challenge: The KV cache for the system prompt (~500 tokens at 4 KV heads × 128 dim × FP16) takes:
2 × 32 × 500 × 8 × 128 × 2 = 32,768,000 bytes = ~31 MB
This is modest. We can maintain a separate "prefix KV cache" that persists across sessions.
Design:
pub struct PrefixCache {
prefix_tokens: Vec<u32>, // The cached prefix token IDs
prefix_kv: Vec<f16>, // KV cache for prefix (same layout as main KV)
prefix_len: usize, // Number of cached tokens
}
impl PrefixCache {
/// Check if current input starts with cached prefix
pub fn match_prefix(&self, input: &[u32]) -> usize {
input.iter().zip(self.prefix_tokens.iter())
.take_while(|(a, b)| a == b)
.count()
}
/// Copy cached KV into the main KV cache
pub fn restore_to(&self, layer: usize, kv_cache: &mut KVCache) {
// Copy prefix_kv for this layer into kv_cache
}
}
Memory cost: ~31 MB for a 500-token system prompt (Llama-3.1-8B). Negligible in 6 GB budget.
Recommendation: Implement prefix caching for the system prompt. This is a simple, high-value optimization for multi-turn chat use cases.
9. Tokenizer (Custom Built)
9.1 Custom BPE Tokenizer
We build our own BPE (Byte-Pair Encoding) tokenizer rather than depending on the tokenizers crate. This is straightforward - BPE is a well-understood algorithm (~500 lines of Rust).
Design:
pub struct BpeTokenizer {
vocab: HashMap<Vec<u8>, u32>, // byte sequence → token ID
merges: HashMap<(u32, u32), u32>, // (token_a, token_b) → merged_token
id_to_bytes: Vec<Vec<u8>>, // token ID → byte sequence
special_tokens: HashMap<String, u32>,
}
impl BpeTokenizer {
/// Load from tokenizer.json (HuggingFace format) - parsed by our custom JSON parser
pub fn load(tokenizer_json: &[u8]) -> Self { ... }
/// Encode text → token IDs
pub fn encode(&self, text: &str) -> Vec<u32> {
// 1. UTF-8 → bytes
// 2. Apply byte-level pre-tokenization (split on whitespace/punctuation)
// 3. For each pre-token: greedy BPE merge using merges map
// 4. Return concatenated token IDs
}
/// Decode token IDs → text (for streaming output)
pub fn decode(&self, ids: &[u32]) -> String { ... }
/// Decode single token incrementally (for SSE streaming)
pub fn decode_incremental(&mut self, token_id: u32) -> &str { ... }
}
Vocabulary loading: We parse HuggingFace tokenizer.json files using our custom JSON parser. No Python dependency, no pickle, no unsafe deserialization of external formats.
9.2 Performance
| Operation | Custom BPE | HuggingFace tokenizers crate |
|---|---|---|
| Encode 500 tokens | ~5–8 ms | ~5 ms |
| Decode 1 token | <1 μs | <1 μs |
| Vocab load time | ~50 ms | ~20 ms (with C parsing) |
At 2 vCPU: Tokenization is never the bottleneck even with a custom implementation. Encoding a 500-token prompt takes ~5 ms against ~1 second for prefill compute.
10. HTTP API Server Design
10.1 OpenAI-Compatible API
Endpoints:
POST /v1/chat/completions # Chat completion (main endpoint)
POST /v1/completions # Text completion
GET /v1/models # List available models
GET /health # Health check
GET /v1/memory # Memory usage stats
Chat Completions Request:
{
"model": "llama-3.1-8b-q4km",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"}
],
"temperature": 0.7,
"max_tokens": 512,
"stream": true
}
10.2 Streaming Output via SSE
use axum::{
response::sse::{Event, Sse},
extract::State, Json, Router,
};
use tokio_stream::wrappers::ReceiverStream;
use serde::{Deserialize, Serialize};
#[derive(Deserialize)]
struct ChatRequest {
model: String,
messages: Vec<ChatMessage>,
#[serde(default)]
stream: bool,
#[serde(default = "default_max_tokens")]
max_tokens: usize,
temperature: Option<f32>,
}
#[derive(Serialize)]
struct ChatChunk {
id: String,
object: &'static str,
choices: Vec<ChunkChoice>,
}
async fn chat_completions_stream(
State(runtime): State<Arc<Runtime>>,
Json(request): Json<ChatRequest>,
) -> Sse<impl Stream<Item = Result<Event, Infallible>>> {
let (tx, rx) = mpsc::channel(32);
// Spawn inference on the dedicated compute thread pool (not tokio workers)
runtime.submit_inference(move |engine| {
let mut generator = engine.create_generator(&request);
while let Some(token) = generator.next_token() {
let chunk = ChatChunk::from_token(token);
let json = serde_json::to_string(&chunk).unwrap();
if tx.blocking_send(Ok(Event::default().data(json))).is_err() {
break; // Client disconnected
}
}
let _ = tx.blocking_send(Ok(Event::default().data("[DONE]")));
});
Sse::new(ReceiverStream::new(rx))
}
fn build_router(runtime: Arc<Runtime>) -> Router {
Router::new()
.route("/v1/chat/completions", post(chat_completions))
.route("/v1/models", get(list_models))
.route("/health", get(health_check))
.with_state(runtime)
}
Key design choice: Inference runs on a dedicated thread pool (spawn_blocking or custom), completely isolated from tokio's async worker threads. This ensures the compute-heavy forward pass never blocks HTTP connections or other concurrent requests.
10.3 Backpressure Handling
When the CPU can't keep up (e.g., long prompt prefill + slow decode):
- Queue depth limit: Reject requests if queue > N (return 429 Too Many Requests)
- Timeout: Cancel inference if single request takes > T seconds
- Token streaming: Send tokens as soon as they're generated (no buffering)
- Graceful degradation: If memory pressure detected, reduce max_context for new requests
11. Configuration and Startup
11.1 Configuration File (TOML)
[model]
path = "/models/llama-3.1-8b-instruct.Q4_K_M.gguf"
max_context = 4096
kv_cache_precision = "fp16" # or "int8"
[runtime]
threads = 0 # 0 = auto-detect
memory_budget_mb = 5900 # Leave 100MB for OS
[server]
host = "0.0.0.0"
port = 8080
max_queue_size = 4
request_timeout_secs = 300
[sampling]
default_temperature = 0.7
default_top_p = 0.9
repetition_penalty = 1.1
[optimization]
streaming_weights = true # Use madvise hints
prefix_cache = true # Cache system prompt KV
speculative_decoding = false
speculative_draft_model = ""
11.2 Startup Sequence
async fn main() {
// 1. Parse config
let config = Config::load("config.toml");
// 2. Detect CPU topology
let topology = detect_topology(); // Physical cores, SIMD capabilities
let num_threads = if config.threads == 0 {
topology.physical_cores.min(2)
} else {
config.threads
};
// 3. Load model (mmap, zero-copy)
let model = Model::load_mmap(&config.model.path)?;
// 4. Pre-allocate KV cache
let kv_cache = KVCache::new(&model.config, config.model.max_context);
// 5. Pre-allocate activation arena
let arena = BumpAllocator::new(256 * MB);
// 6. Initialize tokenizer
let tokenizer = Tokenizer::from_file(&config.model.tokenizer_path)?;
// 7. Start API server
let runtime = Runtime::new(model, kv_cache, arena, tokenizer, num_threads);
start_server(runtime, &config.server).await;
}
Startup time: <1 second (mmap is instantaneous, pre-allocation is fast).
12. Implementation Implications
12.1 Architecture Decisions Summary
| Decision | Choice | Rationale |
|---|---|---|
| Batching | Batch size 1 (sequential) | Memory + thread constraints |
| KV cache layout | Contiguous, pre-allocated | CPU cache-friendly |
| KV cache precision | FP16 (default), INT8 (optional) | Quality vs memory tradeoff |
| Sliding window | Yes (for applicable models) | Bounds KV cache growth |
| Prompt caching | Yes (system prompt KV) | Multi-turn chat optimization |
| Speculative decoding | Optional, draft model based | Enable if throughput < 3 tok/s |
| API framework | axum + tokio | Production-grade, async |
| Thread model | 1-2 based on toplogy detection | Avoid hyperthreading contention |
| Memory allocator | Bump arena for activations | No fragmentation, O(1) alloc/free |
12.2 Development Phases
| Phase | Scope | Duration |
|---|---|---|
| Phase 1 | GGUF loader + scalar forward pass + sampling | 2 weeks |
| Phase 2 | AVX2 kernels (Q4_K_M matmul, attention) | 3 weeks |
| Phase 3 | mmap streaming + KV cache + memory manager | 2 weeks |
| Phase 4 | API server (axum) + streaming SSE | 1 week |
| Phase 5 | Multi-model support + prefix caching | 2 weeks |
| Phase 6 | Speculative decoding + AVX-512 kernels | 2 weeks |
Total: ~12 weeks for a single engineer to reach production-ready MVP.
Next: Document 6 covers target model architectures in detail - Qwen2.5, Gemma-2, Llama-3.1, Phi, and BitNet models with compatibility matrices.