Testing Model Registry - From Tiny to Maximum
- Reading time
- 11 min read
- Word count
- 2154 words
- Diagram count
- 0 diagrams
Source: Victor Bona's Obsidian Compendium snapshot, Knowledge base/Research on CPU LLM Inference/11-testing-model-registry.md.
Testing Model Registry - From Tiny to Maximum
Research Program: CPU-Native LLM Inference Runtime Date: June 2025
Philosophy: Progressive Testing
As we implement the runtime, we test against increasingly larger models. Each tier validates the previous work and stretches the next capability:
> Start tiny, prove correctness, optimize, then scale. > A bug found on a 135M model takes 2 seconds to reproduce. > The same bug on a 32B model takes 10 minutes.
Tier 0: Bring-Up (135M - 360M params)
Purpose: Prove the architecture compiles and produces a single correct token. Instant iteration loop.
| Model | Params | Layers | Hidden | Heads | KV Heads | GQA Ratio | Head Dim | Intermediate | Vocab | Arch | Weight Sizes |
|---|---|---|---|---|---|---|---|---|---|---|---|
| SmolLM2-135M-Instruct | 135M | 30 | 576 | 9 | 3 | 3:1 | 64 | 1,536 | 49,152 | Llama | FP16: 270 MB · Q4_K_M: 85 MB |
| SmolLM2-360M-Instruct | 360M | 32 | 960 | 15 | 5 | 3:1 | 64 | 2,560 | 49,152 | Llama | FP16: 720 MB · Q4_K_M: 220 MB |
Source: HuggingFaceTB/SmolLM2-135M-Instruct, HuggingFaceTB/SmolLM2-360M-Instruct (open, Apache 2.0)
Why these models:
- Load instantly on any machine - no mmap streaming needed
- Forward pass in microseconds - test iteration loop is instant
- Small enough to hand-debug: dump every intermediate tensor in a fraction of a second
- Llama architecture - same code path as 8B Llama models
- GQA ratio 3:1 - tests the grouped query attention path from day one
- KV cache per token: ~0.5 KB (negligible)
Success criteria for Tier 0:
- GGUF parser loads model successfully
- Forward pass produces output (even garbage is OK - we're testing plumbing)
- Token output matches llama.cpp reference output
- Streaming mode works (even if all weights in RAM trivially)
Tier 1: Small Models (0.5B - 1.7B params)
Purpose: Validate SIMD kernels at real matrix sizes. First meaningful throughput measurements.
| Model | Params | Layers | Hidden | Heads | KV Heads | GQA | Head Dim | Intermediate | Vocab | Arch | FP16 | Q4_K_M | KV @2K |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Qwen2.5-0.5B-Instruct | 0.49B | 24 | 896 | 14 | 2 | 7:1 | 64 | 4,864 | 151,936 | Qwen2 | 980 MB | 310 MB | 6 MB |
| Llama-3.2-1B-Instruct | 1.24B | 16 | 2048 | 32 | 8 | 4:1 | 64 | 8,192 | 128,256 | Llama | 2.5 GB | 780 MB | 32 MB |
| Qwen2.5-1.5B-Instruct | 1.54B | 28 | 1536 | 12 | 2 | 6:1 | 128 | 8,960 | 151,936 | Qwen2 | 3.1 GB | 970 MB | 28 MB |
| SmolLM2-1.7B-Instruct | 1.7B | 24 | 2048 | 32 | 32 | 1:1 | 64 | 8,192 | 49,152 | Llama | 3.4 GB | 1.1 GB | 96 MB |
Sources: Qwen/Qwen2.5-0.5B-Instruct, meta-llama/Llama-3.2-1B-Instruct, Qwen/Qwen2.5-1.5B-Instruct, HuggingFaceTB/SmolLM2-1.7B-Instruct
Key testing notes:
- SmolLM2-1.7B has NO GQA (32 kv heads = 32 attention heads = full MHA). This is a critical test to ensure our attention kernel handles the no-grouping case correctly. Large KV cache (96 MB at context 2K) relative to model size.
- Llama-3.2-1B tests the Llama architecture family directly - validates we can run both the small and large Llama variants with the same code.
- Qwen2.5-0.5B tests extreme GQA (7:1 ratio) - our attention kernel must handle broadcasting a single KV head to 7 query heads efficiently.
- Qwen2.5-1.5B tests head_dim=128 (vs 64 in the smaller models) - validates our RoPE and attention kernels at the same head dimension used by 7B+ models.
Expected throughput at 2 vCPU: 10-30 tok/s (fast enough for comfortable interactive testing)
Success criteria for Tier 1:
- AVX2 kernels match scalar reference output within 0.1% relative error
- Throughput ≥ 10 tok/s on 2 vCPU (Q4_K_M)
- GQA broadcasting works correctly for ratios 1:1, 3:1, 4:1, 6:1, 7:1
- RoPE handles both head_dim=64 and head_dim=128
- KV cache grows/shrinks correctly across turns
Tier 2: Medium-Small (3B - 4B params)
Purpose: First models that feel "somewhat intelligent." Test memory management begins to matter.
| Model | Params | Layers | Hidden | Heads | KV Heads | GQA | Head Dim | Intermediate | Vocab | Arch | FP16 | Q4_K_M | KV @4K |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Qwen2.5-3B-Instruct | 3.09B | 36 | 2048 | 16 | 2 | 8:1 | 128 | 11,008 | 151,936 | Qwen2 | 6.2 GB | 1.9 GB | 115 MB |
| Llama-3.2-3B-Instruct | 3.21B | 28 | 3072 | 24 | 8 | 3:1 | 128 | 8,192 | 128,256 | Llama | 6.4 GB | 2.0 GB | 192 MB |
| SmolLM3-3B | 3.0B | 36 | 2048 | 16 | 4 | 4:1 | 128 | 11,008 | 128,256 | SmolLM3 | 6.0 GB | 1.9 GB | 172 MB |
| Phi-3.5-mini-instruct | 3.8B | 32 | 3072 | 32 | 32 | 1:1 | 96 | 8,192 | 32,064 | Phi3 | 7.6 GB | 2.4 GB | 576 MB |
| Phi-4-mini-instruct | 3.8B | 32 | 3072 | 24 | 8 | 3:1 | 128 | 8,192 | 200,064 | Phi3 | 7.6 GB | 2.4 GB | 256 MB |
Sources: Qwen/Qwen2.5-3B-Instruct, meta-llama/Llama-3.2-3B-Instruct, HuggingFaceTB/SmolLM3-3B, microsoft/Phi-3.5-mini-instruct, microsoft/Phi-4-mini-instruct
Key testing notes:
- Phi-3.5-mini has NO GQA (full MHA) and uses the Phi3 architecture - our first non-Llama/non-Qwen architecture. Must implement Phi-specific attention (uses rotary embeddings differently) and the Phi attention variant.
- Phi-3.5 KV cache is ENORMOUS - 576 MB at context 4K due to 32 KV heads × head_dim 96. This is the first model where KV cache memory management matters.
- Qwen2.5-3B at Q4_K_M (1.9 GB + 115 MB KV) fits comfortably on any machine. Excellent for daily development testing.
- Phi-4-mini has the Phi3 architecture + large vocab (200K tokens) - tests that our tokenizer handles large vocabularies efficiently.
Expected throughput at 2 vCPU: 5-15 tok/s (Q4_K_M)
Success criteria for Tier 2:
- Multiple architectures supported: Llama, Qwen2, Phi3
- Phi3 architecture correctly implements its attention variant
- KV cache manager handles large KV models (Phi-3.5: 576 MB)
- Memory budget monitoring activates (>1 GB KV cache)
- Output quality subjectively "usable" for simple conversations
Tier 3: Medium - Primary Target (7B - 9B params)
Purpose: This is where the runtime proves its value. Must run at 3-5 tok/s on 2 vCPU / 6 GB.
| Model | Params | Layers | Hidden | Heads | KV Heads | GQA | Head Dim | Intermediate | Vocab | Arch | FP16 | Q4_K_M | KV @4K |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Qwen2.5-7B-Instruct | 7.62B | 28 | 3584 | 28 | 4 | 7:1 | 128 | 18,944 | 152,064 | Qwen2 | 15.2 GB | 4.1 GB | 229 MB |
| Llama-3.1-8B-Instruct | 8.03B | 32 | 4096 | 32 | 8 | 4:1 | 128 | 14,336 | 128,256 | Llama | 16.1 GB | 4.9 GB | 512 MB |
| Mistral-7B-v0.3-Instruct | 7.25B | 32 | 4096 | 32 | 8 | 4:1 | 128 | 14,336 | 32,768 | Mistral | 14.5 GB | 4.5 GB | 512 MB |
| Gemma-2-9B-IT | 9.24B | 42 | 3584 | 16 | 8 | 2:1 | 256 | 14,336 | 256,000 | Gemma2 | 18.5 GB | 5.3 GB | 1,344 MB |
Sources: Qwen/Qwen2.5-7B-Instruct, meta-llama/Llama-3.1-8B-Instruct, mistralai/Mistral-7B-Instruct-v0.3, google/gemma-2-9b-it
Key testing notes:
Qwen2.5-7B - best overall test target:
- Fits 6 GB at Q4_K_M (4.1 GB weights + 229 MB KV @ 4K + 300 MB overhead = 4.6 GB)
- 7:1 GQA ratio → smallest KV cache of any 7B+ model
- Tied embeddings (embed = lm_head shared) → saves ~200 MB weight memory
- Fastest decode of any target model (fewest layers: 28, smallest KV reads)
Llama-3.1-8B - our reference benchmark:
- Fits 6 GB at Q4_K_M (4.9 GB weights + 512 MB KV @ 4K = 5.7 GB)
- 32 layers → heavier sequential compute
- Most community benchmarks available for comparison
Mistral-7B-v0.3 - different family, similar arch:
- Mistral architecture variant (mostly compatible with Llama)
- Smaller vocab (32K) → faster LM head matmul, smaller embedding
- Sliding window attention option (4096 tokens)
Gemma-2-9B - the stress test:
- head_dim = 256 (double all others!) - tests our attention kernel at 2× width
- 42 layers - deepest model, most sequential compute
- Low GQA ratio (2:1) - massive KV cache: 1,344 MB at context 4K FP16
- Requires INT8 KV cache to fit in 6 GB
- Sliding window attention (4096 fixed) - tests circular KV cache
- Post-normalization + logit softcapping - unique Gemma-2 quirks
- This model will be last to pass all tests - it pushes every limit
Expected throughput at 2 vCPU (Q4_K_M):
- Qwen2.5-7B: 4-6 tok/s
- Llama-3.1-8B: 3-5 tok/s
- Mistral-7B: 3-5 tok/s
- Gemma-2-9B: 2-3 tok/s
Streaming FP16 (unquantized, from disk, 5+ GB RAM):
- All models: 0.2-0.5 tok/s (Phase 6 milestone)
Success criteria for Tier 3:
- Llama-3.1-8B achieves ≥3 tok/s decode on 2 vCPU / 6 GB
- Qwen2.5-7B achieves ≥4 tok/s decode on 2 vCPU / 6 GB
- Gemma-2-9B works with INT8 KV cache (context 2048)
- Streaming FP16 mode: any model generates tokens on 5 GB RAM
- Gemma-2 quirks work: softcapping, post-norm, pre-attention scaling
- Memory budget enforced: no OOM at Q4_K_M + context 4096
Tier 4: Large (14B - 32B params)
Purpose: Push streaming architecture to its limits. Test the runtime can handle models 3-4× RAM size.
| Model | Params | Layers | Hidden | Heads | KV Heads | GQA | Head Dim | Intermediate | Vocab | Arch | FP16 | Q4_K_M | KV @4K |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Qwen2.5-14B-Instruct | 14.77B | 48 | 5120 | 40 | 8 | 5:1 | 128 | 13,824 | 152,064 | Qwen2 | 29.5 GB | 8.2 GB | 640 MB |
| Phi-4 | 14.7B | 40 | 5120 | 40 | 10 | 4:1 | 128 | 17,920 | 100,352 | Phi3 | 29.4 GB | 8.2 GB | 800 MB |
| Qwen2.5-32B-Instruct | 32.8B | 64 | 5120 | 40 | 8 | 5:1 | 128 | 27,648 | 152,064 | Qwen2 | 65.6 GB | 17.8 GB | 640 MB |
Sources: Qwen/Qwen2.5-14B-Instruct, microsoft/phi-4, Qwen/Qwen2.5-32B-Instruct
Key testing notes:
Qwen2.5-14B:
- Too large for Q4_K_M in 6 GB (8.2 GB weights alone)
- Viable at Q3_K_M (~6.8 GB) or INT2 (~4.3 GB)
- Streaming FP16: 29.5 GB on NVMe → ~10 seconds/token → 0.1 tok/s
- Tests streaming architecture for models that DON'T fit in RAM
Phi-4 (14B):
- Phi3 architecture at scale - tests that our Phi implementation generalizes
- 10 KV heads (4:1 GQA) - intermediate KV cache size
- Q4_K_M: 8.2 GB → must use streaming
Qwen2.5-32B:
- 64 layers - doubles the streaming read volume vs 8B models
- At Q4_K_M: 17.8 GB → streaming requires ~6 seconds/token on NVMe
- Tests memory management under extreme pressure
- If the streaming architecture works at 32B on 8 GB, it works anywhere
Expected throughput (streaming FP16, 8 GB RAM, NVMe):
- Qwen2.5-14B: ~0.1 tok/s
- Qwen2.5-32B: ~0.05 tok/s
Success criteria for Tier 4:
- Qwen2.5-14B generates tokens via streaming (any speed)
- No OOM during streaming of 32B model on 8 GB RAM
- Streaming works on SATA SSD (slower disk, adjusted prefetch)
- Graceful degradation: system remains responsive during heavy streaming
Summary: The Testing Progression
| Tier | Models | Goal | When to Start |
|---|---|---|---|
| T0: Bring-Up | SmolLM2-135M, SmolLM2-360M | Architecture compiles, produces 1 token | Phase 1, Day 1 |
| T1: Small | Qwen2.5-0.5B, Llama-1B, Qwen2.5-1.5B, SmolLM2-1.7B | SIMD kernels correct, 10+ tok/s | Phase 2 |
| T2: Medium-Small | Qwen2.5-3B, Llama-3B, SmolLM3-3B, Phi-3.5-mini, Phi-4-mini | Multi-arch support, memory management | Phase 3-4 |
| T3: Target | Qwen2.5-7B, Llama-8B, Mistral-7B, Gemma-2-9B | Production quality, 3-5 tok/s | Phase 5-6 |
| T4: Large | Qwen2.5-14B, Phi-4, Qwen2.5-32B | Streaming at scale, stretch goals | Phase 6-7 |
Model Architecture Variants to Support
| Architecture | Models Using It | Key Differences |
|---|---|---|
| Llama | SmolLM2, Llama-3.x, TinyLlama | Standard: pre-norm, SwiGLU, RoPE |
| Qwen2 | Qwen2.5 all sizes, SmolLM3 | Similar to Llama, NTK-aware RoPE, tied embeddings |
| Mistral | Mistral-7B | Sliding window attention option |
| Gemma2 | Gemma-2 family | Post-norm, logit softcap, pre-attention Q scaling, sliding window, head_dim=256 |
| Phi3 | Phi-3.5-mini, Phi-4, Phi-4-mini | Different attention pattern, small vocab, MHA or GQA |
Implementation order: Llama → Qwen2 → Mistral → Phi3 → Gemma2 (each adds complexity)
Special Model Categories
Draft Models (for Speculative Decoding)
| Model | Params | Use As Draft For |
|---|---|---|
| Qwen2.5-0.5B | 0.49B | Qwen2.5-3B or 7B |
| Llama-3.2-1B | 1.24B | Llama-3.1-8B |
| Phi-3.5-mini | 3.8B | Phi-4 (14B) |
Draft model must share tokenizer with target model.
BitNet Candidates (Future)
| Model | Params | Notes |
|---|---|---|
| BitNet-b1.58-2B-4T | 2B | Only public BitNet model (Microsoft, Oct 2024) |
| Community fine-tunes | 1-3B | Community 1-bit models via bitnet.cpp |
When 7B+ BitNet models appear, they become top-priority test targets (1.58-bit weights, no dequantization, potential 10+ tok/s).
Memory Map: What Fits Where
| RAM | Q4_K_M | Q4_K_M + INT8 KV | Streaming FP16 |
|---|---|---|---|
| 4 GB | SmolLM2 (all), Qwen2.5-0.5B, Qwen2.5-1.5B | + Llama-3.2-1B | SmolLM2-1.7B, Qwen2.5-0.5B |
| 5 GB | + Qwen2.5-3B, Llama-3.2-3B, Phi-3.5-mini | + Qwen2.5-3B context 8K | Qwen2.5-1.5B, Llama-1B |
| 6 GB | + Qwen2.5-7B, Llama-8B, Mistral-7B | + Llama-8B context 8K, Gemma-2-9B ctx 2K | Qwen2.5-3B, Llama-3B |
| 8 GB | + Phi-4 (tight) | + Gemma-2-9B context 4K | Qwen2.5-7B, Llama-8B, Mistral-7B |
| 16 GB | + Qwen2.5-14B, Phi-4 comfortably | + Qwen2.5-32B at Q4_K_M | Qwen2.5-14B, Phi-4 |
| 32 GB | + Qwen2.5-32B, Gemma-2-27B | All models full | Qwen2.5-32B |
Download Commands for Test Suite
# Tier 0: Bring-up (instant download)
huggingface-cli download HuggingFaceTB/SmolLM2-135M-Instruct --include "*Q4_K_M*" --local-dir models/smol-135m
huggingface-cli download HuggingFaceTB/SmolLM2-360M-Instruct --include "*Q4_K_M*" --local-dir models/smol-360m
# Tier 1: Small models (~2-4 GB each)
huggingface-cli download Qwen/Qwen2.5-0.5B-Instruct-GGUF --include "*Q4_K_M*" --local-dir models/qwen-0.5b
huggingface-cli download bartowski/Llama-3.2-1B-Instruct-GGUF --include "*Q4_K_M*" --local-dir models/llama-1b
huggingface-cli download Qwen/Qwen2.5-1.5B-Instruct-GGUF --include "*Q4_K_M*" --local-dir models/qwen-1.5b
# Tier 2: Medium-small (~5-10 GB each)
huggingface-cli download Qwen/Qwen2.5-3B-Instruct-GGUF --include "*Q4_K_M*" --local-dir models/qwen-3b
huggingface-cli download bartowski/Llama-3.2-3B-Instruct-GGUF --include "*Q4_K_M*" --local-dir models/llama-3b
huggingface-cli download bartowski/Phi-3.5-mini-instruct-GGUF --include "*Q4_K_M*" --local-dir models/phi3.5-mini
huggingface-cli download bartowski/Phi-4-mini-instruct-GGUF --include "*Q4_K_M*" --local-dir models/phi4-mini
# Tier 3: Primary targets (~10-15 GB each)
huggingface-cli download Qwen/Qwen2.5-7B-Instruct-GGUF --include "*Q4_K_M*" --local-dir models/qwen-7b
huggingface-cli download bartowski/Llama-3.1-8B-Instruct-GGUF --include "*Q4_K_M*" --local-dir models/llama-8b
huggingface-cli download bartowski/Mistral-7B-Instruct-v0.3-GGUF --include "*Q4_K_M*" --local-dir models/mistral-7b
huggingface-cli download bartowski/gemma-2-9b-it-GGUF --include "*Q4_K_M*" --local-dir models/gemma-9b
# Tier 3 FP16 (for streaming tests)
huggingface-cli download Qwen/Qwen2.5-7B-Instruct --include "*.safetensors" --local-dir models/qwen-7b-fp16
huggingface-cli download meta-llama/Llama-3.1-8B-Instruct --include "*.safetensors" --local-dir models/llama-8b-fp16
# Tier 4: Large models (streaming only)
huggingface-cli download Qwen/Qwen2.5-14B-Instruct --include "*.safetensors" --local-dir models/qwen-14b-fp16
huggingface-cli download microsoft/phi-4 --include "*.safetensors" --local-dir models/phi4-fp16
huggingface-cli download Qwen/Qwen2.5-32B-Instruct --include "*.safetensors" --local-dir models/qwen-32b-fp16
This testing registry ensures every phase of development has concrete models to validate against, from a 135M warm-up model that runs in microseconds to a 32B streaming stress test.