Testing Model Registry - From Tiny to Maximum

Reading time
11 min read
Word count
2154 words
Diagram count
0 diagrams

Source: Victor Bona's Obsidian Compendium snapshot, Knowledge base/Research on CPU LLM Inference/11-testing-model-registry.md.

Testing Model Registry - From Tiny to Maximum

Research Program: CPU-Native LLM Inference Runtime Date: June 2025


Philosophy: Progressive Testing

As we implement the runtime, we test against increasingly larger models. Each tier validates the previous work and stretches the next capability:

> Start tiny, prove correctness, optimize, then scale. > A bug found on a 135M model takes 2 seconds to reproduce. > The same bug on a 32B model takes 10 minutes.


Tier 0: Bring-Up (135M - 360M params)

Purpose: Prove the architecture compiles and produces a single correct token. Instant iteration loop.

ModelParamsLayersHiddenHeadsKV HeadsGQA RatioHead DimIntermediateVocabArchWeight Sizes
SmolLM2-135M-Instruct135M30576933:1641,53649,152LlamaFP16: 270 MB · Q4_K_M: 85 MB
SmolLM2-360M-Instruct360M329601553:1642,56049,152LlamaFP16: 720 MB · Q4_K_M: 220 MB

Source: HuggingFaceTB/SmolLM2-135M-Instruct, HuggingFaceTB/SmolLM2-360M-Instruct (open, Apache 2.0)

Why these models:

  • Load instantly on any machine - no mmap streaming needed
  • Forward pass in microseconds - test iteration loop is instant
  • Small enough to hand-debug: dump every intermediate tensor in a fraction of a second
  • Llama architecture - same code path as 8B Llama models
  • GQA ratio 3:1 - tests the grouped query attention path from day one
  • KV cache per token: ~0.5 KB (negligible)

Success criteria for Tier 0:

  • GGUF parser loads model successfully
  • Forward pass produces output (even garbage is OK - we're testing plumbing)
  • Token output matches llama.cpp reference output
  • Streaming mode works (even if all weights in RAM trivially)

Tier 1: Small Models (0.5B - 1.7B params)

Purpose: Validate SIMD kernels at real matrix sizes. First meaningful throughput measurements.

ModelParamsLayersHiddenHeadsKV HeadsGQAHead DimIntermediateVocabArchFP16Q4_K_MKV @2K
Qwen2.5-0.5B-Instruct0.49B248961427:1644,864151,936Qwen2980 MB310 MB6 MB
Llama-3.2-1B-Instruct1.24B1620483284:1648,192128,256Llama2.5 GB780 MB32 MB
Qwen2.5-1.5B-Instruct1.54B2815361226:11288,960151,936Qwen23.1 GB970 MB28 MB
SmolLM2-1.7B-Instruct1.7B24204832321:1648,19249,152Llama3.4 GB1.1 GB96 MB

Sources: Qwen/Qwen2.5-0.5B-Instruct, meta-llama/Llama-3.2-1B-Instruct, Qwen/Qwen2.5-1.5B-Instruct, HuggingFaceTB/SmolLM2-1.7B-Instruct

Key testing notes:

  • SmolLM2-1.7B has NO GQA (32 kv heads = 32 attention heads = full MHA). This is a critical test to ensure our attention kernel handles the no-grouping case correctly. Large KV cache (96 MB at context 2K) relative to model size.
  • Llama-3.2-1B tests the Llama architecture family directly - validates we can run both the small and large Llama variants with the same code.
  • Qwen2.5-0.5B tests extreme GQA (7:1 ratio) - our attention kernel must handle broadcasting a single KV head to 7 query heads efficiently.
  • Qwen2.5-1.5B tests head_dim=128 (vs 64 in the smaller models) - validates our RoPE and attention kernels at the same head dimension used by 7B+ models.

Expected throughput at 2 vCPU: 10-30 tok/s (fast enough for comfortable interactive testing)

Success criteria for Tier 1:

  • AVX2 kernels match scalar reference output within 0.1% relative error
  • Throughput ≥ 10 tok/s on 2 vCPU (Q4_K_M)
  • GQA broadcasting works correctly for ratios 1:1, 3:1, 4:1, 6:1, 7:1
  • RoPE handles both head_dim=64 and head_dim=128
  • KV cache grows/shrinks correctly across turns

Tier 2: Medium-Small (3B - 4B params)

Purpose: First models that feel "somewhat intelligent." Test memory management begins to matter.

ModelParamsLayersHiddenHeadsKV HeadsGQAHead DimIntermediateVocabArchFP16Q4_K_MKV @4K
Qwen2.5-3B-Instruct3.09B3620481628:112811,008151,936Qwen26.2 GB1.9 GB115 MB
Llama-3.2-3B-Instruct3.21B2830722483:11288,192128,256Llama6.4 GB2.0 GB192 MB
SmolLM3-3B3.0B3620481644:112811,008128,256SmolLM36.0 GB1.9 GB172 MB
Phi-3.5-mini-instruct3.8B32307232321:1968,19232,064Phi37.6 GB2.4 GB576 MB
Phi-4-mini-instruct3.8B3230722483:11288,192200,064Phi37.6 GB2.4 GB256 MB

Sources: Qwen/Qwen2.5-3B-Instruct, meta-llama/Llama-3.2-3B-Instruct, HuggingFaceTB/SmolLM3-3B, microsoft/Phi-3.5-mini-instruct, microsoft/Phi-4-mini-instruct

Key testing notes:

  • Phi-3.5-mini has NO GQA (full MHA) and uses the Phi3 architecture - our first non-Llama/non-Qwen architecture. Must implement Phi-specific attention (uses rotary embeddings differently) and the Phi attention variant.
  • Phi-3.5 KV cache is ENORMOUS - 576 MB at context 4K due to 32 KV heads × head_dim 96. This is the first model where KV cache memory management matters.
  • Qwen2.5-3B at Q4_K_M (1.9 GB + 115 MB KV) fits comfortably on any machine. Excellent for daily development testing.
  • Phi-4-mini has the Phi3 architecture + large vocab (200K tokens) - tests that our tokenizer handles large vocabularies efficiently.

Expected throughput at 2 vCPU: 5-15 tok/s (Q4_K_M)

Success criteria for Tier 2:

  • Multiple architectures supported: Llama, Qwen2, Phi3
  • Phi3 architecture correctly implements its attention variant
  • KV cache manager handles large KV models (Phi-3.5: 576 MB)
  • Memory budget monitoring activates (>1 GB KV cache)
  • Output quality subjectively "usable" for simple conversations

Tier 3: Medium - Primary Target (7B - 9B params)

Purpose: This is where the runtime proves its value. Must run at 3-5 tok/s on 2 vCPU / 6 GB.

ModelParamsLayersHiddenHeadsKV HeadsGQAHead DimIntermediateVocabArchFP16Q4_K_MKV @4K
Qwen2.5-7B-Instruct7.62B2835842847:112818,944152,064Qwen215.2 GB4.1 GB229 MB
Llama-3.1-8B-Instruct8.03B3240963284:112814,336128,256Llama16.1 GB4.9 GB512 MB
Mistral-7B-v0.3-Instruct7.25B3240963284:112814,33632,768Mistral14.5 GB4.5 GB512 MB
Gemma-2-9B-IT9.24B4235841682:125614,336256,000Gemma218.5 GB5.3 GB1,344 MB

Sources: Qwen/Qwen2.5-7B-Instruct, meta-llama/Llama-3.1-8B-Instruct, mistralai/Mistral-7B-Instruct-v0.3, google/gemma-2-9b-it

Key testing notes:

Qwen2.5-7B - best overall test target:

  • Fits 6 GB at Q4_K_M (4.1 GB weights + 229 MB KV @ 4K + 300 MB overhead = 4.6 GB)
  • 7:1 GQA ratio → smallest KV cache of any 7B+ model
  • Tied embeddings (embed = lm_head shared) → saves ~200 MB weight memory
  • Fastest decode of any target model (fewest layers: 28, smallest KV reads)

Llama-3.1-8B - our reference benchmark:

  • Fits 6 GB at Q4_K_M (4.9 GB weights + 512 MB KV @ 4K = 5.7 GB)
  • 32 layers → heavier sequential compute
  • Most community benchmarks available for comparison

Mistral-7B-v0.3 - different family, similar arch:

  • Mistral architecture variant (mostly compatible with Llama)
  • Smaller vocab (32K) → faster LM head matmul, smaller embedding
  • Sliding window attention option (4096 tokens)

Gemma-2-9B - the stress test:

  • head_dim = 256 (double all others!) - tests our attention kernel at 2× width
  • 42 layers - deepest model, most sequential compute
  • Low GQA ratio (2:1) - massive KV cache: 1,344 MB at context 4K FP16
  • Requires INT8 KV cache to fit in 6 GB
  • Sliding window attention (4096 fixed) - tests circular KV cache
  • Post-normalization + logit softcapping - unique Gemma-2 quirks
  • This model will be last to pass all tests - it pushes every limit

Expected throughput at 2 vCPU (Q4_K_M):

  • Qwen2.5-7B: 4-6 tok/s
  • Llama-3.1-8B: 3-5 tok/s
  • Mistral-7B: 3-5 tok/s
  • Gemma-2-9B: 2-3 tok/s

Streaming FP16 (unquantized, from disk, 5+ GB RAM):

  • All models: 0.2-0.5 tok/s (Phase 6 milestone)

Success criteria for Tier 3:

  • Llama-3.1-8B achieves ≥3 tok/s decode on 2 vCPU / 6 GB
  • Qwen2.5-7B achieves ≥4 tok/s decode on 2 vCPU / 6 GB
  • Gemma-2-9B works with INT8 KV cache (context 2048)
  • Streaming FP16 mode: any model generates tokens on 5 GB RAM
  • Gemma-2 quirks work: softcapping, post-norm, pre-attention scaling
  • Memory budget enforced: no OOM at Q4_K_M + context 4096

Tier 4: Large (14B - 32B params)

Purpose: Push streaming architecture to its limits. Test the runtime can handle models 3-4× RAM size.

ModelParamsLayersHiddenHeadsKV HeadsGQAHead DimIntermediateVocabArchFP16Q4_K_MKV @4K
Qwen2.5-14B-Instruct14.77B4851204085:112813,824152,064Qwen229.5 GB8.2 GB640 MB
Phi-414.7B40512040104:112817,920100,352Phi329.4 GB8.2 GB800 MB
Qwen2.5-32B-Instruct32.8B6451204085:112827,648152,064Qwen265.6 GB17.8 GB640 MB

Sources: Qwen/Qwen2.5-14B-Instruct, microsoft/phi-4, Qwen/Qwen2.5-32B-Instruct

Key testing notes:

Qwen2.5-14B:

  • Too large for Q4_K_M in 6 GB (8.2 GB weights alone)
  • Viable at Q3_K_M (~6.8 GB) or INT2 (~4.3 GB)
  • Streaming FP16: 29.5 GB on NVMe → ~10 seconds/token → 0.1 tok/s
  • Tests streaming architecture for models that DON'T fit in RAM

Phi-4 (14B):

  • Phi3 architecture at scale - tests that our Phi implementation generalizes
  • 10 KV heads (4:1 GQA) - intermediate KV cache size
  • Q4_K_M: 8.2 GB → must use streaming

Qwen2.5-32B:

  • 64 layers - doubles the streaming read volume vs 8B models
  • At Q4_K_M: 17.8 GB → streaming requires ~6 seconds/token on NVMe
  • Tests memory management under extreme pressure
  • If the streaming architecture works at 32B on 8 GB, it works anywhere

Expected throughput (streaming FP16, 8 GB RAM, NVMe):

  • Qwen2.5-14B: ~0.1 tok/s
  • Qwen2.5-32B: ~0.05 tok/s

Success criteria for Tier 4:

  • Qwen2.5-14B generates tokens via streaming (any speed)
  • No OOM during streaming of 32B model on 8 GB RAM
  • Streaming works on SATA SSD (slower disk, adjusted prefetch)
  • Graceful degradation: system remains responsive during heavy streaming

Summary: The Testing Progression

TierModelsGoalWhen to Start
T0: Bring-UpSmolLM2-135M, SmolLM2-360MArchitecture compiles, produces 1 tokenPhase 1, Day 1
T1: SmallQwen2.5-0.5B, Llama-1B, Qwen2.5-1.5B, SmolLM2-1.7BSIMD kernels correct, 10+ tok/sPhase 2
T2: Medium-SmallQwen2.5-3B, Llama-3B, SmolLM3-3B, Phi-3.5-mini, Phi-4-miniMulti-arch support, memory managementPhase 3-4
T3: TargetQwen2.5-7B, Llama-8B, Mistral-7B, Gemma-2-9BProduction quality, 3-5 tok/sPhase 5-6
T4: LargeQwen2.5-14B, Phi-4, Qwen2.5-32BStreaming at scale, stretch goalsPhase 6-7

Model Architecture Variants to Support

ArchitectureModels Using ItKey Differences
LlamaSmolLM2, Llama-3.x, TinyLlamaStandard: pre-norm, SwiGLU, RoPE
Qwen2Qwen2.5 all sizes, SmolLM3Similar to Llama, NTK-aware RoPE, tied embeddings
MistralMistral-7BSliding window attention option
Gemma2Gemma-2 familyPost-norm, logit softcap, pre-attention Q scaling, sliding window, head_dim=256
Phi3Phi-3.5-mini, Phi-4, Phi-4-miniDifferent attention pattern, small vocab, MHA or GQA

Implementation order: Llama → Qwen2 → Mistral → Phi3 → Gemma2 (each adds complexity)


Special Model Categories

Draft Models (for Speculative Decoding)

ModelParamsUse As Draft For
Qwen2.5-0.5B0.49BQwen2.5-3B or 7B
Llama-3.2-1B1.24BLlama-3.1-8B
Phi-3.5-mini3.8BPhi-4 (14B)

Draft model must share tokenizer with target model.

BitNet Candidates (Future)

ModelParamsNotes
BitNet-b1.58-2B-4T2BOnly public BitNet model (Microsoft, Oct 2024)
Community fine-tunes1-3BCommunity 1-bit models via bitnet.cpp

When 7B+ BitNet models appear, they become top-priority test targets (1.58-bit weights, no dequantization, potential 10+ tok/s).


Memory Map: What Fits Where

RAMQ4_K_MQ4_K_M + INT8 KVStreaming FP16
4 GBSmolLM2 (all), Qwen2.5-0.5B, Qwen2.5-1.5B+ Llama-3.2-1BSmolLM2-1.7B, Qwen2.5-0.5B
5 GB+ Qwen2.5-3B, Llama-3.2-3B, Phi-3.5-mini+ Qwen2.5-3B context 8KQwen2.5-1.5B, Llama-1B
6 GB+ Qwen2.5-7B, Llama-8B, Mistral-7B+ Llama-8B context 8K, Gemma-2-9B ctx 2KQwen2.5-3B, Llama-3B
8 GB+ Phi-4 (tight)+ Gemma-2-9B context 4KQwen2.5-7B, Llama-8B, Mistral-7B
16 GB+ Qwen2.5-14B, Phi-4 comfortably+ Qwen2.5-32B at Q4_K_MQwen2.5-14B, Phi-4
32 GB+ Qwen2.5-32B, Gemma-2-27BAll models fullQwen2.5-32B

Download Commands for Test Suite

# Tier 0: Bring-up (instant download)
huggingface-cli download HuggingFaceTB/SmolLM2-135M-Instruct --include "*Q4_K_M*" --local-dir models/smol-135m
huggingface-cli download HuggingFaceTB/SmolLM2-360M-Instruct --include "*Q4_K_M*" --local-dir models/smol-360m

# Tier 1: Small models (~2-4 GB each)
huggingface-cli download Qwen/Qwen2.5-0.5B-Instruct-GGUF --include "*Q4_K_M*" --local-dir models/qwen-0.5b
huggingface-cli download bartowski/Llama-3.2-1B-Instruct-GGUF --include "*Q4_K_M*" --local-dir models/llama-1b
huggingface-cli download Qwen/Qwen2.5-1.5B-Instruct-GGUF --include "*Q4_K_M*" --local-dir models/qwen-1.5b

# Tier 2: Medium-small (~5-10 GB each)
huggingface-cli download Qwen/Qwen2.5-3B-Instruct-GGUF --include "*Q4_K_M*" --local-dir models/qwen-3b
huggingface-cli download bartowski/Llama-3.2-3B-Instruct-GGUF --include "*Q4_K_M*" --local-dir models/llama-3b
huggingface-cli download bartowski/Phi-3.5-mini-instruct-GGUF --include "*Q4_K_M*" --local-dir models/phi3.5-mini
huggingface-cli download bartowski/Phi-4-mini-instruct-GGUF --include "*Q4_K_M*" --local-dir models/phi4-mini

# Tier 3: Primary targets (~10-15 GB each)
huggingface-cli download Qwen/Qwen2.5-7B-Instruct-GGUF --include "*Q4_K_M*" --local-dir models/qwen-7b
huggingface-cli download bartowski/Llama-3.1-8B-Instruct-GGUF --include "*Q4_K_M*" --local-dir models/llama-8b
huggingface-cli download bartowski/Mistral-7B-Instruct-v0.3-GGUF --include "*Q4_K_M*" --local-dir models/mistral-7b
huggingface-cli download bartowski/gemma-2-9b-it-GGUF --include "*Q4_K_M*" --local-dir models/gemma-9b

# Tier 3 FP16 (for streaming tests)
huggingface-cli download Qwen/Qwen2.5-7B-Instruct --include "*.safetensors" --local-dir models/qwen-7b-fp16
huggingface-cli download meta-llama/Llama-3.1-8B-Instruct --include "*.safetensors" --local-dir models/llama-8b-fp16

# Tier 4: Large models (streaming only)
huggingface-cli download Qwen/Qwen2.5-14B-Instruct --include "*.safetensors" --local-dir models/qwen-14b-fp16
huggingface-cli download microsoft/phi-4 --include "*.safetensors" --local-dir models/phi4-fp16
huggingface-cli download Qwen/Qwen2.5-32B-Instruct --include "*.safetensors" --local-dir models/qwen-32b-fp16

This testing registry ensures every phase of development has concrete models to validate against, from a 135M warm-up model that runs in microseconds to a 32B streaming stress test.