Target Model Architectures - 9B Class Models in Detail
- Reading time
- 15 min read
- Word count
- 2829 words
- Diagram count
- 0 diagrams
Source: Victor Bona's Obsidian Compendium snapshot, Knowledge base/Research on CPU LLM Inference/06-model-architecture-compatibility.md.
Target Model Architectures - 9B Class Models in Detail
Research Program: CPU-Native LLM Inference Runtime Target Spec: 9B parameter model, 2 vCPUs, 6 GB RAM, 2–5 tok/s Author: Research Agent Date: June 2025
1. Introduction
This document provides detailed architectural analysis of candidate 9B-class models for our CPU-native inference runtime. For each model, we cover transformer architecture specifics, memory implications, quantization quality, and compatibility with our target spec (2 vCPU, 6 GB, 2–5 tok/s).
The critical model parameters that affect our runtime:
- num_layers: Determines forward pass depth (more layers = more sequential compute)
- num_kv_heads (GQA ratio): Determines KV cache size (fewer KV heads = smaller cache)
- head_dim: Determines per-head compute and KV size
- hidden_size / intermediate_size: Determines matmul dimensions (memory bandwidth)
- Attention variant: MHA, GQA, MQA, sliding window - affects implementation
2. Qwen2.5-7B-Instruct
Paper: "Qwen2.5 Technical Report" (arXiv:2412.15115)
HuggingFace: Qwen/Qwen2.5-7B-Instruct
GGUF availability: Extensive (Q2_K to Q8_0 on HuggingFace by bartowski, MaziyarPanahi, etc.)
2.1 Architecture Specifications
| Parameter | Value | Notes |
|---|---|---|
| num_layers | 28 | Relatively shallow |
| hidden_size | 3584 | |
| intermediate_size | 18944 | SwiGLU: gate + up = 2 × 18944 |
| num_attention_heads | 28 | |
| num_kv_heads | 4 | 7:1 GQA ratio |
| head_dim | 128 | (3584 / 28) |
| vocab_size | 151,936 | Large multilingual vocab |
| max_position_embeddings | 32768 | Native context |
| rope_theta | 1,000,000 | High RoPE base |
| activation | SwiGLU (SiLU) | |
| normalization | RMSNorm | |
| rope_type | NTK-aware | |
| tying | true (embed = lm_head shared) | Saves ~1.1 GB memory! |
2.2 Memory Implications
Weight memory at Q4_K_M:
- Total params: ~7.62B (including 151K vocab embedding)
- Note: embedding/lm_head are tied (shared weights) → only stored once
- Effective unique weights: ~7.07B
- Q4_K_M: ~4.1 GB
KV cache at various context lengths (FP16):
KV per token = 2 × 28 × 4 × 128 × 2 = 57,344 bytes ≈ 56 KB
| Context | KV Cache (FP16) | KV Cache (INT8) |
|---|---|---|
| 2048 | 115 MB | 57 MB |
| 4096 | 229 MB | 115 MB |
| 8192 | 459 MB | 229 MB |
| 32768 | 1,835 MB | 918 MB |
Total memory at context 4096:
- Weights: 4.1 GB
- KV cache: 229 MB
- Activation + overhead: ~300 MB
- Total: ~4.6 GB ✅ (comfortable 6 GB fit)
2.3 GQA Ratio Advantage
The 7:1 GQA ratio (28 query heads, 4 KV heads) means:
- KV cache is 7× smaller than full MHA
- Each KV head serves 7 query heads during attention
- Attention computation: for each query head, broadcast the corresponding KV head → efficient
Implementation note: When computing attention, we need to replicate each KV head for 7 query heads:
// Qwen2.5-7B attention (simplified)
for kv_head in 0..4 {
let k = load_kv(kv_head); // [seq_len × head_dim]
let v = load_kv_v(kv_head);
for q_head in 0..7 { // 7 query heads per KV head
let q = load_query(kv_head * 7 + q_head); // [head_dim]
let score = dot(q, k); // [seq_len]
attn_out[q_head] = weighted_sum(score, v);
}
}
2.4 Quantization Quality
From HuggingFace community benchmarks:
| Quant | PPL (MMLU) | Downstream Acc | Subjective Quality |
|---|---|---|---|
| FP16 | baseline | 70.3% | Excellent |
| Q5_K_M | +0.05 | ~69.8% | Excellent |
| Q4_K_M | +0.14 | ~69.0% | Very good |
| Q4_K_S | +0.22 | ~68.5% | Good |
| Q3_K_M | +0.65 | ~67.0% | Acceptable |
| Q2_K | +2.7 | ~60% | Poor |
2.5 Expected Throughput at 2 vCPU
Based on llama.cpp benchmarks for similar 7B Q4_K_M models:
| CPU Type | 1 Thread | 2 Threads | Notes |
|---|---|---|---|
| Xeon Ice Lake (2 cores) | 3.5 | 5.2 | Good bandwidth |
| Epyc Milan (2 cores) | 3.8 | 5.5 | Excellent bandwidth |
| Graviton 3 (2 ARM cores) | 4.0 | 6.0 | ARM-optimized |
Verdict: Qwen2.5-7B is an EXCELLENT target for our runtime. Comfortably fits 6 GB, well-supported quantization, high GQA ratio keeps KV cache small.
3. Gemma-2-9B-IT
Paper: "Gemma 2: Improving Open Language Models at a Practical Size" (arXiv:2408.00118)
HuggingFace: google/gemma-2-9b-it
GGUF: Available (bartowski, lmstudio-community)
3.1 Architecture Specifications
| Parameter | Value | Notes |
|---|---|---|
| num_layers | 42 | Deep architecture |
| hidden_size | 3584 | |
| intermediate_size | 14336 | SwiGLU |
| num_attention_heads | 16 | |
| num_kv_heads | 8 | 2:1 GQA ratio (poor for memory) |
| head_dim | 256 | Large head dimension |
| vocab_size | 256,000 | |
| max_position_embeddings | 8192 | Fixed sliding window |
| sliding_window | 4096 | Native sliding window attention |
| attention_type | "sliding_window" | |
| rope_type | Linear | |
| normalization | RMSNorm (post) | Post-attention norm |
| logit_softcap | 30.0 | Softmax logit capping |
| query_pre_attn_scalar | 256 | Pre-attention scaling |
| final_logit_softcap | 30.0 | Final logit capping |
3.2 Unique Features
Sliding Window Attention (4096 tokens):
- Gemma-2 only attends to the last 4096 tokens regardless of total context
- This BOUNDS the KV cache requirement and attention compute
- Implementation: circular buffer of size 4096 for KV cache
Post-Normalization:
- Unlike standard pre-norm (norm → attention → residual), Gemma-2 uses:
output = Norm(x + Attention(x)) - The post-attention RMSNorm means we normalize AFTER the residual connection
- This affects quantization sensitivity (normalized residuals are more stable)
Logit Softcapping:
- Softcap function:
c × tanh(logits / c)where c = 30.0 - Prevents extreme logit values, improves training stability
- Must be applied during sampling (not a standard transformer operation)
Pre-attention scaling:
- Q is scaled by
1/sqrt(query_pre_attn_scalar)=1/sqrt(256)=1/16 - Applied BEFORE RoPE
- Must be incorporated into the attention kernel
3.3 Memory Implications
Weight memory at Q4_K_M:
- Total params: ~9.24B
- Q4_K_M: ~5.3 GB
KV cache (CRITICAL - much larger due to 8 KV heads × head_dim 256):
KV per token = 2 × 42 × 8 × 256 × 2 = 344,064 bytes ≈ 336 KB
| Context | KV Cache (FP16) | KV Cache (INT8) |
|---|---|---|
| 2048 | 672 MB | 336 MB |
| 4096 (max window) | 1,344 MB | 672 MB |
Total at context 4096 (sliding window max):
- Weights: 5.3 GB
- KV cache: 1,344 MB
- Activation + overhead: ~300 MB
- Total: ~6.9 GB ❌ EXCEEDS 6 GB
With INT8 KV cache:
- Weights: 5.3 GB + KV: 672 MB + overhead: 300 MB = ~6.3 GB ❌ Still tight
With Q4_0 weights (lighter, 5.0 GB) + INT8 KV:
- Total: 5.0 + 0.65 + 0.3 = ~5.95 GB ✅ Barely fits
Verdict: Gemma-2-9B is TIGHT at 6 GB. Requires INT8 KV cache AND Q4_0 (not Q4_K_M) weights to fit. Quality will be slightly lower but still acceptable. The sliding window is a memory blessing - KV cache never grows beyond 4096 tokens.
3.4 Implementation Quirks
- Logit softcap must be implemented:
logit = 30.0 * tanh(logit / 30.0) - Pre-attention Q scaling: Must scale Q before RoPE
- Post-norm vs pre-norm: LayerNorm applied after residual, not before
- Sliding window: All layers use sliding window; no "full attention" layers
- Interleaved attention: [UNVERIFIED] Some Gemma-2 variants interleave full and sliding window - check actual config
3.5 Expected Throughput
Gemma-2-9B at Q4_0 on 2 vCPU [ESTIMATED]:
- Decode: ~2.5–3.5 tok/s (slower than Llama/Qwen due to 42 layers and larger KV reads)
- The 42 layers mean 42 sequential forward steps per token vs 28-32 for others
- KV cache reads per token: 42 × 8 × 256 × 2 bytes × 4096 tokens = ~672 MB (large)
4. Llama-3.1-8B-Instruct
Paper: "The Llama 3 Herd of Models" (arXiv:2407.21783)
HuggingFace: meta-llama/Llama-3.1-8B-Instruct
GGUF: Extensively available
4.1 Architecture Specifications
| Parameter | Value | Notes |
|---|---|---|
| num_layers | 32 | |
| hidden_size | 4096 | |
| intermediate_size | 14336 | SwiGLU |
| num_attention_heads | 32 | |
| num_kv_heads | 8 | 4:1 GQA ratio |
| head_dim | 128 | (4096 / 32) |
| vocab_size | 128,256 | |
| max_position_embeddings | 131072 | With RoPE scaling |
| rope_theta | 500,000 | |
| normalization | RMSNorm | Pre-norm |
| activation | SwiGLU (SiLU) |
4.2 Memory Implications
Weight memory at Q4_K_M:
- Total params: ~8.03B
- Q4_K_M: ~4.9 GB
KV cache per token:
KV per token = 2 × 32 × 8 × 128 × 2 = 131,072 bytes ≈ 128 KB
| Context | KV Cache (FP16) | KV Cache (INT8) |
|---|---|---|
| 2048 | 256 MB | 128 MB |
| 4096 | 512 MB | 256 MB |
| 8192 | 1,024 MB | 512 MB |
Total at context 4096:
- Weights: 4.9 GB + KV: 512 MB + overhead: 300 MB = ~5.7 GB ✅
Total at context 8192 with INT8 KV:
- Weights: 4.9 GB + KV: 512 MB + overhead: 300 MB = ~5.7 GB ✅
Verdict: Llama-3.1-8B is the OPTIMAL target. Great balance of quality, memory fit, and KV cache efficiency. Supports context 4096-8192 within 6 GB.
4.3 Quantization Quality
| Quant | MMLU Acc | Notes |
|---|---|---|
| FP16 | 66.7% | Meta paper |
| Q8_0 | ~66.5% | Near-lossless |
| Q5_K_M | ~66.0% | Virtually lossless |
| Q4_K_M | ~65.5% | Production quality |
| Q4_K_S | ~65.0% | Slight drop |
| Q3_K_M | ~63.0% | Noticeable degradation |
Llama-3.1-8B is known to quantize well - the model's weights are more uniformly distributed than some competitors, making quantization less lossy.
4.4 Expected Throughput
From llama.cpp community benchmarks and our bandwidth analysis:
| Config | 2 vCPU Tok/s | Notes |
|---|---|---|
| Q4_K_M, context 2048 | 3.5–4.5 | Standard config |
| Q4_K_M, context 4096 | 3.0–4.0 | Slightly slower (more KV reads) |
| Q5_K_M, context 2048 | 3.0–3.5 | Larger weight reads |
| Q4_0, context 2048 | 4.0–5.0 | Simpler format, less scale overhead |
5. Phi-3.5-mini (3.8B) and Phi-4 (14B)
5.1 Phi-3.5-mini (3.8B)
Paper: "Phi-3 Technical Report" (arXiv:2404.14219) Architecture:
| Parameter | Value |
|---|---|
| num_layers | 32 |
| hidden_size | 3072 |
| intermediate_size | 8192 |
| num_attention_heads | 32 |
| num_kv_heads | 32 |
| head_dim | 96 |
| vocab_size | 32,064 |
| max_position | 4096 / 128K (SU-RoPE) |
| sliding_window | 2048 |
Memory at Q4_K_M:
- Weights: ~2.2 GB
- KV cache (2048 window, FP16): 2 × 32 × 32 × 96 × 2048 × 2 = ~768 MB
- Wait: NO GQA means 32 KV heads, each 96 dim. KV per token: 2 × 32 × 32 × 96 × 2 = 393,216 bytes = 384 KB
- At 2048 tokens: 768 MB
- Total: 2.2 + 0.75 + 0.3 = ~3.25 GB ✅ PLENTY of headroom
Quality: Phi-3.5-mini punches above its weight - competitive with Llama-2-7B on many benchmarks despite being 3.8B.
Throughput at 2 vCPU [ESTIMATED]: 6-10 tok/s (small model, fast)
For our runtime: Excellent backup/fallback option. Fast, high quality for size, leaves plenty of memory headroom. However, not a 9B-class model.
5.2 Phi-4 (14B)
Architecture (estimated based on Phi-3 scaling):
- num_layers: ~40
- hidden_size: ~5120
- Weight at Q4_K_M: ~8.3 GB ❌ (exceeds 6 GB even without KV cache)
Verdict: NOT VIABLE at 6 GB. Too large even at Q4 quantization.
6. BitNet / Natively Low-Bit Models
6.1 Available BitNet Models
| Model | Params | Bits | Weight Size | Status |
|---|---|---|---|---|
| BitNet-b1.58-2B-4T | 2B | 1.58 | ~0.4 GB | Released (Microsoft) |
| bitnet_cpp (reference) | - | - | - | Reference impl |
| Community 1-bit attempts | Various | 1-2 | Various | Experimental |
No 9B-class BitNet model publicly available as of June 2025.
6.2 BitNet b1.58: Implications for CPU Runtime
If a 9B BitNet model were released:
Weight storage:
- Ternary weights (): log₂(3) ≈ 1.585 bits per weight
- 9B × 1.585 / 8 = ~1.78 GB
- Plus: activations are FP16 (needed for residual streams, norms)
Matmul simplification:
- No dequantization needed - weights ARE the computation
- Matmul becomes: additions and subtractions only
- CPU: ADD/SUB instructions are 2-4× faster than MUL on most architectures
- SIMD: Can use bitwise operations to select add/sub/skip
Kernel design for ternary matmul:
// BitNet ternary matmul pseudocode
fn ternary_matmul(weights: &TernaryMatrix, x: &[f16]) -> Vec<f16> {
// weights stored as two bitmaps: is_positive, is_negative
// weight[i,j] = (is_positive[i,j] as i8) - (is_negative[i,j] as i8)
let mut result = vec![0.0f32; weights.rows()];
// Process 256 weights at a time (AVX2 bitmap)
for row in 0..weights.rows() {
let mut acc: f32 = 0.0;
for chunk in 0..(x.len() / 256) {
let pos_bits = load_bitmap(row, chunk); // 256 bits = 32 bytes
let neg_bits = load_bitmap_neg(row, chunk);
let x_chunk = &x[chunk*256..(chunk+1)*256];
// For each bit set in pos: add x[j] to accumulator
// For each bit set in neg: subtract x[j] from accumulator
// This is a masked sum operation
acc += masked_sum_pos_neg(pos_bits, neg_bits, x_chunk);
}
result[row] = acc;
}
result
}
Expected performance advantage: Ternary matmul requires ~3× less memory reads than Q4 matmul → potentially 2-3× faster on memory-bound workloads. Could achieve 8-15 tok/s on 2 vCPU.
Runtime support needed:
- Ternary weight format parser
- Bitmap-based weight storage (not GGUF-compatible)
- Custom kernel: masked-addition-based matmul
- FP16 activation stream (not quantized activations)
6.3 Are Any 9B-class BitNet Models on the Horizon?
[UNVERIFIED] Signals:
- Microsoft Research has indicated larger BitNet models in development
- Community fine-tuning efforts for BitNet-7B are active on GitHub
- The BitNet paper demonstrates scaling laws that suggest 9B+ BitNet would be competitive
- Expected: 7-9B BitNet model sometime in 2025-2026
7. Mixture of Experts (MoE) at 9B Scale
7.1 MoE Architecture Overview
MoE models have multiple "expert" FFN modules per layer, but only a subset are activated per token:
Standard Transformer:
Attention → FFN (always full size)
MoE Transformer:
Attention → Router → TopK Experts (only K of N FFNs activated)
7.2 Llama-4 Scout (109B, MoE)
Architecture:
- 16 experts per layer, 1 active (top-1 routing)
- Active params per token: ~17B (of 109B total)
- Total weight size: 109B × 4.625/8 ≈ 63 GB at Q4_K_M
Obviously NOT viable at 6 GB due to total weight size.
7.3 Small MoE Models
Mixtral 8x7B (46.7B total, ~12.9B active): Too large.
Qwen2-57B-A14B: Too large.
DeepSeekMoE-16B (16B total, 2.8B active):
- Total weights: ~9.3 GB at Q4_K_M ❌ (exceeds 6 GB even though only 2.8B are active per token)
Key MoE insight for memory-constrained deployment:
- Even though only some experts are activated per token, ALL expert weights must be loaded (or accessible)
- On CPU with mmap, inactive experts can stay on disk (only active experts page-faulted in)
- But the total file size still matters for mmap region and disk space
- No MoE model under 6 GB total weight size currently exists at production quality
7.4 MoE CPU Inference Potential
If a suitable small MoE model existed:
- Advantage: Per-token compute is reduced (only active experts computed)
- On memory-bound CPU: Weight reads per token are dramatically reduced (only active expert weights read)
- Example: If total model is 5 GB but only 20% of experts (1 GB) are active per token, effective weight reads = 1 GB/token → ~4× faster than dense model
Future potential: If a ~5 GB MoE model with ~1 GB active params becomes available, it could achieve 15-20 tok/s on 2 vCPU. This would be transformative for our runtime.
8. Compatibility Matrix
8.1 Complete Model Fitness Table
| Model | Params | Q4_K_M Size | KV at 2048 (FP16) | KV at 4096 (FP16) | Total @ 4096 | Fits 6GB? | Quality | Throughput Est. |
|---|---|---|---|---|---|---|---|---|
| Qwen2.5-7B | 7.62B | 4.1 GB | 115 MB | 229 MB | ~4.6 GB | ✅ Yes | ⭐⭐⭐⭐⭐ | 5+ tok/s |
| Llama-3.1-8B | 8.03B | 4.9 GB | 256 MB | 512 MB | ~5.7 GB | ✅ Yes | ⭐⭐⭐⭐ | 3-4 tok/s |
| Gemma-2-9B | 9.24B | 5.3 GB | 672 MB | 1344 MB | ~6.9 GB | ❌ (Q4_0+INT8 OK) | ⭐⭐⭐⭐⭐ | 2.5-3.5 tok/s |
| Phi-3.5-mini | 3.8B | 2.2 GB | 768 MB (2K win) | - | ~3.3 GB | ✅ Plenty | ⭐⭐⭐⭐ | 6-10 tok/s |
| Phi-4 (14B) | 14B | 8.3 GB | - | - | - | ❌ Too big | N/A | N/A |
8.2 Quantization Format Compatibility
| Model | GGUF Q4_K_M | GGUF Q5_K_M | GPTQ-4bit | AWQ-4bit | OpenVINO INT4 |
|---|---|---|---|---|---|
| Qwen2.5-7B | ✅ | ✅ | ✅ | ✅ | ✅ |
| Llama-3.1-8B | ✅ | ✅ | ✅ | ✅ | ✅ |
| Gemma-2-9B | ✅ | ✅ | ⚠️ (softcap) | ⚠️ | ✅ |
| Phi-3.5-mini | ✅ | ✅ | ✅ | ✅ | ✅ |
8.3 Runtime Implementation Requirements Per Model
| Model | Special Handling Needed |
|---|---|
| Qwen2.5-7B | Standard. NTK-aware RoPE. Tied embeddings. |
| Llama-3.1-8B | Standard. RoPE with frequency scaling (for >8K context). |
| Gemma-2-9B | Logit softcap. Pre-attention Q scaling. Sliding window. Post-norm. |
| Phi-3.5-mini | SU-RoPE (scaling). Sliding window. No GQA (full MHA). |
8.4 Ranking for Target Spec
Best to worst for 6GB/2vCPU target:
- Qwen2.5-7B - Best overall: fits easily, fast, high GQA, excellent community support
- Llama-3.1-8B - Close second: fits at context 4096, excellent quality, well-known
- Gemma-2-9B - Tight fit: needs INT8 KV + Q4_0, but highest quality at 9B scale; sliding window is a memory gift
- Phi-3.5-mini - Fallback: fast and light, but only 3.8B params (lower capability ceiling)
9. Architecture-Specific Implementation Notes
9.1 SwiGLU (All Models)
All target models use SwiGLU for the feed-forward network:
FFN(x) = (SiLU(W_gate × x) ⊙ (W_up × x)) × W_down
Where:
W_gate: [intermediate_size × hidden_size]W_up: [intermediate_size × hidden_size]W_down: [hidden_size × intermediate_size]SiLU(silu) = x × σ(x)(SiLU = Swish activation)⊙is element-wise multiplication
Memory for FFN (Llama-3.1-8B):
- W_gate + W_up: 2 × 14336 × 4096 × 4.625/8 = ~68 MB
- W_down: 4096 × 14336 × 4.625/8 = ~34 MB
- Total FFN weights per layer: ~102 MB
- Across 32 layers: ~3.26 GB (bulk of model!)
Computation:
- Two matmuls (gate and up) in parallel or sequential - can fuse into one wider matmul
- Element-wise SiLU + multiply (trivial compute)
- One matmul (down)
Optimization: Fuse gate+up into a single [2*intermediate × hidden] matmul, reducing memory read overhead.
9.2 RoPE (Rotary Position Embedding)
All target models use RoPE with different configurations:
fn apply_rope(x: &[f16], position: usize, dim: usize, theta: f32) -> Vec<f16> {
let mut result = x.to_vec();
for i in (0..dim).step_by(2) {
let freq = 1.0 / (theta.powf(i as f32 / dim as f32));
let angle = position as f32 * freq;
let cos_a = angle.cos();
let sin_a = angle.sin();
let x0 = result[i];
let x1 = result[i + 1];
result[i] = x0 * cos_a - x1 * sin_a; // Rotation
result[i + 1] = x0 * sin_a + x1 * cos_a; // Rotation
}
result
}
Performance: RoPE is lightweight - O(head_dim) per head per token. Not a bottleneck.
SIMD optimization: Process pairs of elements using AVX2 multiply-add instructions. For head_dim=128, that's 64 rotation pairs per head.
9.3 RMSNorm
fn rms_norm(x: &[f16], weight: &[f16], eps: f32) -> Vec<f16> {
let n = x.len() as f32;
let rms: f32 = x.iter().map(|v| (*v as f32).powi(2)).sum::<f32>() / n;
let inv_rms = 1.0 / (rms + eps).sqrt();
x.iter().zip(weight.iter())
.map(|(xi, wi)| *xi as f32 * inv_rms * *wi as f32)
.map(|v| v as f16)
.collect()
}
Performance: O(hidden_dim) with two passes (sum of squares, then normalize). Trivial compute.
10. Implementation Implications
10.1 Primary Target: Llama-3.1-8B
- Best overall fit for 6 GB at Q4_K_M with context 4096
- Standard architecture - no exotic features (no softcap, no post-norm quirks)
- Massive community support - GGUF models widely available
- 4:1 GQA - Reasonable KV cache size
- 32 layers - Good balance of quality vs sequential compute depth
10.2 Secondary Target: Qwen2.5-7B
- Best memory margin - 4.1 GB weights leaves 1.9 GB for everything else
- 7:1 GQA - Smallest KV cache of any target
- Tied embeddings - Saves ~200 MB (embed = lm_head)
- 28 layers - Fewer sequential compute steps = faster per token
- Excellent multilingual - Good for international deployment
10.3 Stretch Target: Gemma-2-9B
- Highest quality per parameter - Best 9B-class model on reasoning/challenge benchmarks
- Requires INT8 KV cache - Implement as a runtime option
- Sliding window simplifies - Fixed 4096-token context window (no context growth management)
- Special implementation needed - softcap, pre-attention scaling, post-norm
10.4 Runtime Architecture Configuration
enum ModelArchitecture {
Llama {
config: LlamaConfig,
rope_type: RopeType,
},
Qwen2 {
config: Qwen2Config,
tied_embeddings: bool,
},
Gemma2 {
config: Gemma2Config,
sliding_window: usize,
logit_softcap: f32,
pre_attn_scale: f32,
},
}
The executor dispatches to the architecture-specific layer implementations based on the loaded model's type.
The next document (Document 7: Benchmarks and Baselines) provides realistic performance targets and the complete benchmarking methodology for validating our runtime.