"WHY"
The 2026 AI economy charges rent on every token. The rent flows through Python on rented GPU, through a serving stack the operator does not own, against a model-weight file the operator does not control. The canon names this position the spread-scalper position. Anti-Edison 17 lays out the audit. vllm-zig is the appliance-layer answer.
vllm-zig is a forward pass. It reads a safetensors blob into typed tensors, runs the prompt through 22 transformer blocks, samples a token, streams it out. It does this in a single Zig binary, on a CPU, without Python in the serving path, against a weights file the operator already owns. It is the minimum viable substrate for inference that does not pay rent to a third-party token meter.
It is not a vllm clone. It does not page attention, it does not continuous-batch, it does not quantise. Those are the next phases, gated on the substrate being correct first per ARCHITECTURE.md. The reason the page exists at v0.0.6 rather than waiting is that the correct-first work is the load-bearing claim, and the canon is graded against shipped substrate, not against intent.
<div class="benchmark-chart" style="padding: 0; overflow: hidden; background: #050505; position: relative; margin: 2rem 0; border-radius: 8px; border: 1px solid rgba(255,255,255,0.1);"> <canvas id="vllm-canvas" style="display: block; width: 100%; border-radius: 6px;"></canvas> <div style="position: absolute; inset: 0; box-shadow: inset 0 0 20px rgba(0,0,0,0.8); pointer-events: none;"></div> </div> <script src="{{site_url}}/js/vllm-visualizer.js"></script>
"WHAT"
The forward pass on TinyLlama-1.1B-Chat, end-to-end, in this order:
- Weight load via
safetensors-zig. The 2.2 GB blob parses in 241 microseconds against 201 tensors in BF16 into typed views with no intermediate copies. - Tokenization via
tokenizers-zig. The prompt enters the sametokenizer.jsoncontract the upstream model was trained against. - Twenty-two transformer blocks. Each block is RoPE rotary embeddings, grouped-query attention with KV cache, SwiGLU FFN, RMSNorm. The matmul is multi-threaded against a persistent pool. The persistent pool replaced the ad-hoc spawn path in v0.0.5; the spawn cost was eating roughly 30 percent of decode time at the v0.0.4 ceiling.
- Sampling. Greedy at default; temperature, top-k, top-p configurable.
v0.0.5 landed the tiled-routing optimisation on the B matrix for decode. v0.0.6 closed three load-bearing breaks from that refactor: the page_manager 4-arg signature was restored, the build graph wiring lost in the engine rewrite was rebuilt around the test and bench surface, and the LM head was migrated to a raw BF16 matmul through matmulBF16SIMD. The LM head is the single largest matmul in TinyLlama decode at M=1, K=2048, N=32000 (about 131 MB of BF16 versus 262 MB of F32), so halving the B-matrix bytes pulled through L2 and L3 is exactly the bandwidth-bound case the BF16 kernel was built for.
The architecture document ARCHITECTURE.md in the repo names what is in-scope and what is deferred. GPU kernels for Ampere and later are Phase 2. Quantised inference (int8, nf4) is Phase 3. Multi-model support beyond the Llama family is Phase 4. The current CPU-first substrate is Phase 1, and Phase 1 is the substrate the rest is built on.
"MILESTONES"
- 2026-05-27 · v0.0.6 · tested. Page-manager 4-arg signature restored, build graph rebuilt around test + bench surface, BF16 LM head wired through
matmulBF16SIMD. 80 unit tests pass. - 2026-05-22 · v0.0.5 · benched. Tiled-routing + scalar/SIMD
matmulBF16land. Persistent thread pool replaces ad-hoc spawn. - 2026-05-21 · v0.0.4 · tested. Multi-thread matmul with ad-hoc spawn. First end-to-end TinyLlama decode on Ice Lake.
- 2026-05-20 · v0.0.3 · tested. RoPE + GQA + KV cache wired. Forward pass green against
safetensors-zigweight load. - 2026-05-18 · v0.0.1 · tested. Repository scaffold. RMSNorm + linear projection unit tests pass.
"DEPENDENCIES"
safetensors-zig. Pure-Zig safetensors reader. Provides the typed-tensor views vllm-zig consumes at weight load.tokenizers-zig. BPE / WordPiece / Unigram tokenizer with HuggingFace parity. Provides the prompt-to-token-id stage.
"ADAPTER TARGETS"
faiss-zig. Composes on top for RAG retrieval; vllm-zig accepts a retrieved-context prefix at the forward-pass entry.agent-fleet. BlastFleet substrate uses vllm-zig as a hosted token source for local-model lifecycle on Tier-2 lanes.
"RELATED CANON"
- Anti-Edison 17 — The AI Wrapper Question. The merchant-lens audit that names this position.
- The Mercantile Thesis. The appliance-layer claim this substrate instantiates.
- Doctrine 01 — Field Statement. The discipline.
"RELATED LAB NOTES"
- AI inference in Zig — a 4-repo stack from weights to tokens. The composition write-up.
"RELATED WORKSHOP"
The v0.0.6 to v0.0.7 work (forward-pass wiring of tiled-routing plus bench publication) is paused on the bench-publish gate. Workshop entry forthcoming.
"LIMITS"
Pre-1.0 substrate, named honestly.
- CPU only. No CUDA kernel. No MLX backend. GPU work is Phase 2, gated on the CPU substrate being correct first.
- Llama-family only. RoPE plus GQA plus SwiGLU is the Llama shape. Qwen and Mistral are present; Gemma and Phi are not yet wired.
- No quantisation. F32 and BF16 only. The token-latency number does not compete with llama.cpp Q4. Phase 3.
- No continuous batching. Single-stream decode. The PagedAttention runtime is Phase 2.
- Not vLLM-equivalent. The forward pass runs; the architecture is auditable end-to-end in an afternoon; the bench is reproducible. That is the substrate claim. It is not a vllm replacement, and the README states this in the same words.
- Zig 0.16 ceiling. Standard-library API churn each release. The repo pins
0.16.0and the migration tax travels with the substrate.
"SOURCE"
- AGPL-3.0-or-later. The canonical public surface is mirrored at github.com/SMC17/vllm-zig. The architecture is open. Verify the claims.