"SUBSTRATE / vllm-zig"

vllm-zig

pre-1.0 . v0.0.6 . shipped 2026-05-27 . AGPL-3.0

80 unit tests . lane: inference

"WHY"

The 2026 AI economy charges rent on every token. The rent flows through Python on rented GPU, through a serving stack the operator does not own, against a model-weight file the operator does not control. The canon names this position the spread-scalper position. Anti-Edison 17 lays out the audit. vllm-zig is the appliance-layer answer.

vllm-zig is a forward pass. It reads a safetensors blob into typed tensors, runs the prompt through 22 transformer blocks, samples a token, streams it out. It does this in a single Zig binary, on a CPU, without Python in the serving path, against a weights file the operator already owns. It is the minimum viable substrate for inference that does not pay rent to a third-party token meter.

It is not a vllm clone. It does not page attention, it does not continuous-batch, it does not quantise. Those are the next phases, gated on the substrate being correct first per ARCHITECTURE.md. The reason the page exists at v0.0.6 rather than waiting is that the correct-first work is the load-bearing claim, and the canon is graded against shipped substrate, not against intent.

<div class="benchmark-chart" style="padding: 0; overflow: hidden; background: #050505; position: relative; margin: 2rem 0; border-radius: 8px; border: 1px solid rgba(255,255,255,0.1);"> <canvas id="vllm-canvas" style="display: block; width: 100%; border-radius: 6px;"></canvas> <div style="position: absolute; inset: 0; box-shadow: inset 0 0 20px rgba(0,0,0,0.8); pointer-events: none;"></div> </div> <script src="{{site_url}}/js/vllm-visualizer.js"></script>

"WHAT"

The forward pass on TinyLlama-1.1B-Chat, end-to-end, in this order:

  1. Weight load via safetensors-zig. The 2.2 GB blob parses in 241 microseconds against 201 tensors in BF16 into typed views with no intermediate copies.
  2. Tokenization via tokenizers-zig. The prompt enters the same tokenizer.json contract the upstream model was trained against.
  3. Twenty-two transformer blocks. Each block is RoPE rotary embeddings, grouped-query attention with KV cache, SwiGLU FFN, RMSNorm. The matmul is multi-threaded against a persistent pool. The persistent pool replaced the ad-hoc spawn path in v0.0.5; the spawn cost was eating roughly 30 percent of decode time at the v0.0.4 ceiling.
  4. Sampling. Greedy at default; temperature, top-k, top-p configurable.

v0.0.5 landed the tiled-routing optimisation on the B matrix for decode. v0.0.6 closed three load-bearing breaks from that refactor: the page_manager 4-arg signature was restored, the build graph wiring lost in the engine rewrite was rebuilt around the test and bench surface, and the LM head was migrated to a raw BF16 matmul through matmulBF16SIMD. The LM head is the single largest matmul in TinyLlama decode at M=1, K=2048, N=32000 (about 131 MB of BF16 versus 262 MB of F32), so halving the B-matrix bytes pulled through L2 and L3 is exactly the bandwidth-bound case the BF16 kernel was built for.

The architecture document ARCHITECTURE.md in the repo names what is in-scope and what is deferred. GPU kernels for Ampere and later are Phase 2. Quantised inference (int8, nf4) is Phase 3. Multi-model support beyond the Llama family is Phase 4. The current CPU-first substrate is Phase 1, and Phase 1 is the substrate the rest is built on.

"MILESTONES"

"DEPENDENCIES"

"ADAPTER TARGETS"

"RELATED CANON"

"RELATED LAB NOTES"

"RELATED WORKSHOP"

The v0.0.6 to v0.0.7 work (forward-pass wiring of tiled-routing plus bench publication) is paused on the bench-publish gate. Workshop entry forthcoming.

"LIMITS"

Pre-1.0 substrate, named honestly.

"SOURCE"