"WHY"
The tokenizer is the contract the model was trained against. Mis-tokenise a prompt and every subsequent layer in the forward pass is computing against a different distribution than the model expects. The Python tokenizers package is a Rust core with PyO3 bindings, which means the operator who claims a sovereign-Zig inference path still depends on a Rust binary loaded into a Python process at the prompt-encoding stage. The wedge is small in compute terms; it is load-bearing in supply-chain terms.
tokenizers-zig closes the wedge. It loads the same tokenizer.json files HuggingFace ships against BERT, T5, and TinyLlama. The encode path returns the canonical Encoding struct (ids, type_ids, attention_mask, offsets, tokens, special_tokens_mask, sequence_ids) so downstream code that was written against the HuggingFace contract works without modification at the type boundary.
"WHAT"
Three tokenizer families with HuggingFace tokenizer.json parity.
- BPE. Byte-Pair Encoding. The TinyLlama / Llama family path. Both v1 string-merges and v2 array-merges formats. ByteLevel pre-tokenisation per GPT-2.
- WordPiece. The BERT family path. Greedy longest-match with the
##continuation marker. - Unigram. The T5 / SentencePiece family path. Viterbi-decoded best path.
The v0.2 ship at 2026-05-21 replaced the linear-scan BPE merge with a priority-queue O(n log n) merge per Sennrich 2016. The encode-time win was 491 microseconds to 56 microseconds, an 8.8x speedup on the TinyLlama prompt fixture. The gap to the upstream Rust tokenizer closed from roughly 9x slower to roughly 1.2x slower at the same single-thread budget. The point is not to outrun the Rust path; the point is that the Zig path is within the same operating order of magnitude on the same hardware while removing the Python and Rust dependencies from the deployment envelope.
v0.26 added a property-based invariant fuzzer. The fuzzer generates 200 random ASCII texts per real fixture (BERT, T5, TinyLlama) and verifies five canonical invariants: non-empty input produces at least one token; all offsets lie within bounds; offsets are monotonic non-decreasing; the parallel arrays (attention_mask, type_ids, offsets, tokens, special_tokens_mask) stay parallel to ids; decode round-trips without error. The fuzz pass caught a real monotonicity bug in the TinyLlama BPE+SentencePiece-marker offset remap when adjacent normalised bytes were all synthetic (the ▁▁ double-space sequence). The fix tracks a running_end cursor through the encode loop and assigns zero-width offsets to all-synthetic tokens at the current running position.
"MILESTONES"
- 2026-05-22 · v0.26.0 · fuzz-tested. Property-based invariant fuzzer. Caught and fixed the synthetic-byte monotonicity bug. 191 tests plus 600-iteration fuzz pass across the three real fixtures.
- 2026-05-22 · v0.25.0 · tested. Full canonical HF
Encodingstruct.encodeFull/encodePairFullone-shot APIs. 186 tests pass. - 2026-05-21 · v0.2.0 · benched. Priority-queue BPE merge per Sennrich 2016. 491 us to 56 us on the TinyLlama prompt fixture. Gap to upstream Rust closed from 9x to 1.2x.
- 2026-05-21 · v0.0.1 · tested. Scaffold. ByteLevel pre-tokenisation plus BPE merge. HuggingFace
tokenizer.jsonround-trip.
"DEPENDENCIES"
- Zig 0.16 standard library. No external dependencies. The point is the single-binary deployment envelope.
"ADAPTER TARGETS"
vllm-zig. Tokenisation stage in the inference pipeline. The prompt enters the sametokenizer.jsoncontract the upstream model was trained against.
"RELATED CANON"
- Anti-Edison 17 — The AI Wrapper Question. The merchant-lens audit. The tokenisation wedge is small but load-bearing on supply chain.
- The Mercantile Thesis. The appliance-layer claim this substrate is one component of.
"RELATED LAB NOTES"
- AI inference in Zig — a 4-repo stack from weights to tokens. The composition write-up.
"RELATED WORKSHOP"
The v0.26 to v0.27 work (regex pre-tokenisation parity with the GPT-2 / GPT-3.5 / GPT-4 tiktoken set) is queued. Workshop entry forthcoming.
"LIMITS"
Pre-1.0 substrate, named honestly.
- No tiktoken parity yet. The regex pre-tokeniser path used by GPT-2 and later families is queued for v0.27. Llama, BERT, and T5 are present; OpenAI-family BPE is not yet wired.
- Encode is single-threaded. Batch encode in v0.28.
- Numbers measured at single-thread parity. The 1.2x gap to upstream Rust is single-thread to single-thread; the upstream
tokenizerscrate batches differently and the multi-thread comparison is not the substrate claim. - Zig 0.16 ceiling. Standard-library API churn each release. The repo pins
0.16.0.
"SOURCE"
- AGPL-3.0-or-later. This substrate page is the canonical public surface; the source mirror is gated by current posture and not advertised as publicly reachable.
"INSTALL"
git clone https://github.com/SMC17/tokenizers-zig.git
cd tokenizers-zig
zig build -Doptimize=ReleaseFast
zig build test
Zig 0.16.0 required. No external dependencies, no Rust toolchain, no Python wrapper.
"DOWNLOAD"
- Release tarball:
v0.26.0— BPE / WordPiece / Unigram, 189 tests, 600-iteration property fuzzer, canonical Encoding parity with HuggingFace. - Source archive:
v0.26.0.tar.gz.
"CITATION"
@software{collins_tokenizers_zig_2026,
author = {Collins, Sean},
title = {{tokenizers-zig: Pure-Zig HuggingFace Tokenizers}},
version = {v0.26.0},
year = {2026},
month = {5},
url = {https://sunlitmoon.online/substrate/tokenizers-zig.html},
note = {AGPL-3.0-or-later. Substrate page: sunlitmoon.online/substrate/tokenizers-zig.}
}