"SUBSTRATE / tokenizers-zig"

tokenizers-zig

pre-1.0 . v0.26.0 . shipped 2026-05-22 . AGPL-3.0

191 tests + 600-iter fuzz . lane: inference

"WHY"

The tokenizer is the contract the model was trained against. Mis-tokenise a prompt and every subsequent layer in the forward pass is computing against a different distribution than the model expects. The Python tokenizers package is a Rust core with PyO3 bindings, which means the operator who claims a sovereign-Zig inference path still depends on a Rust binary loaded into a Python process at the prompt-encoding stage. The wedge is small in compute terms; it is load-bearing in supply-chain terms.

tokenizers-zig closes the wedge. It loads the same tokenizer.json files HuggingFace ships against BERT, T5, and TinyLlama. The encode path returns the canonical Encoding struct (ids, type_ids, attention_mask, offsets, tokens, special_tokens_mask, sequence_ids) so downstream code that was written against the HuggingFace contract works without modification at the type boundary.

"WHAT"

Three tokenizer families with HuggingFace tokenizer.json parity.

The v0.2 ship at 2026-05-21 replaced the linear-scan BPE merge with a priority-queue O(n log n) merge per Sennrich 2016. The encode-time win was 491 microseconds to 56 microseconds, an 8.8x speedup on the TinyLlama prompt fixture. The gap to the upstream Rust tokenizer closed from roughly 9x slower to roughly 1.2x slower at the same single-thread budget. The point is not to outrun the Rust path; the point is that the Zig path is within the same operating order of magnitude on the same hardware while removing the Python and Rust dependencies from the deployment envelope.

v0.26 added a property-based invariant fuzzer. The fuzzer generates 200 random ASCII texts per real fixture (BERT, T5, TinyLlama) and verifies five canonical invariants: non-empty input produces at least one token; all offsets lie within bounds; offsets are monotonic non-decreasing; the parallel arrays (attention_mask, type_ids, offsets, tokens, special_tokens_mask) stay parallel to ids; decode round-trips without error. The fuzz pass caught a real monotonicity bug in the TinyLlama BPE+SentencePiece-marker offset remap when adjacent normalised bytes were all synthetic (the ▁▁ double-space sequence). The fix tracks a running_end cursor through the encode loop and assigns zero-width offsets to all-synthetic tokens at the current running position.

"MILESTONES"

"DEPENDENCIES"

"ADAPTER TARGETS"

"RELATED CANON"

"RELATED LAB NOTES"

"RELATED WORKSHOP"

The v0.26 to v0.27 work (regex pre-tokenisation parity with the GPT-2 / GPT-3.5 / GPT-4 tiktoken set) is queued. Workshop entry forthcoming.

"LIMITS"

Pre-1.0 substrate, named honestly.

"SOURCE"

"INSTALL"

git clone https://github.com/SMC17/tokenizers-zig.git
cd tokenizers-zig
zig build -Doptimize=ReleaseFast
zig build test

Zig 0.16.0 required. No external dependencies, no Rust toolchain, no Python wrapper.

"DOWNLOAD"

"CITATION"

@software{collins_tokenizers_zig_2026,
  author       = {Collins, Sean},
  title        = {{tokenizers-zig: Pure-Zig HuggingFace Tokenizers}},
  version      = {v0.26.0},
  year         = {2026},
  month        = {5},
  url          = {https://sunlitmoon.online/substrate/tokenizers-zig.html},
  note         = {AGPL-3.0-or-later. Substrate page: sunlitmoon.online/substrate/tokenizers-zig.}
}