Beyond Tokenization: The Four Taxes and the Path Forward

This is Part 3 of a three-part series on tokenization and multilingual AI. Part 1 introduced the problem. Part 2 covered how bad tokenization wastes model capacity.

The Four Compounding Taxes

Low-resource languages don't just have less data. They carry a multiplicative penalty stack, where each problem amplifies the next.

Tax 1: Fertility overhead. More tokens per word. Shorter effective context window. Higher attention compute per sentence. And weird generalizations that do not map well to the language.

Tax 2: Morphological incoherence. Boundaries don't respect morphemes. The model spends middle-layer depth reconstructing what the tokenizer destroyed, instead of doing the task it was asked to do.

Tax 3: No variant recovery. Insufficient data to learn orthographic correspondences. Every typo, diacritic variant, normalization mismatch, and case variation is a cold start, completely unrelated sequences in embedding space, forever.

Tax 4: Capacity spillover. Taxes 1–3 consume context positions, layer depth, and embedding dimensions. What remains for actual reasoning is systematically smaller than what a high-resource language gets from an equivalent model.

It's a runaway effect. The less data a language has, the worse its tokenization quality. The worse its tokenization, the more data needed to compensate. The standard prescription, collect more data, presupposes a fixed tokenization overhead that low-resource languages are never positioned to pay off. You cannot data-scale your way out of a broken input pipeline. The tax compounds on the language that can least afford it.

The Second Deepseek Moment

But it's not a lost cause.

Deepseek's OCR results demonstrated that feeding text as a rendered image to a vision encoder outperforms feeding the same text as tokens for character-level tasks. Practitioners independently rediscovered this as a "hack" with VLLMs, literally screenshotting text before passing it to multimodal models to sidestep tokenization artifacts.

Why does it work? Vision encoders define a continuous latent space. A slightly shifted edge is still an edge. A slightly different pixel is still part of the same gradient. The representation absorbs variation by design, the structural opposite of what text tokenization does. There is no discrete lookup nor out-of-vocabulary problem. There is no Unicode normalization trap. There is just a continuous signal with smooth geometry.

Which raises a question critical for multilingualism: what would it mean to give a language model the same kind of perceptual front-end that vision models already take for granted? What if text, just a sequence of bytes at its most primitive level, could be consumed as a continuous signal rather than a lookup in a discrete symbol table? What if we could get rid of tokenization altogether?

The most robust tokenizer in production today might be a JPEG encoder.

Where Do We Go from Here?

Tokenization-free architectures are gaining serious traction. ByT5, byte-level models, Meta FAIR's Large Concept Model operating in a concept embedding space rather than at the token level. These are genuine advances. But they require training from scratch, trade sequence efficiency for robustness, and are not deployable as improvements to the models that already serve the 340 languages Wikilangs covers.

What doesn't exist yet is a continuous pre-tokenization layer, a component that sits between raw text and the LLM's attention and MLP layers, mapping the brittle discrete token space into a smooth representation space where orthographic variants, diacritics, normalization forms, and morphological fragments collapse to nearby regions before the model ever sees them.

"Tokenization Falling Short" (EMNLP 2024) explicitly names perturbation-invariant tokenization strategies as future work. That future work is what this post is calling for.

Several concrete empirical questions follow and remain open:

Does degree of morphological misalignment at the token boundary predict, at fixed parameter count, measurable degradation in downstream reasoning tasks? Scale-sensitivity evidence implies yes. No study has controlled for this directly.
Can a continuous pre-tokenization layer trained contrastively to embed orthographic variants and morphological fragments to nearby regions recover the performance gap without retraining the LLM itself?
Does such a layer generalize across language families, or does it require per-family inductive biases? Wikilangs provides evaluation infrastructure across 340 languages to test this at scale. Sawalni is the proving ground for Moroccan languages specifically.

Tokenization is not a solved problem. It is the one structural barrier that compounds every other disadvantage low-resource languages carry. It's a leaky bucket, chipping away at your resources despite scaling LLMs.

If you work on multilingual representations, input encoding architectures, or low-resource NLP, these are concrete open experiments. Reach out and let's figure it out.

Beyond Tokenization: The Four Taxes and the Path Forward

The Four Compounding Taxes

The Second Deepseek Moment

Where Do We Go from Here?

Resources and References

Related posts

Recent posts

Interested in collaboration?