Wikilangs

NLP models for 340+ languages

Wikilangs provides NLP bedrock models (word embeddings, tokenizers, and base models) for 340+ languages derived from Wikipedia. It enables NLP research and applications for languages that lack resources from major providers.

Hugging Face GitHub

Beyond Tokenization: The Four Taxes and the Path Forward

The compounding tax stack low-resource languages carry, why vision encoders might hold the key, and the open research questions.

Mar 2026

Tokenization is Killing Our Multilingual LLM Dream

Why tokenization is the hidden bottleneck blocking truly multilingual AI — lessons from building Sawalni and Wikilangs.

Mar 2026

Why I stopped trusting the official Wikipedia dataset, and what I did about it

It all started with a DM from a friend, member and contributor to the Moroccan Wikipedia community. "Are you using the current version of Wikipedia? The official dataset is severely outdated. We added so many cool articles nowhere on huggingface" He was right. I was running a 2023 snapshot in 2025.

Mar 2026

A Wordle for the Worldle

I built a word game for more than 300 languages, each drawing on its own Wikipedia as the source. Here's the thing nobody tells you: building a simple word game for most of these languages meant building things that didn't exist.

Mar 2026

Wikilangs

Related posts

Beyond Tokenization: The Four Taxes and the Path Forward

Tokenization is Killing Our Multilingual LLM Dream

Why I stopped trusting the official Wikipedia dataset, and what I did about it

A Wordle for the Worldle