Wikilangs
NLP models for 340+ languages
Wikilangs provides NLP bedrock models (word embeddings, tokenizers, and base models) for 340+ languages derived from Wikipedia. It enables NLP research and applications for languages that lack resources from major providers.
Related posts
Beyond Tokenization: The Four Taxes and the Path Forward
The compounding tax stack low-resource languages carry, why vision encoders might hold the key, and the open research questions.
Tokenization is Killing Our Multilingual LLM Dream
Why tokenization is the hidden bottleneck blocking truly multilingual AI — lessons from building Sawalni and Wikilangs.
Why I stopped trusting the official Wikipedia dataset, and what I did about it
It all started with a DM from a friend, member and contributor to the Moroccan Wikipedia community. "Are you using the current version of Wikipedia? The official dataset is severely outdated. We added so many cool articles nowhere on huggingface" He was right. I was running a 2023 snapshot in 2025.
A Wordle for the Worldle
I built a word game for more than 300 languages, each drawing on its own Wikipedia as the source. Here's the thing nobody tells you: building a simple word game for most of these languages meant building things that didn't exist.