Beyond Tokenization: The Four Taxes and the Path Forward
The compounding tax stack low-resource languages carry, why vision encoders might hold the key, and the open research questions.
The compounding tax stack low-resource languages carry, why vision encoders might hold the key, and the open research questions.
How bad tokenization forces language models to waste capacity on reconstruction instead of reasoning.
Why tokenization is the hidden bottleneck blocking truly multilingual AI — lessons from building Sawalni and Wikilangs.
It all started with a DM from a friend, member and contributor to the Moroccan Wikipedia community. "Are you using the current version of Wikipedia? The official dataset is severely outdated. We added so many cool articles nowhere on huggingface" He was right. I was running a 2023 snapshot in 2025.
I built a word game for more than 300 languages, each drawing on its own Wikipedia as the source. Here's the thing nobody tells you: building a simple word game for most of these languages meant building things that didn't exist.
Earlier this month, I whipped up a Python script with an LLM that parsed amd-smi output. It was ugly. It worked. I called it picomon.
Announcing Wikipedia Monthly, an always fresh dataset to support research for low-resource languages
If you've ever struggled to get consistent JSON output from large language models, I have a simple and clever solution for you.
As we leave 2024 behind, I found myself reflecting over the holidays on a transformative year that reshaped my grasp of technology's role in human connection.
Exporting & importing Hugging Face datasets to spreadsheets and various file formats.
OpenAI's data integrations in Assistant and GPTs is causing ripples in the AI world. Beyond the excitement, let's look at OpenAI's strategy critically, the tension in its ecosystem, questionable comparisons to Apple, and the impending threat of commodification that OpenAI itself may face.
We know what Data is, but where does it come from? With Web Scraping you can collect data from any website. This article will get you started in this world.
An introduction to ease you into the world of data, what it is, what is it useful for and privacy concerns, as a preamble to the Data Series.
I've been asked multiple times, "Why are you creating a Moroccan AI?" Today I want to share the story behind Sawalni, the first AI in history to speak our beautiful Moroccan Darija, with all of you.
Have you ever wondered how AI systems make sense of the vast amount of information they encounter? Let's look at AI tokens and why you should care.
New websites are launched every day and others stop existing overnight. Servers and IP addresses get repurposed. How does DNS keep it all together?
I've recently wanted to add a newsletter to my blog. I decided to build it instead of using a marketing tool.
There are over 22 billion machines connected to the internet. How do these machines communicate with each other to fulfil their purpose?
Do you ever stop and think about the countless invisible components working behind the scenes to bring an app like TikTok to life on your phone?
Read this article to learn more about How the Internet Works: The Essential Guide to Digital Infrastructure
Read this article to learn more about The Market Decides Your Project Scope
Read this article to learn more about Should you automate this task?