Omar Kamali
Independent AI Researcher & Builder
I build AI for the languages the industry ignores.
I grew up in Morocco, speaking Darija in a world where every piece of technology spoke back in someone else's language. That experience is the origin of everything I build.
In 2023, I built Sawalni, the first conversational AI for Moroccan Darija, supporting both Arabic and Latin script. Corpus, pipeline, model architecture: all built from scratch, with no prior art to lean on. Thousands of users. Presented at international conferences on Moroccan Arabic linguistics. Featured on Moroccan national television.
I run Omneity Labs on a personal basis, as a private R&D lab focused on low-resource language AI. Current research: training base language models for underrepresented language families. The data pipelines, tokenization methods, and evaluation infrastructure that don't yet exist for these languages. I've published peer-reviewed work on multilingual phonetics, and I maintain open-source datasets and models used by researchers working on similar problems.
I also lead GenAI engineering at Blue Yonder, where I build and deploy LLM systems at enterprise scale. That work informs how I think about training infrastructure and production deployment, not just research prototypes.
My first line of code was BASIC on an Amstrad CPC at age six. I built my first website at nine (it's still online). My first serious research question: why can't I talk to a computer in the language I actually think in? I'm still working on the answer.
I'm interested in collaborating with researchers, communities, and organizations working on language equity, multilingual AI, and open-source NLP. If that's you, I'd like to hear from you.
Work
Sawalni
The first conversational AI for Moroccan Darija & Amazigh. Arabic, Latin, and Tifinagh script. Built from scratch.
sawalni.comWikiLLM
In developmentOpen base models for low-resource language families, trained on Wikipedia data.
Coming 2026wikilangs.org
NLP models derived from 340+ Wikipedia language editions to bootstrap LLM development.
HuggingFaceOpen source
Tools, datasets, and infrastructure for multilingual NLP: tokenizers, embeddings, training frameworks, and data pipelines.
github.com/omarkamaliRecent writing
Picomon 0.2.0: From AMD Crash Fix to GPU Monitoring That Doesn’t Suck
Earlier this month, I whipped up a Python script with an LLM that parsed amd-smi output. It was ugly. It worked. I called it picomon.
Introducing Wikipedia Monthly: Fresh, Clean Wikipedia Dumps for NLP & AI Research
Announcing Wikipedia Monthly, an always fresh dataset to support research for low-resource languages
Getting Perfectly Structured Data from LLMs
If you've ever struggled to get consistent JSON output from large language models, I have a simple and clever solution for you.
2024: A Year of Growth, Innovation, and Community
As we leave 2024 behind, I found myself reflecting over the holidays on a transformative year that reshaped my grasp of technology's role in human connection.
Datapluck: Portability Tool for Huggingface Datasets
Exporting & importing Hugging Face datasets to spreadsheets and various file formats.