Omar Kamali
Independent AI Researcher & Builder · Berlin
I build AI for the languages the industry ignores.
Origin
I grew up multilingual in Morocco. Darija at home, French at school, Arabic in religion, English on the internet. My parents teach French and Arabic and had a running bet, whether I'll end up more Arabic or French-minded. Little did they expect me to get pulled by the English speaking-world on a daily basis.
Technology was the catalyst. My first line of code was BASIC on an Amstrad CPC at age six, from 80's magazine listings. I built my first website at nine (it's still online). But here's the kick, every computer I touched spoke a language that wasn't mine. Many times I asked myself, why can't I simply use a computer - all these games and cool programs, in the language I actually think in? Moroccan Arabic was never in the language selector dropdown, Google Translate didn't support it (or any other for that matter), and the few resources that existed were scattered, low-quality, treating Darija as a bastardized language, as if it wasn't the mother tongue of millions of people and the language they think in, me included.
I decided to start coding more professionally to make software that is mine, where I have a full say. Blissful in my ignorance, I worked on an OS that never saw the light of day with a fellow from Tunisia that I met on an (ironically) French-speaking forum. What started as a simple question led me through over a decade of software engineering, product leadership, and eventually to the work I do now: building the AI infrastructure that makes this possible.
Research & infrastructure
I founded Omneity Labs in late 2022 as a private R&D lab focused on low-resource language AI. The lab produces open-source tools, datasets, and models: the infrastructure that doesn't exist yet for underrepresented languages.
In 2023, I built Sawalni, the first conversational AI for Moroccan Darija. Arabic and Latin script. Corpus, pipeline, model architecture: all from scratch, with no prior art. Thousands of users in controlled deployments. Featured on Moroccan national television (Al Aoula). Presented at the 7th International Congress for Moroccan Arabic (University of Navarra, 2024) and TIM'24 (University Hassan II, Morocco).
My peer-reviewed research on multilingual phonetic alignment, Sawtone, was published in Lingua Posnaniensis (2025). It introduces a universal framework for cross-script phonetic alignment, validated on Moroccan Arabic data: 88% BLEU transliteration, 87 to 95% phonetic alignment accuracy. Published as an independent researcher with no institutional affiliation.
Current focus: training base language models for underrepresented language families. Building the data pipelines, tokenization methods, and evaluation benchmarks that don't exist yet.
Open data & models
Wikipedia Monthly: Automated, monthly-refreshed Wikipedia dumps for 341+ languages. 64M+ articles, 89GB+ of clean text. The official Wikipedia datasets haven't been updated since 2023. Native HuggingFace integration, one-line loading.
Gherbal: State-of-the-art Moroccan Arabic language identification model. Used to extract 75M tokens of high-quality Darija from FineWeb 2, the first comprehensive analysis of Moroccan digital content.
All datasets and models are available on HuggingFace. Academic work on ORCID.
Professional context
Sr. Director of GenAI Engineering at Blue Yonder (Berlin), building and deploying LLM systems at enterprise scale for supply chain. That work informs how I approach training infrastructure and production deployment.
Previously: Co-founder & CTO at Vita Digita (Morocco, 2011 to 2014), engineering and product leadership at Matterway (Berlin, 2014 to 2020). Over a decade spanning full-stack engineering, technical product ownership, and team leadership.
Products shipped
- Sawalni: First conversational AI and LLM for Moroccan Darija. Arabic and Latin script.
- Tarjamli: Translation app and NMT model for Moroccan Darija, Arabic, and international languages.
- Herd: Browser superpowers and MCP for AI agents.
- Monitoro: Real-time notification and integration platform. Millions of notifications delivered.
- QR8R: Dynamic and trackable QR codes.
- Datapluck: Dataset import/export for HuggingFace, CSV, JSON, Parquet.
Speaking
The impact of AI on education
Beyond Tech With Soufyan · Moroccan Darija
Can We Trust Artificial Intelligence?
Beyond Tech With Soufyan · Moroccan Darija
#201 Special Episode: Sawalni.ma, the first Moroccan LLM
GeeksBlaBla · Moroccan Darija
Sawalni.ma Interview on Moroccan National TV
Al Aoula - Moroccan TV · Moroccan Darija
Interview, Mentoring at ThinkAI @1337 Benguerir
La Vie Éco · Moroccan Darija
Publications
Papers
Sawtone: A universal framework for phonetic similarity and alignment across languages and scripts
Lingua Posnaniensis, Vol. 67, Issue 1, pp. 165-200 (2025)
Processing text across different scripts presents significant hurdles in natural language processing, especially when dealing with non-standardized orthographies and informal writing systems common in low-resource languages. To address this, we introduce Sawtone, an integrated framework designed to enable consistent cross-script phonetic alignment and text normalization. At its heart is an architecture built for interoperability, combining a unified phonological feature space rooted in linguistic principles with modular, language-specific adapters. This structure allows for robust mapping and comparison between any pair of scripts.
View publicationConference presentations
Moroccan Darija and Generative AI
7th International Congress for Moroccan Arabic · University of Navarra, Spain (2024)
Research presentation on building generative AI systems for Moroccan Darija
TIM'24 Presentation
TIM'24 Conference · University Hassan II, Morocco (2024)
Presentation on multilingual AI and language technology
Interested in collaborating on language equity, multilingual AI, or open-source NLP.
Get in touch