
Introducing Wikipedia Monthly: Fresh, Clean Wikipedia Dumps for NLP & AI Research
I'm excited to announce Wikipedia Monthly, a project born out of necessity while building multilingual AI systems, particularly for my work on Sawalni, the first LLM for Moroccan Darija.
The Problem
The official Wikipedia dataset on Hugging Face Hub is severely outdated – last updated in 2023. That's 18+ months of missing content, cultural shifts, and knowledge updates. When you're building AI systems for low-resource languages or need current information, this becomes a real bottleneck.
Here's what I was dealing with:
- Stale data: Missing recent events, cultural developments, and knowledge updates
- Limited languages: 29 new languages since the last dataset update on Huggingface
- Significant content gaps: 10-50% growth in most languages since the last update
- No flexibility: Can't customize the cleaning pipeline for specific use cases
The Solution
Wikipedia Monthly solves this by providing:
- Monthly updates of Wikipedia content across 341+ languages
- Clean, ready-to-use text with MediaWiki markup already parsed
- Raw MediaWiki source included for custom post-processing needs
- One-line dataset loading through Hugging Face Hub
- 29 additional languages not available in the official HF dataset
- Current content reflecting recent knowledge and cultural developments
How It Works
The pipeline is straightforward but robust:
- Downloads the latest Wikipedia dumps from Wikimedia
- Filters out everything except main articles
- Parses MediaWiki syntax into clean text
- Uploads to Hugging Face Hub with smart configuration naming
Getting Started
Using the dataset is as simple as:
from datasets import load_dataset
# Load English Wikipedia from the latest dump# Better to stream it, it's 20GB+ in size!
dataset = load_dataset("omarkamali/wikipedia-monthly", "latest.en", split="train", streaming=True)
# Or load a specific date
dataset = load_dataset("omarkamali/wikipedia-monthly", "20250701.en", split="train")
What You Get
Each article includes:
- Clean plain text content
- Original MediaWiki source (if you need it)
- Article title and Wikipedia URL
- Unique page identifier
Current Status
You can check the dataset page for live statistics and available languages.
- 🌍 341 Languages Available
- 📄 64.5M Articles in Total
- 💾 205.54 GB of Data in Total
Why This Matters
In my work on AI for low-resource languages, I've experienced firsthand how outdated data can limit what's possible. When building Sawalni, Tarjamli and other multilingual projects, having current, culturally relevant content isn't just nice-to-have, it's essential for creating AI that truly serves diverse communities.
Fresh, accessible Wikipedia data opens up new possibilities for:
- Low-resource language AI: Building models that reflect current cultural contexts
- Multilingual research: Training systems with up-to-date cross-lingual knowledge
- Cultural preservation: Capturing evolving linguistic patterns and cultural references
- Information retrieval: Systems that know about recent events and developments
- Educational applications: Learning tools with current, accurate information
What's Next
The system runs automatically each month, ensuring you always have access to the latest Wikipedia content. No more dealing with stale dumps or complex preprocessing pipelines. Let me know if you would like to sponsor the compute.
Check out Wikipedia Monthly on Hugging Face and let me know what you build with it!
---
Wikipedia Monthly is built on top of the incredible work by the Wikimedia Foundation and the open-source community. All content maintains the original CC-BY-SA-4.0 license.


Omar Kamali
Tech Founder & AI Strategist
Building products at the intersection of AI scale and human finesse, making complex technology accessible to everyone.
More about me