Introducing Wikipedia Monthly: Fresh, Clean Wikipedia Dumps for NLP & AI Research

I'm excited to announce Wikipedia Monthly, a project born out of necessity while building multilingual AI systems, particularly for my work on Sawalni, the first LLM for Moroccan Darija.

The Problem

The official Wikipedia dataset on Hugging Face Hub is severely outdated – last updated in 2023. That's 18+ months of missing content, cultural shifts, and knowledge updates. When you're building AI systems for low-resource languages or need current information, this becomes a real bottleneck.

Here's what I was dealing with:

Stale data: Missing recent events, cultural developments, and knowledge updates
Limited languages: 29 new languages since the last dataset update on Huggingface
Significant content gaps: 10-50% growth in most languages since the last update
No flexibility: Can't customize the cleaning pipeline for specific use cases

The Solution

Wikipedia Monthly solves this by providing:

Monthly updates of Wikipedia content across 341+ languages
Clean, ready-to-use text with MediaWiki markup already parsed
Raw MediaWiki source included for custom post-processing needs
One-line dataset loading through Hugging Face Hub
29 additional languages not available in the official HF dataset
Current content reflecting recent knowledge and cultural developments

How It Works

The pipeline is straightforward but robust:

Downloads the latest Wikipedia dumps from Wikimedia
Filters out everything except main articles
Parses MediaWiki syntax into clean text
Uploads to Hugging Face Hub with smart configuration naming

Getting Started

Using the dataset is as simple as:

from datasets import load_dataset

# Load English Wikipedia from the latest dump# Better to stream it, it's 20GB+ in size!

dataset = load_dataset("omarkamali/wikipedia-monthly", "latest.en", split="train", streaming=True)

# Or load a specific date

dataset = load_dataset("omarkamali/wikipedia-monthly", "20250701.en", split="train")

What You Get

Each article includes:

Clean plain text content
Original MediaWiki source (if you need it)
Article title and Wikipedia URL
Unique page identifier

Current Status

You can check the dataset page for live statistics and available languages.

🌍 341 Languages Available
📄 64.5M Articles in Total
💾 205.54 GB of Data in Total

Why This Matters

In my work on AI for low-resource languages, I've experienced firsthand how outdated data can limit what's possible. When building Sawalni, Tarjamli and other multilingual projects, having current, culturally relevant content isn't just nice-to-have, it's essential for creating AI that truly serves diverse communities.

Fresh, accessible Wikipedia data opens up new possibilities for:

Low-resource language AI: Building models that reflect current cultural contexts
Multilingual research: Training systems with up-to-date cross-lingual knowledge
Cultural preservation: Capturing evolving linguistic patterns and cultural references
Information retrieval: Systems that know about recent events and developments
Educational applications: Learning tools with current, accurate information

What's Next

The system runs automatically each month, ensuring you always have access to the latest Wikipedia content. No more dealing with stale dumps or complex preprocessing pipelines. Let me know if you would like to sponsor the compute.

Check out Wikipedia Monthly on Hugging Face and let me know what you build with it!

---

Wikipedia Monthly is built on top of the incredible work by the Wikimedia Foundation and the open-source community. All content maintains the original CC-BY-SA-4.0 license.