Wikipedia in your favorite data format!

Introducing Wikipedia Monthly: Fresh, Clean Wikipedia Dumps for NLP & AI Research

By Omar Kamali • July 19, 2025 • In Datasets, NLP, Open Source

I'm excited to announce Wikipedia Monthly, a project born out of necessity while building multilingual AI systems, particularly for my work on Sawalni, the first LLM for Moroccan Darija.

The Problem

The official Wikipedia dataset on Hugging Face Hub is severely outdated – last updated in 2023. That's 18+ months of missing content, cultural shifts, and knowledge updates. When you're building AI systems for low-resource languages or need current information, this becomes a real bottleneck.

Here's what I was dealing with:

  • Stale data: Missing recent events, cultural developments, and knowledge updates
  • Limited languages: 29 new languages since the last dataset update on Huggingface
  • Significant content gaps: 10-50% growth in most languages since the last update
  • No flexibility: Can't customize the cleaning pipeline for specific use cases

The Solution

Wikipedia Monthly solves this by providing:

  • Monthly updates of Wikipedia content across 341+ languages
  • Clean, ready-to-use text with MediaWiki markup already parsed
  • Raw MediaWiki source included for custom post-processing needs
  • One-line dataset loading through Hugging Face Hub
  • 29 additional languages not available in the official HF dataset
  • Current content reflecting recent knowledge and cultural developments

How It Works

The pipeline is straightforward but robust:

  1. Downloads the latest Wikipedia dumps from Wikimedia
  2. Filters out everything except main articles
  3. Parses MediaWiki syntax into clean text
  4. Uploads to Hugging Face Hub with smart configuration naming

Getting Started

Using the dataset is as simple as:

from datasets import load_dataset

# Load English Wikipedia from the latest dump
# Better to stream it, it's 20GB+ in size! dataset = load_dataset("omarkamali/wikipedia-monthly", "latest.en", split="train", streaming=True) # Or load a specific date dataset = load_dataset("omarkamali/wikipedia-monthly", "20250701.en", split="train")

What You Get

Each article includes:

  • Clean plain text content
  • Original MediaWiki source (if you need it)
  • Article title and Wikipedia URL
  • Unique page identifier

Current Status

You can check the dataset page for live statistics and available languages.

  • 🌍 341 Languages Available
  • 📄 64.5M Articles in Total
  • 💾 205.54 GB of Data in Total

Why This Matters

In my work on AI for low-resource languages, I've experienced firsthand how outdated data can limit what's possible. When building Sawalni, Tarjamli and other multilingual projects, having current, culturally relevant content isn't just nice-to-have, it's essential for creating AI that truly serves diverse communities.

Fresh, accessible Wikipedia data opens up new possibilities for:

  • Low-resource language AI: Building models that reflect current cultural contexts
  • Multilingual research: Training systems with up-to-date cross-lingual knowledge
  • Cultural preservation: Capturing evolving linguistic patterns and cultural references
  • Information retrieval: Systems that know about recent events and developments
  • Educational applications: Learning tools with current, accurate information

What's Next

The system runs automatically each month, ensuring you always have access to the latest Wikipedia content. No more dealing with stale dumps or complex preprocessing pipelines. Let me know if you would like to sponsor the compute.

Check out Wikipedia Monthly on Hugging Face and let me know what you build with it!

---

Wikipedia Monthly is built on top of the incredible work by the Wikimedia Foundation and the open-source community. All content maintains the original CC-BY-SA-4.0 license.

Omar Kamali

Written by Omar Kamali

Tech Founder & AI Strategist • About me

© 2025 Omar Kamali. All rights reserved.
Made in 🇲🇦 & 🇩🇪