By Tima Miroshnichenko on Unsplash

Datapluck: Portability Tool for Huggingface Datasets

by Omar Kamali / September 05, 2024 / in Data, Huggingface, Google Sheets, Python, Announcement, Datapluck

I found myself recently whipping up notebooks just to pull Huggingface datasets locally, annotate or operate changes and update them again. And while Huggingface uses open formats, I found the official toolchain relatively low-level and not adapted to quick operations such as what I am doing. This happened often enough that I made a cli tool out of it, which I've been using successfully for the last few months.

To this regard I'm happy to announce that I released `datapluck`, a tool to export (download) datasets from Huggingface into CSV, TSV, JSON, JSONL, Parquet, Google Sheets and SQLite (SQLite is super cool!). It also allows to import (upload) from any of these formats back into Huggingface. It's perfect for portability of your Huggingface datasets across different media.

It's as simple as these commands:

# Install datapluck
pip install datapluck

# Export a dataset to csv
datapluck export team/dataset --format csv --output_file data.csv

# Import data to your account
datapluck import user/new-or-existing-dataset --input_file data.csv --format csv --private

The idea behind is to make data wrangling and annotation workflows much easier, as well as automated dataset updates from the command line, quite ideal for CI/CD scenarios. As a bonus, `datapluck`is also a package that exposes two functions `export_dataset` and `import_dataset` to be used in programmatic settings.

I plan to open-source the code under a permissive license when I get around to it and will update here accordingly (speaking of which, feel free to use monitoro to track this post!).

Usage ideas

- Download a Huggingface Dataset into a Google Sheet, annotate the dataset in the spreadsheet and then update back into Huggingface.

- Scrape a data source, and update a Huggingface dataset every N documents.

- Assuming you have an app that collects user feedback, create a batch job to upload all new feedback entries to your Huggingface dataset.

- Download a Huggingface Dataset into a Parquet file, then operate on it in a notebook before pushing it again.

Documentation

Read datapluck's user documentation on PyPi.

Get my latest articles and updates
At most one email a month and no spam.

Omar Kamali
Written by Omar Kamali, Founder, CEO @ Monitoro, & Strategic Technology Advisor.