I found myself recently whipping up notebooks just to pull Huggingface datasets locally, annotate or operate changes and update them again. And while Huggingface uses open formats, I found the official toolchain relatively low-level and not adapted to quick operations such as what I am doing. This happened often enough that I made a cli tool out of it, which I've been using successfully for the last few months.
To this regard I'm happy to announce that I released `datapluck`, a tool to export (download) datasets from Huggingface into CSV, TSV, JSON, JSONL, Parquet, Google Sheets and SQLite (SQLite is super cool!). It also allows to import (upload) from any of these formats back into Huggingface. It's perfect for portability of your Huggingface datasets across different media.
It's as simple as these commands:
# Install datapluckpip install datapluck
# Export a dataset to csvdatapluck export team/dataset --format csv --output_file data.csv
# Import data to your accountdatapluck import user/new-or-existing-dataset --input_file data.csv --format csv --private
The idea behind is to make data wrangling and annotation workflows much easier, as well as automated dataset updates from the command line, quite ideal for CI/CD scenarios. As a bonus, `datapluck`is also a package that exposes two functions `export_dataset` and `import_dataset` to be used in programmatic settings.
I plan to open-source the code under a permissive license when I get around to it and will update here accordingly (speaking of which, feel free to use monitoro to track this post!).
- Download a Huggingface Dataset into a Google Sheet, annotate the dataset in the spreadsheet and then update back into Huggingface.
- Scrape a data source, and update a Huggingface dataset every N documents.
- Assuming you have an app that collects user feedback, create a batch job to upload all new feedback entries to your Huggingface dataset.
- Download a Huggingface Dataset into a Parquet file, then operate on it in a notebook before pushing it again.
Read datapluck's user documentation on PyPi.