Datapluck: Portability Tool for Huggingface Datasets

I found myself recently whipping up notebooks just to pull Huggingface datasets locally, annotate or operate changes and update them again. And while Huggingface uses open formats, I found the official toolchain relatively low-level and not adapted to quick operations such as what I am doing. This happened often enough that I made a cli tool out of it, which I've been using successfully for the last few months.

To this regard I'm happy to announce that I released `datapluck`, a tool to export (download) datasets from Huggingface into CSV, TSV, JSON, JSONL, Parquet, Google Sheets and SQLite (SQLite is super cool!). It also allows to import (upload) from any of these formats back into Huggingface. It's perfect for portability of your Huggingface datasets across different media.

It's as simple as these commands:

# Install datapluck
pip install datapluck

# Export a dataset to csv
datapluck export team/dataset --format csv --output_file data.csv

# Import data to your account
datapluck import user/new-or-existing-dataset --input_file data.csv --format csv --private

The idea behind is to make data wrangling and annotation workflows much easier, as well as automated dataset updates from the command line, quite ideal for CI/CD scenarios. As a bonus, `datapluck`is also a package that exposes two functions `export_dataset` and `import_dataset` to be used in programmatic settings.

I plan to open-source the code under a permissive license when I get around to it and will update here accordingly (speaking of which, feel free to use monitoro to track this post!).