
Datapluck: Portability Tool for Huggingface Datasets
I found myself recently whipping up notebooks just to pull Huggingface datasets locally, annotate or operate changes and update them again. And while Huggingface uses open formats, I found the official toolchain relatively low-level and not adapted to quick operations such as what I am doing. This happened often enough that I made a cli tool out of it, which I've been using successfully for the last few months.
To this regard I'm happy to announce that I released datapluck, a tool to export (download) datasets from Huggingface into CSV, TSV, JSON, JSONL, Parquet, Google Sheets and SQLite (SQLite is super cool!). It also allows to import (upload) from any of these formats back into Huggingface. It's perfect for portability of your Huggingface datasets across different media.
It's as simple as these commands:
<div style="background-color: rgb(31, 31, 31); font-family: Menlo, Monaco, "Courier New", monospace; font-size: 12px; line-height: 18px;"><div style=""><font color="#5c6370"><b><i><span class="hljs-comment"># Install datapluck</span></i></b></font></div><div style=""><font color="#5c6370"><i style=""><span class="hljs-comment">pip install datapluck</span></i></font></div><div style="color: rgb(204, 204, 204);"><span style="color: #569cd6;font-weight: bold;"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><br></span></span></span></span></span></span></div><div style="color: rgb(204, 204, 204);"><span style="color: #569cd6;font-weight: bold;"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"># Export a dataset to csv</span></span></span></span></span></span></span></span></div><div style="color: rgb(204, 204, 204);"><span style="letter-spacing: 0px; color: rgb(220, 220, 170);"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment">datapluck</span></span></span></span></span></span></span></span><span style="letter-spacing: 0px; color: rgb(212, 212, 212);"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"> </span></span></span></span></span></span></span></span><span style="letter-spacing: 0px; color: rgb(206, 145, 120);"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment">export</span></span></span></span></span></span></span></span><span style="letter-spacing: 0px; color: rgb(212, 212, 212);"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"> </span></span></span></span></span></span></span></span><span style="letter-spacing: 0px; color: rgb(206, 145, 120);"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment">team/dataset</span></span></span></span></span></span></span></span><span style="letter-spacing: 0px; color: rgb(212, 212, 212);"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"> </span></span></span></span></span></span></span></span><span style="letter-spacing: 0px; color: rgb(86, 156, 214);"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment">--format</span></span></span></span></span></span></span></span><span style="letter-spacing: 0px; color: rgb(212, 212, 212);"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"> </span></span></span></span></span></span></span></span><span style="letter-spacing: 0px; color: rgb(206, 145, 120);"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment">csv</span></span></span></span></span></span></span></span><span style="letter-spacing: 0px; color: rgb(212, 212, 212);"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"> </span></span></span></span></span></span></span></span><span style="letter-spacing: 0px; color: rgb(86, 156, 214);"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment">--output_file</span></span></span></span></span></span></span></span><span style="letter-spacing: 0px; color: rgb(212, 212, 212);"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"> </span></span></span></span></span></span></span></span><span style="letter-spacing: 0px; color: rgb(206, 145, 120);"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-keyword"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment">data</span></span></span></span></span><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment">.csv</span></span></span></span></span></span></span></span><br></div><div style="color: rgb(204, 204, 204);"><br></div><div style="color: rgb(204, 204, 204);"><span style="color: #569cd6;font-weight: bold;"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"># Import </span></span></span></span><span class="hljs-keyword"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment">data</span></span></span></span></span><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"> to your account</span></span></span></span></span></span></span></span></div><div style="color: rgb(204, 204, 204);"><span style="letter-spacing: 0px; color: rgb(220, 220, 170);"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment">datapluck</span></span></span></span></span></span></span></span><span style="letter-spacing: 0px; color: rgb(212, 212, 212);"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"> </span></span></span></span></span></span></span></span><span style="letter-spacing: 0px; color: rgb(206, 145, 120);"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-keyword"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment">import</span></span></span></span></span></span></span></span></span><span style="letter-spacing: 0px; color: rgb(212, 212, 212);"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"> </span></span></span></span></span></span></span></span><span style="letter-spacing: 0px; color: rgb(206, 145, 120);"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment">user/new-or-existing-dataset</span></span></span></span></span></span></span></span><span style="letter-spacing: 0px; color: rgb(212, 212, 212);"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"> </span></span></span></span></span></span></span></span><span style="letter-spacing: 0px; color: rgb(86, 156, 214);"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment">--input_file</span></span></span></span></span></span></span></span><span style="letter-spacing: 0px; color: rgb(212, 212, 212);"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"> </span></span></span></span></span></span></span></span><span style="letter-spacing: 0px; color: rgb(206, 145, 120);"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-keyword"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment">data</span></span></span></span></span><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment">.csv</span></span></span></span></span></span></span></span><span style="letter-spacing: 0px; color: rgb(212, 212, 212);"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"> </span></span></span></span></span></span></span></span><span style="letter-spacing: 0px; color: rgb(86, 156, 214);"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment">--format</span></span></span></span></span></span></span></span><span style="letter-spacing: 0px; color: rgb(212, 212, 212);"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"> </span></span></span></span></span></span></span></span><span style="letter-spacing: 0px; color: rgb(206, 145, 120);"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment">csv</span></span></span></span></span></span></span></span><span style="letter-spacing: 0px; color: rgb(212, 212, 212);"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"> </span></span></span></span></span></span></span></span><span style="letter-spacing: 0px; color: rgb(86, 156, 214);"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment">--</span></span></span></span><span class="hljs-keyword"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment"><span class="hljs-comment">private</span></span></span></span></span></span></span></span></span><br></div><div style="color: rgb(204, 204, 204);"><br></div></div>
The idea behind is to make data wrangling and annotation workflows much easier, as well as automated dataset updates from the command line, quite ideal for CI/CD scenarios. As a bonus, datapluckis also a package that exposes two functions export_dataset and import_dataset to be used in programmatic settings.
I plan to open-source the code under a permissive license when I get around to it and will update here accordingly (speaking of which, feel free to use <a href="https://monitoro.co">monitoro</a> to track this post!).
Usage ideas
-
Download a Huggingface Dataset into a Google Sheet, annotate the dataset in the spreadsheet and then update back into Huggingface.
-
Scrape a data source, and update a Huggingface dataset every N documents.
-
Assuming you have an app that collects user feedback, create a batch job to upload all new feedback entries to your Huggingface dataset.
-
Download a Huggingface Dataset into a Parquet file, then operate on it in a notebook before pushing it again.
Documentation
Read datapluck's <a href="https://pypi.org/project/datapluck/">user documentation on PyPi</a>.