Create datasets package with shortcuts to acquire datasets as DataFrames #199

frreiss · 2021-06-04T23:06:25Z

Our notebooks and experiment scripts frequently repeat a pattern:

We should wrap these three steps into a single function so that we and our users don't need to write this code over and over again.

Suggested API:

Main entry point attp.dataset.download_<data set name>(), with optional arguments to specify:
- cache directory
- fold name
- whether to return a DataFrame per document or a single stacked DataFrame
Each download_<name>() function performs the following steps:
- If the raw data set isn't present, download it
- Convert the entire raw data set into DataFrames
- Stack the DataFrames into a single large dataframe (add a leading column with fold name) and write this DataFrame as a single Parquet file in the cache directory
- Use the cached Parquet file for subsequent reads of the data set
- If the user requested a DataFrame per document, split the single large DataFrame into multiple smaller ones

The text was updated successfully, but these errors were encountered:

Provide feedback