You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Our notebooks and experiment scripts frequently repeat a pattern:
Download a reference data set (if not already present)
Read the data set with one of our reader functions
Convert everything in the data set to DataFrames
We should wrap these three steps into a single function so that we and our users don't need to write this code over and over again.
Suggested API:
Main entry point attp.dataset.download_<data set name>(), with optional arguments to specify:
cache directory
fold name
whether to return a DataFrame per document or a single stacked DataFrame
Each download_<name>() function performs the following steps:
If the raw data set isn't present, download it
Convert the entire raw data set into DataFrames
Stack the DataFrames into a single large dataframe (add a leading column with fold name) and write this DataFrame as a single Parquet file in the cache directory
Use the cached Parquet file for subsequent reads of the data set
If the user requested a DataFrame per document, split the single large DataFrame into multiple smaller ones
The text was updated successfully, but these errors were encountered:
Our notebooks and experiment scripts frequently repeat a pattern:
We should wrap these three steps into a single function so that we and our users don't need to write this code over and over again.
Suggested API:
tp.dataset.download_<data set name>()
, with optional arguments to specify:download_<name>()
function performs the following steps:The text was updated successfully, but these errors were encountered: