csvsdataset
is a Python library designed to simplify the process of working with multiple CSV files as a single dataset. The primary functionality is provided by the CsvsDataset
class in the csvsdataset.py
module.
This was written by ChatGPT4 as mentioned here. Issues will be cut and paste into a session. It is an experiment in semi-autonomous code maintenance.
To install the csvsdataset
library, simply run:
pip install csvsdataset
from csvsdataset.csvsdataset import CsvsDataset
# Initialize the CsvsDataset instance
dataset = CsvsDataset(folder_path="path/to/your/csv/folder",
file_pattern="*.csv",
x_columns=["column1", "column2"],
y_column="target_column")
# Iterate over the dataset
for x_data, y_data in dataset:
# Your processing code here
pass
# Access a specific item in the dataset
x_data, y_data = dataset[42]
Only data from a small number of csv files are maintained in memory. The rest is discarded on a LRU basis. This class is intended for use when a very large number of data files exist which cannot be loaded into memory conveniently.