DataHarvest is a toolkit specifically designed for building datasets for large language models. It provides a series of pipelines for data acquisition, cleaning, and processing, aiming to deliver high-quality training data for Chinese large language models.
- Data Acquisition: Automatically fetch data from various sources, including websites, APIs, documents, and more.
- Data Cleaning: Clean and preprocess data by removing noise, duplicates, and irrelevant content.
- Data Transformation: Convert data into formats suitable for training, such as text, sequences, etc.
- Customizable Pipelines: Flexible pipeline configurations allow users to customize data processing workflows as needed.
-
Clone the repository:
git clone https://github.com/yourusername/dataharvest.git
-
Install the dependencies:
pip install -r requirements.txt
from dataharvest import data_fetcher
# Fetch data
data = data_fetcher.fetch_data(source='web', url='https://example.com')
# Save the data
data.to_csv('data.csv')
from dataharvest import data_cleaner
# Load data
data = data_cleaner.load_data('data.csv')
# Clean the data
clean_data = data_cleaner.clean(data)
# Save the cleaned data
clean_data.to_csv('clean_data.csv')
from dataharvest import Pipeline
# Initialize the pipeline
pipeline = Pipeline([
('fetch', data_fetcher.fetch_data),
('clean', data_cleaner.clean),
('transform', lambda x: x) # Add additional transformation functions
])
# Execute the pipeline
processed_data = pipeline.execute(source='web', url='https://example.com')
# Save the processed data
processed_data.to_csv('processed_data.csv')
If you have any suggestions or would like to contribute to the project, feel free to submit an issue or send a pull request. We welcome all forms of contributions!
This project is licensed under the MIT License. For more details, please refer to the LICENSE file.
We hope DataHarvest helps you build high-quality Chinese language model datasets! If you have any questions or suggestions, feel free to reach out to us.
Let me know if you'd like any adjustments!