Skip to content

DataHarvest is a toolkit specifically designed for building datasets for large language models. It provides a series of pipelines for data acquisition, cleaning, and processing, aiming to deliver high-quality training data for Chinese large language models.

Notifications You must be signed in to change notification settings

hexixiang/dataharvest

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DataHarvest

DataHarvest is a toolkit specifically designed for building datasets for large language models. It provides a series of pipelines for data acquisition, cleaning, and processing, aiming to deliver high-quality training data for Chinese large language models.

Features

  • Data Acquisition: Automatically fetch data from various sources, including websites, APIs, documents, and more.
  • Data Cleaning: Clean and preprocess data by removing noise, duplicates, and irrelevant content.
  • Data Transformation: Convert data into formats suitable for training, such as text, sequences, etc.
  • Customizable Pipelines: Flexible pipeline configurations allow users to customize data processing workflows as needed.

Installation

  1. Clone the repository:

    git clone https://github.com/yourusername/dataharvest.git
  2. Install the dependencies:

    pip install -r requirements.txt

Usage

Data Acquisition

from dataharvest import data_fetcher

# Fetch data
data = data_fetcher.fetch_data(source='web', url='https://example.com')

# Save the data
data.to_csv('data.csv')

Data Cleaning

from dataharvest import data_cleaner

# Load data
data = data_cleaner.load_data('data.csv')

# Clean the data
clean_data = data_cleaner.clean(data)

# Save the cleaned data
clean_data.to_csv('clean_data.csv')

Building a Pipeline

from dataharvest import Pipeline

# Initialize the pipeline
pipeline = Pipeline([
    ('fetch', data_fetcher.fetch_data),
    ('clean', data_cleaner.clean),
    ('transform', lambda x: x)  # Add additional transformation functions
])

# Execute the pipeline
processed_data = pipeline.execute(source='web', url='https://example.com')

# Save the processed data
processed_data.to_csv('processed_data.csv')

Contributions

If you have any suggestions or would like to contribute to the project, feel free to submit an issue or send a pull request. We welcome all forms of contributions!

License

This project is licensed under the MIT License. For more details, please refer to the LICENSE file.


We hope DataHarvest helps you build high-quality Chinese language model datasets! If you have any questions or suggestions, feel free to reach out to us.


Let me know if you'd like any adjustments!

About

DataHarvest is a toolkit specifically designed for building datasets for large language models. It provides a series of pipelines for data acquisition, cleaning, and processing, aiming to deliver high-quality training data for Chinese large language models.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages