DataHarvest

DataHarvest is a toolkit specifically designed for building datasets for large language models. It provides a series of pipelines for data acquisition, cleaning, and processing, aiming to deliver high-quality training data for Chinese large language models.

Features

Data Acquisition: Automatically fetch data from various sources, including websites, APIs, documents, and more.
Data Cleaning: Clean and preprocess data by removing noise, duplicates, and irrelevant content.
Data Transformation: Convert data into formats suitable for training, such as text, sequences, etc.
Customizable Pipelines: Flexible pipeline configurations allow users to customize data processing workflows as needed.

Installation

Clone the repository:

git clone https://github.com/yourusername/dataharvest.git

Install the dependencies:
```
pip install -r requirements.txt
```

Usage

Data Acquisition

from dataharvest import data_fetcher

# Fetch data
data = data_fetcher.fetch_data(source='web', url='https://example.com')

# Save the data
data.to_csv('data.csv')

Data Cleaning

from dataharvest import data_cleaner

# Load data
data = data_cleaner.load_data('data.csv')

# Clean the data
clean_data = data_cleaner.clean(data)

# Save the cleaned data
clean_data.to_csv('clean_data.csv')

Building a Pipeline

from dataharvest import Pipeline

# Initialize the pipeline
pipeline = Pipeline([
    ('fetch', data_fetcher.fetch_data),
    ('clean', data_cleaner.clean),
    ('transform', lambda x: x)  # Add additional transformation functions
])

# Execute the pipeline
processed_data = pipeline.execute(source='web', url='https://example.com')

# Save the processed data
processed_data.to_csv('processed_data.csv')

Contributions

If you have any suggestions or would like to contribute to the project, feel free to submit an issue or send a pull request. We welcome all forms of contributions!

License

This project is licensed under the MIT License. For more details, please refer to the LICENSE file.

We hope DataHarvest helps you build high-quality Chinese language model datasets! If you have any questions or suggestions, feel free to reach out to us.

Let me know if you'd like any adjustments!

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
convert		convert
tmp		tmp
utils		utils
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DataHarvest

Features

Installation

Usage

Data Acquisition

Data Cleaning

Building a Pipeline

Contributions

License

About

Releases

Packages

Languages

hexixiang/dataharvest

Folders and files

Latest commit

History

Repository files navigation

DataHarvest

Features

Installation

Usage

Data Acquisition

Data Cleaning

Building a Pipeline

Contributions

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages