Name	Name	Last commit message	Last commit date
Latest commit History 2 Commits
convert	convert
tmp	tmp
utils	utils
readme.md	readme.md

Name

Last commit message

Last commit date

convert

tmp

utils

readme.md

DataHarvest

DataHarvest 是一个专为构建大型语言模型数据集而设计的工具集。它提供了一系列的数据获取、清洗和处理的pipeline，旨在为中文大模型提供高质量的训练数据。

特性

数据获取: 从多种来源自动获取数据，包括网站、API、文档等。
数据清洗: 清洗和预处理数据，去除噪声、重复和无关内容。
数据转换: 将数据转换为适合训练的格式，如文本、序列等。
自定义pipeline: 灵活的pipeline配置，允许用户根据需要定制数据处理流程。

安装

克隆仓库：

git clone https://github.com/yourusername/dataharvest.git

安装依赖：
```
pip install -r requirements.txt
```

使用方法

数据获取

from dataharvest import data_fetcher

# 获取数据
data = data_fetcher.fetch_data(source='web', url='https://example.com')

# 保存数据
data.to_csv('data.csv')

数据清洗

from dataharvest import data_cleaner

# 加载数据
data = data_cleaner.load_data('data.csv')

# 清洗数据
clean_data = data_cleaner.clean(data)

# 保存清洗后的数据
clean_data.to_csv('clean_data.csv')

构建pipeline

from dataharvest import Pipeline

# 初始化pipeline
pipeline = Pipeline([
    ('fetch', data_fetcher.fetch_data),
    ('clean', data_cleaner.clean),
    ('transform', lambda x: x)  # 添加其他转换函数
])

# 执行pipeline
processed_data = pipeline.execute(source='web', url='https://example.com')

# 保存处理后的数据
processed_data.to_csv('processed_data.csv')

贡献

如果您有任何建议或希望为项目做出贡献，请随时提出 issue 或发送 pull request。我们欢迎任何形式的贡献！

许可证

本项目采用 MIT 许可证。详细信息请参阅 LICENSE 文件。

希望 DataHarvest 能够帮助您构建高质量的中文语言模型数据集！如果您有任何问题或建议，请随时联系我们。

About

DataHarvest is a toolkit specifically designed for building datasets for large language models. It provides a series of pipelines for data acquisition, cleaning, and processing, aiming to deliver high-quality training data for Chinese large language models.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DataHarvest

特性

安装

使用方法

数据获取

数据清洗

构建pipeline

贡献

许可证

About

Releases

Packages

Languages

hexixiang/dataharvest

Folders and files

Latest commit

History

Repository files navigation

DataHarvest

特性

安装

使用方法

数据获取

数据清洗

构建pipeline

贡献

许可证

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages