Skip to content
View young-chao's full-sized avatar

Block or report young-chao

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Stars

数据处理

11 repositories
Jupyter Notebook 407 66 Updated Aug 15, 2024

Data processing for and with foundation models! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷

Python 3,563 201 Updated Feb 8, 2025

Repository for analysis and experiments in the BigCode project.

Jupyter Notebook 117 20 Updated Mar 20, 2024

Script for downloading GitHub.

Python 90 32 Updated Jul 1, 2024

Reliable project licenses detector.

Go 236 39 Updated Jun 9, 2023

MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++, LSH Ensemble and HNSW

Python 2,646 297 Updated Jun 4, 2024

What's In My Big Data (WIMBD) - a toolkit for analyzing large text datasets

Python 205 20 Updated Nov 16, 2024

Scalable data pre processing and curation toolkit for LLMs

Jupyter Notebook 775 108 Updated Feb 8, 2025

Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.

Python 2,330 168 Updated Feb 4, 2025

Argilla is a collaboration tool for AI engineers and domain experts to build high-quality datasets

Python 4,300 405 Updated Feb 6, 2025

Easily embed, cluster and semantically label text datasets

Python 499 40 Updated Mar 28, 2024