数据处理
Data processing for and with foundation models! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷
Repository for analysis and experiments in the BigCode project.
MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++, LSH Ensemble and HNSW
What's In My Big Data (WIMBD) - a toolkit for analyzing large text datasets
Scalable data pre processing and curation toolkit for LLMs
Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.
Argilla is a collaboration tool for AI engineers and domain experts to build high-quality datasets
Easily embed, cluster and semantically label text datasets