Skip to content

Latest commit

 

History

History

DataSet

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

Open LLM datasets for pre-training

Name Release Date Paper/Blog Dataset Tokens (T) License
Anthropic HH Anthropic HH
HC3 How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection HC3 数据集
koala-test-set koala-test-set
MTP(massive text pairs) 2023/09 智源发布超3亿对面向中英文语义向量模型训练数据集 BAAI-MTP 1.3
OpenAI WebGPT OpenAI WebGPT
OpenAI Summarization OpenAI Summarization
RedPajama 2023/04 RedPajama, a project to create leading open-source models, starts by reproducing LLaMA training dataset of over 1.2 trillion tokens RedPajama-Data
ShareGPT ShareGPT
starcoderdata 2023/05 StarCoder: A State-of-the-Art LLM for Code starcoderdata 0.25 Apache 2.0
Stanford Alpaca Stanford Alpaca Alpaca Dataset

Open LLM datasets for instruction-tuning

Name Release Date Paper/Blog Dataset Tokens (T) License
Baize
Dolly
databricks-dolly-15k 2023/04 Free Dolly: Introducing the World's First Truly Open Instruction-Tuned LLM databricks-dolly-15k 15 CC BY-SA-3.0
Evol-Instruct
Flan 2021
LIMA
MPT-7B-Instruct 2023/05 Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs dolly_hhrlhf 59 CC BY-SA-3.0
MetaMathQA 2023/09 MetaMath: Bootstrap Your Own Mathematical Questions for Large Language ModelsMetaMathQA blog MetaMathQA --- ---
Natural Instructions
OIG (Open Instruction Generalist) 2023/03 THE OIG DATASET OIG 44,000 Apache 2.0
OpenAssistant Conversations
P3 (Public Pool of Prompts)
Self-Instruct
Super-Natural Instructions
Unnatural Instructions
UltraFeedback:大规模、多样化、细粒度的偏好数据集
UltraFeedback Code
UltraChat:高质量对话数据集,包含 150 余万条多轮指令数据 UltraChat Code
WildChat 2024/05 WILDCHAT: 1M CHATGPT INTERACTION LOGS IN THE WILD allenai/WildChat-1M 1M AI2 ImpACT
xP3

Open LLM datasets for alignment-tuning

Name Release Date Paper/Blog Dataset Tokens (T) License
OpenAssistant Conversations Dataset 2023/04 OpenAssistant Conversations - Democratizing Large Language Model Alignment oasst1 161 Apache 2.0

LLM Evaluation Benchmark

Name Paper/Blog Dataset Samples (K) License
C-Eval C-Eval
Gaokao Gaokao
AGIEval AGIEval
MMLU MMLU
LawBench LawBench: Benchmarking Legal Knowledge of Large Language Models LawBench Code

LLM DataSets

Some examples of DataSets as follows:

Description Paper Code Blog
最全《大型语言模型数据集》全面综述pdf及444个数据集获取地址 Awesome-LLMs-Datasets
一篇关于LLM指令微调的综述 paper blog
智源研究院发布国内首个大规模、可商用中文开源指令数据集COIG:最大规模中文多任务指令集,上新千个中文数据集 paper blogCOIG-PC数据下载地址COIG数据下载地址
总结当前开源可用的Instruct/Prompt Tuning数据 blog
GPT-4平替版:MiniGPT-4,支持图像理解和对话,现已开源 dataset
多模态C4:一个开放的、10亿规模的、与文本交错的图像语料库 paper code
Mind2Web: 首个全面衡量大模型上网能力的数据集 blog
该数据集是一个由人工生成、人工注释的助理式对话语料库,覆盖了广泛的主题和写作风格,由 161443 条消息组成,分布在 66497 个会话树中,使用 35 种不同的语言。该语料库是全球众包工作的产物,涉及超过 13500 名志愿者。为了证明 OpenAssistant Conversations 数据集的有效性,该研究还提出了一个基于聊天的助手 OpenAssistant,其可以理解任务、与第三方系统交互、动态检索信息。 paper code dataset
为了让Panda LLM在中文数据集上获得强大的性能,作者使用了强大的指令微调instruction-tuning技术,将LLaMA基础模型在五个开源的中文数据集进行混合训练,其中包括来自各种语言领域的1530万个样本,例如维基百科语料,新闻语料,百科问答语料,社区问答语料,和翻译语料。 blog
RedPajama开源项目|复制超过1.2万亿个令牌的LLaMA训练数据集 code 原始blog中文blogdataset

长文本数据集

Description Paper Code Blog
ZeroSCROLLS ZeroSCROLLS: A Zero-Shot Benchmark for Long Text Understanding ZeroSCROLLS Blog
L-Eval L-EVAL: INSTITUTING STANDARDIZED EVALUATION FOR LONG CONTEXT LANGUAGE MODELS
LongBench LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
LooGLE LooGLE: CAN LONG-CONTEXT LANGUAGE MODELS UNDERSTAND LONG CONTEXTS?
CLongEval CLongEval: A Chinese Benchmark for Evaluating Long-Context Large Language Models CLongEval Code CLongEval Blog