Name | Release Date | Paper/Blog | Dataset | Tokens (T) | License |
---|---|---|---|---|---|
Anthropic HH | Anthropic HH | ||||
HC3 | How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection | HC3 数据集 | |||
koala-test-set | koala-test-set | ||||
MTP(massive text pairs) | 2023/09 | 智源发布超3亿对面向中英文语义向量模型训练数据集 | BAAI-MTP | 1.3 | |
OpenAI WebGPT | OpenAI WebGPT | ||||
OpenAI Summarization | OpenAI Summarization | ||||
RedPajama | 2023/04 | RedPajama, a project to create leading open-source models, starts by reproducing LLaMA training dataset of over 1.2 trillion tokens | RedPajama-Data | ||
ShareGPT | ShareGPT | ||||
starcoderdata | 2023/05 | StarCoder: A State-of-the-Art LLM for Code | starcoderdata | 0.25 | Apache 2.0 |
Stanford Alpaca | Stanford Alpaca | Alpaca Dataset |
Name | Release Date | Paper/Blog | Dataset | Tokens (T) | License |
---|---|---|---|---|---|
Baize | |||||
Dolly | |||||
databricks-dolly-15k | 2023/04 | Free Dolly: Introducing the World's First Truly Open Instruction-Tuned LLM | databricks-dolly-15k | 15 | CC BY-SA-3.0 |
Evol-Instruct | |||||
Flan 2021 | |||||
LIMA | |||||
MPT-7B-Instruct | 2023/05 | Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs | dolly_hhrlhf | 59 | CC BY-SA-3.0 |
MetaMathQA | 2023/09 | MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models,MetaMathQA blog | MetaMathQA | --- | --- |
Natural Instructions | |||||
OIG (Open Instruction Generalist) | 2023/03 | THE OIG DATASET | OIG | 44,000 | Apache 2.0 |
OpenAssistant Conversations | |||||
P3 (Public Pool of Prompts) | |||||
Self-Instruct | |||||
Super-Natural Instructions | |||||
Unnatural Instructions | |||||
UltraFeedback:大规模、多样化、细粒度的偏好数据集 | |||||
UltraFeedback Code | |||||
UltraChat:高质量对话数据集,包含 150 余万条多轮指令数据 | UltraChat Code | ||||
WildChat | 2024/05 | WILDCHAT: 1M CHATGPT INTERACTION LOGS IN THE WILD | allenai/WildChat-1M | 1M | AI2 ImpACT |
xP3 |
Name | Release Date | Paper/Blog | Dataset | Tokens (T) | License |
---|---|---|---|---|---|
OpenAssistant Conversations Dataset | 2023/04 | OpenAssistant Conversations - Democratizing Large Language Model Alignment | oasst1 | 161 | Apache 2.0 |
Name | Paper/Blog | Dataset | Samples (K) | License |
---|---|---|---|---|
C-Eval | C-Eval | |||
Gaokao | Gaokao | |||
AGIEval | AGIEval | |||
MMLU | MMLU | |||
LawBench | LawBench: Benchmarking Legal Knowledge of Large Language Models | LawBench Code |
Some examples of DataSets as follows:
Description | Paper | Code | Blog |
---|---|---|---|
最全《大型语言模型数据集》全面综述pdf及444个数据集获取地址 | Awesome-LLMs-Datasets | ||
一篇关于LLM指令微调的综述 | paper | blog | |
智源研究院发布国内首个大规模、可商用中文开源指令数据集COIG:最大规模中文多任务指令集,上新千个中文数据集 | paper | blog,COIG-PC数据下载地址,COIG数据下载地址 | |
总结当前开源可用的Instruct/Prompt Tuning数据 | blog | ||
GPT-4平替版:MiniGPT-4,支持图像理解和对话,现已开源 | dataset | ||
多模态C4:一个开放的、10亿规模的、与文本交错的图像语料库 | paper | code | |
Mind2Web: 首个全面衡量大模型上网能力的数据集 | blog | ||
该数据集是一个由人工生成、人工注释的助理式对话语料库,覆盖了广泛的主题和写作风格,由 161443 条消息组成,分布在 66497 个会话树中,使用 35 种不同的语言。该语料库是全球众包工作的产物,涉及超过 13500 名志愿者。为了证明 OpenAssistant Conversations 数据集的有效性,该研究还提出了一个基于聊天的助手 OpenAssistant,其可以理解任务、与第三方系统交互、动态检索信息。 | paper | code | dataset |
为了让Panda LLM在中文数据集上获得强大的性能,作者使用了强大的指令微调instruction-tuning技术,将LLaMA基础模型在五个开源的中文数据集进行混合训练,其中包括来自各种语言领域的1530万个样本,例如维基百科语料,新闻语料,百科问答语料,社区问答语料,和翻译语料。 | blog | ||
RedPajama开源项目|复制超过1.2万亿个令牌的LLaMA训练数据集 | code | 原始blog,中文blog,dataset |