A curated list of open-source instruction tuning datasets, models, papers, repositories.
Following Longpre et al., we list all existing instruction tuning datasets modified from traditional NLP tasks.
Release | Datasets | Number of Tasks | Number of Instances | Model_name | Base | Model_Size |
---|---|---|---|---|---|---|
2020-05 | UnifiedQA | 46 | 750k | UnifiedQA | RoBerta | 110-340 M |
2021-04 | CrossFit | 159 | 71.M | BART-CrossFit | BART | 140 M |
2021-04 | Natural Inst v1.0 | 61 | 620 k | Gen. BART | BART | 140 M |
2021-09 | Flan 2021 | 62 | 4.4M | Flan-LaMDA | LaMDA | 137B |
2021-10 | P3 | 62 | 12M | TO, TO+, TO++ | T5-LM | 3-11B |
2021-10 | MetalCL | 142 | 3.5M | MetalCL | GPT-2 | 770 M |
2021-11 | ExMix | 107 | 500 k | ExT5 | T5 | 220M-11B |
2022-04 | Super-Natural Inst. | 1613 | 5M | Tk-Instruct | T5-LM, mT5 | 17-13B |
2022-10 | GLM | 77 | 12M | GLM-130B | GLM | 130 B |
2022-10 | Flan 2022 | 1836 | 15M | Flan-T5, Flan-PaLM | T5-LM, PaLM | 10 M-540 B |
2022-11 | xP3 | 71 | 81M | BLOOMz, mTO | BLOOM, mT5 | 13-176B |
2022-12 | Unnatural Inst. | 117 | 64 k | T5-LM-Unnat. Inst. | T5-LM | 11B |
2023-06 | tasksource-instruct | 475 | 4.58 | - | - | - |
Release | Model_name | Base | Model_Size | Datasets | Number of Instances | Language |
---|---|---|---|---|---|---|
2022-12 | GPT-3 Self Inst. | GPT-3 | 175B | Self-Instruct | 82 k | En |
2023-03-03 | alpaca | LLaMA | 7B | alpaca_data | 52 k | En |
2023-03-19 | alpaca-lora | LLaMA | 7B 13B 30B | alpaca_data、alpaca_data_cleaned | 52 k | En |
2023-03-23 | Chinese-Vicuna | LLaMA | 7B 13B | BELLE、GuanacoDataset | 1M | Zh |
2023-03-24 | Alpaca-CoT | LLaMA | 7B | dataset | ---- | En Zh |
2023-03-25 | dolly | dolly | 6B | alpaca_data | 52 k | En |
2023-03-25 | guanaco | LLaMA | 7B | GuanacoDataset | 534 k | En Zh Ja De |
2023-03-28 | Chinese-LLaMA-Alpaca | LLaMA | 7B | alpaca_data_zh、pCLUE、translation2019zh、alpaca_data、Self-Instruct | 2M | Zh |
2023-03-29 | ColossalChat | LLaMA | 7B 13B | InstructionWild | 104 k | En Zh |
2023-03-31 | Luotuo | LLaMA ChatGLM | 7B 6B | trans_chinese_alpaca_data | 52k | Zh |
2023-03-31 | cerebras-lora-alpaca | Cerebras-GPT | 2.7B | AlpacaDataCleaned | 52k | En |
Most existing datasets are in English. However, most of the world’s population is under-served in terms of availability of data for their languages. How to ensure that everyone across the world is able to benefit from generative AI ? We have developed a straightforward and open-source translation tool based on Helsinki-NLP, capable of translating English datasets into 100+ languages at no cost. Although these translated datasets may contain some noise, they serve as a viable alternative to costly, high-quality data. See below.
python translator.py model_name source_data_path
python translator.py Helsinki-NLP/opus-mt-en-zh alpaca_data.json
Our tool is designed to work with alpaca data and the Helsinki-NLP/opus-mt-en-zh model. Different datasets or Helsinki-NLP models yield varying results. Due to the limitations of the model, Constrained by the model's capabilities, the translation quality may not always be optimal. For example,we observed instances of repeated words in the translations from English to Chinese,which lead us to develop "process.py" to eliminate translated prompts containing strings of any length that appear three consecutive times. We provide the final version in "translated_alpaca_data.json".
python process.py unprocessed_data_path
python process.py translated_data.json
# the Helsinki-NLP model may have a maximum input sentence length limit. We have discarded the prompts which exceed the limit before translate them.
We have extensively reviewed papers in this field and have listed the most valuable ones below:
Finetuned language models are zero-shot learners 2021.9
Multitask Prompted Training Enables Zero-Shot Task Generalization 2021.10
Training language models to follow instructions with human feedback 2022.3
Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks 2022.4
Unsupervised Cross-Task Generalization via Retrieval Augmentation 2022.4
Instruction Induction: From Few Examples to Natural Language Task Descriptions 2022.5
Scaling Instruction-Finetuned Language Models 2022.10
Guess the Instruction! Flipped Learning Makes Language Models Stronger Zero-Shot Learners 2022.10
Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor 2022.12
Self-Instruct: Aligning Language Model with Self Generated Instructions 2022.12
MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning 2022.12
The Flan Collection: Designing Data and Methods for Effective Instruction Tuning 2023.1
In-Context Instruction Learning 2023.2
Additionally, we have provided a list of related repositories for further reference.