Skip to content

Latest commit

 

History

History
 
 

data

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

开放数据集

BELLE项目开源的数据集,并记录一些开源数据

1. 详见BELLE/data/1.5M,参考Stanford Alpaca 生成的中文数据集1M + 0.5M

2. 持续开放的数据集,详见BELLE/data/10M

3. Awesome Open Instruct Data for Chinese 记录一些开源中文指令数据库

  • The dataset for the Guanaco model is designed to enhance the multilingual capabilities and address various linguistic tasks. It builds upon the 175 tasks from the Alpaca model by providing rewrites of seed tasks in different languages and adding new tasks specifically designed for English grammar analysis, natural language understanding, cross-lingual self-awareness, and explicit content recognition. The dataset comprises a total of 534,530 entries.
  • COIG project provides diverse Chinese instruction corpora. Researchers can contribute to the corpus set and collaborate. COIG releases first chip to aid Chinese LLMs' development and encourages more researchers to join. Includes translated, exam, human value alignment, counterfactual correction, and leetcode instruction corpora.
  • Summary: they have constructed a large collection of data related to Chinese culture, comprising 23 commonly used Chinese datasets. For each task, multiple instruction templates were manually written by them to ensure the quality and richness of the data, resulting in a training set of 1.15 million Chinese language samples. The tasks covered include couplet writing, poetry composition, Classical Chinese translation, prose writing, Jin Yong's novels, and others. For each task, multiple human-written instruction templates were used by them to ensure the high quality and diversity of the data.
  • Lincense: [Apache 2.0]
  • Firefly Dataset
  • Summary: the project aims to establish a more diverse and rich dataset of instructional commands. To do so, they collected 429 commands from ChatGPT screenshots and released English and Chinese versions. Following Alpaca's method, they generated 52,000 commands and their responses. They crawled over 700 noisy commands from Twitter and screened out useless ones. As a result, they selected 429 clean commands to ensure high quality. They collected the commands using a method similar to Alpaca's, but without the need for human intervention. Therefore, the generated prompts are more diverse and cover a wider range of topics. They provide 5 sample prompts to generate new commands using the OpenAI API. After collecting these prompts, they collected their respective instructions from the OpenAI API. The English and Chinese datasets were generated independently and cost a total of $880. The English dataset has 52K commands (about 24 million tokens), and the Chinese dataset has 52K commands as well.
  • Lincense: [Apache 2.0]
  • InstructWild Dataset
  • Summary: In this work, we collected tens of thousands of comparison responses from both human experts and ChatGPT, with questions ranging from open-domain, financial, medical, legal, and psychological areas. The HC3 dataset is a valuable resource to analyze the linguistic and stylist characteristics of both humans and ChatGPT, which helps to investigate the future improvement directions for LLMs. They construct the comparison dataset mainly from two sources: Publicly available question-answering datasets, where answers are given by experts in specific domains or the high-voted answers by web users; the second is Wiki text; Based on the collected human question-answering datasets, they use ChatGPT to generate answers to these questions. To make the answer more aligned with human answers, they add additional instructions to ChatGPT for specific datasets.
  • Lincense: [cc-by-sa-4.0]
  • HC3 Dataset
  • Summary: This project present the first attempt to use GPT-4 to generate instruction following data for LLM finetuning. They also collect feedback and comparison data from GPT-4 to enable a comprehensive evaluation and reward model training. The dataset contains 52K instruction-following data generated by GPT-4 with Alpaca prompts translated into Chinese by ChatGPT. Experiments on instruction-tuned LLaMA models show that the 52K English and Chinese instruction-following data generated by GPT-4 leads to superior zero-shot performance on new tasks to the instruction-following data generated by previous state-of-the-art models.
  • Lincense: [cc-by-nc-4.0]
  • alpaca_gpt4_zh Dataset
  • Summary: This is a large-scale pre-training dataset based on prompts for multi-task and zero-shot learning. The dataset contains 1.2 million training samples and 73 prompts, covering 9 datasets: Text classification (tnews), Text classification (iflytek), Natural language inference (ocnli), Semantic matching (afqmc), Coreference resolution (cluewsc2020), Keyword recognition (csl), Reading comprehension - free-form (c3), Reading comprehension - extractive (cmrc2018), Reading comprehension - idiom filling (chid).

  • pCLUE Dataset

  • Summary: Scientific literature serves as a high-quality corpus, supporting a lot of Natural Language Processing (NLP) research. However, existing datasets are centered around the English language, which restricts the development of Chinese scientific NLP. In this work, we present CSL, a large-scale Chinese Scientific Literature dataset, which contains the titles, abstracts, keywords and academic fields of 396k papers. To our knowledge, CSL is the first scientific document dataset in Chinese. The CSL can serve as a Chinese corpus. Also, this semi-structured data is a natural annotation that can constitute many supervised NLP tasks. Based on CSL, we present a benchmark to evaluate the performance of models across scientific domain tasks, i.e., summarization, keyword generation and text classification

  • CSL Dataset

  • Summary: MOSS is an open-sourced plugin-augmented conversational language model. The multi-turn conversational data (moss-002-sft-data) used to train MOSS-002, covering helpfulness, honesty, and harmlessness. The data is consisting of 570K English and 590K Chinese conversations generated by text-davinci-003. We open-sourced a small portion of other data(moss-003-sft-data, moss-003-sft-plugin-data, moss-003-pm-data) and will make public the full data in the near future.

  • (MOSS SFT Dataset)