Skip to content

Latest commit

 

History

History
405 lines (310 loc) · 30.4 KB

TUTORIAL_2_DATASET.md

File metadata and controls

405 lines (310 loc) · 30.4 KB

Data set processing flow

The process of constructing a dataset is the data preprocessing process of NLP. Its main purpose is to rearrange the original scattered file data into data of a unified structure so that the language model can directly use it. The main process of constructing a sample dataset is as follows (take the CommitmentBank dataset as an example):

At present, there are three kinds of data preprocessing in the project, namely, fine-tuning for classification tasks, pre-training, and fine-tuning for generation tasks. We will expand on them separately in the following.

Data processing: fine-tuning for classification tasks

There are two forms of tuning for classification tasks: one is fine-tuning, and the other is the prompt-tuning. Prompt-tuning requires an additional cloze template for the task, which is more suitable for limited data. Let's take prompt learning as an example to introduce the data processing method in classification tasks:

Application code

import torch
from flagai.data.tokenizer import Tokenizer
from flagai.data.dataset import SuperGlueDataset
from flagai.test_utils import CollateArguments
from flagai.data.dataset import ConstructSuperglueStrategy

# get default parameters
cl_args = CollateArguments()

# Create tokenizer
tokenizer = Tokenizer.from_pretrained("GLM-large-en")
            
# Initially read and process the dataset
dataset = SuperGlueDataset(task_name='cb',
                           data_dir='./datasets/',
                           dataset_type='train',
                           tokenizer=tokenizer)

# Construct collate function
collate_fn = ConstructSuperglueStrategy(cl_args, tokenizer, task_name="rte")

# create loader
loader = torch.utils.data.DataLoader(dataset,
                                    batch_size=1,
                                    shuffle=False,
                                    num_workers=1,
                                    drop_last=False,
                                    pin_memory=False,
                                    collate_fn=collate_fn)

Initially read and process the dataset

The corresponding code module is shown below, which consists of two steps: automatically loading the dataset, and unifying the structure of all datasets

dataset = SuperGlueDataset(task_name='cb',
                           data_dir='./datasets/',
                           dataset_type='train',
                           tokenizer=tokenizer)

SuperGlueDatasetis the function in this step,and its major parameters are introduced below:

task_name: Identifier of dataset, Supported datasets and their identifiers are given at the table in [1.Load the dataset](#1.Load the dataset).

data_dir: Data will be automatically downloaded to data_dir directory, which is ./dataset by default.

dataset_type: It can be train/dev/test, which represents train/validation/test set is going to be preprocessed.

tokenizer: Constructed tokenizer as introduced in Tutorial1.

1.Load the dataset

FlagAI currently supports the following classification datasets:

dataset name Identifier Language Benchmark Auto-Download Fully tested
Broadcoverage Diagnostics boolq English SuperGLUE
CommitmentBank cb English SuperGLUE
Choice of Plausible Alternatives copa English SuperGLUE
Multi-Sentence Reading Comprehension muiltirc English SuperGLUE
Recognizing Textual Entailment rte English SuperGLUE
Words in Context wic English SuperGLUE
The Winograd Schema Challenge wsc English SuperGLUE
Ant Financial Question Matching Corpus afqmc Chinese CLUE
Short Text Classificaiton for News tnews Chinese CLUE
Broadcoverage Diagnostics ax-b English SuperGLUE
Winogender Schema Diagnostics ax-g English SuperGLUE
The Corpus of Linguistic Acceptability cola English GLUE
The Stanford Sentiment Treebank sst2 English GLUE
Microsoft Research Paraphrase Corpus mrpc English GLUE
Quora Question Pairs qqp English GLUE
MultiNLI Matched mnli English GLUE
MultiNLI Mismatched mnli-mm English GLUE
Question NLI qnli English GLUE
X-Stance xstance English
X-Stance (German) xstance-de German
X-Stance (French) xstance-fr French
RACE race English
AG News agnews English
Yahoo Answers yahoo English
Yelp Review Polarity yelp-polarity English
Yelp Open Dataset yelp-full English
[squad](The Stanford Question Answering Dataset) squad English
CLUEWSC2020 cluewsc Chinese CLUE

The downloaded dataset directory will contain three files, corresponding to the training set data, the validation set data, and the test set data. Take the CommitmentBank dataset as an example, the train.jsonl in the directory corresponds to the training set, val.jsonl corresponds to the validation set, test.jsonl corresponds to the test set. Generally, the training set and test set contain label information, but the test set does not. These data files will be processed separately in the next process.

Different datasets may have different file formats, as well as different structures. Taking the CommitmentBank dataset as an example, the following is an example of it

It contains four parts, as shown below

key meaning value
premise premise text Mary is a high school student.
hypothesis hypothetical text Mary is a student
label The tags representing the relationship between the premise and the hypothesis. Include three kinds of inclusion, neutral and contrast entailment
idx The sequence number of the sample in the dataset 10

The specific structure of all FlagAI supported datasets can be viewed here.

2. Unified dataset structure

In this step, we will unify the data structures of different datasets to facilitate subsequent processing. The details of this structure are as follows:

key meaning format
guid a unique textual identifier str
text_a the sequence of text str
text_b an optional, second sequence of text str
label an optional label str
logits an optional list of per-class logits list
meta an optional dictionary to store arbitrary meta information dict
ids an optional numeric index int

When the dataset is built, you can view one of the samples directly in the code by indexing:

example = dataset[3]  # The third example in dataset 

For instance, the example of CommitBank in the previous step will be processed into the following form

Noted that if text_a and text_b cannot be filled with background text information because the data structure is too complex, you can put the rest of the information in meta.

Organize the data into input to the model

The corresponding function is implemented in the following function, which consists of two steps: constructing the template, segmenting the word and constructing the input sample.

collate_fn = ConstructSuperglueStrategy(cl_args,
                                        tokenizer,
                                        task_name=task_name)

1. Build the cloze template

A cloze template contains background text, slots, and options provided to the slots. Models need to find the right options and fill in the blanks.

For each different task, we need to construct cloze questions of different structures for the model to answer. Taking the CommitmentBank dataset as an example, it considers whether the hypothesis can be deduced from the premise, and there are only three results: contradiction/neutral/entailment. Then we can construct the following cloze problem, where contrast/neutral/entailment correspond to true/false/neither respectively.

It can be seen that it can be roughly divided into two steps: the first step is to combine the existing text to make it look like a cloze format; the second step is to convert the original label text into a new label, which may be filled in. Option to enter vacancies.

2. Word segmentation and construct input samples

Next, we need to construct the input of the model. The first step is word segmentation, and then we need to divide it into two cases: In the first case, the label categories contained in the dataset are limited. For example, in the CommitmentBank dataset, there are only three kinds of label texts—intailment/contradiction/neutral, which are common in classification tasks. In the second case, each cloze will give different options (usually a long text). For example, in some reading comprehension datasets, each option is a different understanding of the text. The two cases are handled as follows:

a) Cloze for a single token

key dimension meaning Construction method
input_ids torch.Size([seq_length1]) input matrix composed of the cloze text from the previous step, plus some special characters 2
labels labels: torch.Size([1]) labels corresponding numeric labels, such as 0,1,2...
position_ids torch.Size([2, seq_length]) position encoding refer to [GLM process] (GLM.md), the first line represents the absolute position of the token, the second line represents the relative position of the occluded part
attention_mask torch.Size([1]) separator position
target_ids torch.Size([num_labels3]) full label list All label texts correspond to the labels of a single token, and then put the serial numbers of these labels into target_ids
logit_mask torch.Size([seq_length]) Whether the corresponding text is an answer For each token, if it is an answer, the corresponding place is 1, otherwise it is 0

1: seq_length represents the specified maximum length of each input vector

2: As illustrated in the following figure, the process of adding special characters: add the [CLS] symbol at the beginning of the sentence and the [EOS] symbol at the end of the sentence until the length reaches seq_length. If the text output by cloze has two paragraphs, it will be in the middle Add [SEP] symbol

3: num_labels represents the number of options in the cloze problem

b) Cloze for multiple tokens

key dimension meaning Construction method
input_ids torch.Size([num_labels, seq_length]) input matrix copy the corresponding text num_labels copies
labels labels: torch.Size([1]) Label
position_ids torch.Size([num_labels, 2, seq_length]) position encoding Copy the original position code num_labels copies
attention_mask torch.Size([num_labels]) separator position copy num_labels copies
target_ids torch.Size([num_labels, seq_length]) Information about each option Each column of the matrix represents each option, and each row represents the text corresponding to the current option
logit_mask torch.Size([num_labels, seq_length]) Whether the corresponding text is an answer copy num_labels copies

Create loader

Finally, put the data into the PyTorch Loader.

loader = torch.utils.data.DataLoader(dataset,
                                    batch_size=1,
                                    shuffle=False,
                                    num_workers=1,
                                    drop_last=False,
                                    pin_memory=False,
                                    collate_fn=collate_fn)

The data in the Dataloader can be viewed by the following methods

for data_iterator in train_loader:
    for key, value in data_iterator.items():
        print(key, value)
    # break

Once the loader is constructed, it can be used for the subsequent training and prediction process.


Data processing: GLM pre-training task

Sample pre-training task data format:

{
    "RECORDS": [
        {
            "id": 0,
            "uniqueKey": "",
            "titleUkey": "",
            "dataType": "",
            "title": "",
            "content": "平安信用卡上门激活失败是为啥?平安信用卡上门激
            活失败的主要原因有这些:申卡人是在网上申请的新用户,上门激活时
            携带资料不足,需要申卡人携带身份证、信用卡、预留号码等去平安银
            行网点验证身份才能激活;申卡人在平安银行预留的手机号码有误;申卡
            人在激活时输入的相关信息不正确;申卡人在回答上门激活工作人员的问
            题是,跟之前提交的资料有一些出入,有违规嫌疑。平安信用卡上门激活失
            败的主要原因有这些:申卡人是在网上申请的新用户,上门激活时携带资料不
            足,需要申卡人携带身份证、信用卡、预留号码等去平安银行网点验证身份才能
            激活;申卡人在平安银行预留的手机号码有误;申卡人在激活时输入的相关信息不
            正确;申卡人在回答上门激活工作人员的问题是,跟之前提交的资料有一
            些出入,有违规嫌疑。"
        },
    ]
}

Pre-trained task processing example code:

from flagai.data.tokenizer import Tokenizer
from flagai.test_utils import PretrainDatasetArguments
from flagai.data.dataset.block.data_utils import split_ds, get_dataset_lazy, add_args
from flagai.data.dataset import BlockDataset

ds_args = PretrainDatasetArguments()

tokenizer = Tokenizer.from_pretrained("GLM-large-ch")

ds_args = add_args(ds_args, tokenizer)

def create_dataset(tokenizer, should_split):
    dataset = get_dataset_lazy("./examples/glm_pretrain/data", # load
                               tokenizer=tokenizer,
                               pre_tokenize=True,
                               num_processes=10,
                               no_lazy_loader=True)
    if should_split:
        datasets = split_ds(dataset, split=[.8, .2, .0], shuffle=True) # Manual segmentation
    else:
        datasets = [dataset]

    datasets = [
        BlockDataset(ds,
                     tokenizer,
                     max_seq_len=512,
                     sample_across_doc=True,
                     non_sentence_start=0.0) if ds is not None else None
        for ds in datasets
    ]
    return datasets

datasets = create_dataset(tokenizer, should_split=True)

Pre-training data processing also follows the same process, with the following differences

  1. The pre-training data set is not divided into training set, validation set, and test set by default, so it needs to be divided manually
  2. Since the pre-training data set is generally relatively large, lazy loading is used. Lazy loading only instantiates the object when it is actually used, which is a relatively resource-saving operation.
  3. During pre-training, the collate function will randomly process data according to three different modes: bert mode (occlude random intervals), sentence mode (occlude according to complete sentences) and gpt mode (occlude only a long section). The input of the model will also have one more mode key than the general generation task.
  4. There is no need to add templates for pre-training, just follow the table below to build the model input
key dimension meaning Construction method
input_ids torch.Size([seq_length1]) input matrix consists of the template text from the previous step, plus some special characters
position_ids torch.Size([2, seq_length]) position encoding refer to GLM process, the first line represents the absolute position of the token, the second line represents the relative position of the occluded part
attention_mask torch.Size([1]) delimiter position for the pattern of the generated class, get the position where the source text ends; otherwise get the position where the input text ends
target_ids torch.Size([num_labels3]) full label list occluded text
logit_mask torch.Size([seq_length]) By occlusion, the model will only process the loss of the target text part For each token, if it is an answer, the corresponding place is 1, otherwise it is 0
mode str Data processing mode

Data processing: Generating task fine-tuning

The code implementation is as follows:

import torch
from flagai.data.dataset import Seq2SeqDataset
from flagai.data.tokenizer import Tokenizer
from flagai.test_utils import Seq2SeqCollateArguments
from flagai.data.dataset import ConstructSeq2seqStrategy

# get default parameters
cl_args = Seq2SeqCollateArguments()

# create tokenizer
tokenizer = Tokenizer.from_pretrained("GLM-large-ch")
            
# Initially read and process the dataset
dataset = Seq2SeqDataset(task_name='cmrc',
                           data_dir='./datasets/',
                           dataset_type='train',
                           tokenizer=tokenizer)

# build collate function
collate_fn = ConstructSeq2seqStrategy(cl_args, tokenizer, task_name="rte")

# Create a loader
loader = torch.utils.data.DataLoader(dataset,
                                    batch_size=1,
                                    shuffle=False,
                                    num_workers=1,
                                    drop_last=False,
                                    pin_memory=False,
                                    collate_fn=collate_fn)

Supported Tasks

Supported Tasks Identifier Language Auto-download Fully-tested
Reading Comprehension for Simplified Chinese cmrc Chinese
The Winograd Schema Challenge wsc English
English Gigaword gigaword English
CNN/Daily Mail cnn_dm English
Lang-8 and HSK lang8_hsk Chinese
XSum xsum English
[squad](The Stanford Question Answering Dataset) squad_generation English

Initially read and process the dataset

Currently, CMRC2018 task is supported. CMRC is a reading comprehension task that needs to answer a series of questions based on the background text. An example of its data structure is as follows:

{'paragraphs': 
    [{'id': 'TRAIN_186', 
    'context': '范廷颂枢机(,),圣名保禄·若瑟(),是越南罗马天主教枢机。1963年被任
    为主教;1990年被擢升为天主教河内总教区宗座署理;1994年被擢升为总主教,同年年底被擢
    升为枢机;2009年2月离世。范廷颂于1919年6月15日在越南宁平省天主教发艳教区出生;童年
    时接受良好教育后,被一位越南神父带到河内继续其学业。范廷颂于1940年在河内大修道院完
    成神学学业。范廷颂于1949年6月6日在河内的主教座堂晋铎;及后被派到圣女小德兰孤儿院服
    务。1950年代,范廷颂在河内堂区创建移民接待中心以收容到河内避战的难民。1954年,法越
    战争结束,越南民主共和国建都河内,当时很多天主教神职人员逃至越南的南方,但范廷颂仍然
    留在河内。翌年管理圣若望小修院;惟在1960年因捍卫修院的自由、自治及拒绝政府在修院设
    政治课的要求而被捕。1963年4月5日,教宗任命范廷颂为天主教北宁教区主教,同年8月15日就
    任;其牧铭为「我信天主的爱」。由于范廷颂被越南政府软禁差不多30年,因此他无法到所属堂
    区进行牧灵工作而专注研读等工作。范廷颂除了面对战争、贫困、被当局迫害天主教会等问题外
    ,也秘密恢复修院、创建女修会团体等。1990年,教宗若望保禄二世在同年6月18日擢升范廷颂
    为天主教河内总教区宗座署理以填补该教区总主教的空缺。1994年3月23日,范廷颂被教宗若望
    保禄二...', 
    'qas':{'question': '范廷颂是何时去世的?', 'id': 'TRAIN_186_QUERY_4', 
    'answers': [{'text': '范廷颂于2009年2月22日清晨在河内离世', 'answer_start': 759}]}]}], 
    'id': 'TRAIN_186', 'title': '范廷颂'}

When using it, we can change the task_name parameter to cmrc. The implementation process is similar to the fine-tuning of the classification task, and the data set will be initially processed into the same structure in the end. The corresponding code is as follows:

dataset = Seq2SeqDataset(task_name='cmrc', data_dir='./datasets/', 
                            dataset_type='train', tokenizer=tokenizer) 

Organize the data into input to the model

The code is shown below. Compared with the generation task, it is also the construction template and model input, the difference is that the construction method is different

collate_fn = ConstructSeq2seqStrategy(cl_args,
                                        tokenizer,
                                        task_name=task_name) 

1. Build a fill-in-the-blank template

Since it is a reading comprehension task, it needs to be reflected in the template to answer the specified reading comprehension question, refer to the following construction method

2. Word segmentation and construct input samples

Similar to pre-training, the difference is that there is no mode key.