Skip to content

Latest commit

 

History

History

datasets

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 

Tasks and datasets used in our experiments

Chinese word segmentation (CWS):

MSR dataset from SIGHAN2005 Chinese word segmentation Bakeoff.

Part-of-speech (POS) tagging:

CTB5 dataset with standard splits.

Named entity recognition (NER):

MSRA dataset from international Chinese language processing Bakeoff 2006.

Document classification (DC):

THUCNews dataset from Sina news with 10 evenly distributed classes.

Sentiment analysis (SA):

The ChnSentiCorp dataset with 12,000 documents from three domains, i.e., book, computer and hotel.

Sentence pair matching (SPM):

The LCQMC (a large-scale Chinese question matching corpus) dataset, where each instance in it is a pair of two sentences with a label indicating whether their intent is matched.

Natural language inference (NLI):

The Chinese part of the XNLI.