MSR dataset from SIGHAN2005 Chinese word segmentation Bakeoff.
CTB5 dataset with standard splits.
MSRA dataset from international Chinese language processing Bakeoff 2006.
THUCNews dataset from Sina news with 10 evenly distributed classes.
The ChnSentiCorp dataset with 12,000 documents from three domains, i.e., book, computer and hotel.
The LCQMC (a large-scale Chinese question matching corpus) dataset, where each instance in it is a pair of two sentences with a label indicating whether their intent is matched.
The Chinese part of the XNLI.