Chinese Idiom Paraphrasing (CIP), which goal is to rephrase the idioms of input sentence to generate a fluent, meaning-preserving sentence without any idiom:
Data in this dataset and several approaches:
-
LSTM approach
-
Transformer approach
-
mt5-seq2seq approach
-
mt5-infill approach
-
mt5-knowledge approach
- Python>=3.6
- torch>=1.7.1
- transformers==4.8.0
- fairseq==0.10.2
you can download all pre-trained models here, and put it intomodel
directory.
If you want train models from scratch, you need uses the pre-trained language models mt5-base (huggingface) and place the models under the model
directory after downloading.
train LSTM and Transformer model by fairseq, you need process data for jieba and bpe tokenize sentence, we use scripts from Subword-nmt:
git clone https://github.com/rsennrich/subword-nmt
Then run
sh prepare.sh
train LSTM, Transformer, mt5-seq2seq, mt5-fill, mt5-knowledge model
sh train_lstm.sh
sh train_transformer.sh
sh train_t5.sh
sh train_t5_fill.sh
sh train_t5_knowledge.sh
Run the following command to evaluate
sh evaluate_base.sh
sh evaluate_t5.sh
sh evaluate_t5_knowledge.sh
sh evaluate_t5_fill.sh
@article{qiang2022chinese,
title={Chinese Idiom Paraphrasing},
author={Jipeng Qiang, Yang Li, Chaowei Zhang, Yun Li and YunHao Yuan, Yi Zhu, Xindong Wu},
journal={},
year={2022},
}