WMSeg

This is the implementation of Improving Chinese Word Segmentation with Wordhood Memory Networks at ACL2020.

Please contact us at [email protected] or [email protected] if you have any questions.

Citation

If you use or extend our work, please cite our paper at ACL2020.

@inproceedings{tian-etal-2020-improving,
    title = "Improving Chinese Word Segmentation with Wordhood Memory Networks",
    author = "Tian, Yuanhe and Song, Yan and Xia, Fei and Zhang, Tong and Wang, Yonggang",
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2020",
    address = "Online",
    pages = "8274--8285",
}

Requirements

Our code works with the following environment.

python=3.6
pytorch=1.1

Downloading BERT, ZEN and WMSeg

In our paper, we use BERT (paper) and ZEN (paper) as the encoder.

For BERT, please download pre-trained BERT-Base Chinese from Google or from HuggingFace. If you download it from Google, you need to convert the model from TensorFlow version to PyTorch version.

For ZEN, you can download the pre-trained model from here.

For WMSeg, you can download the models we trained in our experiments from here.

Run on Sample Data

Run run_sample.sh to train a model on the small sample data under the sample_data directory.

Datasets

We use SIGHAN2005 and CTB6 in our paper.

To obtain and pre-process the data, please go to data_preprocessing directory and run getdata.sh. This script will download and process the official data from SIGHAN2005. For CTB6, you need to obtain the official data first, and then put the LDC07T36 folder under the data_preprocessing directory.

All processed data will appear in data directory.

Training and Testing

You can find the command lines to train and test models on a specific dataset in run.sh.

Here are some important parameters:

--do_train: train the model.
--do_test: test the model.
--use_bert: use BERT as encoder.
--use_zen: use ZEN as encoder.
--bert_model: the directory of pre-trained BERT/ZEN model.
--use_memory: use key-value memory networks.
--decoder: use crf or softmax as the decoder.
--ngram_flag: use av, dlg, or pmi to construct the lexicon N.
--av_threshold: when using av to construct the lexicon N, n-grams whose AV score is lower than the threshold will be excluded from the lexicon N.
--ngram_num_threshold: n-grams whose frequency is lower than the threshold will be excluded from the lexicon N. Note that, when the threshold is set to 1, no n-gram is filtered out by its frequency. We therefore DO NOT recommend you to use 1 as the n-gram frequency threshold.
--model_name: the name of model to save.

Predicting

run_sample.sh contains the command line to segment the sentences in an input file (./sample_data/sentence.txt).

Here are some important parameters:

--do_predict: segment the sentences using a pre-trained WMSeg model.
--input_file: the file contains sentences to be segmented. Each line contains one sentence; you can refer to a sample input file for the input format.
--output_file: the path of the output file. Words are segmented by a space.
--eval_model: the pre-trained WMSeg model to be used to segment the sentences in the input file.

To-do List

Release a toolkit using WMSeg with necessary APIs

You can leave comments in the Issues section, if you want us to implement any functions.

You can check our updates at updates.md.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
data_preprocessing		data_preprocessing
models		models
pytorch_pretrained_bert		pytorch_pretrained_bert
pytorch_pretrained_zen		pytorch_pretrained_zen
sample_data		sample_data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
run.sh		run.sh
run_sample.sh		run_sample.sh
updates.md		updates.md
wmseg_eval.py		wmseg_eval.py
wmseg_helper.py		wmseg_helper.py
wmseg_main.py		wmseg_main.py
wmseg_model.py		wmseg_model.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WMSeg

Citation

Requirements

Downloading BERT, ZEN and WMSeg

Run on Sample Data

Datasets

Training and Testing

Predicting

To-do List

About

Releases

Packages

Languages

License

rollmark/WMSeg

Folders and files

Latest commit

History

Repository files navigation

WMSeg

Citation

Requirements

Downloading BERT, ZEN and WMSeg

Run on Sample Data

Datasets

Training and Testing

Predicting

To-do List

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages