This directory lists available hooks (see documentation) extending OpenNMT capability.
SentencePiece is a sentence level tokenization implemented by Taku Kudo.
To use SentencePiece, you need to install sentencepiece and sentencepiece lua rock: see here for detailed instructions.
Train models using spm_train
and simply use it like that:
echo "It is a test-sample" | th tools/tokenize.lua -hook_file hooks/sentencepiece -sentencepiece myspmodel.mpdel -mode aggressive -joiner_annotate
Note that sentencepiece can be combined with regular tokenization - in that case, you do need to train the model on the same tokenization.
Simple character tokenization model.
echo "It is a test-sample" | th tools/tokenize.lua -hook_file hooks/chartokenization -mode char
I t ▁ i s ▁ a ▁ t e s t - s a m p l e
This hook is interfacing with TreeTagger and provides POS and/or lemma annotation during tokenization.
First start TreeTagger as a REST service using the provided script:
python -u hooks/ -model /TREE-TAGGER-lib-dir/your-language.par -path /TREE-TAGGER-bin-dir/
This runs the REST service on port 3000 of localhost. To see all available options, just type:
python hooks/ -h
Once the REST server is running, you can use it during tokenization as follows:
th tools/tokenize.lua -hook_file hooks.tree-tagger -pos_feature < file