Machine translation systems make many assumptions, missing a couple words that they could have translated - if only they were a little bit more lenient.
This is a soft dictionary lookup solution that can post-process machine translation output to translate some of the words that were not translated. This works by splitting OOVs (out-of-vocabulary words, unknown words) into multiple parts and then performing a fuzzy lookup in the dictionary for all these parts, yielding english compounds, scoring them according to some handcrafted features and outputing the best option, as shown in these pictures from explanatory slides:
Fun bonus: a glosser using this "guessing" architecture.
- Python 3, must be executable/in path as
python3
- morfessor 2.0, must be executable/in path as
morfessor
- the AGILE tokenizer has to be cloned into
tools/agile_tokenizer
Your "static data" directory must contain the following files with these exact names (you can specify $LEX
yourself):
description | filename |
---|---|
lexicon (3-col TSV: source [ignored] target ) |
${LEX} |
leidos-unigrams (i.e. output of `sort | uniq -c |
Ulf's grammar (uig-specific, sorry) | grammar.uig-v04.txt |
Ulf's pertainym list | english.pertainyms.txt |
binary Morfessor model | binary-baseline-model |
ELISA packages (used for ELISA package output) | package/elisa.*-eng.${set}.y?r?.*.xml.gz |
bleu.pl
for BLEU calculation (will be safely skipped, if not on HPC)packagesbmt.sh
(will be safely skipped, if not on HPC)CALL_{SYSTEM,SET,SETPART}
are set to useqsub -q isi
inrun.system.sh
(replace by e.g.bash
if not on HPC, seerun.singlefile.sh
)
bash run.system.sh SBMTSYSTEMPATH SETS STATICDATAPATH LEXICONNAME REFERENCES
, e.g.:
/home/nlg-05/sjm_445/pyguess/run.system.sh \
/home/nlg-02/pust/elisa-trial/isi-sbmt-v5-uzb \
"dev test syscomb domain domain2 eval" \
/home/nlg-05/sjm_445/uyghur/on_top_of/__staticdata \
guessing_input_lexicon.v14 \
/home/nlg-05/sjm_445/uyghur/elisa.il3.package.y1r2.v1
Run in the directory where you want to store all resulting data. The script will create a folder with the system name (isi-sbmt-v5-uzb
in this case) and store all results (finished ELISA packages and guess-dictionaries for each set) in a outputdata
subfolder.
The references directory is expected to hold all elisa-...set...xml.gz
files from the package. If it does not yet contain the xtract
ed versions of all files, these will be created, so make sure you're allowed to write there!
bash run.singlefile.sh INFILE STATICDATAPATH LEXICONNAME REFERENCES
, e.g.:
/home/nlg-05/sjm_445/pyguess/run.system.sh \
some.oovs.txt \
/home/nlg-05/sjm_445/uyghur/on_top_of/__staticdata \
guessing_input_lexicon.v14
/home/nlg-05/sjm_445/uyghur/elisa.il3.package.y1r2.v1
Run anywhere. Will create $INFILE.guessed.1best.hyp
and $INFILE.guessed.nbest.json
(in the same folder as $INFILE
).
Behind the scenes the script will make a "set" out of the file, create a folder in the $TMPDIR
match the whole set and then continue guessing the set just like run.system.sh
, so if you have expensive matching to do, consider wrapping the call in a qsub
-script yourself.
run.{singlefile,system}.sh
source run.functions.sh
. This is where preprocessing, matching, guessing and postprocessing calls are defined. The python programs are controlled using a JSON config file pyguess.config
. This file is generated by all the shell scripts on each invocation using the given parameters.
For each set individually:
tools/extract_tokens.py
extracts an OOV list from the SBMT output (obtained by runningnbest2json
inrun.system.sh
).- All pipes (
|
) in that list are replaced with slashes to avoid clashes with morfessor-flatcat... which is not yet included, but better be prepared. - Morfessor segments that input.
guess_phrases.py
generates a bunch of phrase parts.
For all these phraseparts together:
- Pipe all phraseparts (types) into
guess_matching.py
, this generates one bigallmatches
dictionary
For each set individually:
thirdeye.py
calculates 1best and nbest translations for the OOV list using the (now shared)allmatches
dictionary
For each set individually:
tools/rejoin_oovs.py
re-inserts the 1best translations of the OOVs into the SBMT output- This file is detokenized, BLEU scores are calculated and it is packaged into a nice ELISA package.
- Many things are probably super inefficient.
- My parallelism (create lots of scripts and jobs) works, but...
- Storing all matches in one big
allmatches
file per system is a giant race condition disaster waiting to happen. Therun.singlefile.sh
script circumvents that problem by not sharing anything and doing all calculations in a temporary directory.