Skip to content

Translate blindly copied OOVs into reasonable english.

Notifications You must be signed in to change notification settings

sjmielke/secondguess

Repository files navigation

secondguess

Machine translation systems make many assumptions, missing a couple words that they could have translated - if only they were a little bit more lenient.

This is a soft dictionary lookup solution that can post-process machine translation output to translate some of the words that were not translated. This works by splitting OOVs (out-of-vocabulary words, unknown words) into multiple parts and then performing a fuzzy lookup in the dictionary for all these parts, yielding english compounds, scoring them according to some handcrafted features and outputing the best option, as shown in these pictures from explanatory slides:

architecture views

Fun bonus: a glosser using this "guessing" architecture.

glosser screenshot

Installation

Dependencies

  • Python 3, must be executable/in path as python3
  • morfessor 2.0, must be executable/in path as morfessor
  • the AGILE tokenizer has to be cloned into tools/agile_tokenizer

Data

Your "static data" directory must contain the following files with these exact names (you can specify $LEX yourself):

description filename
lexicon (3-col TSV: source [ignored] target) ${LEX}
leidos-unigrams (i.e. output of `sort uniq -c
Ulf's grammar (uig-specific, sorry) grammar.uig-v04.txt
Ulf's pertainym list english.pertainyms.txt
binary Morfessor model binary-baseline-model
ELISA packages (used for ELISA package output) package/elisa.*-eng.${set}.y?r?.*.xml.gz

Things that depend on the ISI HPC cluster:

  • bleu.pl for BLEU calculation (will be safely skipped, if not on HPC)
  • packagesbmt.sh (will be safely skipped, if not on HPC)
  • CALL_{SYSTEM,SET,SETPART} are set to use qsub -q isi in run.system.sh (replace by e.g. bash if not on HPC, see run.singlefile.sh)

Usage examples

run.system.sh

bash run.system.sh SBMTSYSTEMPATH SETS STATICDATAPATH LEXICONNAME REFERENCES, e.g.:

/home/nlg-05/sjm_445/pyguess/run.system.sh \
	/home/nlg-02/pust/elisa-trial/isi-sbmt-v5-uzb \
	"dev test syscomb domain domain2 eval" \
	/home/nlg-05/sjm_445/uyghur/on_top_of/__staticdata \
	guessing_input_lexicon.v14 \
	/home/nlg-05/sjm_445/uyghur/elisa.il3.package.y1r2.v1

Run in the directory where you want to store all resulting data. The script will create a folder with the system name (isi-sbmt-v5-uzb in this case) and store all results (finished ELISA packages and guess-dictionaries for each set) in a outputdata subfolder.

The references directory is expected to hold all elisa-...set...xml.gz files from the package. If it does not yet contain the xtracted versions of all files, these will be created, so make sure you're allowed to write there!

run.singlefile.sh

bash run.singlefile.sh INFILE STATICDATAPATH LEXICONNAME REFERENCES, e.g.:

/home/nlg-05/sjm_445/pyguess/run.system.sh \
	some.oovs.txt \
	/home/nlg-05/sjm_445/uyghur/on_top_of/__staticdata \
	guessing_input_lexicon.v14
	/home/nlg-05/sjm_445/uyghur/elisa.il3.package.y1r2.v1

Run anywhere. Will create $INFILE.guessed.1best.hyp and $INFILE.guessed.nbest.json (in the same folder as $INFILE).

Behind the scenes the script will make a "set" out of the file, create a folder in the $TMPDIR match the whole set and then continue guessing the set just like run.system.sh, so if you have expensive matching to do, consider wrapping the call in a qsub-script yourself.

Architecture

run.{singlefile,system}.sh source run.functions.sh. This is where preprocessing, matching, guessing and postprocessing calls are defined. The python programs are controlled using a JSON config file pyguess.config. This file is generated by all the shell scripts on each invocation using the given parameters.

Preprocessing

For each set individually:

  • tools/extract_tokens.py extracts an OOV list from the SBMT output (obtained by running nbest2json in run.system.sh).
  • All pipes (|) in that list are replaced with slashes to avoid clashes with morfessor-flatcat... which is not yet included, but better be prepared.
  • Morfessor segments that input.
  • guess_phrases.py generates a bunch of phrase parts.

Matching

For all these phraseparts together:

  • Pipe all phraseparts (types) into guess_matching.py, this generates one big allmatches dictionary

Guessing

For each set individually:

  • thirdeye.py calculates 1best and nbest translations for the OOV list using the (now shared) allmatches dictionary

Postprocessing

For each set individually:

  • tools/rejoin_oovs.py re-inserts the 1best translations of the OOVs into the SBMT output
  • This file is detokenized, BLEU scores are calculated and it is packaged into a nice ELISA package.

Notes and TODOs

  • Many things are probably super inefficient.
  • My parallelism (create lots of scripts and jobs) works, but...
  • Storing all matches in one big allmatches file per system is a giant race condition disaster waiting to happen. The run.singlefile.sh script circumvents that problem by not sharing anything and doing all calculations in a temporary directory.

About

Translate blindly copied OOVs into reasonable english.

Resources

Stars

Watchers

Forks

Packages

No packages published