Efficient Low-Memory Aligner
This work is a fork of Robert Ösling's eflomal with a few fixes and additional features:
- when builing on Mac OS, remove
-lrt
fromLDFLAGS
- add
mkmodel.py
script for computing translation probabilities directly from a parallel corpus; this first computes alignment usingeflomal
then derives probabilities from it
eflomal
is a word alignment tool based on
efmaral, with the following main
differences:
- More compact data structures are used, so memory requirements are much lower (by orders of magnitude).
- The estimation of alignment variable marginals is done one sentence at a time, which also saves a lot of memory at no detectable cost in accuracy.
Technical details relevant to both efmaral
and eflomal
can be found in
the following article:
To compile and install the C binary and the Python bindings:
make
sudo make install
python3 setup.py install
edit Makefile
manually if you want to install somewhere other than the
default /usr/local/bin
. Note that the align.py
script now uses the
eflomal
executable in the same directory as align.py
, rather than in
$PATH
.
On mac you will need to compile using gcc
because clang
does not support openmp
:
brew install gcc
export CC=/usr/local/bin/gcc-8
Change CC
to match your settings if necessary. Then proceed to build and install normally.
There are three main ways of using eflomal
:
- Directly call the
eflomal
binary. Note that this requires some preprocessing. - Use the align.py command-line interface, which is partly
compatible with that of
efmaral
. Runpython3 align.py --help
for instructions. - Use the Cython module to call the
eflomal
binary, this takes care of the preprocessing and file conversions necessary. See the docstrings in eflomal.pyx for documentation.
In addition, there are convenience scripts for aligning and symmetrizing (with
the atools
program from fast_align
) as well as evaluating with data from
the WPT shared task datasets. These work the same way as in efmaral
,
please see its
README for
details.
The align.py
interface expects one sentence per line with space-separated
tokens, similar to most word alignment software.
This is a comparison between eflomal, efmaral and fast_align.
The difference between efmaral and eflomal is in part due to different default parameters, in particular the number of iterations and the number of independent samplers.
Note that all timing figures below include alignments in both directions (run in parallel) and symmetrization.
Languages | Sentences | AER | CPU time (s) | Real time (s) |
---|---|---|---|---|
English-French | 1,130,551 | 0.081 | 1,232 | 337 |
English-Inkutitut | 340,601 | 0.203 | 161 | 44 |
Romanian-English | 48,681 | 0.298 | 159 | 33 |
English-Hindi | 3,530 | 0.467 | 31 | 6 |
Languages | Sentences | AER | CPU time (s) | Real time (s) |
---|---|---|---|---|
English-Swedish | 1,862,426 | 0.133 | 1,719 | 620 |
English-French | 1,130,551 | 0.085 | 763 | 279 |
English-Inkutitut | 340,601 | 0.235 | 122 | 46 |
Romanian-English | 48,681 | 0.287 | 161 | 46 |
English-Hindi | 3,530 | 0.483 | 98 | 10 |
Languages | Sentences | AER | CPU time (s) | Real time (s) |
---|---|---|---|---|
English-Swedish | 1,862,426 | 0.205 | 11,090 | 672 |
English-French | 1,130,551 | 0.153 | 3,840 | 241 |
English-Inuktitut | 340,601 | 0.287 | 477 | 47 |
Romanian-English | 48,681 | 0.325 | 208 | 17 |
English-Hindi | 3,530 | 0.672 | 24 | 2 |