Skip to content

Commit

Permalink
Update readme
Browse files Browse the repository at this point in the history
  • Loading branch information
kpu committed Nov 1, 2012
1 parent e9955a3 commit 25376ab
Showing 1 changed file with 21 additions and 9 deletions.
30 changes: 21 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

Language model inference code by Kenneth Heafield (kenlm at kheafield.com)

THE GIT REPOSITORY https://github.com/kpu/kenlm IS WHERE ACTIVE DEVELOPMENT HAPPENS. IT MAY RETURN SILENTLY WRONG ANSWERS OR BE SILENTLY BINARY-INCOMPATIBLE WITH STABLE RELEASES.
I do development in master on https://github.com/kpu/kenlm/. Normally, it works, but I do not guarantee it will compile, give correct answers, or generate non-broken binary files. For a more stable release, get http://kheafield.com/code/kenlm.tar.gz .

The website http://kheafield.com/code/kenlm/ has more documentation. If you're a decoder developer, please download the latest version from there instead of copying from another decoder.

Expand All @@ -15,38 +15,50 @@ Binary format via mmap is supported. Run `./build_binary` to make one then pass
## Platforms
`murmur_hash.cc` and `bit_packing.hh` perform unaligned reads and writes that make the code architecture-dependent.
It has been sucessfully tested on x86\_64, x86, and PPC64.
ARM support is reportedly working, at least on the iphone, but I cannot test this.
ARM support is reportedly working, at least on the iphone.

Runs on Linux, OS X, Cygwin, and MinGW.

Hideo Okuma and Tomoyuki Yoshimura from NICT contributed ports to ARM and MinGW. Hieu Hoang is working on a native Windows port.
Hideo Okuma and Tomoyuki Yoshimura from NICT contributed ports to ARM and MinGW.

## Compile-time configuration
There are a number of macros you can set on the g++ command line or in util/have.hh .

`KENLM_MAX_ORDER` is the maximum order that can be loaded. This is done to make state an efficient POD rather than a vector.
`HAVE_BOOST` enables Boost-style hashing of StringPiece. This is only needed if you intend to hash StringPiece in your code.
`HAVE_ICU` If your code links against ICU, define this to disable the internal StringPiece and replace it with ICU's copy of StringPiece, avoiding naming conflicts.

ARPA files can be read in compressed format with these options:
`HAVE_ZLIB` Supports gzip. Link with -lz. I have enabled this by default.
`HAVE_BZLIB` Supports bzip2. Link with -lbz2.
`HAVE_XZLIB` Supports xz. Link with -llzma.
Note that these macros impact only `read_compressed.cc` and `read_compressed_test.cc`. The bjam build system will auto-detect bzip2 and xz support.

## Decoder developers
- I recommend copying the code and distributing it with your decoder. However, please send improvements upstream as indicated in CONTRIBUTORS.
- I recommend copying the code and distributing it with your decoder. However, please send improvements upstream.

- It does not depend on Boost or ICU. If you use ICU, define `HAVE_ICU` in `util/have.hh` (uncomment the line) to avoid a name conflict. Defining `HAVE_BOOST` will let you hash `StringPiece`.
- Omit the lm/filter directory if you do not want the language model filter. Only that and tests depend on Boost.

- Most people have zlib. If you don't want to depend on that, comment out `#define HAVE_ZLIB` in `util/have.hh`. This will disable loading gzipped ARPA files.
- Select the macros you want, listed in the previous section.

- There are two build systems: compile.sh and Jamroot+Jamfile. They're pretty simple and are intended to be reimplemented in your build system.

- Use either the interface in `lm/model.hh` or `lm/virtual_interface.hh`. Interface documentation is in comments of `lm/virtual_interface.hh` and `lm/model.hh`.

- There are several possible data structures in `model.hh`. Use `RecognizeBinary` in `binary_format.hh` to determine which one a user has provided. You probably already implement feature functions as an abstract virtual base class with several children. I suggest you co-opt this existing virtual dispatch by templatizing the language model feature implementation on the KenLM model identified by `RecognizeBinary`. This is the strategy used in Moses and cdec.

- See `lm/config.hh` for tuning options.

- See `lm/config.hh` for run-time tuning options.

## Contributors
Contributions to KenLM are welcome. Please base your contributions on https://github.com/kpu/kenlm and send pull requests (or I might give you commit access). Downstream copies in Moses and cdec are maintained by overwriting them so do not make changes there.

## Python module
Contributed by Victor Chahuneau.

### Installation

```bash
pip install -e git+https://github.com/vchahun/kenlm.git#egg=kenlm
pip install -e git+https://github.com/kpu/kenlm.git#egg=kenlm
```

### Basic Usage
Expand Down

0 comments on commit 25376ab

Please sign in to comment.