Skip to content

Commit

Permalink
Documenting log10, query output format. Fixes kpu#38.
Browse files Browse the repository at this point in the history
  • Loading branch information
kpu committed Feb 22, 2016
1 parent c716841 commit d3b212f
Show file tree
Hide file tree
Showing 2 changed files with 8 additions and 1 deletion.
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,8 @@ and see http://kheafield.com/code/kenlm/filter/ for more documentation.

Two data structures are supported: probing and trie. Probing is a probing hash table with keys that are 64-bit hashes of n-grams and floats as values. Trie is a fairly standard trie but with bit-level packing so it uses the minimum number of bits to store word indices and pointers. The trie node entries are sorted by word index. Probing is the fastest and uses the most memory. Trie uses the least memory and a bit slower.

As is the custom in language modeling, all probabilities are log base 10.

With trie, resident memory is 58% of IRST's smallest version and 21% of SRI's compact version. Simultaneously, trie CPU's use is 81% of IRST's fastest version and 84% of SRI's fast version. KenLM's probing hash table implementation goes even faster at the expense of using more memory. See http://kheafield.com/code/kenlm/benchmark/.

Binary format via mmap is supported. Run `./build_binary` to make one then pass the binary file name to the appropriate Model constructor.
Expand Down
7 changes: 6 additions & 1 deletion lm/query_main.cc
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,12 @@ void Usage(const char *name) {
"-n: Do not wrap the input in <s> and </s>.\n"
"-v summary|sentence|word: Level of verbosity\n"
"-l lazy|populate|read|parallel: Load lazily, with populate, or malloc+read\n"
"The default loading method is populate on Linux and read on others.\n";
"The default loading method is populate on Linux and read on others.\n\n"
"Each word in the output is formatted as:\n"
" word=vocab_id ngram_length log10(p(word|context))\n"
"where ngram_length is the length of n-gram matched. A vocab_id of 0 indicates\n"
"indicates the unknown word. Sentence-level output includes log10 probability of\n"
"the sentence and OOV count.\n";
exit(1);
}

Expand Down

0 comments on commit d3b212f

Please sign in to comment.