This directory contains collection of simple adaptive language models that are cheap enough memory- and processor-wise to train in a browser on the fly.
Prediction by Partial Matching (PPM) character language model.
- Cleary, John G. and Witten, Ian H. (1984): “Data Compression Using Adaptive Coding and Partial String Matching”, IEEE Transactions on Communications, vol. 32, no. 4, pp. 396–402.
- Moffat, Alistair (1990): “Implementing the PPM data compression scheme”, IEEE Transactions on Communications, vol. 38, no. 11, pp. 1917–1921.
- Ney, Reinhard and Kneser, Hermann (1995): “Improved backing-off for M-gram language modeling”, Proc. of Acoustics, Speech, and Signal Processing (ICASSP), May, pp. 181–184. IEEE.
- Chen, Stanley F. and Goodman, Joshua (1999): “An empirical study of smoothing techniques for language modeling”, Computer Speech & Language, vol. 13, no. 4, pp. 359–394, Elsevier.
- Ward, David J. and Blackwell, Alan F. and MacKay, David J. C. (2000): “Dasher – A Data Entry Interface Using Continuous Gestures and Language Models”, UIST '00 Proceedings of the 13th annual ACM symposium on User interface software and technology, pp. 129–137, November, San Diego, USA.
- Drinic, Milenko and Kirovski, Darko and Potkonjak, Miodrag (2003): “PPM Model Cleaning”, Proc. of Data Compression Conference (DCC'2003), pp. 163–172. March, Snowbird, UT, USA. IEEE
- Jin Hu Huang and David Powers (2004): “Adaptive Compression-based Approach for Chinese Pinyin Input”, Proceedings of the Third SIGHAN Workshop on Chinese Language Processing, pp. 24–27, Barcelona, Spain. ACL.
- Cowans, Phil (2005): “Language Modelling In Dasher – A Tutorial”, June, Inference Lab, Cambridge University (presentation).
- Steinruecken, Christian and Ghahramani, Zoubin and MacKay, David (2016): “Improving PPM with dynamic parameter updates”, Proc. of Data Compression Conference (DCC'2015), pp. 193–202, April, Snowbird, UT, USA. IEEE.
- Steinruecken, Christian (2015): “Lossless Data Compression”, PhD dissertation, University of Cambridge.
Very simple context-less histogram character language model.
- Steinruecken, Christian (2015): “Lossless Data Compression”, PhD dissertation, University of Cambridge.
- Pitman, Jim and Yor, Marc (1997): “The two-parameter Poisson–Dirichlet distribution derived from a stable subordinator.”, The Annals of Probability, vol. 25, no. 2, pp. 855–900.
- Stanley F. Chen and Joshua Goodman (1999): “An empirical study of smoothing techniques for language modeling”, Computer Speech and Language, vol. 13, pp. 359–394.
Context-less predictive distribution based on balanced binary search trees. Tentative implementation is here.
- Gleave, Adam and Steinruecken, Christian (2017): “Making compression algorithms for Unicode text”, arXiv preprint arXiv:1701.04047.
- Steinruecken, Christian (2015): “Lossless Data Compression”, PhD dissertation, University of Cambridge.
- Mauldin, R. Daniel and Sudderth, William D. and Williams, S. C. (1992): “Polya Trees and Random Distributions”, The Annals of Statistics, vol. 20, no. 3, pp. 1203–1221.
- Lavine, Michael (1992): “Some aspects of Polya tree distributions for statistical modelling”, The Annals of Statistics, vol. 20, no. 3, pp. 1222–1235.
- Neath, Andrew A. (2003): “Polya Tree Distributions for Statistical Modeling of Censored Data”, Journal of Applied Mathematics and Decision Sciences, vol. 7, no. 3, pp. 175–186.
Please see a simple example usage of the model API in example.js.
The example has no command-line arguments. To run it using NodeJS invoke
> node example.js
A simple test driver language_model_driver.js can be used to check that the model behaves using NodeJS. The driver takes three parameters: the maximum order for the language model, the training file and the test file in text format. Currently only the PPM model is supported.
Example:
> node --max-old-space-size=4096 language_model_driver.js 7 training.txt test.txt
Initializing vocabulary from training.txt ...
Created vocabulary with 212 symbols.
Constructing 7-gram LM ...
Created trie with 21502513 nodes.
Running over test.txt ...
Results: numSymbols = 69302, ppl = 6.047012997396163, entropy = 2.5962226799087356 bits/char, OOVs = 0 (0%).