GitHub

Perezosa

The lazy way how to learn a language.

stopwords
- vocabulary: exclude them but study stopwords separately too
- bigrams: filter out too common/meaningless pairs
- trigrams+: filter out useful phrases, better keep them in
ngram metrics
- PMI (pointwise mutual information)
  - tried only for bigrams, results in lots of proper names (name+surname) not real grammatical phrases
- raw_freq
  - good enough to score bigrams or higher ngrams
vocabulary (unigrams)
- how many words to learn to govern given proportion of the corpus (wiki)
  - 1.00: 770000
  - 0.95: 48226
  - 0.9: 17000
  - 0.8: 4000
order of ngrams
- bigrams show some common phrases but not complete grammar
- trigrams show fragments of some grammar rules, but too short to capture it well
- quadgrams -- no data, corpus too big to process (Python :-/)

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
Makefile		Makefile
README.md		README.md
list-ngram.py		list-ngram.py
list-vocabulary.py		list-vocabulary.py
mtokenize.py		mtokenize.py
train-ngram.py		train-ngram.py
train-vocabulary.py		train-vocabulary.py