This repo contains frequency dictionaries that are text files with one word per line.
It is composed of two folders:
freq_dicts_dirty
: in this folder there the dictionaries contains words that are not in a "standard" dictionary.freq_dicts_clean
: in this folder the dictionaries has been cleaned (and completed) to contain only words that are in a "standard" dictionary.
The files in this folder were obtained from LuminosoInsight/wordfreq project.
The corresponding dictionaries were transformed to .txt
files with one word per line (the more frequents come first) by keeping only the words longer than 2 characters.
The transformation was done using jakm/msgpack-cli tool to convert the .msgpack
files to .json
files, and then using sed
and grep
they are transformed to .txt
files with one word per line.
The files in this folder were obtained from the files in the freq_dicts_dirty
folder by removing all words that are not in the corresponding dictionary of titoBouzout/Dictionaries.
- every
short_xx.txt
file was named with the same name - every
long_xx.txt
file was renamed tomedium_xx.txt
- the files
long_xx.txt
are obtained frommedium_xx.txt
(orshort_xx.txt
) by completing them (at the end in alphabetical order) by all words that are in the "standard" dictionary but not present in the "frequency" dictionary.