Releases: avidale/compress-fasttext
v0.1.5: update numpy
up version
v0.1.4: remove pqkmeans dependency
Closing #19 (comment) by removing pqkmeans
dependency that has un-installable dependencies of its own.
Add feature extraction and finalize migration to gensim 4.0.0
- Wrap up the refactoring related to new
gensim
version - add
FastTextTransformer
, a scikit-learn-like wrapper for feature extraction
Migrate to gensim>=4.0.0
Support of gensim>=4.0.0 and deprecation of earlier gensim
New released models
Russian models based on geowac_tokens_none_fasttextskipgram_300_5_2020
from RusVectores, 1.9GB:
Model | RAM size, mb | similarity to the original model |
---|---|---|
geowac_tokens_sg_300_5_2020-100K-20K-100.bin | 26 | 0.9619 |
geowac_tokens_sg_300_5_2020-400K-100K-300.bin | 202 | 0.9990 |
English models based on cc.en.300.bin
from the Facebook website, 7.2GB:
Model | RAM size, mb | similarity to the original model |
---|---|---|
ft_cc.en.300_freqprune_50K_5K_pq_100.bin | 12 | 0.3570 |
ft_cc.en.300_freqprune_100K_20K_pq_100.bin | 25 | 0.6081 |
ft_cc.en.300_freqprune_100K_20K_pq_300.bin | 48 | 0.6268 |
ft_cc.en.300_freqprune_400K_100K_pq_300.bin | 199 | 0.8782 |
Much more small models for various languages can be found at https://zenodo.org/record/4905385.
Adding more compressed models
- Publish more compressed models and compare their quality
- Make the compressed models downloadable
Handle optional dependencies better
- require
sklearn
andpqkmeans
only in the[full]
setup mode
Convert compressed matrices to `numpy` on math operations
Now attempts of arithmetic operations on compressed matrices do not raise errors. However, they lead to conversion of these matrices to numpy.array
, which uses time and memory.
Add pruning by embedding norm
Now prune_ft_freq
method takes into account not only n-gram frequency, but also the norm of its embedding.
This improves model compression accuracy for the same model size.
Initial release
We publish the code for compressing Gensim FastText models and using their small versions.
We also publish 4 compressed versions of the ruscorpora_none_fasttextskipgram_300_2_2019 model from RusVectores.
Model | RAM, mb | Similarity to the original | Intrinsic evaluation (relative to the original) |
---|---|---|---|
ft_freqprune_50K_5K_pq_100.bin | 13 | 92.7% | 89.9% |
ft_freqprune_100K_20K_pq_100.bin | 28 | 96.1% | 96.6% |
ft_freqprune_100K_20K_pq_300.bin | 51 | 98.2% | 97.9% |
ft_freqprune_400K_100K_pq_300.bin | 180 | 99.7% | 99.9% |