Skip to content

Releases: avidale/compress-fasttext

v0.1.5: update numpy

03 May 06:37
Compare
Choose a tag to compare

v0.1.4: remove pqkmeans dependency

14 Oct 20:33
Compare
Choose a tag to compare

Closing #19 (comment) by removing pqkmeans dependency that has un-installable dependencies of its own.

Add feature extraction and finalize migration to gensim 4.0.0

14 Dec 07:02
Compare
Choose a tag to compare
  • Wrap up the refactoring related to new gensim version
  • add FastTextTransformer, a scikit-learn-like wrapper for feature extraction

Migrate to gensim>=4.0.0

11 Dec 11:12
Compare
Choose a tag to compare

Support of gensim>=4.0.0 and deprecation of earlier gensim

New released models

Russian models based on geowac_tokens_none_fasttextskipgram_300_5_2020 from RusVectores, 1.9GB:

Model RAM size, mb similarity to the original model
geowac_tokens_sg_300_5_2020-100K-20K-100.bin 26 0.9619
geowac_tokens_sg_300_5_2020-400K-100K-300.bin 202 0.9990

English models based on cc.en.300.bin from the Facebook website, 7.2GB:

Model RAM size, mb similarity to the original model
ft_cc.en.300_freqprune_50K_5K_pq_100.bin 12 0.3570
ft_cc.en.300_freqprune_100K_20K_pq_100.bin 25 0.6081
ft_cc.en.300_freqprune_100K_20K_pq_300.bin 48 0.6268
ft_cc.en.300_freqprune_400K_100K_pq_300.bin 199 0.8782

Much more small models for various languages can be found at https://zenodo.org/record/4905385.

Adding more compressed models

10 Nov 13:34
Compare
Choose a tag to compare
  • Publish more compressed models and compare their quality
  • Make the compressed models downloadable

Handle optional dependencies better

12 Apr 08:42
Compare
Choose a tag to compare
  • require sklearn and pqkmeans only in the [full] setup mode

Convert compressed matrices to `numpy` on math operations

12 Mar 13:42
Compare
Choose a tag to compare

Now attempts of arithmetic operations on compressed matrices do not raise errors. However, they lead to conversion of these matrices to numpy.array, which uses time and memory.

Add pruning by embedding norm

28 Feb 20:43
Compare
Choose a tag to compare

Now prune_ft_freq method takes into account not only n-gram frequency, but also the norm of its embedding.
This improves model compression accuracy for the same model size.

Initial release

21 Feb 22:21
Compare
Choose a tag to compare

We publish the code for compressing Gensim FastText models and using their small versions.

We also publish 4 compressed versions of the ruscorpora_none_fasttextskipgram_300_2_2019 model from RusVectores.

Model RAM, mb Similarity to the original Intrinsic evaluation (relative to the original)
ft_freqprune_50K_5K_pq_100.bin 13 92.7% 89.9%
ft_freqprune_100K_20K_pq_100.bin 28 96.1% 96.6%
ft_freqprune_100K_20K_pq_300.bin 51 98.2% 97.9%
ft_freqprune_400K_100K_pq_300.bin 180 99.7% 99.9%