Use wikipedia to train your fastText embedding model!
- Download Wikipedia database dump here
wget https://dumps.wikimedia.org/enwiki/20XXXXXX/enwiki-20XXXXXX-pages-articles-multistream.xml.bz2
- Process wikipedia using wikiextractor
git clone https://github.com/attardi/wikiextractor.git
python3 wikiextractor/WikiExtractor.py wikidatawiki-20xxxxxx-pages-articles-multistream.xml.bz2 -o <wiki_dir> --processes 8 --no-templates
according to official documentation, "--no-templates" will "significantly" speedup the extractor.
- merge extracted wiki data
python3 merge.py <wiki_dir> <output_file name>
see setup.sh for example
- Build fastText package. reference
wget https://github.com/facebookresearch/fastText/archive/v0.1.0.zip
unzip v0.1.0.zip
cd fastText-0.1.0
make
- Start training your model !
/fasttext skipgram -input data.txt -output model -dim 300 -minn 1 -maxn 6