Skip to content

Use wikipedia to train your fastText embedding model!

Notifications You must be signed in to change notification settings

willywsm1013/EnglishFastText

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

EnglishFastText

Use wikipedia to train your fastText embedding model!

Preprocess wiki data

  1. Download Wikipedia database dump here

wget https://dumps.wikimedia.org/enwiki/20XXXXXX/enwiki-20XXXXXX-pages-articles-multistream.xml.bz2

  1. Process wikipedia using wikiextractor

git clone https://github.com/attardi/wikiextractor.git
python3 wikiextractor/WikiExtractor.py wikidatawiki-20xxxxxx-pages-articles-multistream.xml.bz2 -o <wiki_dir> --processes 8 --no-templates

according to official documentation, "--no-templates" will "significantly" speedup the extractor.

  1. merge extracted wiki data

python3 merge.py <wiki_dir> <output_file name>

see setup.sh for example

FastText

  1. Build fastText package. reference

wget https://github.com/facebookresearch/fastText/archive/v0.1.0.zip
unzip v0.1.0.zip
cd fastText-0.1.0
make

  1. Start training your model !

/fasttext skipgram -input data.txt -output model -dim 300 -minn 1 -maxn 6

About

Use wikipedia to train your fastText embedding model!

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published