Training

Python Environment

Set up a python environment with gensim installed. More detailed instructions here. You can also follow this video tutorial about Python virtualenv.

pip install -r requirements.txt

git clone https://github.com/ml5js/training-word2vec/

The script supports training from a single text file or directory of files. Create a text file or folder of multiple files. Now run train.py with the name of the file or folder.

Example:

python train.py file.xt
python train.py files/

The script will output a vectors.txt and vectors.json file, however, if you would like to specify an output file name you can use the additional argument -o for that.

python train.py data.txt -o output.json

The default tokenizer is very basic. You can ask the script to use NLTK's tokenizer with the --tokenizer argument.

Additionally, the script can remove stop words.

python train.py files/ -t nltk --remove-stop-words