Skip to content

Latest commit

 

History

History
49 lines (30 loc) · 1.41 KB

README.md

File metadata and controls

49 lines (30 loc) · 1.41 KB

Training

Python Environment

Requirements

pip install -r requirements.txt

Train the model

  1. Clone this repository or download this python script
git clone https://github.com/ml5js/training-word2vec/
  1. The script supports training from a single text file or directory of files. Create a text file or folder of multiple files. Now run train.py with the name of the file or folder.

Example:

python train.py file.xt
python train.py files/
  1. The script will output a vectors.txt and vectors.json file, however, if you would like to specify an output file name you can use the additional argument -o for that.
python train.py data.txt -o output.json
  1. The output JSON file can be used now with the ml5.js word2vec examples.

Advanced tokenization

The default tokenizer is very basic. You can ask the script to use NLTK's tokenizer with the --tokenizer argument.

Additionally, the script can remove stop words.

python train.py files/ -t nltk --remove-stop-words