This directory contains models for unsupervised training of word embeddings using the model described in:
(Mikolov, et. al.) Efficient Estimation of Word Representations in Vector Space, ICLR 2013.
Detailed instructions on how to get started and use them are available in the tutorials. Brief instructions are below.
Assuming you have cloned the git repository, navigate into this directory. To download the example text and evaluation data:
curl http://mattmahoney.net/dc/text8.zip > text8.zip
unzip text8.zip
curl https://storage.googleapis.com/google-code-archive-source/v2/code.google.com/word2vec/source-archive.zip > source-archive.zip
unzip -p source-archive.zip word2vec/trunk/questions-words.txt > questions-words.txt
rm text8.zip source-archive.zip
You will need to compile the ops as follows (See Adding a New Op to TensorFlow for more details).:
TF_CFLAGS=( $(python -c 'import tensorflow as tf; print(" ".join(tf.sysconfig.get_compile_flags()))') )
TF_LFLAGS=( $(python -c 'import tensorflow as tf; print(" ".join(tf.sysconfig.get_link_flags()))') )
g++ -std=c++11 -shared word2vec_ops.cc word2vec_kernels.cc -o word2vec_ops.so -fPIC ${TF_CFLAGS[@]} ${TF_LFLAGS[@]} -O2 -D_GLIBCXX_USE_CXX11_ABI=0
On Mac, add -undefined dynamic_lookup
to the g++ command. The flag -D_GLIBCXX_USE_CXX11_ABI=0
is included to support newer versions of gcc. However, if you compiled TensorFlow from source using gcc 5 or later, you may need to exclude the flag. Specifically, if you get an error similar to the following: word2vec_ops.so: undefined symbol: _ZN10tensorflow7strings6StrCatERKNS0_8AlphaNumES3_S3_S3_
then you likely need to exclude the flag.
Once you've successfully compiled the ops, run the model as follows:
python word2vec_optimized.py \
--train_data=text8 \
--eval_data=questions-words.txt \
--save_path=/tmp/
Here is a short overview of what is in this directory.
File | What's in it? |
---|---|
word2vec.py |
A version of word2vec implemented using TensorFlow ops and minibatching. |
word2vec_test.py |
Integration test for word2vec. |
word2vec_optimized.py |
A version of word2vec implemented using C ops that does no minibatching. |
word2vec_optimized_test.py |
Integration test for word2vec_optimized. |
word2vec_kernels.cc |
Kernels for the custom input and training ops. |
word2vec_ops.cc |
The declarations of the custom ops. |