Name		Name	Last commit message	Last commit date
parent directory ..
backend		backend
frontend		frontend
README.md		README.md
demo_en.png		demo_en.png
demo_ja.png		demo_ja.png

README.md

word2vec-visualization

Word Vectors Visualization in Tree Form

Authors: Phi Van Thuy and Taishi Ikeda.

Supervisor: Assistant Professor Kevin Duh.

Two types of distances: Cosine distance and Euclidean distance.
Totally 8 different models for the English and the Japanese data.
Run simple HTTP server: "python -m SimpleHTTPServer 8888".

![fig1] (demo_en.png)

![fig2] (demo_ja.png)

Main files and folders:
- backend
  - HiraganaTimes_English
    implementation of the continuous bag-of-words and skip-gram architectures for computing vector representations of words in English; skip-gram (slower, better for infrequent words) vs CBOW (fast).
  - HiraganaTimes_Japanese
    implementation of the continuous bag-of-words and skip-gram architectures for computing vector representations of words in Japanese; skip-gram (slower, better for infrequent words) vs CBOW (fast).
  - Convert_to_JSON
    scripts for converting word2vec models to JSON databases.
- frontend
  - data
    contain all data for searching word and vizualize them: "data_cosine.json" and "data_euclidean.json" are the databases. The flare-format data is created from the database when running the web page.
  - js
    contain D3.js library (visualization javascript library).
  - word2vec_tree_final.html
    the main web page.
Visualize your own data
- To convert the word2vec models to the JSON files, the Gensim library (https://radimrehurek.com/gensim/install.html) is required. Quick install Gensim: "easy_install -U gensim" or, alternatively: "pip install --upgrade gensim".
- For Cosine distance metric: use script "create_database_cosine.py".
- For Euclidean distance metric: use script "create_database_euclidean.py", and copy the file "word2vec.py" to Gensim library's location, e.g., "/Library/Python/2.7/site-packages/gensim-0.10.3-py2.7-macosx-10.10-intel.egg/gensim/models". In this new implementation, the new method most_similar_euclidean() is included to calculate the distance between pairs of words/phrases by Euclidean metric.
- Special characters should be excluded from JSON files to generate the correct JSON format. More details are in "Remove_Special_Characters.txt" file.
For Vietnamese:
- I am testing the UI by quickly replacing English dict with a Vietnamese dict. It doesn't work already so there must be encoding problems. Feel free to test with /frontend/data/vi_....json
- Please send a request to commit any changes that can fix the bug. Feel free to use simple UI instead if you don't know how to fix this problem.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

word2vec-visualization

word2vec-visualization

README.md

word2vec-visualization

Files

word2vec-visualization

Directory actions

More options

Directory actions

More options

Latest commit

History

word2vec-visualization

Folders and files

parent directory

README.md

word2vec-visualization