-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
20 changed files
with
3,763 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,381 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Gensim word vector visualization of various word vectors" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 3, | ||
"metadata": { | ||
"collapsed": true | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"import numpy as np\n", | ||
"\n", | ||
"# Get the interactive Tools for Matplotlib\n", | ||
"%matplotlib notebook\n", | ||
"import matplotlib.pyplot as plt\n", | ||
"plt.style.use('ggplot')\n", | ||
"\n", | ||
"from sklearn.decomposition import PCA\n", | ||
"\n", | ||
"from gensim.test.utils import datapath, get_tmpfile\n", | ||
"from gensim.models import KeyedVectors\n", | ||
"from gensim.scripts.glove2word2vec import glove2word2vec" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"For looking at word vectors, I'll use Gensim. We also use it in hw1 for word vectors. Gensim isn't really a deep learning package. It's a package for for word and text similarity modeling, which started with (LDA-style) topic models and grew into SVD and neural word representations. But its efficient and scalable, and quite widely used." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Our homegrown Stanford offering is GloVe word vectors. Gensim doesn't give them first class support, but allows you to convert a file of GloVe vectors into word2vec format. You can download the GloVe vectors from [the Glove page](https://nlp.stanford.edu/projects/glove/). They're inside [this zip file](https://nlp.stanford.edu/data/glove.6B.zip)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"(I use the 100d vectors below as a mix between speed and smallness vs. quality. If you try out the 50d vectors, they basically work for similarity but clearly aren't as good for analogy problems. If you load the 300d vectors, they're even better than the 100d vectors.)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 5, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"(400000, 100)" | ||
] | ||
}, | ||
"execution_count": 5, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"glove_file = datapath('/Users/Rocha/Documents/Github/cs224n/glove.6B/glove.6B.100d.txt')\n", | ||
"word2vec_glove_file = get_tmpfile(\"glove.6B.100d.word2vec.txt\")\n", | ||
"glove2word2vec(glove_file, word2vec_glove_file)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 6, | ||
"metadata": { | ||
"collapsed": true | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"model = KeyedVectors.load_word2vec_format(word2vec_glove_file)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 7, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"[('barack', 0.937216579914093),\n", | ||
" ('bush', 0.9272854328155518),\n", | ||
" ('clinton', 0.8960003852844238),\n", | ||
" ('mccain', 0.8875634074211121),\n", | ||
" ('gore', 0.8000321388244629),\n", | ||
" ('hillary', 0.7933663129806519),\n", | ||
" ('dole', 0.7851964235305786),\n", | ||
" ('rodham', 0.7518897652626038),\n", | ||
" ('romney', 0.7488930225372314),\n", | ||
" ('kerry', 0.7472623586654663)]" | ||
] | ||
}, | ||
"execution_count": 7, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"model.most_similar('obama')" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 8, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"[('coconut', 0.7097253799438477),\n", | ||
" ('mango', 0.7054824233055115),\n", | ||
" ('bananas', 0.6887733340263367),\n", | ||
" ('potato', 0.6629636287689209),\n", | ||
" ('pineapple', 0.6534532308578491),\n", | ||
" ('fruit', 0.6519854664802551),\n", | ||
" ('peanut', 0.6420576572418213),\n", | ||
" ('pecan', 0.6349173188209534),\n", | ||
" ('cashew', 0.629442036151886),\n", | ||
" ('papaya', 0.6246591210365295)]" | ||
] | ||
}, | ||
"execution_count": 8, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"model.most_similar('banana')" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 9, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"[('keyrates', 0.7173939347267151),\n", | ||
" ('sungrebe', 0.7119239568710327),\n", | ||
" ('þórður', 0.7067720293998718),\n", | ||
" ('zety', 0.7056615352630615),\n", | ||
" ('23aou94', 0.6959497928619385),\n", | ||
" ('___________________________________________________________',\n", | ||
" 0.694915235042572),\n", | ||
" ('elymians', 0.6945434808731079),\n", | ||
" ('camarina', 0.6927202939987183),\n", | ||
" ('ryryryryryry', 0.6905654072761536),\n", | ||
" ('maurilio', 0.6865653395652771)]" | ||
] | ||
}, | ||
"execution_count": 9, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"model.most_similar(negative='banana')" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 10, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"queen: 0.7699\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"result = model.most_similar(positive=['woman', 'king'], negative=['man'])\n", | ||
"print(\"{}: {:.4f}\".format(*result[0]))" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"collapsed": true | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"def analogy(x1, x2, y1):\n", | ||
" result = model.most_similar(positive=[y1, x2], negative=[x1])\n", | ||
" return result[0][0]" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"![Analogy](imgs/word2vec-king-queen-composition.png)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"collapsed": true | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"analogy('japan', 'japanese', 'australia')" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"collapsed": true | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"analogy('australia', 'beer', 'france')" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"collapsed": true | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"analogy('obama', 'clinton', 'reagan')" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"collapsed": true | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"analogy('tall', 'tallest', 'long')" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"collapsed": true | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"analogy('good', 'fantastic', 'bad')" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"collapsed": true | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"print(model.doesnt_match(\"breakfast cereal dinner lunch\".split()))" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"collapsed": true | ||
}, | ||
"outputs": [], | ||
"source": [] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"collapsed": true | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"def display_pca_scatterplot(model, words=None, sample=0):\n", | ||
" if words == None:\n", | ||
" if sample > 0:\n", | ||
" words = np.random.choice(list(model.vocab.keys()), sample)\n", | ||
" else:\n", | ||
" words = [ word for word in model.vocab ]\n", | ||
" \n", | ||
" word_vectors = np.array([model[w] for w in words])\n", | ||
"\n", | ||
" twodim = PCA().fit_transform(word_vectors)[:,:2]\n", | ||
" \n", | ||
" plt.figure(figsize=(6,6))\n", | ||
" plt.scatter(twodim[:,0], twodim[:,1], edgecolors='k', c='r')\n", | ||
" for word, (x,y) in zip(words, twodim):\n", | ||
" plt.text(x+0.05, y+0.05, word)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"collapsed": true | ||
}, | ||
"outputs": [], | ||
"source": [] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"collapsed": true | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"display_pca_scatterplot(model, \n", | ||
" ['coffee', 'tea', 'beer', 'wine', 'brandy', 'rum', 'champagne', 'water',\n", | ||
" 'spaghetti', 'borscht', 'hamburger', 'pizza', 'falafel', 'sushi', 'meatballs',\n", | ||
" 'dog', 'horse', 'cat', 'monkey', 'parrot', 'koala', 'lizard',\n", | ||
" 'frog', 'toad', 'monkey', 'ape', 'kangaroo', 'wombat', 'wolf',\n", | ||
" 'france', 'germany', 'hungary', 'luxembourg', 'australia', 'fiji', 'china',\n", | ||
" 'homework', 'assignment', 'problem', 'exam', 'test', 'class',\n", | ||
" 'school', 'college', 'university', 'institute'])" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"collapsed": true | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"display_pca_scatterplot(model, sample=300)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"collapsed": true | ||
}, | ||
"outputs": [], | ||
"source": [] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.6.1" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
Welcome to CS224N! | ||
|
||
We'll be using Python throughout the course. If you've got a good Python setup already, great! But make sure that it is at least Python version 3.5. If not, the easiest thing to do is to make sure you have at least 3GB free on your computer and then to head over to (https://www.anaconda.com/download/) and install the Python 3 version of Anaconda. It will work on any operating system. | ||
|
||
After you have installed conda, close any open terminals you might have. Then open a new terminal and run the following command: | ||
|
||
conda install gensim | ||
|
||
Homework 1 (only) is a Jupyter Notebook. With the above done you should be able to get underway by typing: | ||
|
||
jupyter notebook exploring_word_vectors.ipynb |
Oops, something went wrong.