Introduced by Mikolov et al. in two papers in 2013 (Mikolov et al. Efficient Estimation of Word Representations in Vector Space and Mikolov et al. Distributed Representations of Words and Phrases and their Compositionality ), Word2Vec is a widely popular way of getting vector representation for words. The core assumption of Word2Vec is that if two words are used in similar contexts, then they should share a similar meaning and vector representation. These vector representations can then be used in clustering a set of documents or in text classification tasks.
In 2014, Omer Levy and Yoav Goldberg demonstrated that Word2Vec could be approximated by "factorizing a word-context matrix whose cells are the pointwise mutual information (PMI) of the respective word and context pairs" (Neural Word Embedding as Implicit Matrix Factorization). Unlike Word2Vec which is based on a neural-network and uses gradient descent, Levy and Goldberg method only relies on word count, information theory and the factorization of a matrix with the well-known Singular Value Decomposition. They also go on and show that this method produces word embeddings that can achieve comparable performance as the ones from Word2Vec.
This project compares the performance of Gensim's famous implementation of Word2Vec and my own implementation of Levy and Goldberg model on a Sentiment Analysis tasks.
The data comes from Ahmed Besbes Sentiment Analysis on twitter using word2vec and keras blog post. You can find the data here.
The data consists of more than a million of tweets, each tweet comes with its content / text and a binary sentiment score (positive / negative feeling).
The code is organized as follow: a light pre-processing of the data, the creation of both a Word2Vec and a SPPMI - SVD models, the training of a classifier for each resulting word-embeddings, the comparison of the two resulting sentiment classification. All in a Python Jupyter Notebook.
A brief pre-processing of the tweets, removing hyperlinks and hashtags, and keeping tweets long enough to learn word-embeddings from.
Using Gensim's Word2Vec implementation, and my own for SPPMI - SVD, built two differents vector representations of the corpus' words. With these vector representations, transformed each tweet into a vector using spaCy.
Trained two Stochastic Gradient Descent classifiers, one for each word-embeddings.
Compared different metrics of the sentiment classification tasks between the two classifiers.
Obtained decent sentiment classification performance from both Word2Vec and SPPMI-SVD. Even better, SPPMI - SVD provided results very close to Word2Ve, providing a decent alternative to Word2Vec.