Word 2 Vec
Word 2 Vec
Nuha Mohammed
April 2020
1 Introduction
Natural Language Processing (NLP) is a subset of machine learning which is
directed toward reading and deriving meaning from human languages. NLP
has a wide array of applications from spell checking to machine translation to
semantic analysis.
1
The easiest way to keep track of each word-index pair is to pair the words to
an index in alphabetical order. Then, in each word vector, a 1 is placed at the
index of that word and a 0 is placed at all other indices.
3 Word2Vec
Word2Vec models create continuous embedding vectors which represent words
within a large text in an n-dimensional vector space.
In the following example, a Word2Vec model with 3 features is created, and
for each feature/category, the word is scored for its likelihood of belonging to
that category.
The embedding vectors show that “Daisy” and “Donald” are closest together
on the first dimension, so the first dimension probably pertains to a feature that
both “Daisy” and “Donald” share that “Rose” does not relate to and so on.
2
Figure 2: Depiction of how context-word pairs are created
3.2 Algorithms
Word2Vec includes two possible algorithms: Skip-Gram and Continuous Bag-
of-Words (CBOW). CBOW works by trying to predict the center word from its
neighboring context words, while Skip-Gram does the opposite and attempts to
predict the context words, given the center word.
3
Whether Skip-Gram or CBOW works better depends on the type and amount
of data. However, I am only going to go in detail on the Skip-Gram model in
this lecture as both are conceptually similar.
3.3 Training
In Skip-Gram models, the input is a one-hot encoded vector of the center word.
Each neuron within the output layer gives the probability that the word at that
position is a context word of that center word.
Since the output layer performs softmax, it generates values between 0 and
1 and all values in the layer add up to 1, giving us probabilities. We want
our generated probability vector to match the actual probabilities which are
represented with the one-hot encoded vector of the word.
4
Figure 5: Weights of the hidden layer in the neural network
W and W’ are the weight matrices between the input and hidden layer
and the output and hidden layer, respectively. The model seeks to optimize
the weight matrices by maximizing the probability of correctly predicting the
context words.
The weight matrices are passed into the cost function as variables and op-
timized through gradient descent. Just as any other neural network, the model
aims to minimize its cost function:
h = W T x,
0 0
T T
uc = W x=W W T x,
yc = Sof tmax(uc )
C
Y
L = −logP (wc,1 , wc,2 , ..., wc |w0 ) = −log P (wc,i |w0 )
c=1
After training, each row in the input weight matrix (W ) consists of a word-
embedding vector for each word:
always 0.5 0.3 7
kite
8 −0.9 2.2
should 0.3 5 2.1
W =
...
...
there 0.6 7 3.2
who 2 0.5 4
5
The word embeddings serve as a foundation of a lot of NLP tasks such as
topic modeling and analyzing meaning from passages.
With the word embedding themselves, you can discover underlying patterns
in data. One interesting application is in Proteomics and Genomics— BioVec-
tors encode n-grams in biological sequences such as DNA, RNA, and protein
sequences. Using techniques such as Skip-Gram modeling, scientists have been
able to group biological sequences based on underlying biochemical properties
and interactions.
4 Code Sample
Checkout this implementation of Word2Vec with the Skip Gram model:
https://github.com/DerekChia/word2vec_numpy/blob/master/wordtovec.
py
5 Further Exploration
For a large amount of data, the Word2Vec becomes inefficient as it will contain
a huge number of weights. Gradient descent performed on that network will be
extremely slow and it will be hard to avoid over-fitting. Therefore, the authors
of Word2Vec also introduced a concept called Negative Sampling which takes
care of this issue. I encourage you to look into their second paper to find out
more about optimization.
6 References
The information and images I used in these lectures came from multiple sources.
Credits go to their respective owners.
• Stanford CS224d Lecture: Deep Learning for NLP
• Word2Vec Paper: Efficient Estimation of Word Representations in Vector
Space