How Exactly Does Word2vec Work?: David Meyer
How Exactly Does Word2vec Work?: David Meyer
David Meyer
dmm@{1-4-5.net,uoregon.edu,brocade.com,...}
1 Introduction
The word2vec model [4] and its applications have recently attracted a great deal of attention
from the machine learning community. These dense vector representations of words learned
by word2vec have remarkably been shown to carry semantic meanings and are useful in
a wide range of use cases ranging from natural language processing to network flow data
analysis.
Perhaps the most amazing property of these word embeddings is that somehow these vector
encodings effectively capture the semantic meanings of the words. The question one might
ask is how or why? The answer is that because the vectors adhere surprisingly well to our
intuition. For instance, words that we know to be synonyms tend to have similar vectors
in terms of cosine similarity and antonyms tend to have dissimilar vectors. Even more
surprisingly, word vectors tend to obey the laws of analogy. For example, consider the
analogy ”Woman is to queen as man is to king”. It turns out that
where vqueen , vwoman , vman and vking are the word vectors for queen, woman, man, and
king respectively. These observations strongly suggest that word vectors encode valuable
semantic information about the words that they represent.
Note that there are two main word2vec models: Continuous Bag of Words (CBOW) and
Skip-Gram. In the CBOW model, we predict a word given a context (a context can be
something like a sentence). Skip-Gram is the opposite: predict the context given an input
word. Each of these models is examined below.
This document contains my notes on the word2vec. NB: there are probably lots of mistakes
in this...
1
Figure 1: Simple CBOW Model
In the scenario depicted in Figure 1, V is the vocabulary size and the hyper-parameter N
is the hidden layer size. The input vector x = {x1 , x2 , . . . , xV } is one-hot encoded, that is,
that some xk = 1 and all other xk0 = 0 for k 6= k 0 .
The weights between between the input layer and the hidden layer can be represented by
the V × N matrix W. Each row of W is the N -dimensional vector representation vw of the
associated word w in the input layer. Give a context (here a single word), and assuming
again xk = 1 and xk0 = 0 for k 6= k 0 (one-hot encoding), then
This essentially copies the k-th row of W to h (this is due to the one-hot encoding of x).
Here vwI is the vector representation of the input word wI . Note that this implies that
the link (aka activation) function of the hidden units is linear (g(x) = x). Recall that
w11 w12 ... w1N
w21 w22 ... w2N
WV ×N = . (2)
.. .. ..
.. . . .
wV1 wV2 . . . wVN
and
2
x1
x2
x= . (3)
..
xV
so that
w11 w12 . . . w1N
w21 w22 . . . w2N
.. .. .. ..
. . . .
h = xT W = x1 x2 . . . xk . . . xV
wk1 wk2 . . . wkN
(4)
.. .. .. ..
. . . .
wV1 wV2 . . . wVN
# for xk = 1, xk0 = 0 for k 6= k 0
= xk wk1 xk wk2 . . . xk wkN (5)
= wk1 wk2 . . . wkN # k-th row of W (6)
= W(k,.) (7)
:= vWI (8)
From the hidden layer to the output layer there is a different weight matrix W0 = {wi,j0 }
which is a N ×V matrix. Using this we can compute a score for each word in the vocabulary:
0 T
uj = vw j
·h (9)
0 is the j-th column of the matrix W0 . Note that the score u is a measure of the
where vwj j
match between the context and the next word 1 and is computed by taking the dot product
0 ) and the representation of the candidate target
between the predicted representation (vwj
word (h = vwI ).
Now you can use a softmax (log linear) classification model to obtain the posterior distri-
bution of words (turns out to be multinomial distribution):
exp(uj )
p(wj |wI ) = yj = V
(10)
P
exp(uj 0 )
j 0 =1
1
Though almost all statistical language models predict the next word, it is also possible to model the
distribution of the word preceding the context or surrounded by the context.
3
where yj is the output of the j-th unit of the output layer. Substituting Equations 1 and
9 into Equation 10 we get
0 Tv
exp vwo wI
p(wj |wI ) = V (11)
0 Tv
P
exp vw 0 wI
j
j 0 =1
Notes
• Both the input vector x and the output y are one-hot encoded
0 are two representations of the input word w
• vw and vw
where E = − log p(wO |wI ) is our loss function (which we want to minimize), and j ∗ is
the index of the actual output word (in the output layer). Note (again) that this loss
function can be understood as a special case of the cross-entropy measurement between
two probabilistic distributions.
The next step is to derive the update equation of the weights between hidden and output
layers. Take the derivative of E with respect to j-th unit’s net input uj , we obtain
4
∂E
= yj − tj := ej (15)
∂uj
where tj = 1(j = j ∗ ), the indicator function, i.e., tj will be 1 when the j-th unit is the
actual output word, otherwise tj = 0 . Interestingly this derivative is simply the prediction
error ej of the output layer.
0 to obtain the gradient on the hidden → output
The next step is to take the derivative on wij
weights which (by the chain rule) is
∂E ∂E ∂uj
0 = · 0 = ej · hi (16)
∂wij ∂uj ∂wij
Now, using stochastic gradient descent, we obtain the weight updating equation for hidden →
output weights:
(new) (old)
w0ij = w0ij − η · ej · hi (21)
and/or
(new) (old)
v0 wj = v0 wj − η · ej · h (22)
where η > 0 is the learning rate (standard SGD), ej = yj − tj (Equation 15), hi is the i-th
unit in the hidden layer, and v0 wj is the output vector for word wj .
5
2.2 Updating Weights: Input to hidden layers
Now that we have the update equations for W0 , we an look at W. Here we take the
derivative for E on the output of the hidden layer:
V
X ∂E ∂uj
∂E
= · (23)
∂hi ∂uj ∂hi
j=1
V
X
0
= ej · wij (24)
j=1
:= EHi (25)
where hi is the output of the i-th unit in the hidden layer, uj is defined in Equation 9 (the
input of the j-th unit to the output layer), and ej = yj − tj is the prediction error of the
j-th word in the output layer. EH is an N -dimensional vector is the sum of the output
vectors of all words in the vocabulary, weighted by their prediction error ej .
The next job is to take the derivative of E with respect to W. So first recall that the
hidden layer performs a linear computation on the values from the input layer, specifically
V
X
hi = xk · wki # see Equation 1 (26)
k=1
Now we can use the chain rule to get the derivative of E with respect to W, as follows
(chain rule again):
∂E ∂E ∂hi
= · (27)
∂wki ∂hi ∂wki
= EHi · xk (28)
and so (vector form)
∂E
= EH · x (29)
∂W
From this we obain an V × N matrix. Notice that since only one component of x is non-
∂E
zero, only one row of ∂W is non-zero, and the value of that row is EH (an 1×N dimensional
vector).
(new) (old)
v0 wI = v0 wI − η · EH (30)
Here vwI is a row of W (namely the input vector of the (only) context word), and because
of the one-hot encoding it is the only row of W whose derivative is non-zero.
6
Figure 2: General CBOW Model
4 Skip-Gram Model
The Skip-Gram model was introduced in Mikolov [3] and is depicted in Figure 3. As you
can see, Skip-Gram is the opposite of CBOW; in Skip-Gram we predict the context C given
and input word, where is in CBOW we predict the word from C. Basically the training
objective of the Skip-Gram model is to learn word vector representations that are good at
predicting nearby words in the associated context(s).
For the Skip-Gram model we continue to use wwi to denote the input vector of the only
word on the input layer, and as a result have the same definition of the hidden-layer outputs
h as in Equation 1 (again this means h copies a row from the input → hidden weight matrix
W associated with input word wI ). Recall that the definition of h was
Now, at the output layer, instead of outputting one multinomial distribution, we output
C multinomial distributions. Each output is computed using the same hidden → output
matrix as follows:
7
Figure 3: Skip-Gram Model
exp(uc,j )
p(wc,j = wO,c |wI ) = yc,j = V
(32)
P
exp(uj 0 )
j 0 =1
where wc,j is the j-th word on the c-th panel of the output layer. wO,c is the actual c-th
word in the output context. Note that wI is the (only) input word. yc,j is the output of
the j-th unit on the c-th panel of the output layer. Finally, uc,j is the input of the j-th unit
on the c-th panel of the output layer.
Said in words, this is the probability that our prediction of the j-th word on the c-th panel,
wcj , equals the actual c-th output word, wOc , conditioned on the input word wI .
Now, because the output layer panels share the same weights, we have
0 T
uc,j = uc = vw j
· h, for c = 1, 2, . . . , C (33)
Given all of this, the parameter update equations are not so different from the one context
word CBOW model. The loss function is changed to
0
2
vw j
is again a column of the hidden → output weight matrix W0
8
E = − log p(wO,1 , wO,2 , . . . , w0,C |wI ) (34)
C
Y exp(uc,jc∗ )
= − log V
(35)
c=1
P
exp(uj 0 )
j 0 =1
C
X V
X
=− uc,jc∗ + C · log exp(uj 0 ) (36)
c=1 j 0 =1
where jc∗ is the index of the actual c-th output context word in V 3 .
Now, if we take the derivative of E with respect to the net input of every unit on panel of
the output layer (i.e., uc,j ), we get
∂E
= yc,j − tc,j := ec,j (37)
∂uc,j
which again is the prediction error on the unit (same as in Equation 15). Now, let EI =
{EI1 , EI2 , . . . , EIV } (EI is a V -dimensional vector) as the sum of the prediction errors over
all context words, that is,
C
X
EIj = ec,j (38)
c=1
C
X ∂E ∂ucj
∂E
0 = · 0 = EIj · hi (39)
∂wij ∂ucj ∂wij
c=1
Given all of this machinery we can now get the update equation for the hidden → output
matrix W0 , which might look familiar by this time:
0 (new) 0 (old)
wij = wij − η · EIj · hi (40)
or
0 (new) 0 (old)
vwj = vwj − η · EIj · h j = 1, 2, . . . , V (41)
3
Recall that log AB = log A + log B
9
4.1 Revisiting Learning In Neural Probabilisitic Language Models
We saw in previous sections that both CBOW and Skip-Gram Neural Probabilisitic Lan-
guage Models (NPLMs) have well-defined posterior distributions and loss functions. First,
the basic form of these posteriors is as follows (and roughly following the notation in [6]):
In a neural language model, either CBOW or Skip-Gram, the conditional distribution
corresponding to context c, P c (w), is defined to be
where sθ (w, c) is a scoring function (i.e., Equation 9 above) with parameters θ which
quantifies the compatibility of word w with context c.4
Note that a NPLM represents each word in the vocabulary using a real-valued vector
and defines the scoring function (sθ or equivalently uj ) in terms of vectors of the context
words and the next word. The important point here is that these vectors account for
most of the parameters in neural language models, which in turn means that their memory
requirements are linear in the vocabulary size. More generally, the time complexity of
these models is O(|V | · n), where |V | is the size of the input vocabulary and n is the size
of the input vectors x ∈ Rn . If x is one-hot encoded (such as in the CBOW or Skip-Gram
models), then this complexity is O(|V |2 ),
Given these complexity considerations, [4] describes two alternative training objectives for
the Skip-Gram model: Hierarchical softmax and Skip-Gram Negative Sampling (SGNS); we
will focus on SGNS here. The rest of this section is orgamized as follows: Section 4.3 reviews
4
E = −sθ (w, c) is sometimes referred to as the energy function [1].
10
the basics of Parametric Density Estimation, and Section 4.4 describes Noise-Contrastive
Estimation (NCE), where we consider the situation where the model probability density
function is unnormalized (recall that the problem with softmax is that it is computationally
expensive due to the requirement for explicit normalization). Finally, Section 4.6 looks at
a modification of NCE called Negative Sampling.
With these definitions we can describe the Parametric Density Estimation (PDE) problem.
In particular, the PDE problem is about finding θ ∗ from the observed sample X. Note also
that any estimate θ̂ must yield a normalized pdf pm (.; θ) which satisfies two properties:
Z
pm (u; θ̂)du = 1 (43)
and
Now, if both constraints hold for all θ (and not only θ̂), then we say that the model is
normalized. If the constraint in Equation 44 holds but Equation 43 does not, we say that
the model is unnormalized. The assumption, however, is that there is at least one value of
the parameters for which an unnormalized model integrates to one (Equation 43), namely
θ∗ .
Next, denote an unnormalized model, parameterized by some α as p0m (.; α). Then the
partition function Z(α) is defined as
Z
Z(α) = p0m (u; α)du (45)
11
Z(α) can be used to convert the unnormalized model p0m (u; α) into a normalized one:
p0m (u; α)/Z(α), which integrates to one for every value of α (as required by Equation 43).
X
Z(θ) = exp(sθ (w0 , h)) (46)
w0
exp(sθ (w, h))
Pθh = (47)
Z(θ)
Note that I changed the symbol we’re using for the context c to h (also sometimes used for
the context) to avoid name clashes below.
with parameters θ = (α, c). The estimate θ̂ = (α̂, ĉ) is intended to make the unnormalized
model p0m (.; α̂) match the shape of pd and ĉ provides the proper scaling so that the con-
straints (Equations 43 and 44) hold. Note that separating estimation of shape and scale is
not possible for maximum likelihood estimation (MLE) since the likelihood can be made
arbitrarily large by setting the normalizing parameter c successively larger values.
12
The key observation underlying NCE is that density estimation is largely about charac-
terizing properties of the observed data X = (x1 , x2 , . . . , xTd ), and a convenient way to
describe properties is to describe them relative to the properties of some reference data Y .
Now, assume that the reference (noise) data Y is an iid sample5 (y1 , y2 , . . . , yTn ) of a ran-
dom variable y ∈ Rn with probability distribution function (pdf) pn , and let the (unknown)
pdf of X be pd . Then a relative description of the data X can be given by the ratio pd /pn
of the two density functions. If the reference distribution pn is known (which it is), we can
recover pd from the ratio pd /pn . That is, since we know the differences between X and Y
and also the properties of Y , we can deduce the properties of X. Finally, following [2], we
assume that the noise samples are k times more frequent than data samples so data points
come from the mixture
1 k
Pdh (w) + Pn (w) (49)
k+1 k+1
NCE connects the problem of PDE to supervised learning, in particular to logistic regres-
sion, and provides a hint as to how the proposed estimator works: By discriminating, or
comparing, between data and noise, NCE can learn properties of the data in the form of
a statistical model. That is, the key idea behind noise-contrastive estimation is ?learning
by comparison?.
So how does this supervised learning work and what exactly does it estimate? Consider
first the following the notation (which with minor modifications largely follows [2]): Let
U = (X ∩ Y ) = (u1 , x2 , . . . , uTd +Tn ). NCE then converts the problem of density estimation
to a binary classification problem as follows: For each ut ∈ U assign a class label Ct such
that Ct = 1 if ut ∈ X and Ct = 0 if ut ∈ Y . Now we can use logistic regression to estimate
the posterior probabilities since
X
P (A) = P (A ∩ Bn ) # by the Sum Rule (50)
n
X
= P (A, Bn ) # alternate notation (51)
n
X
= P (A|Bn )P (Bn ) # by the Product Rule (52)
n
so that the posterior distribution P (C1 |x) for two classes C1 and C2 given input vector x
would look like
5
Independent Identically Distributed
13
P (x|C1 )P (C1 )
P (C1 |x) = (53)
P (x|C1 )P (C1 ) + P (x|C2 )P (C2 )
P (x|C1 )P (C1 )
P (C1 |x) = (54)
P (x|C1 )P (C1 ) + P (x|C2 )P (C2 )
Now, if we set
P (x|C1 )P (C1 )
a = ln (55)
P (x|C2 )P (C2 )
1
P (C1 |x) = = σ(a) (56)
1 + e−a
that is, the sigmoid function. This starts to give us a sense that the sigmoid function is
related to the log of the ratio of likelihoods of p(x|C1 ) and p(x|C2 ), or in our context, pd /pn .
Now, since the pdf pd of x is unknown, we cam model the class-conditional probability
p(.|C = 1) with pm (.; θ), and the class conditional probability densities are
Td
p(C = 1) = (59)
Td + Tn
Tn
p(C = 0) = (60)
Td + Tn
14
and the posteriors are therefore
pm (u; θ)
P (C = 1|u; θ) = (61)
pm (u; θ) + k · pn (u)
k · pn (u)
P (C = 0|u; θ) = (62)
pm (u; θ) + k · pn (u)
where k is the ratio P (C = 0)/P (C = 1) = Tn /Td (remembering that noise samples yi are
k times more frequent that data samples xi ).
Note that the class labels Ct are Bernoulli-distributed. Recall the details of the Bernoulli
distribution: First, the random variable Y takes values yi ∈ {0, 1}. Then the Bernoulli
distribution is a Binomial(1, p) distribution, where 0 < p < 1 and P (Y = y) = py (1−p)1−y .
The the probability that Yi = yi for i = 1, 2, . . . , n is
n
Y
P (Y ) = pyi (1 − p)1−yi (63)
i=1
TdX
+Tn
`(θ) = Ct ln P (Ct = 1|ut ; θ) + (1 − Ct ) ln P (Ct = 0|ut ; θ) (65)
t=1
T
X d Tn
X
= ln h(xt ; θ) + ln 1 − h(yt ; θ) (66)
t=1 t=1
where
Optimizing `(θ) with respect to θ leads to an estimate G(.; θ) of the log ratio ln(pd /pn )
(see Equation 55 for the derivation). That is, an approximate description of X relative to
15
Y is given by Equation 66. Interestingly, the sign inverted objective function, −`(θ), is
also known as the cross entropy (or cross entropy error function).
Now, given an unnormalized statistical model p0m (.; α), the NCE technique adds an addi-
tional normalization parameter c to the model, and defines
where θ = (α, c). The parameter c scales the unnormalized model so that the Equation
43 holds. After learning, ĉ provides an estimate of ln 1/Z(α̂) (this is closely related to
Equation 55).
Td Tn
( )
1 X X
JT (θ) = ln h(x; θ) + ln 1 − h(yi ; θ) (70)
Td
t=1 t=1
and the NCE estimator is defined to be the argument θ̂ which minimizes −JT (θ) (or
alternatively, maximizes JT (θ)) and h(.; θ) is the nonlinearity defined in Equation 68.
16
k
T 0 T
X
log σ(v 0 WO vWI ) +
Ewi ∼Pn (w) log σ(−vW i
vW I (71)
i=1
which they claim is used to replace every occurrence of log P (wO |wI ) in the Skip-Gram
objective (though exactly how this work isn’t discussed).
5 Acknowledgements
17
References
[1] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. A neural
probabilistic language model. J. Mach. Learn. Res., 3:1137–1155, March 2003.
[3] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of
word representations in vector space. 01 2013.
[4] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. word2vec, 2014.
[5] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed
representations of words and phrases and their compositionality. In C.J.C. Burges,
L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors, Advances in
Neural Information Processing Systems 26, pages 3111–3119. Curran Associates, Inc.,
2013.
[6] Andriy Mnih and Yee Whye Teh. A fast and simple algorithm for training neural
probabilistic language models. 06 2012.
18