0% found this document useful (0 votes)
77 views18 pages

How Exactly Does Word2vec Work?: David Meyer

The document summarizes how the word2vec model works. There are two main models: Continuous Bag of Words (CBOW) and Skip-Gram. CBOW predicts a word given the surrounding context words, while Skip-Gram predicts the surrounding context words given an input word. The CBOW model uses a neural network to project one-hot encoded words into dense vector representations that effectively capture semantic meanings, with similar vectors for synonyms and dissimilar vectors for antonyms. Vectors also obey analogies like "man is to king as woman is to queen". The model is trained to maximize the probability of predicting the correct output word given the context word.

Uploaded by

Mahtab
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
77 views18 pages

How Exactly Does Word2vec Work?: David Meyer

The document summarizes how the word2vec model works. There are two main models: Continuous Bag of Words (CBOW) and Skip-Gram. CBOW predicts a word given the surrounding context words, while Skip-Gram predicts the surrounding context words given an input word. The CBOW model uses a neural network to project one-hot encoded words into dense vector representations that effectively capture semantic meanings, with similar vectors for synonyms and dissimilar vectors for antonyms. Vectors also obey analogies like "man is to king as woman is to queen". The model is trained to maximize the probability of predicting the correct output word given the context word.

Uploaded by

Mahtab
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

How exactly does word2vec work?

David Meyer

dmm@{1-4-5.net,uoregon.edu,brocade.com,...}

July 31, 2016

1 Introduction
The word2vec model [4] and its applications have recently attracted a great deal of attention
from the machine learning community. These dense vector representations of words learned
by word2vec have remarkably been shown to carry semantic meanings and are useful in
a wide range of use cases ranging from natural language processing to network flow data
analysis.

Perhaps the most amazing property of these word embeddings is that somehow these vector
encodings effectively capture the semantic meanings of the words. The question one might
ask is how or why? The answer is that because the vectors adhere surprisingly well to our
intuition. For instance, words that we know to be synonyms tend to have similar vectors
in terms of cosine similarity and antonyms tend to have dissimilar vectors. Even more
surprisingly, word vectors tend to obey the laws of analogy. For example, consider the
analogy ”Woman is to queen as man is to king”. It turns out that

vqueen − vwoman + vman ≈ vking

where vqueen , vwoman , vman and vking are the word vectors for queen, woman, man, and
king respectively. These observations strongly suggest that word vectors encode valuable
semantic information about the words that they represent.

Note that there are two main word2vec models: Continuous Bag of Words (CBOW) and
Skip-Gram. In the CBOW model, we predict a word given a context (a context can be
something like a sentence). Skip-Gram is the opposite: predict the context given an input
word. Each of these models is examined below.

This document contains my notes on the word2vec. NB: there are probably lots of mistakes
in this...

1
Figure 1: Simple CBOW Model

2 Continuous Bag-of-Words Model


The simplest version of the continuous bag-of-word model (CBOW) is a ”single context
word” version [3]. This is shown in Figure 1. Here we assume that there is only one word
considered per context, which means the model will predict one target word given one
context word (which is similar to a bi-gram language model).

In the scenario depicted in Figure 1, V is the vocabulary size and the hyper-parameter N
is the hidden layer size. The input vector x = {x1 , x2 , . . . , xV } is one-hot encoded, that is,
that some xk = 1 and all other xk0 = 0 for k 6= k 0 .

The weights between between the input layer and the hidden layer can be represented by
the V × N matrix W. Each row of W is the N -dimensional vector representation vw of the
associated word w in the input layer. Give a context (here a single word), and assuming
again xk = 1 and xk0 = 0 for k 6= k 0 (one-hot encoding), then

h = xT W = W(k,.) := vwI (1)

This essentially copies the k-th row of W to h (this is due to the one-hot encoding of x).
Here vwI is the vector representation of the input word wI . Note that this implies that
the link (aka activation) function of the hidden units is linear (g(x) = x). Recall that

 
w11 w12 ... w1N
 w21 w22 ... w2N 
WV ×N = . (2)
 
.. .. .. 
 .. . . . 
wV1 wV2 . . . wVN

and

2
 
x1
 x2 
x= .  (3)
 
 .. 
xV
so that

 
w11 w12 . . . w1N
 w21 w22 . . . w2N 
 
 .. .. .. .. 
. . . . 
h = xT W = x1 x2 . . . xk . . . xV 
  
 wk1 wk2 . . . wkN 
 (4)
 
 .. .. .. .. 
 . . . . 
wV1 wV2 . . . wVN
# for xk = 1, xk0 = 0 for k 6= k 0
 
= xk wk1 xk wk2 . . . xk wkN (5)
 
= wk1 wk2 . . . wkN # k-th row of W (6)
= W(k,.) (7)
:= vWI (8)

From the hidden layer to the output layer there is a different weight matrix W0 = {wi,j0 }

which is a N ×V matrix. Using this we can compute a score for each word in the vocabulary:

0 T
uj = vw j
·h (9)
0 is the j-th column of the matrix W0 . Note that the score u is a measure of the
where vwj j
match between the context and the next word 1 and is computed by taking the dot product
0 ) and the representation of the candidate target
between the predicted representation (vwj
word (h = vwI ).

Now you can use a softmax (log linear) classification model to obtain the posterior distri-
bution of words (turns out to be multinomial distribution):

exp(uj )
p(wj |wI ) = yj = V
(10)
P
exp(uj 0 )
j 0 =1
1
Though almost all statistical language models predict the next word, it is also possible to model the
distribution of the word preceding the context or surrounded by the context.

3
where yj is the output of the j-th unit of the output layer. Substituting Equations 1 and
9 into Equation 10 we get

0 Tv

exp vwo wI
p(wj |wI ) = V (11)
0 Tv
P 
exp vw 0 wI
j
j 0 =1

Notes

• Both the input vector x and the output y are one-hot encoded
0 are two representations of the input word w
• vw and vw

• vw comes from the rows of W


0 comes from the columns of W0
• vw

• vw is usually called the input vector


0 is usually called the output vector
• vw

2.1 Updating Weights: hidden layer to output layer


The training objective (for one training sample) is to maximize Equation 11, the conditional
probability of observing the actual output word wO (denote its index in the output layer
as j ∗ ) given the input context word wI (and with regard to the weights). That is,

max p(wO |wI ) = max yj ∗ (12)


= max log yj ∗ (13)
V
X
= uj ∗ − log exp(uj 0 ) := −E (14)
j 0 =1

where E = − log p(wO |wI ) is our loss function (which we want to minimize), and j ∗ is
the index of the actual output word (in the output layer). Note (again) that this loss
function can be understood as a special case of the cross-entropy measurement between
two probabilistic distributions.

The next step is to derive the update equation of the weights between hidden and output
layers. Take the derivative of E with respect to j-th unit’s net input uj , we obtain

4
∂E
= yj − tj := ej (15)
∂uj
where tj = 1(j = j ∗ ), the indicator function, i.e., tj will be 1 when the j-th unit is the
actual output word, otherwise tj = 0 . Interestingly this derivative is simply the prediction
error ej of the output layer.
0 to obtain the gradient on the hidden → output
The next step is to take the derivative on wij
weights which (by the chain rule) is

∂E ∂E ∂uj
0 = · 0 = ej · hi (16)
∂wij ∂uj ∂wij

since (sketch of the proof)


V
 
P
∂ uj ∗ − log exp(uj 0 )
∂E j 0 =1
= (17)
∂uj ∂uj ∗
V
 
P
∂ log exp(uj 0 )
∂uj ∗ j 0 =1
= − (18)
∂uj ∂uj
exp(uj )
= tj − V (19)
P
exp(uj 0 )
j 0 =1

= tj − yj # by Equations 10 and 15 (20)

Now, using stochastic gradient descent, we obtain the weight updating equation for hidden →
output weights:

(new) (old)
w0ij = w0ij − η · ej · hi (21)

and/or

(new) (old)
v0 wj = v0 wj − η · ej · h (22)

where η > 0 is the learning rate (standard SGD), ej = yj − tj (Equation 15), hi is the i-th
unit in the hidden layer, and v0 wj is the output vector for word wj .

5
2.2 Updating Weights: Input to hidden layers
Now that we have the update equations for W0 , we an look at W. Here we take the
derivative for E on the output of the hidden layer:

V
X ∂E ∂uj
∂E
= · (23)
∂hi ∂uj ∂hi
j=1
V
X
0
= ej · wij (24)
j=1

:= EHi (25)
where hi is the output of the i-th unit in the hidden layer, uj is defined in Equation 9 (the
input of the j-th unit to the output layer), and ej = yj − tj is the prediction error of the
j-th word in the output layer. EH is an N -dimensional vector is the sum of the output
vectors of all words in the vocabulary, weighted by their prediction error ej .

The next job is to take the derivative of E with respect to W. So first recall that the
hidden layer performs a linear computation on the values from the input layer, specifically
V
X
hi = xk · wki # see Equation 1 (26)
k=1
Now we can use the chain rule to get the derivative of E with respect to W, as follows
(chain rule again):
∂E ∂E ∂hi
= · (27)
∂wki ∂hi ∂wki
= EHi · xk (28)
and so (vector form)
∂E
= EH · x (29)
∂W
From this we obain an V × N matrix. Notice that since only one component of x is non-
∂E
zero, only one row of ∂W is non-zero, and the value of that row is EH (an 1×N dimensional
vector).

Now we can write the update equation for W as

(new) (old)
v0 wI = v0 wI − η · EH (30)
Here vwI is a row of W (namely the input vector of the (only) context word), and because
of the one-hot encoding it is the only row of W whose derivative is non-zero.

6
Figure 2: General CBOW Model

3 General Continuous Bag-of-Words Model

4 Skip-Gram Model
The Skip-Gram model was introduced in Mikolov [3] and is depicted in Figure 3. As you
can see, Skip-Gram is the opposite of CBOW; in Skip-Gram we predict the context C given
and input word, where is in CBOW we predict the word from C. Basically the training
objective of the Skip-Gram model is to learn word vector representations that are good at
predicting nearby words in the associated context(s).

For the Skip-Gram model we continue to use wwi to denote the input vector of the only
word on the input layer, and as a result have the same definition of the hidden-layer outputs
h as in Equation 1 (again this means h copies a row from the input → hidden weight matrix
W associated with input word wI ). Recall that the definition of h was

h = W(k,.) := vwI (31)

Now, at the output layer, instead of outputting one multinomial distribution, we output
C multinomial distributions. Each output is computed using the same hidden → output
matrix as follows:

7
Figure 3: Skip-Gram Model

exp(uc,j )
p(wc,j = wO,c |wI ) = yc,j = V
(32)
P
exp(uj 0 )
j 0 =1

where wc,j is the j-th word on the c-th panel of the output layer. wO,c is the actual c-th
word in the output context. Note that wI is the (only) input word. yc,j is the output of
the j-th unit on the c-th panel of the output layer. Finally, uc,j is the input of the j-th unit
on the c-th panel of the output layer.

Said in words, this is the probability that our prediction of the j-th word on the c-th panel,
wcj , equals the actual c-th output word, wOc , conditioned on the input word wI .

Now, because the output layer panels share the same weights, we have
0 T
uc,j = uc = vw j
· h, for c = 1, 2, . . . , C (33)

0 is the output vector of the j-th word w of the vocabulary2 .


where vwj j

Given all of this, the parameter update equations are not so different from the one context
word CBOW model. The loss function is changed to
0
2
vw j
is again a column of the hidden → output weight matrix W0

8
E = − log p(wO,1 , wO,2 , . . . , w0,C |wI ) (34)
C
Y exp(uc,jc∗ )
= − log V
(35)
c=1
P
exp(uj 0 )
j 0 =1
C
X V
X
=− uc,jc∗ + C · log exp(uj 0 ) (36)
c=1 j 0 =1

where jc∗ is the index of the actual c-th output context word in V 3 .

Now, if we take the derivative of E with respect to the net input of every unit on panel of
the output layer (i.e., uc,j ), we get

∂E
= yc,j − tc,j := ec,j (37)
∂uc,j

which again is the prediction error on the unit (same as in Equation 15). Now, let EI =
{EI1 , EI2 , . . . , EIV } (EI is a V -dimensional vector) as the sum of the prediction errors over
all context words, that is,

C
X
EIj = ec,j (38)
c=1

Now you can find the derivative of E with respect to W0 as follows:

C
X ∂E ∂ucj
∂E
0 = · 0 = EIj · hi (39)
∂wij ∂ucj ∂wij
c=1

Given all of this machinery we can now get the update equation for the hidden → output
matrix W0 , which might look familiar by this time:

0 (new) 0 (old)
wij = wij − η · EIj · hi (40)

or
0 (new) 0 (old)
vwj = vwj − η · EIj · h j = 1, 2, . . . , V (41)
3
Recall that log AB = log A + log B

9
4.1 Revisiting Learning In Neural Probabilisitic Language Models
We saw in previous sections that both CBOW and Skip-Gram Neural Probabilisitic Lan-
guage Models (NPLMs) have well-defined posterior distributions and loss functions. First,
the basic form of these posteriors is as follows (and roughly following the notation in [6]):
In a neural language model, either CBOW or Skip-Gram, the conditional distribution
corresponding to context c, P c (w), is defined to be

exp(sθ (w, c))


Pθc = P # see e.g., Equation 32 (42)
exp(sθ (w0 , c))
w0

where sθ (w, c) is a scoring function (i.e., Equation 9 above) with parameters θ which
quantifies the compatibility of word w with context c.4

Note that a NPLM represents each word in the vocabulary using a real-valued vector
and defines the scoring function (sθ or equivalently uj ) in terms of vectors of the context
words and the next word. The important point here is that these vectors account for
most of the parameters in neural language models, which in turn means that their memory
requirements are linear in the vocabulary size. More generally, the time complexity of
these models is O(|V | · n), where |V | is the size of the input vocabulary and n is the size
of the input vectors x ∈ Rn . If x is one-hot encoded (such as in the CBOW or Skip-Gram
models), then this complexity is O(|V |2 ),

4.2 An Obvious Question


A question one might ask is why, given the reported superior accuracy of NPLMs, had they
until recently been far less widely used than n-gram models? The answer is due to their
notoriously long training times, which had been measured in weeks even for moderately-
sized datasets. But then what is causing this expensive training? The answer is that
training NPLMs is computationally expensive because they are explicitly normalized.
For example, consider the denominator in Equation 42, which essentially requires that we
consider all words in the vocabulary when computing the posterior distributions (or in the
training context, the log-likelihood gradients, such as in Equation 34). This points to a
problem with softmax classifiers in general, namely, that they are explicitly normalized.

Given these complexity considerations, [4] describes two alternative training objectives for
the Skip-Gram model: Hierarchical softmax and Skip-Gram Negative Sampling (SGNS); we
will focus on SGNS here. The rest of this section is orgamized as follows: Section 4.3 reviews
4
E = −sθ (w, c) is sometimes referred to as the energy function [1].

10
the basics of Parametric Density Estimation, and Section 4.4 describes Noise-Contrastive
Estimation (NCE), where we consider the situation where the model probability density
function is unnormalized (recall that the problem with softmax is that it is computationally
expensive due to the requirement for explicit normalization). Finally, Section 4.6 looks at
a modification of NCE called Negative Sampling.

4.3 Basics of Parametric Density Estimation


The basic set up for parametric density estimation is that we sample X = (x1 , x2 , . . . , xTd )
from random vector x ∈ Rn . This is the observed data which follows an unknown proba-
bility distribution function (pdf) pd . This data pdf (pd ) is modeled by a parameterized set
of functions {pm (.; θ)}θ where θ is a parameter vector. It is generally assumed (but not
required) that pd comes from this family, that is, pd (.) = pm (.; θ) for some parameter θ ∗ .

With these definitions we can describe the Parametric Density Estimation (PDE) problem.
In particular, the PDE problem is about finding θ ∗ from the observed sample X. Note also
that any estimate θ̂ must yield a normalized pdf pm (.; θ) which satisfies two properties:

Z
pm (u; θ̂)du = 1 (43)

and

pm (.; θ̂) ≥ 0 (44)

These are the two constraints on the estimation.

Now, if both constraints hold for all θ (and not only θ̂), then we say that the model is
normalized. If the constraint in Equation 44 holds but Equation 43 does not, we say that
the model is unnormalized. The assumption, however, is that there is at least one value of
the parameters for which an unnormalized model integrates to one (Equation 43), namely
θ∗ .

Next, denote an unnormalized model, parameterized by some α as p0m (.; α). Then the
partition function Z(α) is defined as

Z
Z(α) = p0m (u; α)du (45)

11
Z(α) can be used to convert the unnormalized model p0m (u; α) into a normalized one:
p0m (u; α)/Z(α), which integrates to one for every value of α (as required by Equation 43).

If we rewrite Equation 42 in terms of the partition function Z, can see that

X
Z(θ) = exp(sθ (w0 , h)) (46)
w0
exp(sθ (w, h))
Pθh = (47)
Z(θ)

Note that I changed the symbol we’re using for the context c to h (also sometimes used for
the context) to avoid name clashes below.

Unfortunately, the function α 7→ Z(α) is defined by the integral in Equation 45 which,


unless p0m (.; α) has a particularly convenient form, is likely intractable and/or doesn’t have
a nice closed form. In particular, the integral will not be amenable to analytic computation
so a closed form for Z(α) can’t be found. In addition, for low-dimensional problems,
numerical methods can be used to approximate Z(α) to a very high accuracy (MCMC,
Gibbs, or other sampling techniques), but for high-dimensional problems numeric methods
are computationally expensive. Since we are considering the Skip-Gram model here, we
are dealing with a PDE problem in high dimension where computation of the partition
function is analytically intractable and/or computationally expensive.

4.4 Noise Contrastive Estimation


Noise Contrastive Estimation (NCE) was introduced in [2]. The basic idea is to consider Z
(or alternatively c = ln 1/Z) not as a function of α but rather as an additional parameter
to the model. Here the unnormalized model p0m (.; α) is extended with an additional nor-
malizing parameter (c, note the change in meaning of c from context to the normalizing
parameter) and then we estimate

ln pm (.; α) = ln p0m (.; α) + c (48)

with parameters θ = (α, c). The estimate θ̂ = (α̂, ĉ) is intended to make the unnormalized
model p0m (.; α̂) match the shape of pd and ĉ provides the proper scaling so that the con-
straints (Equations 43 and 44) hold. Note that separating estimation of shape and scale is
not possible for maximum likelihood estimation (MLE) since the likelihood can be made
arbitrarily large by setting the normalizing parameter c successively larger values.

12
The key observation underlying NCE is that density estimation is largely about charac-
terizing properties of the observed data X = (x1 , x2 , . . . , xTd ), and a convenient way to
describe properties is to describe them relative to the properties of some reference data Y .

Now, assume that the reference (noise) data Y is an iid sample5 (y1 , y2 , . . . , yTn ) of a ran-
dom variable y ∈ Rn with probability distribution function (pdf) pn , and let the (unknown)
pdf of X be pd . Then a relative description of the data X can be given by the ratio pd /pn
of the two density functions. If the reference distribution pn is known (which it is), we can
recover pd from the ratio pd /pn . That is, since we know the differences between X and Y
and also the properties of Y , we can deduce the properties of X. Finally, following [2], we
assume that the noise samples are k times more frequent than data samples so data points
come from the mixture

1 k
Pdh (w) + Pn (w) (49)
k+1 k+1

NCE connects the problem of PDE to supervised learning, in particular to logistic regres-
sion, and provides a hint as to how the proposed estimator works: By discriminating, or
comparing, between data and noise, NCE can learn properties of the data in the form of
a statistical model. That is, the key idea behind noise-contrastive estimation is ?learning
by comparison?.

So how does this supervised learning work and what exactly does it estimate? Consider
first the following the notation (which with minor modifications largely follows [2]): Let
U = (X ∩ Y ) = (u1 , x2 , . . . , uTd +Tn ). NCE then converts the problem of density estimation
to a binary classification problem as follows: For each ut ∈ U assign a class label Ct such
that Ct = 1 if ut ∈ X and Ct = 0 if ut ∈ Y . Now we can use logistic regression to estimate
the posterior probabilities since

X
P (A) = P (A ∩ Bn ) # by the Sum Rule (50)
n
X
= P (A, Bn ) # alternate notation (51)
n
X
= P (A|Bn )P (Bn ) # by the Product Rule (52)
n

so that the posterior distribution P (C1 |x) for two classes C1 and C2 given input vector x
would look like
5
Independent Identically Distributed

13
P (x|C1 )P (C1 )
P (C1 |x) = (53)
P (x|C1 )P (C1 ) + P (x|C2 )P (C2 )

Interestingly, the posterior distribution is related to logistic regression as follows: First


recall that the posterior P (C1 |x) is

P (x|C1 )P (C1 )
P (C1 |x) = (54)
P (x|C1 )P (C1 ) + P (x|C2 )P (C2 )

Now, if we set

P (x|C1 )P (C1 )
a = ln (55)
P (x|C2 )P (C2 )

we can see that

1
P (C1 |x) = = σ(a) (56)
1 + e−a
that is, the sigmoid function. This starts to give us a sense that the sigmoid function is
related to the log of the ratio of likelihoods of p(x|C1 ) and p(x|C2 ), or in our context, pd /pn .

Now, since the pdf pd of x is unknown, we cam model the class-conditional probability
p(.|C = 1) with pm (.; θ), and the class conditional probability densities are

p(u|C = 1; θ) = pm (u; θ) (57)


p(u|C = 0; θ) = pn (u) (58)

So the prior probabilities are

Td
p(C = 1) = (59)
Td + Tn
Tn
p(C = 0) = (60)
Td + Tn

14
and the posteriors are therefore

pm (u; θ)
P (C = 1|u; θ) = (61)
pm (u; θ) + k · pn (u)
k · pn (u)
P (C = 0|u; θ) = (62)
pm (u; θ) + k · pn (u)

where k is the ratio P (C = 0)/P (C = 1) = Tn /Td (remembering that noise samples yi are
k times more frequent that data samples xi ).

Note that the class labels Ct are Bernoulli-distributed. Recall the details of the Bernoulli
distribution: First, the random variable Y takes values yi ∈ {0, 1}. Then the Bernoulli
distribution is a Binomial(1, p) distribution, where 0 < p < 1 and P (Y = y) = py (1−p)1−y .
The the probability that Yi = yi for i = 1, 2, . . . , n is
n
Y
P (Y ) = pyi (1 − p)1−yi (63)
i=1

and the log likelihood `n (p) is


n
X  
`n (p) = Yi log p + (1 − Yi ) log(1 − p) (64)
i=1

Returning to NCE, the log-likelihood of the parameters θ is then

TdX
+Tn
`(θ) = Ct ln P (Ct = 1|ut ; θ) + (1 − Ct ) ln P (Ct = 0|ut ; θ) (65)
t=1
T
X d Tn
  X  
= ln h(xt ; θ) + ln 1 − h(yt ; θ) (66)
t=1 t=1

where

G(u; θ) = ln pm (u; θ) − ln pn (u) (67)


−x
h(u; θ) = σ(G(u; θ)) # σ(x) = 1/(1 + e ) (68)

Optimizing `(θ) with respect to θ leads to an estimate G(.; θ) of the log ratio ln(pd /pn )
(see Equation 55 for the derivation). That is, an approximate description of X relative to

15
Y is given by Equation 66. Interestingly, the sign inverted objective function, −`(θ), is
also known as the cross entropy (or cross entropy error function).

So density estimation, which is an unsupervised learning task, can be estimated with a


supervised learning technique, namely logistic regression. The important result here is
that even unnormalized models can be estimated using the same principle.

Now, given an unnormalized statistical model p0m (.; α), the NCE technique adds an addi-
tional normalization parameter c to the model, and defines

ln pm (.; α) = ln p0m (.; α) + c (69)

where θ = (α, c). The parameter c scales the unnormalized model so that the Equation
43 holds. After learning, ĉ provides an estimate of ln 1/Z(α̂) (this is closely related to
Equation 55).

4.5 NCE Cost Function


Let X = (x1 , x2 , . . . , xTd ) consist of Td independent observations of x ∈ Rn . Similarily, Y =
(y1 , y2 , . . . , xTn ) is a artificially generated data set that consists of Tn = kTd independent
observations of y ∈ Rn with known distribution pn . The cost function JT (θ) is defined to
be (look familiar?):

Td Tn
( )
1 X   X  
JT (θ) = ln h(x; θ) + ln 1 − h(yi ; θ) (70)
Td
t=1 t=1

and the NCE estimator is defined to be the argument θ̂ which minimizes −JT (θ) (or
alternatively, maximizes JT (θ)) and h(.; θ) is the nonlinearity defined in Equation 68.

4.6 Skip-Gram Negative Sampling


Mikolov et al. [5] introduce Skip-Gram Negative Sampling in as an alternative to the hi-
erarchical softmax method outlined there. Negative Sampling is a form NCE (see Section
4.4). The key assertion underlying NCE is that a good model should be able to differen-
tiate data from noise by means of logistic regression. And while NCE can be showin to
approximately maximize the log probability of the softmax, the authors point out that the
Skip-Gram model is only concerned with learning high-quality vector representations, and
such the authors were free to simplify NCE as long as the vector representations retain
their quality. The negative sampling (NEG) objective is defined to be

16
k
T 0 T
X
log σ(v 0 WO vWI ) +
 
Ewi ∼Pn (w) log σ(−vW i
vW I (71)
i=1

which they claim is used to replace every occurrence of log P (wO |wI ) in the Skip-Gram
objective (though exactly how this work isn’t discussed).

5 Acknowledgements

17
References
[1] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. A neural
probabilistic language model. J. Mach. Learn. Res., 3:1137–1155, March 2003.

[2] Michael U. Gutmann and Aapo Hyvärinen. Noise-contrastive estimation of unnormal-


ized statistical models, with applications to natural image statistics. J. Mach. Learn.
Res., 13:307–361, February 2012.

[3] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of
word representations in vector space. 01 2013.

[4] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. word2vec, 2014.

[5] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed
representations of words and phrases and their compositionality. In C.J.C. Burges,
L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors, Advances in
Neural Information Processing Systems 26, pages 3111–3119. Curran Associates, Inc.,
2013.

[6] Andriy Mnih and Yee Whye Teh. A fast and simple algorithm for training neural
probabilistic language models. 06 2012.

18

You might also like