07 Word Embeddings Notes
07 Word Embeddings Notes
Herman Kamper
One-hot vectors
Word2vec
Skip-gram
1
Motivation
Word embeddings are continuous vector representations of words.
Example:
h i⊤
cat = 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
h i⊤
feline = 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
2
Word2vec
Word2vec (Mikolov et al., 2013a) is a framework for learning word
embeddings.
3
Word2vec: Skip-gram
Example windows in skip-gram:
wt
loves
4
Skip-gram loss function
Assumptions:
For now, let’s pretend that our training set consists of a single long
sequence w1:T .
t=1
T
= − log Pθ (wt+j |wt )
Y Y
t=1 −M ≤j≤M
j̸=0
T
=− log Pθ (wt+j |wt )
X X
t=1 −M ≤j≤M
j̸=0
2
This is not the probability of the training sequence w1:T like in a language
model, but the probability of all the windows. With the two assumptions, this
then becomes the product of the individual word-pair probabilities.
5
Skip-gram model structure
How do we model Pθ (wt+j |wt )? What structure do we use?
For centre word c and context word o we use the following model:
⊤
euo vc
Pθ (wt+j = o|wt = c) = P (o|c) = PV u⊤ v
k=1 e
k c
P (2|c)
u⊤ v
1 e 2 c
f θ (wt = c) = .. = PV . = softmax(Uvc )
. u⊤ v ..
k=1 e
k c
P (V |c) ⊤
euV vc
6
Skip-gram optimisation
Parameters: θ = {V, U}
t=1 −M ≤j≤M
j̸=0
Consider inside term for a single training pair (wt = c, wt+j = o):
⊤
euo vc
Jc,o (θ) = − log P (o|c) = − log PV u⊤ v
k=1 e
k c
V
" #
u⊤ u⊤
= − log e o vc − log k vc
X
e
k=1
V
!
u⊤
=− uo⊤ vc − log k vc
X
e
k=1
V
= −uo + P (j|c) uj
X
j=1
7
Example of skip-gram embeddings
For a skip-gram trained on the the AG News dataset, we print the
closest word embeddings (cosine distance) to a query word:
Query: referendum
mandate (0.5513) vote (0.5551) ballots (0.5872)
vowing (0.5916) constitutional (0.6109)
Query: venezuela
chavez (0.5229) venezuelas (0.5840)
venezuelan (0.6057) hugo (0.6171) counties (0.6229)
Query: war
terrorism (0.6230) raging (0.6280) resumed (0.6296)
independence (0.6422) deportation (0.6444)
Query: pope
ii (0.5573) democrat (0.6149) canaveral (0.6323)
edwards (0.6377) sen (0.6379)
Query: schumacher
johan (0.5250) ferrari (0.5493) trulli (0.5651)
poland (0.5885) owen (0.5921)
Query: ferrari
rubens (0.5049) austria (0.5281) barrichello (0.5416)
schumacher (0.5493) seasonopening (0.6042)
Query: soccer
football (0.4817) basketball (0.5429)
mens (0.5510) fc (0.5530) ncaa (0.5547)
Query: cricket
kolkata (0.4818) test (0.4908) oval (0.5444)
bangladesh (0.5585) tendulkar (0.5756)
8
Skip-gram as a neural network
In the paper:
wt−2
wt−1
loves
wt wt+1
his
wt+2
But this makes it look like we are predicting the context words given
the window word, which is a bit misleading.
f θ (c)
u1
aardvark u⊤
1 vc
u⊤
2 vc
u2
are
softmax
u⊤
o vc fθ,o (c) = Pθ (wt+j = o|wt = c)
uo
his
u⊤
V vc
uV
zoo
9
Word2vec: Continuous bag-of-words
(CBOW)
The continuous bag-of-words (CBOW) model predicts a centre word
given the context words:
loves
wt
Loss function
We again minimise the NLL of the parameters:
T
J(θ) = − log Pθ (wt |wt−M , . . . , wt−1 , wt+1 , . . . , wt+M )
Y
t=1
T
=− log Pθ (wt |wt−M , . . . , wt−1 , wt+1 , . . . , wt+M )
X
t=1
10
Model structure
We now have multiple context words in a single training sample, so
we calculate an average context embedding v̄o . This gives the model:
⊤
euc v̄o
= PV u⊤ v̄
k=1 e
k o
Optimisation
As for skip-gram, we optimise the parameters using gradient descent.
The gradients can be derived as for skip-gram.
Word embeddings
Unlike in skip-gram, for CBOW it is common to use the context
vectors as word embeddings (denoted as v for CBOW). The centre
embeddings u are thrown away.
11
Skip-gram with negative sampling
Mikolov (2013b) proposed an extension of the original skip-gram to
overcome some of its shortcomings.
Negative sampling
Instead treat as binary logistic regression problem:
If we had just one positive pair (wt = c, wt+j = o) in our training set,
the NLL would be
Jc,o (θ) = − log Pθ (y = 1|wt = c, wt+j = o)
= − log σ(uo⊤ vc )
12
For each positive pair (wt = c, wt+j = o) we sample K words not
occurring in the context window of c. The loss now becomes:
k=1
= − log Pθ (y = 1|wt = c, wt+j = o)
K
log Pθ (y = 0|wt = c, wt+j = k)
X
−
k=1
K
= − log σ(uo⊤ vc ) − log 1 − σ(uw
⊤
v)
X
k c
k=1
K
= − log σ(uo⊤ vc ) − log σ(−uw
⊤
v)
X
k c
k=1
13
Global vector (GloVe) word embeddings
Old-school word embeddings
Before methods like the (neural-like) word2vec framework, word em-
beddings were based on co-occurrence counts.
• Very fast
But could we get the best of both worlds? The global vector (GloVe)
word embedding method tries to do this.
14
The GloVe model
Cc,o is the total number of times that centre word c occurs with context
word o in the same context window. (For old-school approaches, we
would have started by collecting such counts.)
GloVe tries to minimise the squared loss between the model output
fθ (c, o) and the log of these counts:
V X
V
J(θ) = (fθ (c, o) − log Cc,o )2
X
c=1 o=1
We might also not care that much about word pairs with very few
counts, so we can weigh these:
V X
V
J(θ) = h(Cc,o ) (fθ (c, o) − log Cc,o )2
X
c=1 o=1
A suggested choice is
x 0.75
if x < 100
h(x) = 100
1 otherwise
15
1.0
0.8
0.6
h(Cc, o)
0.4
0.2
0.0
0 50 100 150 200 250
Cc, o
Since h(0) = 0, the squared loss is only calculated over word pairs
where Cc,o > 0. This makes training way faster compared to consider-
ing all possible word pairs.
The original paper (Pennington et al., 2014) notes that either the
centre word embedding vector v or context vectors u can be used as
word embeddings. In the paper they actually sum the two to get the
final embedding.
Further reading
There are some other interpretations of the GloVe loss. E.g. you can
make some (loose, intuitive) connections with the skip-gram loss itself.
16
Evaluating word embeddings
Qualitative evaluation
• Consider the closest word embeddings to some query words (as
we did for the skip-gram model
– PCA
– t-SNE
– UMAP
17
Extrinsic evaluation
• Build and evaluate a downstream system for a real task
• But on the other hand also the best real world test setting
Intrinsic evaluation
• Word analogy tasks (weird)
18
Intrinsic: Word analogy
A word analogy is specified as
a : b :: c : d
Example:
man : women :: king : ?
To solve the task using some word embedding approach, we find the
word with vector closest to
vc + (vb − va )
woman
king
man
v1
Cosine distance are often used to determine the closest word embed-
ding.
19
Visualisation of skip-gram embeddings with negative sampling (Mikolov
et al., 2013b):
20
Intrin.: Correlation with human judgments
Have humans scale how similar word pairs are.
21
Exercises
Exercise 1: Skip-gram optimisation in practice
We derived the following gradients when we looked at skip-gram
optimisation:
∂Jc,o (θ)
∂vc
This was for a single training pair (wt = c, wt+j = o). We can think
of this as one item in a training dataset with many pairs:
n oN
(c(n) , o(n) )
n=1
∂J(θ)
vc ← vc − η
∂vc
22
Videos covered in this note
• Why word embeddings? (9 min)
• One-hot word embeddings (6 min)
• Skip-gram introduction (7 min)
• Skip-gram loss function (8 min)
• Skip-gram model structure (8 min)
• Skip-gram optimisation (10 min)
• Skip-gram as a neural network (10 min)
• Skip-gram example (2 min)
• Continuous bag-of-words (CBOW) (6 min)
• Skip-gram with negative sampling (16 min)
• GloVe word embeddings (12 min)
• Evaluating word embeddings (21 min)
References
Y. Goldberg and O. Levy, “word2vec explained: deriving Mikolov et
al.’s negative-sampling word-embedding method,” arXiv, 2014.
23