0% found this document useful (0 votes)
25 views23 pages

07 Word Embeddings Notes

Uploaded by

Gurvinder Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views23 pages

07 Word Embeddings Notes

Uploaded by

Gurvinder Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Word embeddings

Herman Kamper

2024-04, CC BY-SA 4.0

One-hot vectors

Word2vec

Skip-gram

Continuous bag-of-words (CBOW)

Skip-gram with negative sampling

Global vector (GloVe) word embeddings

Evaluating word embeddings

1
Motivation
Word embeddings are continuous vector representations of words.

If we could represent words as vectors that capture “meaning” then


we could feed them as input features to standard machine learning
models (SVMs, logistic regression, neural networks).

A first approach: One-hot vectors


Each word is represented as a length-V vector. The vector is all zeros
except for a one at the index representing that word type.1

Example:
h i⊤
cat = 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
h i⊤
feline = 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0

But one-hot vectors on their own still cannot capture similarity:

• Any two one-hot vectors are orthogonal


• I.e. cosine similarity between two one-hot vectors is always 0

Cosine similarity and distance


We often use cosine similarity (or distance) to compare word em-
beddings. (Why not Euclidean?) The cosine similarity between two
vectors is the cosine of the angle between them:
(w(a) )⊤ w(b)
cos θ = ∈ [−1, 1]
∥w(a) ∥ ∥w(b) ∥
Cosine distance is defined as
(w(a) )⊤ w(b)
dcos (w(a) , w(b) ) ≜ 1 − ∈ [0, 2]
∥w(a) ∥ ∥w(b) ∥
1
In this note I use V = |V| to denote the vocabulary size.

2
Word2vec
Word2vec (Mikolov et al., 2013a) is a framework for learning word
embeddings.

It relies on an idea that is the basis of many modern NLP approaches:

A word’s meaning is given by the words that frequently


appear close-by.

Two model variants:

1. Skip-gram: Predict context words given centre word

2. Continuous bag-of-words (CBOW): Predict centre word from


context words

3
Word2vec: Skip-gram
Example windows in skip-gram:

P (wt−2 |wt ) P (wt+2 |wt )


P (wt−1 |wt ) P (wt+1 |wt )
... since the man loves his son so much ...

P (wt−2 |wt ) P (wt+2 |wt )


P (wt−1 |wt ) P (wt+1 |wt )
... since the man loves his son so much ...

Skip-gram predicts context words given a centre word:

the man his son


wt−2 wt−1 wt+1 wt+2

wt

loves

4
Skip-gram loss function
Assumptions:

1. Each window is an i.i.d. sample

2. Within each window, each context word is conditionally inde-


pendent given the centre word, e.g.

P (the, man, his, son|loves)


= P (the|loves) P (man|loves) P (his|loves) P (son|loves)

A dataset gives a large number of (wt , wt+j ) input-output word pairs.

For now, let’s pretend that our training set consists of a single long
sequence w1:T .

Skip-gram loss function


We minimise the negative log likelihood (NLL) of the parameters:2
T
J(θ) = − log Pθ (wt−M , . . . , wt−1 , wt+1 , . . . , wt+M |wt )
Y

t=1
T
= − log Pθ (wt+j |wt )
Y Y

t=1 −M ≤j≤M
j̸=0
T
=− log Pθ (wt+j |wt )
X X

t=1 −M ≤j≤M
j̸=0

In practice we optimise the average NLL, i.e. we divide by the number


of pairs. (This number of terms is not equal to T . Why not?)

2
This is not the probability of the training sequence w1:T like in a language
model, but the probability of all the windows. With the two assumptions, this
then becomes the product of the individual word-pair probabilities.

5
Skip-gram model structure
How do we model Pθ (wt+j |wt )? What structure do we use?

Assumption 3: Each context word can be predicted from the centre


word in the same way, irrespective of its position. E.g. the prediction
of wt−2 and wt−1 is done in the same way from wt .

For each word type we have two vectors:

• vw when w is a centre word


• uw when w is a context word

For centre word c and context word o we use the following model:

euo vc
Pθ (wt+j = o|wt = c) = P (o|c) = PV u⊤ v
k=1 e
k c

where V is the vocabulary size.

This can be written as a softmax function. For input wt = c, the


model outputs a probability distribution over the vocabulary:

 
P (1|c) eu1 vc
 

 P (2|c) 
 u⊤ v 
1 e 2 c 
 
f θ (wt = c) =  ..   = PV  .  = softmax(Uvc )
  
.  u⊤ v  .. 
k=1 e
 k c
 
P (V |c) ⊤
euV vc

Skip-gram word embeddings


• We get two vectors for a single word type: v and u

• Normally in skip-gram we use the v vectors as word embeddings

• These are represented together in matrix V

• For skip-gram we don’t use the context vectors U at test time

6
Skip-gram optimisation
Parameters: θ = {V, U}

Perform gradient descent on each parameter vector:


∂J(θ)
v c ← vc − η
∂vc
∂J(θ)
uo ← uo − η
∂uo

We need the gradients with respect to the loss function:


T
J(θ) = − log Pθ (wt+j |wt )
X X

t=1 −M ≤j≤M
j̸=0

Consider inside term for a single training pair (wt = c, wt+j = o):

euo vc
Jc,o (θ) = − log P (o|c) = − log PV u⊤ v
k=1 e
k c

V
" #
u⊤ u⊤
= − log e o vc − log k vc
X
e
k=1
V
!
u⊤
=− uo⊤ vc − log k vc
X
e
k=1

You can show that



∂Jc,o (θ) V
euj vc
= −uo +
X
PV u⊤ v
uj
∂vc j=1 k=1 e
k c

V
= −uo + P (j|c) uj
X

j=1

and similar for the derivatives w.r.t. the u vectors.

7
Example of skip-gram embeddings
For a skip-gram trained on the the AG News dataset, we print the
closest word embeddings (cosine distance) to a query word:

Query: referendum
mandate (0.5513) vote (0.5551) ballots (0.5872)
vowing (0.5916) constitutional (0.6109)

Query: venezuela
chavez (0.5229) venezuelas (0.5840)
venezuelan (0.6057) hugo (0.6171) counties (0.6229)

Query: war
terrorism (0.6230) raging (0.6280) resumed (0.6296)
independence (0.6422) deportation (0.6444)

Query: pope
ii (0.5573) democrat (0.6149) canaveral (0.6323)
edwards (0.6377) sen (0.6379)

Query: schumacher
johan (0.5250) ferrari (0.5493) trulli (0.5651)
poland (0.5885) owen (0.5921)

Query: ferrari
rubens (0.5049) austria (0.5281) barrichello (0.5416)
schumacher (0.5493) seasonopening (0.6042)

Query: soccer
football (0.4817) basketball (0.5429)
mens (0.5510) fc (0.5530) ncaa (0.5547)

Query: cricket
kolkata (0.4818) test (0.4908) oval (0.5444)
bangladesh (0.5585) tendulkar (0.5756)

8
Skip-gram as a neural network
In the paper:

wt−2

wt−1
loves
wt wt+1
his

wt+2

But this makes it look like we are predicting the context words given
the window word, which is a bit misleading.

More accurately, as a neural network vector diagram:


vc
loves

f θ (c)
u1
aardvark u⊤
1 vc
u⊤
2 vc
u2
are
softmax

u⊤
o vc fθ,o (c) = Pθ (wt+j = o|wt = c)
uo
his

u⊤
V vc
uV
zoo

9
Word2vec: Continuous bag-of-words
(CBOW)
The continuous bag-of-words (CBOW) model predicts a centre word
given the context words:

loves
wt

wt−2 wt−1 wt+1 wt+2

the man his son

P (loves|the, man, his, son)

Assumption: Each window is an i.i.d. sample

Loss function
We again minimise the NLL of the parameters:
T
J(θ) = − log Pθ (wt |wt−M , . . . , wt−1 , wt+1 , . . . , wt+M )
Y

t=1
T
=− log Pθ (wt |wt−M , . . . , wt−1 , wt+1 , . . . , wt+M )
X

t=1

10
Model structure
We now have multiple context words in a single training sample, so
we calculate an average context embedding v̄o . This gives the model:

Pθ (wt = c|wt−M = o1 , wt−M +1 = o2 , . . . , wt+M = o2M )


n o
exp uc⊤ v̄o
= PV n o
k=1 exp uk v̄o

n o
exp 1
u⊤ (vo1 + vo2 + . . . + vo2M )
2M c
= n o
k=1 exp 2M uk (vo1 + vo2 + . . . + vo2M )
PV 1 ⊤


euc v̄o
= PV u⊤ v̄
k=1 e
k o

Optimisation
As for skip-gram, we optimise the parameters using gradient descent.
The gradients can be derived as for skip-gram.

Word embeddings
Unlike in skip-gram, for CBOW it is common to use the context
vectors as word embeddings (denoted as v for CBOW). The centre
embeddings u are thrown away.

11
Skip-gram with negative sampling
Mikolov (2013b) proposed an extension of the original skip-gram to
overcome some of its shortcomings.

In standard skip-gram we have:



euo vc
Pθ (wt+j = o|wt = c) = P (o|c) = PV u⊤ v
k=1 e
k c

The normalisation is over V terms, which could (probably is) huge!

Negative sampling
Instead treat as binary logistic regression problem:

• y = 1 when a centre word wt = c is paired with a word wt+j = o


occurring in its context window

• y = 0 when a centre word wt = c is paired with a randomly


sampled word wt+j = k not occurring in its context

The model now becomes:


1
Pθ (y = 1|wt = c, wt+j = o) = σ(uo⊤ vc ) =
1 + e−u⊤o vc

If we had just one positive pair (wt = c, wt+j = o) in our training set,
the NLL would be
Jc,o (θ) = − log Pθ (y = 1|wt = c, wt+j = o)
= − log σ(uo⊤ vc )

If we only had this single positive example (y = 1), we could easily


hack the loss by just making uo and vc really big. So we need some
negative examples (y = 0).

12
For each positive pair (wt = c, wt+j = o) we sample K words not
occurring in the context window of c. The loss now becomes:

Jc,o (θ) = − log [Pθ (y = 1|wt = c, wt+j = o)


K
#
Pθ (y = 0|wt = c, wt+j = k)
Y

k=1
= − log Pθ (y = 1|wt = c, wt+j = o)
K
log Pθ (y = 0|wt = c, wt+j = k)
X

k=1
K  
= − log σ(uo⊤ vc ) − log 1 − σ(uw

v)
X
k c
k=1
K
= − log σ(uo⊤ vc ) − log σ(−uw

v)
X
k c
k=1

General ML idea: Contrastive learning


This idea of contrasting observations in the vicinity of a sample with
negative samples from elsewhere forms the basis for the more general
idea of contrastive learning in machine learning.

13
Global vector (GloVe) word embeddings
Old-school word embeddings
Before methods like the (neural-like) word2vec framework, word em-
beddings were based on co-occurrence counts.

A co-occurrence count matrix was typically decomposed using some


matrix factorization approach to get lower-dimensional word embed-
dings.

Advantages over word2vec:

• Very fast

• Capture global statistics that word2vec misses: Word2vec con-


siders windows within a batch; just counting can easily tell you
about how words co-occur over an entire corpus

Despite these advantages (especially the speed one), word2vec just


works better (normally).

But could we get the best of both worlds? The global vector (GloVe)
word embedding method tries to do this.

14
The GloVe model
Cc,o is the total number of times that centre word c occurs with context
word o in the same context window. (For old-school approaches, we
would have started by collecting such counts.)

GloVe tries to minimise the squared loss between the model output
fθ (c, o) and the log of these counts:
V X
V
J(θ) = (fθ (c, o) − log Cc,o )2
X

c=1 o=1

We could use something fancy for fθ , but let’s go with something


simple:
fθ (c, o) = uc⊤ vo + bc + co
where bc and co are bias terms for the centre and context words,
respectively.

We might also not care that much about word pairs with very few
counts, so we can weigh these:

V X
V
J(θ) = h(Cc,o ) (fθ (c, o) − log Cc,o )2
X

c=1 o=1

where h(x) is a weight function.

A suggested choice is

 x 0.75
 
if x < 100
h(x) = 100
1 otherwise

15
1.0

0.8

0.6
h(Cc, o)

0.4

0.2

0.0
0 50 100 150 200 250
Cc, o

Since h(0) = 0, the squared loss is only calculated over word pairs
where Cc,o > 0. This makes training way faster compared to consider-
ing all possible word pairs.

The original paper (Pennington et al., 2014) notes that either the
centre word embedding vector v or context vectors u can be used as
word embeddings. In the paper they actually sum the two to get the
final embedding.

Further reading
There are some other interpretations of the GloVe loss. E.g. you can
make some (loose, intuitive) connections with the skip-gram loss itself.

16
Evaluating word embeddings
Qualitative evaluation
• Consider the closest word embeddings to some query words (as
we did for the skip-gram model

• Visualise the embeddings in two or three dimensions using di-


mensionality reduction:

– PCA
– t-SNE
– UMAP

Skip-gram embeddings visualised with UMAP using https://projector.


tensorflow.org/:

17
Extrinsic evaluation
• Build and evaluate a downstream system for a real task

• E.g. text classification or named entity recognition

• Slow: Need to build and evaluate system

• Sometimes unclear whether there are interactions between sub-


systems, making a single subsystem difficult to evaluate

• But on the other hand also the best real world test setting

Intrinsic evaluation
• Word analogy tasks (weird)

• Correlation with human word similarity judgments

18
Intrinsic: Word analogy
A word analogy is specified as

a : b :: c : d

Task: Given a, b, c find d

Example:
man : women :: king : ?

To solve the task using some word embedding approach, we find the
word with vector closest to

vc + (vb − va )

and check if this matches a ground truth labelling.


v2

woman
king

man

v1

Cosine distance are often used to determine the closest word embed-
ding.

One problem is that information might not be encoded linearly in the


word embeddings.

19
Visualisation of skip-gram embeddings with negative sampling (Mikolov
et al., 2013b):

20
Intrin.: Correlation with human judgments
Have humans scale how similar word pairs are.

Examples from WordSim353:

Word 1 Word 2 Human (mean)


tiger cat 7.35
tiger tiger 10.00
book paper 7.46
computer internet 7.58
plane car 5.77
professor doctor 6.62
stock phone 1.62
stock CD 1.31
stock jaguar 0.92

Compare this to the similarity assigned by a word embedding approach.


Normally Spearman’s rank correlation coefficient is used as metric.

A comparison of different word embedding approaches on WordSim353


(Pennington et al., 2014):

Model Size Correlation (%)


SVD 6B 35.3
SVD-S 6B 56.5
SVD-L 6B 65.7
CBOW 6B 57.2
Skip-gram 6B 62.8
GloVe 6B 65.8
SVD-L 42B 74.0
GloVe 42B 75.9
CBOW 100B 68.4

21
Exercises
Exercise 1: Skip-gram optimisation in practice
We derived the following gradients when we looked at skip-gram
optimisation:
∂Jc,o (θ)
∂vc
This was for a single training pair (wt = c, wt+j = o). We can think
of this as one item in a training dataset with many pairs:
n oN
(c(n) , o(n) )
n=1

Given this dataset of pairs, how will we in practice actually do the


gradient descent steps?

∂J(θ)
vc ← vc − η
∂vc

Exercise 2: The skip-gram loss function and cross-entropy


For a centre word c, the skip-gram model outputs a vector f θ (wt =
c) ∈ [0, 1]V . Let’s represent the target context word o as a one-
hot vector y. Show that the loss for the skip-gram model (without
negative sampling) can be written as the cross-entropy between y and
f θ (wt = c), if you treat these as discrete distributions over the V
words in the vocabulary.

22
Videos covered in this note
• Why word embeddings? (9 min)
• One-hot word embeddings (6 min)
• Skip-gram introduction (7 min)
• Skip-gram loss function (8 min)
• Skip-gram model structure (8 min)
• Skip-gram optimisation (10 min)
• Skip-gram as a neural network (10 min)
• Skip-gram example (2 min)
• Continuous bag-of-words (CBOW) (6 min)
• Skip-gram with negative sampling (16 min)
• GloVe word embeddings (12 min)
• Evaluating word embeddings (21 min)

References
Y. Goldberg and O. Levy, “word2vec explained: deriving Mikolov et
al.’s negative-sampling word-embedding method,” arXiv, 2014.

C. Manning, “CS224N: Introduction and word vectors,” Stanford


University, 2022.

C. Manning, “CS224N: Word vectors, word senses, and neural classi-


fiers,” Stanford University, 2022.

T. Mikoliv, K. Chen, G. Corrado, and J. Dean, “Efficient estimation


of word representations in vector space,” in ICLR, 2013a.

T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,


“Distributed representations of words and phrases and their composi-
tionality,” in NeurIPS, 2013b.

J. Pennington, R. Socher, and C. D. Manning, “GloVe: Global vectors


for word representation,” in EMNLP, 2014.

A. Zhang, Z. C. Lipton, M. Li, and A. J. Smola, Dive into Deep


Learning, 2021.

23

You might also like