0% found this document useful (0 votes)
35 views96 pages

06 Wordvectors

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views96 pages

06 Wordvectors

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 96

Word Vectors

Marta R. Costa-jussà
Universitat Politecnica de Catalunya, Barcelona

Based on slides by
Christopher Manning, Stanford University, adapted from CS224n slides: Lecture 1
and illustrations from Jay Alammar, The Illustrated Word2Vec
Word Vectors 1
Towards an efficient representation of words

Word Vectors 2
Towards an efficient representation of words

Word Vectors 3
Towards an efficient representation of words

Word Vectors 4
Towards an efficient representation of words

Word Vectors 5
What to read

• Distributed Representations of Words and Phrases and their Compositionality [pdf]


• Efficient Estimation of Word Representations in Vector Space [pdf]
• A Neural Probabilistic Language Model [pdf]
• Speech and Language Processing by Dan Jurafsky and James H. Martin is a leading
resource for NLP. Word2vec is tackled in Chapter 6.
• Neural Network Methods in Natural Language Processing by Yoav Goldberg is a great read
for neural NLP topics.
• Chris McCormick has written some great blog posts about Word2vec. He also just released
The Inner Workings of word2vec, an E-book focused on the internals of word2vec.
• Want to read the code? Here are two options:
• Gensim’s python implementation of word2vec
• Mikolov’s original implementation in C – better yet, this version with detailed comments
from Chris McCormick.
• Evaluating distributional models of compositional semantics
• On word embeddings, part 2
• Dune

Word Vectors
• WE and NLP: (Levy and Goldberg, 2014, NIPS) 6
Outline

● Word Embeddings: word2vec

● Beyond Word2vec: Glove and Word Senses

● Gender Bias in Word Embeddings

Word Vectors 7
A Word embedding is a numerical representation of a word

● Word embeddings allow for arithmetic operations on a text

○ Example: time + flies

● Word embeddings have been refered also as:


○ Semantic Representation of Words
○ Word Vector Representation

Word Vectors 8
Vector representation of flies and time

Word Vectors 9
Questions that may arise

● How can we obtain those numbers?


● What’s word2vec?
● Is it the only way to obtain those numbers?
● Do the vectors (and components!) have any semantic meaning?
● Are we crazy by summing or multiplying words?

Word Vectors 10
Distributional Hypothesis Contextuality

● Never ask for the meaning of a word in isolation, but only in the
context of a sentence
(Frege, 1884)

● For a large class of cases... the meaning of a word is its use in


the language
(Wittgenstein, 1953)

● You shall know a word by the company it keeps (Firth, 1957)

● Words that occur in similar contexts tend to have similar


meaning (Harris, 1954) 11
Word Vectors
Word Vector Space

Word Vectors 12
Similar Meanings…

Word Vectors 13
Background: One-hot, frequency-based, words-embeddings

● One-hot representation

● Term frequency or TF-IDF methods

● Words embeddings

Count-based 14
One-hot vectors

Two for tea and tea for two


Tea for me and tea for you
You for me and you for me

Two = [1,0,0,0]

tea=[0,1,0,0]

me=[0,0,1,0]

you=[0,0,0,1]

Word Vectors 15
Vector Space Model: Term-document matrix

Count-based 16
Term Frequency-Inverse Document Frecuency

TF-IDF(t, d, D) = TF(t, d)×IDF(t, D)

Count-based 17
Problems with simple co-occurrence vectors

Increase in size with vocabulary

Very high dimensional: requires a lot of storage

Count-based 18
Solution: Low dimensional vectors

• Idea: store “most” of the important information in a fixed,


small number of dimensions: a dense vector

• Usually 25–1000 dimensions

• How to reduce the dimensionality?

Count-based 19
Method: Dimensionality Reduction on X (HW1)

Singular Value Decomposition of co-occurrence


matrix X
Factorizes X into UΣVT, where U and V are
orthonormal

Xk

Retain only k singular values, in order to generalize.


!J is the best rank k approximation to X , in terms of least
squares. Classic linear algebra result. Expensive to compute
Count-based for large matrices. 20
word2vec

1. king - man + woman = queen


2. Huge splash in NLP world
3. Learns from raw text
4. Pretty simple algorithm

Direct-based 21
Word Embeddings use simple feed-forward network

● No deep learning at all!

● A hidden layer in a NN interprets the input in his own way to optimise his
work in the concrete task

● The size of the hidden layer gives you the dimension of the word
embeddings

Word Vectors 22
word2vec

1. Set up an objective function


2. Randomly initialize vectors
3. Do gradient descent

Direct-based 23
Word Embeddings learned by a neural network in two tasks/objectives:

1. predict the probability of a word given a context (CBoW)


2. predict the context given a word (skip-gram)

Word Vectors 24
Continuous Bag of Words, CBoW

Word Vectors 25
Skip-Gram Model

Word Vectors 26
SkipGram C B OW
Guess the context Guess the word
given the word given the context

vIN vOUT

“The fox jumped over the lazy dog ” “T he fox jum p ed over t he lazy dog ”
vIN vIN vIN vIN vIN vIN
vOUT vOUT vOUT vOUT vOUT vOUT
Better at syntax. ~20x faster.
(this is the one we went over) (this is the alternative.) 27
Word Vectors
Observations (Tensorflow Tutorial)

● CBoW
Smoothes over a lot of the distributional information by treating an

entire context as one observation. This turns out to be a useful thing
for smaller datasets
● Skip-gram
○ Treats each context-target pair as a new observation, and this tends
to do better when we have larger datasets

Word Vectors 28
or
ve
d2
c

The fox jumped ove r t he lazy dog ”
w

word2vec: learn word vector from it’s surrounding context


Maximize the likelihood of seeing the words given the word over.

P(the|over)
P(fox|over)
P(jumped|over)
P(the|over)
P(lazy|over)
P(dog|over)

Word Vectors
…instead of maximizing the likelihood of co-occurrence counts. 29
Word2vec: objective function
c
ve
d2
or
w

For each position ! = 1, …, " , predict context words within a window of


fixed size m, given center word # : P(vOUT|vIN)

$%&'(%ℎ**+ = $ - = . . 9(;<=/ |;?@ ; -)


/01 345654
678

30
Word2vec: objective function
c
ve
d2
or
w

For each position ! = 1, …, " , predict context words within a window of


fixed size m, given center word # : P(vOUT|vIN)

$%&'(%ℎ**+ = $ - = . . 9(;<=/ |;?@ ; -)


/01 345654
678
Loop 1
Loop 2

31
Twist: we have two vectors for every word.
c
ve Should depend on whether it’s the input or the output.
d2
or
w

A context window around every input word.

P(vOUT|vIN)

“The fox jumped over the lazy dog”


vIN
32
Loop 1: for the word ‘over’ iteration on loop 2: window around ‘over’
c
ve
d2
or
w

A context window around every input word.

P(vOUT|vIN)

“The fox jumped over the lazy dog”


vOUT vIN
33
Loop 1: for the word ‘over’ iteration on loop 2: window around ‘over’
c
ve
d2
or
w

A context window around every input word.

P(vOUT|vIN)

“The fox jumped over the lazy dog”


vOUT vIN
Loop 1: for the word ‘over’ iteration on loop 2: window around ‘over’
c
ve
d2
or
w

A context window around every input word.

P(vOUT|vIN)

“The fox jumped over the lazy dog”


vOUT vIN
Loop 1: for the word ‘over’ iteration on loop 2: window around ‘over’
c
ve
d2
or
w

A context window around every input word.

P(vOUT|vIN)

“The fox jumped over the lazy dog”


vIN vOUT
Loop 1: for the word ‘over’ iteration on loop 2: window around ‘over’
c
ve
d2
or
w

A context window around every input word.

P(vOUT|vIN)

“The fox jumped over the lazy dog”


vIN vOUT
Loop 1: for the word ‘over’ iteration on loop 2: window around ‘over’
c
ve
d2
or
w

A context window around every input word.

P(vOUT|vIN)

“The fox jumped over the lazy dog”


vIN vOUT
Once loop 2 is finished for the word ‘over’ we move loop 1 into the following word
c
ve
d2
or

Loop 1: for the word ‘the’ iteration on loop 2: window around ‘the’
w

A context window around every input word.

P(vOUT|vIN)

“The fox jumped over the lazy dog”


vIN
39
Loop 1: for the word ‘the’ iteration on loop 2: window around ‘the’
c
ve
d2
or
w

A context window around every input word.

P(vOUT|vIN)

“The fox jumped over the lazy dog”


vIN
Loop 1: for the word ‘the’ iteration on loop 2: window around ‘the’
c
ve
d2
or
w

A context window around every input word.

P(vOUT|vIN)

“The fox jumped over the lazy dog”


vOUT vIN
Loop 1: for the word ‘the’ iteration on loop 2: window around ‘the’
c
ve
d2
or
w

A context window around every input word.

P(vOUT|vIN)

“The fox jumped over the lazy dog”


vOUT vIN
Loop 1: for the word ‘the’ iteration on loop 2: window around ‘the’
c
ve
d2
or
w

A context window around every input word.

P(vOUT|vIN)

“The fox jumped over the lazy dog”


vOUT vIN
Loop 1: for the word ‘the’ iteration on loop 2: window around ‘the’
c
ve
d2
or
w

A context window around every input word.

P(vOUT|vIN)

“The fox jumped over the lazy dog”


vOUT vIN
Loop 1: for the word ‘the’ iteration on loop 2: window around ‘the’
c
ve
d2
or
w

A context window around every input word.

P(vOUT|vIN)

“The fox jumped over the lazy dog”


vIN vOUT
Loop 1: for the word ‘the’ iteration on loop 2: window around ‘the’
c
ve
d2
or
w

A context window around every input word.

P(vOUT|vIN)

“The fox jumped over the lazy dog”


vIN vOUT
Word2vec: objective function
c
ve
d2
or
w

For each position ! = 1, …, " , predict context words within a window of


fixed size m, given center word # : P(vOUT|vIN)

$%&'(%ℎ**+ = $ - = . . 9(;<=/ |;?@ ; -)


/01 345654
678
Loop 1
Loop 2

47
How should we define P(v OUT|v IN)?

e
iv
ct
je
ob

Measure loss between


vIN and vOUT?

!(#$%& |#() ; +)

vin . vout

48
ec
vive
c2t
jred
bo
wo

vin

vout vin . vout ~ 1

Word Vectors 49
ec
vive
c2t
jred
bo
wo

vin

vin . vout ~ 0

Word Vectors
vout 50
ec
vive
c2t
jred
bo
wo

vin

vin . vout ~ -1

Word Vectors
vout 51
ec
vive
c2t
jred
bo
wo

But we’d like to measure a probability.

vin . vout ∈ [-1,1]

Word Vectors 52
But we’d like to measure a probability.

ec
vive
c2t
jred
bo
wo

Dot product compares similarity of vout and vin


Larger dot product = larger probability
Exponentiation makes anything positive

exp(vin . vout )∈ [-1,1]


= P(vout|vin)
Σexp(vin . vk)
k∈V Normalize over entire vocabulary
to give probability distribution

Word Vectors 53
But we’d like to measure a probability.

ec
vive
c2t
jred

exp(vin . vout )∈ [-1,1]


bo
wo

= P(vout|vin)
Σexp(vin . vk)
k∈V

Word Vectors 54
Summary of the process Thou shalt not make a machine in the likeness of a human mind

Untrained model
Task: are the two words
neighbours?

not

thou

vIN
VOUT

P(vOUT|vIN) softmax(vin . vout )


Word Vectors 55
Step-by-step

Let’s glance at how we use it to train a basic model that predicts if


two words appear together in the same context.

Word Vectors 56
Preliminary steps Thou shalt not make a machine in the likeness of a human mind

We start with the first sample in our dataset. We grab the feature and feed to the
untrained model asking it to predict if the words are in the same context or not (1
or 0)

Word Vectors 57
Preliminary steps: Negative examples

This can now be computed at blazing


speed – processing millions of examples
in minutes. But there’s one loophole we
need to close. If all of our examples are
positive (target: 1), we open ourself to
the possibility of a smartass model that
always returns 1 – achieving 100%
accuracy, but learning nothing and
generating garbage embeddings.

Word Vectors 58
Preliminary steps: Negative examples

For each sample in our dataset, we add negative examples. Those have the same
input word, and a 0 label.

We are contrasting the actual signal (positive examples of neighboring words) with
noise (randomly selected words that are not neighbors). This leads to a great tradeoff
of computational and statistical efficiency. 59
Word Vectors
Preliminary steps: pre-process the text

Now that we’ve established the two central ideas of skipgram and negative sampling,
one last preliminary step is we pre-process the text we’re training the model
against. In this step, we determine the size of our vocabulary (we’ll call
this vocab_size, think of it as, say, 10,000) and which words belong to it.

Word Vectors 60
Training process: embedding and context matrices

Now that we’ve established the two central ideas of skipgram and negative
sampling and pre-process, we can proceed to look closer at the actual word2vec
training process.

At the start of the training phase, we


create two matrices – an Embedding
matrix and a Context matrix. These
two matrices have an embedding
for each word in our vocabulary
(So vocab_size is one of their dimensions).
The seconddimension is how long we want each
embedding to be (embedding_size
– 300 is a common value

Word Vectors 61
Training process: matrix initialization

1. At the start of the training process, we initialize these matrices with random
values. Then we start the training process. In each training step, we take one
positive example and its associated negative examples. Let’s take our first
group:

Word Vectors 62
Training process

2. Now we have four words:


○ the input word not
○ the output/context words (1-Word
window):
thou (the actual neighbor), aaron,
and taco (the negative examples).

We proceed to look up their embeddings –


for the input word, we look in the Embedding
matrix. For the context words, we look in the
Context matrix (even though both matrices
have an embedding for every word in our
vocabulary)..
Word Vectors 63
Training process

3. Then, we take the dot product of the input embedding with each of the
context embeddings. In each case, that would result in a number, that number
indicates the similarity of the input and context embeddings

4. Now we need a way to turn these scores into something that looks like
probabilities – we need them to all be positive and have values between zero
and one. This is a great task for sigmoid, the logistic operation. And we can now
treat the output of the sigmoid operations as the model’s output for these
examples.

Word Vectors 64
Training process

5. Now that the untrained model has made a prediction, and seeing as though we
have an actual target label to compare against, let’s calculate how much error is
in the model’s prediction. To do that, we just subtract the sigmoid scores from the
target labels.

Word Vectors 65
Training process

6. Here comes the “learning” part of “machine learning”. We can now use this
error score to adjust the embeddings of not, thou, aaron, and taco so that the
next time we make this calculation, the result would be closer to the target scores

Word Vectors 66
Training process

7. This concludes the training step.


We emerge from it with slightly
better embeddings for the words
involved in this step (not, thou,
aaron, and taco). We now proceed
to our next step (the next positive
sample and its associated
negative samples) and do the
same process again.

Word Vectors 67
Training process

8. The embeddings continue to be improved while we cycle through our


entire dataset for a number of times. We can then stop the training process,
discard the Context matrix, and use the Embeddings matrix as our pre-trained
embeddings for the next task.

Word Vectors 68
Optimization Process

Gradient Descent

We go through gradients for each center vector Vin in a window. We also need gradients for
outside vectors Vout

But Corpus may have 40B tokens and Windows you would wait a very long time before making
a single update!

Stochastic Gradient Descent

We will update parameters after each samples of corpus sentences (what is called batches) à
Stochastic gradient descent (SGD)
and update weights after each one

Word Vectors 69
Let’s Play!

● Word Embedding Visual Inspector, wevi


https://ronxin.github.io/wevi/

● Gensim
http://web.stanford.edu/class/cs224n/materials/Gensim%2
0word%20vector%20visualization.html

● Embedding Projector
http://projector.tensorflow.org/

Word Vectors 70
Embedded space geometry

● King-Man + Woman = Queen

Word Vectors 71
Word2vec in Vikipedia

‘dimecres’ + (‘dimarts’ – ‘dilluns’) = ‘dijous’


‘tres’ + (‘dos’ – ‘un’) = ‘quatre’
‘tres’ + (‘2’ – ‘dos’) = ‘3’
‘viu’ + (‘coneixia’ – ‘coneix’) = ‘vivia’
‘la’ + (‘els’ – ‘el’) = ‘les’
‘Polònia’ + (‘francès’ – ‘França’) = ‘polonès’

Word Vectors 72
GloVe and Words Senses

Word Vectors 73
Frequency based vs. direct prediction

• LSA, HAL (Lund & Burgess), • Skip-gram/CBOW (Mikolov et al)


• COALS, Hellinger-PCA (Rohde • NNLM, HLBL, RNN (Bengio et
et al, Lebret & Collobert) al; Collobert & Weston; Huang et
al; Mnih & Hinton)

• Fast training • Scales with corpus size


• Efficient usage of statistics
• Inefficient usage of statistics

• Primarily used to capture word • Generate improved performance


similarity on other tasks
• Disproportionate importance
given to large counts • Can capture complex patterns
beyond word similarity

Word Vectors 74
GloVE

Combines the advantages of the two major model families in the


literature: global matrix factorization and local context window
methods

The model efficiently leverages statistical information by training only


on the nonzero elements in a word-word co-occurrence matrix rather
than on the entire sparse matrix or on individual context windows in
a large corpus

Word Vectors 75
Ratios of co-occurrence probabilities can encode meaning components

Word Vectors 76
How?

Word Vectors 77
GloVE

GloVe does this by setting a function that represents ratios of co-


occurrence probabilities rather than the probabilities themselves

• Fast training
• Scalable to huge corpora
• Good performance even with
small corpus and small vectors

Word Vectors 78
Word Senses

• Most words have lots of meanings!


• Especially common words
• Especially words that have existed for a longtime

Word Vectors 79
Improving Word Representations Via Global Context And Multiple Word Prototypes
(Huang et al. 2012)
• Idea: Cluster word windows around words, retrain with each word
assigned tomultiple different clusters bank1, bank2, etc

80
Linear Algebraic Structure of Word Senses, with application to polysemy
(Arora, …,Ma, …,TACL2018)

• Different senses of a word reside in a linear superposition


(weighted sum) in standard word embeddings like word2vec

• !pike = "1!pike1 + "2!pike2 + "3!pike3

• Where , etc., for frequency f


• Surprising result:
• Because of ideas from sparse coding you can actually separate out the
senses (providing they are relatively common)
Not so nice…

Man is to computer programmer as woman is to ….

Word Vectors 82
Gender bias in words embeddings

Word Vectors 83
Logic Riddle

A man and his son are in a terrible accident and are rushed to the hospital
in critical care.

The doctor looks at the boy and exclaims "I can't operate on this boy, he's my
son!”

How could this be?

Word Vectors 84
“Doctor” vs “Female doctor”
Word Vectors 85
Related Work: Word Embeddings encode bias
[credits to Hila Gonen]

[Caliskan et al. 2017] replicate a spectrum of biases from using word


embeddings, showing text corpora contain several types of biases:
○ morally neutral as toward insects or flowers
○ problematic as toward race or gender ,
○ reflecting the distribution of gender with respect to careers or first names

Word Vectors 86
Techniques to Debias Word Embeddings

(1) Debias After Training [Bolukbasi et al. 2016] ---> Debias WE


Define a gender direction
Define inherently neutral words (nurse as opposed to mother)
Zero the projection of all neutral words on the gender direction
Remove that direction from words

(2) Debias During Training [Zhao et al. 2018] ---> GN-Glove


Train word embeddings using GloVe (Pennington et al., 2014)
Alter the loss to encourage the gender information to concentrate in the last
coordinate (use two groups of male/female seed words, and encourage words from
different groups to differ in their last coordinate)
To ignore gender information –simply remove the last coordinate

Word Vectors 87
Experiments For Evaluation Bias

Three experiments were carried out in our evaluation:


1. Detecting the gender space and the Direct bias
2. Male and female biased words clustering
3. Classification approach of biased words

Our comparison is based on pre-trained sets of all


these options. For experiments, we use the English-
German news corpus from WMT18
Word Vectors 88
Lists for Definitional, Biased and Professional Terms

● Definitional List 10 pairs (e.g. he-she, man-


woman, boy-girl)
● Biased List, which contains of 1000 words, 500
female biased and 500 male biased. (e.g. diet for
female and hero for male)
● Extended Biased List, extended version of Biased
List. (5000 words, 2500 female biased and 2500
male biased)
● Professional List 319 tokens (e.g. accountant,
surgeon) 89
Word Vectors
1.Gender Space and Direct Bias

1. Randomly sampling sentences that contain words from


the Definitional List, swap the definitional word with its
pair-wise equivalent from the opposite gender.
2. Get word embeddings for the word and its swapped
equivalence, compute their difference.
3. On the set of difference vectors, we compute their
principal components to verify the presence of bias.
4. Repeat for an equivalent list of random words (skipping
the swapping).
Word Vectors 90
1. Gender Space and Direct Bias

Percentage of variance in PCA: definitional vs random

(Left) Percentage of variance explained in the PCA of definitional vector differences.


(Right) The corresponding percentages for random vectors

Word Vectors 91
1. Gender Space and Direct Bias

Direct Bias is a measure of how close a certain set of


words are to the gender vector.
Computed on list of professions.

Direct Bias

WE 0.08

Word Vectors 92
2. Male and female-biased words clustering

k-means

Generate 2 clusters of the embeddings of tokens from


the Biased list (e.g. diet for female and hero for male)
Accuracy

WE 99.9%

Debias WE 92.5%

GN-WE 85.6%
Word Vectors 93
3. Classification Approach

SVM
Classify Extended Biased List into words associated
between male and female
1000 for training, 4000 for testing
Accuracy

WE 98.25%

Debias WE 88.88%

GN-WE 98,65%
Word Vectors 94
Conclusions. Is Debiasing What We Want?

Word Embeddings exhibit Gender Biases

Difficult to scale to different forms of bias

Is debiasing even (always) desirable?


○ ML is about learning biases. Removing attributes removes
information.

○ Gender information in NLP systems becomes harmful when the


use of the system has a negative impact on people’s lives.

Gender bias is a social phenomenon that can’t be solved with


mathematical methods alone. Collaborate with social
sciences/sociolinguistics. 95
Arguments for Doing Research in Gender Bias

Unconscious bias can be harmful

Debiasing computer systems may help in debiasing society

Gender bias causes NLP systems to make errors. You should care about
this even if accuracy is all you care about.

96

You might also like