06 Wordvectors
06 Wordvectors
Marta R. Costa-jussà
Universitat Politecnica de Catalunya, Barcelona
Based on slides by
Christopher Manning, Stanford University, adapted from CS224n slides: Lecture 1
and illustrations from Jay Alammar, The Illustrated Word2Vec
Word Vectors 1
Towards an efficient representation of words
Word Vectors 2
Towards an efficient representation of words
Word Vectors 3
Towards an efficient representation of words
Word Vectors 4
Towards an efficient representation of words
Word Vectors 5
What to read
Word Vectors
• WE and NLP: (Levy and Goldberg, 2014, NIPS) 6
Outline
Word Vectors 7
A Word embedding is a numerical representation of a word
Word Vectors 8
Vector representation of flies and time
Word Vectors 9
Questions that may arise
Word Vectors 10
Distributional Hypothesis Contextuality
● Never ask for the meaning of a word in isolation, but only in the
context of a sentence
(Frege, 1884)
Word Vectors 12
Similar Meanings…
Word Vectors 13
Background: One-hot, frequency-based, words-embeddings
● One-hot representation
● Words embeddings
Count-based 14
One-hot vectors
Two = [1,0,0,0]
tea=[0,1,0,0]
me=[0,0,1,0]
you=[0,0,0,1]
Word Vectors 15
Vector Space Model: Term-document matrix
Count-based 16
Term Frequency-Inverse Document Frecuency
Count-based 17
Problems with simple co-occurrence vectors
Count-based 18
Solution: Low dimensional vectors
Count-based 19
Method: Dimensionality Reduction on X (HW1)
Xk
Direct-based 21
Word Embeddings use simple feed-forward network
● A hidden layer in a NN interprets the input in his own way to optimise his
work in the concrete task
● The size of the hidden layer gives you the dimension of the word
embeddings
Word Vectors 22
word2vec
Direct-based 23
Word Embeddings learned by a neural network in two tasks/objectives:
Word Vectors 24
Continuous Bag of Words, CBoW
Word Vectors 25
Skip-Gram Model
Word Vectors 26
SkipGram C B OW
Guess the context Guess the word
given the word given the context
vIN vOUT
“The fox jumped over the lazy dog ” “T he fox jum p ed over t he lazy dog ”
vIN vIN vIN vIN vIN vIN
vOUT vOUT vOUT vOUT vOUT vOUT
Better at syntax. ~20x faster.
(this is the one we went over) (this is the alternative.) 27
Word Vectors
Observations (Tensorflow Tutorial)
● CBoW
Smoothes over a lot of the distributional information by treating an
○
entire context as one observation. This turns out to be a useful thing
for smaller datasets
● Skip-gram
○ Treats each context-target pair as a new observation, and this tends
to do better when we have larger datasets
Word Vectors 28
or
ve
d2
c
“
The fox jumped ove r t he lazy dog ”
w
P(the|over)
P(fox|over)
P(jumped|over)
P(the|over)
P(lazy|over)
P(dog|over)
Word Vectors
…instead of maximizing the likelihood of co-occurrence counts. 29
Word2vec: objective function
c
ve
d2
or
w
30
Word2vec: objective function
c
ve
d2
or
w
31
Twist: we have two vectors for every word.
c
ve Should depend on whether it’s the input or the output.
d2
or
w
P(vOUT|vIN)
P(vOUT|vIN)
P(vOUT|vIN)
P(vOUT|vIN)
P(vOUT|vIN)
P(vOUT|vIN)
P(vOUT|vIN)
Loop 1: for the word ‘the’ iteration on loop 2: window around ‘the’
w
P(vOUT|vIN)
P(vOUT|vIN)
P(vOUT|vIN)
P(vOUT|vIN)
P(vOUT|vIN)
P(vOUT|vIN)
P(vOUT|vIN)
P(vOUT|vIN)
47
How should we define P(v OUT|v IN)?
e
iv
ct
je
ob
!(#$%& |#() ; +)
vin . vout
48
ec
vive
c2t
jred
bo
wo
vin
Word Vectors 49
ec
vive
c2t
jred
bo
wo
vin
vin . vout ~ 0
Word Vectors
vout 50
ec
vive
c2t
jred
bo
wo
vin
vin . vout ~ -1
Word Vectors
vout 51
ec
vive
c2t
jred
bo
wo
Word Vectors 52
But we’d like to measure a probability.
ec
vive
c2t
jred
bo
wo
Word Vectors 53
But we’d like to measure a probability.
ec
vive
c2t
jred
= P(vout|vin)
Σexp(vin . vk)
k∈V
Word Vectors 54
Summary of the process Thou shalt not make a machine in the likeness of a human mind
Untrained model
Task: are the two words
neighbours?
not
thou
vIN
VOUT
Word Vectors 56
Preliminary steps Thou shalt not make a machine in the likeness of a human mind
We start with the first sample in our dataset. We grab the feature and feed to the
untrained model asking it to predict if the words are in the same context or not (1
or 0)
Word Vectors 57
Preliminary steps: Negative examples
Word Vectors 58
Preliminary steps: Negative examples
For each sample in our dataset, we add negative examples. Those have the same
input word, and a 0 label.
We are contrasting the actual signal (positive examples of neighboring words) with
noise (randomly selected words that are not neighbors). This leads to a great tradeoff
of computational and statistical efficiency. 59
Word Vectors
Preliminary steps: pre-process the text
Now that we’ve established the two central ideas of skipgram and negative sampling,
one last preliminary step is we pre-process the text we’re training the model
against. In this step, we determine the size of our vocabulary (we’ll call
this vocab_size, think of it as, say, 10,000) and which words belong to it.
Word Vectors 60
Training process: embedding and context matrices
Now that we’ve established the two central ideas of skipgram and negative
sampling and pre-process, we can proceed to look closer at the actual word2vec
training process.
Word Vectors 61
Training process: matrix initialization
1. At the start of the training process, we initialize these matrices with random
values. Then we start the training process. In each training step, we take one
positive example and its associated negative examples. Let’s take our first
group:
Word Vectors 62
Training process
3. Then, we take the dot product of the input embedding with each of the
context embeddings. In each case, that would result in a number, that number
indicates the similarity of the input and context embeddings
4. Now we need a way to turn these scores into something that looks like
probabilities – we need them to all be positive and have values between zero
and one. This is a great task for sigmoid, the logistic operation. And we can now
treat the output of the sigmoid operations as the model’s output for these
examples.
Word Vectors 64
Training process
5. Now that the untrained model has made a prediction, and seeing as though we
have an actual target label to compare against, let’s calculate how much error is
in the model’s prediction. To do that, we just subtract the sigmoid scores from the
target labels.
Word Vectors 65
Training process
6. Here comes the “learning” part of “machine learning”. We can now use this
error score to adjust the embeddings of not, thou, aaron, and taco so that the
next time we make this calculation, the result would be closer to the target scores
Word Vectors 66
Training process
Word Vectors 67
Training process
Word Vectors 68
Optimization Process
Gradient Descent
We go through gradients for each center vector Vin in a window. We also need gradients for
outside vectors Vout
But Corpus may have 40B tokens and Windows you would wait a very long time before making
a single update!
We will update parameters after each samples of corpus sentences (what is called batches) à
Stochastic gradient descent (SGD)
and update weights after each one
Word Vectors 69
Let’s Play!
● Gensim
http://web.stanford.edu/class/cs224n/materials/Gensim%2
0word%20vector%20visualization.html
● Embedding Projector
http://projector.tensorflow.org/
Word Vectors 70
Embedded space geometry
Word Vectors 71
Word2vec in Vikipedia
Word Vectors 72
GloVe and Words Senses
Word Vectors 73
Frequency based vs. direct prediction
Word Vectors 74
GloVE
Word Vectors 75
Ratios of co-occurrence probabilities can encode meaning components
Word Vectors 76
How?
Word Vectors 77
GloVE
• Fast training
• Scalable to huge corpora
• Good performance even with
small corpus and small vectors
Word Vectors 78
Word Senses
Word Vectors 79
Improving Word Representations Via Global Context And Multiple Word Prototypes
(Huang et al. 2012)
• Idea: Cluster word windows around words, retrain with each word
assigned tomultiple different clusters bank1, bank2, etc
80
Linear Algebraic Structure of Word Senses, with application to polysemy
(Arora, …,Ma, …,TACL2018)
Word Vectors 82
Gender bias in words embeddings
Word Vectors 83
Logic Riddle
A man and his son are in a terrible accident and are rushed to the hospital
in critical care.
The doctor looks at the boy and exclaims "I can't operate on this boy, he's my
son!”
Word Vectors 84
“Doctor” vs “Female doctor”
Word Vectors 85
Related Work: Word Embeddings encode bias
[credits to Hila Gonen]
Word Vectors 86
Techniques to Debias Word Embeddings
Word Vectors 87
Experiments For Evaluation Bias
Word Vectors 91
1. Gender Space and Direct Bias
Direct Bias
WE 0.08
Word Vectors 92
2. Male and female-biased words clustering
k-means
WE 99.9%
Debias WE 92.5%
GN-WE 85.6%
Word Vectors 93
3. Classification Approach
SVM
Classify Extended Biased List into words associated
between male and female
1000 for training, 4000 for testing
Accuracy
WE 98.25%
Debias WE 88.88%
GN-WE 98,65%
Word Vectors 94
Conclusions. Is Debiasing What We Want?
Gender bias causes NLP systems to make errors. You should care about
this even if accuracy is all you care about.
96