XCS224N Module1 Slides
XCS224N Module1 Slides
Christopher Manning
Lecture 1: Introduction and Word Vectors
Lecture Plan
Lecture 1: Introduction and Word Vectors
1. The course (10 mins)
2. Human language and word meaning (15 mins)
3. Word2vec introduction (15 mins)
4. Word2vec objective function gradients (25 mins)
5. Optimization basics (5 mins)
6. Looking at word vectors (10 mins or less)
Key learning today: The (really surprising!) result that word meaning can be represented
rather well by a large vector of real numbers
2
What do we hope to teach?
1. The foundations of the effective modern methods for deep learning applied to NLP
• Basics first, then key methods used in NLP: Recurrent networks, attention,
transformers, etc.
2. A big picture understanding of human languages and the difficulties in understanding
and producing them
3. An understanding of and ability to build systems (in PyTorch) for some of the major
problems in NLP:
• Word meaning, dependency parsing, machine translation, question answering
4
Lecture Plan
1. The course (10 mins)
2. Human language and word meaning (15 mins)
3. Word2vec introduction (15 mins)
4. Word2vec objective function gradients (25 mins)
5. Optimization basics (5 mins)
6. Looking at word vectors (10 mins or less)
7
https://xkcd.com/1576/ Randall Munroe CC BY NC 2.5
Trained on text data, neural machine translation is quite good!
https://kiswahili.tuko.co.ke/
GPT-3: A first step on the path to universal models
The SEC said, “Musk, your tweets are a S: I broke the window.
blight. Q: What did I break?
S: I gracefully saved the day.
They really could cost you your job,
Q: What did I gracefully save?
if you don't stop all this tweeting at night.” S: I gave John flowers.
Then Musk cried, “Why? Q: Who did I give flowers to?
The tweets I wrote are not mean, S: I gave her a rose and a guitar.
I don't use all-caps Q: Who did I give a rose and a guitar to?
and I'm sure that my tweets are clean.” How many users have signed up since the start of 2020?
“But your tweets can move markets SELECT count(id) FROM users
WHERE created_at > ‘2020-01-01’
and that's why we're sore.
What is the average number of influencers each user is
You may be a genius and a billionaire, subscribed to?
but it doesn't give you the right to SELECT avg(count) FROM ( SELECT user_id, count(*)
be a bore!” FROM subscribers GROUP BY user_id )
AS avg_subscriptions_per_user
How do we represent the meaning of a word?
12
How do we have usable meaning in a computer?
Common NLP solution: Use, e.g., WordNet, a thesaurus containing
lists of synonym sets and hypernyms (“is a” relationships).
e.g., synonym sets containing “good”: e.g., hypernyms of “panda”:
from nltk.corpus import wordnet as wn from nltk.corpus import wordnet as wn
poses = { 'n':'noun', 'v':'verb', 's':'adj (s)', 'a':'adj', 'r':'adv'}
for synset in wn.synsets("good"):
panda = wn.synset("panda.n.01")
print("{}: {}".format(poses[synset.pos()], hyper = lambda s: s.hypernyms()
", ".join([l.name() for l in synset.lemmas()]))) list(panda.closure(hyper))
noun: good
[Synset('procyonid.n.01'),
noun: good, goodness
Synset('carnivore.n.01'),
noun: good, goodness
Synset('placental.n.01'),
noun: commodity, trade_good, good
adj: good Synset('mammal.n.01'),
Synset('vertebrate.n.01'),
adj (sat): full, good
Synset('chordate.n.01'),
adj: good
Synset('animal.n.01'),
adj (sat): estimable, good, honorable, respectable
Synset('organism.n.01'),
adj (sat): beneficial, good
Synset('living_thing.n.01'),
adj (sat): good
Synset('whole.n.02'),
adj (sat): good, just, upright
Synset('object.n.01'),
…
Synset('physical_entity.n.01'),
adverb: well, good
adverb: thoroughly, soundly, good Synset('entity.n.01')]
13
Problems with resources like WordNet
• Great as a resource but missing nuance
• e.g., “proficient” is listed as a synonym for “good”
This is only correct in some contexts
• Missing new meanings of words
• e.g., wicked, badass, nifty, wizard, genius, ninja, bombest
• Impossible to keep up-to-date!
• Subjective
• Requires human labor to create and adapt
• Can’t compute accurate word similarity à
14
Representing words as discrete symbols
In traditional NLP, we regard words as discrete symbols:
hotel, conference, motel – a localist representation
15
Sec. 9.2.2
But:
motel = [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
hotel = [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
These two vectors are orthogonal
There is no natural notion of similarity for one-hot vectors!
Solution:
• Could try to rely on WordNet’s list of synonyms to get similarity?
• But it is well-known to fail badly: incompleteness, etc.
• Instead: learn to encode similarity in the vectors themselves
16
Representing words by their context
0.286
0.792
−0.177
banking = −0.107
0.109
−0.542
0.349
0.271
Note: word vectors are also called word embeddings or (neural) word representations
They are a distributed representation
18
Word meaning as a neural word vector – visualization
0.286
0.792
−0.177
−0.107
expect = 0.109
−0.542
0.349
0.271
0.487
19
3. Word2vec: Overview
Word2vec (Mikolov et al. 2013) is a framework for learning word vectors
Idea:
• We have a large corpus (“body”) of text
• Every word in a fixed vocabulary is represented by a vector
• Go through each position t in the text, which has a center word c and context
(“outside”) words o
• Use the similarity of the word vectors for c and o to calculate the probability of o given
c (or vice versa)
• Keep adjusting the word vectors to maximize this probability
20
Word2Vec Overview
Example windows and process for computing 𝑃 𝑤𝑡+𝑗 | 𝑤𝑡
𝑃 𝑤!%$ | 𝑤! 𝑃 𝑤!"$ | 𝑤!
𝑃 𝑤!%# | 𝑤! 𝑃 𝑤!"# | 𝑤!
21
Word2Vec Overview
Example windows and process for computing 𝑃 𝑤𝑡+𝑗 | 𝑤𝑡
𝑃 𝑤!%$ | 𝑤! 𝑃 𝑤!"$ | 𝑤!
𝑃 𝑤!%# | 𝑤! 𝑃 𝑤!"# | 𝑤!
22
Word2vec: objective function
For each position 𝑡 = 1, … , 𝑇, predict context words within a
window of fixed size m, given center word 𝑤$ . Data likelihood:
(
Likelihood = 𝐿 𝜃 = , , 𝑃 𝑤%.$ | 𝑤% ; 𝜃
%&' )*+$+*
𝜃 is all variables $,-
to be optimized
exp(𝑢0( 𝑣1 )
𝑃 𝑜𝑐 = (𝑣 )
∑2∈4 exp(𝑢2 1
24
Word2Vec Overview with Vectors
• Example windows and process for computing 𝑃 𝑤𝑡+𝑗 | 𝑤𝑡
• 𝑃 𝑢𝑝𝑟𝑜𝑏𝑙𝑒𝑚𝑠 | 𝑣𝑖𝑛𝑡𝑜 short for P 𝑝𝑟𝑜𝑏𝑙𝑒𝑚𝑠 | 𝑖𝑛𝑡𝑜 ; 𝑢𝑝𝑟𝑜𝑏𝑙𝑒𝑚𝑠 , 𝑣𝑖𝑛𝑡𝑜 , 𝜃
25
Word2vec: prediction function
② Exponentiation makes anything positive
① Dot product compares similarity of o and c.
𝑢 1 𝑣 = 𝑢. 𝑣 = ∑#%23 𝑢% 𝑣%
exp(𝑢0( 𝑣1 ) Larger dot product = larger probability
𝑃 𝑜𝑐 = (𝑣 )
∑2∈4 exp(𝑢2 1
③ Normalize over entire vocabulary
to give probability distribution
26
To train the model: Optimize value of parameters to minimize loss
To train a model, we gradually adjust parameters to minimize a loss
• We optimize these parameters by walking down the gradient (see right figure)
• We compute all vector gradients!
27
4. Word2vec derivations of gradient
28
Chain Rule
• Simple example:
𝑑𝑦
= 20(𝑥 > + 7)> . 3𝑥 ?
𝑑𝑥
29
Interactive Whiteboard Session!
exp(𝑢! " 𝑣# )
log 𝑝 𝑜 𝑐 = log
∑'$%& exp(𝑢$ " 𝑣# )
You then also need the gradient for context words (it’s similar;
left for homework). That’s all of the parameters 𝜃 here.
30
Calculating all gradients!
• We went through the gradient for each center vector v in a window
• We also need gradients for outside vectors u
• Derive at home!
• Generally in each window we will compute updates for all parameters that are being
used in that window. For example:
32
5. Optimization: Gradient Descent
• We have a cost function 𝐽 𝜃 we want to minimize
• Gradient Descent is an algorithm to minimize 𝐽 𝜃
• Idea: for current value of 𝜃, calculate gradient of 𝐽 𝜃 , then take small step in direction
of negative gradient. Repeat.
Note: Our
objectives
may not
be convex
like this L
33
Gradient Descent
• Update equation (in matrix notation):
• Algorithm:
34
Stochastic Gradient Descent
• Problem: 𝐽 𝜃 is a function of all windows in the corpus (potentially billions!)
• So is very expensive to compute
• You would wait a very long time before making a single update!
35
Lecture Plan
1. The course (10 mins)
2. Human language and word meaning (15 mins)
3. Word2vec introduction (15 mins)
4. Word2vec objective function gradients (25 mins)
5. Optimization basics (5 mins)
6. Looking at word vectors (10 mins or less)
36
Lecture Plan
Lecture 2: Word Vectors, Word Senses, and Neural Network Classifiers
1. Course organization (2 mins)
2. Finish looking at word vectors and word2vec (10 mins)
3. Optimization basics (8 mins)
4. Can we capture the essence of word meaning more effectively by counting? (8m)
5. The GloVe model of word vectors (8 min)
6. Evaluating word vectors (12 mins)
7. Word senses (6 mins)
8. Review of classification and how neural nets differ (8 mins)
9. Introducing neural networks (14 mins)
Key Goal: To be able to read word embeddings papers by the end of class
2
2. Review: Main idea of word2vec
• Start with random word vectors
• Iterate through each word in the whole corpus
!"#(%!" &# )
• Try to predict surrounding words using word vectors: 𝑃 𝑜 𝑐 = ∑$∈& !"#(%$"& )
#
𝑃 𝑤!%$ | 𝑤! 𝑃 𝑤!"$ | 𝑤!
𝑃 𝑤!%# | 𝑤! 𝑃 𝑤!"# | 𝑤!
• Learning: Update vectors so they can predict actual surrounding words better
• Doing no more than this, this algorithm learns word vectors that capture
well word similarity and meaningful directions in a wordspace!
4
Word2vec parameters and computations
••••• ••••• • •
• • • •• • • • •• • •
• • • •• • • • •• • •
• • • •• • • • •• • •
• • • •• • • • •• • •
• • • •• • • • •• • •
U V 𝑈. 𝑣,- softmax(𝑈. 𝑣,- )
outside center dot product probabilities
6
3. Optimization: Gradient Descent
• To learn good word vectors: We have a cost function 𝐽 𝜃 we want to minimize
• Gradient Descent is an algorithm to minimize 𝐽 𝜃 by changing 𝜃
• Idea: from current value of 𝜃, calculate gradient of 𝐽 𝜃 , then take small step in the
direction of negative gradient. Repeat.
Note: Our
objectives
may not
be convex
like this L
7
Gradient Descent
• Update equation (in matrix notation):
• Algorithm:
8
Stochastic Gradient Descent
• Problem: 𝐽 𝜃 is a function of all windows in the corpus (often, billions!)
• So is very expensive to compute
• You would wait a very long time before making a single update!
9
Stochastic gradients with word vectors! [Aside]
• Iteratively take gradients at each such window for SGD
• But in each window, we only have at most 2m + 1 words,
so is very sparse!
10
Stochastic gradients with word vectors!
• We might only update the word vectors that actually appear!
• Solution: either you need sparse matrix update operations to Rows not columns
only update certain rows of full embedding matrices U and V, in actual DL
or you need to keep around a hash for word vectors packages!
|V| [ ]
• If you have millions of word vectors and do distributed
computing, it is important to not have to send gigantic
updates around!
11
2b. Word2vec algorithm family: More details
Why two vectors? à Easier optimization. Average both at the end
• But can implement the algorithm with just one vector per word … and it helps
Two model variants:
1. Skip-grams (SG)
Predict context (“outside”) words (position independent) given center word
2. Continuous Bag of Words (CBOW)
Predict center word from (bag of) context words
We presented: Skip-gram model
12
The skip-gram model with negative sampling (HW2)
• The normalization term is computationally expensive
"& )
!"#(%! #
• 𝑃 𝑜𝑐 = "& )
∑$∈& !"#(%$ #
• Hence, in standard word2vec and HW2 you implement the skip-gram model with
negative sampling
• Main idea: train binary logistic regressions for a true pair (center word and a word in its
context window) versus several noise pairs (the center word paired with a random
word)
13
The skip-gram model with negative sampling (HW2)
• From paper: “Distributed Representations of Words and Phrases and their
Compositionality” (Mikolov et al. 2013)
• Overall objective function (they maximize):
• Sample with P(w)=U(w)3/4/Z, the unigram distribution U(w) raised to the 3/4 power
(We provide this function in the starter code).
• The power makes less frequent words be sampled more often
15
4. Why not capture co-occurrence counts directly?
16
Example: Window based co-occurrence matrix
• Window length 1 (more common: 5–10)
• Symmetric (irrelevant whether left or right context)
• Example corpus:
• I like deep learning
• I like NLP counts I like enjoy deep learning NLP flying .
I 0 2 1 0 0 0 0 0
• I enjoy flying
like 2 0 0 1 0 1 0 0
enjoy 1 0 0 0 0 0 1 0
deep 0 1 0 0 1 0 0 0
learning 0 0 0 1 0 0 0 1
NLP 0 1 0 0 0 0 0 1
flying 0 0 1 0 0 0 0 1
. 0 0 0 0 1 1 1 0
17
Co-occurrence vectors
• Simple count co-occurrence vectors
• Vectors increase in size with vocabulary
• Very high dimensional: require a lot of storage (though sparse)
• Subsequent classification models have sparsity issues à Models are less robust
• Low-dimensional vectors
• Idea: store “most” of the important information in a fixed, small number of
dimensions: a dense vector
• Usually 25–1000 dimensions, similar to word2vec
• How to reduce the dimensionality?
18
Classic Method: Dimensionality Reduction on X (HW1)
Singular Value Decomposition of co-occurrence matrix X
Factorizes X into UΣVT, where U and V are orthonormal
k
X
20
Rohde, Gonnerman, Plaut Modeling Word Meaning Using Lexical Co-Occurrence
Interesting semantic patterns emerge in the scaled vectors
DRIVER
JANITOR
DRIVE SWIMMER
STUDENT
CLEAN TEACHER
DOCTOR
BRIDE
SWIM
PRIEST
LEARN TEACH
MARRY
TREAT PRAY
Figure 13: Multidimensional scaling for nouns and their associated verbs.
COALS model from
Rohde et al. ms., 2005. An Improved Model of Semantic Similarity Based on Lexical Co-Occurrence
Table 10
The 10 nearest neighbors and their percent correlation similarities for a set of nouns, under the COALS-14K model.
21 gun point mind monopoly cardboard lipstick leningrad feet
5. Towards GloVe: Count based vs. direct prediction
22
Encoding meaning components in vector differences
[Pennington, Socher, and Manning, EMNLP 2014]
large small ~1 ~1
Encoding meaning in vector differences
[Pennington, Socher, and Manning, EMNLP 2014]
A: Log-bilinear model:
• Fast training
• Scalable to huge corpora
• Good performance even with
small corpus and small vectors
GloVe results
Nearest words to
frog:
1. frogs
2. toad
3. litoria litoria leptodactylidae
4. leptodactylidae
5. rana
6. lizard
7. eleutherodactylus
rana eleutherodactylus
27
6. How to evaluate word vectors?
• Related to general evaluation in NLP: Intrinsic vs. extrinsic
• Intrinsic:
• Evaluation on a specific/intermediate subtask
• Fast to compute
• Helps to understand that system
• Not clear if really helpful unless correlation to real task is established
• Extrinsic:
• Evaluation on a real task
• Can take a long time to compute accuracy
• Unclear if the subsystem is the problem or its interaction or other subsystems
• If replacing exactly one subsystem with another improves accuracy à Winning!
28
Intrinsic word vector evaluation
• Word Vector Analogies
a:b :: c:?
man:woman :: king:?
29
Glove Visualizations
30
Glove Visualizations: Company - CEO
31
Glove Visualizations: Comparatives and Superlatives
32
2013); skip-gram (SG) and CBOW results are
SG 300 1B 61 61 61 †
Analogy from (Mikolov
evaluation and et al., 2013a,b); we trained SG
hyperparameters
um in terms of CBOW † 300 1.6B 16.1 52.63 36.1
(19) and CBOW using the word2vec tool . See text
Hn, m . The up- vLBL 300 1.5B 54.2 64.8 60.0
Glove word vectors evaluation
for details and a description of the SVD models.
maximum fre- ivLBL 300 1.5B 65.2 63.0 64.0
ted to |C| when
the number of Model
GloVe Dim. Size Sem.
300 1.6B 80.8 Syn. Tot.
61.5 70.3
we are free to
This number is ivLBL
SVD 100
300 1.5B6B 55.9 6.3 50.18.1 53.2
7.3
uation for large
f r in Eqn. (17) HPCA
SVD-S 100
300 1.6B6B 36.7 4.2 16.4
46.6 10.8
42.1
pansion of gen-
. Therefore we GloVe
SVD-L 100 1.6B
300 6B 67.556.6 54.3
63.0 60.3
60.1
ol, 1976),
SG † 300
CBOW 1B
6B 63.6 61 67.4 61 65.7
61
SG†
CBOW 300 1.6B6B 73.016.1 66.0
52.6 69.1
36.1
s > 0, s ,(19)
1, vLBL
GloVe 300 1.5B6B 77.454.2 67.0
64.8 60.0
71.7
(20) ivLBL 1000
CBOW 300 1.5B6B 57.365.2 68.9
63.0 63.7
64.0
ted to |C| when
GloVe
SG 300 1.6B
1000 6B 80.866.1 61.5
65.1 70.3
65.6
we are free to
33 SVD
SVD-L 300 42B 6B 38.4 6.3 58.28.1 49.2
7.3
Analogy evaluation and hyperparameters
80 70
Semantic Syntactic Overall
ctors. 85
. We 70 65
80
MN, 75 60 60
Accuracy [%]
Accuracy [%]
Accuracy [%]
70
50 55
65
60 40 50
Semantic
55
Syntactic
30 45
50 Overall
Gigaword5 +
Wiki2010 Wiki2014 Gigaword5 Common Crawl 20 40
Wiki2014 0 100 200 300 400 500 600 2
1B tokens 1.6B tokens 4.3B tokens 6B tokens 42B tokens
Vector Dimension
34
Figure 3: Accuracy on the analogy task for 300- (a) Symmetric context
Another intrinsic word vector evaluation
• Word vector distances and their correlation with human judgments
• Example dataset: WordSim353 http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/
Accuracy [%]
and CW. See text for details. 70
60
Discrete 91.0 85.4 77.4 73.4
55
SVD 90.8 85.7 77.3 73.7 50
SVD-S 91.0 85.5 77.6 74.3
Wiki2010 Wiki2014 Gigaword
SVD-L 90.5 84.8 73.6 71.5 1B tokens 1.6B tokens 4.3B token
• Example: pike
38
pike
• A sharp point or staff
• A type of elongated fish
• A railroad line or system
• A type of road
• The future (coming down the pike)
• A type of body position (as in diving)
• To kill or pierce with a pike
• To make one’s way (pike along)
• In Australian English, pike means to pull out from doing something: I reckon he could
have climbed that cliff, but he piked!
39
Improving Word Representations Via Global Context And
Multiple Word Prototypes (Huang et al. 2012)
• Idea: Cluster word windows around words, retrain with each word assigned to multiple
different clusters bank1, bank2, etc.
40
Linear Algebraic Structure of Word Senses, with
Applications to Polysemy (Arora, …, Ma, …, TACL 2018)
• Different senses of a word reside in a linear superposition (weighted
sum) in standard word embeddings like word2vec
• 𝑣pike = 𝛼! 𝑣pike! + 𝛼" 𝑣pike" + 𝛼# 𝑣pike#
$!
• Where 𝛼! = , etc., for frequency f
$! %$" %$#
• Surprising result:
• Because of ideas from sparse coding you can actually separate out
the senses (providing they are relatively common)!
41