0% found this document useful (0 votes)

15 views72 pages

XCS224N Module1 Slides

The document outlines the first lecture of a course on Natural Language Processing (NLP) with Deep Learning, focusing on the introduction to word vectors and the Word2Vec framework. Key topics include the representation of word meanings through vectors, the challenges of traditional NLP methods, and the use of context to derive word meanings. The lecture aims to provide foundational knowledge for building NLP systems using modern deep learning techniques.

Uploaded by

nasoheel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views72 pages

XCS224N Module1 Slides

Uploaded by

nasoheel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 72

Natural Language Processing

with Deep Learning

CS224N/Ling284

Christopher Manning
Lecture 1: Introduction and Word Vectors
Lecture Plan
Lecture 1: Introduction and Word Vectors
1. The course (10 mins)
2. Human language and word meaning (15 mins)
3. Word2vec introduction (15 mins)
4. Word2vec objective function gradients (25 mins)
5. Optimization basics (5 mins)
6. Looking at word vectors (10 mins or less)

Key learning today: The (really surprising!) result that word meaning can be represented
rather well by a large vector of real numbers

2
What do we hope to teach?
1. The foundations of the effective modern methods for deep learning applied to NLP
• Basics first, then key methods used in NLP: Recurrent networks, attention,
transformers, etc.
2. A big picture understanding of human languages and the difficulties in understanding
and producing them
3. An understanding of and ability to build systems (in PyTorch) for some of the major
problems in NLP:
• Word meaning, dependency parsing, machine translation, question answering

4
Lecture Plan
1. The course (10 mins)
2. Human language and word meaning (15 mins)
3. Word2vec introduction (15 mins)
4. Word2vec objective function gradients (25 mins)
5. Optimization basics (5 mins)
6. Looking at word vectors (10 mins or less)

7
https://xkcd.com/1576/ Randall Munroe CC BY NC 2.5
Trained on text data, neural machine translation is quite good!

https://kiswahili.tuko.co.ke/
GPT-3: A first step on the path to universal models
The SEC said, “Musk, your tweets are a S: I broke the window.
blight. Q: What did I break?
S: I gracefully saved the day.
They really could cost you your job,
Q: What did I gracefully save?
if you don't stop all this tweeting at night.” S: I gave John flowers.
Then Musk cried, “Why? Q: Who did I give flowers to?
The tweets I wrote are not mean, S: I gave her a rose and a guitar.
I don't use all-caps Q: Who did I give a rose and a guitar to?
and I'm sure that my tweets are clean.” How many users have signed up since the start of 2020?
“But your tweets can move markets SELECT count(id) FROM users
WHERE created_at > ‘2020-01-01’
and that's why we're sore.
What is the average number of influencers each user is
You may be a genius and a billionaire, subscribed to?
but it doesn't give you the right to SELECT avg(count) FROM ( SELECT user_id, count(*)
be a bore!” FROM subscribers GROUP BY user_id )
AS avg_subscriptions_per_user
How do we represent the meaning of a word?

Definition: meaning (Webster dictionary)

• the idea that is represented by a word, phrase, etc.
• the idea that a person wants to express by using words, signs, etc.
• the idea that is expressed in a work of writing, art, etc.

Commonest linguistic way of thinking of meaning:

signifier (symbol) ⟺ signified (idea or thing)

= denotational semantics

12
How do we have usable meaning in a computer?
Common NLP solution: Use, e.g., WordNet, a thesaurus containing
lists of synonym sets and hypernyms (“is a” relationships).
e.g., synonym sets containing “good”: e.g., hypernyms of “panda”:
from nltk.corpus import wordnet as wn from nltk.corpus import wordnet as wn
poses = { 'n':'noun', 'v':'verb', 's':'adj (s)', 'a':'adj', 'r':'adv'}
for synset in wn.synsets("good"):
panda = wn.synset("panda.n.01")
print("{}: {}".format(poses[synset.pos()], hyper = lambda s: s.hypernyms()
", ".join([l.name() for l in synset.lemmas()]))) list(panda.closure(hyper))

noun: good
[Synset('procyonid.n.01'),
noun: good, goodness
Synset('carnivore.n.01'),
noun: good, goodness
Synset('placental.n.01'),
noun: commodity, trade_good, good
adj: good Synset('mammal.n.01'),
Synset('vertebrate.n.01'),
adj (sat): full, good
Synset('chordate.n.01'),
adj: good
Synset('animal.n.01'),
adj (sat): estimable, good, honorable, respectable
Synset('organism.n.01'),
adj (sat): beneficial, good
Synset('living_thing.n.01'),
adj (sat): good
Synset('whole.n.02'),
adj (sat): good, just, upright
Synset('object.n.01'),
…
Synset('physical_entity.n.01'),
adverb: well, good
adverb: thoroughly, soundly, good Synset('entity.n.01')]

13
Problems with resources like WordNet
• Great as a resource but missing nuance
• e.g., “proficient” is listed as a synonym for “good”
This is only correct in some contexts
• Missing new meanings of words
• e.g., wicked, badass, nifty, wizard, genius, ninja, bombest
• Impossible to keep up-to-date!
• Subjective
• Requires human labor to create and adapt
• Can’t compute accurate word similarity à

14
Representing words as discrete symbols
In traditional NLP, we regard words as discrete symbols:
hotel, conference, motel – a localist representation

Means one 1, the rest 0s

Such symbols for words can be represented by one-hot vectors:

motel = [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
hotel = [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]

Vector dimension = number of words in vocabulary (e.g., 500,000)

15
Sec. 9.2.2

Problem with words as discrete symbols

Example: in web search, if user searches for “Seattle motel”, we would like to match
documents containing “Seattle hotel”

But:
motel = [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
hotel = [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
These two vectors are orthogonal
There is no natural notion of similarity for one-hot vectors!

Solution:
• Could try to rely on WordNet’s list of synonyms to get similarity?
• But it is well-known to fail badly: incompleteness, etc.
• Instead: learn to encode similarity in the vectors themselves
16
Representing words by their context

• Distributional semantics: A word’s meaning is given

by the words that frequently appear close-by
• “You shall know a word by the company it keeps” (J. R. Firth 1957: 11)
• One of the most successful ideas of modern statistical NLP!

• When a word w appears in a text, its context is the set of words

that appear nearby (within a fixed-size window).
• Use the many contexts of w to build up a representation of w

…government debt problems turning into banking crises as happened in 2009…

…saying that Europe needs unified banking regulation to replace the hodgepodge…
…India has just given its banking system a shot in the arm…

These context words will represent banking

17
Word vectors
We will build a dense vector for each word, chosen so that it is similar to vectors of words
that appear in similar contexts

0.286
0.792
−0.177
banking = −0.107
0.109
−0.542
0.349
0.271

Note: word vectors are also called word embeddings or (neural) word representations
They are a distributed representation
18
Word meaning as a neural word vector – visualization

0.286
0.792
−0.177
−0.107
expect = 0.109
−0.542
0.349
0.271
0.487

19
3. Word2vec: Overview
Word2vec (Mikolov et al. 2013) is a framework for learning word vectors

Idea:
• We have a large corpus (“body”) of text
• Every word in a fixed vocabulary is represented by a vector
• Go through each position t in the text, which has a center word c and context
(“outside”) words o
• Use the similarity of the word vectors for c and o to calculate the probability of o given
c (or vice versa)
• Keep adjusting the word vectors to maximize this probability

20
Word2Vec Overview
Example windows and process for computing 𝑃 𝑤𝑡+𝑗 | 𝑤𝑡

𝑃 𝑤!%$ | 𝑤! 𝑃 𝑤!"$ | 𝑤!
𝑃 𝑤!%# | 𝑤! 𝑃 𝑤!"# | 𝑤!

… problems turning into banking crises as …

outside context words center word outside context words

in window of size 2 at position t in window of size 2

21
Word2Vec Overview
Example windows and process for computing 𝑃 𝑤𝑡+𝑗 | 𝑤𝑡

𝑃 𝑤!%$ | 𝑤! 𝑃 𝑤!"$ | 𝑤!
𝑃 𝑤!%# | 𝑤! 𝑃 𝑤!"# | 𝑤!

… problems turning into banking crises as …

outside context words center word outside context words

in window of size 2 at position t in window of size 2

22
Word2vec: objective function
For each position 𝑡 = 1, … , 𝑇, predict context words within a
window of fixed size m, given center word 𝑤$ . Data likelihood:
(

Likelihood = 𝐿 𝜃 = , , 𝑃 𝑤%.$ | 𝑤% ; 𝜃
%&' )*+$+*
𝜃 is all variables $,-
to be optimized

sometimes called a cost or loss function

The objective function 𝐽 𝜃 is the (average) negative log likelihood:

(
1 1
𝐽 𝜃 = − log 𝐿(𝜃) = − 5 5 log 𝑃 𝑤%.$ | 𝑤% ; 𝜃
𝑇 𝑇
%&' )*+$+*
$,-

Minimizing objective function ⟺ Maximizing predictive accuracy

23
Word2vec: objective function

• We want to minimize the objective function:

'
1
𝐽 𝜃 =− * * log 𝑃 𝑤!"* | 𝑤! ; 𝜃
𝑇
!&# %()*)(
*+,

• Question: How to calculate 𝑃 𝑤%.$ | 𝑤% ; 𝜃 ?

• Answer: We will use two vectors per word w:

• 𝑣- when w is a center word
• 𝑢- when w is a context word

• Then for a center word c and a context word o:

exp(𝑢0( 𝑣1 )
𝑃 𝑜𝑐 = (𝑣 )
∑2∈4 exp(𝑢2 1
24
Word2Vec Overview with Vectors
• Example windows and process for computing 𝑃 𝑤𝑡+𝑗 | 𝑤𝑡
• 𝑃 𝑢𝑝𝑟𝑜𝑏𝑙𝑒𝑚𝑠 | 𝑣𝑖𝑛𝑡𝑜 short for P 𝑝𝑟𝑜𝑏𝑙𝑒𝑚𝑠 | 𝑖𝑛𝑡𝑜 ; 𝑢𝑝𝑟𝑜𝑏𝑙𝑒𝑚𝑠 , 𝑣𝑖𝑛𝑡𝑜 , 𝜃

𝑃 𝑢-(!./0+ | 𝑣%#'( 𝑃 𝑢)%+%+ |𝑣%#'(

𝑃 𝑢',#%#& | 𝑣%#'( 𝑃 𝑢!"#$%#& |𝑣%#'(

… problems turning into banking crises as …

outside context words center word outside context words

in window of size 2 at position t in window of size 2

25
Word2vec: prediction function
② Exponentiation makes anything positive
① Dot product compares similarity of o and c.
𝑢 1 𝑣 = 𝑢. 𝑣 = ∑#%23 𝑢% 𝑣%
exp(𝑢0( 𝑣1 ) Larger dot product = larger probability
𝑃 𝑜𝑐 = (𝑣 )
∑2∈4 exp(𝑢2 1
③ Normalize over entire vocabulary
to give probability distribution

• This is an example of the softmax function ℝ9 → (0,1)9 Open

region
exp(𝑥: )
softmax 𝑥: = 9 = 𝑝:
∑$&' exp(𝑥$ )
• The softmax function maps arbitrary values 𝑥: to a probability distribution 𝑝:
• “max” because amplifies probability of largest 𝑥;
But sort of a weird name
• “soft” because still assigns some probability to smaller 𝑥; because it returns a distribution!
• Frequently used in Deep Learning

26
To train the model: Optimize value of parameters to minimize loss
To train a model, we gradually adjust parameters to minimize a loss

• Recall: 𝜃 represents all the

model parameters, in one
long vector
• In our case, with
d-dimensional vectors and
V-many words, we have:
• Remember: every word has
two vectors

• We optimize these parameters by walking down the gradient (see right figure)
• We compute all vector gradients!
27
4. Word2vec derivations of gradient

• Zoom U. Whiteboard – see video or revised slides

• The basic Lego piece: The chain rule
• Useful basic fact:
• If in doubt: write it out with indices

28
Chain Rule

• Chain rule! If y = f(u) and u = g(x), i.e., y = f(g(x)), then:

• Simple example:

𝑑𝑦
= 20(𝑥 > + 7)> . 3𝑥 ?
𝑑𝑥
29
Interactive Whiteboard Session!

Let’s derive gradient for center word together

For one example window and one example outside word:

exp(𝑢! " 𝑣# )
log 𝑝 𝑜 𝑐 = log
∑'$%& exp(𝑢$ " 𝑣# )

You then also need the gradient for context words (it’s similar;
left for homework). That’s all of the parameters 𝜃 here.
30
Calculating all gradients!
• We went through the gradient for each center vector v in a window
• We also need gradients for outside vectors u
• Derive at home!
• Generally in each window we will compute updates for all parameters that are being
used in that window. For example:

𝑃 𝑢',*#%#& |𝑣!"#$%#& 𝑃 𝑢"+ |𝑣!"#$%#&

𝑃 𝑢#(+, | 𝑣&'()#(* 𝑃 𝑢!"#$%$ |𝑣&'()#(*

… problems turning into banking crises as …

outside context words center word outside context words

in window of size 2 at position t in window of size 2
31
Word2vec: More details
Why two vectors? à Easier optimization. Average both at the end.

Two model variants:

1. Skip-grams (SG)
Predict context (“outside”) words (position independent) given center word
2. Continuous Bag of Words (CBOW)
Predict center word from (bag of) context words
This lecture so far: Skip-gram model

Additional efficiency in training:

1. Negative sampling
So far: Focus on naïve softmax (simpler but more expensive training method)

32
5. Optimization: Gradient Descent
• We have a cost function 𝐽 𝜃 we want to minimize
• Gradient Descent is an algorithm to minimize 𝐽 𝜃
• Idea: for current value of 𝜃, calculate gradient of 𝐽 𝜃 , then take small step in direction
of negative gradient. Repeat.

Note: Our
objectives
may not
be convex
like this L

But life turns

out to be
okay J

33
Gradient Descent
• Update equation (in matrix notation):

𝛼 = step size or learning rate

• Update equation (for single parameter):

• Algorithm:

34
Stochastic Gradient Descent
• Problem: 𝐽 𝜃 is a function of all windows in the corpus (potentially billions!)
• So is very expensive to compute
• You would wait a very long time before making a single update!

• Very bad idea for pretty much all neural nets!

• Solution: Stochastic gradient descent (SGD)
• Repeatedly sample windows, and update after each one
• Algorithm:

35
Lecture Plan
1. The course (10 mins)
2. Human language and word meaning (15 mins)
3. Word2vec introduction (15 mins)
4. Word2vec objective function gradients (25 mins)
5. Optimization basics (5 mins)
6. Looking at word vectors (10 mins or less)

36
Lecture Plan
Lecture 2: Word Vectors, Word Senses, and Neural Network Classifiers
1. Course organization (2 mins)
2. Finish looking at word vectors and word2vec (10 mins)
3. Optimization basics (8 mins)
4. Can we capture the essence of word meaning more effectively by counting? (8m)
5. The GloVe model of word vectors (8 min)
6. Evaluating word vectors (12 mins)
7. Word senses (6 mins)
8. Review of classification and how neural nets differ (8 mins)
9. Introducing neural networks (14 mins)

Key Goal: To be able to read word embeddings papers by the end of class
2
2. Review: Main idea of word2vec
• Start with random word vectors
• Iterate through each word in the whole corpus
!"#(%!" &# )
• Try to predict surrounding words using word vectors: 𝑃 𝑜 𝑐 = ∑$∈& !"#(%$"& )
#

𝑃 𝑤!%$ | 𝑤! 𝑃 𝑤!"$ | 𝑤!

𝑃 𝑤!%# | 𝑤! 𝑃 𝑤!"# | 𝑤!

… problems turning into banking crises as …

• Learning: Update vectors so they can predict actual surrounding words better
• Doing no more than this, this algorithm learns word vectors that capture
well word similarity and meaningful directions in a wordspace!

4
Word2vec parameters and computations
••••• ••••• • •
• • • •• • • • •• • •
• • • •• • • • •• • •
• • • •• • • • •• • •
• • • •• • • • •• • •
• • • •• • • • •• • •
U V 𝑈. 𝑣,- softmax(𝑈. 𝑣,- )
outside center dot product probabilities

“Bag of words” model!

The model makes the same predictions at each position
We want a model that gives a reasonably high
probability estimate to all words that occur in the
context (at all often)
5
Word2vec maximizes objective function by
putting similar words nearby in space

6
3. Optimization: Gradient Descent
• To learn good word vectors: We have a cost function 𝐽 𝜃 we want to minimize
• Gradient Descent is an algorithm to minimize 𝐽 𝜃 by changing 𝜃
• Idea: from current value of 𝜃, calculate gradient of 𝐽 𝜃 , then take small step in the
direction of negative gradient. Repeat.

Note: Our
objectives
may not
be convex
like this L

But life turns

out to be
okay J

7
Gradient Descent
• Update equation (in matrix notation):

𝛼 = step size or learning rate

• Update equation (for a single parameter):

• Algorithm:

8
Stochastic Gradient Descent
• Problem: 𝐽 𝜃 is a function of all windows in the corpus (often, billions!)
• So is very expensive to compute
• You would wait a very long time before making a single update!

• Very bad idea for pretty much all neural nets!

• Solution: Stochastic gradient descent (SGD)
• Repeatedly sample windows, and update after each one, or each small batch
• Algorithm:

9
Stochastic gradients with word vectors! [Aside]
• Iteratively take gradients at each such window for SGD
• But in each window, we only have at most 2m + 1 words,
so is very sparse!

10
Stochastic gradients with word vectors!
• We might only update the word vectors that actually appear!

• Solution: either you need sparse matrix update operations to Rows not columns
only update certain rows of full embedding matrices U and V, in actual DL
or you need to keep around a hash for word vectors packages!

|V| [ ]
• If you have millions of word vectors and do distributed
computing, it is important to not have to send gigantic
updates around!

11
2b. Word2vec algorithm family: More details
Why two vectors? à Easier optimization. Average both at the end
• But can implement the algorithm with just one vector per word … and it helps
Two model variants:
1. Skip-grams (SG)
Predict context (“outside”) words (position independent) given center word
2. Continuous Bag of Words (CBOW)
Predict center word from (bag of) context words
We presented: Skip-gram model

Additional efficiency in training:

1. Negative sampling
So far: Focus on naïve softmax (simpler, but expensive, training method)

12
The skip-gram model with negative sampling (HW2)
• The normalization term is computationally expensive

"& )
!"#(%! #
• 𝑃 𝑜𝑐 = "& )
∑$∈& !"#(%$ #

• Hence, in standard word2vec and HW2 you implement the skip-gram model with
negative sampling

• Main idea: train binary logistic regressions for a true pair (center word and a word in its
context window) versus several noise pairs (the center word paired with a random
word)

13
The skip-gram model with negative sampling (HW2)
• From paper: “Distributed Representations of Words and Phrases and their
Compositionality” (Mikolov et al. 2013)
• Overall objective function (they maximize):

• The logistic/sigmoid function:

(we’ll become good friends soon)
• We maximize the probability
of two words co-occurring in first log
and minimize probability of noise words
14
The skip-gram model with negative sampling (HW2)
• Notation more similar to class and HW2:

𝐽./0123456/ 𝒖7 , 𝒗8 , 𝑈 = − log 𝜎 𝒖-7 𝒗8 − , log 𝜎(−𝒖-9 𝒗8 )

9∈{< 23456/= >.=>8/2}

• We take k negative samples (using word probabilities)

• Maximize probability that real outside word appears,
minimize probability that random words appear around center word

• Sample with P(w)=U(w)3/4/Z, the unigram distribution U(w) raised to the 3/4 power
(We provide this function in the starter code).
• The power makes less frequent words be sampled more often
15
4. Why not capture co-occurrence counts directly?

Building a co-occurrence matrix X

• 2 options: windows vs. full document
• Window: Similar to word2vec, use window around
each word à captures some syntactic and semantic
information
• Word-document co-occurrence matrix will give
general topics (all sports terms will have similar
entries) leading to “Latent Semantic Analysis”

16
Example: Window based co-occurrence matrix
• Window length 1 (more common: 5–10)
• Symmetric (irrelevant whether left or right context)
• Example corpus:
• I like deep learning
• I like NLP counts I like enjoy deep learning NLP flying .
I 0 2 1 0 0 0 0 0
• I enjoy flying
like 2 0 0 1 0 1 0 0
enjoy 1 0 0 0 0 0 1 0
deep 0 1 0 0 1 0 0 0
learning 0 0 0 1 0 0 0 1
NLP 0 1 0 0 0 0 0 1
flying 0 0 1 0 0 0 0 1
. 0 0 0 0 1 1 1 0

17
Co-occurrence vectors
• Simple count co-occurrence vectors
• Vectors increase in size with vocabulary
• Very high dimensional: require a lot of storage (though sparse)
• Subsequent classification models have sparsity issues à Models are less robust

• Low-dimensional vectors
• Idea: store “most” of the important information in a fixed, small number of
dimensions: a dense vector
• Usually 25–1000 dimensions, similar to word2vec
• How to reduce the dimensionality?

18
Classic Method: Dimensionality Reduction on X (HW1)
Singular Value Decomposition of co-occurrence matrix X
Factorizes X into UΣVT, where U and V are orthonormal

k
X

Retain only k singular values, in order to generalize.

𝑋" is the best rank k approximation to X , in terms of least squares.
Classic linear algebra result. Expensive to compute for large matrices.
19
Hacks to X (several used in Rohde et al. 2005 in COALS)
• Running an SVD on raw counts doesn’t work well

• Scaling the counts in the cells can help a lot

• Problem: function words (the, he, has) are too frequent à syntax has too much
impact. Some fixes:
• log the frequencies
• min(X,t), with t ≈ 100
• Ignore the function words
• Ramped windows that count closer words more than further away words
• Use Pearson correlations instead of counts, then set negative values to 0
• Etc.

20
Rohde, Gonnerman, Plaut Modeling Word Meaning Using Lexical Co-Occurrence
Interesting semantic patterns emerge in the scaled vectors
DRIVER

JANITOR
DRIVE SWIMMER
STUDENT

CLEAN TEACHER

DOCTOR
BRIDE
SWIM
PRIEST

LEARN TEACH
MARRY

TREAT PRAY

Figure 13: Multidimensional scaling for nouns and their associated verbs.
COALS model from
Rohde et al. ms., 2005. An Improved Model of Semantic Similarity Based on Lexical Co-Occurrence
Table 10
The 10 nearest neighbors and their percent correlation similarities for a set of nouns, under the COALS-14K model.
21 gun point mind monopoly cardboard lipstick leningrad feet
5. Towards GloVe: Count based vs. direct prediction

• LSA, HAL (Lund & Burgess), • Skip-gram/CBOW (Mikolov et al)

• COALS, Hellinger-PCA (Rohde • NNLM, HLBL, RNN (Bengio et
et al, Lebret & Collobert) al; Collobert & Weston; Huang et al; Mnih
& Hinton)

• Fast training • Scales with corpus size

• Efficient usage of statistics
• Inefficient usage of statistics
• Primarily used to capture word • Generate improved performance
similarity on other tasks
• Disproportionate importance
given to large counts • Can capture complex patterns
beyond word similarity

22
Encoding meaning components in vector differences
[Pennington, Socher, and Manning, EMNLP 2014]

Crucial insight: Ratios of co-occurrence probabilities can encode

meaning components

x = solid x = gas x = water x = random

large small large small

small large large small

large small ~1 ~1
Encoding meaning in vector differences
[Pennington, Socher, and Manning, EMNLP 2014]

Crucial insight: Ratios of co-occurrence probabilities can encode

meaning components

x = solid x = gas x = water x = fashion

1.9 x 10-4 6.6 x 10-5 3.0 x 10-3 1.7 x 10-5

2.2 x 10-5 7.8 x 10-4 2.2 x 10-3 1.8 x 10-5

8.9 8.5 x 10-2 1.36 0.96

Encoding meaning in vector differences

Q: How can we capture ratios of co-occurrence probabilities as

linear meaning components in a word vector space?

A: Log-bilinear model:

with vector differences

Combining the best of both worlds
GloVe [Pennington, Socher, and Manning, EMNLP 2014]

• Fast training
• Scalable to huge corpora
• Good performance even with
small corpus and small vectors
GloVe results

Nearest words to
frog:

1. frogs
2. toad
3. litoria litoria leptodactylidae
4. leptodactylidae
5. rana
6. lizard
7. eleutherodactylus

rana eleutherodactylus

27
6. How to evaluate word vectors?
• Related to general evaluation in NLP: Intrinsic vs. extrinsic
• Intrinsic:
• Evaluation on a specific/intermediate subtask
• Fast to compute
• Helps to understand that system
• Not clear if really helpful unless correlation to real task is established
• Extrinsic:
• Evaluation on a real task
• Can take a long time to compute accuracy
• Unclear if the subsystem is the problem or its interaction or other subsystems
• If replacing exactly one subsystem with another improves accuracy à Winning!

28
Intrinsic word vector evaluation
• Word Vector Analogies

a:b :: c:?

man:woman :: king:?

• Evaluate word vectors by how well

their cosine distance after addition
captures intuitive semantic and king
syntactic analogy questions
• Discarding the input words from the
search!
woman
• Problem: What if the information is
man
there but not linear?

29
Glove Visualizations

30
Glove Visualizations: Company - CEO

31
Glove Visualizations: Comparatives and Superlatives

32
2013); skip-gram (SG) and CBOW results are
SG 300 1B 61 61 61 †
Analogy from (Mikolov
evaluation and et al., 2013a,b); we trained SG
hyperparameters
um in terms of CBOW † 300 1.6B 16.1 52.63 36.1
(19) and CBOW using the word2vec tool . See text
Hn, m . The up- vLBL 300 1.5B 54.2 64.8 60.0
Glove word vectors evaluation
for details and a description of the SVD models.
maximum fre- ivLBL 300 1.5B 65.2 63.0 64.0
ted to |C| when
the number of Model
GloVe Dim. Size Sem.
300 1.6B 80.8 Syn. Tot.
61.5 70.3
we are free to
This number is ivLBL
SVD 100
300 1.5B6B 55.9 6.3 50.18.1 53.2
7.3
uation for large
f r in Eqn. (17) HPCA
SVD-S 100
300 1.6B6B 36.7 4.2 16.4
46.6 10.8
42.1
pansion of gen-
. Therefore we GloVe
SVD-L 100 1.6B
300 6B 67.556.6 54.3
63.0 60.3
60.1
ol, 1976),
SG † 300
CBOW 1B
6B 63.6 61 67.4 61 65.7
61
SG†
CBOW 300 1.6B6B 73.016.1 66.0
52.6 69.1
36.1
s > 0, s ,(19)
1, vLBL
GloVe 300 1.5B6B 77.454.2 67.0
64.8 60.0
71.7
(20) ivLBL 1000
CBOW 300 1.5B6B 57.365.2 68.9
63.0 63.7
64.0
ted to |C| when
GloVe
SG 300 1.6B
1000 6B 80.866.1 61.5
65.1 70.3
65.6
we are free to
33 SVD
SVD-L 300 42B 6B 38.4 6.3 58.28.1 49.2
7.3
Analogy evaluation and hyperparameters

• More data helps • Dimensionality

• Wikipedia is better than news text! • Good dimension is ~300

80 70
Semantic Syntactic Overall
ctors. 85

. We 70 65
80

MN, 75 60 60
Accuracy [%]

Accuracy [%]
Accuracy [%]
70
50 55
65

60 40 50
Semantic
55
Syntactic
30 45
50 Overall
Gigaword5 +
Wiki2010 Wiki2014 Gigaword5 Common Crawl 20 40
Wiki2014 0 100 200 300 400 500 600 2
1B tokens 1.6B tokens 4.3B tokens 6B tokens 42B tokens
Vector Dimension

34
Figure 3: Accuracy on the analogy task for 300- (a) Symmetric context
Another intrinsic word vector evaluation
• Word vector distances and their correlation with human judgments
• Example dataset: WordSim353 http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/

Word 1 Word 2 Human (mean)

tiger cat 7.35
tiger tiger 10
book paper 7.46
computer internet 7.58
plane car 5.77
professor doctor 6.62
stock phone 1.62
stock CD 1.31
stock jaguar 0.92
35
Table 3: Spearman rank correlation on word simi-
rmance, with the
larity tasks. All vectors are 300-dimensional. The
nalogyCorrelation
task. evaluation
CBOW⇤ vectors are from the word2vec website
d results of a va-
Word vector
and differ
distances and
in that
their
they contain
correlation with
phrase
human
vectors.
judgments
as well as with
•
the word2vec Model Size WS353 MC RG SCWS RW
ing SVDs. With SVD 6B 35.3 35.1 42.5 38.3 25.6
gram (SG† ) and SVD-S 6B 56.5 71.5 71.0 53.6 34.7
† ) models on the SVD-L 6B 65.7 72.7 75.1 56.5 37.0
CBOW † 6B 57.2 65.6 68.2 57.0 32.5
a 2014 + Giga-
SG† 6B 62.8 65.2 69.7 58.1 37.2
op 400,000 most
GloVe 6B 65.8 72.7 77.8 53.9 38.1
ndow size of 10.
SVD-L 42B 74.0 76.4 74.1 58.3 39.9
hich we show in
GloVe 42B 75.9 83.6 82.9 59.6 47.8
r this corpus. CBOW⇤ 100B 68.4 79.6 75.4 59.4 45.5
erate a truncated
ormation
• Someof ideas
how fromLGlove paper have been shown to improve skip-gram (SG) model also
(e.g., average both model
vectors) on this larger corpus. The fact that this
th only the top basic SVD model does not scale well to large cor-
his step
36 is typi- pora lends further evidence to the necessity of the
Extrinsic word vector evaluation
• Extrinsic evaluation of word vectors: All subsequent NLP tasks in this class. More examples soon.
Semantic Syntac
Table 4: F1 score on NER task with 50d vectors. 85

• One example where good Discrete is the baseline

word vectors without
should help word
directly: vectors.
named Werecognition:
entity 80 identifying
use publicly-available
references to a person, organization or locationvectors for HPCA, HSMN, 75

Accuracy [%]
and CW. See text for details. 70

Model Dev Test ACE MUC7 65

60
Discrete 91.0 85.4 77.4 73.4
55
SVD 90.8 85.7 77.3 73.7 50
SVD-S 91.0 85.5 77.6 74.3
Wiki2010 Wiki2014 Gigaword
SVD-L 90.5 84.8 73.6 71.5 1B tokens 1.6B tokens 4.3B token

HPCA 92.6 88.7 81.7 80.7

Figure 3: Accuracy on the a
HSMN 90.5 85.7 78.7 74.7
dimensional vectors trained o
CW 92.2 87.4 81.7 80.2
CBOW 93.1 88.2 82.2 81.1 entries are updated to assimi
GloVe 93.2 88.3 82.9 82.2 whereas Gigaword is a fixed
outdated and possibly incorre
37 shown for neural vectors in (Turian et al., 2010).
7. Word senses and word sense ambiguity
• Most words have lots of meanings!
• Especially common words
• Especially words that have existed for a long time

• Example: pike

• Does one vector capture all these meanings or do we have a mess?

38
pike
• A sharp point or staff
• A type of elongated fish
• A railroad line or system
• A type of road
• The future (coming down the pike)
• A type of body position (as in diving)
• To kill or pierce with a pike
• To make one’s way (pike along)
• In Australian English, pike means to pull out from doing something: I reckon he could
have climbed that cliff, but he piked!

39
Improving Word Representations Via Global Context And
Multiple Word Prototypes (Huang et al. 2012)
• Idea: Cluster word windows around words, retrain with each word assigned to multiple
different clusters bank1, bank2, etc.

40
Linear Algebraic Structure of Word Senses, with
Applications to Polysemy (Arora, …, Ma, …, TACL 2018)
• Different senses of a word reside in a linear superposition (weighted
sum) in standard word embeddings like word2vec
• 𝑣pike = 𝛼! 𝑣pike! + 𝛼" 𝑣pike" + 𝛼# 𝑣pike#
$!
• Where 𝛼! = , etc., for frequency f
$! %$" %$#
• Surprising result:
• Because of ideas from sparse coding you can actually separate out
the senses (providing they are relatively common)!

Lesson 3 Ppt-History of Communication
100% (2)
Lesson 3 Ppt-History of Communication
33 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
57 pages
cs224n Lecture Notes
No ratings yet
cs224n Lecture Notes
35 pages
Analyzing The Gift of The Magi
No ratings yet
Analyzing The Gift of The Magi
8 pages
Word Embeddings
No ratings yet
Word Embeddings
55 pages
Bsed-4a - Florida, Rina Rose A. - Detailed-Lesson-Plan
No ratings yet
Bsed-4a - Florida, Rina Rose A. - Detailed-Lesson-Plan
8 pages
NLP Prez Word - Sentence Embedding - MAQUET - MARTIN - LEEFEBURE - MOGAVERO
No ratings yet
NLP Prez Word - Sentence Embedding - MAQUET - MARTIN - LEEFEBURE - MOGAVERO
18 pages
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
100% (1)
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
12 pages
Jacques Derrida Religion and Postmodernism Series
100% (3)
Jacques Derrida Religion and Postmodernism Series
430 pages
Lecture1 Word Embeddings
No ratings yet
Lecture1 Word Embeddings
99 pages
Sense VEC A Fast and Accurate Method For Word Sense Disambiguation in Neural Word Embeddings
No ratings yet
Sense VEC A Fast and Accurate Method For Word Sense Disambiguation in Neural Word Embeddings
9 pages
Unit 8 G6 - Global Success
No ratings yet
Unit 8 G6 - Global Success
6 pages
Name Asmat Ullah
0% (1)
Name Asmat Ullah
5 pages
The Theory and Practice of Translation
No ratings yet
The Theory and Practice of Translation
236 pages
Word2Vec Tutorial - The Skip-Gram Model Chris McCormick PDF
No ratings yet
Word2Vec Tutorial - The Skip-Gram Model Chris McCormick PDF
39 pages
06 Wordvectors
No ratings yet
06 Wordvectors
96 pages
Wordembed v2.0
No ratings yet
Wordembed v2.0
46 pages
Christopher Manning Lecture 2: Word Vectors, Word Senses, and Neural Classifiers
No ratings yet
Christopher Manning Lecture 2: Word Vectors, Word Senses, and Neural Classifiers
57 pages
Word and Document Embeddings
No ratings yet
Word and Document Embeddings
94 pages
Madinah Arabic Pack - Lesson 4
No ratings yet
Madinah Arabic Pack - Lesson 4
42 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
33 pages
Distributed Word Representations For Information Retrieval
No ratings yet
Distributed Word Representations For Information Retrieval
46 pages
08 Word Embeddings (2021)
No ratings yet
08 Word Embeddings (2021)
58 pages
Christopher Manning Lecture 1: Introduction and Word Vectors
No ratings yet
Christopher Manning Lecture 1: Introduction and Word Vectors
42 pages
cs224n spr2024 Lecture01 Wordvecs1
No ratings yet
cs224n spr2024 Lecture01 Wordvecs1
40 pages
Cs224n 2025 Lecture03 Neuralnets
No ratings yet
Cs224n 2025 Lecture03 Neuralnets
96 pages
The Evolution of Linguistic Thought From The 18th To The 20th Century
No ratings yet
The Evolution of Linguistic Thought From The 18th To The 20th Century
3 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
36 pages
Lebijp 59 SZ 31 Py
No ratings yet
Lebijp 59 SZ 31 Py
69 pages
3 WordMeaning
No ratings yet
3 WordMeaning
78 pages
ML For NLP-LO4
No ratings yet
ML For NLP-LO4
42 pages
C1-U5-Adverbs CLC With GAMES
No ratings yet
C1-U5-Adverbs CLC With GAMES
24 pages
CCS369 - TSS-Unit 2
No ratings yet
CCS369 - TSS-Unit 2
56 pages
Unit IV
No ratings yet
Unit IV
57 pages
21 Word2Vec 24 09 2024
No ratings yet
21 Word2Vec 24 09 2024
63 pages
Week 2 and 3
No ratings yet
Week 2 and 3
76 pages
2020GRSsectionSurvivalManual Edit PDF
No ratings yet
2020GRSsectionSurvivalManual Edit PDF
29 pages
Neural Models For NLP
No ratings yet
Neural Models For NLP
67 pages
L4 Cse256 Fa24 We
No ratings yet
L4 Cse256 Fa24 We
68 pages
Constructing and Evaluating Word Embeddings
No ratings yet
Constructing and Evaluating Word Embeddings
33 pages
Cs224n 2024 Lecture02 Wordvecs2
No ratings yet
Cs224n 2024 Lecture02 Wordvecs2
45 pages
Lecture 4 Word Representation
No ratings yet
Lecture 4 Word Representation
48 pages
Neural IR
No ratings yet
Neural IR
45 pages
Lecture 10
No ratings yet
Lecture 10
86 pages
Lecture14 Distributed Representations
No ratings yet
Lecture14 Distributed Representations
63 pages
Unit 2 Updated New
No ratings yet
Unit 2 Updated New
77 pages
7a. Word Embeddings Word2Vec and GloVe
No ratings yet
7a. Word Embeddings Word2Vec and GloVe
39 pages
Vector Semantics and Embedding (Part 2)
No ratings yet
Vector Semantics and Embedding (Part 2)
47 pages
Word 2 Vec
No ratings yet
Word 2 Vec
29 pages
Lecture Word Embeddings WordTo Vec IR
No ratings yet
Lecture Word Embeddings WordTo Vec IR
60 pages
cs224n 2025 Lecture02 Wordvecs2
No ratings yet
cs224n 2025 Lecture02 Wordvecs2
46 pages
NLP Lec 03
No ratings yet
NLP Lec 03
26 pages
Word Embeddings
No ratings yet
Word Embeddings
59 pages
11.chapter8 WordEmbedding
No ratings yet
11.chapter8 WordEmbedding
17 pages
Wordembed
No ratings yet
Wordembed
31 pages
CS490 Advanced Topics in Computing - Deep Learning
No ratings yet
CS490 Advanced Topics in Computing - Deep Learning
20 pages
Word 2 Vec
No ratings yet
Word 2 Vec
33 pages
ML4D-L6 nlp2
No ratings yet
ML4D-L6 nlp2
58 pages
Let's Read Together
No ratings yet
Let's Read Together
14 pages
Level 3
No ratings yet
Level 3
6 pages
Ba LLMS W2 S2 2024 2025
No ratings yet
Ba LLMS W2 S2 2024 2025
47 pages
Word Embeddings 1
No ratings yet
Word Embeddings 1
42 pages
cs224n 2025 Lecture01 Wordvecs1
No ratings yet
cs224n 2025 Lecture01 Wordvecs1
36 pages
Vector Semantics and Embeddings
No ratings yet
Vector Semantics and Embeddings
29 pages
Word Vectors I
No ratings yet
Word Vectors I
23 pages
Class IX
No ratings yet
Class IX
13 pages
Chapter Transformers
No ratings yet
Chapter Transformers
8 pages
Chapter 8
No ratings yet
Chapter 8
12 pages
Active and Passive Voice
No ratings yet
Active and Passive Voice
6 pages
Apa Phonetics and Phonology
No ratings yet
Apa Phonetics and Phonology
7 pages
Worksheet Run-On Sentences
No ratings yet
Worksheet Run-On Sentences
5 pages
Eng Lang 6 Half Yearly
No ratings yet
Eng Lang 6 Half Yearly
5 pages
Discourse Term Paper
No ratings yet
Discourse Term Paper
6 pages
Match The Words With Their Indonesian Equivalents. Compare Your Work To Your Classmate's
No ratings yet
Match The Words With Their Indonesian Equivalents. Compare Your Work To Your Classmate's
4 pages
Thpdev Midterms Reviewer
No ratings yet
Thpdev Midterms Reviewer
4 pages
Spanish Word Vectors From Wikipedia: Mathias Etcheverry, Dina Wonsever
No ratings yet
Spanish Word Vectors From Wikipedia: Mathias Etcheverry, Dina Wonsever
5 pages
2.2 Wind (Poem) 1
No ratings yet
2.2 Wind (Poem) 1
8 pages
Ekta Test
No ratings yet
Ekta Test
18 pages
Past Simple or Continuous
No ratings yet
Past Simple or Continuous
2 pages
8300211c-6843-4b37-9a0b-cc7ec0bace42
No ratings yet
8300211c-6843-4b37-9a0b-cc7ec0bace42
3 pages
Vector Semantics 5: (Count (C) )
No ratings yet
Vector Semantics 5: (Count (C) )
3 pages
KO .. - Knowl. Ass. Unit 5
No ratings yet
KO .. - Knowl. Ass. Unit 5
3 pages
Formal Invitation
No ratings yet
Formal Invitation
3 pages
C1 Essay Marked Examples
No ratings yet
C1 Essay Marked Examples
3 pages
Vector Semantics 4
No ratings yet
Vector Semantics 4
3 pages
4b Immigration Political Cartoons-3-1
No ratings yet
4b Immigration Political Cartoons-3-1
2 pages

XCS224N Module1 Slides

Uploaded by

XCS224N Module1 Slides

Uploaded by

Natural Language Processing

with Deep Learning

Definition: meaning (Webster dictionary)

Commonest linguistic way of thinking of meaning:

signifier (symbol) ⟺ signified (idea or thing)

Means one 1, the rest 0s

Such symbols for words can be represented by one-hot vectors:

Vector dimension = number of words in vocabulary (e.g., 500,000)

Problem with words as discrete symbols

• Distributional semantics: A word’s meaning is given

• When a word w appears in a text, its context is the set of words

…government debt problems turning into banking crises as happened in 2009…

These context words will represent banking

… problems turning into banking crises as …

outside context words center word outside context words

… problems turning into banking crises as …

outside context words center word outside context words

sometimes called a cost or loss function

The objective function 𝐽 𝜃 is the (average) negative log likelihood:

Minimizing objective function ⟺ Maximizing predictive accuracy

• We want to minimize the objective function:

• Question: How to calculate 𝑃 𝑤%.$ | 𝑤% ; 𝜃 ?

• Answer: We will use two vectors per word w:

• Then for a center word c and a context word o:

𝑃 𝑢-*(!./0+ | 𝑣%#'( 𝑃 𝑢)*%+%+ |𝑣%#'(

𝑃 𝑢',#%#& | 𝑣%#'( 𝑃 𝑢!"#$%#& |𝑣%#'(

… problems turning into banking crises as …

outside context words center word outside context words

• This is an example of the softmax function ℝ9 → (0,1)9 Open

• Recall: 𝜃 represents all the

• Zoom U. Whiteboard – see video or revised slides

• Chain rule! If y = f(u) and u = g(x), i.e., y = f(g(x)), then:

Let’s derive gradient for center word together

𝑃 𝑢',*#%#& |𝑣!"#$%#& 𝑃 𝑢"+ |𝑣!"#$%#&

𝑃 𝑢#(+, | 𝑣&'()#(* 𝑃 𝑢!"#$%$ |𝑣&'()#(*

… problems turning into banking crises as …

outside context words center word outside context words

Two model variants:

Additional efficiency in training:

But life turns

𝛼 = step size or learning rate

• Update equation (for single parameter):

• Very bad idea for pretty much all neural nets!

… problems turning into banking crises as …

“Bag of words” model!

But life turns

𝛼 = step size or learning rate

• Update equation (for a single parameter):

• Very bad idea for pretty much all neural nets!

Additional efficiency in training:

• The logistic/sigmoid function:

𝐽./0123456/ 𝒖7 , 𝒗8 , 𝑈 = − log 𝜎 𝒖-7 𝒗8 − , log 𝜎(−𝒖-9 𝒗8 )

• We take k negative samples (using word probabilities)

Building a co-occurrence matrix X

Retain only k singular values, in order to generalize.

• Scaling the counts in the cells can help a lot

• LSA, HAL (Lund & Burgess), • Skip-gram/CBOW (Mikolov et al)

• Fast training • Scales with corpus size

Crucial insight: Ratios of co-occurrence probabilities can encode

x = solid x = gas x = water x = random

large small large small

small large large small

Crucial insight: Ratios of co-occurrence probabilities can encode

x = solid x = gas x = water x = fashion

1.9 x 10-4 6.6 x 10-5 3.0 x 10-3 1.7 x 10-5

2.2 x 10-5 7.8 x 10-4 2.2 x 10-3 1.8 x 10-5

8.9 8.5 x 10-2 1.36 0.96

Q: How can we capture ratios of co-occurrence probabilities as

with vector differences

• Evaluate word vectors by how well

• More data helps • Dimensionality

Word 1 Word 2 Human (mean)

• One example where good Discrete is the baseline

Model Dev Test ACE MUC7 65

HPCA 92.6 88.7 81.7 80.7

• Does one vector capture all these meanings or do we have a mess?

𝑃 𝑢-(!./0+ | 𝑣%#'( 𝑃 𝑢)%+%+ |𝑣%#'(