0% found this document useful (0 votes)
15 views72 pages

XCS224N Module1 Slides

The document outlines the first lecture of a course on Natural Language Processing (NLP) with Deep Learning, focusing on the introduction to word vectors and the Word2Vec framework. Key topics include the representation of word meanings through vectors, the challenges of traditional NLP methods, and the use of context to derive word meanings. The lecture aims to provide foundational knowledge for building NLP systems using modern deep learning techniques.

Uploaded by

nasoheel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views72 pages

XCS224N Module1 Slides

The document outlines the first lecture of a course on Natural Language Processing (NLP) with Deep Learning, focusing on the introduction to word vectors and the Word2Vec framework. Key topics include the representation of word meanings through vectors, the challenges of traditional NLP methods, and the use of context to derive word meanings. The lecture aims to provide foundational knowledge for building NLP systems using modern deep learning techniques.

Uploaded by

nasoheel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 72

Natural Language Processing

with Deep Learning


CS224N/Ling284

Christopher Manning
Lecture 1: Introduction and Word Vectors
Lecture Plan
Lecture 1: Introduction and Word Vectors
1. The course (10 mins)
2. Human language and word meaning (15 mins)
3. Word2vec introduction (15 mins)
4. Word2vec objective function gradients (25 mins)
5. Optimization basics (5 mins)
6. Looking at word vectors (10 mins or less)

Key learning today: The (really surprising!) result that word meaning can be represented
rather well by a large vector of real numbers

2
What do we hope to teach?
1. The foundations of the effective modern methods for deep learning applied to NLP
• Basics first, then key methods used in NLP: Recurrent networks, attention,
transformers, etc.
2. A big picture understanding of human languages and the difficulties in understanding
and producing them
3. An understanding of and ability to build systems (in PyTorch) for some of the major
problems in NLP:
• Word meaning, dependency parsing, machine translation, question answering

4
Lecture Plan
1. The course (10 mins)
2. Human language and word meaning (15 mins)
3. Word2vec introduction (15 mins)
4. Word2vec objective function gradients (25 mins)
5. Optimization basics (5 mins)
6. Looking at word vectors (10 mins or less)

7
https://xkcd.com/1576/ Randall Munroe CC BY NC 2.5
Trained on text data, neural machine translation is quite good!

https://kiswahili.tuko.co.ke/
GPT-3: A first step on the path to universal models
The SEC said, “Musk, your tweets are a S: I broke the window.
blight. Q: What did I break?
S: I gracefully saved the day.
They really could cost you your job,
Q: What did I gracefully save?
if you don't stop all this tweeting at night.” S: I gave John flowers.
Then Musk cried, “Why? Q: Who did I give flowers to?
The tweets I wrote are not mean, S: I gave her a rose and a guitar.
I don't use all-caps Q: Who did I give a rose and a guitar to?
and I'm sure that my tweets are clean.” How many users have signed up since the start of 2020?
“But your tweets can move markets SELECT count(id) FROM users
WHERE created_at > ‘2020-01-01’
and that's why we're sore.
What is the average number of influencers each user is
You may be a genius and a billionaire, subscribed to?
but it doesn't give you the right to SELECT avg(count) FROM ( SELECT user_id, count(*)
be a bore!” FROM subscribers GROUP BY user_id )
AS avg_subscriptions_per_user
How do we represent the meaning of a word?

Definition: meaning (Webster dictionary)


• the idea that is represented by a word, phrase, etc.
• the idea that a person wants to express by using words, signs, etc.
• the idea that is expressed in a work of writing, art, etc.

Commonest linguistic way of thinking of meaning:

signifier (symbol) ⟺ signified (idea or thing)


= denotational semantics

12
How do we have usable meaning in a computer?
Common NLP solution: Use, e.g., WordNet, a thesaurus containing
lists of synonym sets and hypernyms (“is a” relationships).
e.g., synonym sets containing “good”: e.g., hypernyms of “panda”:
from nltk.corpus import wordnet as wn from nltk.corpus import wordnet as wn
poses = { 'n':'noun', 'v':'verb', 's':'adj (s)', 'a':'adj', 'r':'adv'}
for synset in wn.synsets("good"):
panda = wn.synset("panda.n.01")
print("{}: {}".format(poses[synset.pos()], hyper = lambda s: s.hypernyms()
", ".join([l.name() for l in synset.lemmas()]))) list(panda.closure(hyper))

noun: good
[Synset('procyonid.n.01'),
noun: good, goodness
Synset('carnivore.n.01'),
noun: good, goodness
Synset('placental.n.01'),
noun: commodity, trade_good, good
adj: good Synset('mammal.n.01'),
Synset('vertebrate.n.01'),
adj (sat): full, good
Synset('chordate.n.01'),
adj: good
Synset('animal.n.01'),
adj (sat): estimable, good, honorable, respectable
Synset('organism.n.01'),
adj (sat): beneficial, good
Synset('living_thing.n.01'),
adj (sat): good
Synset('whole.n.02'),
adj (sat): good, just, upright
Synset('object.n.01'),

Synset('physical_entity.n.01'),
adverb: well, good
adverb: thoroughly, soundly, good Synset('entity.n.01')]

13
Problems with resources like WordNet
• Great as a resource but missing nuance
• e.g., “proficient” is listed as a synonym for “good”
This is only correct in some contexts
• Missing new meanings of words
• e.g., wicked, badass, nifty, wizard, genius, ninja, bombest
• Impossible to keep up-to-date!
• Subjective
• Requires human labor to create and adapt
• Can’t compute accurate word similarity à

14
Representing words as discrete symbols
In traditional NLP, we regard words as discrete symbols:
hotel, conference, motel – a localist representation

Means one 1, the rest 0s

Such symbols for words can be represented by one-hot vectors:


motel = [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
hotel = [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]

Vector dimension = number of words in vocabulary (e.g., 500,000)

15
Sec. 9.2.2

Problem with words as discrete symbols


Example: in web search, if user searches for “Seattle motel”, we would like to match
documents containing “Seattle hotel”

But:
motel = [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
hotel = [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
These two vectors are orthogonal
There is no natural notion of similarity for one-hot vectors!

Solution:
• Could try to rely on WordNet’s list of synonyms to get similarity?
• But it is well-known to fail badly: incompleteness, etc.
• Instead: learn to encode similarity in the vectors themselves
16
Representing words by their context

• Distributional semantics: A word’s meaning is given


by the words that frequently appear close-by
• “You shall know a word by the company it keeps” (J. R. Firth 1957: 11)
• One of the most successful ideas of modern statistical NLP!

• When a word w appears in a text, its context is the set of words


that appear nearby (within a fixed-size window).
• Use the many contexts of w to build up a representation of w

…government debt problems turning into banking crises as happened in 2009…


…saying that Europe needs unified banking regulation to replace the hodgepodge…
…India has just given its banking system a shot in the arm…

These context words will represent banking


17
Word vectors
We will build a dense vector for each word, chosen so that it is similar to vectors of words
that appear in similar contexts

0.286
0.792
−0.177
banking = −0.107
0.109
−0.542
0.349
0.271

Note: word vectors are also called word embeddings or (neural) word representations
They are a distributed representation
18
Word meaning as a neural word vector – visualization

0.286
0.792
−0.177
−0.107
expect = 0.109
−0.542
0.349
0.271
0.487

19
3. Word2vec: Overview
Word2vec (Mikolov et al. 2013) is a framework for learning word vectors

Idea:
• We have a large corpus (“body”) of text
• Every word in a fixed vocabulary is represented by a vector
• Go through each position t in the text, which has a center word c and context
(“outside”) words o
• Use the similarity of the word vectors for c and o to calculate the probability of o given
c (or vice versa)
• Keep adjusting the word vectors to maximize this probability

20
Word2Vec Overview
Example windows and process for computing 𝑃 𝑤𝑡+𝑗 | 𝑤𝑡

𝑃 𝑤!%$ | 𝑤! 𝑃 𝑤!"$ | 𝑤!
𝑃 𝑤!%# | 𝑤! 𝑃 𝑤!"# | 𝑤!

… problems turning into banking crises as …

outside context words center word outside context words


in window of size 2 at position t in window of size 2

21
Word2Vec Overview
Example windows and process for computing 𝑃 𝑤𝑡+𝑗 | 𝑤𝑡

𝑃 𝑤!%$ | 𝑤! 𝑃 𝑤!"$ | 𝑤!
𝑃 𝑤!%# | 𝑤! 𝑃 𝑤!"# | 𝑤!

… problems turning into banking crises as …

outside context words center word outside context words


in window of size 2 at position t in window of size 2

22
Word2vec: objective function
For each position 𝑡 = 1, … , 𝑇, predict context words within a
window of fixed size m, given center word 𝑤$ . Data likelihood:
(

Likelihood = 𝐿 𝜃 = , , 𝑃 𝑤%.$ | 𝑤% ; 𝜃
%&' )*+$+*
𝜃 is all variables $,-
to be optimized

sometimes called a cost or loss function

The objective function 𝐽 𝜃 is the (average) negative log likelihood:


(
1 1
𝐽 𝜃 = − log 𝐿(𝜃) = − 5 5 log 𝑃 𝑤%.$ | 𝑤% ; 𝜃
𝑇 𝑇
%&' )*+$+*
$,-

Minimizing objective function ⟺ Maximizing predictive accuracy


23
Word2vec: objective function

• We want to minimize the objective function:


'
1
𝐽 𝜃 =− * * log 𝑃 𝑤!"* | 𝑤! ; 𝜃
𝑇
!&# %()*)(
*+,

• Question: How to calculate 𝑃 𝑤%.$ | 𝑤% ; 𝜃 ?

• Answer: We will use two vectors per word w:


• 𝑣- when w is a center word
• 𝑢- when w is a context word

• Then for a center word c and a context word o:

exp(𝑢0( 𝑣1 )
𝑃 𝑜𝑐 = (𝑣 )
∑2∈4 exp(𝑢2 1
24
Word2Vec Overview with Vectors
• Example windows and process for computing 𝑃 𝑤𝑡+𝑗 | 𝑤𝑡
• 𝑃 𝑢𝑝𝑟𝑜𝑏𝑙𝑒𝑚𝑠 | 𝑣𝑖𝑛𝑡𝑜 short for P 𝑝𝑟𝑜𝑏𝑙𝑒𝑚𝑠 | 𝑖𝑛𝑡𝑜 ; 𝑢𝑝𝑟𝑜𝑏𝑙𝑒𝑚𝑠 , 𝑣𝑖𝑛𝑡𝑜 , 𝜃

𝑃 𝑢-*(!./0+ | 𝑣%#'( 𝑃 𝑢)*%+%+ |𝑣%#'(

𝑃 𝑢',#%#& | 𝑣%#'( 𝑃 𝑢!"#$%#& |𝑣%#'(

… problems turning into banking crises as …

outside context words center word outside context words


in window of size 2 at position t in window of size 2

25
Word2vec: prediction function
② Exponentiation makes anything positive
① Dot product compares similarity of o and c.
𝑢 1 𝑣 = 𝑢. 𝑣 = ∑#%23 𝑢% 𝑣%
exp(𝑢0( 𝑣1 ) Larger dot product = larger probability
𝑃 𝑜𝑐 = (𝑣 )
∑2∈4 exp(𝑢2 1
③ Normalize over entire vocabulary
to give probability distribution

• This is an example of the softmax function ℝ9 → (0,1)9 Open


region
exp(𝑥: )
softmax 𝑥: = 9 = 𝑝:
∑$&' exp(𝑥$ )
• The softmax function maps arbitrary values 𝑥: to a probability distribution 𝑝:
• “max” because amplifies probability of largest 𝑥;
But sort of a weird name
• “soft” because still assigns some probability to smaller 𝑥; because it returns a distribution!
• Frequently used in Deep Learning

26
To train the model: Optimize value of parameters to minimize loss
To train a model, we gradually adjust parameters to minimize a loss

• Recall: 𝜃 represents all the


model parameters, in one
long vector
• In our case, with
d-dimensional vectors and
V-many words, we have:
• Remember: every word has
two vectors

• We optimize these parameters by walking down the gradient (see right figure)
• We compute all vector gradients!
27
4. Word2vec derivations of gradient

• Zoom U. Whiteboard – see video or revised slides


• The basic Lego piece: The chain rule
• Useful basic fact:
• If in doubt: write it out with indices

28
Chain Rule

• Chain rule! If y = f(u) and u = g(x), i.e., y = f(g(x)), then:

• Simple example:

𝑑𝑦
= 20(𝑥 > + 7)> . 3𝑥 ?
𝑑𝑥
29
Interactive Whiteboard Session!

Let’s derive gradient for center word together


For one example window and one example outside word:

exp(𝑢! " 𝑣# )
log 𝑝 𝑜 𝑐 = log
∑'$%& exp(𝑢$ " 𝑣# )

You then also need the gradient for context words (it’s similar;
left for homework). That’s all of the parameters 𝜃 here.
30
Calculating all gradients!
• We went through the gradient for each center vector v in a window
• We also need gradients for outside vectors u
• Derive at home!
• Generally in each window we will compute updates for all parameters that are being
used in that window. For example:

𝑃 𝑢',*#%#& |𝑣!"#$%#& 𝑃 𝑢"+ |𝑣!"#$%#&

𝑃 𝑢#(+, | 𝑣&'()#(* 𝑃 𝑢!"#$%$ |𝑣&'()#(*

… problems turning into banking crises as …

outside context words center word outside context words


in window of size 2 at position t in window of size 2
31
Word2vec: More details
Why two vectors? à Easier optimization. Average both at the end.

Two model variants:


1. Skip-grams (SG)
Predict context (“outside”) words (position independent) given center word
2. Continuous Bag of Words (CBOW)
Predict center word from (bag of) context words
This lecture so far: Skip-gram model

Additional efficiency in training:


1. Negative sampling
So far: Focus on naïve softmax (simpler but more expensive training method)

32
5. Optimization: Gradient Descent
• We have a cost function 𝐽 𝜃 we want to minimize
• Gradient Descent is an algorithm to minimize 𝐽 𝜃
• Idea: for current value of 𝜃, calculate gradient of 𝐽 𝜃 , then take small step in direction
of negative gradient. Repeat.

Note: Our
objectives
may not
be convex
like this L

But life turns


out to be
okay J

33
Gradient Descent
• Update equation (in matrix notation):

𝛼 = step size or learning rate

• Update equation (for single parameter):

• Algorithm:

34
Stochastic Gradient Descent
• Problem: 𝐽 𝜃 is a function of all windows in the corpus (potentially billions!)
• So is very expensive to compute
• You would wait a very long time before making a single update!

• Very bad idea for pretty much all neural nets!


• Solution: Stochastic gradient descent (SGD)
• Repeatedly sample windows, and update after each one
• Algorithm:

35
Lecture Plan
1. The course (10 mins)
2. Human language and word meaning (15 mins)
3. Word2vec introduction (15 mins)
4. Word2vec objective function gradients (25 mins)
5. Optimization basics (5 mins)
6. Looking at word vectors (10 mins or less)

36
Lecture Plan
Lecture 2: Word Vectors, Word Senses, and Neural Network Classifiers
1. Course organization (2 mins)
2. Finish looking at word vectors and word2vec (10 mins)
3. Optimization basics (8 mins)
4. Can we capture the essence of word meaning more effectively by counting? (8m)
5. The GloVe model of word vectors (8 min)
6. Evaluating word vectors (12 mins)
7. Word senses (6 mins)
8. Review of classification and how neural nets differ (8 mins)
9. Introducing neural networks (14 mins)

Key Goal: To be able to read word embeddings papers by the end of class
2
2. Review: Main idea of word2vec
• Start with random word vectors
• Iterate through each word in the whole corpus
!"#(%!" &# )
• Try to predict surrounding words using word vectors: 𝑃 𝑜 𝑐 = ∑$∈& !"#(%$"& )
#

𝑃 𝑤!%$ | 𝑤! 𝑃 𝑤!"$ | 𝑤!

𝑃 𝑤!%# | 𝑤! 𝑃 𝑤!"# | 𝑤!

… problems turning into banking crises as …

• Learning: Update vectors so they can predict actual surrounding words better
• Doing no more than this, this algorithm learns word vectors that capture
well word similarity and meaningful directions in a wordspace!

4
Word2vec parameters and computations
••••• ••••• • •
• • • •• • • • •• • •
• • • •• • • • •• • •
• • • •• • • • •• • •
• • • •• • • • •• • •
• • • •• • • • •• • •
U V 𝑈. 𝑣,- softmax(𝑈. 𝑣,- )
outside center dot product probabilities

“Bag of words” model!


The model makes the same predictions at each position
We want a model that gives a reasonably high
probability estimate to all words that occur in the
context (at all often)
5
Word2vec maximizes objective function by
putting similar words nearby in space

6
3. Optimization: Gradient Descent
• To learn good word vectors: We have a cost function 𝐽 𝜃 we want to minimize
• Gradient Descent is an algorithm to minimize 𝐽 𝜃 by changing 𝜃
• Idea: from current value of 𝜃, calculate gradient of 𝐽 𝜃 , then take small step in the
direction of negative gradient. Repeat.

Note: Our
objectives
may not
be convex
like this L

But life turns


out to be
okay J

7
Gradient Descent
• Update equation (in matrix notation):

𝛼 = step size or learning rate

• Update equation (for a single parameter):

• Algorithm:

8
Stochastic Gradient Descent
• Problem: 𝐽 𝜃 is a function of all windows in the corpus (often, billions!)
• So is very expensive to compute
• You would wait a very long time before making a single update!

• Very bad idea for pretty much all neural nets!


• Solution: Stochastic gradient descent (SGD)
• Repeatedly sample windows, and update after each one, or each small batch
• Algorithm:

9
Stochastic gradients with word vectors! [Aside]
• Iteratively take gradients at each such window for SGD
• But in each window, we only have at most 2m + 1 words,
so is very sparse!

10
Stochastic gradients with word vectors!
• We might only update the word vectors that actually appear!

• Solution: either you need sparse matrix update operations to Rows not columns
only update certain rows of full embedding matrices U and V, in actual DL
or you need to keep around a hash for word vectors packages!

|V| [ ]
• If you have millions of word vectors and do distributed
computing, it is important to not have to send gigantic
updates around!

11
2b. Word2vec algorithm family: More details
Why two vectors? à Easier optimization. Average both at the end
• But can implement the algorithm with just one vector per word … and it helps
Two model variants:
1. Skip-grams (SG)
Predict context (“outside”) words (position independent) given center word
2. Continuous Bag of Words (CBOW)
Predict center word from (bag of) context words
We presented: Skip-gram model

Additional efficiency in training:


1. Negative sampling
So far: Focus on naïve softmax (simpler, but expensive, training method)

12
The skip-gram model with negative sampling (HW2)
• The normalization term is computationally expensive

"& )
!"#(%! #
• 𝑃 𝑜𝑐 = "& )
∑$∈& !"#(%$ #

• Hence, in standard word2vec and HW2 you implement the skip-gram model with
negative sampling

• Main idea: train binary logistic regressions for a true pair (center word and a word in its
context window) versus several noise pairs (the center word paired with a random
word)

13
The skip-gram model with negative sampling (HW2)
• From paper: “Distributed Representations of Words and Phrases and their
Compositionality” (Mikolov et al. 2013)
• Overall objective function (they maximize):

• The logistic/sigmoid function:


(we’ll become good friends soon)
• We maximize the probability
of two words co-occurring in first log
and minimize probability of noise words
14
The skip-gram model with negative sampling (HW2)
• Notation more similar to class and HW2:

𝐽./0123456/ 𝒖7 , 𝒗8 , 𝑈 = − log 𝜎 𝒖-7 𝒗8 − , log 𝜎(−𝒖-9 𝒗8 )


9∈{< 23456/= >.=>8/2}

• We take k negative samples (using word probabilities)


• Maximize probability that real outside word appears,
minimize probability that random words appear around center word

• Sample with P(w)=U(w)3/4/Z, the unigram distribution U(w) raised to the 3/4 power
(We provide this function in the starter code).
• The power makes less frequent words be sampled more often
15
4. Why not capture co-occurrence counts directly?

Building a co-occurrence matrix X


• 2 options: windows vs. full document
• Window: Similar to word2vec, use window around
each word à captures some syntactic and semantic
information
• Word-document co-occurrence matrix will give
general topics (all sports terms will have similar
entries) leading to “Latent Semantic Analysis”

16
Example: Window based co-occurrence matrix
• Window length 1 (more common: 5–10)
• Symmetric (irrelevant whether left or right context)
• Example corpus:
• I like deep learning
• I like NLP counts I like enjoy deep learning NLP flying .
I 0 2 1 0 0 0 0 0
• I enjoy flying
like 2 0 0 1 0 1 0 0
enjoy 1 0 0 0 0 0 1 0
deep 0 1 0 0 1 0 0 0
learning 0 0 0 1 0 0 0 1
NLP 0 1 0 0 0 0 0 1
flying 0 0 1 0 0 0 0 1
. 0 0 0 0 1 1 1 0

17
Co-occurrence vectors
• Simple count co-occurrence vectors
• Vectors increase in size with vocabulary
• Very high dimensional: require a lot of storage (though sparse)
• Subsequent classification models have sparsity issues à Models are less robust

• Low-dimensional vectors
• Idea: store “most” of the important information in a fixed, small number of
dimensions: a dense vector
• Usually 25–1000 dimensions, similar to word2vec
• How to reduce the dimensionality?

18
Classic Method: Dimensionality Reduction on X (HW1)
Singular Value Decomposition of co-occurrence matrix X
Factorizes X into UΣVT, where U and V are orthonormal

k
X

Retain only k singular values, in order to generalize.


𝑋" is the best rank k approximation to X , in terms of least squares.
Classic linear algebra result. Expensive to compute for large matrices.
19
Hacks to X (several used in Rohde et al. 2005 in COALS)
• Running an SVD on raw counts doesn’t work well

• Scaling the counts in the cells can help a lot


• Problem: function words (the, he, has) are too frequent à syntax has too much
impact. Some fixes:
• log the frequencies
• min(X,t), with t ≈ 100
• Ignore the function words
• Ramped windows that count closer words more than further away words
• Use Pearson correlations instead of counts, then set negative values to 0
• Etc.

20
Rohde, Gonnerman, Plaut Modeling Word Meaning Using Lexical Co-Occurrence
Interesting semantic patterns emerge in the scaled vectors
DRIVER

JANITOR
DRIVE SWIMMER
STUDENT

CLEAN TEACHER

DOCTOR
BRIDE
SWIM
PRIEST

LEARN TEACH
MARRY

TREAT PRAY

Figure 13: Multidimensional scaling for nouns and their associated verbs.
COALS model from
Rohde et al. ms., 2005. An Improved Model of Semantic Similarity Based on Lexical Co-Occurrence
Table 10
The 10 nearest neighbors and their percent correlation similarities for a set of nouns, under the COALS-14K model.
21 gun point mind monopoly cardboard lipstick leningrad feet
5. Towards GloVe: Count based vs. direct prediction

• LSA, HAL (Lund & Burgess), • Skip-gram/CBOW (Mikolov et al)


• COALS, Hellinger-PCA (Rohde • NNLM, HLBL, RNN (Bengio et
et al, Lebret & Collobert) al; Collobert & Weston; Huang et al; Mnih
& Hinton)

• Fast training • Scales with corpus size


• Efficient usage of statistics
• Inefficient usage of statistics
• Primarily used to capture word • Generate improved performance
similarity on other tasks
• Disproportionate importance
given to large counts • Can capture complex patterns
beyond word similarity

22
Encoding meaning components in vector differences
[Pennington, Socher, and Manning, EMNLP 2014]

Crucial insight: Ratios of co-occurrence probabilities can encode


meaning components

x = solid x = gas x = water x = random

large small large small

small large large small

large small ~1 ~1
Encoding meaning in vector differences
[Pennington, Socher, and Manning, EMNLP 2014]

Crucial insight: Ratios of co-occurrence probabilities can encode


meaning components

x = solid x = gas x = water x = fashion

1.9 x 10-4 6.6 x 10-5 3.0 x 10-3 1.7 x 10-5

2.2 x 10-5 7.8 x 10-4 2.2 x 10-3 1.8 x 10-5

8.9 8.5 x 10-2 1.36 0.96


Encoding meaning in vector differences

Q: How can we capture ratios of co-occurrence probabilities as


linear meaning components in a word vector space?

A: Log-bilinear model:

with vector differences


Combining the best of both worlds
GloVe [Pennington, Socher, and Manning, EMNLP 2014]

• Fast training
• Scalable to huge corpora
• Good performance even with
small corpus and small vectors
GloVe results

Nearest words to
frog:

1. frogs
2. toad
3. litoria litoria leptodactylidae
4. leptodactylidae
5. rana
6. lizard
7. eleutherodactylus

rana eleutherodactylus

27
6. How to evaluate word vectors?
• Related to general evaluation in NLP: Intrinsic vs. extrinsic
• Intrinsic:
• Evaluation on a specific/intermediate subtask
• Fast to compute
• Helps to understand that system
• Not clear if really helpful unless correlation to real task is established
• Extrinsic:
• Evaluation on a real task
• Can take a long time to compute accuracy
• Unclear if the subsystem is the problem or its interaction or other subsystems
• If replacing exactly one subsystem with another improves accuracy à Winning!

28
Intrinsic word vector evaluation
• Word Vector Analogies

a:b :: c:?

man:woman :: king:?

• Evaluate word vectors by how well


their cosine distance after addition
captures intuitive semantic and king
syntactic analogy questions
• Discarding the input words from the
search!
woman
• Problem: What if the information is
man
there but not linear?

29
Glove Visualizations

30
Glove Visualizations: Company - CEO

31
Glove Visualizations: Comparatives and Superlatives

32
2013); skip-gram (SG) and CBOW results are
SG 300 1B 61 61 61 †
Analogy from (Mikolov
evaluation and et al., 2013a,b); we trained SG
hyperparameters
um in terms of CBOW † 300 1.6B 16.1 52.63 36.1
(19) and CBOW using the word2vec tool . See text
Hn, m . The up- vLBL 300 1.5B 54.2 64.8 60.0
Glove word vectors evaluation
for details and a description of the SVD models.
maximum fre- ivLBL 300 1.5B 65.2 63.0 64.0
ted to |C| when
the number of Model
GloVe Dim. Size Sem.
300 1.6B 80.8 Syn. Tot.
61.5 70.3
we are free to
This number is ivLBL
SVD 100
300 1.5B6B 55.9 6.3 50.18.1 53.2
7.3
uation for large
f r in Eqn. (17) HPCA
SVD-S 100
300 1.6B6B 36.7 4.2 16.4
46.6 10.8
42.1
pansion of gen-
. Therefore we GloVe
SVD-L 100 1.6B
300 6B 67.556.6 54.3
63.0 60.3
60.1
ol, 1976),
SG † 300
CBOW 1B
6B 63.6 61 67.4 61 65.7
61
SG†
CBOW 300 1.6B6B 73.016.1 66.0
52.6 69.1
36.1
s > 0, s ,(19)
1, vLBL
GloVe 300 1.5B6B 77.454.2 67.0
64.8 60.0
71.7
(20) ivLBL 1000
CBOW 300 1.5B6B 57.365.2 68.9
63.0 63.7
64.0
ted to |C| when
GloVe
SG 300 1.6B
1000 6B 80.866.1 61.5
65.1 70.3
65.6
we are free to
33 SVD
SVD-L 300 42B 6B 38.4 6.3 58.28.1 49.2
7.3
Analogy evaluation and hyperparameters

• More data helps • Dimensionality


• Wikipedia is better than news text! • Good dimension is ~300

80 70
Semantic Syntactic Overall
ctors. 85

. We 70 65
80

MN, 75 60 60
Accuracy [%]

Accuracy [%]
Accuracy [%]
70
50 55
65

60 40 50
Semantic
55
Syntactic
30 45
50 Overall
Gigaword5 +
Wiki2010 Wiki2014 Gigaword5 Common Crawl 20 40
Wiki2014 0 100 200 300 400 500 600 2
1B tokens 1.6B tokens 4.3B tokens 6B tokens 42B tokens
Vector Dimension

34
Figure 3: Accuracy on the analogy task for 300- (a) Symmetric context
Another intrinsic word vector evaluation
• Word vector distances and their correlation with human judgments
• Example dataset: WordSim353 http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/

Word 1 Word 2 Human (mean)


tiger cat 7.35
tiger tiger 10
book paper 7.46
computer internet 7.58
plane car 5.77
professor doctor 6.62
stock phone 1.62
stock CD 1.31
stock jaguar 0.92
35
Table 3: Spearman rank correlation on word simi-
rmance, with the
larity tasks. All vectors are 300-dimensional. The
nalogyCorrelation
task. evaluation
CBOW⇤ vectors are from the word2vec website
d results of a va-
Word vector
and differ
distances and
in that
their
they contain
correlation with
phrase
human
vectors.
judgments
as well as with

the word2vec Model Size WS353 MC RG SCWS RW
ing SVDs. With SVD 6B 35.3 35.1 42.5 38.3 25.6
gram (SG† ) and SVD-S 6B 56.5 71.5 71.0 53.6 34.7
† ) models on the SVD-L 6B 65.7 72.7 75.1 56.5 37.0
CBOW † 6B 57.2 65.6 68.2 57.0 32.5
a 2014 + Giga-
SG† 6B 62.8 65.2 69.7 58.1 37.2
op 400,000 most
GloVe 6B 65.8 72.7 77.8 53.9 38.1
ndow size of 10.
SVD-L 42B 74.0 76.4 74.1 58.3 39.9
hich we show in
GloVe 42B 75.9 83.6 82.9 59.6 47.8
r this corpus. CBOW⇤ 100B 68.4 79.6 75.4 59.4 45.5
erate a truncated
ormation
• Someof ideas
how fromLGlove paper have been shown to improve skip-gram (SG) model also
(e.g., average both model
vectors) on this larger corpus. The fact that this
th only the top basic SVD model does not scale well to large cor-
his step
36 is typi- pora lends further evidence to the necessity of the
Extrinsic word vector evaluation
• Extrinsic evaluation of word vectors: All subsequent NLP tasks in this class. More examples soon.
Semantic Syntac
Table 4: F1 score on NER task with 50d vectors. 85

• One example where good Discrete is the baseline


word vectors without
should help word
directly: vectors.
named Werecognition:
entity 80 identifying
use publicly-available
references to a person, organization or locationvectors for HPCA, HSMN, 75

Accuracy [%]
and CW. See text for details. 70

Model Dev Test ACE MUC7 65

60
Discrete 91.0 85.4 77.4 73.4
55
SVD 90.8 85.7 77.3 73.7 50
SVD-S 91.0 85.5 77.6 74.3
Wiki2010 Wiki2014 Gigaword
SVD-L 90.5 84.8 73.6 71.5 1B tokens 1.6B tokens 4.3B token

HPCA 92.6 88.7 81.7 80.7


Figure 3: Accuracy on the a
HSMN 90.5 85.7 78.7 74.7
dimensional vectors trained o
CW 92.2 87.4 81.7 80.2
CBOW 93.1 88.2 82.2 81.1 entries are updated to assimi
GloVe 93.2 88.3 82.9 82.2 whereas Gigaword is a fixed
outdated and possibly incorre
37 shown for neural vectors in (Turian et al., 2010).
7. Word senses and word sense ambiguity
• Most words have lots of meanings!
• Especially common words
• Especially words that have existed for a long time

• Example: pike

• Does one vector capture all these meanings or do we have a mess?

38
pike
• A sharp point or staff
• A type of elongated fish
• A railroad line or system
• A type of road
• The future (coming down the pike)
• A type of body position (as in diving)
• To kill or pierce with a pike
• To make one’s way (pike along)
• In Australian English, pike means to pull out from doing something: I reckon he could
have climbed that cliff, but he piked!

39
Improving Word Representations Via Global Context And
Multiple Word Prototypes (Huang et al. 2012)
• Idea: Cluster word windows around words, retrain with each word assigned to multiple
different clusters bank1, bank2, etc.

40
Linear Algebraic Structure of Word Senses, with
Applications to Polysemy (Arora, …, Ma, …, TACL 2018)
• Different senses of a word reside in a linear superposition (weighted
sum) in standard word embeddings like word2vec
• 𝑣pike = 𝛼! 𝑣pike! + 𝛼" 𝑣pike" + 𝛼# 𝑣pike#
$!
• Where 𝛼! = , etc., for frequency f
$! %$" %$#
• Surprising result:
• Because of ideas from sparse coding you can actually separate out
the senses (providing they are relatively common)!

41

You might also like