0% found this document useful (0 votes)
51 views45 pages

Cs224n 2024 Lecture02 Wordvecs2

Uploaded by

Bonnie
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views45 pages

Cs224n 2024 Lecture02 Wordvecs2

Uploaded by

Bonnie
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

Natural Language Processing

with Deep Learning


CS224N/Ling284

Diyi Yang
Lecture 2: Word Vectors, Word Senses, and Neural Classifiers
Lecture Plan
Lecture 2: Word Vectors, Word Senses, and Neural Network Classifiers
1. Course organization (3 mins)
2. Finish looking at word vectors and word2vec (15 mins)
3. Can we capture the essence of word meaning more effectively by counting? (10m)
4. Evaluating word vectors (10 mins)
5. Word senses (8 mins)
6. Review of classification and how neural nets differ (14 mins)
7. Introducing neural networks (10 mins)

Key Goal: To be able to read word embeddings papers by the end of class

2
1. Course Organization
• Come to office hours/help sessions!
• They started yesterday
• Come to discuss final project ideas as well as the assignments
• Try to come early, often and off-cycle!
• TA office hours: 3-hour blocks Mon–Sat, with multiple TAs
• Just show up! Our friendly course staff will be on hand to assist you!
• https://web.stanford.edu/class/cs224n/office_hours.html
• Instructors’ office hours (in person by default):
• Diyi: Tuesdays 3-4pm
• Tatsu: Fridays 3-4pm

3
2. Review: Main idea of word2vec
• Start with random word vectors
• Iterate through each word position in the whole corpus
!"#(%!" &# )
• Try to predict surrounding words using word vectors: 𝑃 𝑜 𝑐 = ∑$∈& !"#(%$"& )
#

𝑃 𝑤!%$ | 𝑤! 𝑃 𝑤!"$ | 𝑤!
𝑃 𝑤!%# | 𝑤! 𝑃 𝑤!"# | 𝑤!

… problems turning into banking crises as …

• Learning: Update vectors so they can predict actual surrounding words better
• Doing no more than this, this algorithm learns word vectors that capture
well word similarity and meaningful directions in a word space!

5
2. Optimization: Gradient Descent
• We have a cost function 𝐽 𝜃 we want to minimize
• Gradient Descent is an algorithm to minimize 𝐽 𝜃
• Idea: for current value of 𝜃, calculate gradient of 𝐽 𝜃 , then take small step in direction
of negative gradient. Repeat.

Note: Our
objectives
may not
be convex
like this L

But life turns


out to be
okay J

6
Gradient Descent
• Update equation (in matrix notation):

𝛼 = step size or learning rate

• Update equation (for single parameter):

• Algorithm:

7
Stochastic Gradient Descent
• Problem: 𝐽 𝜃 is a function of all windows in the corpus (potentially billions!)
• So is very expensive to compute
• You would wait a very long time before making a single update!

• Very bad idea for pretty much all neural nets!


• Solution: Stochastic gradient descent (SGD) Mini Batch
• Repeatedly sample windows, and update after each one
Gradient Descent
• Algorithm:

8
Word2vec parameters … and computations
••••• ••••• • •
• • • •• • • • •• • •
• • • •• • • • •• • •
• • • •• • • • •• • •
• • • •• • • • •• • •
• • • •• • • • •• • •
U V 𝑈. 𝑣)* softmax(𝑈. 𝑣)* )
outside center dot product probabilities

“Bag of words” model! The model makes the same predictions at each position
We want a model that gives a reasonably high
probability estimate to all words that occur in the
context (at all often)
9
Word2vec maximizes objective function by
putting similar words nearby in space

10
Word2vec algorithm family (Mikolov et al. 2013): More details
Why two vectors? à Easier optimization. Average both at the end
• But can implement the algorithm with just one vector per word … and it helps a bit
Two model variants:
1. Skip-grams (SG)
Predict context (“outside”) words (position independent) given center word
2. Continuous Bag of Words (CBOW)
Predict center word from (bag of) context words
We presented: Skip-gram model

Loss functions for training:


1. Naïve softmax (simple but expensive loss function, when many output classes)
2. More optimized variants like hierarchical softmax
3. Negative sampling
So far, we explained naïve softmax
11
The skip-gram model with negative sampling (HW2)
• The normalization term is computationally expensive (when many output classes):

"& )
!"#(%! #
• 𝑃 𝑜𝑐 = "& )
∑$∈& !"#(%$ A big sum over words
#

• Hence, in standard word2vec and HW2 you implement the skip-gram model with
negative sampling

• Main idea: train binary logistic regressions to differentiate a true pair (center word and
a word in its context window) versus several “noise” pairs (the center word paired with
a random word)

12
The skip-gram model with negative sampling (HW2)
• Introduced in: “Distributed Representations of Words and Phrases and their
Compositionality” (Mikolov et al. 2013)
• Overall objective function (they maximize):

sigmoid
rather than softmax

• The logistic/sigmoid function:


(we’ll become good friends soon)
• We maximize the probability of two words
co-occurring in first log and minimize probability
of noise words in second part
13
The skip-gram model with negative sampling (HW2)
• Using notation consistent with this class and HW2:

𝐽+,-./0123, 𝒖4 , 𝒗5 , 𝑈 = − log 𝜎 𝒖*4 𝒗5 − : log 𝜎 −𝒖*6 𝒗5


6∈ 8 /0123,9 :+9:5,/

1
𝜎 𝑥 =
1 + 𝑒 !"
• We take k negative samples (using word probabilities)
• Maximize probability that real outside word appears;
minimize probability that random words appear around center word

• Sample with P(w)=U(w)3/4/Z, the unigram distribution U(w) raised to the 3/4 power
(We provide this function in the starter code).
• The power makes less frequent words be sampled more often
14
Stochastic gradients with negative sampling [aside]
• We iteratively take gradients at each window for SGD
• In each window, we only have at most 2m + 1 words plus 2km negative
words with negative sampling, so ∇A 𝐽B (𝜃) is very sparse!

15
Stochastic gradients with with negative sampling [aside]
• We might only update the word vectors that actually appear!

• Solution: either you need sparse matrix update operations to Rows not columns
only update certain rows of full embedding matrices U and V, in actual DL
or you need to keep around a hash for word vectors packages!

|V| [ ]
• If you have millions of word vectors and do distributed
computing, it is important to not have to send gigantic
updates around!

16
3. Why not capture co-occurrence counts directly?
There’s something weird about iterating through the whole corpus (perhaps many times);
why don’t we just accumulate all the statistics of what words appear near each other?!?

Building a co-occurrence matrix X


• 2 options: windows vs. full document
• Window: Similar to word2vec, use window around each word à captures some
syntactic and semantic information (“word space”)
• Word-document co-occurrence matrix will give general topics (all sports terms will
have similar entries) leading to “Latent Semantic Analysis” (“document space”)

17
Example: Window based co-occurrence matrix
• Window length 1 (more common: 5–10)
• Symmetric (irrelevant whether left or right context)

• Example corpus:
• I like deep learning counts I like enjoy deep learning NLP flying .
I 0 2 1 0 0 0 0 0
• I like NLP
like 2 0 0 1 0 1 0 0
• I enjoy flying
enjoy 1 0 0 0 0 0 1 0
deep 0 1 0 0 1 0 0 0
learning 0 0 0 1 0 0 0 1
NLP 0 1 0 0 0 0 0 1
flying 0 0 1 0 0 0 0 1
. 0 0 0 0 1 1 1 0

18
Co-occurrence vectors
• Simple count co-occurrence vectors
• Vectors increase in size with vocabulary
• Very high dimensional: require a lot of storage (though sparse)
• Subsequent classification models have sparsity issues à Models are less robust

• Low-dimensional vectors
• Idea: store “most” of the important information in a fixed, small number of
dimensions: a dense vector
• Usually 25–1000 dimensions, similar to word2vec
• How to reduce the dimensionality?

19
Classic Method: Dimensionality Reduction on X (HW1)
Singular Value Decomposition of co-occurrence matrix X
Factorizes X into UΣVT, where U and V are orthonormal (unit vectors and orthogonal)

k
X

Retain only k singular values, in order to generalize.


𝑋& is the best rank k approximation to X , in terms of least squares.
Classic linear algebra result. Expensive to compute for large matrices.
20
Hacks to X (several used in Rohde et al. 2005 in COALS)
• Running an SVD on raw counts doesn’t work well!!!

• Scaling the counts in the cells can help a lot


• Problem: function words (the, he, has) are too frequent à syntax has too much
impact. Some fixes:
• log the frequencies
• min(X, t), with t ≈ 100
• Ignore the function words
• Ramped windows that count closer words more than further away words
• Use Pearson correlations instead of counts, then set negative values to 0
• Etc.

21
Rohde, Gonnerman, Plaut Modeling Word Meaning Using Lexical Co-Occurrence
Interesting semantic patterns emerge in the scaled vectors
DRIVER

JANITOR
DRIVE SWIMMER
STUDENT

CLEAN TEACHER

DOCTOR
BRIDE
SWIM
PRIEST

LEARN TEACH
MARRY

TREAT PRAY

Figure 13: Multidimensional scaling for nouns and their associated verbs.
COALS model from
Rohde et al. ms., 2005. An Improved Model of Semantic Similarity Based on Lexical Co-Occurrence
Table 10
The 10 nearest neighbors and their percent correlation similarities for a set of nouns, under the COALS-14K model.
22 gun point mind monopoly cardboard lipstick leningrad feet
GloVe [Pennington, Socher, and Manning, EMNLP 2014]:
Encoding meaning components in vector differences

Q: How can we capture ratios of co-occurrence probabilities as


linear meaning components in a word vector space?
GloVe [Pennington, Socher, and Manning, EMNLP 2014]:
Encoding meaning components in vector differences

Q: How can we capture ratios of co-occurrence probabilities as


linear meaning components in a word vector space?

A: Log-bilinear model:

with vector differences

Loss:

• Fast training
• Scalable to huge corpora
4. How to evaluate word vectors?
• Related to general evaluation in NLP: Intrinsic vs. extrinsic
• Intrinsic:
• Evaluation on a specific/intermediate subtask
• Fast to compute
• Helps to understand that system
• Not clear if really helpful unless correlation to real task is established
• Extrinsic:
• Evaluation on a real task
• Can take a long time to compute accuracy
• Unclear if the subsystem is the problem or its interaction or other subsystems
• If replacing exactly one subsystem with another improves accuracy à Winning!

25
Intrinsic word vector evaluation
• Word Vector Analogies

a:b :: c:?

man:woman :: king:?

• Evaluate word vectors by how well


their cosine distance after addition
captures intuitive semantic and king
syntactic analogy questions
• Discarding the input words from the
search (!)
woman
• Problem: What if the information is
man
there but not linear?

26
GloVe Visualization

27
Meaning similarity: Another intrinsic word vector evaluation
• Word vector distances and their correlation with human judgments
• Example dataset: WordSim353 http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/

Word 1 Word 2 Human (mean)


tiger cat 7.35
tiger tiger 10
book paper 7.46
computer internet 7.58
plane car 5.77
professor doctor 6.62
stock phone 1.62
stock CD 1.31
stock jaguar 0.92
28
typ-
Table 3: Spearman rank correlation on word simi-
h theCorrelation evaluation
larity tasks. All vectors are 300-dimensional. The
CBOW ⇤ vectors are from the word2vec website
a va-• Word vector distances and their correlation with human judgments
and differ in that they contain phrase vectors.
with
vec Model Size WS353 MC RG SCWS RW
With SVD 6B 35.3 35.1 42.5 38.3 25.6
and SVD-S 6B 56.5 71.5 71.0 53.6 34.7
n the SVD-L 6B 65.7 72.7 75.1 56.5 37.0
CBOW † 6B 57.2 65.6 68.2 57.0 32.5
Giga-
SG† 6B 62.8 65.2 69.7 58.1 37.2
most
GloVe 6B 65.8 72.7 77.8 53.9 38.1
f 10.
SVD-L 42B 74.0 76.4 74.1 58.3 39.9
w in
GloVe 42B 75.9 83.6 82.9 59.6 47.8
s. CBOW⇤ 100B 68.4 79.6 75.4 59.4 45.5
cated
how L model on this larger corpus. The fact that this
top29 basic SVD model does not scale well to large cor-
Extrinsic word vector evaluation
Semantic Syntactic Overall
• OneTable 4: F1
example score
where on NER
good task with
word vectors 50d help
should vectors.
directly: named
85 entity recognition: identifying
Discretetoisa the
references baseline
person, without or
organization word vectors.
location: WeManning80lives in Palo Alto.
Chris
use publicly-available vectors for HPCA, HSMN, 75

Accuracy [%]
and CW. See text for details. 70

Model Dev Test ACE MUC7 65

60
Discrete 91.0 85.4 77.4 73.4
55
SVD 90.8 85.7 77.3 73.7 50
SVD-S 91.0 85.5 77.6 74.3 Gigaword5 +
Wiki2010 Wiki2014 Gigaword5 Common Crawl
SVD-L 90.5 84.8 73.6 71.5 1B tokens 1.6B tokens 4.3B tokens
Wiki2014
6B tokens 42B tokens

HPCA 92.6 88.7 81.7 80.7


Figure 3: Accuracy on the analogy task for 300
HSMN 90.5 85.7 78.7 74.7
dimensional vectors trained on different corpora.
CW 92.2 87.4 81.7 80.2
CBOW 93.1 88.2 82.2 81.1 entries are updated to assimilate new knowledge
GloVe 93.2 88.3 82.9 82.2 whereas Gigaword is a fixed news repository wit
outdated and possibly incorrect information.
shown for neural vectors in (Turian et al., 2010).
30
4.4 Model Analysis: Vector Length and 4.6 Model Analysis: Run-time
5. Word senses and word sense ambiguity
• Most words have lots of meanings!
• Especially common words
• Especially words that have existed for a long time

• Example: pike

• Does one vector capture all these meanings or do we have a mess?

31
pike
• A sharp point or staff
• A type of elongated fish
• A railroad line or system
• A type of road
• The future (coming down the pike)
• A type of body position (as in diving)
• To kill or pierce with a pike
• To make one’s way (pike along)
• In Australian English, pike means to pull out from doing something: I reckon he could
have climbed that cliff, but he piked!

32
Improving Word Representations Via Global Context And
Multiple Word Prototypes (Huang et al. 2012)
• Idea: Cluster word windows around words, retrain with each word assigned to multiple
different clusters bank1, bank2, etc.

33
Linear Algebraic Structure of Word Senses, with
Applications to Polysemy (Arora, …, Ma, …, TACL 2018)
• Different senses of a word reside in a linear superposition (weighted
sum) in standard word embeddings like word2vec
• 𝑣pike = 𝛼! 𝑣pike# + 𝛼" 𝑣pike$ + 𝛼# 𝑣pike%
$#
• Where 𝛼! = , etc., for frequency f
$# %$$ %$%
• Surprising result:
• Because of ideas from sparse coding you can actually separate out
the senses (providing they are relatively common)!

34
6. Deep Learning Classification: Named Entity Recognition (NER)
• The task: find and classify names in text, by labeling word tokens, for example:

Last night , Paris Hilton wowed in a sequin gown .


PER PER
Samuel Quinn was arrested in the Hilton Hotel in Paris in April 1989 .
PER PER LOC LOC LOC DATE DATE

• Possible uses:
• Tracking mentions of particular entities in documents
• For question answering, answers are usually named entities
• Relating sentiment analysis to the entity under discussion
• Often followed by Entity Linking/Canonicalization into a Knowledge Base such as Wikidata

35
Simple NER: Window classification using binary logistic classifier
• Idea: classify each word in its context window of neighboring words
• Train logistic classifier on hand-labeled data to classify center word {yes/no} for each
class based on a concatenation of word vectors in a window
• Really, we usually use multi-class softmax, but we’re trying to keep it simple J
• Example: Classify “Paris” as +/– location in context of sentence with window length 2:

the museums in Paris are amazing to see .

Xwindow = [ xmuseums xin xParis xare xamazing ]T

• Resulting vector xwindow = x ∈ R5d


• To classify all words: run classifier for each class on the vector centered on each word
in the sentence
36
NER: Binary classification for center word being location
• We do supervised training and want high score if it’s a location

1
𝐽B 𝜃 = 𝜎 𝑠 =
1 + 𝑒 ./
predicted model
probability of class

f = Some element-
wise non-linear
function, e.g.,
logistic, tanh, ReLU

x = [ xmuseums xin xParis xare xamazing ]


40
Remember: Stochastic Gradient Descent
Update equation:

𝛼 = step size or learning rate

EF A
i.e., for each parameter: 𝜃C+,D = 𝜃C439 −𝛼
EA'!()

In deep learning, 𝜃 includes the data representation (e.g., word vectors) too!

How can we compute ∇A 𝐽(𝜃)?


1. By hand
2. Algorithmically: the backpropagation algorithm (next lecture!)
43
7. Neural computation

44
A binary logistic regression unit is a bit similar to a neuron
f = nonlinear activation function (e.g. sigmoid), w = weights, b = bias, h = hidden, x = inputs

b: We can have an “always on” bias


T feature, which gives a class prior, or
hw,b (x) = f (w x + b)
separate it out, as a bias term
1
f (z) =
1+ e−z

w, b are the parameters of this neuron


i.e., this logistic regression model
45
A neural network
= running several logistic regressions at the same time
If we feed a vector of inputs through a bunch of logistic regression functions, then we get
a vector of outputs …

But we don’t have to decide


ahead of time what variables
these logistic regressions are
trying to predict!

46
A neural network
= running several logistic regressions at the same time
… which we can feed into another logistic regression function, giving composed functions

It is the loss function


that will direct what
the intermediate
hidden variables should
be, so as to do a good
job at predicting the
targets for the next
layer, etc.

47
A neural network
= running several logistic regressions at the same time
Before we know it, we have a multilayer neural network….

This allows us to
re-represent and
compose our data
multiple times and to
learn a classifier that is
highly non-linear in
terms of the original
inputs
(but typically is linear in terms of
the pre-final layer representations)

48
Matrix notation for a layer

We have
a1 = f (W11 x1 + W12 x2 + W13 x3 + b1 ) W12
a2 = f (W21 x1 + W22 x2 + W23 x3 + b2 ) a1

etc.
In matrix notation a2

z = Wx + b a3
a = f (z)
Activation f is applied element-wise: b3
f ([z1, z2 , z3 ]) = [ f (z1 ), f (z2 ), f (z3 )]
49
Non-linearities (like f or sigmoid): Why they’re needed
• Neural networks do function approximation,
e.g., regression or classification
• Without non-linearities, deep neural networks
can’t do anything more than a linear transform
• Extra layers could just be compiled down into a
single linear transform: W1 W2 x = Wx
• But, with more layers that include non-linearities,
they can approximate more complex functions!

50

You might also like