Cs224n 2024 Lecture02 Wordvecs2
Cs224n 2024 Lecture02 Wordvecs2
Diyi Yang
Lecture 2: Word Vectors, Word Senses, and Neural Classifiers
Lecture Plan
Lecture 2: Word Vectors, Word Senses, and Neural Network Classifiers
1. Course organization (3 mins)
2. Finish looking at word vectors and word2vec (15 mins)
3. Can we capture the essence of word meaning more effectively by counting? (10m)
4. Evaluating word vectors (10 mins)
5. Word senses (8 mins)
6. Review of classification and how neural nets differ (14 mins)
7. Introducing neural networks (10 mins)
Key Goal: To be able to read word embeddings papers by the end of class
2
1. Course Organization
• Come to office hours/help sessions!
• They started yesterday
• Come to discuss final project ideas as well as the assignments
• Try to come early, often and off-cycle!
• TA office hours: 3-hour blocks Mon–Sat, with multiple TAs
• Just show up! Our friendly course staff will be on hand to assist you!
• https://web.stanford.edu/class/cs224n/office_hours.html
• Instructors’ office hours (in person by default):
• Diyi: Tuesdays 3-4pm
• Tatsu: Fridays 3-4pm
3
2. Review: Main idea of word2vec
• Start with random word vectors
• Iterate through each word position in the whole corpus
!"#(%!" &# )
• Try to predict surrounding words using word vectors: 𝑃 𝑜 𝑐 = ∑$∈& !"#(%$"& )
#
𝑃 𝑤!%$ | 𝑤! 𝑃 𝑤!"$ | 𝑤!
𝑃 𝑤!%# | 𝑤! 𝑃 𝑤!"# | 𝑤!
• Learning: Update vectors so they can predict actual surrounding words better
• Doing no more than this, this algorithm learns word vectors that capture
well word similarity and meaningful directions in a word space!
5
2. Optimization: Gradient Descent
• We have a cost function 𝐽 𝜃 we want to minimize
• Gradient Descent is an algorithm to minimize 𝐽 𝜃
• Idea: for current value of 𝜃, calculate gradient of 𝐽 𝜃 , then take small step in direction
of negative gradient. Repeat.
Note: Our
objectives
may not
be convex
like this L
6
Gradient Descent
• Update equation (in matrix notation):
• Algorithm:
7
Stochastic Gradient Descent
• Problem: 𝐽 𝜃 is a function of all windows in the corpus (potentially billions!)
• So is very expensive to compute
• You would wait a very long time before making a single update!
8
Word2vec parameters … and computations
••••• ••••• • •
• • • •• • • • •• • •
• • • •• • • • •• • •
• • • •• • • • •• • •
• • • •• • • • •• • •
• • • •• • • • •• • •
U V 𝑈. 𝑣)* softmax(𝑈. 𝑣)* )
outside center dot product probabilities
“Bag of words” model! The model makes the same predictions at each position
We want a model that gives a reasonably high
probability estimate to all words that occur in the
context (at all often)
9
Word2vec maximizes objective function by
putting similar words nearby in space
10
Word2vec algorithm family (Mikolov et al. 2013): More details
Why two vectors? à Easier optimization. Average both at the end
• But can implement the algorithm with just one vector per word … and it helps a bit
Two model variants:
1. Skip-grams (SG)
Predict context (“outside”) words (position independent) given center word
2. Continuous Bag of Words (CBOW)
Predict center word from (bag of) context words
We presented: Skip-gram model
"& )
!"#(%! #
• 𝑃 𝑜𝑐 = "& )
∑$∈& !"#(%$ A big sum over words
#
• Hence, in standard word2vec and HW2 you implement the skip-gram model with
negative sampling
• Main idea: train binary logistic regressions to differentiate a true pair (center word and
a word in its context window) versus several “noise” pairs (the center word paired with
a random word)
12
The skip-gram model with negative sampling (HW2)
• Introduced in: “Distributed Representations of Words and Phrases and their
Compositionality” (Mikolov et al. 2013)
• Overall objective function (they maximize):
sigmoid
rather than softmax
1
𝜎 𝑥 =
1 + 𝑒 !"
• We take k negative samples (using word probabilities)
• Maximize probability that real outside word appears;
minimize probability that random words appear around center word
• Sample with P(w)=U(w)3/4/Z, the unigram distribution U(w) raised to the 3/4 power
(We provide this function in the starter code).
• The power makes less frequent words be sampled more often
14
Stochastic gradients with negative sampling [aside]
• We iteratively take gradients at each window for SGD
• In each window, we only have at most 2m + 1 words plus 2km negative
words with negative sampling, so ∇A 𝐽B (𝜃) is very sparse!
15
Stochastic gradients with with negative sampling [aside]
• We might only update the word vectors that actually appear!
• Solution: either you need sparse matrix update operations to Rows not columns
only update certain rows of full embedding matrices U and V, in actual DL
or you need to keep around a hash for word vectors packages!
|V| [ ]
• If you have millions of word vectors and do distributed
computing, it is important to not have to send gigantic
updates around!
16
3. Why not capture co-occurrence counts directly?
There’s something weird about iterating through the whole corpus (perhaps many times);
why don’t we just accumulate all the statistics of what words appear near each other?!?
17
Example: Window based co-occurrence matrix
• Window length 1 (more common: 5–10)
• Symmetric (irrelevant whether left or right context)
• Example corpus:
• I like deep learning counts I like enjoy deep learning NLP flying .
I 0 2 1 0 0 0 0 0
• I like NLP
like 2 0 0 1 0 1 0 0
• I enjoy flying
enjoy 1 0 0 0 0 0 1 0
deep 0 1 0 0 1 0 0 0
learning 0 0 0 1 0 0 0 1
NLP 0 1 0 0 0 0 0 1
flying 0 0 1 0 0 0 0 1
. 0 0 0 0 1 1 1 0
18
Co-occurrence vectors
• Simple count co-occurrence vectors
• Vectors increase in size with vocabulary
• Very high dimensional: require a lot of storage (though sparse)
• Subsequent classification models have sparsity issues à Models are less robust
• Low-dimensional vectors
• Idea: store “most” of the important information in a fixed, small number of
dimensions: a dense vector
• Usually 25–1000 dimensions, similar to word2vec
• How to reduce the dimensionality?
19
Classic Method: Dimensionality Reduction on X (HW1)
Singular Value Decomposition of co-occurrence matrix X
Factorizes X into UΣVT, where U and V are orthonormal (unit vectors and orthogonal)
k
X
21
Rohde, Gonnerman, Plaut Modeling Word Meaning Using Lexical Co-Occurrence
Interesting semantic patterns emerge in the scaled vectors
DRIVER
JANITOR
DRIVE SWIMMER
STUDENT
CLEAN TEACHER
DOCTOR
BRIDE
SWIM
PRIEST
LEARN TEACH
MARRY
TREAT PRAY
Figure 13: Multidimensional scaling for nouns and their associated verbs.
COALS model from
Rohde et al. ms., 2005. An Improved Model of Semantic Similarity Based on Lexical Co-Occurrence
Table 10
The 10 nearest neighbors and their percent correlation similarities for a set of nouns, under the COALS-14K model.
22 gun point mind monopoly cardboard lipstick leningrad feet
GloVe [Pennington, Socher, and Manning, EMNLP 2014]:
Encoding meaning components in vector differences
A: Log-bilinear model:
Loss:
• Fast training
• Scalable to huge corpora
4. How to evaluate word vectors?
• Related to general evaluation in NLP: Intrinsic vs. extrinsic
• Intrinsic:
• Evaluation on a specific/intermediate subtask
• Fast to compute
• Helps to understand that system
• Not clear if really helpful unless correlation to real task is established
• Extrinsic:
• Evaluation on a real task
• Can take a long time to compute accuracy
• Unclear if the subsystem is the problem or its interaction or other subsystems
• If replacing exactly one subsystem with another improves accuracy à Winning!
25
Intrinsic word vector evaluation
• Word Vector Analogies
a:b :: c:?
man:woman :: king:?
26
GloVe Visualization
27
Meaning similarity: Another intrinsic word vector evaluation
• Word vector distances and their correlation with human judgments
• Example dataset: WordSim353 http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/
Accuracy [%]
and CW. See text for details. 70
60
Discrete 91.0 85.4 77.4 73.4
55
SVD 90.8 85.7 77.3 73.7 50
SVD-S 91.0 85.5 77.6 74.3 Gigaword5 +
Wiki2010 Wiki2014 Gigaword5 Common Crawl
SVD-L 90.5 84.8 73.6 71.5 1B tokens 1.6B tokens 4.3B tokens
Wiki2014
6B tokens 42B tokens
• Example: pike
31
pike
• A sharp point or staff
• A type of elongated fish
• A railroad line or system
• A type of road
• The future (coming down the pike)
• A type of body position (as in diving)
• To kill or pierce with a pike
• To make one’s way (pike along)
• In Australian English, pike means to pull out from doing something: I reckon he could
have climbed that cliff, but he piked!
32
Improving Word Representations Via Global Context And
Multiple Word Prototypes (Huang et al. 2012)
• Idea: Cluster word windows around words, retrain with each word assigned to multiple
different clusters bank1, bank2, etc.
33
Linear Algebraic Structure of Word Senses, with
Applications to Polysemy (Arora, …, Ma, …, TACL 2018)
• Different senses of a word reside in a linear superposition (weighted
sum) in standard word embeddings like word2vec
• 𝑣pike = 𝛼! 𝑣pike# + 𝛼" 𝑣pike$ + 𝛼# 𝑣pike%
$#
• Where 𝛼! = , etc., for frequency f
$# %$$ %$%
• Surprising result:
• Because of ideas from sparse coding you can actually separate out
the senses (providing they are relatively common)!
34
6. Deep Learning Classification: Named Entity Recognition (NER)
• The task: find and classify names in text, by labeling word tokens, for example:
• Possible uses:
• Tracking mentions of particular entities in documents
• For question answering, answers are usually named entities
• Relating sentiment analysis to the entity under discussion
• Often followed by Entity Linking/Canonicalization into a Knowledge Base such as Wikidata
35
Simple NER: Window classification using binary logistic classifier
• Idea: classify each word in its context window of neighboring words
• Train logistic classifier on hand-labeled data to classify center word {yes/no} for each
class based on a concatenation of word vectors in a window
• Really, we usually use multi-class softmax, but we’re trying to keep it simple J
• Example: Classify “Paris” as +/– location in context of sentence with window length 2:
1
𝐽B 𝜃 = 𝜎 𝑠 =
1 + 𝑒 ./
predicted model
probability of class
f = Some element-
wise non-linear
function, e.g.,
logistic, tanh, ReLU
EF A
i.e., for each parameter: 𝜃C+,D = 𝜃C439 −𝛼
EA'!()
In deep learning, 𝜃 includes the data representation (e.g., word vectors) too!
44
A binary logistic regression unit is a bit similar to a neuron
f = nonlinear activation function (e.g. sigmoid), w = weights, b = bias, h = hidden, x = inputs
46
A neural network
= running several logistic regressions at the same time
… which we can feed into another logistic regression function, giving composed functions
47
A neural network
= running several logistic regressions at the same time
Before we know it, we have a multilayer neural network….
This allows us to
re-represent and
compose our data
multiple times and to
learn a classifier that is
highly non-linear in
terms of the original
inputs
(but typically is linear in terms of
the pre-final layer representations)
48
Matrix notation for a layer
We have
a1 = f (W11 x1 + W12 x2 + W13 x3 + b1 ) W12
a2 = f (W21 x1 + W22 x2 + W23 x3 + b2 ) a1
etc.
In matrix notation a2
z = Wx + b a3
a = f (z)
Activation f is applied element-wise: b3
f ([z1, z2 , z3 ]) = [ f (z1 ), f (z2 ), f (z3 )]
49
Non-linearities (like f or sigmoid): Why they’re needed
• Neural networks do function approximation,
e.g., regression or classification
• Without non-linearities, deep neural networks
can’t do anything more than a linear transform
• Extra layers could just be compiled down into a
single linear transform: W1 W2 x = Wx
• But, with more layers that include non-linearities,
they can approximate more complex functions!
50