3 WordMeaning
3 WordMeaning
Representation
Word
Meaning
Words as Vectors
Word2Vec
Skip-Gram
GloVe
FastText
Word Vector Evaluation
Word as Vectors
WordNet
Synonym and hypernym(“is a” relationships)
4
https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1204/slides/cs224n-2020-lecture01-wordvecs1.pdf
Representing words as discrete symbols
• When a word w appears in a text, its context is the set of words that appear
nearby (within a fixed-size window).
• We use the many contexts of w to build up a representation of w
6
Distributional hypothesis
7
Distributional hypothesis
8
Words as vectors
• We’ll build a new model of meaning focusing on similarity
Each word is a vector
Similar words are “nearby in space”
• A first solution: we can just use context vectors to represent the meaning of
words!
word-word co-occurrence matrix:
9
Words as vectors
10
Sparse vs dense vectors
• Still, the vectors we get from word-word occurrence matrix are sparse (most
are 0’s) & long (vocabulary size)
• Alternative: we want to represent words as short (50-300 dimensional) &
dense (real-valued) vectors
Our focus in this lecture
The basis of all the modern NLP systems
11
Dense Vectors
12
Word meaning as a neural word vector – visualization
13
Why dense vectors?
• Short vectors are easier to use as features in ML systems
• Dense vectors may generalize better than storing explicit counts
• They do better at capturing synonymy
w1 co-occurs with “car”, w2 co-occurs with “automobile”
14
SVD
Singular Value Decomposition
• Problem:
• #1: Find concepts in data
• #2: Reduce dimensionality
16
Recommender Systems, Lior Rokach
SVD - Definition
17
Manning and Raghavan, 2004
SVD - Properties
18
Manning and Raghavan, 2004
SVD - Properties
1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
x l1 x
5 5 5 0 0 = u1 u2 l2
0 0 0 2 2
0 0 0 3 3 v1
0 0 0 1 1
v2
19
Manning and Raghavan, 2004
SVD - Interpretation
Projection:
• best axis to project on:
‘best’ = minimum sum of squares of projection errors
20
SVD - Example
• A = U L VT - example:
retrieval
inf. lung
data brain
1 1 1 0 0 0.18 0
2 2 2 0 0 0.36 0
CS 1 1 1 0 0 0.18 0 9.64 0
5 5 5 0 0 = x 0 5.29
x
0.90 0
0 0 0 2 2 0 0.53
0 0 0 3 3 0 0.80
MD 0 0 0 1 1
0.58 0.58 0.58 0 0
0 0.27 0 0 0 0.71 0.71
21
SVD - Example
• A = U L VT - example:
doc-to-concept
retrieval CS-concept similarity matrix
inf. lung MD-concept
data brain
1 1 1 0 0 0.18 0
2 2 2 0 0 0.36 0
CS 1 1 1 0 0 0.18 0 9.64 0
5 5 5 0 0 = x 0 5.29
x
0.90 0
0 0 0 2 2 0 0.53
0 0 0 3 3 0 0.80
MD 0 0 0 1 1
0.58 0.58 0.58 0 0
0 0.27 0 0 0 0.71 0.71
22
SVD - Example
• A = U L VT - example:
retrieval
inf. lung ‘strength’ of CS-concept
data brain
1 1 1 0 0 0.18 0
2 2 2 0 0 0.36 0
CS 1 1 1 0 0 0.18 0 9.64 0
5 5 5 0 0 = x 0 5.29
x
0.90 0
0 0 0 2 2 0 0.53
0 0 0 3 3 0 0.80
MD 0 0 0 1 1
0.58 0.58 0.58 0 0
0 0.27 0 0 0 0.71 0.71
23
SVD - Example
• A = U L VT - example:
retrieval
term-to-concept
inf. lung similarity matrix
data brain
1 1 1 0 0 0.18 0 CS-concept
2 2 2 0 0 0.36 0
CS 1 1 1 0 0 0.18 0 9.64 0
5 5 5 0 0 = x 0 5.29
x
0.90 0
0 0 0 2 2 0 0.53
0 0 0 3 3 0 0.80
MD 0 0 0 1 1
0.58 0.58 0.58 0 0
0 0.27 0 0 0 0.71 0.71
24
SVD – Dimensionality reduction
1 1 1 0 0 0.18 0
2 2 2 0 0 0.36 0
1 1 1 0 0 0.18 0 9.64 0
5 5 5 0 0 = x 0 5.29
x
0.90 0
0 0 0 2 2 0 0.53
0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0
0 0 0 1 1 0 0.27 0 0 0 0.71 0.71
25
SVD - Dimensionality reduction
1 1 1 0 0 0.18
2 2 2 0 0 0.36
1 1 1 0 0 0.18 9.64
5 5 5 0 0 ~ x x
0.90
0 0 0 2 2 0
0 0 0 3 3 0 0.58 0.58 0.58 0 0
0 0 0 1 1 0
26
SVD - Dimensionality reduction
1 1 1 0 0
1 1 1 0 0
2 2 2 0 0
2 2 2 0 0
1 1 1 0 0
1 1 1 0 0
5 5 5 0 0 ~ 5 5 5 0 0
0 0 0 0 0
0 0 0 2 2
0 0 0 0 0
0 0 0 3 3
0 0 0 0 0
0 0 0 1 1
27
Word2Vec
Word2Vec: Overview
• Word2vec (Mikolov et al. 2013) is a framework for learning word vectors
• Idea:
We have a large corpus (“body”) of text: a long list of words
Every word in a fixed vocabulary is represented by a vector
Go through each position t in the text, which has a center word c and context (“outside”)
words o
Use the similarity of the word vectors for c and o to calculate the probability of o given c
(or vice versa)
Keep adjusting the word vectors to maximize this probability
29
Word2Vec
- Skip-Gram
- CBOW(Continuous Bag of Words)
Predict neighbors
based on center word
30
Word2Vec (skip-gram): Overview
31
Word2Vec: Skip-Gram Overview
32
Word2Vec: Skip-Gram Overview
context
center word word
33
Word2Vec: Skip-Gram Overview
34
Word2Vec: Skip-Gram Overview
|V| : size of vocabulary
fat
N is around 300
while |V| is around 500,000
sat cat
35
Word2Vec: Objective Function
36
Word2Vec: Objective Function
37
Word2Vec with Vectors
38
Training the Model: Optimize value of parameters
39
Exercise (not evaluated):
• Derive this
p(o|c)
Optimization: Gradient Descent
41
Gradient Descent
42
Stochastic Gradient Descent
43
Skip-gram with
Negative Sampling
Skip-gram with Negative Sampling
“Distributed Representations of Words and Phrases and their Compositionality” (Mikolov et al. 2013)
45
Skip-gram with Negative Sampling
The input is the central word, the model's prediction is the context word
Both the central word and the context words are input, and the probability of
whether these two words are actually neighbors within the window size is predicted.
https://wikidocs.net/69141
46
Skip-gram with Negative Sampling
47
Skip-gram with Negative Sampling
48
Skip-gram with Negative Sampling
prediction label
https://wikidocs.net/69141 49
Skip-gram with Negative Sampling
(1)
Maximize the probability of two words co-occurring in first log and minimize
probability of noise words in second part
50
being a word and its context
p(D = 1 | w, c)
(2)
(3)
51
being a word and its context
52
being a word and its context
(5)
53
being a word and its context
• α= ¾ works well because it gives rare noise words slightly higher probability
• To show this, imagine two events p(a)=.99 and p(b) = .01:
54
being a word and its context
55
Analogy
• Word embedding meaning
56
GloVe
GloVe: Global Vectors
• Combines the advantages of the two major models:
Global matrix factorization
• Pro: Use the statistical information of the overall corpus
• Con: Not good at word-to-word analogy task
Local context window methods
• Pro: Good at word-to-word analogy task
• Con: Does not reflect the statistical information of the overall corpus
• Low-dimensional vectors
Idea: store “most” of the important information in a fixed, small number of dimensions: a
dense vector
Usually 25–1000 dimensions, similar to word2vec
How to reduce the dimensionality?
60
Classic Method: Dimensionality Reduction on X
• Singular Value Decomposition of co-occurrence matrix X
• Factorizes X into UΣVT, where U and V are orthonormal
61
Hacks to X: Scaled vectors
• Running an SVD on raw counts doesn’t work well
• Scaling the counts in the cells can help a lot
Problem: function words (the, he, has) are too frequent syntax has too much impact.
Some fixes:
• log the frequencies
• min(X, t), with t ≈ 100
• Ignore the function words
Ramped windows that count closer words more than further away words
Use Pearson correlations instead of counts, then set negative values to 0
Etc.
62
Interesting semantic patterns in the scaled vectors
COALS model (Rohde et al. ms., 2005). An Improved Model of Semantic Similarity Based on Lexical Co-Occurrence
63
GloVe: Cost function
• Objective: Making the dot product of the embedded center word and
surrounding word vectors the probability of co-occurrence in the entire corpus
Notations:
X : Co-occurrence Matrix
Xij : The number of times the surrounding word j appears in the window when the central
word i appears
Xi : Xij : The sum of all the values in row i
Pik : P (k | i) = Xik : The probability that surrounding words k will appear in the window
X
when the central iword i appears
• Ex) P(like l NLP) = Probability of the word ‘like’ appearing when the word ‘NLP’ appears
64
Ratio of Co-occurrence probabilities
• Co-occurrence probabilities for target words ice and steam with selected context
words from a 6 billion token corpus.
In P(k | ice), the probability is highest when k is water, so ice and water are related.
If the probe word k is solid, the similarity between ice and steam is low. A value greater than 1
means that solid is more related to ice.
If k is gas, it means that there is a higher correlation with steam than with ice, and only when k is
water there can be a correlation between ice and steam.
• If Pij satisfies only local statistics, by introducing k, Pik contains global information.
Pjk
65
GloVe: Cost function
Q: How can we capture ratios of co-occurrence probabilities as linear meaning
components in a word vector space?
Training faster
Scalable to very large corpora
66
GloVe: Global Vectors
where:
• Replace
68
Word Representation
• Word2Vec takes texts as training data for a neural network. The
resulting embedding captures whether words appear in similar
contexts.
• GloVe focuses on words co-occurrences over the whole corpus. Its
embeddings relate to the probabilities that two words appear
together.
• FastText improves on Word2Vec by taking word parts (characters)
into account, too. This trick enables training of embeddings on
smaller datasets and generalization to unknown words.
69
Trained Word Embeddings
• word2vec: https://code.google.com/archive/p/word2vec/
• GloVe: https://nlp.stanford.edu/projects/glove/
• FastText: https://fasttext.cc/
70
How to evaluate word vectors?
• Related to general evaluation in NLP: Intrinsic vs. extrinsic
• Intrinsic:
Evaluation on a specific/intermediate subtask
Fast to compute
Helps to understand that system
Not clear if really helpful unless correlation to real task is established
• Extrinsic:
Evaluation on a real task
Can take a long time to compute accuracy
Unclear if the subsystem is the problem or its interaction or other subsystems
If replacing exactly one subsystem with another improves accuracy Winning!
71
Intrinsic word vector evaluation
• Word Vector Analogies
• Evaluate word vectors by how well their cosine distance after addition
captures intuitive semantic and syntactic analogy questions
72
GloVe Visualization
73
Meaning similarity: Another intrinsic word vector evaluation
74
Correlation evaluation
75
Extrinsic word vector evaluation
• One example where good word vectors should help directly: named entity
recognition: identifying references to a person, organization or location:
76
Named Entity Recognition (NER)
• The task: find and classify names in text, by labeling word tokens, for
example:
Last night , Paris Hilton wowed in a sequin gown .
PER PER
Samuel Quinn was arrested in the Hilton Hotel in Paris in April 1989 .
PER PER LOC LOC LOC DATE DATE
• Possible uses:
Tracking mentions of particular entities in documents
For question answering, answers are usually named entities
Relating sentiment analysis to the entity under discussion
78