0% found this document useful (0 votes)
21 views78 pages

3 WordMeaning

Uploaded by

minhah saleem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views78 pages

3 WordMeaning

Uploaded by

minhah saleem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 78

Word Meaning and

Representation
Word
Meaning
Words as Vectors
Word2Vec
Skip-Gram
GloVe
FastText
Word Vector Evaluation
Word as Vectors
WordNet
Synonym and hypernym(“is a” relationships)

4
https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1204/slides/cs224n-2020-lecture01-wordvecs1.pdf
Representing words as discrete symbols

There is no natural notion of similarity for one-hot vectors!

Solution: learn to encode similarity in the vectors themselves


5
Representing words by their context
• Distributional semantics: A word’s meaning is given by the words that
frequently appear close-by
 you shall know a word by the company it keeps” (J. R. Firth 1957: 11)
 One of the most successful ideas of modern statistical NLP!

• When a word w appears in a text, its context is the set of words that appear
nearby (within a fixed-size window).
• We use the many contexts of w to build up a representation of w

6
Distributional hypothesis

7
Distributional hypothesis

8
Words as vectors
• We’ll build a new model of meaning focusing on similarity
 Each word is a vector
 Similar words are “nearby in space”

• A first solution: we can just use context vectors to represent the meaning of
words!
 word-word co-occurrence matrix:

9
Words as vectors

10
Sparse vs dense vectors
• Still, the vectors we get from word-word occurrence matrix are sparse (most
are 0’s) & long (vocabulary size)
• Alternative: we want to represent words as short (50-300 dimensional) &
dense (real-valued) vectors
 Our focus in this lecture
 The basis of all the modern NLP systems

11
Dense Vectors

12
Word meaning as a neural word vector – visualization

13
Why dense vectors?
• Short vectors are easier to use as features in ML systems
• Dense vectors may generalize better than storing explicit counts
• They do better at capturing synonymy
 w1 co-occurs with “car”, w2 co-occurs with “automobile”

• Different methods for getting dense vectors:


 Singular value decomposition (SVD)
 word2vec and friends: “learn” the vectors!

14
SVD
Singular Value Decomposition
• Problem:
• #1: Find concepts in data
• #2: Reduce dimensionality

16
Recommender Systems, Lior Rokach
SVD - Definition

A[n x m] = U[n x r] L [ r x r] (V[m x r])T

• A: n x m matrix (e.g., n documents, m terms)


• U: n x r matrix (n documents, r concepts)
• L: r x r diagonal matrix (strength of each ‘concept’) (r: rank
of the matrix)
• V: m x r matrix (m terms, r concepts)

17
Manning and Raghavan, 2004
SVD - Properties

THEOREM [Press+92]: always possible to decompose matrix A into


A = U L VT , where
• U, L, V: unique (*)
• U, V: column orthonormal (i.e., columns are unit vectors,
orthogonal to each other)
• UTU = I; VTV = I (I: identity matrix)
• L: singular value are positive, and sorted in decreasing order

18
Manning and Raghavan, 2004
SVD - Properties

‘spectral decomposition’ of the matrix:

1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
x l1 x
5 5 5 0 0 = u1 u2 l2
0 0 0 2 2
0 0 0 3 3 v1
0 0 0 1 1
v2

19
Manning and Raghavan, 2004
SVD - Interpretation

‘documents’, ‘terms’ and ‘concepts’:


• U: document-to-concept similarity matrix
• V: term-to-concept similarity matrix
• L: its diagonal elements: ‘strength’ of each concept

Projection:
• best axis to project on:
‘best’ = minimum sum of squares of projection errors

20
SVD - Example

• A = U L VT - example:

retrieval
inf. lung
data brain

1 1 1 0 0 0.18 0
2 2 2 0 0 0.36 0
CS 1 1 1 0 0 0.18 0 9.64 0
5 5 5 0 0 = x 0 5.29
x
0.90 0
0 0 0 2 2 0 0.53
0 0 0 3 3 0 0.80
MD 0 0 0 1 1
0.58 0.58 0.58 0 0
0 0.27 0 0 0 0.71 0.71

21
SVD - Example

• A = U L VT - example:
doc-to-concept
retrieval CS-concept similarity matrix
inf. lung MD-concept
data brain
1 1 1 0 0 0.18 0
2 2 2 0 0 0.36 0
CS 1 1 1 0 0 0.18 0 9.64 0
5 5 5 0 0 = x 0 5.29
x
0.90 0
0 0 0 2 2 0 0.53
0 0 0 3 3 0 0.80
MD 0 0 0 1 1
0.58 0.58 0.58 0 0
0 0.27 0 0 0 0.71 0.71

22
SVD - Example

• A = U L VT - example:

retrieval
inf. lung ‘strength’ of CS-concept
data brain
1 1 1 0 0 0.18 0
2 2 2 0 0 0.36 0
CS 1 1 1 0 0 0.18 0 9.64 0
5 5 5 0 0 = x 0 5.29
x
0.90 0
0 0 0 2 2 0 0.53
0 0 0 3 3 0 0.80
MD 0 0 0 1 1
0.58 0.58 0.58 0 0
0 0.27 0 0 0 0.71 0.71

23
SVD - Example

• A = U L VT - example:

retrieval
term-to-concept
inf. lung similarity matrix
data brain
1 1 1 0 0 0.18 0 CS-concept
2 2 2 0 0 0.36 0
CS 1 1 1 0 0 0.18 0 9.64 0
5 5 5 0 0 = x 0 5.29
x
0.90 0
0 0 0 2 2 0 0.53
0 0 0 3 3 0 0.80
MD 0 0 0 1 1
0.58 0.58 0.58 0 0
0 0.27 0 0 0 0.71 0.71

24
SVD – Dimensionality reduction

• Q: how exactly is dim. reduction done?


• A: set the smallest singular values to zero:

1 1 1 0 0 0.18 0
2 2 2 0 0 0.36 0
1 1 1 0 0 0.18 0 9.64 0
5 5 5 0 0 = x 0 5.29
x
0.90 0
0 0 0 2 2 0 0.53
0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0
0 0 0 1 1 0 0.27 0 0 0 0.71 0.71

25
SVD - Dimensionality reduction

1 1 1 0 0 0.18
2 2 2 0 0 0.36
1 1 1 0 0 0.18 9.64
5 5 5 0 0 ~ x x
0.90
0 0 0 2 2 0
0 0 0 3 3 0 0.58 0.58 0.58 0 0
0 0 0 1 1 0

26
SVD - Dimensionality reduction

1 1 1 0 0
1 1 1 0 0
2 2 2 0 0
2 2 2 0 0
1 1 1 0 0
1 1 1 0 0
5 5 5 0 0 ~ 5 5 5 0 0
0 0 0 0 0
0 0 0 2 2
0 0 0 0 0
0 0 0 3 3
0 0 0 0 0
0 0 0 1 1

27
Word2Vec
Word2Vec: Overview
• Word2vec (Mikolov et al. 2013) is a framework for learning word vectors
• Idea:
 We have a large corpus (“body”) of text: a long list of words
 Every word in a fixed vocabulary is represented by a vector
 Go through each position t in the text, which has a center word c and context (“outside”)
words o
 Use the similarity of the word vectors for c and o to calculate the probability of o given c
(or vice versa)
 Keep adjusting the word vectors to maximize this probability

29
Word2Vec
- Skip-Gram
- CBOW(Continuous Bag of Words)

Predict center word based


on neighbors

Predict neighbors
based on center word

30
Word2Vec (skip-gram): Overview

31
Word2Vec: Skip-Gram Overview

32
Word2Vec: Skip-Gram Overview

center word context word

context
center word word

33
Word2Vec: Skip-Gram Overview

34
Word2Vec: Skip-Gram Overview
|V| : size of vocabulary

fat
N is around 300
while |V| is around 500,000

sat cat

fat cat sat on the on

35
Word2Vec: Objective Function

36
Word2Vec: Objective Function

37
Word2Vec with Vectors

38
Training the Model: Optimize value of parameters

39
Exercise (not evaluated):

• Derive this

p(o|c)
Optimization: Gradient Descent

41
Gradient Descent

42
Stochastic Gradient Descent

43
Skip-gram with
Negative Sampling
Skip-gram with Negative Sampling

 Use Skip-gram model with Negative Sampling

“Distributed Representations of Words and Phrases and their Compositionality” (Mikolov et al. 2013)
45
Skip-gram with Negative Sampling
The input is the central word, the model's prediction is the context word

Both the central word and the context words are input, and the probability of
whether these two words are actually neighbors within the window size is predicted.

https://wikidocs.net/69141
46
Skip-gram with Negative Sampling

randomly sampled set of negative


examples are taken for each word

47
Skip-gram with Negative Sampling

1. Treat the target word and a neighboring context word as positive


examples.
2. Randomly sample other words in the lexicon to get negative samples
3. Use logistic regression to train a classifier to distinguish those two cases
4. Use the weights as the embeddings

48
Skip-gram with Negative Sampling
prediction label

update embedding vector

https://wikidocs.net/69141 49
Skip-gram with Negative Sampling

Objective Function (they maximize):

(1)

Maximize the probability of two words co-occurring in first log and minimize
probability of noise words in second part

50
being a word and its context

Skip-gram with Negative Sampling


A pair of words that appear near each other where w being a word and c its context

p(D = 1 | w, c)

the pair is not in the training data : p(D = 0 | w, c) = 1 - p(D = 1 | w, c)

 We have to optimize this:

(2)

 converting the max of products to max of sum of logarithms,

(3)

51
being a word and its context

How to compute p(D=1|w,c)?


Intuition:
• Words are likely to appear near similar words
• Model similarity with dot-product!
• Similarity(t,c) ∝ t ∙ c
Problem:
Dot product is not a probability!
(Neither is cosine)

Turning dot product into a probability:


Use the logistic/sigmoid function:

52
being a word and its context

Skip-gram with Negative Sampling


We can compute p(D = 1 | w, c ; θ ) using the sigmoid function, where vw and vc are
representations of center and context words with the current θ
(4)

Formula (3) becomes:

(5)

This is same as Formula (1) summed over entire corpus

53
being a word and its context

Choosing noise words

• Could pick w according to their unigram frequency P(w)


• More common to choose w according to pα(w)

• α= ¾ works well because it gives rare noise words slightly higher probability
• To show this, imagine two events p(a)=.99 and p(b) = .01:

윈도우 내에 등장하지 않은 어떤 단어(w)가


negative sample로 뽑힐 확률

54
being a word and its context

maximizing positive pairs and minimizing negative pairs

55
Analogy
• Word embedding  meaning

56
GloVe
GloVe: Global Vectors
• Combines the advantages of the two major models:
 Global matrix factorization
• Pro: Use the statistical information of the overall corpus
• Con: Not good at word-to-word analogy task
 Local context window methods
• Pro: Good at word-to-word analogy task
• Con: Does not reflect the statistical information of the overall corpus

• The model efficiently leverages statistical information by training only on the


nonzero elements in a word-word co-occurrence matrix, rather than on the
entire sparse matrix or on individual context windows in a large corpus

(Pennington et al, 2014): GloVe: Global Vectors for Word Representation


58
Window based Co-occurrence Matrix
• A matrix listing the number of times a word k appears within the window size
of the word i in row i and column k.
 Symmetric (irrelevant whether left or right context)

• E.g. The corpus has these three sentences (window size is 1)


 I like deep learning
 I like NLP
 I enjoy flying

(Pennington et al, 2014): GloVe: Global Vectors for Word Representation


59
Co-occurrence vectors
• Simple count co-occurrence vectors
 Vectors increase in size with vocabulary
 Very high dimensional: require a lot of storage (though sparse)
 Subsequent classification models have sparsity issues Models are less robust

• Low-dimensional vectors
 Idea: store “most” of the important information in a fixed, small number of dimensions: a
dense vector
 Usually 25–1000 dimensions, similar to word2vec
 How to reduce the dimensionality?

60
Classic Method: Dimensionality Reduction on X
• Singular Value Decomposition of co-occurrence matrix X
• Factorizes X into UΣVT, where U and V are orthonormal

Retain only k singular values, in order to generalize.


The result is the best rank k approximation to X , in terms of least squares.
Classic linear algebra result. Expensive to compute for large matrices.

61
Hacks to X: Scaled vectors
• Running an SVD on raw counts doesn’t work well
• Scaling the counts in the cells can help a lot
 Problem: function words (the, he, has) are too frequent  syntax has too much impact.
Some fixes:
• log the frequencies
• min(X, t), with t ≈ 100
• Ignore the function words
 Ramped windows that count closer words more than further away words
 Use Pearson correlations instead of counts, then set negative values to 0
 Etc.

62
Interesting semantic patterns in the scaled vectors

COALS model (Rohde et al. ms., 2005). An Improved Model of Semantic Similarity Based on Lexical Co-Occurrence

63
GloVe: Cost function
• Objective: Making the dot product of the embedded center word and
surrounding word vectors the probability of co-occurrence in the entire corpus
Notations:
 X : Co-occurrence Matrix
 Xij : The number of times the surrounding word j appears in the window when the central
word i appears
 Xi : Xij : The sum of all the values in row i

 Pik : P (k | i) = Xik : The probability that surrounding words k will appear in the window
X
when the central iword i appears
• Ex) P(like l NLP) = Probability of the word ‘like’ appearing when the word ‘NLP’ appears

 Pik : e.g. P(like l NLP) / P(like l deep) = 2.0


Pjk
 wi: embedding vector of center word i
 wk~ : embedding vector of context word k

64
Ratio of Co-occurrence probabilities
• Co-occurrence probabilities for target words ice and steam with selected context
words from a 6 billion token corpus.

 In P(k | ice), the probability is highest when k is water, so ice and water are related.
 If the probe word k is solid, the similarity between ice and steam is low. A value greater than 1
means that solid is more related to ice.
 If k is gas, it means that there is a higher correlation with steam than with ice, and only when k is
water there can be a correlation between ice and steam.

• If Pij satisfies only local statistics, by introducing k, Pik contains global information.
Pjk
65
GloVe: Cost function
Q: How can we capture ratios of co-occurrence probabilities as linear meaning
components in a word vector space?

 Training faster
 Scalable to very large corpora

66
GloVe: Global Vectors

(Pennington et al, 2014): GloVe: Global Vectors for Word Representation


67
FastText: Sub-Word Embeddings
• Similar as Skip-gram, but break words into n-grams with n = 3 to 6

where:

• Replace

68
Word Representation
• Word2Vec takes texts as training data for a neural network. The
resulting embedding captures whether words appear in similar
contexts.
• GloVe focuses on words co-occurrences over the whole corpus. Its
embeddings relate to the probabilities that two words appear
together.
• FastText improves on Word2Vec by taking word parts (characters)
into account, too. This trick enables training of embeddings on
smaller datasets and generalization to unknown words.

69
Trained Word Embeddings
• word2vec: https://code.google.com/archive/p/word2vec/
• GloVe: https://nlp.stanford.edu/projects/glove/
• FastText: https://fasttext.cc/

70
How to evaluate word vectors?
• Related to general evaluation in NLP: Intrinsic vs. extrinsic
• Intrinsic:
 Evaluation on a specific/intermediate subtask
 Fast to compute
 Helps to understand that system
 Not clear if really helpful unless correlation to real task is established

• Extrinsic:
 Evaluation on a real task
 Can take a long time to compute accuracy
 Unclear if the subsystem is the problem or its interaction or other subsystems
 If replacing exactly one subsystem with another improves accuracy  Winning!
71
Intrinsic word vector evaluation
• Word Vector Analogies

• Evaluate word vectors by how well their cosine distance after addition
captures intuitive semantic and syntactic analogy questions
72
GloVe Visualization

73
Meaning similarity: Another intrinsic word vector evaluation

• Word vector distances and their correlation with human judgments


• Example dataset: WordSim353
http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/

74
Correlation evaluation

• Word vector distances and their correlation with human judgments

75
Extrinsic word vector evaluation
• One example where good word vectors should help directly: named entity
recognition: identifying references to a person, organization or location:

76
Named Entity Recognition (NER)
• The task: find and classify names in text, by labeling word tokens, for
example:
Last night , Paris Hilton wowed in a sequin gown .
PER PER
Samuel Quinn was arrested in the Hilton Hotel in Paris in April 1989 .
PER PER LOC LOC LOC DATE DATE

• Possible uses:
 Tracking mentions of particular entities in documents
 For question answering, answers are usually named entities
 Relating sentiment analysis to the entity under discussion

• Often followed by Entity Linking/Canonicalization into a Knowledge Base such


as Wikidata
77
Summary
• Word as Vectors
• Word2Vec
• Skip-Gram
• GloVe
• FastText
• Word vector evaluation

78

You might also like