0% found this document useful (0 votes)

21 views78 pages

3 WordMeaning

Uploaded by

minhah saleem

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views78 pages

3 WordMeaning

Uploaded by

minhah saleem

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 78

Word Meaning and

Representation
Word
Meaning
Words as Vectors
Word2Vec
Skip-Gram
GloVe
FastText
Word Vector Evaluation
Word as Vectors
WordNet
Synonym and hypernym(“is a” relationships)

4
https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1204/slides/cs224n-2020-lecture01-wordvecs1.pdf
Representing words as discrete symbols

There is no natural notion of similarity for one-hot vectors!

Solution: learn to encode similarity in the vectors themselves

5
Representing words by their context
• Distributional semantics: A word’s meaning is given by the words that
frequently appear close-by
 you shall know a word by the company it keeps” (J. R. Firth 1957: 11)
 One of the most successful ideas of modern statistical NLP!

• When a word w appears in a text, its context is the set of words that appear
nearby (within a fixed-size window).
• We use the many contexts of w to build up a representation of w

6
Distributional hypothesis

7
Distributional hypothesis

8
Words as vectors
• We’ll build a new model of meaning focusing on similarity
 Each word is a vector
 Similar words are “nearby in space”

• A first solution: we can just use context vectors to represent the meaning of
words!
 word-word co-occurrence matrix:

9
Words as vectors

10
Sparse vs dense vectors
• Still, the vectors we get from word-word occurrence matrix are sparse (most
are 0’s) & long (vocabulary size)
• Alternative: we want to represent words as short (50-300 dimensional) &
dense (real-valued) vectors
 Our focus in this lecture
 The basis of all the modern NLP systems

11
Dense Vectors

12
Word meaning as a neural word vector – visualization

13
Why dense vectors?
• Short vectors are easier to use as features in ML systems
• Dense vectors may generalize better than storing explicit counts
• They do better at capturing synonymy
 w1 co-occurs with “car”, w2 co-occurs with “automobile”

• Different methods for getting dense vectors:

 Singular value decomposition (SVD)
 word2vec and friends: “learn” the vectors!

14
SVD
Singular Value Decomposition
• Problem:
• #1: Find concepts in data
• #2: Reduce dimensionality

16
Recommender Systems, Lior Rokach
SVD - Definition

A[n x m] = U[n x r] L [ r x r] (V[m x r])T

• A: n x m matrix (e.g., n documents, m terms)

• U: n x r matrix (n documents, r concepts)
• L: r x r diagonal matrix (strength of each ‘concept’) (r: rank
of the matrix)
• V: m x r matrix (m terms, r concepts)

17
Manning and Raghavan, 2004
SVD - Properties

THEOREM [Press+92]: always possible to decompose matrix A into

A = U L VT , where
• U, L, V: unique (*)
• U, V: column orthonormal (i.e., columns are unit vectors,
orthogonal to each other)
• UTU = I; VTV = I (I: identity matrix)
• L: singular value are positive, and sorted in decreasing order

18
Manning and Raghavan, 2004
SVD - Properties

‘spectral decomposition’ of the matrix:

1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
x l1 x
5 5 5 0 0 = u1 u2 l2
0 0 0 2 2
0 0 0 3 3 v1
0 0 0 1 1
v2

19
Manning and Raghavan, 2004
SVD - Interpretation

‘documents’, ‘terms’ and ‘concepts’:

• U: document-to-concept similarity matrix
• V: term-to-concept similarity matrix
• L: its diagonal elements: ‘strength’ of each concept

Projection:
• best axis to project on:
‘best’ = minimum sum of squares of projection errors

20
SVD - Example

• A = U L VT - example:

retrieval
inf. lung
data brain

1 1 1 0 0 0.18 0
2 2 2 0 0 0.36 0
CS 1 1 1 0 0 0.18 0 9.64 0
5 5 5 0 0 = x 0 5.29
x
0.90 0
0 0 0 2 2 0 0.53
0 0 0 3 3 0 0.80
MD 0 0 0 1 1
0.58 0.58 0.58 0 0
0 0.27 0 0 0 0.71 0.71

21
SVD - Example

• A = U L VT - example:
doc-to-concept
retrieval CS-concept similarity matrix
inf. lung MD-concept
data brain
1 1 1 0 0 0.18 0
2 2 2 0 0 0.36 0
CS 1 1 1 0 0 0.18 0 9.64 0
5 5 5 0 0 = x 0 5.29
x
0.90 0
0 0 0 2 2 0 0.53
0 0 0 3 3 0 0.80
MD 0 0 0 1 1
0.58 0.58 0.58 0 0
0 0.27 0 0 0 0.71 0.71

22
SVD - Example

• A = U L VT - example:

retrieval
inf. lung ‘strength’ of CS-concept
data brain
1 1 1 0 0 0.18 0
2 2 2 0 0 0.36 0
CS 1 1 1 0 0 0.18 0 9.64 0
5 5 5 0 0 = x 0 5.29
x
0.90 0
0 0 0 2 2 0 0.53
0 0 0 3 3 0 0.80
MD 0 0 0 1 1
0.58 0.58 0.58 0 0
0 0.27 0 0 0 0.71 0.71

23
SVD - Example

• A = U L VT - example:

retrieval
term-to-concept
inf. lung similarity matrix
data brain
1 1 1 0 0 0.18 0 CS-concept
2 2 2 0 0 0.36 0
CS 1 1 1 0 0 0.18 0 9.64 0
5 5 5 0 0 = x 0 5.29
x
0.90 0
0 0 0 2 2 0 0.53
0 0 0 3 3 0 0.80
MD 0 0 0 1 1
0.58 0.58 0.58 0 0
0 0.27 0 0 0 0.71 0.71

24
SVD – Dimensionality reduction

• Q: how exactly is dim. reduction done?

• A: set the smallest singular values to zero:

1 1 1 0 0 0.18 0
2 2 2 0 0 0.36 0
1 1 1 0 0 0.18 0 9.64 0
5 5 5 0 0 = x 0 5.29
x
0.90 0
0 0 0 2 2 0 0.53
0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0
0 0 0 1 1 0 0.27 0 0 0 0.71 0.71

25
SVD - Dimensionality reduction

1 1 1 0 0 0.18
2 2 2 0 0 0.36
1 1 1 0 0 0.18 9.64
5 5 5 0 0 ~ x x
0.90
0 0 0 2 2 0
0 0 0 3 3 0 0.58 0.58 0.58 0 0
0 0 0 1 1 0

26
SVD - Dimensionality reduction

1 1 1 0 0
1 1 1 0 0
2 2 2 0 0
2 2 2 0 0
1 1 1 0 0
1 1 1 0 0
5 5 5 0 0 ~ 5 5 5 0 0
0 0 0 0 0
0 0 0 2 2
0 0 0 0 0
0 0 0 3 3
0 0 0 0 0
0 0 0 1 1

27
Word2Vec
Word2Vec: Overview
• Word2vec (Mikolov et al. 2013) is a framework for learning word vectors
• Idea:
 We have a large corpus (“body”) of text: a long list of words
 Every word in a fixed vocabulary is represented by a vector
 Go through each position t in the text, which has a center word c and context (“outside”)
words o
 Use the similarity of the word vectors for c and o to calculate the probability of o given c
(or vice versa)
 Keep adjusting the word vectors to maximize this probability

29
Word2Vec
- Skip-Gram
- CBOW(Continuous Bag of Words)

Predict center word based

on neighbors

Predict neighbors
based on center word

30
Word2Vec (skip-gram): Overview

31
Word2Vec: Skip-Gram Overview

32
Word2Vec: Skip-Gram Overview

center word context word

context
center word word

33
Word2Vec: Skip-Gram Overview

34
Word2Vec: Skip-Gram Overview
|V| : size of vocabulary

fat
N is around 300
while |V| is around 500,000

sat cat

fat cat sat on the on

35
Word2Vec: Objective Function

36
Word2Vec: Objective Function

37
Word2Vec with Vectors

38
Training the Model: Optimize value of parameters

39
Exercise (not evaluated):

• Derive this

p(o|c)
Optimization: Gradient Descent

41
Gradient Descent

42
Stochastic Gradient Descent

43
Skip-gram with
Negative Sampling
Skip-gram with Negative Sampling

 Use Skip-gram model with Negative Sampling

“Distributed Representations of Words and Phrases and their Compositionality” (Mikolov et al. 2013)
45
Skip-gram with Negative Sampling
The input is the central word, the model's prediction is the context word

Both the central word and the context words are input, and the probability of
whether these two words are actually neighbors within the window size is predicted.

https://wikidocs.net/69141
46
Skip-gram with Negative Sampling

randomly sampled set of negative

examples are taken for each word

47
Skip-gram with Negative Sampling

1. Treat the target word and a neighboring context word as positive

examples.
2. Randomly sample other words in the lexicon to get negative samples
3. Use logistic regression to train a classifier to distinguish those two cases
4. Use the weights as the embeddings

48
Skip-gram with Negative Sampling
prediction label

update embedding vector

https://wikidocs.net/69141 49
Skip-gram with Negative Sampling

Objective Function (they maximize):

(1)

Maximize the probability of two words co-occurring in first log and minimize
probability of noise words in second part

50
being a word and its context

Skip-gram with Negative Sampling

A pair of words that appear near each other where w being a word and c its context

p(D = 1 | w, c)

the pair is not in the training data : p(D = 0 | w, c) = 1 - p(D = 1 | w, c)

 We have to optimize this:

(2)

 converting the max of products to max of sum of logarithms,

(3)

51
being a word and its context

How to compute p(D=1|w,c)?

Intuition:
• Words are likely to appear near similar words
• Model similarity with dot-product!
• Similarity(t,c) ∝ t ∙ c
Problem:
Dot product is not a probability!
(Neither is cosine)

Turning dot product into a probability:

Use the logistic/sigmoid function:

52
being a word and its context

Skip-gram with Negative Sampling

We can compute p(D = 1 | w, c ; θ ) using the sigmoid function, where vw and vc are
representations of center and context words with the current θ
(4)

Formula (3) becomes:

(5)

This is same as Formula (1) summed over entire corpus

53
being a word and its context

Choosing noise words

• Could pick w according to their unigram frequency P(w)

• More common to choose w according to pα(w)

• α= ¾ works well because it gives rare noise words slightly higher probability
• To show this, imagine two events p(a)=.99 and p(b) = .01:

윈도우 내에 등장하지 않은 어떤 단어(w)가

negative sample로 뽑힐 확률

54
being a word and its context

maximizing positive pairs and minimizing negative pairs

55
Analogy
• Word embedding  meaning

56
GloVe
GloVe: Global Vectors
• Combines the advantages of the two major models:
 Global matrix factorization
• Pro: Use the statistical information of the overall corpus
• Con: Not good at word-to-word analogy task
 Local context window methods
• Pro: Good at word-to-word analogy task
• Con: Does not reflect the statistical information of the overall corpus

• The model efficiently leverages statistical information by training only on the

nonzero elements in a word-word co-occurrence matrix, rather than on the
entire sparse matrix or on individual context windows in a large corpus

(Pennington et al, 2014): GloVe: Global Vectors for Word Representation

58
Window based Co-occurrence Matrix
• A matrix listing the number of times a word k appears within the window size
of the word i in row i and column k.
 Symmetric (irrelevant whether left or right context)

• E.g. The corpus has these three sentences (window size is 1)

 I like deep learning
 I like NLP
 I enjoy flying

(Pennington et al, 2014): GloVe: Global Vectors for Word Representation

59
Co-occurrence vectors
• Simple count co-occurrence vectors
 Vectors increase in size with vocabulary
 Very high dimensional: require a lot of storage (though sparse)
 Subsequent classification models have sparsity issues Models are less robust

• Low-dimensional vectors
 Idea: store “most” of the important information in a fixed, small number of dimensions: a
dense vector
 Usually 25–1000 dimensions, similar to word2vec
 How to reduce the dimensionality?

60
Classic Method: Dimensionality Reduction on X
• Singular Value Decomposition of co-occurrence matrix X
• Factorizes X into UΣVT, where U and V are orthonormal

Retain only k singular values, in order to generalize.

The result is the best rank k approximation to X , in terms of least squares.
Classic linear algebra result. Expensive to compute for large matrices.

61
Hacks to X: Scaled vectors
• Running an SVD on raw counts doesn’t work well
• Scaling the counts in the cells can help a lot
 Problem: function words (the, he, has) are too frequent  syntax has too much impact.
Some fixes:
• log the frequencies
• min(X, t), with t ≈ 100
• Ignore the function words
 Ramped windows that count closer words more than further away words
 Use Pearson correlations instead of counts, then set negative values to 0
 Etc.

62
Interesting semantic patterns in the scaled vectors

COALS model (Rohde et al. ms., 2005). An Improved Model of Semantic Similarity Based on Lexical Co-Occurrence

63
GloVe: Cost function
• Objective: Making the dot product of the embedded center word and
surrounding word vectors the probability of co-occurrence in the entire corpus
Notations:
 X : Co-occurrence Matrix
 Xij : The number of times the surrounding word j appears in the window when the central
word i appears
 Xi : Xij : The sum of all the values in row i

 Pik : P (k | i) = Xik : The probability that surrounding words k will appear in the window
X
when the central iword i appears
• Ex) P(like l NLP) = Probability of the word ‘like’ appearing when the word ‘NLP’ appears

 Pik : e.g. P(like l NLP) / P(like l deep) = 2.0

Pjk
 wi: embedding vector of center word i
 wk~ : embedding vector of context word k

64
Ratio of Co-occurrence probabilities
• Co-occurrence probabilities for target words ice and steam with selected context
words from a 6 billion token corpus.

 In P(k | ice), the probability is highest when k is water, so ice and water are related.
 If the probe word k is solid, the similarity between ice and steam is low. A value greater than 1
means that solid is more related to ice.
 If k is gas, it means that there is a higher correlation with steam than with ice, and only when k is
water there can be a correlation between ice and steam.

• If Pij satisfies only local statistics, by introducing k, Pik contains global information.
Pjk
65
GloVe: Cost function
Q: How can we capture ratios of co-occurrence probabilities as linear meaning
components in a word vector space?

 Training faster
 Scalable to very large corpora

66
GloVe: Global Vectors

(Pennington et al, 2014): GloVe: Global Vectors for Word Representation

67
FastText: Sub-Word Embeddings
• Similar as Skip-gram, but break words into n-grams with n = 3 to 6

where:

• Replace

68
Word Representation
• Word2Vec takes texts as training data for a neural network. The
resulting embedding captures whether words appear in similar
contexts.
• GloVe focuses on words co-occurrences over the whole corpus. Its
embeddings relate to the probabilities that two words appear
together.
• FastText improves on Word2Vec by taking word parts (characters)
into account, too. This trick enables training of embeddings on
smaller datasets and generalization to unknown words.

69
Trained Word Embeddings
• word2vec: https://code.google.com/archive/p/word2vec/
• GloVe: https://nlp.stanford.edu/projects/glove/
• FastText: https://fasttext.cc/

70
How to evaluate word vectors?
• Related to general evaluation in NLP: Intrinsic vs. extrinsic
• Intrinsic:
 Evaluation on a specific/intermediate subtask
 Fast to compute
 Helps to understand that system
 Not clear if really helpful unless correlation to real task is established

• Extrinsic:
 Evaluation on a real task
 Can take a long time to compute accuracy
 Unclear if the subsystem is the problem or its interaction or other subsystems
 If replacing exactly one subsystem with another improves accuracy  Winning!
71
Intrinsic word vector evaluation
• Word Vector Analogies

• Evaluate word vectors by how well their cosine distance after addition
captures intuitive semantic and syntactic analogy questions
72
GloVe Visualization

73
Meaning similarity: Another intrinsic word vector evaluation

• Word vector distances and their correlation with human judgments

• Example dataset: WordSim353
http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/

74
Correlation evaluation

• Word vector distances and their correlation with human judgments

75
Extrinsic word vector evaluation
• One example where good word vectors should help directly: named entity
recognition: identifying references to a person, organization or location:

76
Named Entity Recognition (NER)
• The task: find and classify names in text, by labeling word tokens, for
example:
Last night , Paris Hilton wowed in a sequin gown .
PER PER
Samuel Quinn was arrested in the Hilton Hotel in Paris in April 1989 .
PER PER LOC LOC LOC DATE DATE

• Possible uses:
 Tracking mentions of particular entities in documents
 For question answering, answers are usually named entities
 Relating sentiment analysis to the entity under discussion

• Often followed by Entity Linking/Canonicalization into a Knowledge Base such

as Wikidata
77
Summary
• Word as Vectors
• Word2Vec
• Skip-Gram
• GloVe
• FastText
• Word vector evaluation

Morning Mail
100% (1)
Morning Mail
7 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
57 pages
Computer Basics Lesson Plan One
No ratings yet
Computer Basics Lesson Plan One
6 pages
Word Embeddings
No ratings yet
Word Embeddings
55 pages
Word Embedding
No ratings yet
Word Embedding
35 pages
Lecture1 Word Embeddings
No ratings yet
Lecture1 Word Embeddings
99 pages
VendingMachine Notes
No ratings yet
VendingMachine Notes
54 pages
Vector Representation of Text: Vagelis Hristidis Prepared With The Help of Nhat Le Many Slides Are From Richard Socher
No ratings yet
Vector Representation of Text: Vagelis Hristidis Prepared With The Help of Nhat Le Many Slides Are From Richard Socher
20 pages
Lec 3 - Transport Layer - V - Pipelining Protocols
No ratings yet
Lec 3 - Transport Layer - V - Pipelining Protocols
16 pages
English Verb Conjugation 2
No ratings yet
English Verb Conjugation 2
2 pages
The Apostles Creed Sacrament Prayer: Leader
100% (1)
The Apostles Creed Sacrament Prayer: Leader
4 pages
IBM AIX7 官方培训文档
No ratings yet
IBM AIX7 官方培训文档
495 pages
Philippine Literature and Its Historical Backround
No ratings yet
Philippine Literature and Its Historical Backround
8 pages
The Largest Truly Open Library in Human History
No ratings yet
The Largest Truly Open Library in Human History
10 pages
Companions by Raja Rao
No ratings yet
Companions by Raja Rao
2 pages
06 Wordvectors
No ratings yet
06 Wordvectors
96 pages
Wordembed v2.0
No ratings yet
Wordembed v2.0
46 pages
Christopher Manning Lecture 2: Word Vectors, Word Senses, and Neural Classifiers
No ratings yet
Christopher Manning Lecture 2: Word Vectors, Word Senses, and Neural Classifiers
57 pages
COMP5046: Natural Language Processing
No ratings yet
COMP5046: Natural Language Processing
71 pages
Ling571 Class14 Distr Thes
No ratings yet
Ling571 Class14 Distr Thes
122 pages
The Praying Parent Challenge
No ratings yet
The Praying Parent Challenge
59 pages
Word and Document Embeddings
No ratings yet
Word and Document Embeddings
94 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
33 pages
9-Basic GK Questions With Answers PDF Notes
No ratings yet
9-Basic GK Questions With Answers PDF Notes
23 pages
08 Word Embeddings (2021)
No ratings yet
08 Word Embeddings (2021)
58 pages
RRB ALP Previous Year Papers PDF - 2424
No ratings yet
RRB ALP Previous Year Papers PDF - 2424
70 pages
4 Word Representation
No ratings yet
4 Word Representation
41 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
36 pages
Lebijp 59 SZ 31 Py
No ratings yet
Lebijp 59 SZ 31 Py
69 pages
Unit IV
No ratings yet
Unit IV
58 pages
Lecture 2a - Word Level Semantics
No ratings yet
Lecture 2a - Word Level Semantics
34 pages
CS224d Deep Learning For Natural Language Processing Lecture 2: Word Vectors
No ratings yet
CS224d Deep Learning For Natural Language Processing Lecture 2: Word Vectors
40 pages
Teun A Van Dijk - Analyzing Frame Analysis
No ratings yet
Teun A Van Dijk - Analyzing Frame Analysis
25 pages
CCS369 - TSS-Unit 2
No ratings yet
CCS369 - TSS-Unit 2
56 pages
Unit IV
No ratings yet
Unit IV
57 pages
21 Word2Vec 24 09 2024
No ratings yet
21 Word2Vec 24 09 2024
63 pages
Neural Models For NLP
No ratings yet
Neural Models For NLP
67 pages
Java Test - Day 1 - July 1
No ratings yet
Java Test - Day 1 - July 1
54 pages
L4 Cse256 Fa24 We
No ratings yet
L4 Cse256 Fa24 We
68 pages
Constructing and Evaluating Word Embeddings
No ratings yet
Constructing and Evaluating Word Embeddings
33 pages
2002 Amc 10B
No ratings yet
2002 Amc 10B
6 pages
Cs224n 2024 Lecture02 Wordvecs2
No ratings yet
Cs224n 2024 Lecture02 Wordvecs2
45 pages
Lecture 4 Word Representation
No ratings yet
Lecture 4 Word Representation
48 pages
Lecture 10
No ratings yet
Lecture 10
86 pages
Unit 2 Updated New
No ratings yet
Unit 2 Updated New
77 pages
7a. Word Embeddings Word2Vec and GloVe
No ratings yet
7a. Word Embeddings Word2Vec and GloVe
39 pages
Vector Semantics and Embedding (Part 2)
No ratings yet
Vector Semantics and Embedding (Part 2)
47 pages
Neural Network
No ratings yet
Neural Network
23 pages
Lecture Word Embeddings WordTo Vec IR
No ratings yet
Lecture Word Embeddings WordTo Vec IR
60 pages
cs224n 2025 Lecture02 Wordvecs2
No ratings yet
cs224n 2025 Lecture02 Wordvecs2
46 pages
XCS224N Module1 Slides
No ratings yet
XCS224N Module1 Slides
72 pages
12 Subrata DL
No ratings yet
12 Subrata DL
25 pages
NLP Lec 03
No ratings yet
NLP Lec 03
26 pages
11.chapter8 WordEmbedding
No ratings yet
11.chapter8 WordEmbedding
17 pages
Word Embeddings
No ratings yet
Word Embeddings
59 pages
Wordembed
No ratings yet
Wordembed
31 pages
CS490 Advanced Topics in Computing - Deep Learning
No ratings yet
CS490 Advanced Topics in Computing - Deep Learning
20 pages
Word Embedding
No ratings yet
Word Embedding
9 pages
Anwer and Question
No ratings yet
Anwer and Question
9 pages
ML4D-L6 nlp2
No ratings yet
ML4D-L6 nlp2
58 pages
L06 - Syntactic and Semantic Errors
No ratings yet
L06 - Syntactic and Semantic Errors
19 pages
WordRepresentation
No ratings yet
WordRepresentation
26 pages
Text Vectorization
No ratings yet
Text Vectorization
10 pages
Greece and The Greeks in Ottoman History and Turkish Historiography
No ratings yet
Greece and The Greeks in Ottoman History and Turkish Historiography
15 pages
Word Embeddings 1
No ratings yet
Word Embeddings 1
42 pages
... The Noisier... The Children Got, ... The Angrier... The Teacher Got
No ratings yet
... The Noisier... The Children Got, ... The Angrier... The Teacher Got
6 pages
Vector Semantics and Embeddings
No ratings yet
Vector Semantics and Embeddings
29 pages
Kennedy 1945 Bibliography of Indonesian Peoples and Cultures
No ratings yet
Kennedy 1945 Bibliography of Indonesian Peoples and Cultures
12 pages
Word Vectors I
No ratings yet
Word Vectors I
23 pages
Project On Banking System in Mis PDF
No ratings yet
Project On Banking System in Mis PDF
43 pages
Unit 2
No ratings yet
Unit 2
15 pages
Learning Representations That Convey Semantic and Syntactic Information
No ratings yet
Learning Representations That Convey Semantic and Syntactic Information
14 pages
END OF TERM 1 Maths EXAMINATION 2024 gr6
No ratings yet
END OF TERM 1 Maths EXAMINATION 2024 gr6
7 pages
Levy Improving Distributional
No ratings yet
Levy Improving Distributional
16 pages
Khutbahs by Almaghrib Institute Instructors
No ratings yet
Khutbahs by Almaghrib Institute Instructors
11 pages
What Is Coaching?: in This Chapter We Will Look at
No ratings yet
What Is Coaching?: in This Chapter We Will Look at
7 pages
Sample Paper 13 IP
No ratings yet
Sample Paper 13 IP
9 pages
XNC Program Format Biesse Rover C9
No ratings yet
XNC Program Format Biesse Rover C9
8 pages
Spanish Word Vectors From Wikipedia: Mathias Etcheverry, Dina Wonsever
No ratings yet
Spanish Word Vectors From Wikipedia: Mathias Etcheverry, Dina Wonsever
5 pages
DM Chapter 9 - Word Embedding
No ratings yet
DM Chapter 9 - Word Embedding
7 pages
John Mauchly
No ratings yet
John Mauchly
7 pages
(Lecture 1) Andrew Marvell, To His Coy Mistress - Explanation
No ratings yet
(Lecture 1) Andrew Marvell, To His Coy Mistress - Explanation
5 pages
Sheet 3
No ratings yet
Sheet 3
5 pages
Vocabulary 2 Complete Each Sentence Below With A Suitable Word From The List. Be Sure To Use The Correct Noun Form or Verb Form
No ratings yet
Vocabulary 2 Complete Each Sentence Below With A Suitable Word From The List. Be Sure To Use The Correct Noun Form or Verb Form
2 pages
How To Code For Quantum Computers
From Everand
How To Code For Quantum Computers
Nivio Dos Santos
No ratings yet
Applied Linear Algebra: Core Principles
From Everand
Applied Linear Algebra: Core Principles
Kartikeya Dutta
No ratings yet
Gauss Nodes Revolution: Numerical Integration Theory Radically Simplified And Generalised
From Everand
Gauss Nodes Revolution: Numerical Integration Theory Radically Simplified And Generalised
Rob Porter
No ratings yet
Introduction to Differentiable Manifolds
From Everand
Introduction to Differentiable Manifolds
Louis Auslander
4.5/5 (2)
Exercises of Vectors and Vectorial Spaces
From Everand
Exercises of Vectors and Vectorial Spaces
Simone Malacrida
No ratings yet
Ordered Weighted Averaging Aggregation Operator: Fundamentals and Applications
From Everand
Ordered Weighted Averaging Aggregation Operator: Fundamentals and Applications
Fouad Sabry
No ratings yet

3 WordMeaning

Uploaded by

3 WordMeaning

Uploaded by

Word Meaning and

There is no natural notion of similarity for one-hot vectors!

Solution: learn to encode similarity in the vectors themselves

• Different methods for getting dense vectors:

A[n x m] = U[n x r] L [ r x r] (V[m x r])T

• A: n x m matrix (e.g., n documents, m terms)

THEOREM [Press+92]: always possible to decompose matrix A into

‘spectral decomposition’ of the matrix:

‘documents’, ‘terms’ and ‘concepts’:

• Q: how exactly is dim. reduction done?

Predict center word based

center word context word

fat cat sat on the on

 Use Skip-gram model with Negative Sampling

randomly sampled set of negative

1. Treat the target word and a neighboring context word as positive

update embedding vector

Objective Function (they maximize):

Skip-gram with Negative Sampling

the pair is not in the training data : p(D = 0 | w, c) = 1 - p(D = 1 | w, c)

 We have to optimize this:

 converting the max of products to max of sum of logarithms,

How to compute p(D=1|w,c)?

Turning dot product into a probability:

Skip-gram with Negative Sampling

Formula (3) becomes:

This is same as Formula (1) summed over entire corpus

Choosing noise words

• Could pick w according to their unigram frequency P(w)

윈도우 내에 등장하지 않은 어떤 단어(w)가

maximizing positive pairs and minimizing negative pairs

• The model efficiently leverages statistical information by training only on the

(Pennington et al, 2014): GloVe: Global Vectors for Word Representation

• E.g. The corpus has these three sentences (window size is 1)

(Pennington et al, 2014): GloVe: Global Vectors for Word Representation

Retain only k singular values, in order to generalize.

 Pik : e.g. P(like l NLP) / P(like l deep) = 2.0

(Pennington et al, 2014): GloVe: Global Vectors for Word Representation

• Word vector distances and their correlation with human judgments

• Word vector distances and their correlation with human judgments

• Often followed by Entity Linking/Canonicalization into a Knowledge Base such

You might also like