0% found this document useful (0 votes)
5 views118 pages

Vector Semantics - NLP

NLP Vector Semantics BCS level lecture

Uploaded by

azia0034
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views118 pages

Vector Semantics - NLP

NLP Vector Semantics BCS level lecture

Uploaded by

azia0034
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 118

Dan Jurafsky and James Martin

Speech and Language Processing

Chapter 6:
Vector Semantics
Representing strings with vectors of words or
characters
• to be or not to be a bee is the question, said the
queen bee
Multiset Word frequencies Word frequencies
a Words Count Raw Relative
be a 1 Words Count Frequency
be be 2 a 1 0.07
bee bee 2 be 2 0.13
bee is 1 bee 2 0.13
is not 1 is 1 0.07
not or 1 not 1 0.07
or queen 1 or 1 0.07
queen question 1 queen 1 0.07
question said 1 question, 1 0.07
said the 2 said 1 0.07
the to 2 the 2 0.13
the to 2 0.13
to Total 15 1
to

• Bag-of-words or unigrams
More context?
• to be or not to be a bee is the question, said the
queen bee
Multiset Word pair frequencies Word pair frequencies
to be Word pairs Count Raw Relative
be or a bee 1 Word pairs Count Frequency
or not be a 1
a bee 1 0.07
not to be or 1
to be be a 1 0.07
bee is 1
be a is the 1 be or 1 0.07
a bee not to 1 bee is 1 0.07
bee is or not 1 is the 1 0.07
is the queen bee 1 not to 1 0.07
the question said 1 or not 1 0.07
question said the 1
question queen bee 1 0.07
the queen 1
said the question 1 question said 1 0.07
said the to be 2 said the 1 0.07
the queen the queen 1 0.07
queen bee the question 1 0.07
to be 2 0.14

• Bigrams Total 14 1
More context?
• to be or not to be a bee is the question, said the
queen bee
• Unigrams
• Bigrams
• Trigrams
• 4 grams and so on…
Character-level vectors
• to be or not to be a bee is the question, said the
queen bee Character pair frequencies
Character frequencies Character pairs Raw Count Relative Frequency
ab 1 0.02
Raw Relative ai 1 0.02
be 4 0.09
Characters Count Frequency dt 1 0.02
a 2 0.04 ea
ee
1
3
0.02
0.07
b 4 0.09 ei
en
1
1
0.02
0.02
d 1 0.02 eo 1 0.02
eq 2 0.05
e 11 0.24 es 1 0.02
he 2 0.05
h 2 0.04 id 1 0.02
io 1 0.02
i 3 0.07 is 1 0.02
nb 1 0.02
n 3 0.07 no 1 0.02
o 5 0.11 ns
ob
1
2
0.02
0.05
q 2 0.04 on
or
1
1
0.02
0.02
r 1 0.02 ot 1 0.02
qu 2 0.05
s 3 0.07 rn 1 0.02
sa 1 0.02
t 6 0.13 st 2 0.05
th 2 0.05
u 2 0.04 ti 1 0.02
to 2 0.05
Total 45 1 tt 1 0.02
ue 2 0.05
Total 44 1.00
Bag of Words
• A simplifying representation
• Text (such as a sentence or a document) is
represented as the bag (multiset) of its words
– Disregarding grammar and even word order but keeping
multiplicity
• Used in document classification where the
(frequency of) occurrence of each word is used as a
feature for training a classifier
• But some words are common in general which do
not tell a lot about a document (e.g. the, is, a)
• TF-IDF
Measures to normalize term-frequencies
• Raw frequency: The number of times that
term t occurs in document d,
– tf(t,d) = ft,d
– We need to remove bias towards long or short
documents
• Normalize by document length
• Relative term frequency i.e. tf adjusted for
document length:
– tf(t,d) = ft,d ÷ (number of words in d)
But raw frequency is a bad
representation
Frequency is clearly useful; if sugar appears a lot
near apricot, that's useful information.
But overly frequent words like the, it, or they are
not very informative about the context
Need a function that resolves this frequency
paradox!
Other measures to normalize term-frequencies
• Next, we need ways to remove bias towards more
frequently occurring words.
– A word appearing 100 times in a document does not make it
a 100 times more likely representative of the document

• Boolean ”frequency": tf(t,d) = 1 if t occurs in d and 0


otherwise

• Logarithmically scaled term frequency:

– Thus terms which occur 10 times in a document would have


a tf=2, 100 times in a document tf=3, 1000 times tf=4,...
TF-IDF
https://en.wikipedia.org/wiki/Tf%E2%80%93idf

• TF-IDF increases proportionally to the number of times a word


appears in the document,
• It is offset by the frequency of the word in the whole corpus of
documents
• Helps adjust for the fact that some words appear more frequently in
general.
IDF
https://en.wikipedia.org/wiki/Tf%E2%80%93idf
• Give a higher weight to words that occur only in a few documents
• Terms that are limited to a few documents are useful for
discriminating those documents from the rest of the collection; terms
that occur frequently across the entire collection aren’t as helpful.
• Because of the large number of documents in many collections, this
measure is usually squashed with a log function.
• Inverse document frequency, idf(t, d) is the logarithmically scaled
inverse fraction of the documents that contain the word
tf-idf: combine two factors
tf: term frequency. frequency count (usually log-transformed):

Idf: inverse document frequency: tf-


Total # of docs in collection

Words like "the" or "good" have very low idf # of docs that have word i

tf-idf value for word t in document d:


TF-IDF

-
TF-IDF

-
TF-IDF
https://en.wikipedia.org/wiki/Tf%E2%80%93idf
Now that we have our vectors!
• How do we compare two strings or documents
– Convert each into a vector

– Calculate the distance between vectors


Vectors are the basis of
information retrieval

Vectors are similar for the two comedies


Different than the history

Comedies have more fools and wit and


fewer battles.
Words can be vectors too

battle is "the kind of word that occurs in Julius


Caesar and Henry V"

fool is "the kind of word that occurs in


comedies, especially Twelfth Night"
Angles between vectors
Cosine Similarity

• This raw dot-product, however, has a problem as a similarity metric: it favors vector length
long vectors.
• The simplest way to modify the dot product to normalize for the vector length is to divide
the dot product by the lengths of each of the two vectors.
• This normalized dot product turns out to be the same as the cosine of the angle between
the two
Cosine Similarity
• The cosine value ranges from 1 for vectors pointing in the
same direction, through 0 for vectors that are orthogonal,
to -1 for vectors pointing in opposite directions.
• But raw frequency values are non-negative, so the cosine
for these vectors ranges from 0–1.
Cosine Similarity
All Distances
S1: to be or not to be a bee is the question, said the queen bee
S2: one needs to be strong in order to be a queen bee

sqrt((f1 -
Words # S1 # S2 f1 f2 |f1 - f2| Max |f1-f2| f1 . f2 |f1| |f2| f1.f2/|f1||f2|
f2)^2)

a 1 1 0.0667 0.0833 0.0003 0.0167 0.0167 0.0056 0.0044 0.0069


be 2 2 0.1333 0.1667 0.0011 0.0333 0.0333 0.0222 0.0178 0.0278
bee 2 1 0.1333 0.0833 0.0025 0.0500 0.0500 0.0111 0.0178 0.0069
in 0 1 0.0000 0.0833 0.0069 0.0833 0.0833 0.0000 0.0000 0.0069
is 1 0 0.0667 0.0000 0.0044 0.0667 0.0667 0.0000 0.0044 0.0000
needs 0 1 0.0000 0.0833 0.0069 0.0833 0.0833 0.0000 0.0000 0.0069
not 1 0 0.0667 0.0000 0.0044 0.0667 0.0667 0.0000 0.0044 0.0000
one 0 1 0.0000 0.0833 0.0069 0.0833 0.0833 0.0000 0.0000 0.0069
or 1 0 0.0667 0.0000 0.0044 0.0667 0.0667 0.0000 0.0044 0.0000
order 0 1 0.0000 0.0833 0.0069 0.0833 0.0833 0.0000 0.0000 0.0069
queen 1 1 0.0667 0.0833 0.0003 0.0167 0.0167 0.0056 0.0044 0.0069
question 1 0 0.0667 0.0000 0.0044 0.0667 0.0667 0.0000 0.0044 0.0000
said 1 0 0.0667 0.0000 0.0044 0.0667 0.0667 0.0000 0.0044 0.0000
strong 0 1 0.0000 0.0833 0.0069 0.0833 0.0833 0.0000 0.0000 0.0069
the 2 0 0.1333 0.0000 0.0178 0.1333 0.1333 0.0000 0.0178 0.0000
to 2 2 0.1333 0.1667 0.0011 0.0333 0.0333 0.0222 0.0178 0.0278
Total 15 12 1 1 0.2828 1.0333 0.1333 0.0667 0.3197 0.3333
Cosine
Euclidean Manhattan Chebyshev Dot product Similarity
Automatic Thesaurus
Generation
Attempt to generate a thesaurus automatically by analyzing
a collection of documents
Fundamental notion: similarity between two words
Definition 1: Two words are similar if they co-occur with
similar words.
Definition 2: Two words are similar if they occur in a given
grammatical relation with the same words.
You can harvest, peel, eat, prepare, etc. apples and pears,
so apples and pears must be similar.
Co-occurrence based is more robust, grammatical relations
are more accurate.
Words, Lemmas, Senses, Definitions
lemma sense definition
pepper, n.
Pronunciation: Brit. /ˈpɛpə/ , U.S. /ˈpɛpər/
Forms: OE peopor (rare), OE pipcer (transmission error), OE pipor, OE pipur (rare
c. U.S. The California pepper tree, Schinus molle. Cf. PEPPER TREE n.
Frequency (in current use):
Etymology: A borrowing from Latin. Etymon: Latin piper.
< classical Latin piper, a loanword < Indo-Aryan (as is ancient Greek πέπερι ); compare Sa

I. The spice or the plant.


3. Any of various forms of capsicum, esp. Capsicum annuum var.
1. annuum. Originally (chiefly with distinguishing word): any variety of the
C. annuum Longum group, with elongated fruits having a hot, pungent
a. A hot pungent spice derived from the prepared fruits (peppercorns) of
taste, the source of cayenne, chilli powder, paprika, etc., or of the
the pepper plant, Piper nigrum (see sense 2a), used from early times to
season food, either whole or ground to powder (often in association with
perennial C. frutescens, the source of Tabasco sauce. Now frequently
salt). Also (locally, chiefly with distinguishing word): a similar spice (more fully sweet pepper): any variety of the C. annuum Grossum
derived from the fruits of certain other species of the genus Piper; the group, with large, bell-shaped or apple-shaped, mild-flavoured fruits,
fruits themselves. usually ripening to red, orange, or yellow and eaten raw in salads or
The ground spice from Piper nigrum comes in two forms, the more pungent black pepper, produced cooked as a vegetable. Also: the fruit of any of these capsicums.
from black peppercorns, and the milder white pepper, produced from white peppercorns: see BLACK Sweet peppers are often used in their green immature state (more fully green pepper), but some
adj. and n. Special uses 5a, PEPPERCORN n. 1a, and WHITE adj. and n.1 Special uses 7b(a). new varieties remain green when ripe.

2.
a. The plant Piper nigrum (family Piperaceae), a climbing shrub
indigenous to South Asia and also cultivated elsewhere in the tropics,
which has alternate stalked entire leaves, with pendulous spikes of small
green flowers opposite the leaves, succeeded by small berries turning red
when ripe. Also more widely: any plant of the genus Piper or the family
Piperaceae.

b. Usu. with distinguishing word: any of numerous plants of other


families having hot pungent fruits or leaves which resemble pepper ( 1a)
in taste and in some cases are used as a substitute for it.
Lemma pepper
Sense 1: spice from pepper plant
Sense 2: the pepper plant itself
Sense 3: another similar plant (Jamaican
pepper)
Sense 4: another plant with peppercorns
(California pepper)
Sense 5: capsicum (i.e. chili, paprika, bell
pepper, etc)
A sense or “concept” is the
meaning component of a word
Let's define words by their
usages
In particular, words are defined by their
environments (the words around them)

Zellig Harris (1954): If A and B have almost


identical environments we say that they are
synonyms.
What does ong choi mean?
Suppose you see these sentences:
• Ongchoi is delicious sautéed with garlic.
• Ongchoi is superb over rice
• Ongchoi leaves with salty sauces
And you've also seen these:
• …spinach sautéed with garlic over rice
• Chard stems and leaves are delicious
• Collard greens and other salty leafy greens
Conclusion:
◦ Ongchoi is a leafy green like spinach, chard, or collard
greens
Ong choi: Ipomoea aquatica
"Water Spinach"

Yamaguchi, Wikimedia Commons, public domain


Example of Tf*Idf Vector
Represent the word “apple” as vector using following corpus. Use TF.IDF weights.
Assume the window size for word context is 2
Document 1: I like to ride cycle often.
Document 2: Ali and Hassan ate apple and oranges.
Document 3: Ali ate apple not oranges
Example of Tf*Idf Vector
Represent the word “apple” as vector using following corpus. Use TF.IDF weights.
Assume the window size for word context is 2
Document 1: I like to ride cycle often.
Document 2: Ali and Hassan ate apple and oranges.
Document 3: Ali ate apple not oranges

Context words: Hassan, ate, and, oranges,


Ali, not
Step 4: Construct the TF-IDF vector for "apple“
combine all the TF-IDF values for the context words to create the vector for "apple“
TF-IDF vector=[0.157, 0.063, 0.157, 0.058, 0.183, 0.068, 0.183]
Hassan ate and oranges ali ate not oranges

Notice ‘ate’ comes twice: so in the consolidated vector u can


• Average tf-idf value of ‘ate’
• Weighted average tf-idf value of ate
Summary: tf-idf
Compare two words using tf-idf cosine to see if they
are similar
Compare two documents
◦ Take the centroid of vectors of all the words in the
document
1. Counting repeated words
◦ Centroid document vector is: (Approach 1) gives more
weight to frequent words,
emphasizing their importance
in the document.2.
2. Counting unique words
(Approach 2) reduces the
impact of word frequency,
focusing on the diversity of
words.
tf-idf cosine
When comparing two documents, we treat the document as a
collection of word vectors, and to get a single vector for a
document, we calculate its centroid.
The centroid is simply the average of the TF-IDF vectors of all
the words in the document.

By comparing centroids, you can measure the overall similarity of two


documents rather than comparing individual words.

For document cosine similarity: f1.f2/|f1||f2|


Dan Jurafsky and James Martin
Speech and Language Processing

Chapter 6:
Vector Semantics
Let's define words by their
usages
In particular, words are defined by their
environments (the words around them)

Zellig Harris (1954): If A and B have almost


identical environments we say that they are
synonyms.
What does ong choi mean?
Suppose you see these sentences:
• Ongchoi is delicious sautéed with garlic.
• Ongchoi is superb over rice
• Ongchoi leaves with salty sauces
And you've also seen these:
• …spinach sautéed with garlic over rice
• Chard stems and leaves are delicious
• Collard greens and other salty leafy greens
Conclusion:
◦ Ongchoi is a leafy green like spinach, chard, or collard
greens
Ong choi: Ipomoea aquatica
"Water Spinach"

Yamaguchi, Wikimedia Commons, public domain


Build a new model of meaning
focusing on similarity
Each word = a vector
Similar words are "nearby in space"

not good
bad
to by dislike worst
‘s
that now incredibly bad
are worse
a i you
than
with is

very good incredibly good


amazing fantastic
terrific wonderful
nice
good
Define a word as a vector
Called an "embedding" because it's embedded
into a space
The standard way to represent meaning in NLP
Fine-grained model of meaning for similarity
◦ NLP tasks like sentiment analysis
◦ With words, requires same word to be in training and test
◦ With embeddings: ok if similar words occurred!!!
◦ Question answering, conversational agents, etc
2 kinds of embeddings
Tf-idf
◦ A common baseline model
◦ Sparse vectors
◦ Words are represented by a simple function of the counts
of nearby words
Word2vec
◦ Dense vectors
◦ Representation is created by training a classifier to
distinguish nearby and far-away words
An alternative to tf-idf
Ask whether a context word is particularly
informative about the target word.
◦ Positive Pointwise Mutual Information (PPMI)

It compares the observed co-occurrence


frequency of two words to their expected
co-occurrence if they were statistically
independent.
Pointwise Mutual Information

Pointwise mutual information:


Do events x and y co-occur more than if they were independent?

PMI(X,Y ) = log 2 P(x,y)


P(x)P(y)

PMI between two words: (Church & Hanks 1989)


Do words x and y co-occur more than if they were independent?

𝑃(𝑤𝑜𝑟𝑑1, 𝑤𝑜𝑟𝑑2)
PMI 𝑤𝑜𝑟𝑑1 , 𝑤𝑜𝑟𝑑 2 = log 2
𝑃 𝑤𝑜𝑟𝑑1 𝑃(𝑤𝑜𝑟𝑑2)
Positive Pointwise Mutual Information
◦ PMI ranges from −∞ to + ∞
◦ But the negative values are problematic
◦ Things are co-occurring less than we expect by chance
◦ So we just replace negative PMI values by 0
◦ Positive PMI (PPMI) between word1 and word2:
𝑃(𝑤𝑜𝑟𝑑1, 𝑤𝑜𝑟𝑑2)
PPMI 𝑤𝑜𝑟𝑑1, 𝑤𝑜𝑟𝑑2 = max log 2 ,0
𝑃 𝑤𝑜𝑟𝑑1 𝑃(𝑤𝑜𝑟𝑑2)
Computing PPMI on a term-context
matrix
Matrix F with W rows (words) and C columns (contexts)
fij is # of times wi occurs in context cj
C W

fij  fij  fij


pij = W C pi* = j=1 p* j = i=1
W C W C
 fij  fij  fij
i=1 j=1 i=1 j=1 i=1 j=1

pij  pmi if pmiij  0


ij
pmiij = log2 ppmiij = 
pi* p* j  0 otherwise
fij
pij = W C
 fij
i=1 j=1
fij
pij = W C
 fij
i=1 j=1
C W
p(w=information,c=data) = 6/19 = .32  fij  fij
11/19 = .58 p(wi ) = j=1 p(cj ) = i=1
p(w=information) = N N
7/19 = .37 p(w,context) p(w)
p(c=data) =
computer data pinch result sugar
apricot 0.00 0.00 0.05 0.00 0.05 0.11
pineapple 0.00 0.00 0.05 0.00 0.05 0.11
digital 0.11 0.05 0.00 0.05 0.00 0.21
information 0.05 0.32 0.00 0.21 0.00 0.58

p(context) 0.16 0.37 0.11 0.26 0.11


p(w,context) p(w)
computer data pinch result sugar
p apricot 0.00 0.00 0.05 0.00 0.05 0.11
pmiij = log 2 ij pineapple 0.00 0.00 0.05 0.00 0.05 0.11
pi* p* j
digital 0.11 0.05 0.00 0.05 0.00 0.21
information 0.05 0.32 0.00 0.21 0.00 0.58
p(context) 0.16 0.37 0.11 0.26 0.11

pmi(information,data) = log2 ( .32 / (.37*.58) ) = .58


(.57 using full precision)
PPMI(w,context)
computer data pinch result sugar
apricot - - 2.25 - 2.25
pineapple - - 2.25 - 2.25
digital 1.66 0.00 - 0.00 -
information 0.00 0.57 - 0.47 -
Weighting PMI
PMI is biased toward infrequent events
◦ Very rare words have very high PMI values
Two solutions:
◦ Give rare words slightly higher probabilities
Weighting PMI
PMI is biased toward infrequent events
◦ Very rare words have very high PMI values
Two solutions:
◦ Give rare words slightly higher probabilities

If one of the words (either the target word w or the context word c has a
higher probability while the other is rare, the PMI value will still be affected.
Weighting PMI: Giving rare
context words slightly higher
probability
Raise the context probabilities to 𝛼 = 0.75:

This helps because 𝑃𝛼 𝑐 > 𝑃 𝑐 for rare c


Consider two events, P(a) = .99 and P(b)=.01
99.75 1.75
𝑃𝛼 𝑎 = 99.75+1.75
= .97 𝑃𝛼 𝑏 = 99.75+1.75
= .03
Summary for Part I

• Idea of Embeddings: Represent a word as a


function of its distribution with other words
• Tf-idf
• Cosines
• PPMI
Calculate the PPMI vector of the words: data , science
Consider context window of size 2.

Data science is an interdisciplinary field.


Machine learning is a subset of Data science.
Data science involves statistics and programming.
Programming is essential for Data science and Artificial intelligence.
Statistics is important for Data science and Machine learning.
Dan Jurafsky and James Martin
Speech and Language Processing

Chapter 6:
Vector Semantics, Part II
Distributional similarity based
representations
You can get a lot of value by representing a word by means of its
neighbors
“You shall know a word by the company it keeps”
(J. R. Firth 1957: 11)

One of the most successful ideas of modern statistical NLP

…government debt problems turning into banking crises as happened in 2009…


 These
…saying that Europe needs words
unified will regulation
banking representtobanking hodgepodge…
replace the
…India has just given its banking system a shot in the arm…
Solution: Low dimensional
vectors
The number of topics that people talk about is small (in some sense)
◦ Clothes, movies, politics, …
• Idea: store “most” of the important information in a fixed, small
number of dimensions: a dense vector
• Usually 25 – 1000 dimensions

• How to reduce the dimensionality?


• Go from big, sparse co-occurrence count vector to low dimensional
“word embedding”
Idea: Directly learn low-
dimensional word vectors based
on ability to predict
• Old idea: Learning representations by back-propagating
errors. (Rumelhart et al., 1986)
• A neural probabilistic language model (Bengio et al.,
2003) Non-linear
• NLP (almost) from Scratch (Collobert & Weston, 2008) and slow

• A recent, even simpler and faster model:


Fast
word2vec (Mikolov et al. 2013) → intro now bilinear
• The GloVe model from Stanford (Pennington, Socher, models
and Manning 2014) connects back to matrix
factorization
Current
• Per-token representations: Deep contextual word state of the
representations: ELMo, ULMfit, BERT art
Tf-idf and PPMI are
sparse representations

tf-idf and PPMI vectors are


◦long (length |V|= 20,000 to 50,000)
◦sparse (most elements are zero)
Alternative: dense vectors

vectors which are


◦ short (length 50-1000)
◦ dense (most elements are non-zero)
Sparse versus dense vectors
Why dense vectors?
◦ Short vectors may be easier to use as features in machine
learning (less weights to tune)
◦ Dense vectors may generalize better than storing explicit
counts
◦ They may do better at capturing synonymy:
◦ car and automobile are synonyms; but are distinct dimensions
◦ a word with car as a neighbor and a word with automobile as a
neighbor should be similar, but aren't
◦ In practice, they work better
Word2vec
Popular embedding method
Very fast to train
Code available on the web
Idea: predict rather than count
Word2vec
◦Instead of counting how often each
word w occurs near "apricot"
◦Train a classifier on a binary
prediction task:
◦Is w likely to show up near "apricot"?

◦ We don’t actually care about this task


◦But we'll take the learned classifier weights
as the word embeddings
Brilliant insight: Use running text as
implicitly supervised training data!
• A word s near apricot
• Acts as gold ‘correct answer’ to the
question
• “Is word w likely to show up near apricot?”
• No need for hand-labeled supervision
• The idea comes from neural language
modeling
• Bengio et al. (2003)
• Collobert et al. (2011)
BENGIO, YOSHUA, ET AL. "A NEURAL PROBABILISTIC LANGUAGE MODEL." JOURNAL OF MACHINE LEARNING RESEARCH 3.FEB (2003):
1137-1155.
Word2vec is a family of algorithms
[Mikolov et al. 2013]

Predict between every word and its context


words!

Two algorithms
1. Skip-grams (SG)
Predict context words given target (position independent)
2. Continuous Bag of Words (CBOW)
Predict target word from bag-of-words context

Two (moderately efficient) training methods


1. Hierarchical softmax
2. Negative sampling
3. Naïve softmax
Word2vec CBOW (Continuous Bag Of Words) and
Skip-gram network architectures
Word2Vec: Skip-Gram Task
Word2vec provides a variety of options. Let's do
◦ "skip-gram with negative sampling" (SGNS)
Word2Vec Skip-gram
Overview
Example windows and process for computing 𝑃 𝑤𝑡+𝑗 | 𝑤𝑡

𝑃 𝑤𝑡−2 | 𝑤𝑡 𝑃 𝑤𝑡+2 | 𝑤𝑡

𝑃 𝑤𝑡−1 | 𝑤𝑡 𝑃 𝑤𝑡+1 | 𝑤𝑡

… problems turning into banking crises as …

outside context words center word outside context words


in window of size 2 at position t in window of size 2
WordToVec algorithm
1. Treat the target word and a neighboring
context word as positive examples.
2. Use other words not in context as
negative samples
3. Use a model to train a classifier to
distinguish those two cases
4. Use the learned weights (parameters) of
classifier as the embeddings
Skip-Gram Training Data
Training sentence:
... lemon, a tablespoon of apricot jam a pinch ...
c1 c2 target c3 c4

Asssume context words are those in +/- 2


word window
WordToVec Goal
Given a tuple (t,c) = target, context
◦ (apricot, jam)
◦ (apricot, aardvark)
Return probability that c is a real context word:

P(+|t,c)
P(−|t,c) = 1−P(+|t,c)
How to compute p(+|t,c)?
Intuition:
◦ Words are likely to appear near similar words
◦ Model similarity with dot-product!
◦ Similarity(t,c) 𝖺 t ∙ c
Problem:
◦Dot product is not a probability!
◦ (Neither is cosine)
Turning dot product into a
probability
The sigmoid lies between 0 and 1:
Turning dot product into a
probability
For all the context words:
Assume all context words are
independent
Objective Criteria
We want to maximize…

Maximize the + label for the pairs from the positive


training data, and minimize the – label for the pairs
sample from the negative data.
Loss Function: Binary Cross-Entropy /
Log Loss
Setup
Let's represent words as vectors of some length (say
300), randomly initialized.
So we start with 300 * V random parameters
Over the entire training set, we’d like to adjust those
word vectors such that we
◦ Maximize the similarity of the target word, context
word pairs (t,c) drawn from the positive data
◦ Minimize the similarity of the (t,c) pairs drawn from
the negative data.
Learning the classifier
Iterative process.
We’ll start with 0 or random weights
Then adjust the word weights to
◦ make the positive pairs more likely
◦ and the negative pairs less likely
over the entire training set:
HTTPS://NATHANROOY.GITHUB.IO/POSTS/2018-03-22/WORD2VEC-FROM-SCRATCH-WITH-PYTHON-AND-NUMPY/?SOURCE=POST_PAGE-----13445EEBD281
HTTPS://NATHANROOY.GITHUB.IO/POSTS/2018-03-22/WORD2VEC-FROM-SCRATCH-WITH-PYTHON-AND-NUMPY/?SOURCE=POST_PAGE-----13445EEBD281
HTTPS://TOWARDSDATASCIENCE.COM/AN-IMPLEMENTATION-GUIDE-TO-WORD2VEC-USING-NUMPY-AND-GOOGLE-SHEETS-13445EEBD281
HTTPS://TOWARDSDATASCIENCE.COM/AN-IMPLEMENTATION-GUIDE-TO-WORD2VEC-USING-NUMPY-AND-GOOGLE-SHEETS-13445EEBD281
first and last element in the first training window
# 1 [Target (natural)], [Context (language, processing)]
[list([1, 0, 0, 0, 0, 0, 0, 0, 0])
list([[0, 1, 0, 0, 0, 0, 0, 0, 0], [0, 0, 1, 0, 0, 0, 0, 0, 0]])]

first and last element in the last training window


#10 [Target (exciting)], [Context (fun, and)]
[list([0, 0, 0, 0, 0, 0, 0, 0, 1])
list([[0, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 1, 0, 0, 0, 0, 0]])]

HTTPS://TOWARDSDATASCIENCE.COM/AN-IMPLEMENTATION-GUIDE-TO-WORD2VEC-USING-NUMPY-AND-GOOGLE-SHEETS-13445EEBD281
N = Dimensions, V = Vocabulary, unique words

Input Layer Hidden Layer Output Layer

Vector of word i

Vector of word i
× W1 = × W2 =

N×V
1×N
1×V V×N 1×V
Word2Vec — skip-gram network
architecture

HTTPS://TOWARDSDATASCIENCE.COM/AN-IMPLEMENTATION-GUIDE-TO-WORD2VEC-USING-NUMPY-AND-GOOGLE-SHEETS-13445EEBD281
WordToVec Example
WordToVec Example 2 layer
neural network
HTTPS://TOWARDSDATASCIENCE.COM/AN-IMPLEMENTATION-GUIDE-TO-WORD2VEC-USING-NUMPY-AND-GOOGLE-SHEETS-13445EEBD281
Read from following links for
WordToVec example
https://towardsdatascience.com/an-
implementation-guide-to-word2vec-using-
numpy-and-google-sheets-13445eebd281

https://nathanrooy.github.io/posts/2018-03-
22/word2vec-from-scratch-with-python-and-
numpy/?source=post_page-----
13445eebd281
Word2vec is a family of algorithms
[Mikolov et al. 2013]

Predict between every word and its context


words!

Two algorithms
1. Skip-grams (SG)
Predict context words given target (position independent)
2. Continuous Bag of Words (CBOW)
Predict target word from bag-of-words context

Two (moderately efficient) training methods


1. Hierarchical softmax
2. Negative sampling
3. Naïve softmax
Motivation for Negative Sampling (weight
2 matrix is updated for every example)

This means that instead of updating the entire


matrix (which has a size of hidden layer x
vocabulary size), the model only updates the
weights associated with the context word and
the few negative samples.

HTTPS://TOWARDSDATASCIENCE.COM/AN-IMPLEMENTATION-GUIDE-TO-WORD2VEC-USING-NUMPY-AND-GOOGLE-SHEETS-13445EEBD281
Skip-gram algorithm
1. Treat the target word and a neighboring
context word as positive examples.
2. Randomly sample other words in the
lexicon to get negative samples
3. Use a model to train a classifier to
distinguish those two cases
4. Use the weights as the embeddings
Skip-Gram Training Data with
Negative Sampling
Training sentence:
... lemon, a tablespoon of apricot jam a pinch ...
c1 c2 t c3 c4

Training data: input/output pairs centering


on apricot
Asssume a +/- 2 word window
Skip-Gram Training with
Negative Sampling
Training sentence:
... lemon, a tablespoon of apricot jam a pinch ...
c1 c2 t c3 c4

•For each positive example,


we'll create k negative
examples.
•Any random word that isn't t
Skip-Gram Training with
Negative Sampling
Training sentence:
... lemon, a tablespoon of apricot jam a pinch ...
c1 c2 t c3 c4
k=2
Total words = dog, cat, lion, table, chair, dog , dog, cat ,
cat , cat
Vocabulary = dog , cat , lion, table, chair

Cat = 4/10 = 0.4


Dog = 0.3
Table = 0.1
Chair = 0.1
Lion =0.1
Choosing noise words (negative
samples)
Could pick w according to their unigram frequency P(w)
More common to use pα(w)

α= ¾ works well because it gives rare noise words slightly higher


probability
To show this, imagine two events p(a)=.99 and p(b) = .01:
Loss Function
We want to maximize…

Maximize the + label for the pairs from the positive


training data, and the – label for the pairs sample
from the negative data.

The number of negative examples is much smaller


using negative sampling
Two sets of embeddings

SGNS learns two sets of embeddings


Target embeddings matrix W (weight 1)
Context embedding matrix C (weight 2)

It's common to just add them together, representing


word i as the vector wi + ci
Summary: How to learn word2vec
(skip-gram) embeddings
Start with V random 300-dimensional vectors as
initial embeddings
Use logistic regression, the second most basic
classifier used in machine learning after naïve
bayes
◦ Take a corpus and take pairs of words that co-occur as
positive examples
◦ Take pairs of words that don't co-occur, as negative
examples
◦ Train the classifier to distinguish these by slowly adjusting all
the embeddings to improve the classifier performance
◦ Throw away the classifier code and keep the embeddings.
Properties of embeddings
Similarity depends on window size C 1. Syntactic similarity: Words
(small window size: words that are syntactically with similar grammatical
structures, word order, or
similar and have same part of speech, linguistic patterns.
Large window size: words related but do not have 2. Semantic similarity: Words
same Part of speech (semantics) with similar meanings,
concepts, or contexts.
C = ±2 The nearest words to Hogwarts (name
of school) were names of other fictional schools:
◦ Sunnydale (name of fictional school)
◦ Evernight (name of fictional school)

C = ±5 The nearest words to Hogwarts were other


words topically related to the Harry Potter series :
◦ Dumbledore
◦ Malfoy
◦ halfblood
Analogy: Embeddings capture
relational meaning!
vector(‘king’) - vector(‘man’) + vector(‘woman’) ≈ vector(‘queen’)
vector(‘Paris’) - vector(‘France’) + vector(‘Italy’) ≈ vector(‘Rome’)
The evolution of sentiment words
Negative words change faster than positive words
Embeddings reflect cultural bias
Bolukbasi, Tolga, Kai-Wei Chang, James Y. Zou, Venkatesh Saligrama, and
Adam T. Kalai. "Man is to computer programmer as woman is to
homemaker? debiasing word embeddings." In Advances in Neural
Information Processing Systems, pp. 4349-4357. 2016.

Ask “Paris : France :: Tokyo : x”


◦ x = Japan
Ask “father : doctor :: mother : x”
◦ x = nurse
Ask “man : computer programmer :: woman : x”
◦ x = homemaker
Embeddings as a window onto history
Garg, Nikhil, Schiebinger, Londa, Jurafsky, Dan, and Zou, James (2018). Word embeddings quantify 100 years of gender
and ethnic stereotypes. Proceedings of the National Academy of Sciences, 115(16), E3635–E3644

The cosine similarity of embeddings for decade X


for occupations (like teacher) to male vs female
names
◦ Is correlated with the actual percentage of women
teachers in decade X
Change in linguistic framing
1910-1990
Garg, Nikhil, Schiebinger, Londa, Jurafsky, Dan, and Zou, James (2018). Word embeddings quantify 100 years of gender
and ethnic stereotypes. Proceedings of the National Academy of Sciences, 115(16), E3635–E3644

Change in association of Chinese names with adjectives


framed as "othering" (barbaric, monstrous, bizarre)
Changes in framing:
adjectives associated with Chinese
Garg, Nikhil, Schiebinger, Londa, Jurafsky, Dan, and Zou, James (2018). Word embeddings quantify 100 years of gender
and ethnic stereotypes. Proceedings of the National Academy of Sciences, 115(16), E3635–E3644
Conclusion
Embeddings = vector models of meaning
◦ More fine-grained than just a string or index
◦ Especially good at modeling similarity/analogy
◦ Just download them and use cosines!!
◦ Can use sparse models (tf-idf) or dense models (word2vec,
GLoVE)
◦ Useful in practice but know they encode cultural stereotypes
◦ W2v gives static embeddings
Which one to choose?
, good for infrequent words
Limitations of w2v

You might also like