Vector Semantics - NLP
Vector Semantics - NLP
Chapter 6:
Vector Semantics
Representing strings with vectors of words or
characters
• to be or not to be a bee is the question, said the
queen bee
Multiset Word frequencies Word frequencies
a Words Count Raw Relative
be a 1 Words Count Frequency
be be 2 a 1 0.07
bee bee 2 be 2 0.13
bee is 1 bee 2 0.13
is not 1 is 1 0.07
not or 1 not 1 0.07
or queen 1 or 1 0.07
queen question 1 queen 1 0.07
question said 1 question, 1 0.07
said the 2 said 1 0.07
the to 2 the 2 0.13
the to 2 0.13
to Total 15 1
to
• Bag-of-words or unigrams
More context?
• to be or not to be a bee is the question, said the
queen bee
Multiset Word pair frequencies Word pair frequencies
to be Word pairs Count Raw Relative
be or a bee 1 Word pairs Count Frequency
or not be a 1
a bee 1 0.07
not to be or 1
to be be a 1 0.07
bee is 1
be a is the 1 be or 1 0.07
a bee not to 1 bee is 1 0.07
bee is or not 1 is the 1 0.07
is the queen bee 1 not to 1 0.07
the question said 1 or not 1 0.07
question said the 1
question queen bee 1 0.07
the queen 1
said the question 1 question said 1 0.07
said the to be 2 said the 1 0.07
the queen the queen 1 0.07
queen bee the question 1 0.07
to be 2 0.14
• Bigrams Total 14 1
More context?
• to be or not to be a bee is the question, said the
queen bee
• Unigrams
• Bigrams
• Trigrams
• 4 grams and so on…
Character-level vectors
• to be or not to be a bee is the question, said the
queen bee Character pair frequencies
Character frequencies Character pairs Raw Count Relative Frequency
ab 1 0.02
Raw Relative ai 1 0.02
be 4 0.09
Characters Count Frequency dt 1 0.02
a 2 0.04 ea
ee
1
3
0.02
0.07
b 4 0.09 ei
en
1
1
0.02
0.02
d 1 0.02 eo 1 0.02
eq 2 0.05
e 11 0.24 es 1 0.02
he 2 0.05
h 2 0.04 id 1 0.02
io 1 0.02
i 3 0.07 is 1 0.02
nb 1 0.02
n 3 0.07 no 1 0.02
o 5 0.11 ns
ob
1
2
0.02
0.05
q 2 0.04 on
or
1
1
0.02
0.02
r 1 0.02 ot 1 0.02
qu 2 0.05
s 3 0.07 rn 1 0.02
sa 1 0.02
t 6 0.13 st 2 0.05
th 2 0.05
u 2 0.04 ti 1 0.02
to 2 0.05
Total 45 1 tt 1 0.02
ue 2 0.05
Total 44 1.00
Bag of Words
• A simplifying representation
• Text (such as a sentence or a document) is
represented as the bag (multiset) of its words
– Disregarding grammar and even word order but keeping
multiplicity
• Used in document classification where the
(frequency of) occurrence of each word is used as a
feature for training a classifier
• But some words are common in general which do
not tell a lot about a document (e.g. the, is, a)
• TF-IDF
Measures to normalize term-frequencies
• Raw frequency: The number of times that
term t occurs in document d,
– tf(t,d) = ft,d
– We need to remove bias towards long or short
documents
• Normalize by document length
• Relative term frequency i.e. tf adjusted for
document length:
– tf(t,d) = ft,d ÷ (number of words in d)
But raw frequency is a bad
representation
Frequency is clearly useful; if sugar appears a lot
near apricot, that's useful information.
But overly frequent words like the, it, or they are
not very informative about the context
Need a function that resolves this frequency
paradox!
Other measures to normalize term-frequencies
• Next, we need ways to remove bias towards more
frequently occurring words.
– A word appearing 100 times in a document does not make it
a 100 times more likely representative of the document
Words like "the" or "good" have very low idf # of docs that have word i
-
TF-IDF
-
TF-IDF
https://en.wikipedia.org/wiki/Tf%E2%80%93idf
Now that we have our vectors!
• How do we compare two strings or documents
– Convert each into a vector
• This raw dot-product, however, has a problem as a similarity metric: it favors vector length
long vectors.
• The simplest way to modify the dot product to normalize for the vector length is to divide
the dot product by the lengths of each of the two vectors.
• This normalized dot product turns out to be the same as the cosine of the angle between
the two
Cosine Similarity
• The cosine value ranges from 1 for vectors pointing in the
same direction, through 0 for vectors that are orthogonal,
to -1 for vectors pointing in opposite directions.
• But raw frequency values are non-negative, so the cosine
for these vectors ranges from 0–1.
Cosine Similarity
All Distances
S1: to be or not to be a bee is the question, said the queen bee
S2: one needs to be strong in order to be a queen bee
sqrt((f1 -
Words # S1 # S2 f1 f2 |f1 - f2| Max |f1-f2| f1 . f2 |f1| |f2| f1.f2/|f1||f2|
f2)^2)
2.
a. The plant Piper nigrum (family Piperaceae), a climbing shrub
indigenous to South Asia and also cultivated elsewhere in the tropics,
which has alternate stalked entire leaves, with pendulous spikes of small
green flowers opposite the leaves, succeeded by small berries turning red
when ripe. Also more widely: any plant of the genus Piper or the family
Piperaceae.
Chapter 6:
Vector Semantics
Let's define words by their
usages
In particular, words are defined by their
environments (the words around them)
not good
bad
to by dislike worst
‘s
that now incredibly bad
are worse
a i you
than
with is
𝑃(𝑤𝑜𝑟𝑑1, 𝑤𝑜𝑟𝑑2)
PMI 𝑤𝑜𝑟𝑑1 , 𝑤𝑜𝑟𝑑 2 = log 2
𝑃 𝑤𝑜𝑟𝑑1 𝑃(𝑤𝑜𝑟𝑑2)
Positive Pointwise Mutual Information
◦ PMI ranges from −∞ to + ∞
◦ But the negative values are problematic
◦ Things are co-occurring less than we expect by chance
◦ So we just replace negative PMI values by 0
◦ Positive PMI (PPMI) between word1 and word2:
𝑃(𝑤𝑜𝑟𝑑1, 𝑤𝑜𝑟𝑑2)
PPMI 𝑤𝑜𝑟𝑑1, 𝑤𝑜𝑟𝑑2 = max log 2 ,0
𝑃 𝑤𝑜𝑟𝑑1 𝑃(𝑤𝑜𝑟𝑑2)
Computing PPMI on a term-context
matrix
Matrix F with W rows (words) and C columns (contexts)
fij is # of times wi occurs in context cj
C W
If one of the words (either the target word w or the context word c has a
higher probability while the other is rare, the PMI value will still be affected.
Weighting PMI: Giving rare
context words slightly higher
probability
Raise the context probabilities to 𝛼 = 0.75:
Chapter 6:
Vector Semantics, Part II
Distributional similarity based
representations
You can get a lot of value by representing a word by means of its
neighbors
“You shall know a word by the company it keeps”
(J. R. Firth 1957: 11)
Two algorithms
1. Skip-grams (SG)
Predict context words given target (position independent)
2. Continuous Bag of Words (CBOW)
Predict target word from bag-of-words context
𝑃 𝑤𝑡−2 | 𝑤𝑡 𝑃 𝑤𝑡+2 | 𝑤𝑡
𝑃 𝑤𝑡−1 | 𝑤𝑡 𝑃 𝑤𝑡+1 | 𝑤𝑡
P(+|t,c)
P(−|t,c) = 1−P(+|t,c)
How to compute p(+|t,c)?
Intuition:
◦ Words are likely to appear near similar words
◦ Model similarity with dot-product!
◦ Similarity(t,c) 𝖺 t ∙ c
Problem:
◦Dot product is not a probability!
◦ (Neither is cosine)
Turning dot product into a
probability
The sigmoid lies between 0 and 1:
Turning dot product into a
probability
For all the context words:
Assume all context words are
independent
Objective Criteria
We want to maximize…
HTTPS://TOWARDSDATASCIENCE.COM/AN-IMPLEMENTATION-GUIDE-TO-WORD2VEC-USING-NUMPY-AND-GOOGLE-SHEETS-13445EEBD281
N = Dimensions, V = Vocabulary, unique words
Vector of word i
Vector of word i
× W1 = × W2 =
N×V
1×N
1×V V×N 1×V
Word2Vec — skip-gram network
architecture
HTTPS://TOWARDSDATASCIENCE.COM/AN-IMPLEMENTATION-GUIDE-TO-WORD2VEC-USING-NUMPY-AND-GOOGLE-SHEETS-13445EEBD281
WordToVec Example
WordToVec Example 2 layer
neural network
HTTPS://TOWARDSDATASCIENCE.COM/AN-IMPLEMENTATION-GUIDE-TO-WORD2VEC-USING-NUMPY-AND-GOOGLE-SHEETS-13445EEBD281
Read from following links for
WordToVec example
https://towardsdatascience.com/an-
implementation-guide-to-word2vec-using-
numpy-and-google-sheets-13445eebd281
https://nathanrooy.github.io/posts/2018-03-
22/word2vec-from-scratch-with-python-and-
numpy/?source=post_page-----
13445eebd281
Word2vec is a family of algorithms
[Mikolov et al. 2013]
Two algorithms
1. Skip-grams (SG)
Predict context words given target (position independent)
2. Continuous Bag of Words (CBOW)
Predict target word from bag-of-words context
HTTPS://TOWARDSDATASCIENCE.COM/AN-IMPLEMENTATION-GUIDE-TO-WORD2VEC-USING-NUMPY-AND-GOOGLE-SHEETS-13445EEBD281
Skip-gram algorithm
1. Treat the target word and a neighboring
context word as positive examples.
2. Randomly sample other words in the
lexicon to get negative samples
3. Use a model to train a classifier to
distinguish those two cases
4. Use the weights as the embeddings
Skip-Gram Training Data with
Negative Sampling
Training sentence:
... lemon, a tablespoon of apricot jam a pinch ...
c1 c2 t c3 c4