Lecture - 7 MSDS
Lecture - 7 MSDS
Similarity
Measures
Representing strings with vectors of words or
characters
• to be or not to be a bee is the question, said the
queen bee
Multiset Word frequencies Word frequencies
a Words Count Raw Relative
be a 1 Words Count Frequency
be be 2 a 1 0.07
bee bee 2 be 2 0.13
bee is 1 bee 2 0.13
is not 1 is 1 0.07
not or 1 not 1 0.07
or queen 1 or 1 0.07
queen question 1 queen 1 0.07
question said 1 question, 1 0.07
said the 2 said 1 0.07
the to 2 the 2 0.13
the to 2 0.13
to Total 15 1
to
• Bag-of-words or unigrams
More context?
• to be or not to be a bee is the question, said the
queen bee
Multiset Word pair frequencies Word pair frequencies
to be Word pairs Count Raw Relative
be or a bee 1 Word pairs Count Frequency
or not be a 1
a bee 1 0.07
not to be or 1
to be be a 1 0.07
bee is 1
be a is the 1 be or 1 0.07
a bee not to 1 bee is 1 0.07
bee is or not 1 is the 1 0.07
is the queen bee 1 not to 1 0.07
the question said 1 or not 1 0.07
question said the 1
question queen bee 1 0.07
the queen 1
said the question 1 question said 1 0.07
said the to be 2 said the 1 0.07
the queen the queen 1 0.07
queen bee the question 1 0.07
to be 2 0.14
• Bigrams Total 14 1
More context?
• to be or not to be a bee is the question, said the
queen bee
• Unigrams
• Bigrams
• Trigrams
• 4 grams and so on…
Character-level vectors
• to be or not to be a bee is the question, said the
queen bee Character pair frequencies
Character frequencies Character pairs Raw Count Relative Frequency
ab 1 0.02
Raw Relative ai 1 0.02
be 4 0.09
Characters Count Frequency dt 1 0.02
a 2 0.04 ea
ee
1
3
0.02
0.07
b 4 0.09 ei
en
1
1
0.02
0.02
d 1 0.02 eo 1 0.02
eq 2 0.05
e 11 0.24 es 1 0.02
he 2 0.05
h 2 0.04 id 1 0.02
io 1 0.02
i 3 0.07 is 1 0.02
nb 1 0.02
n 3 0.07 no 1 0.02
o 5 0.11 ns
ob
1
2
0.02
0.05
q 2 0.04 on
or
1
1
0.02
0.02
r 1 0.02 ot 1 0.02
qu 2 0.05
s 3 0.07 rn 1 0.02
sa 1 0.02
t 6 0.13 st 2 0.05
th 2 0.05
u 2 0.04 ti 1 0.02
to 2 0.05
Total 45 1 tt 1 0.02
ue 2 0.05
Total 44 1.00
Bag of Words
• A simplified representation
• Text (such as a sentence or a document) is
represented as the bag (multiset) of its words
– Disregarding grammar and even word order but keeping
multiplicity
• This representation only takes into account the frequency
of each word in the document, and not their position,
grammar or context.
Bag of Words
• To create a bag of words representation, the text is first
preprocessed by removing stop words (common words like
"the" and "a"), punctuation, and any other irrelevant
information.
• TF-IDF = TF * IDF
TF-IDF
https://en.wikipedia.org/wiki/Tf%E2%80%93idf
For example, if we have a corpus of 100 documents, and the word "apple" appears in 20 of
those documents, the IDF for "apple" would be: idf = log(100 / 20) = log(5) = 1.609
TF-IDF
https://en.wikipedia.org/wiki/Tf%E2%80%93idf
TF-IDF
Total plays: 37
IDF (Romeo)= log(37/1)= 1.57
TF-IDF
-
TF-IDF
https://en.wikipedia.org/wiki/Tf%E2%80%93idf
Several way to create vectors
• TF
– Levels: Character, word, phrase, We have already seen
calculation variations of this (raw, normalized, Boolean,
smoothed etc.)
• TF-IDF
• Word Embedding
Now that we have our vectors!
• How do we compare two strings or documents
– Convert each into a vector
𝑑 𝑝, 𝑞 = 𝑑 𝑞, 𝑝 = (𝑝 𝑖 − 𝑞𝑖)2
𝑖=1
𝑑 𝑝, 𝑞 = 𝑑 𝑞, 𝑝 = |𝑝𝑖 − 𝑞𝑖 |
𝑖=1
Minkowski Distance
• The Minkowski distance is a generalized metric form of
Euclidean distance and Manhattan distance
1/𝑎
𝑛
𝑑(𝑝, 𝑞) = |𝑝𝑖 − 𝑞𝑖 |𝑎
𝑖=1
• a = 1 is the Manhattan distance
• a = 2 is the Euclidean distance
Chebyshev Distance
Effect of Different Distance Measures in Result of Cluster Analysis, Sujan Dahal
• A= [70, 40]
• B= [330, 228]
• This raw dot-product, however, has a problem as a similarity metric: it favors vector length
long vectors.
• The simplest way to modify the dot product to normalize for the vector length is to divide
the dot product by the lengths of each of the two vectors.
• This normalized dot product turns out to be the same as the cosine of the angle between
the two
Cosine Similarity
• The cosine value ranges from 1 for vectors pointing in the
same direction, through 0 for vectors that are orthogonal,
to -1 for vectors pointing in opposite directions.
• But raw frequency values are non-negative, so the cosine
for these vectors ranges from 0–1.
Cosine Similarity