0% found this document useful (0 votes)
3 views32 pages

Lecture - 7 MSDS

The document discusses various string similarity measures, focusing on representing strings as vectors using methods like Bag-of-Words, TF-IDF, and character-level vectors. It explains how to calculate term frequencies, normalize them, and evaluate the importance of words in documents. Additionally, it covers different distance measures such as Euclidean, Manhattan, and Cosine similarity for comparing strings or documents.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views32 pages

Lecture - 7 MSDS

The document discusses various string similarity measures, focusing on representing strings as vectors using methods like Bag-of-Words, TF-IDF, and character-level vectors. It explains how to calculate term frequencies, normalize them, and evaluate the importance of words in documents. Additionally, it covers different distance measures such as Euclidean, Manhattan, and Cosine similarity for comparing strings or documents.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

String

Similarity
Measures
Representing strings with vectors of words or
characters
• to be or not to be a bee is the question, said the
queen bee
Multiset Word frequencies Word frequencies
a Words Count Raw Relative
be a 1 Words Count Frequency
be be 2 a 1 0.07
bee bee 2 be 2 0.13
bee is 1 bee 2 0.13
is not 1 is 1 0.07
not or 1 not 1 0.07
or queen 1 or 1 0.07
queen question 1 queen 1 0.07
question said 1 question, 1 0.07
said the 2 said 1 0.07
the to 2 the 2 0.13
the to 2 0.13
to Total 15 1
to

• Bag-of-words or unigrams
More context?
• to be or not to be a bee is the question, said the
queen bee
Multiset Word pair frequencies Word pair frequencies
to be Word pairs Count Raw Relative
be or a bee 1 Word pairs Count Frequency
or not be a 1
a bee 1 0.07
not to be or 1
to be be a 1 0.07
bee is 1
be a is the 1 be or 1 0.07
a bee not to 1 bee is 1 0.07
bee is or not 1 is the 1 0.07
is the queen bee 1 not to 1 0.07
the question said 1 or not 1 0.07
question said the 1
question queen bee 1 0.07
the queen 1
said the question 1 question said 1 0.07
said the to be 2 said the 1 0.07
the queen the queen 1 0.07
queen bee the question 1 0.07
to be 2 0.14

• Bigrams Total 14 1
More context?
• to be or not to be a bee is the question, said the
queen bee
• Unigrams
• Bigrams
• Trigrams
• 4 grams and so on…
Character-level vectors
• to be or not to be a bee is the question, said the
queen bee Character pair frequencies
Character frequencies Character pairs Raw Count Relative Frequency
ab 1 0.02
Raw Relative ai 1 0.02
be 4 0.09
Characters Count Frequency dt 1 0.02
a 2 0.04 ea
ee
1
3
0.02
0.07
b 4 0.09 ei
en
1
1
0.02
0.02
d 1 0.02 eo 1 0.02
eq 2 0.05
e 11 0.24 es 1 0.02
he 2 0.05
h 2 0.04 id 1 0.02
io 1 0.02
i 3 0.07 is 1 0.02
nb 1 0.02
n 3 0.07 no 1 0.02
o 5 0.11 ns
ob
1
2
0.02
0.05
q 2 0.04 on
or
1
1
0.02
0.02
r 1 0.02 ot 1 0.02
qu 2 0.05
s 3 0.07 rn 1 0.02
sa 1 0.02
t 6 0.13 st 2 0.05
th 2 0.05
u 2 0.04 ti 1 0.02
to 2 0.05
Total 45 1 tt 1 0.02
ue 2 0.05
Total 44 1.00
Bag of Words
• A simplified representation
• Text (such as a sentence or a document) is
represented as the bag (multiset) of its words
– Disregarding grammar and even word order but keeping
multiplicity
• This representation only takes into account the frequency
of each word in the document, and not their position,
grammar or context.
Bag of Words
• To create a bag of words representation, the text is first
preprocessed by removing stop words (common words like
"the" and "a"), punctuation, and any other irrelevant
information.

• Then, the remaining words are counted and their


frequencies are stored in a vector.

• the: 2 quick: 1 brown: 1 fox: 1 jumps: 1 over: 1 lazy: 1 dog:


1
• [2,1,1,1,1,1,1,1,]
• Each word may also be represented as a one hot encoded
vector. 1 x V: Brown =[0 0 1 0 0 0 0 0]
Bag of Words
• Used in document classification where the
(frequency of) occurrence of each word is used as a
feature for training a classifier
• But some words are common in general which do
not tell a lot about a document (e.g. the, is, a)
• TF-IDF
Bag of Words
For example, consider the following sentence:

• "The quick brown fox jumps over the lazy dog.“

• The bag of words representation of this sentence


would be a vector with the following entries:
{"The": 1, "quick": 1, "brown": 1, "fox": 1, "jumps":
1, "over": 1, "the": 1, "lazy": 1, "dog": 1}.

• Note that the words "the" and "The" are treated


as different words, as they have different
capitalization.
Measures to normalize term-frequencies
• Raw frequency: The number of times that
term t occurs in document d,
– tf(t,d) = ft,d
– We need to remove bias towards long or short
documents
• Normalize by document length
• Relative term frequency i.e. tf adjusted for
document length:
– tf(t,d) = ft,d ÷ (number of words in d)
Other measures to normalize term-frequencies
• Next, we need ways to remove bias towards more
frequently occurring words.
– A word appearing 100 times in a document does not make it
a 100 times more likely representative of the document

• Boolean ”frequency": tf(t,d) = 1 if t occurs in d and 0


otherwise

• Logarithmically scaled frequency:

tf(t,d) = = log (1 + ft,d) if log (ft,d) > 0


Thus, terms which occur 10 times in a document would have a tf=1, 100 times in a
document tf=2, 1000 times tf=3,...
TF-IDF
https://en.wikipedia.org/wiki/Tf%E2%80%93idf

• To evaluate the importance of a word in a document.


• It takes into account not only the frequency of the word in the
document, but also the frequency of the word in the corpus
(i.e., the collection of all documents).

• TF = (number of times the word appears in the document) /


(total number of words in the document)

• Term frequency is often normalized or transformed in some


way to reduce the impact of common terms and increase the
weight of rare terms.

• TF-IDF = TF * IDF
TF-IDF
https://en.wikipedia.org/wiki/Tf%E2%80%93idf

• TF-IDF increases proportionally to the number of times a word


appears in the document,
• It is offset by the frequency of the word in the whole corpus of
documents
• Helps adjust for the fact that some words appear more frequently in
general.
• A word that appears frequently in a document but infrequently in
the corpus is likely to be more important to that document than a
word that appears frequently in both the document and the corpus
TF-IDF
https://en.wikipedia.org/wiki/Tf%E2%80%93idf

• Give a higher weight to words that occur only in a few documents


• Terms that are limited to a few documents are useful for
discriminating those documents from the rest of the collection; terms
that occur frequently across the entire collection aren’t as helpful.
• Because of the large number of documents in many collections, this
measure is usually squashed with a log function.
• Inverse document frequency, idf(t, d) is the logarithmically scaled
inverse fraction of the documents that contain the word

For example, if we have a corpus of 100 documents, and the word "apple" appears in 20 of
those documents, the IDF for "apple" would be: idf = log(100 / 20) = log(5) = 1.609
TF-IDF
https://en.wikipedia.org/wiki/Tf%E2%80%93idf
TF-IDF

Total plays: 37
IDF (Romeo)= log(37/1)= 1.57
TF-IDF

-
TF-IDF
https://en.wikipedia.org/wiki/Tf%E2%80%93idf
Several way to create vectors
• TF
– Levels: Character, word, phrase, We have already seen
calculation variations of this (raw, normalized, Boolean,
smoothed etc.)
• TF-IDF
• Word Embedding
Now that we have our vectors!
• How do we compare two strings or documents
– Convert each into a vector

– Calculate the distance between vectors


Similarity and distance
(https://www.cs.utah.edu/~jeffp/teaching/cs5955/L4-Jaccard+Shingle.pdf)

• A distance d(A, B) has the properties:


– it is small if objects A and B are close,
– it is large if they are far,
– it is (usually) 0 if they are the same, and
– it has value in [0, ∞].
• On the other hand, a similarity s(A, B) has the
properties:
– it is large if the objects A and B are close,
– it is small if they are far,
– it is (usually) 1 if they are the same, and
– it is in the range [0, 1].
• Often we can convert between the two as d(A, B) = 1 − s(A, B)
Several distance measures for vectors
• Euclidean distance
• Manhattan distance
• Chebyshev Distance
• Minkowski distance
• Cosine similarity
Euclidean Distance
• Good choice for numeric attributes
• When data is dense or continuous, this is a good proximity
measure
• The Pythagorean theorem gives this distance between two
points, p and q, each with a n-dimensional feature vector:

𝑑 𝑝, 𝑞 = 𝑑 𝑞, 𝑝 = (𝑝1 − 𝑞1 )2 +(𝑝2 − 𝑞2 )2 , … , (𝑝𝑛 − 𝑞𝑛 )2

𝑑 𝑝, 𝑞 = 𝑑 𝑞, 𝑝 = ෍(𝑝 𝑖 − 𝑞𝑖)2
𝑖=1

• Downside: Sensitive to extreme deviations in a single


attribute (as it squares differences)
Manhatten Distance
• The distance between two points is the sum of the absolute
differences of their Cartesian coordinates.
– It is the total sum of the difference between the x-coordinates
and y-coordinates.
• Also known as Manhattan length, rectilinear distance, L1
distance or L1 norm, city block distance, snake distance,
taxi-cab metric, or city block distance
𝑑 𝑝, 𝑞 = 𝑑 𝑞, 𝑝 = |𝑝1 − 𝑞1 | + |𝑝2 − 𝑞2 | , … , |𝑝𝑛 − 𝑞𝑛 |

𝑑 𝑝, 𝑞 = 𝑑 𝑞, 𝑝 = ෍ |𝑝𝑖 − 𝑞𝑖 |
𝑖=1
Minkowski Distance
• The Minkowski distance is a generalized metric form of
Euclidean distance and Manhattan distance
1/𝑎
𝑛

𝑑(𝑝, 𝑞) = ෍ |𝑝𝑖 − 𝑞𝑖 |𝑎
𝑖=1
• a = 1 is the Manhattan distance
• a = 2 is the Euclidean distance
Chebyshev Distance
Effect of Different Distance Measures in Result of Cluster Analysis, Sujan Dahal

• For Chebyshev distance, the distance between two vectors


is the greatest of their differences along any coordinate
dimension
• When two objects are to be defined as “different” if they
are different in any one dimension
• Also called chessboard distance, maximum metric, or L∞
metric
𝑑 𝑝, 𝑞 = 𝑚𝑎𝑥𝑖 𝑝𝑖 − 𝑞𝑖
Chebyshev Distance
Effect of Different Distance Measures in Result of Cluster Analysis, Sujan Dahal

• A= [70, 40]
• B= [330, 228]

• D(A,B)= max {|70-330|, |40-228|}


• Max {260, 188}
• 260
Angles between vectors
Cosine Similarity

• This raw dot-product, however, has a problem as a similarity metric: it favors vector length
long vectors.
• The simplest way to modify the dot product to normalize for the vector length is to divide
the dot product by the lengths of each of the two vectors.
• This normalized dot product turns out to be the same as the cosine of the angle between
the two
Cosine Similarity
• The cosine value ranges from 1 for vectors pointing in the
same direction, through 0 for vectors that are orthogonal,
to -1 for vectors pointing in opposite directions.
• But raw frequency values are non-negative, so the cosine
for these vectors ranges from 0–1.
Cosine Similarity

Cos (x,y)= x dot y/ || x || || y ||


X=[3,2,0,5], y[1,0,0,0]
x.y (dot)= 3x1 + 2x0 + 0X0 + 5x0
||x||= sqrt (3^2 +2^2 + 0^2 + 5^2)
All Distances
S1: to be or not to be a bee is the question, said the queen bee
S2: one needs to be strong in order to be a queen bee
sqrt((f1 -
Words # S1 # S2 f1 f2 |f1 - f2| Max |f1-f2| f1 . f2 |f1| |f2| Cosine Simi
f2)^2)

a 1 1 0.0667 0.0833 0.0003 0.0167 0.0167 0.0056 0.0044 0.0069


be 2 2 0.1333 0.1667 0.0011 0.0333 0.0333 0.0222 0.0178 0.0278
bee 2 1 0.1333 0.0833 0.0025 0.0500 0.0500 0.0111 0.0178 0.0069
in 0 1 0.0000 0.0833 0.0069 0.0833 0.0833 0.0000 0.0000 0.0069
is 1 0 0.0667 0.0000 0.0044 0.0667 0.0667 0.0000 0.0044 0.0000
needs 0 1 0.0000 0.0833 0.0069 0.0833 0.0833 0.0000 0.0000 0.0069
not 1 0 0.0667 0.0000 0.0044 0.0667 0.0667 0.0000 0.0044 0.0000
one 0 1 0.0000 0.0833 0.0069 0.0833 0.0833 0.0000 0.0000 0.0069
or 1 0 0.0667 0.0000 0.0044 0.0667 0.0667 0.0000 0.0044 0.0000
order 0 1 0.0000 0.0833 0.0069 0.0833 0.0833 0.0000 0.0000 0.0069
queen 1 1 0.0667 0.0833 0.0003 0.0167 0.0167 0.0056 0.0044 0.0069
question 1 0 0.0667 0.0000 0.0044 0.0667 0.0667 0.0000 0.0044 0.0000
said 1 0 0.0667 0.0000 0.0044 0.0667 0.0667 0.0000 0.0044 0.0000
strong 0 1 0.0000 0.0833 0.0069 0.0833 0.0833 0.0000 0.0000 0.0069
the 2 0 0.1333 0.0000 0.0178 0.1333 0.1333 0.0000 0.0178 0.0000
to 2 2 0.1333 0.1667 0.0011 0.0333 0.0333 0.0222 0.0178 0.0278
Total 15 12 1 1 0.2828 1.0333 0.1333 0.0667 0.3197 0.3333 0.6255
Cosine
Euclidean Manhattan Chebyshev Dot product Similarity

You might also like