CS2209 Similarity Distances
CS2209 Similarity Distances
CS2209
1
Similarity and Dissimilarity Measures
• Similarity measure
– Numerical measure of how alike two data objects are.
– Is higher when objects are more alike.
– Often falls in the range [0,1]
• Dissimilarity measure
– Numerical measure of how different two data objects are
– Lower when objects are more alike
– Minimum dissimilarity is often 0
– Upper limit varies
• Proximity refers to a similarity or dissimilarity
2
Similarity/Dissimilarity for Simple Attributes
The following table shows the similarity and dissimilarity between two objects, x and y,
with respect to a single, simple attribute.
3
Euclidean Distance
• Euclidean Distance n
dist ( p
k 1
k qk ) 2
p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Distance Matrix
5
Normalization
Min-max normalization: to [new_minA, new_maxA]
v minA
v' (new _ maxA new _ minA) new _ minA
maxA minA
– Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,600 is mapped
to 73,600 12,000
(1.0 0) 0 0.716
98,000 12,000
7
Minkowski Distance: Examples
• r = 1. City block (Manhattan, taxicab, L1 norm) distance.
– A common example of this for binary vectors is the Hamming distance, which is
just the number of bits that are different between two binary vectors
• r = 2. Euclidean distance
• Do not confuse r with n, i.e., all these distances are defined for all numbers of
dimensions.
8
Minkowski Distance
L1 p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
point x y
p1 0 2 L2 p1 p2 p3 p4
p2 2 0 p1 0 2.828 3.162 5.099
p3 3 1 p2 2.828 0 1.414 3.162
p4 5 1 p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
L p1 p2 p3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0
Distance Matrix
9
Common Properties of a Distance
• Distances, such as the Euclidean distance, have some well known properties.
10
Common Properties of a Similarity
• Similarities, also have some well known properties.
11
Similarity Between Binary Vectors
• Common situation is that objects, x and y, have only binary attributes
13
Cosine Similarity
•If d1 and d2 are two document vectors, then
cos( d1, d2 ) = <d1,d2> / ||d1|| ||d2|| ,
where <d1,d2> indicates inner product or vector dot product of vectors, d1 and d2, and
|| d || is the length of vector d.
• The result of the cosine similarity ranges from -1 to 1.
– Value 1: indicates that the vectors are identical
– Value 0: means that the vectors are orthogonal (not similar at all)
– Value -1: implies complete dissimilarity
• The cosine similarity is often used in text analysis to determine the similarity
between documents represented as vectors in a high-dimensional space, where
each dimension corresponds to a specific term or word.
14
Cosine Similarity
•Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
<d1, d2> = 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
| | d1 || = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
|| d2 || = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.449
cos(d1, d2 ) = 0.3150
15
Drawback of Correlation
• Consider two sample sets
• x = (-3, -2, -1, 0, 1, 2, 3)
• y = (9, 4, 1, 0, 1, 4, 9)
• mean(x) = 0, mean(y) = 4
• std(x) = 2.16, std(y) = 3.74
yi = xi2
16
Correlation vs Cosine vs Euclidean Distance
• Compare the three proximity
correlation
measures according to their behavior ((𝐴𝑖 − 𝐴)(𝐵𝑖 −𝐵))
under variable transformation 𝐶𝑜𝑟𝑟 𝐴, 𝐵 =
(𝐴𝑖 − 𝐴)2 (𝐵𝑖 −𝐵)2
– scaling: multiplication by a value
– translation: adding a constant euclidean_distance
• Consider the example
– x = (1, 2, 4, 3, 0, 0, 0), ED A, B = (𝐴𝑖 − 𝐵𝑖 )2
– y = (1, 2, 3, 4, 0, 0, 0)
– ys = y * 2 (scaled version of y),
– yt = y + 5 (translated version) cosine_similarity
𝐴. 𝐵
𝐶𝑆 𝐴, 𝐵 =
𝐴 .| 𝐵 |
17
Correlation vs Cosine vs Euclidean Distance
x = (1, 2, 4, 3, 0, 0, 0), y = (1, 2, 3, 4, 0, 0, 0)
ys = y * 2 (scaled version of y), yt = y + 5 (translated version)
Since the classical Euclidean distance weights each axis equally it effectively
assumes that the variables constructing the space are independent and
represent unrelated equally important information to one another
19
Mahalanobis Distance
Q1 Q2
Green and Blue points are equally away from the center Red point?
20
Mahalanobis Distance
• Mahalanobis distance is the distance between a point and a distribution
– Not a distance between two distinct points
– Effectively a multivariate equivalent of the Euclidean distance
– It was introduced by Prof. P. C. Mahalanobis in 1936 and has been used in various statistical
applications ever since
– Defined by
𝑀𝐷 = 𝑥 − 𝜇 𝑇 𝐶 −1 𝑥 − 𝜇
where,
- MD is the Mahalanobis distance.
- 𝑥 is the vector of the observation,
- 𝜇 is the vector of mean values of independent variables ,
- 𝐶 −1 is the inverse covariance matrix of independent variables.
21
Mahalanobis Distance as Solution
Let Red, Green and Blue points are given as follows
0 −1 1
R= G= B=
5 7 7
1 0
C1 = It is defined as a square matrix where
0 1
the diagonal elements represent the
1 0.89 variance and the off-diagonal elements
C2 =
0.89 1 represent the covariance
22
Mahalanobis Distance as Solution
By the Euclidian distance formula, we can see that the points are equally
distant from each other
However, considering the distribution, the result for the green point:
23