0% found this document useful (0 votes)
8 views23 pages

CS2209 Similarity Distances

The document discusses various measures of similarity and dissimilarity, including Euclidean and Minkowski distances, and their applications in data analysis. It also covers normalization techniques, properties of distance and similarity measures, and specific examples like Cosine and Mahalanobis distances. The content is aimed at understanding how to quantify the proximity between data objects in different contexts.

Uploaded by

Yatharth Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views23 pages

CS2209 Similarity Distances

The document discusses various measures of similarity and dissimilarity, including Euclidean and Minkowski distances, and their applications in data analysis. It also covers normalization techniques, properties of distance and similarity measures, and specific examples like Cosine and Mahalanobis distances. The content is aimed at understanding how to quantify the proximity between data objects in different contexts.

Uploaded by

Yatharth Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Proximity, Similarity, Distances

CS2209

1
Similarity and Dissimilarity Measures
• Similarity measure
– Numerical measure of how alike two data objects are.
– Is higher when objects are more alike.
– Often falls in the range [0,1]
• Dissimilarity measure
– Numerical measure of how different two data objects are
– Lower when objects are more alike
– Minimum dissimilarity is often 0
– Upper limit varies
• Proximity refers to a similarity or dissimilarity

2
Similarity/Dissimilarity for Simple Attributes
The following table shows the similarity and dissimilarity between two objects, x and y,
with respect to a single, simple attribute.

3
Euclidean Distance
• Euclidean Distance n
dist  ( p
k 1
k  qk ) 2

where n is the number of dimensions (attributes) and pk and qk are, respectively,


the kth attributes (components) or data objects p and q
Example
Cost Time Weight Incentive
Object A 0 3 4 5
Object B 7 6 3 -1
The Euclidean distance between point A and B is
𝑑𝐴𝐵 = 0 − 7 2 + 3 − 6 2 + 4 − 3 2 + (5 + 1)2
𝑑𝐴𝐵 = 49 + 9 + 1 + 36 𝑑𝐴𝐵 =9.747
Standardization/normalization is necessary, if scales differ.
Euclidean Distance
3
point x y
2 p1
p1 0 2
p3 p4
1
p2 2 0
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6

p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Distance Matrix
5
Normalization
 Min-max normalization: to [new_minA, new_maxA]
v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
– Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,600 is mapped
to 73,600  12,000
(1.0  0)  0  0.716
98,000  12,000

 Z-score normalization (μ: mean, σ: standard deviation):


v  A
z
 A
73,600  54,000
– Ex. Let μ = 54,000, σ = 16,000. Then  1.225
16,000
 Normalization by decimal scaling
v Where j is the smallest integer such that Max(|ν’|) < 1
v'
10 j
Minkowski Distance

• Minkowski Distance is a generalization of Euclidean Distance

Where r is a parameter, n is the number of dimensions (attributes) and xk and


yk are, respectively, the kth attributes (components) or data objects x and y.

7
Minkowski Distance: Examples
• r = 1. City block (Manhattan, taxicab, L1 norm) distance.
– A common example of this for binary vectors is the Hamming distance, which is
just the number of bits that are different between two binary vectors

• r = 2. Euclidean distance

• r  . Chebyshev (Lmax norm, L norm) distance.


– This is the maximum difference between any component of the vectors

• Do not confuse r with n, i.e., all these distances are defined for all numbers of
dimensions.

8
Minkowski Distance
L1 p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
point x y
p1 0 2 L2 p1 p2 p3 p4
p2 2 0 p1 0 2.828 3.162 5.099
p3 3 1 p2 2.828 0 1.414 3.162
p4 5 1 p3 3.162 1.414 0 2
p4 5.099 3.162 2 0

L p1 p2 p3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0

Distance Matrix
9
Common Properties of a Distance
• Distances, such as the Euclidean distance, have some well known properties.

1. d(x, y)  0 for all x and y and d(x, y) = 0 if and only if x = y.


2. d(x, y) = d(y, x) for all x and y. (Symmetry)
3. d(x, z)  d(x, y) + d(y, z) for all points x, y, and z.
(Triangle Inequality)

where d(x, y) is the distance (dissimilarity) between points (data objects), x


and y.

• A distance that satisfies these properties is a metric

10
Common Properties of a Similarity
• Similarities, also have some well known properties.

1. s(x, y) = 1 (or maximum similarity) only if x = y.


(does not always hold, e.g., cosine)
2. s(x, y) = s(y, x) for all x and y. (Symmetry)

where s(x, y) is the similarity between points (data objects), x and y.

11
Similarity Between Binary Vectors
• Common situation is that objects, x and y, have only binary attributes

• Compute similarities using the following quantities


f01 = the number of attributes where x was 0 and y was 1
f10 = the number of attributes where x was 1 and y was 0
f00 = the number of attributes where x was 0 and y was 0
f11 = the number of attributes where x was 1 and y was 1

• Simple Matching and Jaccard Coefficients


SMC = number of matches / number of attributes
= (f11 + f00) / (f01 + f10 + f11 + f00)

J = number of 11 matches / number of non-zero attributes


= (f11) / (f01 + f10 + f11)
12
SMC versus Jaccard: Example
x= 1000000000
y= 0000001001

f01 = 2 (the number of attributes where x was 0 and y was 1)


f10 = 1 (the number of attributes where x was 1 and y was 0)
f00 = 7 (the number of attributes where x was 0 and y was 0)
f11 = 0 (the number of attributes where x was 1 and y was 1)

SMC = (f11 + f00) / (f01 + f10 + f11 + f00)


= (0+7) / (2+1+0+7) = 0.7

J = (f11) / (f01 + f10 + f11) = 0 / (2 + 1 + 0) = 0

13
Cosine Similarity
•If d1 and d2 are two document vectors, then
cos( d1, d2 ) = <d1,d2> / ||d1|| ||d2|| ,
where <d1,d2> indicates inner product or vector dot product of vectors, d1 and d2, and
|| d || is the length of vector d.
• The result of the cosine similarity ranges from -1 to 1.
– Value 1: indicates that the vectors are identical
– Value 0: means that the vectors are orthogonal (not similar at all)
– Value -1: implies complete dissimilarity
• The cosine similarity is often used in text analysis to determine the similarity
between documents represented as vectors in a high-dimensional space, where
each dimension corresponds to a specific term or word.

14
Cosine Similarity
•Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
<d1, d2> = 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
| | d1 || = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
|| d2 || = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.449
cos(d1, d2 ) = 0.3150

15
Drawback of Correlation
• Consider two sample sets
• x = (-3, -2, -1, 0, 1, 2, 3)
• y = (9, 4, 1, 0, 1, 4, 9)

• mean(x) = 0, mean(y) = 4
• std(x) = 2.16, std(y) = 3.74
yi = xi2

• corr = (-3)(5)+(-2)(0)+(-1)(-3)+(0)(-4)+(1)(-3)+(2)(0)+3(5) / ( 6 * 2.16 * 3.74 )


=0

16
Correlation vs Cosine vs Euclidean Distance
• Compare the three proximity
correlation
measures according to their behavior ((𝐴𝑖 − 𝐴)(𝐵𝑖 −𝐵))
under variable transformation 𝐶𝑜𝑟𝑟 𝐴, 𝐵 = ​
(𝐴𝑖 − 𝐴)2 (𝐵𝑖 −𝐵)2
– scaling: multiplication by a value
– translation: adding a constant euclidean_distance
• Consider the example
– x = (1, 2, 4, 3, 0, 0, 0), ED A, B = (𝐴𝑖 − 𝐵𝑖 )2
– y = (1, 2, 3, 4, 0, 0, 0)
– ys = y * 2 (scaled version of y),
– yt = y + 5 (translated version) cosine_similarity
𝐴. 𝐵
𝐶𝑆 𝐴, 𝐵 =
𝐴 .| 𝐵 |
17
Correlation vs Cosine vs Euclidean Distance
x = (1, 2, 4, 3, 0, 0, 0), y = (1, 2, 3, 4, 0, 0, 0)
ys = y * 2 (scaled version of y), yt = y + 5 (translated version)

Measure (x , y) (x , ys) (x , yt)


Cosine 0.9667 0.9667 0.7940
Correlation 0.9429 0.9429 0.9429
Euclidean 1.4142 5.8310 14.2127
Distance

Property Cosine Correlation Euclidean Distance


Invariant to scaling Yes Yes No
(multiplication)
Invariant to translation No Yes No
(addition)
18
Mahalanobis Distance
When attempting to compare a point with an entire distribution, certain
considerations and precautions become necessary.

Usually, to measure the distance between a distribution and a point, we would


first need to reduce the distribution to a point by finding its mean.
After that, we can simply measure the distance to the point in terms of standard
deviations from this mean

Since the classical Euclidean distance weights each axis equally it effectively
assumes that the variables constructing the space are independent and
represent unrelated equally important information to one another

19
Mahalanobis Distance

Q1 Q2
Green and Blue points are equally away from the center Red point?
20
Mahalanobis Distance
• Mahalanobis distance is the distance between a point and a distribution
– Not a distance between two distinct points
– Effectively a multivariate equivalent of the Euclidean distance
– It was introduced by Prof. P. C. Mahalanobis in 1936 and has been used in various statistical
applications ever since
– Defined by

𝑀𝐷 = 𝑥 − 𝜇 𝑇 𝐶 −1 𝑥 − 𝜇

where,
- MD is the Mahalanobis distance.
- 𝑥 is the vector of the observation,
- 𝜇 is the vector of mean values of independent variables ,
- 𝐶 −1 is the inverse covariance matrix of independent variables.
21
Mahalanobis Distance as Solution
Let Red, Green and Blue points are given as follows

0 −1 1
R= G= B=
5 7 7

The example data was generated using the mean vector 0


𝜇=
5
Two covariance matrices are

1 0
C1 = It is defined as a square matrix where
0 1
the diagonal elements represent the
1 0.89 variance and the off-diagonal elements
C2 =
0.89 1 represent the covariance
22
Mahalanobis Distance as Solution
By the Euclidian distance formula, we can see that the points are equally
distant from each other

However, considering the distribution, the result for the green point:

23

You might also like