0% found this document useful (0 votes)

8 views23 pages

CS2209 Similarity Distances

The document discusses various measures of similarity and dissimilarity, including Euclidean and Minkowski distances, and their applications in data analysis. It also covers normalization techniques, properties of distance and similarity measures, and specific examples like Cosine and Mahalanobis distances. The content is aimed at understanding how to quantify the proximity between data objects in different contexts.

Uploaded by

Yatharth Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views23 pages

CS2209 Similarity Distances

Uploaded by

Yatharth Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Proximity, Similarity, Distances

CS2209

1
Similarity and Dissimilarity Measures
• Similarity measure
– Numerical measure of how alike two data objects are.
– Is higher when objects are more alike.
– Often falls in the range [0,1]
• Dissimilarity measure
– Numerical measure of how different two data objects are
– Lower when objects are more alike
– Minimum dissimilarity is often 0
– Upper limit varies
• Proximity refers to a similarity or dissimilarity

2
Similarity/Dissimilarity for Simple Attributes
The following table shows the similarity and dissimilarity between two objects, x and y,
with respect to a single, simple attribute.

3
Euclidean Distance
• Euclidean Distance n
dist  ( p
k 1
k  qk ) 2

where n is the number of dimensions (attributes) and pk and qk are, respectively,

the kth attributes (components) or data objects p and q
Example
Cost Time Weight Incentive
Object A 0 3 4 5
Object B 7 6 3 -1
The Euclidean distance between point A and B is
𝑑𝐴𝐵 = 0 − 7 2 + 3 − 6 2 + 4 − 3 2 + (5 + 1)2
𝑑𝐴𝐵 = 49 + 9 + 1 + 36 𝑑𝐴𝐵 =9.747
Standardization/normalization is necessary, if scales differ.
Euclidean Distance
3
point x y
2 p1
p1 0 2
p3 p4
1
p2 2 0
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6

p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Distance Matrix
5
Normalization
 Min-max normalization: to [new_minA, new_maxA]
v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
– Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,600 is mapped
to 73,600  12,000
(1.0  0)  0  0.716
98,000  12,000

 Z-score normalization (μ: mean, σ: standard deviation):

v  A
z
 A
73,600  54,000
– Ex. Let μ = 54,000, σ = 16,000. Then  1.225
16,000
 Normalization by decimal scaling
v Where j is the smallest integer such that Max(|ν’|) < 1
v'
10 j
Minkowski Distance

• Minkowski Distance is a generalization of Euclidean Distance

Where r is a parameter, n is the number of dimensions (attributes) and xk and

yk are, respectively, the kth attributes (components) or data objects x and y.

7
Minkowski Distance: Examples
• r = 1. City block (Manhattan, taxicab, L1 norm) distance.
– A common example of this for binary vectors is the Hamming distance, which is
just the number of bits that are different between two binary vectors

• r = 2. Euclidean distance

• r  . Chebyshev (Lmax norm, L norm) distance.

– This is the maximum difference between any component of the vectors

• Do not confuse r with n, i.e., all these distances are defined for all numbers of
dimensions.

8
Minkowski Distance
L1 p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
point x y
p1 0 2 L2 p1 p2 p3 p4
p2 2 0 p1 0 2.828 3.162 5.099
p3 3 1 p2 2.828 0 1.414 3.162
p4 5 1 p3 3.162 1.414 0 2
p4 5.099 3.162 2 0

L p1 p2 p3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0

Distance Matrix
9
Common Properties of a Distance
• Distances, such as the Euclidean distance, have some well known properties.

1. d(x, y)  0 for all x and y and d(x, y) = 0 if and only if x = y.

2. d(x, y) = d(y, x) for all x and y. (Symmetry)
3. d(x, z)  d(x, y) + d(y, z) for all points x, y, and z.
(Triangle Inequality)

where d(x, y) is the distance (dissimilarity) between points (data objects), x

and y.

• A distance that satisfies these properties is a metric

10
Common Properties of a Similarity
• Similarities, also have some well known properties.

1. s(x, y) = 1 (or maximum similarity) only if x = y.

(does not always hold, e.g., cosine)
2. s(x, y) = s(y, x) for all x and y. (Symmetry)

where s(x, y) is the similarity between points (data objects), x and y.

11
Similarity Between Binary Vectors
• Common situation is that objects, x and y, have only binary attributes

• Compute similarities using the following quantities

f01 = the number of attributes where x was 0 and y was 1
f10 = the number of attributes where x was 1 and y was 0
f00 = the number of attributes where x was 0 and y was 0
f11 = the number of attributes where x was 1 and y was 1

• Simple Matching and Jaccard Coefficients

SMC = number of matches / number of attributes
= (f11 + f00) / (f01 + f10 + f11 + f00)

J = number of 11 matches / number of non-zero attributes

= (f11) / (f01 + f10 + f11)
12
SMC versus Jaccard: Example
x= 1000000000
y= 0000001001

f01 = 2 (the number of attributes where x was 0 and y was 1)

f10 = 1 (the number of attributes where x was 1 and y was 0)
f00 = 7 (the number of attributes where x was 0 and y was 0)
f11 = 0 (the number of attributes where x was 1 and y was 1)

SMC = (f11 + f00) / (f01 + f10 + f11 + f00)

= (0+7) / (2+1+0+7) = 0.7

J = (f11) / (f01 + f10 + f11) = 0 / (2 + 1 + 0) = 0

13
Cosine Similarity
•If d1 and d2 are two document vectors, then
cos( d1, d2 ) = <d1,d2> / ||d1|| ||d2|| ,
where <d1,d2> indicates inner product or vector dot product of vectors, d1 and d2, and
|| d || is the length of vector d.
• The result of the cosine similarity ranges from -1 to 1.
– Value 1: indicates that the vectors are identical
– Value 0: means that the vectors are orthogonal (not similar at all)
– Value -1: implies complete dissimilarity
• The cosine similarity is often used in text analysis to determine the similarity
between documents represented as vectors in a high-dimensional space, where
each dimension corresponds to a specific term or word.

14
Cosine Similarity
•Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
<d1, d2> = 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
| | d1 || = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
|| d2 || = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.449
cos(d1, d2 ) = 0.3150

15
Drawback of Correlation
• Consider two sample sets
• x = (-3, -2, -1, 0, 1, 2, 3)
• y = (9, 4, 1, 0, 1, 4, 9)

• mean(x) = 0, mean(y) = 4
• std(x) = 2.16, std(y) = 3.74
yi = xi2

• corr = (-3)(5)+(-2)(0)+(-1)(-3)+(0)(-4)+(1)(-3)+(2)(0)+3(5) / ( 6 * 2.16 * 3.74 )

16
Correlation vs Cosine vs Euclidean Distance
• Compare the three proximity
correlation
measures according to their behavior ((𝐴𝑖 − 𝐴)(𝐵𝑖 −𝐵))
under variable transformation 𝐶𝑜𝑟𝑟 𝐴, 𝐵 =
(𝐴𝑖 − 𝐴)2 (𝐵𝑖 −𝐵)2
– scaling: multiplication by a value
– translation: adding a constant euclidean_distance
• Consider the example
– x = (1, 2, 4, 3, 0, 0, 0), ED A, B = (𝐴𝑖 − 𝐵𝑖 )2
– y = (1, 2, 3, 4, 0, 0, 0)
– ys = y * 2 (scaled version of y),
– yt = y + 5 (translated version) cosine_similarity
𝐴. 𝐵
𝐶𝑆 𝐴, 𝐵 =
𝐴 .| 𝐵 |
17
Correlation vs Cosine vs Euclidean Distance
x = (1, 2, 4, 3, 0, 0, 0), y = (1, 2, 3, 4, 0, 0, 0)
ys = y * 2 (scaled version of y), yt = y + 5 (translated version)

Measure (x , y) (x , ys) (x , yt)

Cosine 0.9667 0.9667 0.7940
Correlation 0.9429 0.9429 0.9429
Euclidean 1.4142 5.8310 14.2127
Distance

Property Cosine Correlation Euclidean Distance

Invariant to scaling Yes Yes No
(multiplication)
Invariant to translation No Yes No
(addition)
18
Mahalanobis Distance
When attempting to compare a point with an entire distribution, certain
considerations and precautions become necessary.

Usually, to measure the distance between a distribution and a point, we would

first need to reduce the distribution to a point by finding its mean.
After that, we can simply measure the distance to the point in terms of standard
deviations from this mean

Since the classical Euclidean distance weights each axis equally it effectively
assumes that the variables constructing the space are independent and
represent unrelated equally important information to one another

19
Mahalanobis Distance

Q1 Q2
Green and Blue points are equally away from the center Red point?
20
Mahalanobis Distance
• Mahalanobis distance is the distance between a point and a distribution
– Not a distance between two distinct points
– Effectively a multivariate equivalent of the Euclidean distance
– It was introduced by Prof. P. C. Mahalanobis in 1936 and has been used in various statistical
applications ever since
– Defined by

𝑀𝐷 = 𝑥 − 𝜇 𝑇 𝐶 −1 𝑥 − 𝜇

where,
- MD is the Mahalanobis distance.
- 𝑥 is the vector of the observation,
- 𝜇 is the vector of mean values of independent variables ,
- 𝐶 −1 is the inverse covariance matrix of independent variables.
21
Mahalanobis Distance as Solution
Let Red, Green and Blue points are given as follows

0 −1 1
R= G= B=
5 7 7

The example data was generated using the mean vector 0

𝜇=
5
Two covariance matrices are

1 0
C1 = It is defined as a square matrix where
0 1
the diagonal elements represent the
1 0.89 variance and the off-diagonal elements
C2 =
0.89 1 represent the covariance
22
Mahalanobis Distance as Solution
By the Euclidian distance formula, we can see that the points are equally
distant from each other

However, considering the distribution, the result for the green point:

Similarity
No ratings yet
Similarity
20 pages
Similarity
No ratings yet
Similarity
20 pages
Materi 7.1. Distance Measurement
No ratings yet
Materi 7.1. Distance Measurement
14 pages
Class-Data Preprocessing-IV
No ratings yet
Class-Data Preprocessing-IV
28 pages
DMi 03-Proximity
No ratings yet
DMi 03-Proximity
51 pages
Lab 2
No ratings yet
Lab 2
21 pages
Data Mining and Predictive Modeling: Lecture 13: Measuring Data Similarity
No ratings yet
Data Mining and Predictive Modeling: Lecture 13: Measuring Data Similarity
19 pages
Measuring Data Similarity and Dissimilarity
No ratings yet
Measuring Data Similarity and Dissimilarity
20 pages
Similarity and Dissimilarity
No ratings yet
Similarity and Dissimilarity
34 pages
Similarity Analysis
No ratings yet
Similarity Analysis
85 pages
Dist
No ratings yet
Dist
14 pages
DMi 03 Proximity
No ratings yet
DMi 03 Proximity
9 pages
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
26 pages
Data Mining: Similarity and Distance Recommendation Systems Sketching, Locality Sensitive Hashing
No ratings yet
Data Mining: Similarity and Distance Recommendation Systems Sketching, Locality Sensitive Hashing
57 pages
Class 1c - DataFundamentals
No ratings yet
Class 1c - DataFundamentals
27 pages
3 Unit PR NonParametric Decision Making
No ratings yet
3 Unit PR NonParametric Decision Making
78 pages
Lecture 2. Similarity Measures For Cluster Analysis
No ratings yet
Lecture 2. Similarity Measures For Cluster Analysis
31 pages
Unit 3
No ratings yet
Unit 3
13 pages
Similarity Measures
No ratings yet
Similarity Measures
11 pages
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
No ratings yet
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
30 pages
TE IT DMBI Module2 Data Preprocessing L8-L11
No ratings yet
TE IT DMBI Module2 Data Preprocessing L8-L11
73 pages
Lec09 466 PDF
No ratings yet
Lec09 466 PDF
5 pages
Lecture 3
No ratings yet
Lecture 3
58 pages
18CSE397T - Computational Data Analysis Unit - 3: Session - 8: SLO - 2
No ratings yet
18CSE397T - Computational Data Analysis Unit - 3: Session - 8: SLO - 2
4 pages
CS-DM Module - 3
No ratings yet
CS-DM Module - 3
27 pages
Similarity
No ratings yet
Similarity
19 pages
Lec-3. Datamining-Similarity-Distance-Ext
No ratings yet
Lec-3. Datamining-Similarity-Distance-Ext
104 pages
Lec 5
No ratings yet
Lec 5
22 pages
Clustering Lecture 1: Basics: Jing Gao
No ratings yet
Clustering Lecture 1: Basics: Jing Gao
62 pages
Distances Similarities
No ratings yet
Distances Similarities
39 pages
DS5 Statistics
No ratings yet
DS5 Statistics
67 pages
Week 3 - Similarity Distance Measures
No ratings yet
Week 3 - Similarity Distance Measures
42 pages
CSE 1 PPT MiniTest 12feb24 Similarity
No ratings yet
CSE 1 PPT MiniTest 12feb24 Similarity
11 pages
Lecture 4
No ratings yet
Lecture 4
33 pages
Cosine Similarity
No ratings yet
Cosine Similarity
4 pages
Clustering
No ratings yet
Clustering
15 pages
Similarty and Dissimilarity
No ratings yet
Similarty and Dissimilarity
11 pages
ML Unit 2
No ratings yet
ML Unit 2
11 pages
CS822 DataMining Week4
No ratings yet
CS822 DataMining Week4
45 pages
Data Mining: Similarity and Distance
No ratings yet
Data Mining: Similarity and Distance
6 pages
29.measuring Data Similarity and Dissimilarity Introduction
No ratings yet
29.measuring Data Similarity and Dissimilarity Introduction
43 pages
Distance and Similarity
No ratings yet
Distance and Similarity
33 pages
Distance and Similarity
No ratings yet
Distance and Similarity
33 pages
Data Science: Department of Computer Science & Engineering
No ratings yet
Data Science: Department of Computer Science & Engineering
31 pages
Chapter 2
No ratings yet
Chapter 2
70 pages
Clustering
0% (1)
Clustering
127 pages
Module-3Conti.. Similarity& Dissimlarity
No ratings yet
Module-3Conti.. Similarity& Dissimlarity
29 pages
Distance Functions
No ratings yet
Distance Functions
10 pages
VectorApplicationsInDS
No ratings yet
VectorApplicationsInDS
31 pages
Lecture 7 - Distance Measures
No ratings yet
Lecture 7 - Distance Measures
38 pages
Cosine Similarity Tutorial
No ratings yet
Cosine Similarity Tutorial
7 pages
Rsfinal
No ratings yet
Rsfinal
30 pages
03 Schubert
No ratings yet
03 Schubert
13 pages
Data Mining: Similarity and Distance
No ratings yet
Data Mining: Similarity and Distance
13 pages
Data Mining: Similarity and Distance
No ratings yet
Data Mining: Similarity and Distance
13 pages
Measure of Proximity
No ratings yet
Measure of Proximity
11 pages
Algebraic Equations
From Everand
Algebraic Equations
Demetrios P. Kanoussis
No ratings yet
Calculus: Maths of the Gods
From Everand
Calculus: Maths of the Gods
Bill Todorovich
No ratings yet
Application of Derivatives Tangents and Normals (Calculus) Mathematics E-Book For Public Exams
From Everand
Application of Derivatives Tangents and Normals (Calculus) Mathematics E-Book For Public Exams
Mohmmad Khaja Shareef
5/5 (1)
Shortcuts to College Calculus Refreshment Kit
From Everand
Shortcuts to College Calculus Refreshment Kit
Juan Acevedo
No ratings yet
Medical Physics
No ratings yet
Medical Physics
34 pages
Auto Switch 1
No ratings yet
Auto Switch 1
2 pages
Star Track Guide
No ratings yet
Star Track Guide
62 pages
Survey Report 4 Haikal
No ratings yet
Survey Report 4 Haikal
16 pages
Persky 1999 PDF
No ratings yet
Persky 1999 PDF
25 pages
QP - Chemistry - Class Ix - Ut I - 24-25
No ratings yet
QP - Chemistry - Class Ix - Ut I - 24-25
2 pages
Verif EPI-ASAP-procedure-EN
No ratings yet
Verif EPI-ASAP-procedure-EN
4 pages
Crush Core Forming
No ratings yet
Crush Core Forming
8 pages
TESTO Temperature Strips Datasheet
No ratings yet
TESTO Temperature Strips Datasheet
4 pages
6.2 Electrical Fields MS
No ratings yet
6.2 Electrical Fields MS
19 pages
Determination of Hydrolysis Constant
67% (3)
Determination of Hydrolysis Constant
14 pages
SC Physics Formulas
86% (7)
SC Physics Formulas
2 pages
P-5 Relay
No ratings yet
P-5 Relay
9 pages
Choke Valve Sizing
No ratings yet
Choke Valve Sizing
8 pages
Dorf Chapter 10
No ratings yet
Dorf Chapter 10
22 pages
Absolyte GP Constant Current Specs 12h To 120h
No ratings yet
Absolyte GP Constant Current Specs 12h To 120h
20 pages
Ecc Lab Report
No ratings yet
Ecc Lab Report
45 pages
NMML1-PST-8-C-2112-08 - SH 2 of 4
No ratings yet
NMML1-PST-8-C-2112-08 - SH 2 of 4
1 page
Cgpa Calculator
No ratings yet
Cgpa Calculator
15 pages
Learning Worksheet - Q2 - Module 4
No ratings yet
Learning Worksheet - Q2 - Module 4
8 pages
Origin of The Solar System
No ratings yet
Origin of The Solar System
3 pages
Smart Vehicle Headlight Automation With Efficient Energy Management System
100% (1)
Smart Vehicle Headlight Automation With Efficient Energy Management System
32 pages
PG 4 Technical Data Sheet
No ratings yet
PG 4 Technical Data Sheet
50 pages
IB CHEMISTRY Organic Chemistry
No ratings yet
IB CHEMISTRY Organic Chemistry
1 page
Form P-4 Manufacturer'S Partial Data Report As Required by The Provisions of The ASME Code Rules, Section I
No ratings yet
Form P-4 Manufacturer'S Partial Data Report As Required by The Provisions of The ASME Code Rules, Section I
2 pages
Sumersión Riemanniana
No ratings yet
Sumersión Riemanniana
20 pages
QSight 400 Triple Quadrupole Specification Sheet
100% (1)
QSight 400 Triple Quadrupole Specification Sheet
2 pages
Bridgestone Catalogue PDF
No ratings yet
Bridgestone Catalogue PDF
44 pages
A Circuit Modeling Technique For The Iso 7637 3 Capacitive 10ly9qhhax
No ratings yet
A Circuit Modeling Technique For The Iso 7637 3 Capacitive 10ly9qhhax
9 pages
Notes For Light-Reflection and Refraction
No ratings yet
Notes For Light-Reflection and Refraction
4 pages

CS2209 Similarity Distances

Uploaded by

CS2209 Similarity Distances

Uploaded by

Proximity, Similarity, Distances

where n is the number of dimensions (attributes) and pk and qk are, respectively,

 Z-score normalization (μ: mean, σ: standard deviation):

• Minkowski Distance is a generalization of Euclidean Distance

Where r is a parameter, n is the number of dimensions (attributes) and xk and

• r  . Chebyshev (Lmax norm, L norm) distance.

1. d(x, y)  0 for all x and y and d(x, y) = 0 if and only if x = y.

where d(x, y) is the distance (dissimilarity) between points (data objects), x

• A distance that satisfies these properties is a metric

1. s(x, y) = 1 (or maximum similarity) only if x = y.

where s(x, y) is the similarity between points (data objects), x and y.

• Compute similarities using the following quantities

• Simple Matching and Jaccard Coefficients

J = number of 11 matches / number of non-zero attributes

f01 = 2 (the number of attributes where x was 0 and y was 1)

SMC = (f11 + f00) / (f01 + f10 + f11 + f00)

J = (f11) / (f01 + f10 + f11) = 0 / (2 + 1 + 0) = 0

• corr = (-3)(5)+(-2)(0)+(-1)(-3)+(0)(-4)+(1)(-3)+(2)(0)+3(5) / ( 6 * 2.16 * 3.74 )

Measure (x , y) (x , ys) (x , yt)

Property Cosine Correlation Euclidean Distance

Usually, to measure the distance between a distribution and a point, we would

The example data was generated using the mean vector 0

You might also like