0% found this document useful (0 votes)

21 views39 pages

Distances Similarities

The document provides an introduction to distance metrics and similarity measures used in data mining, covering various types such as Euclidean, Manhattan, and Mahalanobis distances, as well as similarity metrics like cosine similarity and the Jaccard index. It discusses the importance of normalization and standardization in applying these metrics effectively. Additionally, the document introduces dynamic time-warping for comparing misaligned signals.

Uploaded by

rahim qamar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views39 pages

Distances Similarities

Uploaded by

rahim qamar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

Introduction to Data Mining

Distances & Similarities

CPSC/AMTH 445a/545a

Guy Wolf
[email protected]

Yale University
Fall 2016

CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 1 / 22
Outline
1 Distance metrics
Minkowski distances
Euclidean distance
Manhattan distance
Normalization & standardization
Mahalanobis distance
Hamming distance
2 Similarities and dissimilarities
Correlation
Gaussian affinities
Cosine similarities
Jaccard index
3 Dynamic time-warp
Comparing misaligned signals
Computing DTW dissimilarity
4 Combining similarities
CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 2 / 22
Distance metrics
Metric spaces

Consider a dataset X as an arbitrary collection of data points

Distance metric
A distance metric is a function d : X × X → [0, ∞) that satisfies
three conditions for any x , y , z ∈ X :
1 d(x , y ) = 0 ⇔ x = y
2 d(x , y ) = d(y , x )
3 d(x , y ) ≤ d(x , z) + d(z, y )

The set X of data points together with an appropriate distance

metric d(·, ·) is called a metric space.

CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 3 / 22
Distance metrics
Euclidean distance

When X ⊂ Rn we can consider Euclidean distances:

Euclidean distance
The distance between x , y ∈ X is defined by
kx − y k2 = ni=1 (x [i] − y [i])2
P

One of the classic most common distance metrics

Often inappropriate in realistic settings without proper
preprocessing & feature extraction
Also used for least mean square error optimizations
Proximity requires all attributes to have equally small differences

CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 4 / 22
Distance metrics
Manhattan distances

Manhattan distance
The Manhattan distance between x , y ∈ X is defined by
kx − y k1 = ni=1 |x [i] − y [i]|. This distance is also called taxicab or
P

cityblock distance

Taken from Wikipedia

CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 5 / 22
Distance metrics
Minkowski (`p ) distance

Minkowski distance
The Minkowski distance between x , y ∈ X ⊂ Rn is defined by
n
kx − y kpp = |x [i] − y [i]|p
X

i=1

for some p > 0. This is also called the `p distance.

Three popular Minkowski distances are:

Manhattan distance: kx − y k1 = ni=1 |x [i] − y [i]|
P
p=1
|x [i] − y [i]|2
Pn
p=2 Euclidean distance: kx − y k2 = i=1
p=∞ Supremum/`max distance:
kx − y k∞ = sup1≤i≤n |x [i] − y [i]|
CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 6 / 22
Distance metrics
Normalization & standardization

Minkowski distances require normalization to deal with varying

magnitudes, scaling, distribution or measurement units.
Min-max normalization
x [i]−mi
minmax(x )[i] = ri
, where mi and ri are the min value and range
of attribute i.

Z-score standardization
x [i]−µi
zscore(x )[i] = σi
, where µi and σi are the mean and STD of
attribute i.

log attenuation
logatt(x )[i] = sgn(x [i]) log(|x [i]| + 1)
CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 7 / 22
Distance metrics
Mahalanobis distance

Mahalanobis distances
The Mahalanobis distance is defined by
q
mahal(x , y ) = (x − y )Σ−1 (x − y )T

where Σ is the covariance matrix of the data and data points are
represented as row vectors.

CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 8 / 22
Distance metrics
Mahalanobis distance

Mahalanobis distances
The Mahalanobis distance is defined by
q
mahal(x , y ) = (x − y )Σ−1 (x − y )T

where Σ is the covariance matrix of the data and data points are
represented as row vectors.

When all attributes are independent with unit standard deviation

(e.g., z-scored) then Σ = Id and we get the Euclidean distance.

CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 8 / 22
Distance metrics
Mahalanobis distance

Mahalanobis distances
The Mahalanobis distance is defined by
q
mahal(x , y ) = (x − y )Σ−1 (x − y )T

where Σ is the covariance matrix of the data and data points are
represented as row vectors.

When all attributes are independent with variances σi2 then

qP
n x [i]−y [i] 2
Σ = diag(σ12 , . . . , σn2 ) and we get mahal(x , y ) = i=1 ( σi
),
which is the Euclidean distance between z-scored data points.

CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 8 / 22
Distance metrics
Mahalanobis distance

" #
0.3 0.2
Σ=
0.2 0.3

z
x x = (0, 1)
y = (0.5, 0.5)
y z = (1.5, 1.5)

d(x , y ) = 5
d(y , z) = 4

CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 8 / 22
Distance metrics
Hamming distance

When the data contains nominal values, we can use Hamming

distances:
Hamming distances
The hamming distance is defined as hamm(x , y ) = ni=1 x [i] 6= y [i]
P

for data points x , y that contain n nominal attributes.

This distance is equivalent to `1 distance with binary flag

representation.

Example
If x = (‘big’, ‘black’, ‘cat’), y = (‘small’, ‘black’, ‘rat’), and
z = (’big’, ’blue’, ‘bulldog’) then hamm(x , y ) = d(x , z) = 2 and
hamm(y , z) = 3.

CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 9 / 22
Similarities and dissimilarities
Similarities / affinities

Similarities or affinities quantify whether, or how much, data points

are similar.
Similarity/affinity measure
We will consider a similarity or affinity measure as a function
a : X × X → [0, 1] such that for every x , y ∈ X
a(x , x ) = a(y , y ) = 1
a(x , y ) = a(y , x )

Dissimilarities quantify the opposite notion, and typically take values

in [0, ∞), although they are sometimes normalized to finite ranges.
Distances can serve as a way to measure dissimilarities.

CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 10 / 22
Similarities and dissimilarities
Simple similarity measures

CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 11 / 22
Similarities and dissimilarities
Correlation

CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 12 / 22
Similarities and dissimilarities
Gaussian affinities

Given a distance metric d(x , y ), we can use it to formulate Guassian

affinities
Gaussian affinities
Gaussian affinities are defined as
2
k(x , y ) = exp(− d(x2ε,y ) )
given a distance metric d.

Essentially, data points are similar if they are within the same
spherical neighborhoods w.r.t. the distance metric, whose radius is
determined by ε.
For Euclidean distances they are also known as RBF (radial basis
function) affinities.

CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 13 / 22
Similarities and dissimilarities
Cosine similarities

Another similarity metric in Euclidean space is based on the inner

product (i.e., dot product) hx , y i = kx k ky k cos(∠xy )
Cosine similarities
The cosine similarity between x , y ∈ X ⊂ Rn is defined as
hx , y i
cos(x , y ) =
kx k ky k

CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 14 / 22
Similarities and dissimilarities
Cosine similarities

Another similarity metric in Euclidean space is based on the inner

product (i.e., dot product) hx , y i = kx k ky k cos(∠xy )
Cosine similarities
The cosine similarity between x , y ∈ X ⊂ Rn is defined as
hx , y i
cos(x , y ) =
kx k ky k

CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 14 / 22
Similarities and dissimilarities
Cosine similarities

Another similarity metric in Euclidean space is based on the inner

product (i.e., dot product) hx , y i = kx k ky k cos(∠xy )
Cosine similarities
The cosine similarity between x , y ∈ X ⊂ Rn is defined as
hx , y i
cos(x , y ) =
kx k ky k

CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 14 / 22
Similarities and dissimilarities
Cosine similarities

Another similarity metric in Euclidean space is based on the inner

product (i.e., dot product) hx , y i = kx k ky k cos(∠xy )
Cosine similarities
The cosine similarity between x , y ∈ X ⊂ Rn is defined as
hx , y i
cos(x , y ) =
kx k ky k

*
:

*

:

*

CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 14 / 22
Similarities and dissimilarities
Jaccard index

For data with n binary attributes we consider two similarity metrics:

Simple matching coefficient
Pn Pn
x [i]∧y [i]+ ¬x [i]∧¬y [i]
SMC (x , y ) = i=1
n
i=1

Jaccard coefficient
Pn
x [i]∧y [i]
J(x , y ) = Pi=1
n
x [i]∨y [i]
i=1

The Jaccard coefficient can be extended to continuous attributes:

Tanimoto (extended Jaccard) coefficient
hx ,y i
T (x , y ) = kx k2 +ky k2 −hx ,y i

CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 15 / 22
Dynamic time-warp
Comparing misaligned signals

CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 16 / 22
Dynamic time-warp
Comparing misaligned signals

Theoretically:
a
Use time offset
to align signals

CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 16 / 22
Dynamic time-warp
Comparing misaligned signals

Realistically:

CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 16 / 22
Dynamic time-warp
Comparing misaligned signals
a
a-
Realistically:

a a-
Which offset to use?

CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 16 / 22
Dynamic time-warp
Adaptive alignment

CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 17 / 22
Dynamic time-warp
Adaptive alignment

CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 17 / 22
Dynamic time-warp
Computing DTW dissimilarity
Signal x -
i a
Pairwise diff. matrix:
each cell holds difference

6j a qe between two signal entries

H
HH
H x [i] − y [j]
HH

Signal y

H
CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 18 / 22
Dynamic time-warp
Computing DTW dissimilarity
Signal x -

Alignment path:
get from start to end
of both signals
6
Signal y

CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 18 / 22
Dynamic time-warp
Computing DTW dissimilarity
Signal x -

1:1 alignment:
trivial - nothing modified
by the alignment
6
Aligned distance:
P 2
= kx − y k2
Signal y

CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 18 / 22
Dynamic time-warp
Computing DTW dissimilarity
Signal x -

Time offset:
works sometimes, but
not always optimal
6
Aligned distance:
P 2
=?
Signal y

CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 18 / 22
Dynamic time-warp
Computing DTW dissimilarity
Signal x -

Extreme offset:
complete misalignment -
worst alignment
6 alternative

Aligned distance:
P 2
= kx k2 + ky k2
Signal y

CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 18 / 22
Dynamic time-warp
Computing DTW dissimilarity
Signal x -

Optimal alignment:
Optimize alignment by
minimizing aligned
6 distance

Aligned distance:
P 2
= min
Signal y

CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 18 / 22
Dynamic time-warp
Dynamic programming algorithm

Dynamic Programming
A method for solving complex problems by breaking them down
into simpler subproblems.
Applicable to problems exhibiting the properties of overlapping
subproblems and optimal substructure.
Better performances than naive methods that do not utilize the
subproblem overlap.
CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 19 / 22
Dynamic time-warp
Dynamic programming algorithm

DTW Algorithm:
For each signal-time i and for each signal-time j:
Set cost ← (x [i] − y [j])2
Set the optimal distance at stage [i, j] to:
DTW[i,j−1] - DTW[i,j]
 

DTW[i,j−1] 
DTW[i,j] ← cost + min DTW[i−1,j−1] 6
 

DTW[i−1,j]  DTW[i−1,j−1] DTW[i−1,j]

Optimal distance: DTW[m,n] (where m & n are lengths of signals).

Optimal alignment: backtracking the path leading to DTW[m,n] via

min-cost choices of the algorithm

CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 19 / 22
Dynamic time-warp
Remark about earth-mover distances (EMD)

What is the cost of transforming one distribution to another?

n X
n
EMDpp (x , y ) |i − j|p Ωij :
X
= min{
i=1 j=1
n
X n
X
Ωij = x [i] ∧ Ωij = y [j]}
j=1 i=1
where Ω is a moving strategy (transferring Ωij mass from i to j).

Can be solved with the Hungarian algorithm, but more efficient

methods exist and rely on wavelets and mathematical analysis.

CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 20 / 22
Combining similarities
To combine similarities of different attributes we can consider several
alternatives:

1 Transform all the attributes to conform to the same

similarity/distance metric
2 Use weighted average to combine similarities
a(x , y ) = ni=1 wi ai (x , y ) or distances
P

d 2 (x , y ) = ni=1 wi di2 (x , y ) with ni=1 wi = 1.

P P

3 Consider asymmetric attributes by defining binary flags

δi (x , y ) ∈ {0, 1} that mark whether two data points share
comparable information in affinity i andPnthen combine only
w δ (x ,y )a (x ,y )
comparable information by a(x , y ) = i=1Pni i δi (x ,yi) .
i=1

CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 21 / 22
Summary

To compare data points we can either

1 quantify how similar they are with a similarity or affinity metric,
or
2 quantify how different they are with a dissimilarity or a distance
metric.

There are many possible metrics (e.g., Euclidean, Mahalanobis, Ham-

ming, Gaussian, Cosine, Jaccard), and the choice of which one to use
depends on both the task and the input data.

It is sometimes useful to consider several different metrics and then

combine them together. Alternatively, data preprocessing can be done
to transform all the data to conform with a single metric.

CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 22 / 22

Measuring Data Similarity and Dissimilarity
No ratings yet
Measuring Data Similarity and Dissimilarity
20 pages
Jbdlkdckldms
100% (1)
Jbdlkdckldms
366 pages
Digital Control System Analysis and Design 4th Edition by Phillips ISBN Solution Manual
100% (45)
Digital Control System Analysis and Design 4th Edition by Phillips ISBN Solution Manual
46 pages
Class Notes Unit 2 ML Material
No ratings yet
Class Notes Unit 2 ML Material
31 pages
Del Operator1
No ratings yet
Del Operator1
89 pages
(Maa 5.5-5.6) Monotony and Concavity - Solutions
No ratings yet
(Maa 5.5-5.6) Monotony and Concavity - Solutions
17 pages
Similarity and Dissimilarity
No ratings yet
Similarity and Dissimilarity
34 pages
Lecture 1: Catalan Numbers and Recurrence Relations
100% (1)
Lecture 1: Catalan Numbers and Recurrence Relations
6 pages
Ch-2 (HMT)
No ratings yet
Ch-2 (HMT)
67 pages
Similarity
No ratings yet
Similarity
20 pages
Data Mining: Data: Lecture Notes For Chapter 2 Lecture Notes For Chapter 2
100% (1)
Data Mining: Data: Lecture Notes For Chapter 2 Lecture Notes For Chapter 2
16 pages
Alegbra Hard Qs Ans
No ratings yet
Alegbra Hard Qs Ans
176 pages
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
26 pages
LLM Paper
No ratings yet
LLM Paper
26 pages
Introduction To Classification - KNN
No ratings yet
Introduction To Classification - KNN
29 pages
Introduction To Machine Learning: K-Nearest Neighbor Algorithm
No ratings yet
Introduction To Machine Learning: K-Nearest Neighbor Algorithm
25 pages
Algebra Formulas
No ratings yet
Algebra Formulas
12 pages
PX267 - Hamilton Mechanics
100% (1)
PX267 - Hamilton Mechanics
2 pages
Showfile
No ratings yet
Showfile
130 pages
O&M Manual
No ratings yet
O&M Manual
4 pages
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
No ratings yet
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
30 pages
Reachable Distance Function For KNN Classification
No ratings yet
Reachable Distance Function For KNN Classification
152 pages
Distance Based Models
No ratings yet
Distance Based Models
58 pages
S3 Hand-out-KTU
No ratings yet
S3 Hand-out-KTU
128 pages
Similarity Analysis
No ratings yet
Similarity Analysis
85 pages
Clustering Lecture 1: Basics: Jing Gao
No ratings yet
Clustering Lecture 1: Basics: Jing Gao
62 pages
Data Mining: Similarity and Distance Recommendation Systems Sketching, Locality Sensitive Hashing
No ratings yet
Data Mining: Similarity and Distance Recommendation Systems Sketching, Locality Sensitive Hashing
57 pages
CBSE - Maths - Manual-Projects - IX-X Module 4
No ratings yet
CBSE - Maths - Manual-Projects - IX-X Module 4
26 pages
3 Unit PR NonParametric Decision Making
No ratings yet
3 Unit PR NonParametric Decision Making
78 pages
III Clustering
No ratings yet
III Clustering
87 pages
TE IT DMBI Module2 Data Preprocessing L8-L11
No ratings yet
TE IT DMBI Module2 Data Preprocessing L8-L11
73 pages
Lecture 2. Similarity Measures For Cluster Analysis
No ratings yet
Lecture 2. Similarity Measures For Cluster Analysis
31 pages
TM3 ch07 Clustering
No ratings yet
TM3 ch07 Clustering
47 pages
Similarity
No ratings yet
Similarity
20 pages
DS - Module 3
No ratings yet
DS - Module 3
65 pages
DMi 03-Proximity
No ratings yet
DMi 03-Proximity
51 pages
IV Distance and Rule Based Models 4.1 Distance Based Models
No ratings yet
IV Distance and Rule Based Models 4.1 Distance Based Models
45 pages
Distance and Similarity
No ratings yet
Distance and Similarity
33 pages
12th Maths Unit 6 Study Material English Medium PDF
No ratings yet
12th Maths Unit 6 Study Material English Medium PDF
58 pages
Distance and Similarity: Andre Salvaro Furtado
No ratings yet
Distance and Similarity: Andre Salvaro Furtado
56 pages
Materi 7.1. Distance Measurement
No ratings yet
Materi 7.1. Distance Measurement
14 pages
Lesson 6 Similarities KNN
No ratings yet
Lesson 6 Similarities KNN
25 pages
Class-Data Preprocessing-IV
No ratings yet
Class-Data Preprocessing-IV
28 pages
ADA243087
No ratings yet
ADA243087
130 pages
CS2209 Similarity Distances
No ratings yet
CS2209 Similarity Distances
23 pages
Lecture 7 - Distance Measures
No ratings yet
Lecture 7 - Distance Measures
38 pages
2 Similarity Disimilarity Measure
No ratings yet
2 Similarity Disimilarity Measure
35 pages
CS-DM Module - 3
No ratings yet
CS-DM Module - 3
27 pages
Lecture 3
No ratings yet
Lecture 3
58 pages
12 A Comprehensive Taxonomy For Multi-Robot Task Allocation Ok
No ratings yet
12 A Comprehensive Taxonomy For Multi-Robot Task Allocation Ok
19 pages
Dist
No ratings yet
Dist
14 pages
Chapter 2
No ratings yet
Chapter 2
70 pages
Clustering
No ratings yet
Clustering
15 pages
Distance and Similarity
No ratings yet
Distance and Similarity
33 pages
4.4-InstanceBasedLearning Part 1
No ratings yet
4.4-InstanceBasedLearning Part 1
16 pages
Similarty and Dissimilarity
No ratings yet
Similarty and Dissimilarity
11 pages
K Nearest Neighbour - Algorithm
No ratings yet
K Nearest Neighbour - Algorithm
29 pages
Data Mining and Predictive Modeling: Lecture 13: Measuring Data Similarity
No ratings yet
Data Mining and Predictive Modeling: Lecture 13: Measuring Data Similarity
19 pages
Class 6th A+B Computer 1st Term Full PDF
No ratings yet
Class 6th A+B Computer 1st Term Full PDF
16 pages
Rsfinal
No ratings yet
Rsfinal
30 pages
Lecture 4
No ratings yet
Lecture 4
33 pages
Advanced Engineering Mathematics Prof. P. D. Srivastava Department of Mathematics Indian Institute of Technology, Kharagpur
No ratings yet
Advanced Engineering Mathematics Prof. P. D. Srivastava Department of Mathematics Indian Institute of Technology, Kharagpur
15 pages
Grade 8
No ratings yet
Grade 8
15 pages
Chapter 1
No ratings yet
Chapter 1
41 pages
Geometric & Harmonic Means
No ratings yet
Geometric & Harmonic Means
16 pages
01 Basics 02knn 03
No ratings yet
01 Basics 02knn 03
9 pages
ML Unit 2
No ratings yet
ML Unit 2
11 pages
DMi 03 Proximity
No ratings yet
DMi 03 Proximity
9 pages
6 2015 A Heuristic Distributed Task Allocation Method For Multivehicle Multitask Problems and Its Application PK
No ratings yet
6 2015 A Heuristic Distributed Task Allocation Method For Multivehicle Multitask Problems and Its Application PK
14 pages
Data Mining Lecture 1 - Summary
No ratings yet
Data Mining Lecture 1 - Summary
3 pages
Math DETERMINANT
No ratings yet
Math DETERMINANT
15 pages
An Empirical Study of Distance Metrics For K-Nearest Neighbor Algorithm
No ratings yet
An Empirical Study of Distance Metrics For K-Nearest Neighbor Algorithm
6 pages
Distance Functions
No ratings yet
Distance Functions
7 pages
2ED Advanced Math William Guo ToC Ref
No ratings yet
2ED Advanced Math William Guo ToC Ref
11 pages
Data Mining: Similarity and Distance
No ratings yet
Data Mining: Similarity and Distance
6 pages
Different Distances Used in K-NN
No ratings yet
Different Distances Used in K-NN
8 pages
Similarity Based Learning (Part 2)
No ratings yet
Similarity Based Learning (Part 2)
15 pages
A Survey of Task Allocation and Load Balancing in Distributed Systems
No ratings yet
A Survey of Task Allocation and Load Balancing in Distributed Systems
15 pages
DM Lab 02
No ratings yet
DM Lab 02
12 pages
Module 3 Lab 1
No ratings yet
Module 3 Lab 1
6 pages
18CSE397T - Computational Data Analysis Unit - 3: Session - 8: SLO - 2
No ratings yet
18CSE397T - Computational Data Analysis Unit - 3: Session - 8: SLO - 2
4 pages
Two-Way Table of Specification: Name: Angelica H. Paras Subject: General Mathematics Quarter: First
No ratings yet
Two-Way Table of Specification: Name: Angelica H. Paras Subject: General Mathematics Quarter: First
3 pages
Lec09 466 PDF
No ratings yet
Lec09 466 PDF
5 pages
2 Analysis - Mcts
No ratings yet
2 Analysis - Mcts
8 pages
FEM Vs FVM-COMSOL Blog
No ratings yet
FEM Vs FVM-COMSOL Blog
8 pages
Distance Functions
No ratings yet
Distance Functions
10 pages
11.6 Absolute Convergence and The Ratio and Root Tests
No ratings yet
11.6 Absolute Convergence and The Ratio and Root Tests
3 pages
Feedback Linearized Model of DC Motor Using Differential Geometry
No ratings yet
Feedback Linearized Model of DC Motor Using Differential Geometry
6 pages
TNT 14
No ratings yet
TNT 14
7 pages
Mulmatrix Cu
No ratings yet
Mulmatrix Cu
3 pages
Sheet6 Solution
No ratings yet
Sheet6 Solution
4 pages
1981PM
No ratings yet
1981PM
6 pages
Strauss PDEch 1 S 6 P 06
No ratings yet
Strauss PDEch 1 S 6 P 06
2 pages
Class 6th A+B Computer Objective
No ratings yet
Class 6th A+B Computer Objective
3 pages
11th Maths EM Half Yearly Exam 2022 Original Question Paper Thiruvallur
No ratings yet
11th Maths EM Half Yearly Exam 2022 Original Question Paper Thiruvallur
4 pages
Previous Next: 11.3.4 Solved Problems
No ratings yet
Previous Next: 11.3.4 Solved Problems
1 page
TUTORIAL Chapter 1 - I MAT 455: Answers
No ratings yet
TUTORIAL Chapter 1 - I MAT 455: Answers
1 page
Gauss Nodes Revolution: Numerical Integration Theory Radically Simplified And Generalised
From Everand
Gauss Nodes Revolution: Numerical Integration Theory Radically Simplified And Generalised
Rob Porter
No ratings yet
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet

Distances Similarities

Uploaded by

Distances Similarities

Uploaded by

Introduction to Data Mining

Distances & Similarities

Consider a dataset X as an arbitrary collection of data points

The set X of data points together with an appropriate distance

When X ⊂ Rn we can consider Euclidean distances:

One of the classic most common distance metrics

Taken from Wikipedia

for some p > 0. This is also called the `p distance.

Three popular Minkowski distances are:

Minkowski distances require normalization to deal with varying

When all attributes are independent with unit standard deviation

When all attributes are independent with variances σi2 then

When the data contains nominal values, we can use Hamming

for data points x , y that contain n nominal attributes.

This distance is equivalent to `1 distance with binary flag

Similarities or affinities quantify whether, or how much, data points

Dissimilarities quantify the opposite notion, and typically take values

Given a distance metric d(x , y ), we can use it to formulate Guassian

Another similarity metric in Euclidean space is based on the inner

Another similarity metric in Euclidean space is based on the inner

Another similarity metric in Euclidean space is based on the inner

Another similarity metric in Euclidean space is based on the inner

For data with n binary attributes we consider two similarity metrics:

The Jaccard coefficient can be extended to continuous attributes:

6j a qe between two signal entries

Optimal distance: DTW[m,n] (where m & n are lengths of signals).

Optimal alignment: backtracking the path leading to DTW[m,n] via

What is the cost of transforming one distribution to another?

Can be solved with the Hungarian algorithm, but more efficient

1 Transform all the attributes to conform to the same

d 2 (x , y ) = ni=1 wi di2 (x , y ) with ni=1 wi = 1.

3 Consider asymmetric attributes by defining binary flags

To compare data points we can either

There are many possible metrics (e.g., Euclidean, Mahalanobis, Ham-

It is sometimes useful to consider several different metrics and then

You might also like

6j a qe between two signal entries