0% found this document useful (0 votes)
21 views39 pages

Distances Similarities

The document provides an introduction to distance metrics and similarity measures used in data mining, covering various types such as Euclidean, Manhattan, and Mahalanobis distances, as well as similarity metrics like cosine similarity and the Jaccard index. It discusses the importance of normalization and standardization in applying these metrics effectively. Additionally, the document introduces dynamic time-warping for comparing misaligned signals.

Uploaded by

rahim qamar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views39 pages

Distances Similarities

The document provides an introduction to distance metrics and similarity measures used in data mining, covering various types such as Euclidean, Manhattan, and Mahalanobis distances, as well as similarity metrics like cosine similarity and the Jaccard index. It discusses the importance of normalization and standardization in applying these metrics effectively. Additionally, the document introduces dynamic time-warping for comparing misaligned signals.

Uploaded by

rahim qamar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Introduction to Data Mining

Distances & Similarities

CPSC/AMTH 445a/545a

Guy Wolf
[email protected]

Yale University
Fall 2016

CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 1 / 22
Outline
1 Distance metrics
Minkowski distances
Euclidean distance
Manhattan distance
Normalization & standardization
Mahalanobis distance
Hamming distance
2 Similarities and dissimilarities
Correlation
Gaussian affinities
Cosine similarities
Jaccard index
3 Dynamic time-warp
Comparing misaligned signals
Computing DTW dissimilarity
4 Combining similarities
CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 2 / 22
Distance metrics
Metric spaces

Consider a dataset X as an arbitrary collection of data points

Distance metric
A distance metric is a function d : X × X → [0, ∞) that satisfies
three conditions for any x , y , z ∈ X :
1 d(x , y ) = 0 ⇔ x = y
2 d(x , y ) = d(y , x )
3 d(x , y ) ≤ d(x , z) + d(z, y )

The set X of data points together with an appropriate distance


metric d(·, ·) is called a metric space.

CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 3 / 22
Distance metrics
Euclidean distance

When X ⊂ Rn we can consider Euclidean distances:


Euclidean distance
The distance between x , y ∈ X is defined by
kx − y k2 = ni=1 (x [i] − y [i])2
P

One of the classic most common distance metrics


Often inappropriate in realistic settings without proper
preprocessing & feature extraction
Also used for least mean square error optimizations
Proximity requires all attributes to have equally small differences

CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 4 / 22
Distance metrics
Manhattan distances

Manhattan distance
The Manhattan distance between x , y ∈ X is defined by
kx − y k1 = ni=1 |x [i] − y [i]|. This distance is also called taxicab or
P

cityblock distance

Taken from Wikipedia

CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 5 / 22
Distance metrics
Minkowski (`p ) distance

Minkowski distance
The Minkowski distance between x , y ∈ X ⊂ Rn is defined by
n
kx − y kpp = |x [i] − y [i]|p
X

i=1

for some p > 0. This is also called the `p distance.

Three popular Minkowski distances are:


Manhattan distance: kx − y k1 = ni=1 |x [i] − y [i]|
P
p=1
|x [i] − y [i]|2
Pn
p=2 Euclidean distance: kx − y k2 = i=1
p=∞ Supremum/`max distance:
kx − y k∞ = sup1≤i≤n |x [i] − y [i]|
CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 6 / 22
Distance metrics
Normalization & standardization

Minkowski distances require normalization to deal with varying


magnitudes, scaling, distribution or measurement units.
Min-max normalization
x [i]−mi
minmax(x )[i] = ri
, where mi and ri are the min value and range
of attribute i.

Z-score standardization
x [i]−µi
zscore(x )[i] = σi
, where µi and σi are the mean and STD of
attribute i.

log attenuation
logatt(x )[i] = sgn(x [i]) log(|x [i]| + 1)
CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 7 / 22
Distance metrics
Mahalanobis distance

Mahalanobis distances
The Mahalanobis distance is defined by
q
mahal(x , y ) = (x − y )Σ−1 (x − y )T

where Σ is the covariance matrix of the data and data points are
represented as row vectors.

CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 8 / 22
Distance metrics
Mahalanobis distance

Mahalanobis distances
The Mahalanobis distance is defined by
q
mahal(x , y ) = (x − y )Σ−1 (x − y )T

where Σ is the covariance matrix of the data and data points are
represented as row vectors.

When all attributes are independent with unit standard deviation


(e.g., z-scored) then Σ = Id and we get the Euclidean distance.

CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 8 / 22
Distance metrics
Mahalanobis distance

Mahalanobis distances
The Mahalanobis distance is defined by
q
mahal(x , y ) = (x − y )Σ−1 (x − y )T

where Σ is the covariance matrix of the data and data points are
represented as row vectors.

When all attributes are independent with variances σi2 then


qP
n x [i]−y [i] 2
Σ = diag(σ12 , . . . , σn2 ) and we get mahal(x , y ) = i=1 ( σi
),
which is the Euclidean distance between z-scored data points.

CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 8 / 22
Distance metrics
Mahalanobis distance

" #
0.3 0.2
Σ=
0.2 0.3

z
x x = (0, 1)
y = (0.5, 0.5)
y z = (1.5, 1.5)

d(x , y ) = 5
d(y , z) = 4

CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 8 / 22
Distance metrics
Hamming distance

When the data contains nominal values, we can use Hamming


distances:
Hamming distances
The hamming distance is defined as hamm(x , y ) = ni=1 x [i] 6= y [i]
P

for data points x , y that contain n nominal attributes.

This distance is equivalent to `1 distance with binary flag


representation.

Example
If x = (‘big’, ‘black’, ‘cat’), y = (‘small’, ‘black’, ‘rat’), and
z = (’big’, ’blue’, ‘bulldog’) then hamm(x , y ) = d(x , z) = 2 and
hamm(y , z) = 3.

CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 9 / 22
Similarities and dissimilarities
Similarities / affinities

Similarities or affinities quantify whether, or how much, data points


are similar.
Similarity/affinity measure
We will consider a similarity or affinity measure as a function
a : X × X → [0, 1] such that for every x , y ∈ X
a(x , x ) = a(y , y ) = 1
a(x , y ) = a(y , x )

Dissimilarities quantify the opposite notion, and typically take values


in [0, ∞), although they are sometimes normalized to finite ranges.
Distances can serve as a way to measure dissimilarities.

CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 10 / 22
Similarities and dissimilarities
Simple similarity measures

CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 11 / 22
Similarities and dissimilarities
Correlation

CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 12 / 22
Similarities and dissimilarities
Gaussian affinities

Given a distance metric d(x , y ), we can use it to formulate Guassian


affinities
Gaussian affinities
Gaussian affinities are defined as
2
k(x , y ) = exp(− d(x2ε,y ) )
given a distance metric d.

Essentially, data points are similar if they are within the same
spherical neighborhoods w.r.t. the distance metric, whose radius is
determined by ε.
For Euclidean distances they are also known as RBF (radial basis
function) affinities.

CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 13 / 22
Similarities and dissimilarities
Cosine similarities

Another similarity metric in Euclidean space is based on the inner


product (i.e., dot product) hx , y i = kx k ky k cos(∠xy )
Cosine similarities
The cosine similarity between x , y ∈ X ⊂ Rn is defined as
hx , y i
cos(x , y ) =
kx k ky k

CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 14 / 22
Similarities and dissimilarities
Cosine similarities

Another similarity metric in Euclidean space is based on the inner


product (i.e., dot product) hx , y i = kx k ky k cos(∠xy )
Cosine similarities
The cosine similarity between x , y ∈ X ⊂ Rn is defined as
hx , y i
cos(x , y ) =
kx k ky k

CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 14 / 22
Similarities and dissimilarities
Cosine similarities

Another similarity metric in Euclidean space is based on the inner


product (i.e., dot product) hx , y i = kx k ky k cos(∠xy )
Cosine similarities
The cosine similarity between x , y ∈ X ⊂ Rn is defined as
hx , y i
cos(x , y ) =
kx k ky k

CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 14 / 22
Similarities and dissimilarities
Cosine similarities

Another similarity metric in Euclidean space is based on the inner


product (i.e., dot product) hx , y i = kx k ky k cos(∠xy )
Cosine similarities
The cosine similarity between x , y ∈ X ⊂ Rn is defined as
hx , y i
cos(x , y ) =
kx k ky k

 
*
 :

*
  
  
  :

  
*


  
 
 

 


CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 14 / 22
Similarities and dissimilarities
Jaccard index

For data with n binary attributes we consider two similarity metrics:


Simple matching coefficient
Pn Pn
x [i]∧y [i]+ ¬x [i]∧¬y [i]
SMC (x , y ) = i=1
n
i=1

Jaccard coefficient
Pn
x [i]∧y [i]
J(x , y ) = Pi=1
n
x [i]∨y [i]
i=1

The Jaccard coefficient can be extended to continuous attributes:


Tanimoto (extended Jaccard) coefficient
hx ,y i
T (x , y ) = kx k2 +ky k2 −hx ,y i

CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 15 / 22
Dynamic time-warp
Comparing misaligned signals

CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 16 / 22
Dynamic time-warp
Comparing misaligned signals

Theoretically:
 a
Use time offset
to align signals

a-

CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 16 / 22
Dynamic time-warp
Comparing misaligned signals

Realistically:

CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 16 / 22
Dynamic time-warp
Comparing misaligned signals
 a
a-
Realistically:

 a a-
Which offset to use?

CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 16 / 22
Dynamic time-warp
Adaptive alignment

CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 17 / 22
Dynamic time-warp
Adaptive alignment

CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 17 / 22
Dynamic time-warp
Adaptive alignment

CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 17 / 22
Dynamic time-warp
Computing DTW dissimilarity
Signal x -
i a
Pairwise diff. matrix:
each cell holds difference

6j a qe between two signal entries

H
HH
H x [i] − y [j]
HH

Signal y

H
CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 18 / 22
Dynamic time-warp
Computing DTW dissimilarity
Signal x -

Alignment path:
get from start to end
of both signals
6
Signal y

CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 18 / 22
Dynamic time-warp
Computing DTW dissimilarity
Signal x -

1:1 alignment:
trivial - nothing modified
by the alignment
6
Aligned distance:
P 2
= kx − y k2
Signal y

CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 18 / 22
Dynamic time-warp
Computing DTW dissimilarity
Signal x -

Time offset:
works sometimes, but
not always optimal
6
Aligned distance:
P 2
=?
Signal y

CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 18 / 22
Dynamic time-warp
Computing DTW dissimilarity
Signal x -

Extreme offset:
complete misalignment -
worst alignment
6 alternative

Aligned distance:
P 2
= kx k2 + ky k2
Signal y

CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 18 / 22
Dynamic time-warp
Computing DTW dissimilarity
Signal x -

Optimal alignment:
Optimize alignment by
minimizing aligned
6 distance

Aligned distance:
P 2
= min
Signal y

CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 18 / 22
Dynamic time-warp
Dynamic programming algorithm

Dynamic Programming
A method for solving complex problems by breaking them down
into simpler subproblems.
Applicable to problems exhibiting the properties of overlapping
subproblems and optimal substructure.
Better performances than naive methods that do not utilize the
subproblem overlap.
CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 19 / 22
Dynamic time-warp
Dynamic programming algorithm

DTW Algorithm:
For each signal-time i and for each signal-time j:
Set cost ← (x [i] − y [j])2
Set the optimal distance at stage [i, j] to:
DTW[i,j−1] - DTW[i,j]
 

DTW[i,j−1] 
DTW[i,j] ← cost + min DTW[i−1,j−1]  6
 

DTW[i−1,j]  DTW[i−1,j−1] DTW[i−1,j]

Optimal distance: DTW[m,n] (where m & n are lengths of signals).

Optimal alignment: backtracking the path leading to DTW[m,n] via


min-cost choices of the algorithm

CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 19 / 22
Dynamic time-warp
Remark about earth-mover distances (EMD)

What is the cost of transforming one distribution to another?

n X
n
EMDpp (x , y ) |i − j|p Ωij :
X
= min{
i=1 j=1
n
X n
X
Ωij = x [i] ∧ Ωij = y [j]}
j=1 i=1
where Ω is a moving strategy (transferring Ωij mass from i to j).

Can be solved with the Hungarian algorithm, but more efficient


methods exist and rely on wavelets and mathematical analysis.

CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 20 / 22
Combining similarities
To combine similarities of different attributes we can consider several
alternatives:

1 Transform all the attributes to conform to the same


similarity/distance metric
2 Use weighted average to combine similarities
a(x , y ) = ni=1 wi ai (x , y ) or distances
P

d 2 (x , y ) = ni=1 wi di2 (x , y ) with ni=1 wi = 1.


P P

3 Consider asymmetric attributes by defining binary flags


δi (x , y ) ∈ {0, 1} that mark whether two data points share
comparable information in affinity i andPnthen combine only
w δ (x ,y )a (x ,y )
comparable information by a(x , y ) = i=1Pni i δi (x ,yi) .
i=1

CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 21 / 22
Summary

To compare data points we can either


1 quantify how similar they are with a similarity or affinity metric,
or
2 quantify how different they are with a dissimilarity or a distance
metric.

There are many possible metrics (e.g., Euclidean, Mahalanobis, Ham-


ming, Gaussian, Cosine, Jaccard), and the choice of which one to use
depends on both the task and the input data.

It is sometimes useful to consider several different metrics and then


combine them together. Alternatively, data preprocessing can be done
to transform all the data to conform with a single metric.

CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 22 / 22

You might also like