Distances Similarities
Distances Similarities
CPSC/AMTH 445a/545a
Guy Wolf
[email protected]
Yale University
Fall 2016
CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 1 / 22
Outline
1 Distance metrics
Minkowski distances
Euclidean distance
Manhattan distance
Normalization & standardization
Mahalanobis distance
Hamming distance
2 Similarities and dissimilarities
Correlation
Gaussian affinities
Cosine similarities
Jaccard index
3 Dynamic time-warp
Comparing misaligned signals
Computing DTW dissimilarity
4 Combining similarities
CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 2 / 22
Distance metrics
Metric spaces
Distance metric
A distance metric is a function d : X × X → [0, ∞) that satisfies
three conditions for any x , y , z ∈ X :
1 d(x , y ) = 0 ⇔ x = y
2 d(x , y ) = d(y , x )
3 d(x , y ) ≤ d(x , z) + d(z, y )
CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 3 / 22
Distance metrics
Euclidean distance
CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 4 / 22
Distance metrics
Manhattan distances
Manhattan distance
The Manhattan distance between x , y ∈ X is defined by
kx − y k1 = ni=1 |x [i] − y [i]|. This distance is also called taxicab or
P
cityblock distance
CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 5 / 22
Distance metrics
Minkowski (`p ) distance
Minkowski distance
The Minkowski distance between x , y ∈ X ⊂ Rn is defined by
n
kx − y kpp = |x [i] − y [i]|p
X
i=1
Z-score standardization
x [i]−µi
zscore(x )[i] = σi
, where µi and σi are the mean and STD of
attribute i.
log attenuation
logatt(x )[i] = sgn(x [i]) log(|x [i]| + 1)
CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 7 / 22
Distance metrics
Mahalanobis distance
Mahalanobis distances
The Mahalanobis distance is defined by
q
mahal(x , y ) = (x − y )Σ−1 (x − y )T
where Σ is the covariance matrix of the data and data points are
represented as row vectors.
CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 8 / 22
Distance metrics
Mahalanobis distance
Mahalanobis distances
The Mahalanobis distance is defined by
q
mahal(x , y ) = (x − y )Σ−1 (x − y )T
where Σ is the covariance matrix of the data and data points are
represented as row vectors.
CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 8 / 22
Distance metrics
Mahalanobis distance
Mahalanobis distances
The Mahalanobis distance is defined by
q
mahal(x , y ) = (x − y )Σ−1 (x − y )T
where Σ is the covariance matrix of the data and data points are
represented as row vectors.
CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 8 / 22
Distance metrics
Mahalanobis distance
" #
0.3 0.2
Σ=
0.2 0.3
z
x x = (0, 1)
y = (0.5, 0.5)
y z = (1.5, 1.5)
d(x , y ) = 5
d(y , z) = 4
CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 8 / 22
Distance metrics
Hamming distance
Example
If x = (‘big’, ‘black’, ‘cat’), y = (‘small’, ‘black’, ‘rat’), and
z = (’big’, ’blue’, ‘bulldog’) then hamm(x , y ) = d(x , z) = 2 and
hamm(y , z) = 3.
CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 9 / 22
Similarities and dissimilarities
Similarities / affinities
CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 10 / 22
Similarities and dissimilarities
Simple similarity measures
CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 11 / 22
Similarities and dissimilarities
Correlation
CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 12 / 22
Similarities and dissimilarities
Gaussian affinities
Essentially, data points are similar if they are within the same
spherical neighborhoods w.r.t. the distance metric, whose radius is
determined by ε.
For Euclidean distances they are also known as RBF (radial basis
function) affinities.
CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 13 / 22
Similarities and dissimilarities
Cosine similarities
CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 14 / 22
Similarities and dissimilarities
Cosine similarities
CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 14 / 22
Similarities and dissimilarities
Cosine similarities
CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 14 / 22
Similarities and dissimilarities
Cosine similarities
*
:
*
:
*
CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 14 / 22
Similarities and dissimilarities
Jaccard index
Jaccard coefficient
Pn
x [i]∧y [i]
J(x , y ) = Pi=1
n
x [i]∨y [i]
i=1
CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 15 / 22
Dynamic time-warp
Comparing misaligned signals
CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 16 / 22
Dynamic time-warp
Comparing misaligned signals
Theoretically:
a
Use time offset
to align signals
a-
CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 16 / 22
Dynamic time-warp
Comparing misaligned signals
Realistically:
CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 16 / 22
Dynamic time-warp
Comparing misaligned signals
a
a-
Realistically:
a a-
Which offset to use?
CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 16 / 22
Dynamic time-warp
Adaptive alignment
CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 17 / 22
Dynamic time-warp
Adaptive alignment
CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 17 / 22
Dynamic time-warp
Adaptive alignment
CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 17 / 22
Dynamic time-warp
Computing DTW dissimilarity
Signal x -
i a
Pairwise diff. matrix:
each cell holds difference
H
HH
H x [i] − y [j]
HH
Signal y
H
CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 18 / 22
Dynamic time-warp
Computing DTW dissimilarity
Signal x -
Alignment path:
get from start to end
of both signals
6
Signal y
CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 18 / 22
Dynamic time-warp
Computing DTW dissimilarity
Signal x -
1:1 alignment:
trivial - nothing modified
by the alignment
6
Aligned distance:
P 2
= kx − y k2
Signal y
CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 18 / 22
Dynamic time-warp
Computing DTW dissimilarity
Signal x -
Time offset:
works sometimes, but
not always optimal
6
Aligned distance:
P 2
=?
Signal y
CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 18 / 22
Dynamic time-warp
Computing DTW dissimilarity
Signal x -
Extreme offset:
complete misalignment -
worst alignment
6 alternative
Aligned distance:
P 2
= kx k2 + ky k2
Signal y
CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 18 / 22
Dynamic time-warp
Computing DTW dissimilarity
Signal x -
Optimal alignment:
Optimize alignment by
minimizing aligned
6 distance
Aligned distance:
P 2
= min
Signal y
CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 18 / 22
Dynamic time-warp
Dynamic programming algorithm
Dynamic Programming
A method for solving complex problems by breaking them down
into simpler subproblems.
Applicable to problems exhibiting the properties of overlapping
subproblems and optimal substructure.
Better performances than naive methods that do not utilize the
subproblem overlap.
CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 19 / 22
Dynamic time-warp
Dynamic programming algorithm
DTW Algorithm:
For each signal-time i and for each signal-time j:
Set cost ← (x [i] − y [j])2
Set the optimal distance at stage [i, j] to:
DTW[i,j−1] - DTW[i,j]
DTW[i,j−1]
DTW[i,j] ← cost + min DTW[i−1,j−1] 6
DTW[i−1,j] DTW[i−1,j−1] DTW[i−1,j]
CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 19 / 22
Dynamic time-warp
Remark about earth-mover distances (EMD)
n X
n
EMDpp (x , y ) |i − j|p Ωij :
X
= min{
i=1 j=1
n
X n
X
Ωij = x [i] ∧ Ωij = y [j]}
j=1 i=1
where Ω is a moving strategy (transferring Ωij mass from i to j).
CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 20 / 22
Combining similarities
To combine similarities of different attributes we can consider several
alternatives:
CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 21 / 22
Summary
CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 22 / 22