0% found this document useful (0 votes)
47 views25 pages

Measuring Distances: Applied Multivariate Statistics - Spring 2012

The document discusses different methods for measuring distances and dissimilarities between samples and variables. It explains that the appropriate distance measure depends on the data context and type, including whether the data is interval scaled, binary, nominal, ordinal or mixed. Common distance metrics like Euclidean, Manhattan and Maximum distance are special cases of the Minkowski distance. Scaling variables gives them equal weight, while not scaling favors variables with larger ranges. Dissimilarities are a more flexible generalization of distances. Different approaches are needed for different data types, such as using simple matching coefficients for binary symmetric data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views25 pages

Measuring Distances: Applied Multivariate Statistics - Spring 2012

The document discusses different methods for measuring distances and dissimilarities between samples and variables. It explains that the appropriate distance measure depends on the data context and type, including whether the data is interval scaled, binary, nominal, ordinal or mixed. Common distance metrics like Euclidean, Manhattan and Maximum distance are special cases of the Minkowski distance. Scaling variables gives them equal weight, while not scaling favors variables with larger ranges. Dissimilarities are a more flexible generalization of distances. Different approaches are needed for different data types, such as using simple matching coefficients for binary symmetric data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Measuring distances

Applied multivariate statistics – Spring 2012


Overview

 Distances between samples or variables?


 Scaling gives equal weight to all variables
 Dissimilarity is a generalization of Distance
 Dissimilarity for different data types:
- interval scaled
- binary (symmetric / asymmetric)
- nominal
- ordinal
- mixed

Appl. Multivariate Statistics - Spring 2012 2


Different perspective of one thing
• Data context (e.g. biologist, doctor, …)
determines distance measure, not
statistician
• In practice: Statistician has to offer choices
with pros and cons

Appl. Multivariate Statistics - Spring 2012 3


Between samples or variables?

X1 X2 X3

2.5 3.4 1.6


Rest of this lecture
4.3 5.3 5.3

6.3 9.4 8.9

Use correlation

1¡Cor(Xi ;Xj )
d(Xi; Xj ) = 2

Appl. Multivariate Statistics - Spring 2012 4


Properties of distance measures

 D1: d(i,j) >= 0


 D2: d(i,i) = 0
 D3: d(i,j) = d(j,i)
 D4: d(i,j) <= d(i,h) + d(h,j) (triangle inequality)

d(i,j)
d(i,h)

j
h d(j,h)

Appl. Multivariate Statistics - Spring 2012 5


Examples

 Euclidean distance:
p
d(i; j) = (xi1 ¡ xj1)2 + (xi2 ¡ xj2)2 + ::: + (xip ¡ xjp)2
 Manhattan distance:
d(i; j) = jxi1 ¡ xj1j + jxi2 ¡ xj2j + ::: + jxip ¡ xjpj

 Maximum distance:
1
1 1 1
d(i; j) = (jxi1 ¡ xj1 j + jxi2 ¡ xj2 j + ::: + jxip ¡ xjp j ) 1
=
= maxpk=1 jxik ¡ xjk j
 Special cases of Minkowski distance:
1
q q q
d(i; j) = (jxi1 ¡ xj1j + jxi2 ¡ xj2j + ::: + jxip ¡ xjpj ) q

Appl. Multivariate Statistics - Spring 2012 6


Intuition for Minkowski Distance

 p: Index of Minkowski Distance


 Points on the line have equal Minkowski Distance from
center
 R: Function “dist” Euclidean
distance

Manhattan Maximum
distance distance

Appl. Multivariate Statistics - Spring 2012 7


Distance metrics in practice

 Euclidean Distance: By far most common


Our intuitive notion of distance
 Manhattan Distance: Sometimes seen

 Rest: Very rare

Appl. Multivariate Statistics - Spring 2012 8


To scale or not to scale…

Appl. Multivariate Statistics - Spring 2012 9


Person Age Height
[years] [cm]
A 35 190
Example 1: cm B 40 190
C 35 160
 4 persons D 40 160

Close

Appl. Multivariate Statistics - Spring 2012 10


Person Age Height
[years] [feet]
A 35 6.232
Example 1: feet B 40 6.232
C 35 5.248
 4 persons Close D 40 5.248

Appl. Multivariate Statistics - Spring 2012 11


Person Age Height
[scaled] [scaled]
A -0.87 0.87
Example 1: scaled B 0.87 0.87
C -0.87 -0.87
 4 persons D 0.87 -0.87

No subgroups
anymore

Appl. Multivariate Statistics - Spring 2012 12


Object x1 x2

A 13.3 38.0
Example 2 B 12.4 45.4
C -122.7 45.6
 4 objects D -122.4 37.7

OR

Appl. Multivariate Statistics - Spring 2012 13


Need knowledge Object Long. Lat.
of context
Palermo 13.3 38.0
Example 2 Venice 12.4 45.4
Portland -122.7 45.6
 4 objects San Francisco -122.4 37.7

OR

Appl. Multivariate Statistics - Spring 2012 14


To scale or not to scale…

 If variables are not scaled


- variable with largest range has most weight
- distance depends on scale
 Scaling gives every variable equal weight
 Similarpalternative is re-weighing:
d(i; j) = w1(xi1 ¡ xj1)2 + w2(xi2 ¡ xj2)2 + ::: + wp(xip ¡ xjp)2

 Scale if,
- variables measure different units (kg, meter, sec,…)
- you explicitly want to have equal weight for each variable
 Don’t scale if units are the same for all variables
 Most often: Better to scale.
Appl. Multivariate Statistics - Spring 2012 15
Dissimilarities
M P H
 More flexible than distances M 10 1 8
D1: d(i,j) >= 0
P 10 5
D2: d(i,i) = 0
D3: d(i,j) = d(j,i) H 10

 Example: What do you think, how different are the topics


Mathematics, Physics, History on a scale from 0 to 10 (very
different)?
 Could also work with “Similarities” (e.g. 1-Dissimilarity)

Appl. Multivariate Statistics - Spring 2012 16


Dissimilarities for different data types
 Interval-scaled:
- continuous, positive or negative
- examples: height, weight, temperature, age, cost,...
Difference of values has a fixed interpretation
- use metrics we just discussed
 Ratio-scaled:
- continuous, positive
- example: concentration
Ratio of values has fixed interpretation
- use log-transformation, then metrics we just discussed
 R:
- Function “dist” in base distribution (includes Minkowski)
- Function “daisy” in package “cluster”

Appl. Multivariate Statistics - Spring 2012 17


Binary symmetric: Simple matching coefficient

 “Symmetric”: No clear asymmetry between group 0 and


group 1
 Example: Gender, Right-handed
Two right-handed people are as similar as two left-handed
people
 Counter-example: Having AIDS, being Nobel Laureate
Two Nobel Laureates are more similar than two non-Nobel-
Laureates (e.g. Uni Prof at Harvard without Nobel Prize
and baby from Sudan)

Appl. Multivariate Statistics - Spring 2012 18


Binary symmetric: Simple matching coefficient

Object j
X=1 X=0
Object i

X=1 a b a+b+c+d = Number of variables

X=0 c d

Simple matching coefficient

b+c
d(i; j) = a+b+c+d
Proportion of variables,
in which people disagree

Appl. Multivariate Statistics - Spring 2012 19


Binary asymmetric: Jaccard distance

Object j
X=1 X=0
Object i

X=1 a b a+b+c+d = Number of variables

X=0 c d

Uninformative
Simple matching coefficient

b+c
d(i; j) = a+b+c
Proportion of variables,
in which people disagree
ignoring (0,0)

Appl. Multivariate Statistics - Spring 2012 20


Nominal: Simple matching coefficient

Simple matching coefficient


- mm: Number of variables in which object i and j
mismatch
- p: Number of variables

mm
d(i; j) = p
Proportion of variables,
in which people disagree

Appl. Multivariate Statistics - Spring 2012 21


Ordinal: Normalized ranks

 Rank outcome of variable f=1,2,…,M: rif


r ¡1
 Normalize: zif = Miff ¡1

 Treat zif as interval-scaled

Appl. Multivariate Statistics - Spring 2012 22


Mixed: Gower Distance

 Idea: Use distance measure between 0 and 1 for each


variable: d(f
ij
)

P
 Aggregate: d(i; j) = p1 pi=1 d(f)
ij

 Binary (a/s), nominal: Use methods discussed before


jxif ¡xjf j
 Interval-scaled: d(f)ij = Rf
xif: Value for object i in variable f
Rf: Range of variable f for all objects
 Ordinal: Use normalized ranks; then like interval-scaled
based on range

Appl. Multivariate Statistics - Spring 2012 23


Concepts to know

 Effect of scaling / no scaling


 Distance measures for
- interval scaled
- binary (s/a)
- nominal
- categorical
- mixed data

Appl. Multivariate Statistics - Spring 2012 24


R functions to know

 dist, daisy

Appl. Multivariate Statistics - Spring 2012 25

You might also like