Chap2 Data
Chap2 Data
=
=
n
k
k k
q p dist
1
2
) (
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 #
Euclidean Distance
0
1
2
3
0 1 2 3 4 5 6
p1
p2
p3 p4
point x y
p1 0 2
p2 2 0
p3 3 1
p4 5 1
Distance Matrix
p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 #
Minkowski Distance
Minkowski Distance is a generalization of Euclidean
Distance
Where r is a parameter, n is the number of dimensions
(attributes) and p
k
and q
k
are, respectively, the kth attributes
(components) or data objects p and q.
r
n
k
r
k k
q p dist
1
1
) | | (
=
=
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 #
Minkowski Distance: Examples
r = 1. City block (Manhattan, taxicab, L
1
norm) distance.
A common example of this is the Hamming distance, which is just the
number of bits that are different between two binary vectors
r = 2. Euclidean distance
r . supremum (L
max
norm, L
norm) distance.
This is the maximum difference between any component of the vectors
Do not confuse r with n, i.e., all these distances are
defined for all numbers of dimensions.
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 #
Minkowski Distance
Distance Matrix
point x y
p1 0 2
p2 2 0
p3 3 1
p4 5 1
L1 p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
L2 p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
L
p1 p2 p3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 #
Mahalanobis Distance
T
q p q p q p s mahalanobi ) ( ) ( ) , (
1
=
For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6.
E is the covariance matrix of
the input data X
= E
n
i
k
ik
j
ij k j
X X X X
n
1
,
) )( (
1
1
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 #
Mahalanobis Distance
Covariance Matrix:
(
= E
3 . 0 2 . 0
2 . 0 3 . 0
B
A
C
A: (0.5, 0.5)
B: (0, 1)
C: (1.5, 1.5)
Mahal(A,B) = 5
Mahal(A,C) = 4
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 #
Common Properties of a Distance
Distances, such as the Euclidean distance,
have some well known properties.
1. d(p, q) > 0 for all p and q and d(p, q) = 0 only if
p = q. (Positive definiteness)
2. d(p, q) = d(q, p) for all p and q. (Symmetry)
3. d(p, r) s d(p, q) + d(q, r) for all points p, q, and r.
(Triangle Inequality)
where d(p, q) is the distance (dissimilarity) between
points (data objects), p and q.
A distance that satisfies these properties is a
metric
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 #
Common Properties of a Similarity
Similarities, also have some well known
properties.
1. s(p, q) = 1 (or maximum similarity) only if p = q.
2. s(p, q) = s(q, p) for all p and q. (Symmetry)
where s(p, q) is the similarity between points (data
objects), p and q.
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 #
Similarity Between Binary Vectors
Common situation is that objects, p and q, have only
binary attributes
Compute similarities using the following quantities
M
01
= the number of attributes where p was 0 and q was 1
M
10
= the number of attributes where p was 1 and q was 0
M
00
= the number of attributes where p was 0 and q was 0
M
11
= the number of attributes where p was 1 and q was 1
Simple Matching and Jaccard Coefficients
SMC = number of matches / number of attributes
= (M
11
+ M
00
) / (M
01
+ M
10
+ M
11
+ M
00
)
J = number of 11 matches / number of not-both-zero attributes values
= (M
11
) / (M
01
+ M
10
+ M
11
)
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 #
SMC versus Jaccard: Example
p = 1 0 0 0 0 0 0 0 0 0
q = 0 0 0 0 0 0 1 0 0 1
M
01
= 2 (the number of attributes where p was 0 and q was 1)
M
10
= 1 (the number of attributes where p was 1 and q was 0)
M
00
= 7 (the number of attributes where p was 0 and q was 0)
M
11
= 0 (the number of attributes where p was 1 and q was 1)
SMC = (M
11
+ M
00
)/(M
01
+ M
10
+ M
11
+ M
00
) = (0+7) / (2+1+0+7) = 0.7
J = (M
11
) / (M
01
+ M
10
+ M
11
) = 0 / (2 + 1 + 0) = 0
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 #
Cosine Similarity
If d
1
and d
2
are two document vectors, then
cos( d
1
, d
2
) = (d
1
- d
2
) / ||d
1
|| ||d
2
|| ,
where - indicates vector dot product and || d || is the length of vector d.
Example:
d
1
= 3 2 0 5 0 0 0 2 0 0
d
2
= 1 0 0 0 0 0 0 1 0 2
d
1
- d
2
= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
||d
1
|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)
0.5
= (42)
0.5
= 6.481
||d
2
|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2)
0.5
= (6)
0.5
= 2.245
cos( d
1
, d
2
) = .3150
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 #
Extended Jaccard Coefficient (Tanimoto)
Variation of Jaccard for continuous or count
attributes
Reduces to Jaccard for binary attributes
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 #
Correlation
Correlation measures the linear relationship
between objects
To compute correlation, we standardize data
objects, p and q, and then take their dot product
) ( / )) ( ( p std p mean p p
k k
=
'
) ( / )) ( ( q std q mean q q
k k
=
'
q p q p n correlatio
'
-
'
= ) , (
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 #
Visually Evaluating Correlation
Scatter plots
showing the
similarity from
1 to 1.
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 #
General Approach for Combining Similarities
Sometimes attributes are of many different
types, but an overall similarity is needed.
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 #
Using Weights to Combine Similarities
May not want to treat all attributes the same.
Use weights w
k
which are between 0 and 1 and sum
to 1.
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 #
Density
Density-based clustering require a notion of
density
Examples:
Euclidean density
Euclidean density = number of points per unit volume
Probability density
Graph-based density
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 #
Euclidean Density Cell-based
Simplest approach is to divide region into a
number of rectangular cells of equal volume and
define density as # of points the cell contains
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 #
Euclidean Density Center-based
Euclidean density is the number of points within a
specified radius of the point