0% found this document useful (0 votes)
94 views

Lecture 2. Similarity Measures For Cluster Analysis

This document discusses different methods for measuring similarity between objects for cluster analysis. It covers distance measures for numeric data like Minkowski distance, proximity measures for binary variables like Jaccard coefficient, and distances between categorical attributes. The document emphasizes that similarity measures are critical for cluster analysis as they define how objects are grouped together. It provides examples to illustrate distance calculations for special cases of Minkowski distance and dissimilarity between binary variables.

Uploaded by

MUKTAR REZVI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
94 views

Lecture 2. Similarity Measures For Cluster Analysis

This document discusses different methods for measuring similarity between objects for cluster analysis. It covers distance measures for numeric data like Minkowski distance, proximity measures for binary variables like Jaccard coefficient, and distances between categorical attributes. The document emphasizes that similarity measures are critical for cluster analysis as they define how objects are grouped together. It provides examples to illustrate distance calculations for special cases of Minkowski distance and dissimilarity between binary variables.

Uploaded by

MUKTAR REZVI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Lecture 2.

Similarity Measures
for Cluster Analysis
Lecture 2. Similarity Measures for Cluster Analysis
 Basic Concept: Measuring Similarity between Objects

 Distance on Numeric Data: Minkowski Distance


 Proximity Measure for Symmetric vs. Asymmetric Binary Variables

 Distance between Categorical Attributes, Ordinal Attributes, and


Mixed Types
 Proximity Measure between Two Vectors: Cosine Similarity

 Correlation Measures between Two Variables: Covariance and


Correlation Coefficient
 Summary
2
Session 1: Basic Concepts:
Measuring Similarity
between Objects
What Is Good Clustering?
 A good clustering method will produce high quality clusters which should have

 High intra-class similarity: Cohesive within clusters


 Low inter-class similarity: Distinctive between clusters
 Quality function
 There is usually a separate “quality” function that measures the “goodness” of
a cluster
 It is hard to define “similar enough” or “good enough”
 The answer is typically highly subjective
 There exist many similarity measures and/or functions for different applications
 Similarity measure is critical for cluster analysis

4
Similarity, Dissimilarity, and Proximity
 Similarity measure or similarity function

 A real-valued function that quantifies the similarity between two objects


 Measure how two data objects are alike: The higher value, the more alike
 Often falls in the range [0,1]: 0: no similarity; 1: completely similar
 Dissimilarity (or distance) measure

 Numerical measure of how different two data objects are


 In some sense, the inverse of similarity: The lower, the more alike
 Minimum dissimilarity is often 0 (i.e., completely similar)
 Range [0, 1] or [0, ∞) , depending on the definition
 Proximity usually refers to either similarity or dissimilarity

5
Session 2: Distance on Numeric
Data: Minkowski Distance
Data Matrix and Dissimilarity Matrix
 Data matrix  x11 x12 ... x1l 
 
 A data matrix of n data points with l dimensions  x21 x22 ... x2l 
D
 Dissimilarity (distance) matrix  
 
 n data points, but registers only the distance d(i, j)  xn1 xn 2 ... xnl 
(typically metric)
 0 
 Usually symmetric, thus a triangular matrix  
 d (2,1) 0 
 Distance functions are usually different for real, boolean,  
categorical, ordinal, ratio, and vector variables  
 d ( n,1) d ( n, 2) ... 0 
 Weights can be associated with different variables based
on applications and data semantics

7
Example: Data Matrix and Dissimilarity Matrix
Data Matrix

point attribute1 attribute2


x1 1 2
x2 3 5
x3 2 0
x4 4 5

Dissimilarity Matrix (by Euclidean Distance)


x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0

8
Distance on Numeric Data: Minkowski Distance
 Minkowski distance: A popular distance measure
d (i, j )  p | xi1  x j1 | p  | xi 2  x j 2 | p   | xil  x jl | p
where i = (xi1, xi2, …, xil) and j = (xj1, xj2, …, xjl) are two l-dimensional data
objects, and p is the order (the distance so defined is also called L-p norm)
 Properties
 d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positivity)
 d(i, j) = d(j, i) (Symmetry)
 d(i, j)  d(i, k) + d(k, j) (Triangle Inequality)
 A distance that satisfies these properties is a metric
 Note: There are nonmetric dissimilarities, e.g., set differences

9
Special Cases of Minkowski Distance
 p = 1: (L1 norm) Manhattan (or city block) distance
 E.g., the Hamming distance: the number of bits that are different between two
binary vectors
d (i, j ) | xi1  x j1 |  | xi 2  x j 2 |   | xil  x jl |

 p = 2: (L2 norm) Euclidean distance

d (i, j )  | xi1  x j1 |2  | xi 2  x j 2 |2   | xil  x jl |2


 p  : (Lmax norm, L norm) “supremum” distance
 The maximum difference between any component (attribute) of the vectors
l
d (i, j )  lim | xi1  x j1 |  | xi 2  x j 2 |   | xil  x jl |  max | xif  xif |
p p p p
p  f 1

10
Example: Minkowski Distance at Special Cases
point attribute 1 attribute 2 Manhattan (L1)
x1 1 2 L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0

Euclidean (L2)
L2 x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0

Supremum (L)
L x1 x2 x3 x4
x1 0
x2 3 0
x3 2 5 0
x4 3 1 5 0
11
Session 3: Proximity Measure
for Symmetric vs. Asymmetric
Binary Variables
Proximity Measure for Binary Attributes
 A contingency table for binary data
Object j

Object i

 Distance measure for symmetric binary variables:

 Distance measure for asymmetric binary variables:

 Jaccard coefficient (similarity measure for asymmetric

binary variables):

 Note: Jaccard coefficient is the same as “coherence”: (a concept discussed in Pattern Discovery)

13
Example: Dissimilarity between Asymmetric Binary Variables
Mary
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N 1 0 ∑row
Mary F Y N P N P N 1 2 0 2
Jack
Jim M Y P N N N N 0 1 3 4
 Gender is a symmetric attribute (not counted in) ∑col 3 3 6
Jim
 The remaining attributes are asymmetric binary
1 0 ∑row
 Let the values Y and P be 1, and the value N be 0 1 1 1 2
Jack 0 1 3 4
 Distance:
∑col 2 4 6
01 Mary
d ( jack , mary )   0.33
2 01 1 0 ∑row
11
d ( jack , jim )   0.67 1 1 1 2
111
1 2 Jim 0 2 2 4
d ( jim , mary )   0.75
11 2 ∑col 3 3 6
14
Session 4: Distance between
Categorical Attributes, Ordinal
Attributes, and Mixed Types
Proximity Measure for Categorical Attributes
 Categorical data, also called nominal attributes

 Example: Color (red, yellow, blue, green), profession, etc.

 Method 1: Simple matching

 m: # of matches, p: total # of variables


p
d (i, j)  p m

 Method 2: Use a large number of binary attributes

 Creating a new binary attribute for each of the M nominal states

16
Ordinal Variables
 An ordinal variable can be discrete or continuous

 Order is important, e.g., rank (e.g., freshman, sophomore, junior, senior)

 Can be treated like interval-scaled

 Replace an ordinal variable value by its rank: rif  {1,..., M f }


 Map the range of each variable onto [0, 1] by replacing i-th object in
the f-th variable by
rif  1
zif 
M f 1
 Example: freshman: 0; sophomore: 1/3; junior: 2/3; senior 1

 Then distance: d(freshman, senior) = 1, d(junior, senior) = 1/3


 Compute the dissimilarity using methods for interval-scaled variables

17
Attributes of Mixed Type
 A dataset may contain all attribute types

 Nominal, symmetric binary, asymmetric binary, numeric, and ordinal


 One may use a weighted formula to combine their effects:
p

 ij dij
w (f) (f)

f 1
d (i, j )  p

 ij
w (f)

f 1

 If f is numeric: Use the normalized distance


 If f is binary or nominal: dij(f) = 0 if xif = xjf; or dij(f) = 1 otherwise
 If f is ordinal
rif  1
 Compute ranks zif (where zif  )
M f 1
18
 Treat zif as interval-scaled
Session 5: Proximity Measure
between Two Vectors: Cosine
Similarity
Cosine Similarity of Two Vectors
 A document can be represented by a bag of terms or a long vector, with each
attribute recording the frequency of a particular term (such as word, keyword, or
phrase) in the document

 Other vector objects: Gene features in micro-arrays


 Applications: Information retrieval, biologic taxonomy, gene feature mapping, etc.
 Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency vectors), then
d1  d 2
cos (d1 , d 2 ) 
|| d1 ||  || d 2 ||
where  indicates vector dot product, ||d||: the length of vector d
20
Example: Calculating Cosine Similarity
 Calculating Cosine Similarity: d1  d 2
cos (d1 , d 2 ) 
|| d1 ||  || d 2 ||
where  indicates vector dot product, ||d||: the length of vector d
 Ex: Find the similarity between documents 1 and 2.
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0) d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
 First, calculate vector dot product
d1d2 = 5 X 3 + 0 X 0 + 3 X 2 + 0 X 0 + 2 X 1 + 0 X 1 + 0 X 1 + 2 X 1 + 0 X 0 + 0 X 1 = 25
 Then, calculate ||d1|| and ||d2||

|| d1 || 5  5  0  0  3  3  0  0  2  2  0  0  0  0  2  2  0  0  0  0  6.481

|| d 2 || 3  3  0  0  2  2  0  0  1 1  1 1  0  0  1 1  0  0  1 1  4.12
 Calculate cosine similarity: cos(d1, d2 ) = 26/ (6.481 X 4.12) = 0.94
21
Session 6: Correlation Measures
between Two Variables: Covariance
and Correlation Coefficient
Variance for Single Variable
 The variance of a random variable X provides a measure of how much the value of X
deviates from the mean or expected value of X:
  ( x   ) 2 f ( x) if X is discrete

 x
  var( X )  E[(X   ) ]   
2 2

  ( x   ) 2 f ( x)dx if X is continuous

 
 where σ2 is the variance of X, σ is called standard deviation
µ is the mean, and µ = E[X] is the expected value of X
 That is, variance is the expected value of the square deviation from the mean
 It can also be written as:  2  var( X )  E[(X   ) 2 ]  E[X 2 ]   2  E[X 2 ]  [ E ( x)]2
 Sample variance is the average squared deviation of the data value xi from the
n
1
sample mean ˆ   ( xi  ˆ ) 2
2

n i 1
23
Covariance for Two Variables
 Covariance between two variables X1 and X2
 12  E[( X 1  1 )( X 2  2 )]  E[ X 1 X 2 ]  12  E[ X 1 X 2 ]  E[ X 1 ]E[ X 2 ]
where µ1 = E[X1] is the respective mean or expected value of X1; similarly for µ2
 Sample covariance between X1 and X2:
 Sample covariance is a generalization of the sample variance:
1 n 1 n
ˆ11   ( xi1  ˆ1 )( xi1  ˆ1 )   ( xi1  ˆ1 ) 2  ˆ12
n i 1 n i 1
 Positive covariance: If σ12 > 0
 Negative covariance: If σ12 <0
 Independence: If X1 and X2 are independent, σ12 = 0 but the reverse is not true
 Some pairs of random variables may have a covariance 0 but are not independent
 Only under some additional assumptions (e.g., the data follow multivariate normal
distributions) does a covariance of 0 imply independence
24
Example: Calculation of Covariance
 Suppose two stocks X1 and X2 have the following values in one week:

 (2, 5), (3, 8), (5, 10), (4, 11), (6, 14)
 Question: If the stocks are affected by the same industry trends, will their prices rise
or fall together?
 Covariance formula
 12  E[( X 1  1 )( X 2  2 )]  E[ X 1 X 2 ]  12  E[ X 1 X 2 ]  E[ X 1 ]E[ X 2 ]

 Its computation can be simplified as:  12  E[ X 1 X 2 ]  E[ X 1 ]E[ X 2 ]


 E(X1) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
 E(X2) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
 σ12 = (2×5 + 3×8 + 5×10 + 4×11 + 6×14)/5 − 4 × 9.6 = 4
 Thus, X1 and X2 rise together since σ12 > 0
25
Correlation between Two Numerical Variables
 Correlation between two variables X1 and X2 is the standard covariance, obtained by
normalizing the covariance with the standard deviation of each variable
 12  12
12  
 1 2  12 2 2 n

ˆ12  (x i1  ˆ1 )( xi 2  ˆ 2 )
 Sample correlation for two attributes X1 and X2: ˆ12   i 1

ˆ1ˆ 2 n n

 (x
i 1
i1  ˆ1 ) 2
 (x
i 1
i2  ˆ 2 ) 2

where n is the number of tuples, µ1 and µ2 are the respective means of X1 and X2 ,
σ1 and σ2 are the respective standard deviation of X1 and X2
 If ρ12 > 0: A and B are positively correlated (X1’s values increase as X2’s)

 The higher, the stronger correlation


 If ρ12 = 0: independent (under the same assumption as discussed in co-variance)

 If ρ12 < 0: negatively correlated


26
Visualizing Changes of Correlation Coefficient

 Correlation coefficient value range:


[–1, 1]
 A set of scatter plots shows sets of
points and their correlation
coefficients changing from –1 to 1

27
Covariance Matrix
 The variance and covariance information for the two variables X1 and X2 can be
summarized as 2 X 2 covariance matrix as
X 1  1
  E[( X   )( X   ) ]  E[(
T
)( X 1  1 X 2  2 )]
X 2  2
 E[( X 1  1 )( X 1  1 )] E[( X 1  1 )( X 2  2 )] 
 
 E[( X 2   2 )( X 1  1 )] E[( X 2   2 )( X 2   2 
)]
  12  12 
 2 
  21  2 
 In general, considering d numerical attributes, we have,
X1 X 2 ... X d
 x11 x12 ... x1d    12  12 ...  1d 
   
  2
...  2 d 
D   x21 x22 ... x2 d    E[( X   )( X   )T ]   21 2

   
   2 

 xn1 xn 2 ... xnd    n1  n 2 ...  nd 
28
Session 7: Summary
Summary: Similarity Measures for Cluster Analysis
 Basic Concept: Measuring Similarity between Objects
 Distance on Numeric Data: Minkowski Distance

 Proximity Measure for Symmetric vs. Asymmetric Binary Variables

 Distance between Categorical Attributes, Ordinal Attributes, and


Mixed Types
 Proximity Measure between Two Vectors: Cosine Similarity

 Correlation Measures between Two Variables: Covariance and


Correlation Coefficient
 Summary
30
Recommended Readings
 L. Kaufman and P. J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster

Analysis, John Wiley & Sons, 1990

 Mohammed J. Zaki and Wagner Meira, Jr.. Data Mining and Analysis: Fundamental
Concepts and Algorithms. Cambridge University Press, 2014

 Jiawei Han, Micheline Kamber, and Jian Pei. Data Mining: Concepts and Techniques.

Morgan Kaufmann, 3rd ed. , 2011

 Charu Aggarwal and Chandran K. Reddy (eds.). Data Clustering: Algorithms and

Applications. CRC Press, 2014


31

You might also like