0% found this document useful (0 votes)
10 views19 pages

Data Mining and Predictive Modeling: Lecture 13: Measuring Data Similarity

The document discusses measuring data similarity and dissimilarity, explaining various proximity measures including Euclidean, Minkowski, and Mahalanobis distances. It also covers similarity metrics for binary vectors such as Simple Matching Coefficient and Jaccard Coefficient, as well as Cosine Similarity and correlation. The importance of standardization and the use of weights to combine similarities across different attribute types is highlighted.

Uploaded by

Jay Wardhan Suri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views19 pages

Data Mining and Predictive Modeling: Lecture 13: Measuring Data Similarity

The document discusses measuring data similarity and dissimilarity, explaining various proximity measures including Euclidean, Minkowski, and Mahalanobis distances. It also covers similarity metrics for binary vectors such as Simple Matching Coefficient and Jaccard Coefficient, as well as Cosine Similarity and correlation. The importance of standardization and the use of weights to combine similarities across different attribute types is highlighted.

Uploaded by

Jay Wardhan Suri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Data Mining and Predictive

Modeling
Lecture 13: Measuring Data Similarity

Dr. Abinash Pujahari


Similarity
• Numerical measure of how alike two data objects are
• Value is higher when objects are more alike
Data Similarity • Often falls in the range [0,1]

and Dissimilarity
Dissimilarity • Numerical measure of how different two data objects
are
• Lower when objects are more alike
• Minimum dissimilarity is often 0
• Upper limit varies
Proximity refers to a similarity or dissimilarity
Data Matrix and Dissimilarity Matrix
• Data Matrix
• N data points with p dimensions
• Two modes
• Dissimilarity Matrix
• N data points, but registers only
the distance
• A triangular matrix
• Single mode
Proximity Measure for Nominal Attributes

• Can take 2 or more states, e.g., red, yellow, blue, green


(generalization of a binary attribute)
• Method 1: Simple matching
• m: # of matches, p: total number of variables

𝑝−𝑚
𝑑 𝑖, 𝑗 =
𝑝
• Method 2: Use a large number of binary attributes
• Creating a new binary attribute for each of the M nominal states
Similarity/Dissimilarity for Simple Attributes
p and q are the attribute values for two data objects.
Euclidean Distance
• Euclidean Distance

n
dist =  ( pk − qk ) 2
k =1
Where n is the number of dimensions (attributes) and 𝑝𝑘 and 𝑞𝑘
are , respectively the kth attributes (components) or data objects p and
q.
• Standardization is necessary if scales differ.
Euclidean Distance
3

p1
point x y
2
p1 0 2
p3 p4
1
p2 2 0
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6

p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0

Distance Matrix
Minkowski Distance

Minkowski distance is a generalization of Euclidean Distance.

1
n
dist = (  | pk − qk r r
|)
k =1

Where r is a parameter, n is the number of dimensions


(attributes) and pk and qk are, respectively, the kth
attributes (components) or data objects p and q.
Minkowski Distance Examples
• r = 1. City block (Manhattan, taxicab, L1 norm) distance.
• A common example of this is the Hamming distance, which is just the number of
bits that are different between two binary vectors

• r = 2. Euclidean distance

• r → . “supremum” (Lmax norm, L norm) distance.


• This is the maximum difference between any component of the vectors

• Do not confuse r with n, i.e., all these distances are defined for
all numbers of dimensions.
Mahalanobis Distance
mahalanobi s( p, q) = ( p − q)  −1 ( p − q)T

 is the covariance matrix of the


input data X

1 n
 j ,k =  ( X ij − X j )( X ik − X k )
n − 1 i =1

For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6.


Common Properties of a Distance
• Distances, such as the Euclidean distance, have some well
known properties.
1. d(p, q)  0 for all p and q and d(p, q) = 0 only if
p = q. (Positive definiteness)
2. d(p, q) = d(q, p) for all p and q. (Symmetry)
3. d(p, r)  d(p, q) + d(q, r) for all points p, q, and r.
(Triangle Inequality)
where d(p, q) is the distance (dissimilarity) between points (data objects), p
and q.

• A distance that satisfies these properties is a metric


Common Properties of a Distance

• Similarities, also have some well known


properties.
1. s(p, q) = 1 (or maximum similarity) only if p = q.

2. s(p, q) = s(q, p) for all p and q. (Symmetry)

where s(p, q) is the similarity between points (data


objects), p and q.
Similarity Between Binary Vectors
• Common situation is that objects, p and q, have only binary attributes
• Compute similarities using the following quantities
M01 = the number of attributes where p was 0 and q was 1
M10 = the number of attributes where p was 1 and q was 0
M00 = the number of attributes where p was 0 and q was 0
M11 = the number of attributes where p was 1 and q was 1

• Simple Matching and Jaccard Coefficients


SMC = number of matches / number of attributes
= (M11 + M00) / (M01 + M10 + M11 + M00)

J = number of 11 matches / number of not-both-zero attributes values


= (M11) / (M01 + M10 + M11)
SMC vs Jaccard: Example
p= 1000000000
q= 0000001001

M01 = 2 (the number of attributes where p was 0 and q was 1)


M10 = 1 (the number of attributes where p was 1 and q was 0)
M00 = 7 (the number of attributes where p was 0 and q was 0)
M11 = 0 (the number of attributes where p was 1 and q was 1)

SMC = (M11 + M00)/(M01 + M10 + M11 + M00) = (0+7) / (2+1+0+7) = 0.7

J = (M11) / (M01 + M10 + M11) = 0 / (2 + 1 + 0) = 0


Cosine Similarity
• If d1 and d2 are two document vectors, then
cos( d1, d2 ) = (d1 • d2) / ||d1|| ||d2|| ,
where • indicates vector dot product and || d || is the length of vector d.

• Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2

d1 • d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245

cos( d1, d2 ) = .3150


Extended Jaccard Coefficient
• Variation of Jaccard for continuous or count attributes
• Reduces to Jaccard for binary attributes
Correlation
• Correlation measures the linear relationship between objects
• To compute correlation, we standardize data objects, p and q,
and then take their dot product

pk = ( pk − mean( p)) / std ( p)


qk = (qk − mean(q)) / std (q)

correlation( p, q) = p • q
General Approach for Combining Similarities
• Sometimes attributes are of many different types,
but an overall similarity is needed.
Using Weights to Combine Similarities
• May not want to treat all attributes the same.
• Use weights wk which are between 0 and 1 and sum to 1.

You might also like