Data Mining and Predictive Modeling: Lecture 13: Measuring Data Similarity
Data Mining and Predictive Modeling: Lecture 13: Measuring Data Similarity
Modeling
Lecture 13: Measuring Data Similarity
and Dissimilarity
Dissimilarity • Numerical measure of how different two data objects
are
• Lower when objects are more alike
• Minimum dissimilarity is often 0
• Upper limit varies
Proximity refers to a similarity or dissimilarity
Data Matrix and Dissimilarity Matrix
• Data Matrix
• N data points with p dimensions
• Two modes
• Dissimilarity Matrix
• N data points, but registers only
the distance
• A triangular matrix
• Single mode
Proximity Measure for Nominal Attributes
𝑝−𝑚
𝑑 𝑖, 𝑗 =
𝑝
• Method 2: Use a large number of binary attributes
• Creating a new binary attribute for each of the M nominal states
Similarity/Dissimilarity for Simple Attributes
p and q are the attribute values for two data objects.
Euclidean Distance
• Euclidean Distance
n
dist = ( pk − qk ) 2
k =1
Where n is the number of dimensions (attributes) and 𝑝𝑘 and 𝑞𝑘
are , respectively the kth attributes (components) or data objects p and
q.
• Standardization is necessary if scales differ.
Euclidean Distance
3
p1
point x y
2
p1 0 2
p3 p4
1
p2 2 0
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6
p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Distance Matrix
Minkowski Distance
1
n
dist = ( | pk − qk r r
|)
k =1
• r = 2. Euclidean distance
• Do not confuse r with n, i.e., all these distances are defined for
all numbers of dimensions.
Mahalanobis Distance
mahalanobi s( p, q) = ( p − q) −1 ( p − q)T
1 n
j ,k = ( X ij − X j )( X ik − X k )
n − 1 i =1
• Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
d1 • d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245
correlation( p, q) = p • q
General Approach for Combining Similarities
• Sometimes attributes are of many different types,
but an overall similarity is needed.
Using Weights to Combine Similarities
• May not want to treat all attributes the same.
• Use weights wk which are between 0 and 1 and sum to 1.