0% found this document useful (0 votes)

12 views19 pages

Data Mining and Predictive Modeling: Lecture 13: Measuring Data Similarity

The document discusses measuring data similarity and dissimilarity, explaining various proximity measures including Euclidean, Minkowski, and Mahalanobis distances. It also covers similarity metrics for binary vectors such as Simple Matching Coefficient and Jaccard Coefficient, as well as Cosine Similarity and correlation. The importance of standardization and the use of weights to combine similarities across different attribute types is highlighted.

Uploaded by

Jay Wardhan Suri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views19 pages

Data Mining and Predictive Modeling: Lecture 13: Measuring Data Similarity

Uploaded by

Jay Wardhan Suri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Data Mining and Predictive

Modeling
Lecture 13: Measuring Data Similarity

Dr. Abinash Pujahari

Similarity
• Numerical measure of how alike two data objects are
• Value is higher when objects are more alike
Data Similarity • Often falls in the range [0,1]

and Dissimilarity
Dissimilarity • Numerical measure of how different two data objects
are
• Lower when objects are more alike
• Minimum dissimilarity is often 0
• Upper limit varies
Proximity refers to a similarity or dissimilarity
Data Matrix and Dissimilarity Matrix
• Data Matrix
• N data points with p dimensions
• Two modes
• Dissimilarity Matrix
• N data points, but registers only
the distance
• A triangular matrix
• Single mode
Proximity Measure for Nominal Attributes

• Can take 2 or more states, e.g., red, yellow, blue, green

(generalization of a binary attribute)
• Method 1: Simple matching
• m: # of matches, p: total number of variables

𝑝−𝑚
𝑑 𝑖, 𝑗 =
𝑝
• Method 2: Use a large number of binary attributes
• Creating a new binary attribute for each of the M nominal states
Similarity/Dissimilarity for Simple Attributes
p and q are the attribute values for two data objects.
Euclidean Distance
• Euclidean Distance

n
dist =  ( pk − qk ) 2
k =1
Where n is the number of dimensions (attributes) and 𝑝𝑘 and 𝑞𝑘
are , respectively the kth attributes (components) or data objects p and
q.
• Standardization is necessary if scales differ.
Euclidean Distance
3

p1
point x y
2
p1 0 2
p3 p4
1
p2 2 0
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6

p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0

Distance Matrix
Minkowski Distance

Minkowski distance is a generalization of Euclidean Distance.

1
n
dist = (  | pk − qk r r
|)
k =1

Where r is a parameter, n is the number of dimensions

(attributes) and pk and qk are, respectively, the kth
attributes (components) or data objects p and q.
Minkowski Distance Examples
• r = 1. City block (Manhattan, taxicab, L1 norm) distance.
• A common example of this is the Hamming distance, which is just the number of
bits that are different between two binary vectors

• r = 2. Euclidean distance

• r → . “supremum” (Lmax norm, L norm) distance.

• This is the maximum difference between any component of the vectors

• Do not confuse r with n, i.e., all these distances are defined for
all numbers of dimensions.
Mahalanobis Distance
mahalanobi s( p, q) = ( p − q)  −1 ( p − q)T

 is the covariance matrix of the

input data X

1 n
 j ,k =  ( X ij − X j )( X ik − X k )
n − 1 i =1

For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6.

Common Properties of a Distance
• Distances, such as the Euclidean distance, have some well
known properties.
1. d(p, q)  0 for all p and q and d(p, q) = 0 only if
p = q. (Positive definiteness)
2. d(p, q) = d(q, p) for all p and q. (Symmetry)
3. d(p, r)  d(p, q) + d(q, r) for all points p, q, and r.
(Triangle Inequality)
where d(p, q) is the distance (dissimilarity) between points (data objects), p
and q.

• A distance that satisfies these properties is a metric

Common Properties of a Distance

• Similarities, also have some well known

properties.
1. s(p, q) = 1 (or maximum similarity) only if p = q.

2. s(p, q) = s(q, p) for all p and q. (Symmetry)

where s(p, q) is the similarity between points (data

objects), p and q.
Similarity Between Binary Vectors
• Common situation is that objects, p and q, have only binary attributes
• Compute similarities using the following quantities
M01 = the number of attributes where p was 0 and q was 1
M10 = the number of attributes where p was 1 and q was 0
M00 = the number of attributes where p was 0 and q was 0
M11 = the number of attributes where p was 1 and q was 1

• Simple Matching and Jaccard Coefficients

SMC = number of matches / number of attributes
= (M11 + M00) / (M01 + M10 + M11 + M00)

J = number of 11 matches / number of not-both-zero attributes values

= (M11) / (M01 + M10 + M11)
SMC vs Jaccard: Example
p= 1000000000
q= 0000001001

M01 = 2 (the number of attributes where p was 0 and q was 1)

M10 = 1 (the number of attributes where p was 1 and q was 0)
M00 = 7 (the number of attributes where p was 0 and q was 0)
M11 = 0 (the number of attributes where p was 1 and q was 1)

SMC = (M11 + M00)/(M01 + M10 + M11 + M00) = (0+7) / (2+1+0+7) = 0.7

J = (M11) / (M01 + M10 + M11) = 0 / (2 + 1 + 0) = 0

Cosine Similarity
• If d1 and d2 are two document vectors, then
cos( d1, d2 ) = (d1 • d2) / ||d1|| ||d2|| ,
where • indicates vector dot product and || d || is the length of vector d.

• Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2

d1 • d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245

cos( d1, d2 ) = .3150

Extended Jaccard Coefficient
• Variation of Jaccard for continuous or count attributes
• Reduces to Jaccard for binary attributes
Correlation
• Correlation measures the linear relationship between objects
• To compute correlation, we standardize data objects, p and q,
and then take their dot product

pk = ( pk − mean( p)) / std ( p)

qk = (qk − mean(q)) / std (q)

correlation( p, q) = p • q
General Approach for Combining Similarities
• Sometimes attributes are of many different types,
but an overall similarity is needed.
Using Weights to Combine Similarities
• May not want to treat all attributes the same.
• Use weights wk which are between 0 and 1 and sum to 1.

Measuring Data Similarity and Dissimilarity
No ratings yet
Measuring Data Similarity and Dissimilarity
20 pages
Unit 3
No ratings yet
Unit 3
39 pages
Beamer Presentation For The P Adic Integers
No ratings yet
Beamer Presentation For The P Adic Integers
71 pages
Similarity and Dissimilarity
No ratings yet
Similarity and Dissimilarity
34 pages
Similarity
No ratings yet
Similarity
20 pages
TE IT DMBI Module2 Data Preprocessing L8-L11
No ratings yet
TE IT DMBI Module2 Data Preprocessing L8-L11
73 pages
29.measuring Data Similarity and Dissimilarity Introduction
No ratings yet
29.measuring Data Similarity and Dissimilarity Introduction
43 pages
Measure of Proximity
No ratings yet
Measure of Proximity
11 pages
Chapter - 2 Data Mining
No ratings yet
Chapter - 2 Data Mining
21 pages
CS-DM Module - 3
No ratings yet
CS-DM Module - 3
27 pages
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
26 pages
Clustering
0% (1)
Clustering
127 pages
Unit I Ic Fabrication
No ratings yet
Unit I Ic Fabrication
23 pages
Fee Structure PG 2023
No ratings yet
Fee Structure PG 2023
7 pages
Lecture 4
No ratings yet
Lecture 4
33 pages
Lecture 3
No ratings yet
Lecture 3
58 pages
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
No ratings yet
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
30 pages
Structural Theory 1 (Part 1)
No ratings yet
Structural Theory 1 (Part 1)
14 pages
Data Science: Department of Computer Science & Engineering
No ratings yet
Data Science: Department of Computer Science & Engineering
31 pages
Lec-3. Datamining-Similarity-Distance-Ext
No ratings yet
Lec-3. Datamining-Similarity-Distance-Ext
104 pages
3 Unit PR NonParametric Decision Making
No ratings yet
3 Unit PR NonParametric Decision Making
78 pages
2023 Nov Algebra 1
No ratings yet
2023 Nov Algebra 1
2 pages
David Dickerson Comments To The US Navy & Washington State Department of Ecology
No ratings yet
David Dickerson Comments To The US Navy & Washington State Department of Ecology
130 pages
CS2209 Similarity Distances
No ratings yet
CS2209 Similarity Distances
23 pages
DMi 03-Proximity
No ratings yet
DMi 03-Proximity
51 pages
Data Similarity
0% (1)
Data Similarity
18 pages
Ordinary Differential Equations (Odes) : Part A P1
No ratings yet
Ordinary Differential Equations (Odes) : Part A P1
125 pages
Introduction To Machine Learning: K-Nearest Neighbor Algorithm
No ratings yet
Introduction To Machine Learning: K-Nearest Neighbor Algorithm
25 pages
DS5 Statistics
No ratings yet
DS5 Statistics
67 pages
CS822 DataMining Week4
No ratings yet
CS822 DataMining Week4
45 pages
Distances Similarities
No ratings yet
Distances Similarities
39 pages
Distance and Similarity
No ratings yet
Distance and Similarity
33 pages
Part 1: Evaluation and Modification of Open Web Steel Joists and Joist Girders
No ratings yet
Part 1: Evaluation and Modification of Open Web Steel Joists and Joist Girders
101 pages
Wind Power Plant
100% (1)
Wind Power Plant
36 pages
Class 1c - DataFundamentals
No ratings yet
Class 1c - DataFundamentals
27 pages
9-2 Data Analysis and Pre-Processing Part 2 PDF
No ratings yet
9-2 Data Analysis and Pre-Processing Part 2 PDF
27 pages
Lab 2
No ratings yet
Lab 2
21 pages
IDS4
No ratings yet
IDS4
50 pages
Similarity
No ratings yet
Similarity
20 pages
2 Similarity Disimilarity Measure
No ratings yet
2 Similarity Disimilarity Measure
35 pages
Smart Meter and Outage Management System
No ratings yet
Smart Meter and Outage Management System
9 pages
IS:2062
No ratings yet
IS:2062
13 pages
Rsfinal
No ratings yet
Rsfinal
30 pages
CSC 522 Lecture10
No ratings yet
CSC 522 Lecture10
30 pages
Footing
No ratings yet
Footing
84 pages
Similarity Analysis
No ratings yet
Similarity Analysis
85 pages
Materi 7.1. Distance Measurement
No ratings yet
Materi 7.1. Distance Measurement
14 pages
Clustering Lecture 1: Basics: Jing Gao
No ratings yet
Clustering Lecture 1: Basics: Jing Gao
62 pages
Data Mining: Similarity and Distance Recommendation Systems Sketching, Locality Sensitive Hashing
No ratings yet
Data Mining: Similarity and Distance Recommendation Systems Sketching, Locality Sensitive Hashing
57 pages
Ec0221 Ed Lab Manual
No ratings yet
Ec0221 Ed Lab Manual
68 pages
Knowing Your Data
No ratings yet
Knowing Your Data
43 pages
02data Part4
No ratings yet
02data Part4
28 pages
Unit 3
No ratings yet
Unit 3
13 pages
CSE 1 PPT MiniTest 12feb24 Similarity
No ratings yet
CSE 1 PPT MiniTest 12feb24 Similarity
11 pages
Similarity
No ratings yet
Similarity
19 pages
Class-Data Preprocessing-IV
No ratings yet
Class-Data Preprocessing-IV
28 pages
Procedure For Making Tandem Crane Lifts in The USA
100% (2)
Procedure For Making Tandem Crane Lifts in The USA
4 pages
ML Unit 2
No ratings yet
ML Unit 2
11 pages
Introduction To Data Science: Tom A S Horv Ath
No ratings yet
Introduction To Data Science: Tom A S Horv Ath
39 pages
Lecture 2. Similarity Measures For Cluster Analysis
No ratings yet
Lecture 2. Similarity Measures For Cluster Analysis
31 pages
Bman 07
No ratings yet
Bman 07
75 pages
Lesson 6 Similarities KNN
No ratings yet
Lesson 6 Similarities KNN
25 pages
Lec 5
No ratings yet
Lec 5
24 pages
DMi 03 Proximity
No ratings yet
DMi 03 Proximity
9 pages
Biochemistry Handout
No ratings yet
Biochemistry Handout
22 pages
Lec 5
No ratings yet
Lec 5
22 pages
RL3.2 Data Similarity 1
No ratings yet
RL3.2 Data Similarity 1
17 pages
Similarity Based Learning (Part 2)
No ratings yet
Similarity Based Learning (Part 2)
15 pages
Similarty and Dissimilarity
No ratings yet
Similarty and Dissimilarity
11 pages
Assignment 5: Discuss The Contributions of Islamic Mathematics
100% (1)
Assignment 5: Discuss The Contributions of Islamic Mathematics
5 pages
Guided Notes - IN Algebra II - Unit A4 - Solving and Reasoning With Complex Numbers
No ratings yet
Guided Notes - IN Algebra II - Unit A4 - Solving and Reasoning With Complex Numbers
5 pages
Dist
No ratings yet
Dist
14 pages
Balls of Iron
No ratings yet
Balls of Iron
4 pages
Echo and Reverberation
No ratings yet
Echo and Reverberation
31 pages
r05412107 Helicopter Engineering
No ratings yet
r05412107 Helicopter Engineering
5 pages
Similarity Measures
No ratings yet
Similarity Measures
11 pages
Data Mining: Similarity and Distance
No ratings yet
Data Mining: Similarity and Distance
6 pages
18CSE397T - Computational Data Analysis Unit - 3: Session - 8: SLO - 2
No ratings yet
18CSE397T - Computational Data Analysis Unit - 3: Session - 8: SLO - 2
4 pages
Badr 2007
No ratings yet
Badr 2007
12 pages
Pid Controllers Program
No ratings yet
Pid Controllers Program
6 pages
ME320 - Mechanics of Materials Laboratory: TEST TITLE: Combined Stress & Pop Can Experiments
No ratings yet
ME320 - Mechanics of Materials Laboratory: TEST TITLE: Combined Stress & Pop Can Experiments
8 pages
LeaP Science G7 Week 3 Q3
No ratings yet
LeaP Science G7 Week 3 Q3
4 pages
ch3 Current Electricity
No ratings yet
ch3 Current Electricity
3 pages
Lec09 466 PDF
No ratings yet
Lec09 466 PDF
5 pages
Lewis Acid Promoted Reactions of N - (Chlorides. Ring-Size Effects in Competitive Intramolecular Acylation of Phenyl and Cyclopropyl Substituents
No ratings yet
Lewis Acid Promoted Reactions of N - (Chlorides. Ring-Size Effects in Competitive Intramolecular Acylation of Phenyl and Cyclopropyl Substituents
3 pages
Specifications 3000 Eq
No ratings yet
Specifications 3000 Eq
1 page
Coaster Example Problems
No ratings yet
Coaster Example Problems
2 pages
Calculus: Maths of the Gods
From Everand
Calculus: Maths of the Gods
Bill Todorovich
No ratings yet
Algebraic Equations
From Everand
Algebraic Equations
Demetrios P. Kanoussis
No ratings yet
Application of Derivatives Tangents and Normals (Calculus) Mathematics E-Book For Public Exams
From Everand
Application of Derivatives Tangents and Normals (Calculus) Mathematics E-Book For Public Exams
Mohmmad Khaja Shareef
5/5 (1)

Data Mining and Predictive Modeling: Lecture 13: Measuring Data Similarity

Uploaded by

Data Mining and Predictive Modeling: Lecture 13: Measuring Data Similarity

Uploaded by

Data Mining and Predictive

Dr. Abinash Pujahari

• Can take 2 or more states, e.g., red, yellow, blue, green

Minkowski distance is a generalization of Euclidean Distance.

Where r is a parameter, n is the number of dimensions

• r → . “supremum” (Lmax norm, L norm) distance.

 is the covariance matrix of the

For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6.

• A distance that satisfies these properties is a metric

• Similarities, also have some well known

2. s(p, q) = s(q, p) for all p and q. (Symmetry)

where s(p, q) is the similarity between points (data

• Simple Matching and Jaccard Coefficients

J = number of 11 matches / number of not-both-zero attributes values

M01 = 2 (the number of attributes where p was 0 and q was 1)

SMC = (M11 + M00)/(M01 + M10 + M11 + M00) = (0+7) / (2+1+0+7) = 0.7

J = (M11) / (M01 + M10 + M11) = 0 / (2 + 1 + 0) = 0

cos( d1, d2 ) = .3150

pk = ( pk − mean( p)) / std ( p)

You might also like