0% found this document useful (0 votes)

94 views

Lecture 2. Similarity Measures For Cluster Analysis

This document discusses different methods for measuring similarity between objects for cluster analysis. It covers distance measures for numeric data like Minkowski distance, proximity measures for binary variables like Jaccard coefficient, and distances between categorical attributes. The document emphasizes that similarity measures are critical for cluster analysis as they define how objects are grouped together. It provides examples to illustrate distance calculations for special cases of Minkowski distance and dissimilarity between binary variables.

Uploaded by

MUKTAR REZVI

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

94 views

Lecture 2. Similarity Measures For Cluster Analysis

Uploaded by

MUKTAR REZVI

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

Lecture 2.

Similarity Measures
for Cluster Analysis
Lecture 2. Similarity Measures for Cluster Analysis
 Basic Concept: Measuring Similarity between Objects

 Distance on Numeric Data: Minkowski Distance

 Proximity Measure for Symmetric vs. Asymmetric Binary Variables

 Distance between Categorical Attributes, Ordinal Attributes, and

Mixed Types
 Proximity Measure between Two Vectors: Cosine Similarity

 Correlation Measures between Two Variables: Covariance and

Correlation Coefficient
 Summary
2
Session 1: Basic Concepts:
Measuring Similarity
between Objects
What Is Good Clustering?
 A good clustering method will produce high quality clusters which should have

 High intra-class similarity: Cohesive within clusters

 Low inter-class similarity: Distinctive between clusters
 Quality function
 There is usually a separate “quality” function that measures the “goodness” of
a cluster
 It is hard to define “similar enough” or “good enough”
 The answer is typically highly subjective
 There exist many similarity measures and/or functions for different applications
 Similarity measure is critical for cluster analysis

4
Similarity, Dissimilarity, and Proximity
 Similarity measure or similarity function

 A real-valued function that quantifies the similarity between two objects

 Measure how two data objects are alike: The higher value, the more alike
 Often falls in the range [0,1]: 0: no similarity; 1: completely similar
 Dissimilarity (or distance) measure

 Numerical measure of how different two data objects are

 In some sense, the inverse of similarity: The lower, the more alike
 Minimum dissimilarity is often 0 (i.e., completely similar)
 Range [0, 1] or [0, ∞) , depending on the definition
 Proximity usually refers to either similarity or dissimilarity

5
Session 2: Distance on Numeric
Data: Minkowski Distance
Data Matrix and Dissimilarity Matrix
 Data matrix  x11 x12 ... x1l 
 
 A data matrix of n data points with l dimensions  x21 x22 ... x2l 
D
 Dissimilarity (distance) matrix  
 
 n data points, but registers only the distance d(i, j)  xn1 xn 2 ... xnl 
(typically metric)
 0 
 Usually symmetric, thus a triangular matrix  
 d (2,1) 0 
 Distance functions are usually different for real, boolean,  
categorical, ordinal, ratio, and vector variables  
 d ( n,1) d ( n, 2) ... 0 
 Weights can be associated with different variables based
on applications and data semantics

7
Example: Data Matrix and Dissimilarity Matrix
Data Matrix

point attribute1 attribute2

x1 1 2
x2 3 5
x3 2 0
x4 4 5

Dissimilarity Matrix (by Euclidean Distance)

x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0

8
Distance on Numeric Data: Minkowski Distance
 Minkowski distance: A popular distance measure
d (i, j )  p | xi1  x j1 | p  | xi 2  x j 2 | p   | xil  x jl | p
where i = (xi1, xi2, …, xil) and j = (xj1, xj2, …, xjl) are two l-dimensional data
objects, and p is the order (the distance so defined is also called L-p norm)
 Properties
 d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positivity)
 d(i, j) = d(j, i) (Symmetry)
 d(i, j)  d(i, k) + d(k, j) (Triangle Inequality)
 A distance that satisfies these properties is a metric
 Note: There are nonmetric dissimilarities, e.g., set differences

 p = 2: (L2 norm) Euclidean distance

d (i, j )  | xi1  x j1 |2  | xi 2  x j 2 |2   | xil  x jl |2

10
Example: Minkowski Distance at Special Cases
point attribute 1 attribute 2 Manhattan (L1)
x1 1 2 L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0

Euclidean (L2)
L2 x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0

Supremum (L)
L x1 x2 x3 x4
x1 0
x2 3 0
x3 2 5 0
x4 3 1 5 0
11
Session 3: Proximity Measure
for Symmetric vs. Asymmetric
Binary Variables
Proximity Measure for Binary Attributes
 A contingency table for binary data
Object j

Object i

 Distance measure for symmetric binary variables:

 Distance measure for asymmetric binary variables:

 Jaccard coefficient (similarity measure for asymmetric

binary variables):

 Note: Jaccard coefficient is the same as “coherence”: (a concept discussed in Pattern Discovery)

13
Example: Dissimilarity between Asymmetric Binary Variables
Mary
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N 1 0 ∑row
Mary F Y N P N P N 1 2 0 2
Jack
Jim M Y P N N N N 0 1 3 4
 Gender is a symmetric attribute (not counted in) ∑col 3 3 6
Jim
 The remaining attributes are asymmetric binary
1 0 ∑row
 Let the values Y and P be 1, and the value N be 0 1 1 1 2
Jack 0 1 3 4
 Distance:
∑col 2 4 6
01 Mary
d ( jack , mary )   0.33
2 01 1 0 ∑row
11
d ( jack , jim )   0.67 1 1 1 2
111
1 2 Jim 0 2 2 4
d ( jim , mary )   0.75
11 2 ∑col 3 3 6
14
Session 4: Distance between
Categorical Attributes, Ordinal
Attributes, and Mixed Types
Proximity Measure for Categorical Attributes
 Categorical data, also called nominal attributes

 Example: Color (red, yellow, blue, green), profession, etc.

 Method 1: Simple matching

 m: # of matches, p: total # of variables

p
d (i, j)  p m

 Method 2: Use a large number of binary attributes

 Creating a new binary attribute for each of the M nominal states

16
Ordinal Variables
 An ordinal variable can be discrete or continuous

 Order is important, e.g., rank (e.g., freshman, sophomore, junior, senior)

 Can be treated like interval-scaled

 Replace an ordinal variable value by its rank: rif  {1,..., M f }

 Map the range of each variable onto [0, 1] by replacing i-th object in
the f-th variable by
rif  1
zif 
M f 1
 Example: freshman: 0; sophomore: 1/3; junior: 2/3; senior 1

 Then distance: d(freshman, senior) = 1, d(junior, senior) = 1/3

 Compute the dissimilarity using methods for interval-scaled variables

17
Attributes of Mixed Type
 A dataset may contain all attribute types

 Nominal, symmetric binary, asymmetric binary, numeric, and ordinal

 One may use a weighted formula to combine their effects:
p

 ij dij
w (f) (f)

f 1
d (i, j )  p

 ij
w (f)

f 1

 If f is numeric: Use the normalized distance

 If f is binary or nominal: dij(f) = 0 if xif = xjf; or dij(f) = 1 otherwise
 If f is ordinal
rif  1
 Compute ranks zif (where zif  )
M f 1
18
 Treat zif as interval-scaled
Session 5: Proximity Measure
between Two Vectors: Cosine
Similarity
Cosine Similarity of Two Vectors
 A document can be represented by a bag of terms or a long vector, with each
attribute recording the frequency of a particular term (such as word, keyword, or
phrase) in the document

 Other vector objects: Gene features in micro-arrays

 Applications: Information retrieval, biologic taxonomy, gene feature mapping, etc.
 Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency vectors), then
d1  d 2
cos (d1 , d 2 ) 
|| d1 ||  || d 2 ||
where  indicates vector dot product, ||d||: the length of vector d
20
Example: Calculating Cosine Similarity
 Calculating Cosine Similarity: d1  d 2
cos (d1 , d 2 ) 
|| d1 ||  || d 2 ||
where  indicates vector dot product, ||d||: the length of vector d
 Ex: Find the similarity between documents 1 and 2.
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0) d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
 First, calculate vector dot product
d1d2 = 5 X 3 + 0 X 0 + 3 X 2 + 0 X 0 + 2 X 1 + 0 X 1 + 0 X 1 + 2 X 1 + 0 X 0 + 0 X 1 = 25
 Then, calculate ||d1|| and ||d2||

|| d1 || 5  5  0  0  3  3  0  0  2  2  0  0  0  0  2  2  0  0  0  0  6.481

|| d 2 || 3  3  0  0  2  2  0  0  1 1  1 1  0  0  1 1  0  0  1 1  4.12
 Calculate cosine similarity: cos(d1, d2 ) = 26/ (6.481 X 4.12) = 0.94
21
Session 6: Correlation Measures
between Two Variables: Covariance
and Correlation Coefficient
Variance for Single Variable
 The variance of a random variable X provides a measure of how much the value of X
deviates from the mean or expected value of X:
  ( x   ) 2 f ( x) if X is discrete

 x
  var( X )  E[(X   ) ]   
2 2

  ( x   ) 2 f ( x)dx if X is continuous

 
 where σ2 is the variance of X, σ is called standard deviation
µ is the mean, and µ = E[X] is the expected value of X
 That is, variance is the expected value of the square deviation from the mean
 It can also be written as:  2  var( X )  E[(X   ) 2 ]  E[X 2 ]   2  E[X 2 ]  [ E ( x)]2
 Sample variance is the average squared deviation of the data value xi from the
n
1
sample mean ˆ   ( xi  ˆ ) 2
2

n i 1
23
Covariance for Two Variables
 Covariance between two variables X1 and X2
 12  E[( X 1  1 )( X 2  2 )]  E[ X 1 X 2 ]  12  E[ X 1 X 2 ]  E[ X 1 ]E[ X 2 ]
where µ1 = E[X1] is the respective mean or expected value of X1; similarly for µ2
 Sample covariance between X1 and X2:
 Sample covariance is a generalization of the sample variance:
1 n 1 n
ˆ11   ( xi1  ˆ1 )( xi1  ˆ1 )   ( xi1  ˆ1 ) 2  ˆ12
n i 1 n i 1
 Positive covariance: If σ12 > 0
 Negative covariance: If σ12 <0
 Independence: If X1 and X2 are independent, σ12 = 0 but the reverse is not true
 Some pairs of random variables may have a covariance 0 but are not independent
 Only under some additional assumptions (e.g., the data follow multivariate normal
distributions) does a covariance of 0 imply independence
24
Example: Calculation of Covariance
 Suppose two stocks X1 and X2 have the following values in one week:

 (2, 5), (3, 8), (5, 10), (4, 11), (6, 14)
 Question: If the stocks are affected by the same industry trends, will their prices rise
or fall together?
 Covariance formula
 12  E[( X 1  1 )( X 2  2 )]  E[ X 1 X 2 ]  12  E[ X 1 X 2 ]  E[ X 1 ]E[ X 2 ]

 Its computation can be simplified as:  12  E[ X 1 X 2 ]  E[ X 1 ]E[ X 2 ]

 E(X1) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
 E(X2) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
 σ12 = (2×5 + 3×8 + 5×10 + 4×11 + 6×14)/5 − 4 × 9.6 = 4
 Thus, X1 and X2 rise together since σ12 > 0
25
Correlation between Two Numerical Variables
 Correlation between two variables X1 and X2 is the standard covariance, obtained by
normalizing the covariance with the standard deviation of each variable
 12  12
12  
 1 2  12 2 2 n

ˆ12  (x i1  ˆ1 )( xi 2  ˆ 2 )
 Sample correlation for two attributes X1 and X2: ˆ12   i 1

ˆ1ˆ 2 n n

 (x
i 1
i1  ˆ1 ) 2
 (x
i 1
i2  ˆ 2 ) 2

where n is the number of tuples, µ1 and µ2 are the respective means of X1 and X2 ,
σ1 and σ2 are the respective standard deviation of X1 and X2
 If ρ12 > 0: A and B are positively correlated (X1’s values increase as X2’s)

 The higher, the stronger correlation

 If ρ12 = 0: independent (under the same assumption as discussed in co-variance)

 If ρ12 < 0: negatively correlated

26
Visualizing Changes of Correlation Coefficient

 Correlation coefficient value range:

[–1, 1]
 A set of scatter plots shows sets of
points and their correlation
coefficients changing from –1 to 1

27
Covariance Matrix
 The variance and covariance information for the two variables X1 and X2 can be
summarized as 2 X 2 covariance matrix as
X 1  1
  E[( X   )( X   ) ]  E[(
T
)( X 1  1 X 2  2 )]
X 2  2
 E[( X 1  1 )( X 1  1 )] E[( X 1  1 )( X 2  2 )] 
 
 E[( X 2   2 )( X 1  1 )] E[( X 2   2 )( X 2   2 
)]
  12  12 
 2 
  21  2 
 In general, considering d numerical attributes, we have,
X1 X 2 ... X d
 x11 x12 ... x1d    12  12 ...  1d 
   
  2
...  2 d 
D   x21 x22 ... x2 d    E[( X   )( X   )T ]   21 2

   
   2 

 xn1 xn 2 ... xnd    n1  n 2 ...  nd 
28
Session 7: Summary
Summary: Similarity Measures for Cluster Analysis
 Basic Concept: Measuring Similarity between Objects
 Distance on Numeric Data: Minkowski Distance

 Proximity Measure for Symmetric vs. Asymmetric Binary Variables

 Distance between Categorical Attributes, Ordinal Attributes, and

Mixed Types
 Proximity Measure between Two Vectors: Cosine Similarity

 Correlation Measures between Two Variables: Covariance and

Correlation Coefficient
 Summary
30
Recommended Readings
 L. Kaufman and P. J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster

Analysis, John Wiley & Sons, 1990

 Mohammed J. Zaki and Wagner Meira, Jr.. Data Mining and Analysis: Fundamental
Concepts and Algorithms. Cambridge University Press, 2014

 Jiawei Han, Micheline Kamber, and Jian Pei. Data Mining: Concepts and Techniques.

Morgan Kaufmann, 3rd ed. , 2011

 Charu Aggarwal and Chandran K. Reddy (eds.). Data Clustering: Algorithms and

Applications. CRC Press, 2014

Resultados Banco
No ratings yet
Resultados Banco
5 pages
Gandhi Cloth Company - Integer & Mixed Integer Programming
No ratings yet
Gandhi Cloth Company - Integer & Mixed Integer Programming
3 pages
Last Train Out of Warsaw
No ratings yet
Last Train Out of Warsaw
44 pages
Pre Test NG Grade 8
100% (5)
Pre Test NG Grade 8
4 pages
Data Science: Department of Computer Science & Engineering
No ratings yet
Data Science: Department of Computer Science & Engineering
31 pages
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
No ratings yet
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
30 pages
Similarity
No ratings yet
Similarity
19 pages
TE IT DMBI Module2 Data Preprocessing L8-L11
No ratings yet
TE IT DMBI Module2 Data Preprocessing L8-L11
73 pages
Lec2 2-Dataset2
No ratings yet
Lec2 2-Dataset2
29 pages
2 Similarity Disimilarity Measure
No ratings yet
2 Similarity Disimilarity Measure
35 pages
Data Similarity
0% (1)
Data Similarity
18 pages
DMi_03-Proximity
No ratings yet
DMi_03-Proximity
51 pages
02data Part4
No ratings yet
02data Part4
28 pages
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
26 pages
Chapter - 2 Data Mining
No ratings yet
Chapter - 2 Data Mining
21 pages
X Chapter 02 Data
No ratings yet
X Chapter 02 Data
67 pages
Similarty and Dissimilarity
No ratings yet
Similarty and Dissimilarity
11 pages
Lesson 6 Similarities KNN
No ratings yet
Lesson 6 Similarities KNN
25 pages
Lab 2
No ratings yet
Lab 2
21 pages
Class-Data Preprocessing-IV
No ratings yet
Class-Data Preprocessing-IV
28 pages
Lec 5
No ratings yet
Lec 5
24 pages
CS2209 Similarity Distances
No ratings yet
CS2209 Similarity Distances
23 pages
18CSE397T - Computational Data Analysis Unit - 3: Session - 8: SLO - 2
No ratings yet
18CSE397T - Computational Data Analysis Unit - 3: Session - 8: SLO - 2
4 pages
Measure of Proximity
No ratings yet
Measure of Proximity
11 pages
29.measuring Data Similarity and Dissimilarity Introduction
No ratings yet
29.measuring Data Similarity and Dissimilarity Introduction
43 pages
L13
No ratings yet
L13
19 pages
Clustering Lecture 1: Basics: Jing Gao
No ratings yet
Clustering Lecture 1: Basics: Jing Gao
62 pages
Knowing Your Data
No ratings yet
Knowing Your Data
43 pages
ML Co4 Session 29
No ratings yet
ML Co4 Session 29
36 pages
Lec09 466 PDF
No ratings yet
Lec09 466 PDF
5 pages
CS-DM MODULE- 3
No ratings yet
CS-DM MODULE- 3
27 pages
Week 3 - Similarity Distance Measures
No ratings yet
Week 3 - Similarity Distance Measures
42 pages
Cluster Analysis Introduction (Unit-6)
No ratings yet
Cluster Analysis Introduction (Unit-6)
20 pages
DS5 Statistics
No ratings yet
DS5 Statistics
67 pages
Lecture 3-Know Your Data - M
No ratings yet
Lecture 3-Know Your Data - M
19 pages
CSC_522_Lecture10_5f0e8c83dce359ee001691c737303b46
No ratings yet
CSC_522_Lecture10_5f0e8c83dce359ee001691c737303b46
30 pages
DM&DW Individual Assignment (50%)
No ratings yet
DM&DW Individual Assignment (50%)
4 pages
Similarity
No ratings yet
Similarity
20 pages
Similarity
No ratings yet
Similarity
20 pages
Session-5.1-Measuring Data Similarity and Dissimilarity - Part-2
No ratings yet
Session-5.1-Measuring Data Similarity and Dissimilarity - Part-2
16 pages
Session-5.1-Measuring Data Similarity and Dissimilarity - Part-1
No ratings yet
Session-5.1-Measuring Data Similarity and Dissimilarity - Part-1
11 pages
Similarity Measures
No ratings yet
Similarity Measures
11 pages
CSE-1-PPT-MiniTest-12feb24-Similarity (6)
No ratings yet
CSE-1-PPT-MiniTest-12feb24-Similarity (6)
11 pages
APznzaaN7_CY3hhfhbJRXjYJ1BR6-NtGzIkO6tA99bBiITMP7edAeijYM4WIPHTX6qmgs05QF3M-ALsy0PRS_TYvyugVy6R2kjYnK0BCBRm9Wtq_9FaGq4pVaH_pFWQ-CutgWY_nI5HsUACQNIaD3Gu0gxaanUrACiGy2qvKlVDZgXatZgVnQ_WWUQGN5GK3MgGPyk7wNYpPtuWmopw0KMKDCQDXsrCNzmu9V5rqcPBmZE4z
No ratings yet
APznzaaN7_CY3hhfhbJRXjYJ1BR6-NtGzIkO6tA99bBiITMP7edAeijYM4WIPHTX6qmgs05QF3M-ALsy0PRS_TYvyugVy6R2kjYnK0BCBRm9Wtq_9FaGq4pVaH_pFWQ-CutgWY_nI5HsUACQNIaD3Gu0gxaanUrACiGy2qvKlVDZgXatZgVnQ_WWUQGN5GK3MgGPyk7wNYpPtuWmopw0KMKDCQDXsrCNzmu9V5rqcPBmZE4z
50 pages
Lecture 8-9 - Clustering
No ratings yet
Lecture 8-9 - Clustering
43 pages
Similarity and Dissimilarity
No ratings yet
Similarity and Dissimilarity
34 pages
CS822-DataMining-Week4 (2)
No ratings yet
CS822-DataMining-Week4 (2)
45 pages
STAT243 Chapter 2 - Section 2.4 (1)
No ratings yet
STAT243 Chapter 2 - Section 2.4 (1)
41 pages
Similarity Analysis
No ratings yet
Similarity Analysis
85 pages
DM - Topic Four - Part III (Autosaved)
No ratings yet
DM - Topic Four - Part III (Autosaved)
67 pages
Materi 7.1. Distance Measurement
No ratings yet
Materi 7.1. Distance Measurement
14 pages
DM Lab 02
No ratings yet
DM Lab 02
12 pages
DWM UNIT-VI (2)
No ratings yet
DWM UNIT-VI (2)
30 pages
4
No ratings yet
4
26 pages
Ml unit 2
No ratings yet
Ml unit 2
11 pages
18CSE397T - Computational Data Analysis Unit - 3: Session - 8: SLO - 1
No ratings yet
18CSE397T - Computational Data Analysis Unit - 3: Session - 8: SLO - 1
6 pages
rsfinal (1)
No ratings yet
rsfinal (1)
30 pages
Slides of Lecture 2 of CS3319 SJTU
No ratings yet
Slides of Lecture 2 of CS3319 SJTU
35 pages
IDS4
No ratings yet
IDS4
50 pages
Data Mining Lecture 1 - Summary
No ratings yet
Data Mining Lecture 1 - Summary
3 pages
Lecture 10
No ratings yet
Lecture 10
26 pages
Geometric functions in computer aided geometric design
From Everand
Geometric functions in computer aided geometric design
Oscar Ruiz
No ratings yet
Student's Solutions Manual and Supplementary Materials for Econometric Analysis of Cross Section and Panel Data, second edition
From Everand
Student's Solutions Manual and Supplementary Materials for Econometric Analysis of Cross Section and Panel Data, second edition
Jeffrey M. Wooldridge
No ratings yet
Generalized Fermat Equation
From Everand
Generalized Fermat Equation
Ran Van Vo
No ratings yet
Lecture 1 PDF
No ratings yet
Lecture 1 PDF
23 pages
Lecture 1 PDF
No ratings yet
Lecture 1 PDF
23 pages
CS 313 Introduction To Computer Networking & Telecommunication Data Link Layer Part II - Sliding Window Protocols
No ratings yet
CS 313 Introduction To Computer Networking & Telecommunication Data Link Layer Part II - Sliding Window Protocols
39 pages
Data Link Control: Mcgraw-Hill ©the Mcgraw-Hill Companies, Inc., 2004
No ratings yet
Data Link Control: Mcgraw-Hill ©the Mcgraw-Hill Companies, Inc., 2004
33 pages
Application Form PDF
No ratings yet
Application Form PDF
2 pages
Application Form PDF
No ratings yet
Application Form PDF
2 pages
Application Form PDF
No ratings yet
Application Form PDF
2 pages
CS 313 Introduction To Computer Networking & Telecommunication Data Link Layer Part II - Sliding Window Protocols
No ratings yet
CS 313 Introduction To Computer Networking & Telecommunication Data Link Layer Part II - Sliding Window Protocols
39 pages
Personal Statement and Study Plan Form PDF
No ratings yet
Personal Statement and Study Plan Form PDF
1 page
Application Form PDF
No ratings yet
Application Form PDF
2 pages
WebUser 442 2018 02 07
100% (2)
WebUser 442 2018 02 07
76 pages
Smelt Newspaper
No ratings yet
Smelt Newspaper
2 pages
Annex 2
No ratings yet
Annex 2
5 pages
As An Alternative Antibacterial Dishwashing Liquid - Suyat
No ratings yet
As An Alternative Antibacterial Dishwashing Liquid - Suyat
85 pages
LD7575 Leadtrend
No ratings yet
LD7575 Leadtrend
18 pages
Karthik
No ratings yet
Karthik
1 page
Meniscectomy
No ratings yet
Meniscectomy
10 pages
A Review On TAM and TOE Framework Progression and How These Models Integrate
No ratings yet
A Review On TAM and TOE Framework Progression and How These Models Integrate
9 pages
MCT-222 Embedded Systems: Lecture 5:parameter Passing, Aapcs Pointers, Arrays
No ratings yet
MCT-222 Embedded Systems: Lecture 5:parameter Passing, Aapcs Pointers, Arrays
22 pages
Lesson Plan in Science 4 I. Objective
No ratings yet
Lesson Plan in Science 4 I. Objective
9 pages
LESSON 1-18 Am Am
No ratings yet
LESSON 1-18 Am Am
22 pages
MyGov 11th June 2024 & Agenda Kenya
No ratings yet
MyGov 11th June 2024 & Agenda Kenya
35 pages
Amaurosis Fugax
100% (1)
Amaurosis Fugax
11 pages
S5 - SD - Instructional Planning-1
No ratings yet
S5 - SD - Instructional Planning-1
25 pages
S7O2 SportsNutritionInformation
No ratings yet
S7O2 SportsNutritionInformation
5 pages
Old Toefl VS New Toefl
No ratings yet
Old Toefl VS New Toefl
2 pages
Oral Medicine and Radiology, 2019
No ratings yet
Oral Medicine and Radiology, 2019
261 pages
LEMBARAN KERJA BI TAHUN 4 (m7-10) PDPR
No ratings yet
LEMBARAN KERJA BI TAHUN 4 (m7-10) PDPR
9 pages
Btech 1 Sem Engineering Mathematics 1 Nas103 2019
No ratings yet
Btech 1 Sem Engineering Mathematics 1 Nas103 2019
2 pages
Paper Cup Machine Price List W.E.F 1.3.2010
No ratings yet
Paper Cup Machine Price List W.E.F 1.3.2010
2 pages
Chapter 4- Design of Tension Members(1)
No ratings yet
Chapter 4- Design of Tension Members(1)
41 pages
MQTT - A Practical Protocol For The Internet of Things
No ratings yet
MQTT - A Practical Protocol For The Internet of Things
40 pages
Shift &amp Key Lock System-01
No ratings yet
Shift &amp Key Lock System-01
1 page
Nanda Nursing Diagnosis List 2018-2020
100% (1)
Nanda Nursing Diagnosis List 2018-2020
7 pages
Manual Cortadora Cesped Lawn-Boy Modelo 10604
No ratings yet
Manual Cortadora Cesped Lawn-Boy Modelo 10604
12 pages
Alien RPG Custom Character Sheet Mono FF-1
No ratings yet
Alien RPG Custom Character Sheet Mono FF-1
1 page

Lecture 2. Similarity Measures For Cluster Analysis

Uploaded by

Lecture 2. Similarity Measures For Cluster Analysis

Uploaded by

Lecture 2.

 Distance on Numeric Data: Minkowski Distance

 Distance between Categorical Attributes, Ordinal Attributes, and

 Correlation Measures between Two Variables: Covariance and

 High intra-class similarity: Cohesive within clusters

 A real-valued function that quantifies the similarity between two objects

 Numerical measure of how different two data objects are

point attribute1 attribute2

Dissimilarity Matrix (by Euclidean Distance)

 p = 2: (L2 norm) Euclidean distance

d (i, j )  | xi1  x j1 |2  | xi 2  x j 2 |2   | xil  x jl |2

 Distance measure for symmetric binary variables:

 Distance measure for asymmetric binary variables:

 Jaccard coefficient (similarity measure for asymmetric

 Example: Color (red, yellow, blue, green), profession, etc.

 Method 1: Simple matching

 m: # of matches, p: total # of variables

 Method 2: Use a large number of binary attributes

 Creating a new binary attribute for each of the M nominal states

 Order is important, e.g., rank (e.g., freshman, sophomore, junior, senior)

 Can be treated like interval-scaled

 Replace an ordinal variable value by its rank: rif  {1,..., M f }

 Then distance: d(freshman, senior) = 1, d(junior, senior) = 1/3

 Nominal, symmetric binary, asymmetric binary, numeric, and ordinal

 If f is numeric: Use the normalized distance

 Other vector objects: Gene features in micro-arrays

 Its computation can be simplified as:  12  E[ X 1 X 2 ]  E[ X 1 ]E[ X 2 ]

 The higher, the stronger correlation

 If ρ12 < 0: negatively correlated

 Correlation coefficient value range:

 Proximity Measure for Symmetric vs. Asymmetric Binary Variables

 Distance between Categorical Attributes, Ordinal Attributes, and

 Correlation Measures between Two Variables: Covariance and

Analysis, John Wiley & Sons, 1990

Morgan Kaufmann, 3rd ed. , 2011

Applications. CRC Press, 2014

You might also like