0% found this document useful (0 votes)
46 views42 pages

Week 3 - Similarity Distance Measures

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views42 pages

Week 3 - Similarity Distance Measures

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

1604C331 Data Mining

Week 3:
Similarity &
Distance Measures

Odd Semester 2024-2025


20102620240912
Informatics Engineering
Faculty of Engineering | Universitas Surabaya
Correlation & Covariance,
Covariance Matrix

2
Informatics Engineering | Universitas Surabaya
Correlation Analysis
(for Categorical Data)

• 𝛘2 (chi-square) test:

• Null hypothesis: the two distribution are independent


• The cells that contribute the most to the 𝛘2 value are those whose
actual count is very different from the expected count.
– The larger the 𝛘2 value, the more likely the variables are related.
• Note: Correlation does not imply causality
– # of hospitals and # of car-theft in a city are correlated
– Both are causally linked to the third variable: population
Chi-Square Calculation (1)
Play chess Not play chess Sum (row)
Like science fiction 250 (X1) 200 (X2) 450

Not like science fiction 50 (X3) 1000 (X4) 1050

Sum (col) 300 1200 1500

• Null hypothesis: the two distributions are independent.


– What does that mean?
The ratio between:
people who play chess vs not play chess
IS THE SAME for both groups of
like science fiction vs not like science fiction.

– X1:X2 = X3:X4 = 300:1200


– X1:X3 = X2:X4 = 450:1050
Chi-Square Calculation (2) How to derive 90?
Play chess Not play chess Sum (row) 450/1500 * 300 = 90
Like science fiction 250 | 90 200 | 360 450

Not like science fiction 50 | 210 1000 | 840 1050

Sum (col) 300 1200 1500

We can reject the


• 𝛘2 (chi-square) calculation: null hypothesis of
(numbers in blue color are expected counts calculated based on the data independence at a
distribution in the two categories) confidence level of
0.001
( 250 − 90) 2
(50 − 210) 2
( 200 − 360) 2
(1000 − 840) 2
2 = + + + = 507.93
90 210 360 840
• It shows that like science fiction and play chess are
correlated in the group.
Chi-Square Calculation (2) How to derive 90?
Play chess Not play chess Sum (row) 450/1500 * 300 = 90
Like science fiction 250 | 90 200 | 360 450

Not like science fiction 50 | 210 1000 | 840 1050


We can reject the
null hypothesis of
Sum (col) 300 1200 1500 independence at a
confidence level of
(250 − 90) 2 (50 − 210) 2 (200 − 360) 2 (1000 − 840) 2 0.001
 =2
+ + + = 507.93
90 210 360 840
Degree of freedom = ?
Correlation Coefficient
(for Numeric Data)

• Correlation between two numeric attributes can be computed with


correlation coefficient (Pearson’s product moment coefficient) to
measure the strength and direction of a linear relationship between
two attributes.
• The range is from -1 to +1.

𝑛 σ 𝑥𝑦 − (σ 𝑥)(σ 𝑦)
𝑟=
[𝑛 σ 𝑥 2 − (σ 𝑥)2 ][𝑛 σ 𝑦 2 − (σ 𝑦)2 ]

where n is the number of data pairs.


Visually Evaluating Correlation
Scatter plots showing correlation coefficient value from -1 to +1.

Corelation
coefficient Correlation type Description
value
Perfect positive correlation When one variable changes, the other
1 variables change in the same
direction.
Zero correlation There is no relationship between the
0
variables.
Perfect negative When one variable changes, the other
-1 correlation variables change in the opposite
direction.
𝑛 σ 𝑥𝑦 − (σ 𝑥)(σ 𝑦)
Example: Cars 𝑟=
[𝑛 σ 𝑥 2 − (σ 𝑥)2 ][𝑛 σ 𝑦 2 − (σ 𝑦)2 ]

Cars Revenue
Company
(in ten thousands) ($ billions)
A 63.0 7.0
B 29.1 3.9
C 20.8 2.1
D 19.1 2.8
E 13.4 1.4
F 8.5 1.5

Compute the correlation coefficient.


Covariance of Numeric Data
• For assessing how much two attributes change together.
• Covariance between two attributes A and B:
𝐶𝑜𝑣 𝐴, 𝐵 = 𝐸( 𝐴 − 𝐴ҧ 𝐵 − 𝐵ത
𝑛
1
= ෍ 𝑎𝑖 − 𝐴ҧ 𝑏𝑖 − 𝐵ത
𝑛
𝑖=1

= 𝐸 𝐴 . 𝐵 − 𝐴ҧ𝐵ത

where µA = E[A] is the respective mean or expected value of A.

• Positive covariance: if 𝐶𝑜𝑣 𝐴, 𝐵 > 0


• Negative covariance: if 𝐶𝑜𝑣 𝐴, 𝐵 < 0
Example: Stock Prices 𝐶𝑜𝑣 𝐴, 𝐵 = 𝐸( 𝐴 − 𝐴ҧ 𝐵 − 𝐵ത

1
Stock Prices ($) for Company A and Company B = 𝑛 σ𝑛𝑖=1 𝑎𝑖 − 𝐴ҧ 𝑏𝑖 − 𝐵ത
= 𝐸 𝐴 . 𝐵 − 𝐴ҧ𝐵ത

Time Point Company A Company B

t1 6 20
t2 5 10
t3 4 14
t4 3 5
t5 2 5
If the stocks are affected by the same industry trends, will their
prices rise or fall together? Compute the covariance between
Company A and Company B.
𝑛
1
Variance 𝜎 2
=
𝑛
෍ 𝑥𝑖 − 𝜇
1
2

• Variance represents the variation of values in a single variable.


– Variance explains how the values vary in a variable.
• Each value in the variable subtracts from the mean of that variable.
– After the differences are squared, it gets divided by the number of values
(n) in that variable.
• What happens when the variance is low or high?
𝑛
1
𝐶𝑜𝑣(𝐴, 𝐵) = ෍ 𝑎𝑖 − 𝐴ҧ 𝑏𝑖 − 𝐵ത
Covariance 𝑛
𝑖=1

• Covariance is calculated between two different variables.


– The purpose it to find the value that indicates how these 2 variables vary
together.
– The values of both variables are multiplied by taking the difference from
the mean.
– Covariance is using values and means of 2 variables.
Covariance Matrix
• Covariance can only be calculated between 2 variables.
• Covariance matrix stands for representing covariance values of each pair
of variables in multivariate data.
– Covariance between the same variables equals the variance.
– The diagonal shows the variance of each variable.
– The values show the distribution magnitude and direction of multivariate data in
a multidimensional space and
– can allow you to gather information about how data spreads among 2
dimension.
𝑥 𝑦 𝑧
𝑥 𝑦
𝑥 𝑣𝑎𝑟(𝑥) 𝑐𝑜𝑣(𝑥, 𝑦) 𝑐𝑜𝑣(𝑥, 𝑧)
𝑥 𝑣𝑎𝑟(𝑥) 𝑐𝑜𝑣(𝑥, 𝑦
𝑦 𝑐𝑜𝑣(𝑥, 𝑦) 𝑣𝑎𝑟(𝑦) 𝑐𝑜𝑣(𝑦, 𝑧)
𝑦 𝑐𝑜𝑣(𝑥, 𝑦) 𝑣𝑎𝑟(𝑦) 𝑧 𝑐𝑜𝑣(𝑥, 𝑧) 𝑐𝑜𝑣(𝑦, 𝑧) 𝑣𝑎𝑟(𝑧)
Example: Covariance Matrix
• .A square matrix with:
– diagonal elements represent the variance and
– non-diagonal components express covariance.
• Example: X = [10, 5], Y = [3, 9]
– Variance of Set X = 6.5
– Variance of Set Y = 9
– Covariance = -15
Covariance vs Variance
Proximity Measures

30
Informatics Engineering | Universitas Surabaya
Similarity & Dissimilarity
• Similarity
– Numerical measure of how alike two objects are:
• The higher value, the more alike
– Often falls in the range [0,1]

• Dissimilarity
– Numerical measure of how different two data objects are:
• The lower, the more alike
– Minimum dissimilarity is often 0
– Range [0,1] or [0,∞], depending on the definition

• Proximity usually refers to either similarity or dissimilarity.


Similarity & Dissimilarity for Simple Attributes

p and q are the attribute values for two data objects.


Data Matrix and Proximity Matrix
 x11 x12 ... x1l 
• Data matrix  
 x21 x22 ... x2l 
D=
– A data matrix of n data points with l dimensions.  M M O M
 
 xn1 xn 2 ... xnl 
• Proximity matrix
– In a form of dissimilarity matrix or similarity matrix.
– n data points, but registers only the distance d(i,j) or similarity s(i,j),
typically metric.
 0 
– Usually symmetric, thus a triangular matrix.  
 d (2,1) 0 
 M M O 
 
 d (n,1) d (n, 2) ... 0 
Data Matrix and Dissimilarity Matrix
Example
Data Matrix

point attribute1 attribute2


x1 1 2
x2 3 5
x3 2 0
x4 4 5

Dissimilarity Matrix (by Euclidean Distance)


x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0
Proximity Matrix (Dissimilarity)
Example

Create proximity matrix for only attribute Test-1 (nominal).


d(i,j) evaluates 0 if objects i and j match, and 1 if the objects differ.
Distance on Numeric Data:
Minkowski Distance
• Minkowski distance
d (i, j ) = p | xi1 − x j1 | p + | xi 2 − x j 2 | p + L + | xil − x jl | p
where i = (xi1, xi2, …, xil) and j = (xj1, xj2, …, xjl) are two l-dimensional data objects, and p is the order (the
distance so defined is also called L-p norm)

• Properties:
– POSITIVITY: d(i, j) > 0 if i ≠ j, and d(i, i) = 0
– SYMMETRY: d(i, j) = d(j, i)
– TRIANGLE INEQUALITY: d(i, j)  d(i, k) + d(k, j)
• A distance that satisfies these properties is a metric.
Special case of Minkowski distance
• Manhattan (or City Block) distance (L1 norm), p = 1
– e.g., Hamming distance: the number of bits that are different between
two binary vectors.
d (i, j ) =| xi1 − x j1 | + | xi 2 − x j 2 | + L + | xil − x jl |

• Euclidean distance (L2 norm), p = 2


d (i, j ) = | xi1 − x j1 |2 + | xi 2 − x j 2 |2 + L + | xil − x jl |2

• Supremum distance (Lmax norm, L norm) , p = 


– The maximum difference between any component (attribute) of the vectors.
Euclidean vs Manhattan vs Supremum
Example: Minkowski Distance
Manhattan (L1)
point attribute 1 attribute 2
L x1 x2 x3 x4
x1 1 2
x1 0
x2 3 5
x2 5 0
x3 2 0
x3 3 6 0
x4 4 5
x4 6 1 7 0

Euclidean (L2)
L2 x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0

Supremum (L)
L x1 x2 x3 x4
x1 0
x2 3 0
x3 2 5 0
x4 3 1 5 0
Proximity Measure for Binary Attributes
Object j
A contingency table for binary data
Object i

• Distance measure for symmetric binary variables:

• Distance measure for asymmetric binary variables:

• Jaccard coefficient (similarity measure for asymmetric binary variables):


Example: Dissimilarity between Asymmetric Binary Variables
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4 Mary
Jack M Y N P N N N 1 0 ∑row
Mary F Y N P N P N
1 2 0 2
Jim M Y P N N N N Jack
0 1 3 4

• Gender is a symmetric attribute (not counted in) ∑col 3 3 6

• The remaining attributes are asymmetric binary Jim

• Let the values Y and P be 1, and the value N be 0 1 0 ∑row

• Distance: 1 1 1 2
Jack 0 1 3 4
0+1
d ( jack , mary ) = = 0.33 Mary ∑col 2 4 6
2+ 0+1
1+1 1 0 ∑row
d ( jack , jim ) = = 0.67
1+1+1 1 1 1 2
1+ 2 0 2 2 4
d ( jim , mary ) = = 0.75 Jim
1+1+ 2 ∑col 3 3 6
Mahalanobis distance
Mahalanobis distance is a metric used to find the distance between a
point and a distribution.

It is most commonly used on multivariate data.

It calculates the distance between a point and distribution by


considering how many standard deviations away the two points are,
making it useful to detect outliers.
Mahalanobis distance Calculation
𝑚𝑎ℎ𝑎𝑙𝑎𝑛𝑜𝑏𝑖𝑠 𝒙, 𝒚 = 𝒙 − 𝒚 Σ−1 (𝒙 − 𝒚)𝑇  is the covariance matrix of the input data X

 0.3 0.2 A: (0.5, 0.5)


= 
0.2 0.3 B: (0, 1)
C: (1.5, 1.5)
C

Mahal(A,B) = 5 = 2.236
B
Mahal(A,C) = 4 = 2
A
Common Property of a Similarity
• Similarities, also have some well known properties:

– MAXIMUM SIMILARITY
s(p,q) = 1, only if p = q

– SYMMETRY
s(p,q) = s(q,p) for all p and q

where s(p,q) is the similarity between points (data objects), p and q.


SMC and Jaccard
• Common situation is that objects, x and y, have only binary attributes
• Compute similarities using the following quantities
f01 = the number of attributes where x was 0 and y was 1
f10 = the number of attributes where x was 1 and y was 0
f00 = the number of attributes where x was 0 and y was 0
f11 = the number of attributes where x was 1 and y was 1

• Simple Matching and Jaccard Coefficients


SMC = number of matches / number of attributes
= (f11 + f00) / (f01 + f10 + f11 + f00)
J = number of 11 matches / number of non-zero attributes
= (f11) / (f01 + f10 + f11)
Example: SMC and Jaccard
x= 1000000000
y= 0000001001

f01 = 2 (the number of attributes where x was 0 and y was 1)


f10 = 1 (the number of attributes where x was 1 and y was 0)
f00 = 7 (the number of attributes where x was 0 and y was 0)
f11 = 0 (the number of attributes where x was 1 and y was 1)

SMC = (f11 + f00) / (f01 + f10 + f11 + f00)


= (0+7) / (2+1+0+7) = 0.7

J = (f11) / (f01 + f10 + f11) = 0 / (2 + 1 + 0) = 0


Cosine Similarity
A document can be represented by a bag of terms or a long vector, with each attribute recording the
frequency of a particular term (such as word, keyword, or phrase) in the document.

Other vector objects: gene features in micro-arrays.

Applications: information retrieval, biologic taxonomy, gene feature mapping, …


If d1 and d2 are two vectors (e.g., term-frequency vectors), then:

d1 • d 2
cos (d1 , d 2 ) =
|| d1 ||  || d 2 ||
where • indicates vector dot product and || di || is the length of vector di
Example: Cosine Similarity
d1 • d 2
cos (d1 , d 2 ) =
Find the similarity between document d1 and d2. || d1 ||  || d 2 ||

d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)

d1•d2 = 5 X 3 + 0 X 0 + 3 X 2 + 0 X 0 + 2 X 1 + 0 X 1 + 0 X 1 + 2 X 1 + 0 X 0 + 0 X 1 = 25
|| d1 ||= 5  5 + 0  0 + 3  3 + 0  0 + 2  2 + 0  0 + 0  0 + 2  2 + 0  0 + 0  0 = 6.481
|| d 2 ||= 3  3 + 0  0 + 2  2 + 0  0 + 1 1 + 1 1 + 0  0 + 1 1 + 0  0 + 1 1 = 4.12

cos(d1, d2 ) = 25/ (6.481 X 4.12) = 0.94


Capture Hidden Semantics
• Cosine similarity measure cannot capture hidden semantics.
– Which pairs are more similar: geometry, algebra, music, politics?
• The same bags of words may express rather different meanings
– ”The cat bites a mouse” vs “The mouse bites a cat”
– This is beyond what a vector space model can handle.
• Moreover, objects can be composed of rather complex structures
and connections (e.g., graphs and networks).
• New similarity measures needed to handle complex semantics
– ex. Distributive representation and representation learning
Exercises

54
Informatics Engineering | Universitas Surabaya
Homework
• Do all the exercises.
• You can write the solution on papers or you can use tools like Excel
or Python and explain in detail step-by-step of your work until it finds
the solution.
• Create a pdf file for your solution and submit it to ULS
• You can upload one more file that you use to do the computation
(.xlsx or .ipynb) along with your .pdf file. Upload those files
separately.
• Note: do not forget to put your Student ID and name at the first page
of the pdf file.
Exercise 1: Students
Student Number of absences Final Grade
Adi 6 82
Budi 2 86
Kaka 15 43
Denny 9 74
Ethan 12 58
Fanny 5 90
Gary 8 78

Compute the correlation coefficient.


Exercise 2: Covariance matrix

ID A B
1 92 80
2 60 30
3 100 70

Compute the covariance matrix.


Exercise 3: Age and Body Fat

a. Calculate the correlation coefficient (Pearson’s product moment coefficient).


Are these two attributes positively or negatively correlated?
b. Compute their covariance. What can you conclude from it?
c. Create proximity matrix using L-p norms (p=1, 2, dan ) and mahalanobis
Question?

64
Informatics Engineering | Universitas Surabaya

You might also like