Week 3 - Similarity Distance Measures
Week 3 - Similarity Distance Measures
Week 3:
Similarity &
Distance Measures
2
Informatics Engineering | Universitas Surabaya
Correlation Analysis
(for Categorical Data)
• 𝛘2 (chi-square) test:
𝑛 σ 𝑥𝑦 − (σ 𝑥)(σ 𝑦)
𝑟=
[𝑛 σ 𝑥 2 − (σ 𝑥)2 ][𝑛 σ 𝑦 2 − (σ 𝑦)2 ]
Corelation
coefficient Correlation type Description
value
Perfect positive correlation When one variable changes, the other
1 variables change in the same
direction.
Zero correlation There is no relationship between the
0
variables.
Perfect negative When one variable changes, the other
-1 correlation variables change in the opposite
direction.
𝑛 σ 𝑥𝑦 − (σ 𝑥)(σ 𝑦)
Example: Cars 𝑟=
[𝑛 σ 𝑥 2 − (σ 𝑥)2 ][𝑛 σ 𝑦 2 − (σ 𝑦)2 ]
Cars Revenue
Company
(in ten thousands) ($ billions)
A 63.0 7.0
B 29.1 3.9
C 20.8 2.1
D 19.1 2.8
E 13.4 1.4
F 8.5 1.5
= 𝐸 𝐴 . 𝐵 − 𝐴ҧ𝐵ത
1
Stock Prices ($) for Company A and Company B = 𝑛 σ𝑛𝑖=1 𝑎𝑖 − 𝐴ҧ 𝑏𝑖 − 𝐵ത
= 𝐸 𝐴 . 𝐵 − 𝐴ҧ𝐵ത
t1 6 20
t2 5 10
t3 4 14
t4 3 5
t5 2 5
If the stocks are affected by the same industry trends, will their
prices rise or fall together? Compute the covariance between
Company A and Company B.
𝑛
1
Variance 𝜎 2
=
𝑛
𝑥𝑖 − 𝜇
1
2
30
Informatics Engineering | Universitas Surabaya
Similarity & Dissimilarity
• Similarity
– Numerical measure of how alike two objects are:
• The higher value, the more alike
– Often falls in the range [0,1]
• Dissimilarity
– Numerical measure of how different two data objects are:
• The lower, the more alike
– Minimum dissimilarity is often 0
– Range [0,1] or [0,∞], depending on the definition
• Properties:
– POSITIVITY: d(i, j) > 0 if i ≠ j, and d(i, i) = 0
– SYMMETRY: d(i, j) = d(j, i)
– TRIANGLE INEQUALITY: d(i, j) d(i, k) + d(k, j)
• A distance that satisfies these properties is a metric.
Special case of Minkowski distance
• Manhattan (or City Block) distance (L1 norm), p = 1
– e.g., Hamming distance: the number of bits that are different between
two binary vectors.
d (i, j ) =| xi1 − x j1 | + | xi 2 − x j 2 | + L + | xil − x jl |
Euclidean (L2)
L2 x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0
Supremum (L)
L x1 x2 x3 x4
x1 0
x2 3 0
x3 2 5 0
x4 3 1 5 0
Proximity Measure for Binary Attributes
Object j
A contingency table for binary data
Object i
• Distance: 1 1 1 2
Jack 0 1 3 4
0+1
d ( jack , mary ) = = 0.33 Mary ∑col 2 4 6
2+ 0+1
1+1 1 0 ∑row
d ( jack , jim ) = = 0.67
1+1+1 1 1 1 2
1+ 2 0 2 2 4
d ( jim , mary ) = = 0.75 Jim
1+1+ 2 ∑col 3 3 6
Mahalanobis distance
Mahalanobis distance is a metric used to find the distance between a
point and a distribution.
Mahal(A,B) = 5 = 2.236
B
Mahal(A,C) = 4 = 2
A
Common Property of a Similarity
• Similarities, also have some well known properties:
– MAXIMUM SIMILARITY
s(p,q) = 1, only if p = q
– SYMMETRY
s(p,q) = s(q,p) for all p and q
d1 • d 2
cos (d1 , d 2 ) =
|| d1 || || d 2 ||
where • indicates vector dot product and || di || is the length of vector di
Example: Cosine Similarity
d1 • d 2
cos (d1 , d 2 ) =
Find the similarity between document d1 and d2. || d1 || || d 2 ||
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
d1•d2 = 5 X 3 + 0 X 0 + 3 X 2 + 0 X 0 + 2 X 1 + 0 X 1 + 0 X 1 + 2 X 1 + 0 X 0 + 0 X 1 = 25
|| d1 ||= 5 5 + 0 0 + 3 3 + 0 0 + 2 2 + 0 0 + 0 0 + 2 2 + 0 0 + 0 0 = 6.481
|| d 2 ||= 3 3 + 0 0 + 2 2 + 0 0 + 1 1 + 1 1 + 0 0 + 1 1 + 0 0 + 1 1 = 4.12
54
Informatics Engineering | Universitas Surabaya
Homework
• Do all the exercises.
• You can write the solution on papers or you can use tools like Excel
or Python and explain in detail step-by-step of your work until it finds
the solution.
• Create a pdf file for your solution and submit it to ULS
• You can upload one more file that you use to do the computation
(.xlsx or .ipynb) along with your .pdf file. Upload those files
separately.
• Note: do not forget to put your Student ID and name at the first page
of the pdf file.
Exercise 1: Students
Student Number of absences Final Grade
Adi 6 82
Budi 2 86
Kaka 15 43
Denny 9 74
Ethan 12 58
Fanny 5 90
Gary 8 78
ID A B
1 92 80
2 60 30
3 100 70
64
Informatics Engineering | Universitas Surabaya