Lecture 4
Lecture 4
3rd Year
Spring 2025
Lec. 4
2
Proximity: Similarity and Dissimilarity
Similarity
3
Data Matrix and Dissimilarity Matrix
◼ Data matrix
◼ Composed of: n X p ( n data points or
x11 ... x1f ... x1p
objects X p attributes or dimensions)
... ... ... ... ...
x ... x if ... x ip
◼ Dissimilarity matrix i1
... ... ... ... ...
◼ Composed of: n data points, but registers x ... x nf ... x np
n1
in the matrix only the distance D(i,j)
◼ A triangular matrix –why?
6
Proximity Measure for Nominal Attributes
Consider the attribute test-1, compute the dissimilarity
matrix.
◼ Binary attribute: 0 and 1, where 0 means that the attribute is absent, and 1 means that it is
present → object i has values 0,1 and object j has values 0,1.
Object i
8
Proximity Measure for Binary Attributes
how can we compute the dissimilarity between two binary attributes?
◼ Symmetric binary attributes, each state is equally valuable
◼ Symmetric binary dissimilarity:
◼ Asymmetric binary attributes, the two states are not equally important,
such as the positive (1) and negative (0) outcomes of a disease test.
Given two asymmetric binary attributes, the agreement of two 1s (a
positive match) is then considered more significant than that of two 0s
(a negative match). Therefore, such binary attributes are often
considered “monary” (having one state).
◼ Asymmetric binary dissimilarity:
• Asymmetric binary similarity be:
• is called Jaccard coefficient
• (similarity measure for asymmetric binary variables) 9
Dissimilarity between Binary Variables
Object j
Object i
Jack and Mary are the most likely to have a similar disease.
Jim and Mary are unlikely to have a similar disease because they
have the highest dissimilarity value among the three pairs 10
11
Data Matrix and Dissimilarity Matrix
Dissimilarity of numeric data
The most popular distance measure is Euclidean distance (i.e., straight line)
Dissimilarity Matrix
The most popular distance measure is Euclidean distance (i.e., straight line)
Data Matrix
13
Data Matrix and Dissimilarity Matrix
Dissimilarity of numeric data
• Both the Euclidean and the Manhattan distance satisfy the following mathematical
properties:
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data
objects, and h is the order (the distance so defined is also called L-h norm)
It represents the Manhattan distance when h = 1(i.e.,L1 norm) and
Euclidean distance when h = 2(i.e.,L2 norm)
◼ Properties
◼ d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (measure distance between the same objects)
◼ d(i, j) = d(j, i) (Symmetry)
◼ d(i, j) d(i, k) + d(k, j) (Triangle Inequality)
◼ A distance that satisfies these properties is a metric 15
Minkowski Distance
◼ h = 1: Manhattan (city block, L1 norm) distance
◼ E.g., the Hamming distance: the number of bits that are different between two
binary vectors
d (i, j) =| x − x | + | x − x | +...+ | x − x |
i1 j1 i2 j 2 ip jp
16
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2 Manhattan (L1)
x1 1 2
L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0
Euclidean (L2)
L2 x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0
Supremum
L x1 x2 x3 x4
x1 0
x2 3 0
x3 2 5 0
x4 3 1 5 0
17
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2 Manhattan (L1)
x1 1 2
L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0
d (i, j) =| x − x | + | x − x | +...+ | x − x |
i1 j1 i2 j 2 ip jp
18
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2 Supremum
x1 1 2 L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 3 0
x4 4 5 x3 2 5 0
x4 3 1 5 0
19
Example: Minkowski Distance
To summarize: they represent the following distances on a triangle
20
Standardizing Numeric Data
❑ In some cases, the data are normalized before applying distance calculations. This
involves transforming the data to fall within a smaller or common range, such as
[−1.0,1.0] or [0.0, 1.0].
❑ For example: height attribute, could be measured in either meters or inches.
❑ In general, expressing an attribute in smaller units will lead to a larger range for that
attribute and thus tend to give such attributes greater effect or “weight.”
❑ Normalizing the data attempts to give all attributes an equal weight.
◼ Z-score: x
z= −
22
Standardizing Numeric Data
23
Ordinal Variables
Step 1 Step 2
◼ An ordinal variable can be discrete or continuous
◼ Order is important, e.g., rank
◼ Math grade: A,B,C,D,E
◼ Can be treated like interval-scaled rif {1,..., M f }
Step 1 ◼ replace xif by their rank A,B,C,D,E→ 1,2,3,4,5
Step 2 ◼ map the range of each variable onto [0, 1] by replacing
i-th object in the f-th variable by
rif −1 numeric for rank A = (1-1)/(5-1)=0
zif = numeric for rank B = (2-1)/(5-1)=1/4=0.25
M f −1
numeric for rank E = (5-1)/(5-1)=4/4=1
◼ compute the dissimilarity using methods for interval-
scaled variables
24
Ordinal Variables
Dissimilarity between ordinal attributes.
Step2: normalization
rif −1
zif =
M f −1
= 2- 1/ 3-1 = ½ = 0.5
Step3: dissimilarity matrix
Step2: normalization
1
0
0.5
1
25
Attributes of Mixed Type
26
Cosine Similarity
◼ A document can be represented by thousands of attributes, each recording the
frequency of a particular word (such as keywords) or phrase in the document.
27
Cosine Similarity
28
Example: Cosine Similarity
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
d1•d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 = 4.12
cos(d1, d2 ) = sim(x,y)= 0.94
29
Exercise
30
Exercise
31
Exercise
32
Thank You