Similarty and Dissimilarity
Similarty and Dissimilarity
DISSIMILARITY
Similarity measure between two objects is a numerical measure of the degree to which two
objects are alike .
Dissimilarity measure between two objects is a numerical measure of the degree to which two
objects are different
TYPES OF ATTRIBUTES
There are different types of attributes
Binary : True/False
Nominal: Examples: ID numbers, eye color, zip codes
Ordinal: Examples: rankings (e.g., taste of potato chips on a scale from 1 ‐10), grades, height
in {tall, medium, short}
Interval: Examples: calendar dates, temperatures in Celsius or Fahrenheit.
Ratio: Examples: temperature in Kelvin, length, time, counts
PROXIMITY MEASURE FOR BINARY ATTRIBUTES
Object j
5
DISSIMILARITY BETWEEN BINARY VARIABLES
Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
Gender is a symmetric attribute
The remaining attributes are asymmetric binary
Let the values Y and P be 1, and the value N 0
01
d ( jack , mary ) 0.33
2 01
11
d ( jack , jim ) 0.67
111
1 2
d ( jim , mary ) 0.75
11 2 6
EXAMPLE:
DATA MATRIX AND DISSIMILARITY MATRIX
Data Matrix
x2 x4
point attribute1 attribute2
4 x1 1 2
x2 3 5
x3 2 0
x4 4 5
2 x1
Dissimilarity Matrix
(with Euclidean Distance)
x3
0 4 x1 x2 x3 x4
2
x1 0
x2 3.61 0
x3 5.1 5.1 0
x4 4.24 1 5.39 0
7
DISTANCE ON NUMERIC DATA: MINKOWSKI
DISTANCE
Minkowski distance: A popular distance measure
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-
dimensional data objects, and h is the order (the distance so
defined is also called L-h norm)
Properties
d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
d(i, j) = d(j, i) (Symmetry)
d(i, j) d(i, k) + d(k, j) (Triangle Inequality)
A distance that satisfies these properties is a metric
8
SPECIAL CASES OF MINKOWSKI DISTANCE
d (i, j) | x x | | x x | ... | x x |
i1 j1 i2 j 2 ip jp
9
EXAMPLE: MINKOWSKI DISTANCE
Dissimilarity Matrices
point attribute 1 attribute 2 Manhattan (L1)
x1 1 2
L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0
Euclidean (L2)
x2 x4
L2 x1 x2 x3 x4
4 x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0
2 x1
Supremum
L x1 x2 x3 x4
x1 0
x2 3 0
x3 x3 2 5 0
0 2 4 x4 3 1 5 0 10
ORDINAL VARIABLES