0% found this document useful (0 votes)
34 views11 pages

Similarty and Dissimilarity

The document discusses various methods for measuring similarity and dissimilarity between objects. It describes different types of attributes like binary, nominal, ordinal, interval, and ratio. It then focuses on calculating proximity measures for binary attributes using contingency tables and distance measures. Various distance measures are introduced for different data types, including binary, numeric, and ordinal variables. The Minkowski distance is described as a popular distance measure for numeric data, with special cases like Manhattan, Euclidean, and supremum distances.

Uploaded by

Macho Nandini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views11 pages

Similarty and Dissimilarity

The document discusses various methods for measuring similarity and dissimilarity between objects. It describes different types of attributes like binary, nominal, ordinal, interval, and ratio. It then focuses on calculating proximity measures for binary attributes using contingency tables and distance measures. Various distance measures are introduced for different data types, including binary, numeric, and ordinal variables. The Minkowski distance is described as a popular distance measure for numeric data, with special cases like Manhattan, Euclidean, and supremum distances.

Uploaded by

Macho Nandini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 11

MEASURES OF SIMILARITY AND

DISSIMILARITY
 Similarity measure between two objects is a numerical measure of the degree to which two
objects are alike .
 Dissimilarity measure between two objects is a numerical measure of the degree to which two
objects are different
TYPES OF ATTRIBUTES
 There are different types of attributes
 Binary : True/False
 Nominal: Examples: ID numbers, eye color, zip codes
 Ordinal: Examples: rankings (e.g., taste of potato chips on a scale from 1 ‐10), grades, height
in {tall, medium, short}
 Interval: Examples: calendar dates, temperatures in Celsius or Fahrenheit.
 Ratio: Examples: temperature in Kelvin, length, time, counts
PROXIMITY MEASURE FOR BINARY ATTRIBUTES
Object j

 A contingency table for binary data


Object i

 Distance measure for symmetric binary


variables:

 Distance measure for asymmetric binary


variables:

 Jaccard coefficient (similarity measure


for asymmetric binary variables):

5
DISSIMILARITY BETWEEN BINARY VARIABLES

 Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
 Gender is a symmetric attribute
 The remaining attributes are asymmetric binary
 Let the values Y and P be 1, and the value N 0

01
d ( jack , mary )   0.33
2 01
11
d ( jack , jim )   0.67
111
1 2
d ( jim , mary )   0.75
11 2 6
EXAMPLE:
DATA MATRIX AND DISSIMILARITY MATRIX

Data Matrix
x2 x4
point attribute1 attribute2
4 x1 1 2
x2 3 5
x3 2 0
x4 4 5
2 x1
Dissimilarity Matrix
(with Euclidean Distance)
x3
0 4 x1 x2 x3 x4
2
x1 0
x2 3.61 0
x3 5.1 5.1 0
x4 4.24 1 5.39 0
7
DISTANCE ON NUMERIC DATA: MINKOWSKI
DISTANCE
 Minkowski distance: A popular distance measure

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-
dimensional data objects, and h is the order (the distance so
defined is also called L-h norm)
 Properties
 d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
 d(i, j) = d(j, i) (Symmetry)
 d(i, j)  d(i, k) + d(k, j) (Triangle Inequality)
 A distance that satisfies these properties is a metric
8
SPECIAL CASES OF MINKOWSKI DISTANCE

 h = 1: Manhattan (city block, L norm) distance


1
 E.g., the Hamming distance: the number of bits that are different
between two binary vectors

d (i, j) | x  x |  | x  x | ... | x  x |
i1 j1 i2 j 2 ip jp

 h = 2: (L2 norm) Euclidean distance


d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
i1 j1 i2 j 2 ip jp

 h  . “supremum” (L norm, L norm) distance.


max 
 This is the maximum difference between any component
(attribute) of the vectors

9
EXAMPLE: MINKOWSKI DISTANCE

Dissimilarity Matrices
point attribute 1 attribute 2 Manhattan (L1)
x1 1 2
L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0
Euclidean (L2)
x2 x4
L2 x1 x2 x3 x4
4 x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0

2 x1
Supremum
L x1 x2 x3 x4
x1 0
x2 3 0
x3 x3 2 5 0
0 2 4 x4 3 1 5 0 10
ORDINAL VARIABLES

 An ordinal variable can be discrete or continuous

 Order is important, e.g., rank

 Can be treated like interval-scaled


rif {1,...,M f }
 replace xif by their rank
 map the range of each variable onto [0, 1] by replacing i-th
object in the f-th variable by
rif 1
zif 
M f 1

 compute the dissimilarity using methods for interval-scaled


variables
11

You might also like