0% found this document useful (0 votes)

7 views33 pages

Lecture 4

The document discusses the concepts of similarity and dissimilarity in data mining, focusing on how to measure the proximity between data objects using various methods such as data matrices and dissimilarity matrices. It covers different types of attributes, including nominal, binary, and numeric, and explains distance measures like Euclidean, Manhattan, and Minkowski distances. Additionally, it addresses the importance of standardizing and normalizing data to ensure accurate distance calculations across mixed attribute types.

Uploaded by

kiro2morris3

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views33 pages

Lecture 4

Uploaded by

kiro2morris3

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

SET 393: Data Mining and Business Intelligence

3rd Year

Spring 2025

Lec. 4

Chapter 2. Data, Measurements, and Data Preprocessing

Assistant Professor: Dr. Rasha Saleh
Similarity and Distance

2
Proximity: Similarity and Dissimilarity

Similarity

• Numerical measure of how alike two data objects are

• Value is higher when objects are more alike Similarity= 1-dissimilarity
• Often falls in the range [0,1]

Dissimilarity (e.g., distance)

• Numerical measure of how different two data objects are

• Lower when objects are more alike
• Minimum dissimilarity is often 0→ indicates objects are similar

Proximity refers to a similarity or dissimilarity

3
Data Matrix and Dissimilarity Matrix

◼ Data matrix
◼ Composed of: n X p ( n data points or
 x11 ... x1f ... x1p 
objects X p attributes or dimensions)  
 ... ... ... ... ... 
x ... x if ... x ip 
◼ Dissimilarity matrix  i1 
 ... ... ... ... ... 
◼ Composed of: n data points, but registers x ... x nf ... x np 
 n1 
in the matrix only the distance D(i,j)
◼ A triangular matrix –why?

◼ D(i,j) = dissimilarity between object i & j

 0 
 d(2,1) 0 
◼ One dissimilarity matrix for one column or  
 d(3,1) d ( 3,2) 0 
attribute of data matrix.  
 : : : 
d ( n,1) d ( n,2) ... ... 0
4
Data Matrix and Dissimilarity Matrix
◼ Dissimilarity matrix
◼ d(i,j) is a nonnegative number that is close to 0 when objects i and j are
highly similar or “near” each other, and becomes larger the more they
differ.
◼ 0 < D(i,j) < 1
◼ d(i,i)= 0; that is, the difference between an object and itself is 0.
Furthermore, d(i,j)= d(j,i)
◼ Measures of similarity can often be expressed as a function of measures of
dissimilarity. For example, for nominal data,  0 
 d(2,1) 0 
 
 d(3,1) d ( 3,2) 0 
 
 : : : 
d ( n,1) d ( n,2) ... ... 0
5
Proximity Measure for Nominal Attributes

◼ How is dissimilarity computed between objects described by nominal attributes ?

◼ Nominal attribute can take 2 or more states, e.g., red, yellow, blue, green
(generalization of a binary attribute)
◼ Method 1: Simple matching (ratio of mismatch)
p
d (i, j) = p− m
◼ m: # of matches, m =1, p=3(column)
◼ p: total # of variables
◼ Method 2: Use a large number of binary attributes
◼ creating a new binary attribute for each of the M nominal states

6
Proximity Measure for Nominal Attributes
Consider the attribute test-1, compute the dissimilarity
matrix.

Similarity can be computed as

Because we have one nominal attribute

All objects are dissimilar except objects1 and 4 (i.e., d (4,1) = 0 ).

7
Proximity Measure for Binary Attributes
how can we compute the dissimilarity between two binary attributes?

◼ Binary attribute: 0 and 1, where 0 means that the attribute is absent, and 1 means that it is
present → object i has values 0,1 and object j has values 0,1.

◼ A contingency ‫ احتمالية‬table for binary data

◼ q is the number of attributes that equal 1 for both objects i and j
◼ r is the number of attributes that equal 1 for object i but equal 0 for object j,
◼ s is the number of attributes that equal 0 for object i but equal 1 for object j,
◼ t is the number of attributes that equal 0 for both objects i and j.
Object j
◼ total number of attributes is p, where p = q +r +s+t.

Object i
8
Proximity Measure for Binary Attributes
how can we compute the dissimilarity between two binary attributes?
◼ Symmetric binary attributes, each state is equally valuable
◼ Symmetric binary dissimilarity:
◼ Asymmetric binary attributes, the two states are not equally important,
such as the positive (1) and negative (0) outcomes of a disease test.
Given two asymmetric binary attributes, the agreement of two 1s (a
positive match) is then considered more significant than that of two 0s
(a negative match). Therefore, such binary attributes are often
considered “monary” (having one state).
◼ Asymmetric binary dissimilarity:
• Asymmetric binary similarity be:
• is called Jaccard coefficient
• (similarity measure for asymmetric binary variables) 9
Dissimilarity between Binary Variables

Patients records with 8 attributes

Only one
symmetric distance between objects (patients)
attribute is computed based only on
the asymmetric binary attributes.

Object j

Object i

Jack and Mary are the most likely to have a similar disease.

Jim and Mary are unlikely to have a similar disease because they
have the highest dissimilarity value among the three pairs 10
11
Data Matrix and Dissimilarity Matrix
Dissimilarity of numeric data

The most popular distance measure is Euclidean distance (i.e., straight line)

d(x1,x2) = sqrt ((1-3)^2 + (2-5)^2) = sqrt(4+9) = sqrt(13)=3.61

Dissimilarity Matrix

Data Matrix (with Euclidean Distance)

point attribute1 attribute2 x1 x2 x3 x4
x1 1 2 x1 0
x2 3.61 0
x2 3 5
x3 sqrt 5.1 5.1 0
x3 2 0
x4 4.24 1 5.39 0
x4 4 5 12
Data Matrix and Dissimilarity Matrix
Dissimilarity of numeric data
Try it yourself!

The most popular distance measure is Euclidean distance (i.e., straight line)

Calculate the dissimilarity matrix (with Euclidean Distance)

Data Matrix

13
Data Matrix and Dissimilarity Matrix
Dissimilarity of numeric data

Another popular distance measure is Manhattan (or city block) distance

• Both the Euclidean and the Manhattan distance satisfy the following mathematical
properties:

• Nonnegativity: d(i,j)≥ 0: Distance is a nonnegative number.

k
• d(i,i)= 0: The distance of an object to itself is 0.

• Symmetry: d(i,j)=d(j,i): Distance is a symmetric function.

Distance refers to dissimilarity i j
• Triangle inequality: d(i,j)≤ d(i,k)+d(k,j): Going directly from object i to object j in
space is no more than making a detour over any other object k.
•
• A measure that satisfies these conditions is known as metric 14
Distance on Numeric Data: Minkowski Distance
generalization of the Euclidean and Manhattan distances

◼ Minkowski distance: A popular distance measure

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data
objects, and h is the order (the distance so defined is also called L-h norm)
It represents the Manhattan distance when h = 1(i.e.,L1 norm) and
Euclidean distance when h = 2(i.e.,L2 norm)
◼ Properties
◼ d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (measure distance between the same objects)
◼ d(i, j) = d(j, i) (Symmetry)
◼ d(i, j)  d(i, k) + d(k, j) (Triangle Inequality)
◼ A distance that satisfies these properties is a metric 15
Minkowski Distance
◼ h = 1: Manhattan (city block, L1 norm) distance
◼ E.g., the Hamming distance: the number of bits that are different between two

binary vectors
d (i, j) =| x − x | + | x − x | +...+ | x − x |
i1 j1 i2 j 2 ip jp

◼ h = 2: (L2 norm) Euclidean distance

d (i, j) = (| x − x |2 + | x − x |2 +...+ | x − x |2 )
i1 j1 i2 j2 ip jp

◼ h → . generalization of the Minkowski distance, “supremum” (Lmax norm or L norm

or uniform norm) distance.
◼ This is the maximum difference between any component (attribute) of the vectors

16
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2 Manhattan (L1)
x1 1 2
L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0
Euclidean (L2)
L2 x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0

Supremum
L x1 x2 x3 x4
x1 0
x2 3 0
x3 2 5 0
x4 3 1 5 0
17
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2 Manhattan (L1)
x1 1 2
L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0

d (i, j) =| x − x | + | x − x | +...+ | x − x |
i1 j1 i2 j 2 ip jp

d(x1,x2)= absolute(1-3) + absolute(2-5)= 2+3=5

18
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2 Supremum
x1 1 2 L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 3 0
x4 4 5 x3 2 5 0
x4 3 1 5 0

d(x1,x2)=> first attribute: absolute(3-1) =2

second attribute: absolute (5-2) = 3
Since result of second attribute> result of first attribute

The supreme distance for (x1,x2) is 3

19
Example: Minkowski Distance
To summarize: they represent the following distances on a triangle

20
Standardizing Numeric Data

❑ In some cases, the data are normalized before applying distance calculations. This
involves transforming the data to fall within a smaller or common range, such as
[−1.0,1.0] or [0.0, 1.0].
❑ For example: height attribute, could be measured in either meters or inches.
❑ In general, expressing an attribute in smaller units will lead to a larger range for that
attribute and thus tend to give such attributes greater effect or “weight.”
❑ Normalizing the data attempts to give all attributes an equal weight.

◼ Z-score: x
z=  − 

◼ X: raw score to be standardized, μ: mean of the population, σ: standard deviation

◼ the distance between the raw score and the population mean in units of the
standard deviation
◼ negative when the raw score is below the mean, “+” when above
21
Standardizing Numeric Data

22
Standardizing Numeric Data

23
Ordinal Variables
Step 1 Step 2
◼ An ordinal variable can be discrete or continuous
◼ Order is important, e.g., rank
◼ Math grade: A,B,C,D,E
◼ Can be treated like interval-scaled rif {1,..., M f }
Step 1 ◼ replace xif by their rank A,B,C,D,E→ 1,2,3,4,5
Step 2 ◼ map the range of each variable onto [0, 1] by replacing
i-th object in the f-th variable by
rif −1 numeric for rank A = (1-1)/(5-1)=0
zif = numeric for rank B = (2-1)/(5-1)=1/4=0.25
M f −1
numeric for rank E = (5-1)/(5-1)=4/4=1
◼ compute the dissimilarity using methods for interval-
scaled variables
24
Ordinal Variables
Dissimilarity between ordinal attributes.

Step2: normalization
rif −1
zif =
M f −1
= 2- 1/ 3-1 = ½ = 0.5
Step3: dissimilarity matrix
Step2: normalization
1
0
0.5
1
25
Attributes of Mixed Type

◼ A database may contain all attribute types

◼ Nominal, symmetric binary, asymmetric binary, numeric,
ordinal
◼ One may use a weighted formula to combine their effects
 pf = 1 ij( f ) dij( f )
d (i, j) =
 pf = 1 ij( f )
◼ f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
◼ f is numeric: use the normalized distance
◼ f is ordinal
◼ Compute ranks rif and
zif = r − 1
if

◼ Treat zif as interval-scaled M −1 f

26
Cosine Similarity
◼ A document can be represented by thousands of attributes, each recording the
frequency of a particular word (such as keywords) or phrase in the document.

◼ Applications: information retrieval, biologic taxonomy, gene feature mapping, ...

◼ Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency vectors), then
cos(d1, d2) = (d1 • d2) /||d1|| ||d2|| ,

d1 dot d2 / amplitude d1 * amplitude d2

where • indicates vector dot product, ||d||: the length of vector d

27
Cosine Similarity

◼ Reminder: Dot product:

◼ Assume d1= (1,2) ,

◼ d2= (0,3)

◼ d1 dot d2 = (1 x 0) + (2 x 3)/ sqrt (1^2 + 2^2) x sqrt ( 0^2 + 3^2)

◼ = 6/sqrt(5) x sqrt(9) = 6/3 x sqrt(5) = 2/ sqrt (5)

28
Example: Cosine Similarity

◼ cos(d1, d2) = (d1 • d2) /||d1|| ||d2|| ,

where • indicates vector dot product, ||d|: the length of vector d

◼ Ex: Find the similarity between documents 1 and 2.

d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)

d1•d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 = 4.12
cos(d1, d2 ) = sim(x,y)= 0.94
29
Exercise

30
Exercise

31
Exercise

32
Thank You

Pattern Recognition - Clustering - Classification
No ratings yet
Pattern Recognition - Clustering - Classification
177 pages
Online Examination System
100% (4)
Online Examination System
22 pages
Chapter - 2 Data Mining
No ratings yet
Chapter - 2 Data Mining
21 pages
Data Mining: Data: Lecture Notes For Chapter 2 Lecture Notes For Chapter 2
100% (1)
Data Mining: Data: Lecture Notes For Chapter 2 Lecture Notes For Chapter 2
16 pages
29.measuring Data Similarity and Dissimilarity Introduction
No ratings yet
29.measuring Data Similarity and Dissimilarity Introduction
43 pages
Similarity and Dissimilarity Measures: Distance
No ratings yet
Similarity and Dissimilarity Measures: Distance
50 pages
Mod 4 Types of Data in Cluster Analysis
No ratings yet
Mod 4 Types of Data in Cluster Analysis
31 pages
X Chapter 02 Data
No ratings yet
X Chapter 02 Data
67 pages
Session-5.1-Measuring Data Similarity and Dissimilarity - Part-1
No ratings yet
Session-5.1-Measuring Data Similarity and Dissimilarity - Part-1
11 pages
DMi 03-Proximity
No ratings yet
DMi 03-Proximity
51 pages
CS822 DataMining Week4
No ratings yet
CS822 DataMining Week4
45 pages
Week 3 - Similarity Distance Measures
No ratings yet
Week 3 - Similarity Distance Measures
42 pages
Data Similarity
0% (1)
Data Similarity
18 pages
Clustering Lecture 1: Basics: Jing Gao
No ratings yet
Clustering Lecture 1: Basics: Jing Gao
62 pages
9-2 Data Analysis and Pre-Processing Part 2 PDF
No ratings yet
9-2 Data Analysis and Pre-Processing Part 2 PDF
27 pages
2 2 Data
No ratings yet
2 2 Data
27 pages
STAT243 Chapter 2 - Section 2.4
No ratings yet
STAT243 Chapter 2 - Section 2.4
41 pages
02data Part4
No ratings yet
02data Part4
28 pages
IDS4
No ratings yet
IDS4
50 pages
Class 1c - DataFundamentals
No ratings yet
Class 1c - DataFundamentals
27 pages
Lec 5
No ratings yet
Lec 5
24 pages
DWM Unit-Vi
No ratings yet
DWM Unit-Vi
30 pages
DM - Topic Four - Part III (Autosaved)
No ratings yet
DM - Topic Four - Part III (Autosaved)
67 pages
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
26 pages
ML Co4 Session 29
No ratings yet
ML Co4 Session 29
36 pages
Similarity
No ratings yet
Similarity
20 pages
TE IT DMBI Module2 Data Preprocessing L8-L11
No ratings yet
TE IT DMBI Module2 Data Preprocessing L8-L11
73 pages
2 Similarity Disimilarity Measure
No ratings yet
2 Similarity Disimilarity Measure
35 pages
Destination Cert - CISSP - Domain 2
No ratings yet
Destination Cert - CISSP - Domain 2
15 pages
Similarty and Dissimilarity
No ratings yet
Similarty and Dissimilarity
11 pages
Automation Studio User Manual - UC 500
No ratings yet
Automation Studio User Manual - UC 500
150 pages
Measure of Proximity
No ratings yet
Measure of Proximity
11 pages
SAP.C IBP 2005.v2022 02 25.q31
No ratings yet
SAP.C IBP 2005.v2022 02 25.q31
10 pages
RL3.2 Data Similarity 1
No ratings yet
RL3.2 Data Similarity 1
17 pages
Similarity
No ratings yet
Similarity
19 pages
DMi 03 Proximity
No ratings yet
DMi 03 Proximity
9 pages
Lesson 6 Similarities KNN
No ratings yet
Lesson 6 Similarities KNN
25 pages
Lecture 2. Similarity Measures For Cluster Analysis
No ratings yet
Lecture 2. Similarity Measures For Cluster Analysis
31 pages
Data Mining and Predictive Modeling: Lecture 13: Measuring Data Similarity
No ratings yet
Data Mining and Predictive Modeling: Lecture 13: Measuring Data Similarity
19 pages
Policies en 001
100% (1)
Policies en 001
173 pages
Similarity and Dissimilarity
No ratings yet
Similarity and Dissimilarity
34 pages
Lecture 3-Know Your Data - M
No ratings yet
Lecture 3-Know Your Data - M
19 pages
CSE 1 PPT MiniTest 12feb24 Similarity
No ratings yet
CSE 1 PPT MiniTest 12feb24 Similarity
11 pages
CS2209 Similarity Distances
No ratings yet
CS2209 Similarity Distances
23 pages
Data Modeling Best Practices
No ratings yet
Data Modeling Best Practices
41 pages
Knowing Your Data
No ratings yet
Knowing Your Data
43 pages
Formulas at A Glance - IDS
No ratings yet
Formulas at A Glance - IDS
5 pages
Cluster Analysis Introduction (Unit-6)
No ratings yet
Cluster Analysis Introduction (Unit-6)
20 pages
Lab 2
No ratings yet
Lab 2
21 pages
18CSE397T - Computational Data Analysis Unit - 3: Session - 8: SLO - 1
No ratings yet
18CSE397T - Computational Data Analysis Unit - 3: Session - 8: SLO - 1
6 pages
Similarity
No ratings yet
Similarity
20 pages
Materi 7.1. Distance Measurement
No ratings yet
Materi 7.1. Distance Measurement
14 pages
CS-DM Module - 3
No ratings yet
CS-DM Module - 3
27 pages
18CSE397T - Computational Data Analysis Unit - 3: Session - 8: SLO - 2
No ratings yet
18CSE397T - Computational Data Analysis Unit - 3: Session - 8: SLO - 2
4 pages
Session-5.1-Measuring Data Similarity and Dissimilarity - Part-2
No ratings yet
Session-5.1-Measuring Data Similarity and Dissimilarity - Part-2
16 pages
Lec09 466 PDF
No ratings yet
Lec09 466 PDF
5 pages
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
No ratings yet
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
30 pages
Data Science: Department of Computer Science & Engineering
No ratings yet
Data Science: Department of Computer Science & Engineering
31 pages
DM&DW Individual Assignment (50%)
No ratings yet
DM&DW Individual Assignment (50%)
4 pages
Similarity Measures
No ratings yet
Similarity Measures
11 pages
Dist
No ratings yet
Dist
14 pages
Data Mining Lecture 1 - Summary
No ratings yet
Data Mining Lecture 1 - Summary
3 pages
Business Analytics Theory Exam Notes
No ratings yet
Business Analytics Theory Exam Notes
61 pages
UNIT V DWM Notes
No ratings yet
UNIT V DWM Notes
18 pages
Chapter 2: Data Models: Answer: True
100% (2)
Chapter 2: Data Models: Answer: True
5 pages
Lecture 3
No ratings yet
Lecture 3
51 pages
Lecture 5
No ratings yet
Lecture 5
27 pages
Minor-Project JAN-JUNE 2025
No ratings yet
Minor-Project JAN-JUNE 2025
56 pages
Data Pro Lesson Note Year 10
No ratings yet
Data Pro Lesson Note Year 10
20 pages
Lecture 1
No ratings yet
Lecture 1
32 pages
Dbms 5
No ratings yet
Dbms 5
15 pages
ch03
No ratings yet
ch03
24 pages
Dokumen - Pub - Understanding Etl Data Pipelines For Modern Data Architectures Early Release 9781098159252
No ratings yet
Dokumen - Pub - Understanding Etl Data Pipelines For Modern Data Architectures Early Release 9781098159252
39 pages
Lecture 6
No ratings yet
Lecture 6
19 pages
SQL Major Method 2
100% (1)
SQL Major Method 2
6 pages
COMP 3278 Assignment 2 Incompleted
No ratings yet
COMP 3278 Assignment 2 Incompleted
9 pages
Percona Training MySQL
No ratings yet
Percona Training MySQL
5 pages
Hive Securing Hive
No ratings yet
Hive Securing Hive
22 pages
MySQL Using The MySQL Yum Repository 1 Installing MySQL On Linux Using The MySQL Yum Repository
No ratings yet
MySQL Using The MySQL Yum Repository 1 Installing MySQL On Linux Using The MySQL Yum Repository
8 pages
Problems of Concurrency Control
No ratings yet
Problems of Concurrency Control
5 pages
Bsadcom 201910007
No ratings yet
Bsadcom 201910007
18 pages
DDL Lab Exercise
No ratings yet
DDL Lab Exercise
9 pages
Punjab University College of Information Technology (Pucit) : Database Systems Lab 4
No ratings yet
Punjab University College of Information Technology (Pucit) : Database Systems Lab 4
3 pages
Distributed Hash Tables
No ratings yet
Distributed Hash Tables
20 pages
DBMS Exp 1 - Designing ER/EER Model DBMS Exp 1 - Designing ER/EER Model
No ratings yet
DBMS Exp 1 - Designing ER/EER Model DBMS Exp 1 - Designing ER/EER Model
6 pages
Activity 3 THE DATABASE CONCEPTS
No ratings yet
Activity 3 THE DATABASE CONCEPTS
4 pages
Railway Reservation System
No ratings yet
Railway Reservation System
5 pages
005 Lab-Blind-SQLi-WebAppSecurity
No ratings yet
005 Lab-Blind-SQLi-WebAppSecurity
5 pages
KISSsoft Specifications Belts and Chain Drives
No ratings yet
KISSsoft Specifications Belts and Chain Drives
3 pages
Project Proposal Template FYP 2019
No ratings yet
Project Proposal Template FYP 2019
4 pages
Practice Set 5
No ratings yet
Practice Set 5
1 page
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Top Numerical Methods With Matlab For Beginners!
From Everand
Top Numerical Methods With Matlab For Beginners!
Andrei Besedin
No ratings yet

Lecture 4

Uploaded by

Lecture 4

Uploaded by

SET 393: Data Mining and Business Intelligence

Chapter 2. Data, Measurements, and Data Preprocessing

• Numerical measure of how alike two data objects are

Dissimilarity (e.g., distance)

• Numerical measure of how different two data objects are

Proximity refers to a similarity or dissimilarity

◼ D(i,j) = dissimilarity between object i & j

◼ How is dissimilarity computed between objects described by nominal attributes ?

Similarity can be computed as

Because we have one nominal attribute

All objects are dissimilar except objects1 and 4 (i.e., d (4,1) = 0 ).

◼ A contingency ‫ احتمالية‬table for binary data

Patients records with 8 attributes

d(x1,x2) = sqrt ((1-3)^2 + (2-5)^2) = sqrt(4+9) = sqrt(13)=3.61

Data Matrix (with Euclidean Distance)

Calculate the dissimilarity matrix (with Euclidean Distance)

Another popular distance measure is Manhattan (or city block) distance

• Nonnegativity: d(i,j)≥ 0: Distance is a nonnegative number.

• Symmetry: d(i,j)=d(j,i): Distance is a symmetric function.

◼ Minkowski distance: A popular distance measure

◼ h = 2: (L2 norm) Euclidean distance

◼ h → . generalization of the Minkowski distance, “supremum” (Lmax norm or L norm

d(x1,x2)= absolute(1-3) + absolute(2-5)= 2+3=5

d(x1,x2)=> first attribute: absolute(3-1) =2

The supreme distance for (x1,x2) is 3

◼ X: raw score to be standardized, μ: mean of the population, σ: standard deviation

◼ A database may contain all attribute types

◼ Treat zif as interval-scaled M −1 f

◼ Applications: information retrieval, biologic taxonomy, gene feature mapping, ...

d1 dot d2 / amplitude d1 * amplitude d2

◼ Reminder: Dot product:

◼ Assume d1= (1,2) ,

◼ d1 dot d2 = (1 x 0) + (2 x 3)/ sqrt (1^2 + 2^2) x sqrt ( 0^2 + 3^2)

◼ = 6/sqrt(5) x sqrt(9) = 6/3 x sqrt(5) = 2/ sqrt (5)

◼ cos(d1, d2) = (d1 • d2) /||d1|| ||d2|| ,

◼ Ex: Find the similarity between documents 1 and 2.

You might also like