0% found this document useful (0 votes)
7 views

Lecture 4

The document discusses the concepts of similarity and dissimilarity in data mining, focusing on how to measure the proximity between data objects using various methods such as data matrices and dissimilarity matrices. It covers different types of attributes, including nominal, binary, and numeric, and explains distance measures like Euclidean, Manhattan, and Minkowski distances. Additionally, it addresses the importance of standardizing and normalizing data to ensure accurate distance calculations across mixed attribute types.

Uploaded by

kiro2morris3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Lecture 4

The document discusses the concepts of similarity and dissimilarity in data mining, focusing on how to measure the proximity between data objects using various methods such as data matrices and dissimilarity matrices. It covers different types of attributes, including nominal, binary, and numeric, and explains distance measures like Euclidean, Manhattan, and Minkowski distances. Additionally, it addresses the importance of standardizing and normalizing data to ensure accurate distance calculations across mixed attribute types.

Uploaded by

kiro2morris3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

SET 393: Data Mining and Business Intelligence

3rd Year

Spring 2025

Lec. 4

Chapter 2. Data, Measurements, and Data Preprocessing


Assistant Professor: Dr. Rasha Saleh
Similarity and Distance

2
Proximity: Similarity and Dissimilarity

Similarity

• Numerical measure of how alike two data objects are


• Value is higher when objects are more alike Similarity= 1-dissimilarity
• Often falls in the range [0,1]

Dissimilarity (e.g., distance)

• Numerical measure of how different two data objects are


• Lower when objects are more alike
• Minimum dissimilarity is often 0→ indicates objects are similar

Proximity refers to a similarity or dissimilarity

3
Data Matrix and Dissimilarity Matrix

◼ Data matrix
◼ Composed of: n X p ( n data points or
 x11 ... x1f ... x1p 
objects X p attributes or dimensions)  
 ... ... ... ... ... 
x ... x if ... x ip 
◼ Dissimilarity matrix  i1 
 ... ... ... ... ... 
◼ Composed of: n data points, but registers x ... x nf ... x np 
 n1 
in the matrix only the distance D(i,j)
◼ A triangular matrix –why?

◼ D(i,j) = dissimilarity between object i & j


 0 
 d(2,1) 0 
◼ One dissimilarity matrix for one column or  
 d(3,1) d ( 3,2) 0 
attribute of data matrix.  
 : : : 
d ( n,1) d ( n,2) ... ... 0
4
Data Matrix and Dissimilarity Matrix
◼ Dissimilarity matrix
◼ d(i,j) is a nonnegative number that is close to 0 when objects i and j are
highly similar or “near” each other, and becomes larger the more they
differ.
◼ 0 < D(i,j) < 1
◼ d(i,i)= 0; that is, the difference between an object and itself is 0.
Furthermore, d(i,j)= d(j,i)
◼ Measures of similarity can often be expressed as a function of measures of
dissimilarity. For example, for nominal data,  0 
 d(2,1) 0 
 
 d(3,1) d ( 3,2) 0 
 
 : : : 
d ( n,1) d ( n,2) ... ... 0
5
Proximity Measure for Nominal Attributes

◼ How is dissimilarity computed between objects described by nominal attributes ?


◼ Nominal attribute can take 2 or more states, e.g., red, yellow, blue, green
(generalization of a binary attribute)
◼ Method 1: Simple matching (ratio of mismatch)
p
d (i, j) = p− m
◼ m: # of matches, m =1, p=3(column)
◼ p: total # of variables
◼ Method 2: Use a large number of binary attributes
◼ creating a new binary attribute for each of the M nominal states

6
Proximity Measure for Nominal Attributes
Consider the attribute test-1, compute the dissimilarity
matrix.

Similarity can be computed as

Because we have one nominal attribute

All objects are dissimilar except objects1 and 4 (i.e., d (4,1) = 0 ).


7
Proximity Measure for Binary Attributes
how can we compute the dissimilarity between two binary attributes?

◼ Binary attribute: 0 and 1, where 0 means that the attribute is absent, and 1 means that it is
present → object i has values 0,1 and object j has values 0,1.

◼ A contingency ‫ احتمالية‬table for binary data


◼ q is the number of attributes that equal 1 for both objects i and j
◼ r is the number of attributes that equal 1 for object i but equal 0 for object j,
◼ s is the number of attributes that equal 0 for object i but equal 1 for object j,
◼ t is the number of attributes that equal 0 for both objects i and j.
Object j
◼ total number of attributes is p, where p = q +r +s+t.

Object i
8
Proximity Measure for Binary Attributes
how can we compute the dissimilarity between two binary attributes?
◼ Symmetric binary attributes, each state is equally valuable
◼ Symmetric binary dissimilarity:
◼ Asymmetric binary attributes, the two states are not equally important,
such as the positive (1) and negative (0) outcomes of a disease test.
Given two asymmetric binary attributes, the agreement of two 1s (a
positive match) is then considered more significant than that of two 0s
(a negative match). Therefore, such binary attributes are often
considered “monary” (having one state).
◼ Asymmetric binary dissimilarity:
• Asymmetric binary similarity be:
• is called Jaccard coefficient
• (similarity measure for asymmetric binary variables) 9
Dissimilarity between Binary Variables

Patients records with 8 attributes


Only one
symmetric distance between objects (patients)
attribute is computed based only on
the asymmetric binary attributes.

Object j

Object i

Jack and Mary are the most likely to have a similar disease.

Jim and Mary are unlikely to have a similar disease because they
have the highest dissimilarity value among the three pairs 10
11
Data Matrix and Dissimilarity Matrix
Dissimilarity of numeric data

The most popular distance measure is Euclidean distance (i.e., straight line)

d(x1,x2) = sqrt ((1-3)^2 + (2-5)^2) = sqrt(4+9) = sqrt(13)=3.61

Dissimilarity Matrix

Data Matrix (with Euclidean Distance)


point attribute1 attribute2 x1 x2 x3 x4
x1 1 2 x1 0
x2 3.61 0
x2 3 5
x3 sqrt 5.1 5.1 0
x3 2 0
x4 4.24 1 5.39 0
x4 4 5 12
Data Matrix and Dissimilarity Matrix
Dissimilarity of numeric data
Try it yourself!

The most popular distance measure is Euclidean distance (i.e., straight line)

Calculate the dissimilarity matrix (with Euclidean Distance)

Data Matrix

13
Data Matrix and Dissimilarity Matrix
Dissimilarity of numeric data

Another popular distance measure is Manhattan (or city block) distance

• Both the Euclidean and the Manhattan distance satisfy the following mathematical
properties:

• Nonnegativity: d(i,j)≥ 0: Distance is a nonnegative number.


k
• d(i,i)= 0: The distance of an object to itself is 0.

• Symmetry: d(i,j)=d(j,i): Distance is a symmetric function.


Distance refers to dissimilarity i j
• Triangle inequality: d(i,j)≤ d(i,k)+d(k,j): Going directly from object i to object j in
space is no more than making a detour over any other object k.

• A measure that satisfies these conditions is known as metric 14
Distance on Numeric Data: Minkowski Distance
generalization of the Euclidean and Manhattan distances

◼ Minkowski distance: A popular distance measure

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data
objects, and h is the order (the distance so defined is also called L-h norm)
It represents the Manhattan distance when h = 1(i.e.,L1 norm) and
Euclidean distance when h = 2(i.e.,L2 norm)
◼ Properties
◼ d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (measure distance between the same objects)
◼ d(i, j) = d(j, i) (Symmetry)
◼ d(i, j)  d(i, k) + d(k, j) (Triangle Inequality)
◼ A distance that satisfies these properties is a metric 15
Minkowski Distance
◼ h = 1: Manhattan (city block, L1 norm) distance
◼ E.g., the Hamming distance: the number of bits that are different between two

binary vectors
d (i, j) =| x − x | + | x − x | +...+ | x − x |
i1 j1 i2 j 2 ip jp

◼ h = 2: (L2 norm) Euclidean distance


d (i, j) = (| x − x |2 + | x − x |2 +...+ | x − x |2 )
i1 j1 i2 j2 ip jp

◼ h → . generalization of the Minkowski distance, “supremum” (Lmax norm or L norm


or uniform norm) distance.
◼ This is the maximum difference between any component (attribute) of the vectors

16
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2 Manhattan (L1)
x1 1 2
L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0
Euclidean (L2)
L2 x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0

Supremum
L x1 x2 x3 x4
x1 0
x2 3 0
x3 2 5 0
x4 3 1 5 0
17
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2 Manhattan (L1)
x1 1 2
L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0

d (i, j) =| x − x | + | x − x | +...+ | x − x |
i1 j1 i2 j 2 ip jp

d(x1,x2)= absolute(1-3) + absolute(2-5)= 2+3=5

18
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2 Supremum
x1 1 2 L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 3 0
x4 4 5 x3 2 5 0
x4 3 1 5 0

d(x1,x2)=> first attribute: absolute(3-1) =2


second attribute: absolute (5-2) = 3
Since result of second attribute> result of first attribute

The supreme distance for (x1,x2) is 3

19
Example: Minkowski Distance
To summarize: they represent the following distances on a triangle

20
Standardizing Numeric Data

❑ In some cases, the data are normalized before applying distance calculations. This
involves transforming the data to fall within a smaller or common range, such as
[−1.0,1.0] or [0.0, 1.0].
❑ For example: height attribute, could be measured in either meters or inches.
❑ In general, expressing an attribute in smaller units will lead to a larger range for that
attribute and thus tend to give such attributes greater effect or “weight.”
❑ Normalizing the data attempts to give all attributes an equal weight.

◼ Z-score: x
z=  − 

◼ X: raw score to be standardized, μ: mean of the population, σ: standard deviation


◼ the distance between the raw score and the population mean in units of the
standard deviation
◼ negative when the raw score is below the mean, “+” when above
21
Standardizing Numeric Data

22
Standardizing Numeric Data

23
Ordinal Variables
Step 1 Step 2
◼ An ordinal variable can be discrete or continuous
◼ Order is important, e.g., rank
◼ Math grade: A,B,C,D,E
◼ Can be treated like interval-scaled rif {1,..., M f }
Step 1 ◼ replace xif by their rank A,B,C,D,E→ 1,2,3,4,5
Step 2 ◼ map the range of each variable onto [0, 1] by replacing
i-th object in the f-th variable by
rif −1 numeric for rank A = (1-1)/(5-1)=0
zif = numeric for rank B = (2-1)/(5-1)=1/4=0.25
M f −1
numeric for rank E = (5-1)/(5-1)=4/4=1
◼ compute the dissimilarity using methods for interval-
scaled variables
24
Ordinal Variables
Dissimilarity between ordinal attributes.

Step2: normalization
rif −1
zif =
M f −1
= 2- 1/ 3-1 = ½ = 0.5
Step3: dissimilarity matrix
Step2: normalization
1
0
0.5
1
25
Attributes of Mixed Type

◼ A database may contain all attribute types


◼ Nominal, symmetric binary, asymmetric binary, numeric,
ordinal
◼ One may use a weighted formula to combine their effects
 pf = 1 ij( f ) dij( f )
d (i, j) =
 pf = 1 ij( f )
◼ f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
◼ f is numeric: use the normalized distance
◼ f is ordinal
◼ Compute ranks rif and
zif = r − 1
if

◼ Treat zif as interval-scaled M −1 f

26
Cosine Similarity
◼ A document can be represented by thousands of attributes, each recording the
frequency of a particular word (such as keywords) or phrase in the document.

◼ Applications: information retrieval, biologic taxonomy, gene feature mapping, ...


◼ Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency vectors), then
cos(d1, d2) = (d1 • d2) /||d1|| ||d2|| ,

d1 dot d2 / amplitude d1 * amplitude d2


where • indicates vector dot product, ||d||: the length of vector d

27
Cosine Similarity

◼ Reminder: Dot product:

◼ Assume d1= (1,2) ,


◼ d2= (0,3)

◼ d1 dot d2 = (1 x 0) + (2 x 3)/ sqrt (1^2 + 2^2) x sqrt ( 0^2 + 3^2)

◼ = 6/sqrt(5) x sqrt(9) = 6/3 x sqrt(5) = 2/ sqrt (5)

28
Example: Cosine Similarity

◼ cos(d1, d2) = (d1 • d2) /||d1|| ||d2|| ,


where • indicates vector dot product, ||d|: the length of vector d

◼ Ex: Find the similarity between documents 1 and 2.

d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)

d1•d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 = 4.12
cos(d1, d2 ) = sim(x,y)= 0.94
29
Exercise

30
Exercise

31
Exercise

32
Thank You

You might also like