0% found this document useful (0 votes)
52 views6 pages

18CSE397T - Computational Data Analysis Unit - 3: Session - 8: SLO - 1

The document discusses different proximity measures used to calculate similarity and dissimilarity between data points for clustering algorithms. It describes how dissimilarity is calculated for nominal attributes by taking the ratio of attribute mismatches between two points. For ordinal attributes, values are first converted to numbers representing their order, then normalized to a range before calculating dissimilarity based on differences in normalized values. Proximity measures allow quantifying how alike or different objects are, which is important for clustering algorithms to group similar objects together.

Uploaded by

HoShang PAtel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views6 pages

18CSE397T - Computational Data Analysis Unit - 3: Session - 8: SLO - 1

The document discusses different proximity measures used to calculate similarity and dissimilarity between data points for clustering algorithms. It describes how dissimilarity is calculated for nominal attributes by taking the ratio of attribute mismatches between two points. For ordinal attributes, values are first converted to numbers representing their order, then normalized to a range before calculating dissimilarity based on differences in normalized values. Proximity measures allow quantifying how alike or different objects are, which is important for clustering algorithms to group similar objects together.

Uploaded by

HoShang PAtel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

18CSE397T– Computational Data Analysis

Unit – 3: Session – 8 : SLO – 1


DISSIMILARITIES BASED ON ATTRIBUTES

Introduction

Data mining is the process of finding interesting patterns in large quantities of


data. While implementing clustering algorithms, it is important to be able to
quantify the proximity of objects to one another. Proximity measures are
mainly mathematical techniques that calculate the similarity/dissimilarity of
data points. Usually, proximity is measured in terms of similarity or dissimilarity
i.e., how alike objects are to one another.

Real-Life Example Use-case : Predicting COVID-19 patients on the basis of


their symptoms

With the rise of COVID-19 cases, many people are not being able to seek
proper medical advice due to the shortage of both human and infrastructure
resources. As a result, we as engineers can contribute our bit to solve this
problem by providing a basic diagnosis to help in identifying the people
suffering from COVID-19. To help us we can make use of Machine Learning
algorithms to ease out this task, among which clustering algorithms come in
handy to use.
For this, we make two clusters based on the symptoms of the patients who are
COVID-19 positive or negative and then predict whether a new incoming
patient is suffering from COVID-19 or not by measuring the
similarity/dissimilarity of the observed symptoms (features) with that of the
infected person’s symptoms.

Proximity measures are different for different types of


attributes. 

Similarity measure:

 –Numerical measure of how alike two data objects are.

– Is higher when objects are more alike.

– Often falls in the range [0,1].

Dissimilarity measure:

 –Numerical measure of how different two data objects are.

– Lower when objects are more alike.

– Minimum dissimilarity is often 0.

– Upper limit varies.


Dissimilarity Matrix

Dissimilarity matrix is a matrix of pairwise dissimilarity among the data points.


It is often desirable to keep only lower triangle or upper triangle of a
dissimilarity matrix to reduce the space and time complexity.

1. It’s square and symmetric(AT= A for a square matrix A, where A T represents its
transpose).

2. The diagonals members are zero, meaning that zero is the measure of
dissimilarity between an element and itself.

Proximity measures for Nominal Attributes

Nominal attributes can have two or more different states e.g. an attribute
‘color’ can have values like ‘Red’, ‘Green’, ‘Yellow’, ‘Blue’, etc. Dissimilarity for
nominal attributes is calculated as the ratio of total number of mismatches
between two data points to the total number of attributes.

Nominal means “relating to names.” The values of a nominal attribute are


symbols or names of things. Each value represents some kind of category,
code,
or state and so nominal attributes are also referred to as categorical.

Examples: ID numbers, eye color, zip codes.

Let M be the total number of states of a nominal attribute. Then the states can
be numbered from 1 to M. However, the numbering does not denote any kind
of ordering and can not be used for any mathematical operations.
Let m be total number of matches between two-point attributes and p be total
number of attributes, then the dissimilarity can be calculated as,

                                                        d(i,  j)=(p-m)/p

We can calculate similarity as,

                                                        s(i, j)=1-d(i, j)

EXAMPLE,

                       Roll No                           Marks                         Grades


            1                            96                            A
            2                            87                            B
            3                             83                            B
            4                             96                            A
In this example we have four objects as Roll No from 1 to 4.

Now, we apply the formula(described above) for finding the proximity of


nominal attributes:

– d(1,1)= (p-m)/p = (2-2)/2 = 0                  – d(2,2)= (p-m)/p = (2-2)/2 = 0

– d(2,1)= (p-m)/p = (2-0)/2 = 1                  – d(3,2)= (p-m)/p = (2-1)/2 = 0.5

– d(3,1)= (p-m)/p = (2-2)/2 = 1                  – d(4,2)= (p-m)/p = (2-0)/2 = 1

– d(4,1)= (p-m)/p = (2-2)/2 = 0                  – d(3,3)= (p-m)/p = (2-2)/2 = 0

– d(4,3)= (p-m)/p = (2-0)/2 = 1                  – d(4,4)= (p-m)/p = (2-2)/2 = 0

– As seen from the calculation, we observe that the similarity between an


object with itself is 1, which seems intuitively correct.
Proximity measures for ordinal attributes
An ordinal attribute is an attribute whose possible values have a meaningful
order or ranking among them, but the magnitude between successive values is
not known. However, to do so, it is important to convert the states to numbers
where each state of an ordinal attribute is assigned a number corresponding to
the order of attribute values.

Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades,
height {tall, medium, short}.

Since a number of states can be different for different ordinal attributes, it is


therefore required to scale the values to a common range, e.g [0,1]. This can
be done using the given formula,

                                                         zif=(rif−1)/(Mf−1)

where M is a maximum number assigned to states and r is the rank(numeric


value) of a particular object.

The similarity can be calculated as:

                                                         s(i, j)=1-d(i, j)

EXAMPLE,

Object ID Attribute
                                     1                                   High
                                     2                                   Low
                                     3                                   Medium
                                     4                                   High
In this example, we have four objects having ID from 1 to 4.

Here for encoding our attribute column, we consider High=1, Medium=2, and


Low=3. And, the value of Mf=3(since there are three states available)

Now, we normalize the ranking in the range of 0 to 1 using the above formula.

So,  High=(1-1)/(3-1)=0,  Medium=(2-1)/(3-1)=0.5,  Low=(3-1)/(3-1)=1.

Finally, we are able to calculate the dissimilarity based on difference in


normalized values corresponding to that attribute.

– d(1,1)= 0-0 = 0                               – d(2,2)= 3-3 = 0

– d(2,1)= 1-0= 1                                – d(3,2)= 0.5-0 = 0.5

– d(3,1)= 0.5-0 = 0.5                         – d(4,2)= 1-0 = 1

– d(4,1)= 0-0 =0                                 – d(3,3)= 0.5-0.5 = 0

– d(4,3)= 0.5-0=0                               – d(4,4)= 0-0 = 0

You might also like