0% found this document useful (0 votes)

52 views6 pages

18CSE397T - Computational Data Analysis Unit - 3: Session - 8: SLO - 1

The document discusses different proximity measures used to calculate similarity and dissimilarity between data points for clustering algorithms. It describes how dissimilarity is calculated for nominal attributes by taking the ratio of attribute mismatches between two points. For ordinal attributes, values are first converted to numbers representing their order, then normalized to a range before calculating dissimilarity based on differences in normalized values. Proximity measures allow quantifying how alike or different objects are, which is important for clustering algorithms to group similar objects together.

Uploaded by

HoShang PAtel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

52 views6 pages

18CSE397T - Computational Data Analysis Unit - 3: Session - 8: SLO - 1

Uploaded by

HoShang PAtel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 6

18CSE397T– Computational Data Analysis

Unit – 3: Session – 8 : SLO – 1

DISSIMILARITIES BASED ON ATTRIBUTES

Introduction

Data mining is the process of finding interesting patterns in large quantities of

data. While implementing clustering algorithms, it is important to be able to
quantify the proximity of objects to one another. Proximity measures are
mainly mathematical techniques that calculate the similarity/dissimilarity of
data points. Usually, proximity is measured in terms of similarity or dissimilarity
i.e., how alike objects are to one another.

Real-Life Example Use-case : Predicting COVID-19 patients on the basis of

their symptoms

With the rise of COVID-19 cases, many people are not being able to seek
proper medical advice due to the shortage of both human and infrastructure
resources. As a result, we as engineers can contribute our bit to solve this
problem by providing a basic diagnosis to help in identifying the people
suffering from COVID-19. To help us we can make use of Machine Learning
algorithms to ease out this task, among which clustering algorithms come in
handy to use.
For this, we make two clusters based on the symptoms of the patients who are
COVID-19 positive or negative and then predict whether a new incoming
patient is suffering from COVID-19 or not by measuring the
similarity/dissimilarity of the observed symptoms (features) with that of the
infected person’s symptoms.

Proximity measures are different for different types of

attributes.

Similarity measure:

–Numerical measure of how alike two data objects are.

– Is higher when objects are more alike.

– Often falls in the range [0,1].

Dissimilarity measure:

–Numerical measure of how different two data objects are.

– Lower when objects are more alike.

– Minimum dissimilarity is often 0.

– Upper limit varies.

Dissimilarity Matrix

Dissimilarity matrix is a matrix of pairwise dissimilarity among the data points.

It is often desirable to keep only lower triangle or upper triangle of a
dissimilarity matrix to reduce the space and time complexity.

1. It’s square and symmetric(AT= A for a square matrix A, where A T represents its
transpose).

2. The diagonals members are zero, meaning that zero is the measure of
dissimilarity between an element and itself.

Proximity measures for Nominal Attributes

Nominal attributes can have two or more different states e.g. an attribute
‘color’ can have values like ‘Red’, ‘Green’, ‘Yellow’, ‘Blue’, etc. Dissimilarity for
nominal attributes is calculated as the ratio of total number of mismatches
between two data points to the total number of attributes.

Nominal means “relating to names.” The values of a nominal attribute are

symbols or names of things. Each value represents some kind of category,
code,
or state and so nominal attributes are also referred to as categorical.

Examples: ID numbers, eye color, zip codes.

Let M be the total number of states of a nominal attribute. Then the states can
be numbered from 1 to M. However, the numbering does not denote any kind
of ordering and can not be used for any mathematical operations.
Let m be total number of matches between two-point attributes and p be total
number of attributes, then the dissimilarity can be calculated as,

d(i, j)=(p-m)/p

We can calculate similarity as,

s(i, j)=1-d(i, j)

EXAMPLE,

Roll No Marks Grades

            1 96 A
            2 87 B
            3 83 B
            4 96 A
In this example we have four objects as Roll No from 1 to 4.

Now, we apply the formula(described above) for finding the proximity of

nominal attributes:

– d(1,1)= (p-m)/p = (2-2)/2 = 0 – d(2,2)= (p-m)/p = (2-2)/2 = 0

– d(2,1)= (p-m)/p = (2-0)/2 = 1 – d(3,2)= (p-m)/p = (2-1)/2 = 0.5

– d(3,1)= (p-m)/p = (2-2)/2 = 1 – d(4,2)= (p-m)/p = (2-0)/2 = 1

– d(4,1)= (p-m)/p = (2-2)/2 = 0 – d(3,3)= (p-m)/p = (2-2)/2 = 0

– d(4,3)= (p-m)/p = (2-0)/2 = 1 – d(4,4)= (p-m)/p = (2-2)/2 = 0

– As seen from the calculation, we observe that the similarity between an

object with itself is 1, which seems intuitively correct.
Proximity measures for ordinal attributes
An ordinal attribute is an attribute whose possible values have a meaningful
order or ranking among them, but the magnitude between successive values is
not known. However, to do so, it is important to convert the states to numbers
where each state of an ordinal attribute is assigned a number corresponding to
the order of attribute values.

Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades,
height {tall, medium, short}.

Since a number of states can be different for different ordinal attributes, it is

therefore required to scale the values to a common range, e.g [0,1]. This can
be done using the given formula,

zif=(rif−1)/(Mf−1)

where M is a maximum number assigned to states and r is the rank(numeric

value) of a particular object.

The similarity can be calculated as:

s(i, j)=1-d(i, j)

EXAMPLE,

Object ID Attribute
1 High
2 Low
3 Medium
4 High
In this example, we have four objects having ID from 1 to 4.

Here for encoding our attribute column, we consider High=1, Medium=2, and

Low=3. And, the value of Mf=3(since there are three states available)

Now, we normalize the ranking in the range of 0 to 1 using the above formula.

So, High=(1-1)/(3-1)=0, Medium=(2-1)/(3-1)=0.5, Low=(3-1)/(3-1)=1.

Finally, we are able to calculate the dissimilarity based on difference in

normalized values corresponding to that attribute.

– d(1,1)= 0-0 = 0 – d(2,2)= 3-3 = 0

– d(2,1)= 1-0= 1 – d(3,2)= 0.5-0 = 0.5

– d(3,1)= 0.5-0 = 0.5 – d(4,2)= 1-0 = 1

– d(4,1)= 0-0 =0 – d(3,3)= 0.5-0.5 = 0

– d(4,3)= 0.5-0=0 – d(4,4)= 0-0 = 0

Resume Working Student Jollibee
50% (2)
Resume Working Student Jollibee
3 pages
Mill Reject Technical Specifications
No ratings yet
Mill Reject Technical Specifications
403 pages
Measuring Data Similarity and Dissimilarity
No ratings yet
Measuring Data Similarity and Dissimilarity
20 pages
BS en Iso 14692-3-2017
No ratings yet
BS en Iso 14692-3-2017
46 pages
Similarity and Dissimilarity
No ratings yet
Similarity and Dissimilarity
34 pages
Session-5.1-Measuring Data Similarity and Dissimilarity - Part-1
No ratings yet
Session-5.1-Measuring Data Similarity and Dissimilarity - Part-1
11 pages
4 - Ch4 - Data Objects and Attribute Types
No ratings yet
4 - Ch4 - Data Objects and Attribute Types
14 pages
Chapter - 2 Data Mining
No ratings yet
Chapter - 2 Data Mining
21 pages
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
26 pages
Measure of Proximity
No ratings yet
Measure of Proximity
11 pages
Data Similarity
0% (1)
Data Similarity
18 pages
29.measuring Data Similarity and Dissimilarity Introduction
No ratings yet
29.measuring Data Similarity and Dissimilarity Introduction
43 pages
Proximity Measure
No ratings yet
Proximity Measure
34 pages
Sess02 Data
No ratings yet
Sess02 Data
96 pages
SQAP For Starter or Control Panel
No ratings yet
SQAP For Starter or Control Panel
29 pages
Introduction To Data Science: Tom A S Horv Ath
No ratings yet
Introduction To Data Science: Tom A S Horv Ath
39 pages
Lecture 4
No ratings yet
Lecture 4
33 pages
02data Part4
No ratings yet
02data Part4
28 pages
TE IT DMBI Module2 Data Preprocessing L8-L11
No ratings yet
TE IT DMBI Module2 Data Preprocessing L8-L11
73 pages
Lecture 6 Clustring
No ratings yet
Lecture 6 Clustring
7 pages
IS15477 - 2019 Tile Adhesive
No ratings yet
IS15477 - 2019 Tile Adhesive
21 pages
Lec 5
No ratings yet
Lec 5
24 pages
Lecture 3-Know Your Data - M
No ratings yet
Lecture 3-Know Your Data - M
19 pages
Daily Accomplishment Report
No ratings yet
Daily Accomplishment Report
13 pages
CS-DM Module - 3
No ratings yet
CS-DM Module - 3
27 pages
ML Co4 Session 29
No ratings yet
ML Co4 Session 29
36 pages
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
No ratings yet
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
30 pages
Session-5.1-Measuring Data Similarity and Dissimilarity - Part-2
No ratings yet
Session-5.1-Measuring Data Similarity and Dissimilarity - Part-2
16 pages
Clustering Lecture 1: Basics: Jing Gao
No ratings yet
Clustering Lecture 1: Basics: Jing Gao
62 pages
Data Science: Department of Computer Science & Engineering
No ratings yet
Data Science: Department of Computer Science & Engineering
31 pages
Knowing Your Data
No ratings yet
Knowing Your Data
43 pages
Data Mining: Data
No ratings yet
Data Mining: Data
50 pages
X Chapter 02 Data
No ratings yet
X Chapter 02 Data
67 pages
DWDM Unit6-Data Similarity Measures
No ratings yet
DWDM Unit6-Data Similarity Measures
40 pages
Similarity
No ratings yet
Similarity
19 pages
Lecture 2. Similarity Measures For Cluster Analysis
No ratings yet
Lecture 2. Similarity Measures For Cluster Analysis
31 pages
CSC 240 HW 2
No ratings yet
CSC 240 HW 2
5 pages
Similarty and Dissimilarity
No ratings yet
Similarty and Dissimilarity
11 pages
STAT243 Chapter 2 - Section 2.4
No ratings yet
STAT243 Chapter 2 - Section 2.4
41 pages
DMi 03-Proximity
No ratings yet
DMi 03-Proximity
51 pages
CSE 1 PPT MiniTest 12feb24 Similarity
No ratings yet
CSE 1 PPT MiniTest 12feb24 Similarity
11 pages
Similarity and Dissimilarity Measures: Distance
No ratings yet
Similarity and Dissimilarity Measures: Distance
50 pages
R22 Unit2 Ids CH1
No ratings yet
R22 Unit2 Ids CH1
10 pages
Lesson 6 Similarities KNN
No ratings yet
Lesson 6 Similarities KNN
25 pages
Lab 2
No ratings yet
Lab 2
21 pages
Data Mining and Predictive Modeling: Lecture 13: Measuring Data Similarity
No ratings yet
Data Mining and Predictive Modeling: Lecture 13: Measuring Data Similarity
19 pages
CSC 522 Lecture10
No ratings yet
CSC 522 Lecture10
30 pages
18CSE397T - Computational Data Analysis Unit - 3: Session - 8: SLO - 2
No ratings yet
18CSE397T - Computational Data Analysis Unit - 3: Session - 8: SLO - 2
4 pages
2 Similarity Disimilarity Measure
No ratings yet
2 Similarity Disimilarity Measure
35 pages
Cluster Analysis Introduction (Unit-6)
No ratings yet
Cluster Analysis Introduction (Unit-6)
20 pages
Class 1c - DataFundamentals
No ratings yet
Class 1c - DataFundamentals
27 pages
CS822 DataMining Week4
No ratings yet
CS822 DataMining Week4
45 pages
Data Mining
No ratings yet
Data Mining
24 pages
DMi 03 Proximity
No ratings yet
DMi 03 Proximity
9 pages
9-2 Data Analysis and Pre-Processing Part 2 PDF
No ratings yet
9-2 Data Analysis and Pre-Processing Part 2 PDF
27 pages
A.I. Lecture 5 NEW
No ratings yet
A.I. Lecture 5 NEW
96 pages
Similarity Measures
No ratings yet
Similarity Measures
11 pages
DM&DW Individual Assignment (50%)
No ratings yet
DM&DW Individual Assignment (50%)
4 pages
DWM Unit-Vi
No ratings yet
DWM Unit-Vi
30 pages
2nd Slides
No ratings yet
2nd Slides
54 pages
IDS4
No ratings yet
IDS4
50 pages
DSV-S6 Measures of Similarity and Dissimilarity
No ratings yet
DSV-S6 Measures of Similarity and Dissimilarity
43 pages
GT Operating and Maintenance Manual v943 - 240416 - 184428
No ratings yet
GT Operating and Maintenance Manual v943 - 240416 - 184428
765 pages
Formulas at A Glance - IDS
No ratings yet
Formulas at A Glance - IDS
5 pages
2307
No ratings yet
2307
3 pages
DOLE Advisory No - 3 - 09
No ratings yet
DOLE Advisory No - 3 - 09
4 pages
MIP GET VIEW BOQDripSystem
No ratings yet
MIP GET VIEW BOQDripSystem
6 pages
Solutions
100% (1)
Solutions
25 pages
FCB UnO ControlCenter Manual
No ratings yet
FCB UnO ControlCenter Manual
30 pages
Redox
No ratings yet
Redox
2 pages
Ultrasonic Sensors: USA Series US-T50/R25 US-S25AN US-S300 Series US-1AH
No ratings yet
Ultrasonic Sensors: USA Series US-T50/R25 US-S25AN US-S300 Series US-1AH
19 pages
4th Sem Corporate Regulation Full Chapters-Krishtalkz
No ratings yet
4th Sem Corporate Regulation Full Chapters-Krishtalkz
81 pages
Building A Performance Based Work Culture PDF
No ratings yet
Building A Performance Based Work Culture PDF
16 pages
Engineering Foundation 2020-2021
No ratings yet
Engineering Foundation 2020-2021
5 pages
Chapter 2 - Classification of Business
No ratings yet
Chapter 2 - Classification of Business
22 pages
Design of Regenerative Pump
No ratings yet
Design of Regenerative Pump
19 pages
Forbidden Topic in Health Policy Debate - Cost Effectiveness - The New York Times
No ratings yet
Forbidden Topic in Health Policy Debate - Cost Effectiveness - The New York Times
4 pages
Science Quarter 4 Week 4: Capslet
No ratings yet
Science Quarter 4 Week 4: Capslet
9 pages
Expanding Mental Health Care in The Kingdom of Eswatini: Successes, Challenges and Recommendations From Initial Experiences in Lubombo Region
No ratings yet
Expanding Mental Health Care in The Kingdom of Eswatini: Successes, Challenges and Recommendations From Initial Experiences in Lubombo Region
8 pages
Mechanical Module 06
No ratings yet
Mechanical Module 06
14 pages
Business Model Canvas
No ratings yet
Business Model Canvas
3 pages
Agree or Disagree
No ratings yet
Agree or Disagree
2 pages
Group Life Assurance in Myanmar
No ratings yet
Group Life Assurance in Myanmar
2 pages
Role of Principal
No ratings yet
Role of Principal
3 pages
IEC-IM03 Series: Key Features
No ratings yet
IEC-IM03 Series: Key Features
1 page
Fire Hydrant 2 Polyhose
No ratings yet
Fire Hydrant 2 Polyhose
1 page
Quests in White Orchard The Witcher 3 Wiki
No ratings yet
Quests in White Orchard The Witcher 3 Wiki
1 page
A Short Course in Discrete Mathematics
From Everand
A Short Course in Discrete Mathematics
Edward A. Bender
3/5 (1)
Sample Size for Analytical Surveys, Using a Pretest-Posttest-Comparison-Group Design
From Everand
Sample Size for Analytical Surveys, Using a Pretest-Posttest-Comparison-Group Design
Joseph George Caldwell
No ratings yet
Top Numerical Methods With Matlab For Beginners!
From Everand
Top Numerical Methods With Matlab For Beginners!
Andrei Besedin
No ratings yet

18CSE397T - Computational Data Analysis Unit - 3: Session - 8: SLO - 1

Uploaded by

18CSE397T - Computational Data Analysis Unit - 3: Session - 8: SLO - 1

Uploaded by

18CSE397T– Computational Data Analysis

Unit – 3: Session – 8 : SLO – 1

Data mining is the process of finding interesting patterns in large quantities of

Real-Life Example Use-case : Predicting COVID-19 patients on the basis of

Proximity measures are different for different types of

–Numerical measure of how alike two data objects are.

– Is higher when objects are more alike.

– Often falls in the range [0,1].

–Numerical measure of how different two data objects are.

– Lower when objects are more alike.

– Minimum dissimilarity is often 0.

– Upper limit varies.

Dissimilarity matrix is a matrix of pairwise dissimilarity among the data points.

Proximity measures for Nominal Attributes

Nominal means “relating to names.” The values of a nominal attribute are

Examples: ID numbers, eye color, zip codes.

We can calculate similarity as,

s(i, j)=1-d(i, j)

Roll No Marks Grades

Now, we apply the formula(described above) for finding the proximity of

– d(1,1)= (p-m)/p = (2-2)/2 = 0 – d(2,2)= (p-m)/p = (2-2)/2 = 0

– d(2,1)= (p-m)/p = (2-0)/2 = 1 – d(3,2)= (p-m)/p = (2-1)/2 = 0.5

– d(3,1)= (p-m)/p = (2-2)/2 = 1 – d(4,2)= (p-m)/p = (2-0)/2 = 1

– d(4,1)= (p-m)/p = (2-2)/2 = 0 – d(3,3)= (p-m)/p = (2-2)/2 = 0

– d(4,3)= (p-m)/p = (2-0)/2 = 1 – d(4,4)= (p-m)/p = (2-2)/2 = 0

– As seen from the calculation, we observe that the similarity between an

Since a number of states can be different for different ordinal attributes, it is

where M is a maximum number assigned to states and r is the rank(numeric

The similarity can be calculated as:

Here for encoding our attribute column, we consider High=1, Medium=2, and

So, High=(1-1)/(3-1)=0, Medium=(2-1)/(3-1)=0.5, Low=(3-1)/(3-1)=1.

Finally, we are able to calculate the dissimilarity based on difference in

– d(1,1)= 0-0 = 0 – d(2,2)= 3-3 = 0

– d(2,1)= 1-0= 1 – d(3,2)= 0.5-0 = 0.5

– d(3,1)= 0.5-0 = 0.5 – d(4,2)= 1-0 = 1

– d(4,1)= 0-0 =0 – d(3,3)= 0.5-0.5 = 0

– d(4,3)= 0.5-0=0 – d(4,4)= 0-0 = 0

You might also like