DM&DW Individual Assignment (50%)

This document outlines methods for computing dissimilarity between objects with different types of attributes. For nominal attributes, dissimilarity can be measured using simple matching coefficient, Jaccard coefficient, or Hamming distance. For asymmetric binary attributes, the Jaccard dissimilarity is used. For numeric attributes, common measures are Euclidean, Manhattan, and Minkowski distances. Term frequency vectors can use cosine similarity or Jaccard similarity to measure dissimilarity between text objects. The choice of dissimilarity measure depends on the attribute types and analysis requirements.

Uploaded by

abrham

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

64 views4 pages

DM&DW Individual Assignment (50%)

Uploaded by

abrham

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

DM&DW Individual Assignment (50%)

1. Suppose that the data for analysis includes the attribute age. The age values for the
datatuples are (in increasing order) 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25,
30,33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70.
a. What is the mean of the data?
b. What is the median?
c. What is the mode of the data? Comment on the data’s modality (i.e.,
bimodal,trimodal, etc.).
d. What is the midrange of the data?
e. Can you find (roughly) the first quartile (Q1) and the third quartile (Q3) of the data?
f. Give the five-number summary of the data.
g. Show a boxplot of the data.
h. How is a quantile–quantile plot different from a quantile plot?
2. Suppose that the values for a given set of data are grouped into intervals. The intervalsand
corresponding frequencies are as follows:
age frequency
1-5 200
6-15 450
16-20 300
21-50 1500
51-80 700
81-110 44
3. Suppose that a hospital tested the age and body fat data for 18 randomly selected adults
with the following results:
age 23 23 27 27 39 41 47 49 50
%fat 9.5 26.5 7.8 17.8 31.4 25.9 27.4 27.2 31.2
age 52 54 54 56 57 58 58 60 61
%fat 34.6 42.5 28.8 33.4 30.2 34.1 32.9 41.2 35.7

a. Calculate the mean, median, and standard deviation of age and %fat.
b.Draw the boxplots for age and %fat.
c. Draw a scatter plot and a q-q plot based on these two variables.
4. Briefly outline how to compute the dissimilarity between objects described by thefollowing:
a. Nominal attributes
b.Asymmetric binary attributes
c. Numeric attributes
d.Term-frequency vectors
5. Given two objects represented by the tuples (22, 1, 42, 10) and (20, 0, 36, 8):
a. Compute the Euclidean distance between the two objects.
b.Compute the Manhattan distance between the two objects.
c. Compute the Minkowski distance between the two objects, using q =3.
d.Compute the supremum distance between the two objects.
6. The median is one of the most important holistic measures in data analysis. Proposeseveral
methods for median approximation. Analyze their respective complexityunder different
parameter settings and decide to what extent the real value can beapproximated. Moreover,
suggest a heuristic strategy to balance between accuracy andcomplexity and then apply it to
all methods you have given.
7. It is important to define or select similarity measures in data analysis. However, thereis no
commonly accepted subjective similarity measure. Results can vary depending onthe
similarity measures used. Nonetheless, seemingly different similarity measures maybe
equivalent after some transformation.Suppose we have the following 2-D data set:

A1 A2
X1 1.5 1.7
X2 2 1.9
X3 1.6 1.8
X4 1.2 1.5
X5 1.5 1.0

a. Consider the data as 2-D data points. Given a new data point, x = (1.4, 1.6) as aquery, rank
the database points based on similarity with the query using Euclideandistance, Manhattan
distance, supremumdistance, and cosine similarity.
b. Normalize the data set to make the normof each data point equal to 1. Use Euclidean
distance on the transformed data to rank the data points

Submission Date: January 30, 2024.

To compute dissimilarity between objects described by different types of attributes,

various methods are used. Here's a brief outline for each type of attribute:
a. Nominal Attributes:

Nominal attributes represent categories without any inherent order. To compute

dissimilarity between objects with nominal attributes:

1. Simple Matching Coefficient (SMC):

• Count the number of attributes where the values are the same.
• Divide this count by the total number of attributes.
2. Jaccard Coefficient:
• Count the number of attributes where both objects have non-zero values.
• Divide this count by the total number of attributes.
3. Hamming Distance:
• Count the number of attributes where the values are different.

b. Asymmetric Binary Attributes:

Asymmetric binary attributes have values like 0 and 1, but the meaning is different
depending on the order. Dissimilarity can be computed using:

1. Jaccard Dissimilarity for Asymmetric Binary Data:

• Count the number of attributes where one object has 1 and the other has
0.
• Divide this count by the total number of attributes where one of them is 1.

c. Numeric Attributes:

Numeric attributes represent quantitative values. For dissimilarity between objects with
numeric attributes:

1. Euclidean Distance:
• Calculate the square root of the sum of squared differences between
corresponding attribute values.
2. Manhattan Distance (L1 norm):
• Sum the absolute differences between corresponding attribute values.
3. Minkowski Distance:
• Generalization of Euclidean and Manhattan distances. It introduces a
parameter "p" that influences the level of emphasis on large values.

d. Term-Frequency Vectors:
Term-frequency vectors are commonly used in text data analysis. For dissimilarity
between objects represented by term-frequency vectors:

1. Cosine Similarity:
• Compute the dot product of the vectors.
• Normalize by the product of the magnitudes of the vectors.
2. Jaccard Similarity:
• Compute the size of the intersection divided by the size of the union of
non-zero elements in the vectors.

These methods provide ways to measure dissimilarity based on the nature of the
attributes and the data representation. The choice of dissimilarity measure often
depends on the characteristics of the data and the specific requirements of the analysis.

Population Forecasting Methods - Formulas - Example Problems - Practice Problem
No ratings yet
Population Forecasting Methods - Formulas - Example Problems - Practice Problem
10 pages
Pde 240509154448 9589657a
No ratings yet
Pde 240509154448 9589657a
20 pages
Session-5.1-Measuring Data Similarity and Dissimilarity - Part-1
No ratings yet
Session-5.1-Measuring Data Similarity and Dissimilarity - Part-1
11 pages
Discrete Memoryless Source Final 2
100% (6)
Discrete Memoryless Source Final 2
34 pages
Measure of Proximity
No ratings yet
Measure of Proximity
11 pages
29.measuring Data Similarity and Dissimilarity Introduction
No ratings yet
29.measuring Data Similarity and Dissimilarity Introduction
43 pages
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
26 pages
Chapter - 2 Data Mining
No ratings yet
Chapter - 2 Data Mining
21 pages
ML Notes Vaibhav
No ratings yet
ML Notes Vaibhav
269 pages
Data Similarity
0% (1)
Data Similarity
18 pages
Toc Important Questions
No ratings yet
Toc Important Questions
2 pages
The Future of Humanoid Robots - Zater
No ratings yet
The Future of Humanoid Robots - Zater
310 pages
Yang and Rannala 2012 Molecular Phylogenetics.
100% (1)
Yang and Rannala 2012 Molecular Phylogenetics.
12 pages
TE IT DMBI Module2 Data Preprocessing L8-L11
No ratings yet
TE IT DMBI Module2 Data Preprocessing L8-L11
73 pages
DB Ch06
No ratings yet
DB Ch06
106 pages
DS5 Statistics
No ratings yet
DS5 Statistics
67 pages
X Chapter 02 Data
No ratings yet
X Chapter 02 Data
67 pages
STAT243 Chapter 2 - Section 2.4
No ratings yet
STAT243 Chapter 2 - Section 2.4
41 pages
Getting To Know Your Data: 2.1 Exercises
100% (1)
Getting To Know Your Data: 2.1 Exercises
8 pages
02 Data
No ratings yet
02 Data
35 pages
Lecture 4
No ratings yet
Lecture 4
33 pages
CS-DM Module - 3
No ratings yet
CS-DM Module - 3
27 pages
Community Detection
No ratings yet
Community Detection
72 pages
Winter 2021 Paper Solution - Math 1
No ratings yet
Winter 2021 Paper Solution - Math 1
42 pages
02data Part4
No ratings yet
02data Part4
28 pages
CS822 DataMining Week4
No ratings yet
CS822 DataMining Week4
45 pages
Lab 2
No ratings yet
Lab 2
21 pages
2 Similarity Disimilarity Measure
No ratings yet
2 Similarity Disimilarity Measure
35 pages
Similarity
No ratings yet
Similarity
19 pages
Basic Statistical Descriptions of Data
No ratings yet
Basic Statistical Descriptions of Data
26 pages
Lec 5
No ratings yet
Lec 5
24 pages
CSC 522 Lecture10
No ratings yet
CSC 522 Lecture10
30 pages
Cluster Analysis Introduction (Unit-6)
No ratings yet
Cluster Analysis Introduction (Unit-6)
20 pages
Data Preprocessing II
No ratings yet
Data Preprocessing II
21 pages
Class 1c - DataFundamentals
No ratings yet
Class 1c - DataFundamentals
27 pages
Ransomware
No ratings yet
Ransomware
18 pages
Rsfinal
No ratings yet
Rsfinal
30 pages
Data Mining and Predictive Modeling: Lecture 13: Measuring Data Similarity
No ratings yet
Data Mining and Predictive Modeling: Lecture 13: Measuring Data Similarity
19 pages
Similarty and Dissimilarity
No ratings yet
Similarty and Dissimilarity
11 pages
COMP 352 Data Structures and Algorithms: Recursion
No ratings yet
COMP 352 Data Structures and Algorithms: Recursion
40 pages
Data Structures & Algorithms - Week 1 To 7
No ratings yet
Data Structures & Algorithms - Week 1 To 7
105 pages
Knowing Your Data
No ratings yet
Knowing Your Data
43 pages
Mod 4 Types of Data in Cluster Analysis
No ratings yet
Mod 4 Types of Data in Cluster Analysis
31 pages
2 2 Data
No ratings yet
2 2 Data
27 pages
9-2 Data Analysis and Pre-Processing Part 2 PDF
No ratings yet
9-2 Data Analysis and Pre-Processing Part 2 PDF
27 pages
Cap 25 Taha
No ratings yet
Cap 25 Taha
20 pages
Lecture 2. Similarity Measures For Cluster Analysis
No ratings yet
Lecture 2. Similarity Measures For Cluster Analysis
31 pages
Spectral Density
No ratings yet
Spectral Density
27 pages
Data Mining Assignment 2
No ratings yet
Data Mining Assignment 2
2 pages
Data Mining Homework 1
100% (1)
Data Mining Homework 1
2 pages
CSE 1 PPT MiniTest 12feb24 Similarity
No ratings yet
CSE 1 PPT MiniTest 12feb24 Similarity
11 pages
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
No ratings yet
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
30 pages
Finite Elements in Analysis & Design: Abhishek Arora, Benjamin M. Ward, Caglar Oskay
No ratings yet
Finite Elements in Analysis & Design: Abhishek Arora, Benjamin M. Ward, Caglar Oskay
25 pages
Data Science: Department of Computer Science & Engineering
No ratings yet
Data Science: Department of Computer Science & Engineering
31 pages
DSV-S6 Measures of Similarity and Dissimilarity
No ratings yet
DSV-S6 Measures of Similarity and Dissimilarity
43 pages
Examples of Maximum Likelihood Estimation and Optimization in R
No ratings yet
Examples of Maximum Likelihood Estimation and Optimization in R
15 pages
HW1
0% (1)
HW1
2 pages
Assignment 2
No ratings yet
Assignment 2
6 pages
18CSE397T - Computational Data Analysis Unit - 3: Session - 8: SLO - 2
No ratings yet
18CSE397T - Computational Data Analysis Unit - 3: Session - 8: SLO - 2
4 pages
Session-5.1-Measuring Data Similarity and Dissimilarity - Part-2
No ratings yet
Session-5.1-Measuring Data Similarity and Dissimilarity - Part-2
16 pages
Bidirectional LSTM-CRF For Named Entity Recognition
No ratings yet
Bidirectional LSTM-CRF For Named Entity Recognition
10 pages
Quantitative Management-Network Models: Minimum Spanning Tree
No ratings yet
Quantitative Management-Network Models: Minimum Spanning Tree
10 pages
Similarity Measures
No ratings yet
Similarity Measures
11 pages
DM Lab 02
No ratings yet
DM Lab 02
12 pages
38.1 - Problem Formulation Movie Reviews - mp4
No ratings yet
38.1 - Problem Formulation Movie Reviews - mp4
5 pages
Assignment 1
No ratings yet
Assignment 1
9 pages
32-Bidirectional Encoder Representations From Transformers (BERT) - 30!09!2024
No ratings yet
32-Bidirectional Encoder Representations From Transformers (BERT) - 30!09!2024
8 pages
Hand Gesture Recognition2
No ratings yet
Hand Gesture Recognition2
5 pages
Data Mining Solution
No ratings yet
Data Mining Solution
7 pages
Regula Falsi PDF
No ratings yet
Regula Falsi PDF
8 pages
Quiz2 Source
No ratings yet
Quiz2 Source
8 pages
6.1.9 Recursion - Recursive Algorithms Assignment
No ratings yet
6.1.9 Recursion - Recursive Algorithms Assignment
3 pages
To Students Data Mining Part-2 Sept 13 - 240913 - 160930
No ratings yet
To Students Data Mining Part-2 Sept 13 - 240913 - 160930
5 pages
CSC 240 HW 2
No ratings yet
CSC 240 HW 2
5 pages
Assignment DMBI 2
No ratings yet
Assignment DMBI 2
2 pages
12 Cbse Revision Assignment Day 11 22-12-24
No ratings yet
12 Cbse Revision Assignment Day 11 22-12-24
2 pages
Lec09 466 PDF
No ratings yet
Lec09 466 PDF
5 pages
Course - Syllabus - 2024 WAY - ECO3104-11 - ECONOMETRICS (1) - SEOKJOO ANDREW CHANG
No ratings yet
Course - Syllabus - 2024 WAY - ECO3104-11 - ECONOMETRICS (1) - SEOKJOO ANDREW CHANG
2 pages
Mca Syllabus
No ratings yet
Mca Syllabus
55 pages
Lec2 Activities
No ratings yet
Lec2 Activities
2 pages
Syllabus cmpt726 Sfu
No ratings yet
Syllabus cmpt726 Sfu
4 pages
Decomposing Design Effects For Stratified Sampling: Deff Var y Var y UWE Deff
No ratings yet
Decomposing Design Effects For Stratified Sampling: Deff Var y Var y UWE Deff
3 pages
No 2
No ratings yet
No 2
2 pages
Assign 1
No ratings yet
Assign 1
1 page
2020-Daa-Sol-Mid-2020 Autumn
No ratings yet
2020-Daa-Sol-Mid-2020 Autumn
12 pages

DM&DW Individual Assignment (50%)

Uploaded by

DM&DW Individual Assignment (50%)

Uploaded by

DM&DW Individual Assignment (50%)

Submission Date: January 30, 2024.

To compute dissimilarity between objects described by different types of attributes,

Nominal attributes represent categories without any inherent order. To compute

1. Simple Matching Coefficient (SMC):

b. Asymmetric Binary Attributes:

1. Jaccard Dissimilarity for Asymmetric Binary Data:

You might also like