0% found this document useful (0 votes)
64 views4 pages

DM&DW Individual Assignment (50%)

This document outlines methods for computing dissimilarity between objects with different types of attributes. For nominal attributes, dissimilarity can be measured using simple matching coefficient, Jaccard coefficient, or Hamming distance. For asymmetric binary attributes, the Jaccard dissimilarity is used. For numeric attributes, common measures are Euclidean, Manhattan, and Minkowski distances. Term frequency vectors can use cosine similarity or Jaccard similarity to measure dissimilarity between text objects. The choice of dissimilarity measure depends on the attribute types and analysis requirements.

Uploaded by

abrham
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views4 pages

DM&DW Individual Assignment (50%)

This document outlines methods for computing dissimilarity between objects with different types of attributes. For nominal attributes, dissimilarity can be measured using simple matching coefficient, Jaccard coefficient, or Hamming distance. For asymmetric binary attributes, the Jaccard dissimilarity is used. For numeric attributes, common measures are Euclidean, Manhattan, and Minkowski distances. Term frequency vectors can use cosine similarity or Jaccard similarity to measure dissimilarity between text objects. The choice of dissimilarity measure depends on the attribute types and analysis requirements.

Uploaded by

abrham
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

DM&DW Individual Assignment (50%)

1. Suppose that the data for analysis includes the attribute age. The age values for the
datatuples are (in increasing order) 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25,
30,33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70.
a. What is the mean of the data?
b. What is the median?
c. What is the mode of the data? Comment on the data’s modality (i.e.,
bimodal,trimodal, etc.).
d. What is the midrange of the data?
e. Can you find (roughly) the first quartile (Q1) and the third quartile (Q3) of the data?
f. Give the five-number summary of the data.
g. Show a boxplot of the data.
h. How is a quantile–quantile plot different from a quantile plot?
2. Suppose that the values for a given set of data are grouped into intervals. The intervalsand
corresponding frequencies are as follows:
age frequency
1-5 200
6-15 450
16-20 300
21-50 1500
51-80 700
81-110 44
3. Suppose that a hospital tested the age and body fat data for 18 randomly selected adults
with the following results:
age 23 23 27 27 39 41 47 49 50
%fat 9.5 26.5 7.8 17.8 31.4 25.9 27.4 27.2 31.2
age 52 54 54 56 57 58 58 60 61
%fat 34.6 42.5 28.8 33.4 30.2 34.1 32.9 41.2 35.7

a. Calculate the mean, median, and standard deviation of age and %fat.
b.Draw the boxplots for age and %fat.
c. Draw a scatter plot and a q-q plot based on these two variables.
4. Briefly outline how to compute the dissimilarity between objects described by thefollowing:
a. Nominal attributes
b.Asymmetric binary attributes
c. Numeric attributes
d.Term-frequency vectors
5. Given two objects represented by the tuples (22, 1, 42, 10) and (20, 0, 36, 8):
a. Compute the Euclidean distance between the two objects.
b.Compute the Manhattan distance between the two objects.
c. Compute the Minkowski distance between the two objects, using q =3.
d.Compute the supremum distance between the two objects.
6. The median is one of the most important holistic measures in data analysis. Proposeseveral
methods for median approximation. Analyze their respective complexityunder different
parameter settings and decide to what extent the real value can beapproximated. Moreover,
suggest a heuristic strategy to balance between accuracy andcomplexity and then apply it to
all methods you have given.
7. It is important to define or select similarity measures in data analysis. However, thereis no
commonly accepted subjective similarity measure. Results can vary depending onthe
similarity measures used. Nonetheless, seemingly different similarity measures maybe
equivalent after some transformation.Suppose we have the following 2-D data set:

A1 A2
X1 1.5 1.7
X2 2 1.9
X3 1.6 1.8
X4 1.2 1.5
X5 1.5 1.0

a. Consider the data as 2-D data points. Given a new data point, x = (1.4, 1.6) as aquery, rank
the database points based on similarity with the query using Euclideandistance, Manhattan
distance, supremumdistance, and cosine similarity.
b. Normalize the data set to make the normof each data point equal to 1. Use Euclidean
distance on the transformed data to rank the data points

Submission Date: January 30, 2024.

To compute dissimilarity between objects described by different types of attributes,


various methods are used. Here's a brief outline for each type of attribute:
a. Nominal Attributes:

Nominal attributes represent categories without any inherent order. To compute


dissimilarity between objects with nominal attributes:

1. Simple Matching Coefficient (SMC):


• Count the number of attributes where the values are the same.
• Divide this count by the total number of attributes.
2. Jaccard Coefficient:
• Count the number of attributes where both objects have non-zero values.
• Divide this count by the total number of attributes.
3. Hamming Distance:
• Count the number of attributes where the values are different.

b. Asymmetric Binary Attributes:

Asymmetric binary attributes have values like 0 and 1, but the meaning is different
depending on the order. Dissimilarity can be computed using:

1. Jaccard Dissimilarity for Asymmetric Binary Data:


• Count the number of attributes where one object has 1 and the other has
0.
• Divide this count by the total number of attributes where one of them is 1.

c. Numeric Attributes:

Numeric attributes represent quantitative values. For dissimilarity between objects with
numeric attributes:

1. Euclidean Distance:
• Calculate the square root of the sum of squared differences between
corresponding attribute values.
2. Manhattan Distance (L1 norm):
• Sum the absolute differences between corresponding attribute values.
3. Minkowski Distance:
• Generalization of Euclidean and Manhattan distances. It introduces a
parameter "p" that influences the level of emphasis on large values.

d. Term-Frequency Vectors:
Term-frequency vectors are commonly used in text data analysis. For dissimilarity
between objects represented by term-frequency vectors:

1. Cosine Similarity:
• Compute the dot product of the vectors.
• Normalize by the product of the magnitudes of the vectors.
2. Jaccard Similarity:
• Compute the size of the intersection divided by the size of the union of
non-zero elements in the vectors.

These methods provide ways to measure dissimilarity based on the nature of the
attributes and the data representation. The choice of dissimilarity measure often
depends on the characteristics of the data and the specific requirements of the analysis.

You might also like