DM&DW Individual Assignment (50%)
DM&DW Individual Assignment (50%)
1. Suppose that the data for analysis includes the attribute age. The age values for the
datatuples are (in increasing order) 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25,
30,33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70.
a. What is the mean of the data?
b. What is the median?
c. What is the mode of the data? Comment on the data’s modality (i.e.,
bimodal,trimodal, etc.).
d. What is the midrange of the data?
e. Can you find (roughly) the first quartile (Q1) and the third quartile (Q3) of the data?
f. Give the five-number summary of the data.
g. Show a boxplot of the data.
h. How is a quantile–quantile plot different from a quantile plot?
2. Suppose that the values for a given set of data are grouped into intervals. The intervalsand
corresponding frequencies are as follows:
age frequency
1-5 200
6-15 450
16-20 300
21-50 1500
51-80 700
81-110 44
3. Suppose that a hospital tested the age and body fat data for 18 randomly selected adults
with the following results:
age 23 23 27 27 39 41 47 49 50
%fat 9.5 26.5 7.8 17.8 31.4 25.9 27.4 27.2 31.2
age 52 54 54 56 57 58 58 60 61
%fat 34.6 42.5 28.8 33.4 30.2 34.1 32.9 41.2 35.7
a. Calculate the mean, median, and standard deviation of age and %fat.
b.Draw the boxplots for age and %fat.
c. Draw a scatter plot and a q-q plot based on these two variables.
4. Briefly outline how to compute the dissimilarity between objects described by thefollowing:
a. Nominal attributes
b.Asymmetric binary attributes
c. Numeric attributes
d.Term-frequency vectors
5. Given two objects represented by the tuples (22, 1, 42, 10) and (20, 0, 36, 8):
a. Compute the Euclidean distance between the two objects.
b.Compute the Manhattan distance between the two objects.
c. Compute the Minkowski distance between the two objects, using q =3.
d.Compute the supremum distance between the two objects.
6. The median is one of the most important holistic measures in data analysis. Proposeseveral
methods for median approximation. Analyze their respective complexityunder different
parameter settings and decide to what extent the real value can beapproximated. Moreover,
suggest a heuristic strategy to balance between accuracy andcomplexity and then apply it to
all methods you have given.
7. It is important to define or select similarity measures in data analysis. However, thereis no
commonly accepted subjective similarity measure. Results can vary depending onthe
similarity measures used. Nonetheless, seemingly different similarity measures maybe
equivalent after some transformation.Suppose we have the following 2-D data set:
A1 A2
X1 1.5 1.7
X2 2 1.9
X3 1.6 1.8
X4 1.2 1.5
X5 1.5 1.0
a. Consider the data as 2-D data points. Given a new data point, x = (1.4, 1.6) as aquery, rank
the database points based on similarity with the query using Euclideandistance, Manhattan
distance, supremumdistance, and cosine similarity.
b. Normalize the data set to make the normof each data point equal to 1. Use Euclidean
distance on the transformed data to rank the data points
Asymmetric binary attributes have values like 0 and 1, but the meaning is different
depending on the order. Dissimilarity can be computed using:
c. Numeric Attributes:
Numeric attributes represent quantitative values. For dissimilarity between objects with
numeric attributes:
1. Euclidean Distance:
• Calculate the square root of the sum of squared differences between
corresponding attribute values.
2. Manhattan Distance (L1 norm):
• Sum the absolute differences between corresponding attribute values.
3. Minkowski Distance:
• Generalization of Euclidean and Manhattan distances. It introduces a
parameter "p" that influences the level of emphasis on large values.
d. Term-Frequency Vectors:
Term-frequency vectors are commonly used in text data analysis. For dissimilarity
between objects represented by term-frequency vectors:
1. Cosine Similarity:
• Compute the dot product of the vectors.
• Normalize by the product of the magnitudes of the vectors.
2. Jaccard Similarity:
• Compute the size of the intersection divided by the size of the union of
non-zero elements in the vectors.
These methods provide ways to measure dissimilarity based on the nature of the
attributes and the data representation. The choice of dissimilarity measure often
depends on the characteristics of the data and the specific requirements of the analysis.