0% found this document useful (0 votes)
397 views2 pages

No 2

This document discusses various techniques for summarizing and analyzing data, including calculating measures of central tendency (mean, median, mode), dispersion (range, interquartile range, standard deviation), and position (quartiles, percentiles). It also covers computing distances between data points using metrics like Euclidean, Manhattan, and Minkowski distances. Methods for approximating the median and ranking data based on similarity are presented.

Uploaded by

Asyraf Gary
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
397 views2 pages

No 2

This document discusses various techniques for summarizing and analyzing data, including calculating measures of central tendency (mean, median, mode), dispersion (range, interquartile range, standard deviation), and position (quartiles, percentiles). It also covers computing distances between data points using metrics like Euclidean, Manhattan, and Minkowski distances. Methods for approximating the median and ranking data based on similarity are presented.

Uploaded by

Asyraf Gary
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

80 Chapter 2 Getting to Know Your Data

2.2 Suppose that the data for analysis includes the attribute age. The age values for the data
tuples are (in increasing order) 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30,
33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70.
(a) What is the mean of the data? What is the median?
(b) What is the mode of the data? Comment on the data’s modality (i.e., bimodal,
trimodal, etc.).
(c) What is the midrange of the data?
(d) Can you find (roughly) the first quartile (Q1 ) and the third quartile (Q3 ) of the data?
(e) Give the five-number summary of the data.
(f) Show a boxplot of the data.
(g) How is a quantile–quantile plot different from a quantile plot?
2.3 Suppose that the values for a given set of data are grouped into intervals. The intervals
and corresponding frequencies are as follows:
age frequency
1–5 200
6–15 450
16–20 300
21–50 1500
51–80 700
81–110 44
Compute an approximate median value for the data.
2.4 Suppose that a hospital tested the age and body fat data for 18 randomly selected adults
with the following results:

age 23 23 27 27 39 41 47 49 50
%fat 9.5 26.5 7.8 17.8 31.4 25.9 27.4 27.2 31.2
age 52 54 54 56 57 58 58 60 61
%fat 34.6 42.5 28.8 33.4 30.2 34.1 32.9 41.2 35.7

(a) Calculate the mean, median, and standard deviation of age and %fat.
(b) Draw the boxplots for age and %fat.
(c) Draw a scatter plot and a q-q plot based on these two variables.
2.5 Briefly outline how to compute the dissimilarity between objects described by the
following:
(a) Nominal attributes
(b) Asymmetric binary attributes
2.7 Bibliographic Notes 81

(c) Numeric attributes


(d) Term-frequency vectors

2.6 Given two objects represented by the tuples (22, 1, 42, 10) and (20, 0, 36, 8):
(a) Compute the Euclidean distance between the two objects.
(b) Compute the Manhattan distance between the two objects.
(c) Compute the Minkowski distance between the two objects, using q = 3.
(d) Compute the supremum distance between the two objects.
2.7 The median is one of the most important holistic measures in data analysis. Pro-
pose several methods for median approximation. Analyze their respective complexity
under different parameter settings and decide to what extent the real value can be
approximated. Moreover, suggest a heuristic strategy to balance between accuracy and
complexity and then apply it to all methods you have given.
2.8 It is important to define or select similarity measures in data analysis. However, there
is no commonly accepted subjective similarity measure. Results can vary depending on
the similarity measures used. Nonetheless, seemingly different similarity measures may
be equivalent after some transformation.
Suppose we have the following 2-D data set:

A1 A2
x1 1.5 1.7
x2 2 1.9
x3 1.6 1.8
x4 1.2 1.5
x5 1.5 1.0

(a) Consider the data as 2-D data points. Given a new data point, x = (1.4, 1.6) as a
query, rank the database points based on similarity with the query using Euclidean
distance, Manhattan distance, supremum distance, and cosine similarity.
(b) Normalize the data set to make the norm of each data point equal to 1. Use Euclidean
distance on the transformed data to rank the data points.

2.7 Bibliographic Notes

Methods for descriptive data summarization have been studied in the statistics literature
long before the onset of computers. Good summaries of statistical descriptive data min-
ing methods include Freedman, Pisani, and Purves [FPP07] and Devore [Dev95]. For

You might also like