No 2

This document discusses various techniques for summarizing and analyzing data, including calculating measures of central tendency (mean, median, mode), dispersion (range, interquartile range, standard deviation), and position (quartiles, percentiles). It also covers computing distances between data points using metrics like Euclidean, Manhattan, and Minkowski distances. Methods for approximating the median and ranking data based on similarity are presented.

Uploaded by

Asyraf Gary

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

397 views2 pages

No 2

Uploaded by

Asyraf Gary

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 2

80 Chapter 2 Getting to Know Your Data

2.2 Suppose that the data for analysis includes the attribute age. The age values for the data
tuples are (in increasing order) 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30,
33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70.
(a) What is the mean of the data? What is the median?
(b) What is the mode of the data? Comment on the data’s modality (i.e., bimodal,
trimodal, etc.).
(c) What is the midrange of the data?
(d) Can you find (roughly) the first quartile (Q1 ) and the third quartile (Q3 ) of the data?
(e) Give the five-number summary of the data.
(f) Show a boxplot of the data.
(g) How is a quantile–quantile plot different from a quantile plot?
2.3 Suppose that the values for a given set of data are grouped into intervals. The intervals
and corresponding frequencies are as follows:
age frequency
1–5 200
6–15 450
16–20 300
21–50 1500
51–80 700
81–110 44
Compute an approximate median value for the data.
2.4 Suppose that a hospital tested the age and body fat data for 18 randomly selected adults
with the following results:

age 23 23 27 27 39 41 47 49 50
%fat 9.5 26.5 7.8 17.8 31.4 25.9 27.4 27.2 31.2
age 52 54 54 56 57 58 58 60 61
%fat 34.6 42.5 28.8 33.4 30.2 34.1 32.9 41.2 35.7

(a) Calculate the mean, median, and standard deviation of age and %fat.
(b) Draw the boxplots for age and %fat.
(c) Draw a scatter plot and a q-q plot based on these two variables.
2.5 Briefly outline how to compute the dissimilarity between objects described by the
following:
(a) Nominal attributes
(b) Asymmetric binary attributes
2.7 Bibliographic Notes 81

(c) Numeric attributes

(d) Term-frequency vectors

2.6 Given two objects represented by the tuples (22, 1, 42, 10) and (20, 0, 36, 8):
(a) Compute the Euclidean distance between the two objects.
(b) Compute the Manhattan distance between the two objects.
(c) Compute the Minkowski distance between the two objects, using q = 3.
(d) Compute the supremum distance between the two objects.
2.7 The median is one of the most important holistic measures in data analysis. Pro-
pose several methods for median approximation. Analyze their respective complexity
under different parameter settings and decide to what extent the real value can be
approximated. Moreover, suggest a heuristic strategy to balance between accuracy and
complexity and then apply it to all methods you have given.
2.8 It is important to define or select similarity measures in data analysis. However, there
is no commonly accepted subjective similarity measure. Results can vary depending on
the similarity measures used. Nonetheless, seemingly different similarity measures may
be equivalent after some transformation.
Suppose we have the following 2-D data set:

A1 A2
x1 1.5 1.7
x2 2 1.9
x3 1.6 1.8
x4 1.2 1.5
x5 1.5 1.0

(a) Consider the data as 2-D data points. Given a new data point, x = (1.4, 1.6) as a
query, rank the database points based on similarity with the query using Euclidean
distance, Manhattan distance, supremum distance, and cosine similarity.
(b) Normalize the data set to make the norm of each data point equal to 1. Use Euclidean
distance on the transformed data to rank the data points.

2.7 Bibliographic Notes

Methods for descriptive data summarization have been studied in the statistics literature
long before the onset of computers. Good summaries of statistical descriptive data min-
ing methods include Freedman, Pisani, and Purves [FPP07] and Devore [Dev95]. For

2025 IFT CFA Level I Facts and Formula Sheet hd4wwj
No ratings yet
2025 IFT CFA Level I Facts and Formula Sheet hd4wwj
17 pages
(Ebook PDF) The Basic Practice of Statistics 8Th Edition: Go To Download The Full and Correct Content Document
No ratings yet
(Ebook PDF) The Basic Practice of Statistics 8Th Edition: Go To Download The Full and Correct Content Document
43 pages
Assignment DMBI 2
No ratings yet
Assignment DMBI 2
2 pages
Quiz2 Source
No ratings yet
Quiz2 Source
8 pages
Getting To Know Your Data: 2.1 Exercises
100% (1)
Getting To Know Your Data: 2.1 Exercises
8 pages
Lec2 Activities
No ratings yet
Lec2 Activities
2 pages
Assignment 1
No ratings yet
Assignment 1
9 pages
DM&DW Individual Assignment (50%)
No ratings yet
DM&DW Individual Assignment (50%)
4 pages
Data Mining Solution
No ratings yet
Data Mining Solution
7 pages
Basic Statistical Descriptions of Data
No ratings yet
Basic Statistical Descriptions of Data
26 pages
QB 2
No ratings yet
QB 2
3 pages
02 Data
No ratings yet
02 Data
35 pages
2 2 Data
No ratings yet
2 2 Data
27 pages
9-2 Data Analysis and Pre-Processing Part 2 PDF
No ratings yet
9-2 Data Analysis and Pre-Processing Part 2 PDF
27 pages
Mod 4 Types of Data in Cluster Analysis
No ratings yet
Mod 4 Types of Data in Cluster Analysis
31 pages
Data Mining Assignment 2
No ratings yet
Data Mining Assignment 2
2 pages
DM-Knowing Your Data
No ratings yet
DM-Knowing Your Data
56 pages
Assignment 2
No ratings yet
Assignment 2
6 pages
Lect 3
No ratings yet
Lect 3
51 pages
Chapter - 2 Data Mining
No ratings yet
Chapter - 2 Data Mining
21 pages
9-1 Data Analysis and Pre-Processing Part 1 PDF
No ratings yet
9-1 Data Analysis and Pre-Processing Part 1 PDF
19 pages
CPSC 4830 2025summer Lecture 2
No ratings yet
CPSC 4830 2025summer Lecture 2
42 pages
IT326 - Ch2
No ratings yet
IT326 - Ch2
44 pages
Data Preprocessing Data Basics
No ratings yet
Data Preprocessing Data Basics
86 pages
Tutorial: ND RD
No ratings yet
Tutorial: ND RD
34 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
29 pages
Module No 2 - Part 2 - Compressed - Compressed
No ratings yet
Module No 2 - Part 2 - Compressed - Compressed
46 pages
Data Similarity
0% (1)
Data Similarity
18 pages
Chapter 2: Getting To Know Your Data
No ratings yet
Chapter 2: Getting To Know Your Data
30 pages
21CS63 - Unit1 Practice Questions
No ratings yet
21CS63 - Unit1 Practice Questions
3 pages
VIPDMTheory Chapter 2
No ratings yet
VIPDMTheory Chapter 2
56 pages
Knowing The Data Set
No ratings yet
Knowing The Data Set
31 pages
FORMULAS
No ratings yet
FORMULAS
16 pages
Lec 2
No ratings yet
Lec 2
26 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
46 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
8 pages
CH 2
No ratings yet
CH 2
35 pages
Introduction To The Practice of Basic Statistics (Textbook Outline)
100% (14)
Introduction To The Practice of Basic Statistics (Textbook Outline)
65 pages
Unit 3 Data Preprocessing - Data
No ratings yet
Unit 3 Data Preprocessing - Data
90 pages
02 Data
No ratings yet
02 Data
41 pages
Module 1
No ratings yet
Module 1
64 pages
Transportation Data Mining: Chapter 2. Getting To Know Your Data
No ratings yet
Transportation Data Mining: Chapter 2. Getting To Know Your Data
77 pages
02 Kinds of Data
No ratings yet
02 Kinds of Data
41 pages
02 Data
No ratings yet
02 Data
66 pages
2 Knowing Data & Visualization
No ratings yet
2 Knowing Data & Visualization
51 pages
DWDM Unit-2
No ratings yet
DWDM Unit-2
19 pages
Data Mining Mid Term
No ratings yet
Data Mining Mid Term
9 pages
NguyenDucThang
No ratings yet
NguyenDucThang
5 pages
02data Edited v2
No ratings yet
02data Edited v2
43 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
Lec 5
No ratings yet
Lec 5
24 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
78 pages
Data ch2
No ratings yet
Data ch2
16 pages
Mathematics As A Tool (Descriptive Statistics) (Midterm Period) Overview: This Module Tackles Mathematics As Applied To Different Areas Such As Data
No ratings yet
Mathematics As A Tool (Descriptive Statistics) (Midterm Period) Overview: This Module Tackles Mathematics As Applied To Different Areas Such As Data
33 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
36 pages
Stats - The Theory 2
No ratings yet
Stats - The Theory 2
25 pages
Mvda - Question Bank
No ratings yet
Mvda - Question Bank
14 pages
IGNOU MCA Discrete Mathematics Previous Years Unsolved Papers MCS 212
From Everand
IGNOU MCA Discrete Mathematics Previous Years Unsolved Papers MCS 212
Manish Soni
No ratings yet
CompTIA DataX Study Guide: Exam DY0-001
From Everand
CompTIA DataX Study Guide: Exam DY0-001
Fred Nwanganga
No ratings yet
Profit Driven Business Analytics: A Practitioner's Guide to Transforming Big Data into Added Value
From Everand
Profit Driven Business Analytics: A Practitioner's Guide to Transforming Big Data into Added Value
Wouter Verbeke
No ratings yet
Social Network Analysis with Applications
From Everand
Social Network Analysis with Applications
Ian McCulloh
No ratings yet
Math 102 - Statistics and Probability
No ratings yet
Math 102 - Statistics and Probability
52 pages
Midterm Examination # 3: Sta 113: Probability and Statistics in Engineering Tuesday, 2008 Nov. 25, 1:15 - 2:30 PM
No ratings yet
Midterm Examination # 3: Sta 113: Probability and Statistics in Engineering Tuesday, 2008 Nov. 25, 1:15 - 2:30 PM
14 pages
Good and Bad Customers For Granting Credit: Genpact Data Science Prodegree Logistic Regression: Problem Statement
No ratings yet
Good and Bad Customers For Granting Credit: Genpact Data Science Prodegree Logistic Regression: Problem Statement
2 pages
Unit-6: Classification and Prediction
No ratings yet
Unit-6: Classification and Prediction
63 pages
T PFN: A T T S S T C P S: AB Ransformer HAT Olves Mall Abular Lassification Roblems in A Econd
No ratings yet
T PFN: A T T S S T C P S: AB Ransformer HAT Olves Mall Abular Lassification Roblems in A Econd
33 pages
Final Exam of Statistics 2020
No ratings yet
Final Exam of Statistics 2020
2 pages
Statistical Methods Previous Year Question Paper
100% (1)
Statistical Methods Previous Year Question Paper
9 pages
Presentation Regression
No ratings yet
Presentation Regression
12 pages
H. Paul Barringer, P.E.: (In Red)
100% (2)
H. Paul Barringer, P.E.: (In Red)
7 pages
Statistical Comparison of The Slopes of Two Regression Lines A Tutorial. J.M. Andrade, M.G. Estévez-Pérez
No ratings yet
Statistical Comparison of The Slopes of Two Regression Lines A Tutorial. J.M. Andrade, M.G. Estévez-Pérez
12 pages
Hyperparameter Tuning
No ratings yet
Hyperparameter Tuning
4 pages
Chapter 5
No ratings yet
Chapter 5
58 pages
Particle Filter Tutorial
No ratings yet
Particle Filter Tutorial
8 pages
CH 8 Response Surface Methods (Central Composite Designs, CCDS)
No ratings yet
CH 8 Response Surface Methods (Central Composite Designs, CCDS)
34 pages
WBS-2-Operations Analytics-W1S5-Practice-Problems-Solutions
No ratings yet
WBS-2-Operations Analytics-W1S5-Practice-Problems-Solutions
6 pages
CB2200+Assignment+1+Questions 3
No ratings yet
CB2200+Assignment+1+Questions 3
1 page
Unit 3 (QM)
No ratings yet
Unit 3 (QM)
20 pages
X Bar
No ratings yet
X Bar
8 pages
MATH 120 - Elementary Statistics and Probability Course Outline Time Frame (By Week) Topic/Task Desired Learning Outcomes (DLO)
No ratings yet
MATH 120 - Elementary Statistics and Probability Course Outline Time Frame (By Week) Topic/Task Desired Learning Outcomes (DLO)
7 pages
2nd Quarter Stat and Prob
No ratings yet
2nd Quarter Stat and Prob
2 pages
Introductory Econometrics Test Bank Compress
100% (1)
Introductory Econometrics Test Bank Compress
134 pages
Lecture 06
No ratings yet
Lecture 06
55 pages
LDPC Decoder Help Doc
No ratings yet
LDPC Decoder Help Doc
4 pages
Creating A New Variable
No ratings yet
Creating A New Variable
5 pages
Discrete Probability Distribution
No ratings yet
Discrete Probability Distribution
9 pages
Ma3251 SNM Important Questions
No ratings yet
Ma3251 SNM Important Questions
44 pages
Fe Industrial Engineering
No ratings yet
Fe Industrial Engineering
14 pages
All Lesson Summaries (Bloomberg's Level I CFA (R) Exam Prep)
No ratings yet
All Lesson Summaries (Bloomberg's Level I CFA (R) Exam Prep)
144 pages

No 2

Uploaded by

No 2

Uploaded by

80 Chapter 2 Getting to Know Your Data

(c) Numeric attributes

2.7 Bibliographic Notes

You might also like