Unit 1: Exploratory Data Analysis
Unit 1: Exploratory Data Analysis
Analysis
(Ch 1.1, 1.3, 1.10-1.13, 2.4.3, 2.5)
1
Learning Objectives
At the end of this unit, students should be able to:
1. Define a population, sample, sample frame, variable of interest and identify these
concepts in particular examples.
2
Populations and Samples
Statisticians hope to learn about some characteristic/variable in a population. But we
often can’t see the whole population; so, we investigate a sample.
Definition: A population is a collection of units (units can be people, widgets, servings
of food, kittens, songs, Tweets, etc.)
Definition: A sample is a subset of the population.
Definition: A characteristic/variable of interest (VoI) is something to be measured for
each unit.
Example: CU might want to study the average GPA of juniors who are engineering
majors at CU. In this case, the Population is..? Reasonable Sample? VoI?
3
Populations and Samples:
Examples
• Insurance company surveying damage in a particular town after
hurricane…
4
Populations and Samples
Statisticians learn about a characteristic in a population
by studying a sample.
5
Exploratory Data Analysis
(EDA)/Descriptive Statistics
6
Numerical Summaries:
Sample Statistics
The calculation and interpretation of certain
summarizing numbers can help us gain an
understanding of the data.
7
Sample Statistics: Measures
of Centrality
Summarizing the “center” of the sample data is a popular
and important characteristic of a set of numbers. The
goal here is to capture something like the “typical” unit
with respect to the VoI.
8
The Sample Mean
For a given set of numbers x1, x2,. . ., xn, the most familiar
measure of the center is the mean (arithmetic average).
Advantages?
Disadvantages?
9
The Sample Median
10
The Sample Median
Median: Middle value when observations are ordered smallest to largest.
11
The Mean vs. the Median
The population mean and median will not generally
be identical.
Three$different$shapes$for$a$population$distribution
12
The Mean vs. the Median
The population mean and median will not generally
be identical.
Three$different$shapes$for$a$population$distribution
Data: 34, 47, 1, 15, 57, 24, 20, 11, 19, 50, 28, 37.
14
Variability
So far, we’ve learned techniques for visualizing our
data and measures of center. What about the spread
of the data?
15
Variability
So far, we’ve learned techniques for visualizing our
data and measures of center. What about the spread
of the data?
Daily Average Temperature: City 1 Daily Average Temperature
100
100
80
80
60
60
Temp (Deg F)
Temp (Deg F)
40
40
20
20
0
-20
0
Day 16 Day
Variability
Simplest measure of variability: The range.
Samples(with(identical(measures(of(center(but(different(amounts(of(variability
17
Variability
Simplest measure of variability: The range.
Samples(with(identical(measures(of(center(but(different(amounts(of(variability
18
Variability
Can we combine the deviations into a single quantity
by finding the average deviation?
19
Variability
The sample variance, denoted by ________, is given by:
The sample standard deviation, denoted by s, is the (positive) square root of the
variance:
Note that _____ and _____ are both nonnegative. The unit for _____ is the same as the
unit for each of the _____.
20
Graphics: Histograms
A histogram is a graphical representation of the distribution of
numerical data.
Construct a histogram:
“Bin” the measured values of the VoI. (The bins are usually
consecutive, non-overlapping, and are usually equal size.)
Frequency histogram: count how many values fall into each bin/
interval and draw accordingly.
Density histogram: count how many values fall into each bin, and
adjust the height such that the sum of the area of each bin equals 1.
21
Graphics: Histograms
Histogram of x
80
60
Frequency
40
20
0
2 3 4 5 6 7 8
x
22
Histograms: Example
Charity is a big business in the United States. The
Web site charitynavigator.com gives information on
roughly 5500 charitable organizations.
23
Histograms: Example
Here are the data on fundraising expenses as a percentage of
total expenditures for a random sample of 60 charities:
6.1 12.6 34.7 1.6 18.8 2.2 3.0 2.2 5.6 3.8
2.2 3.1 1.3 1.1 14.1 4.0 21.0 6.1 1.3 20.4
7.5 3.9 10.1 8.1 19.5 5.2 12.0 15.8 10.4 5.2
6.4 10.8 83.1 3.6 6.2 6.3 16.3 12.7 1.3 0.8
8.8 5.1 3.7 26.3 6.0 48.0 8.2 11.7 7.2 3.9
15.3 16.6 8.8 12.0 4.7 14.7 6.4 17.0 2.5 16.2
24
Histograms: Example
Histogram of x
35
30
25
Frequency
20
15
10
5
0
0 20 40 60 80
25
Graphics: Histograms
Histograms come in a variety of shapes.
27
Classwork
Answer the following questions for a sample data set with n
values. What happens to the mean when:
28