Lesson1 Summarizingdata
Lesson1 Summarizingdata
Data
Example: You want to study the average GPA of juniors
who are engineering majors.
Population:
All engineering majors who are juniors.
2
Populations and Samples
What statisticians need to do:
3
Graphics: Histograms
A histogram is a graphical representation of the
distribution of numerical data.
Construct a histogram:
1. “Bin” the range of values. (The bins are usually
consecutive, non-overlapping, and are usually equal
size.)
2. Frequency histogram: count how many values fall into
each bin/interval and draw accordingly.
3. Density histogram: count how many values fall into
each bin, and adjust the height such that the sum of the
area of each bin equals 1.
4
Graphics: Histograms
Examples:
- Drawing a frequency histogram by hand.
- Drawing a density histogram by hand.
5
Example
Charity is a big business in the United States. The Web site
charitynavigator.com gives information on roughly 5500
charitable organizations.
6
Example cont’d
6.1 12.6 34.7 1.6 18.8 2.2 3.0 2.2 5.6 3.8
2.2 3.1 1.3 1.1 14.1 4.0 21.0 6.1 1.3 20.4
7.5 3.9 10.1 8.1 19.5 5.2 12.0 15.8 10.4 5.2
6.4 10.8 83.1 3.6 6.2 6.3 16.3 12.7 1.3 0.8
8.8 5.1 3.7 26.3 6.0 48.0 8.2 11.7 7.2 3.9
15.3 16.6 8.8 12.0 4.7 14.7 6.4 17.0 2.5 16.2
7
Example cont’d
8
Graphics: Histograms
Histograms come in a variety of shapes.
• Unimodal histogram: single peak
• Bimodal histogram: two different peaks
• Multimodal histogram: many different peaks
Symmetric histograms
Positively skewed histograms
Negatively skewed histograms
9
Sample Statistics
• Histograms and other visual summaries of samples are
excellent tools for informal learning about population
characteristics.
10
Sample Statistics: Measures of Centrality
Summarizing the center of the sample data is a popular
and important characteristic of a set of numbers.
11
The Sample Mean
For a given set of numbers x1, x2,. . ., xn, the most
familiar measure of the center is the mean
(arithmetic average).
12
The Sample Mean
For a given set of numbers x1, x2,. . ., xn, the most
familiar measure of the center is the mean
(arithmetic average).
Disadvantage?
13
The Sample Median
Median: Middle value when observations are ordered
smallest to largest.
14
The Sample Median
Median: Middle value when observations are ordered
smallest to largest.
To calculate: Order the n observations smallest to largest
(repeated values included and find the middle one.
15
The Mean vs. the Median
The population mean µ and median will not generally be
identical. If the population distribution is positively or
negatively skewed, as pictured below, then
17
Other Sample Measures
• Quartiles: divide the data set into four equal parts (how
is this calculated?)
• Percentiles: A data set can be even more finely
divided. What does “percentile” mean?
18
Graphics: Boxplots
A boxplot is a convenient way of graphically depicting
groups of numerical data through the five number
summary: minimum, first quartile, median, third quartile,
and maximum.
19
Variability
So far, we’ve learned techniques for visualizing our data
and measures of center. What about how far apart the
data is spread out?
Samples with identical measures of center but different amounts of variability
20
Variability
Simplest measure of variability: The range.
Samples with identical measures of center but different amounts of variability
21
Variability
Simplest measure of variability: The range.
Samples with identical measures of center but different amounts of variability
22
Variability
Can we combine the deviations into a single quantity by
finding the average deviation?
23
Variability
The sample variance, denoted by s2, is given by
Note that s2 and s are both nonnegative. The unit for s is
the same as the unit for each of the xi.
24
Summarizing Data in R
- Summary statistics
- Graphics (boxplots, histograms)
25