Chapter 4 Data Description
Chapter 4 Data Description
in
Our Daily Life
Chapter 4
Data Description
1
The “AVERAGE”
Mean =(0+340+…+200+210)/20=167
Practical interpretation:
On average, the sodium level for a typical
breakfast cereal on the market is 167mg per
serving.
3
Parameters and Statistics
A parameter: a number that describes the population.
A parameter is a fixed number.
In practice, often we don’t know the actual value of this
number.
4
The Mean
Population Sample
take sample
X1, X2, …, XN x1, x2, …, xn
Xi estimate
x
i=1
i
i=1 x
N n
Sample
Population
Parameter Statistics
5
The Mean
Mean is the balance point of a set of data if we were to place
identical weights on a line representing where the observations
occur.
Usually, the mean is not equal to any value that was observed in
the sample.
The mean can be highly influenced by an outlier, which is an
unusually small or unusually large observation
6
Try to imagine that ten guys are sitting
on bar stools in a middle-class drinking
establishment in Seattle; each of these
guys earns $35,000 a year, which makes
the mean annual income for the group
$35,000.
Bill Gates walks into the bar. Let’s assume that Bill Gates has an
annual income of $1 billion. When Bill sits down on the eleventh bar
stool, the mean annual income for the bar patrons rises to about $91
million. Obviously none of the original ten drinkers is any richer.
7
Example: Sodium in Cereal
Practical interpretation:
Among the 20 tested breakfast cereal, half of them
contain more than 180mg of sodium(salt) per
serving.
8
The Median
The population or sample median Md is a value such that 50% of
all measurements, after having been arranged in numerical
order, lie above (or below) it
9
Median or Mean?
Mean is sensitive to extreme values.
Median is resistant to the effect of extreme observations.
10
Example: Sodium in Cereal
Practical interpretation:
Majority of the breakfast cereal on the
market has the sodium content of 180mg
per serving.
11
The Mode
The mode Mo of a population or sample of measurements
is the measurement that occurs most frequently
• Modes are the values that are observed “most typically”
• A set of data can have:
a) One mode (unimodal)
b) More than one mode
• If there are two modes, the data is bimodal
• If more than two modes, the data is multimodal
c) No mode (uniform distribution)
12
The Well-Chosen Average
The figure shows how people get
paid in a corporation:
Average temperature:
77 oF (25 oC)
Very comfortable… NO!
14
Measures of Variation
Knowing the measures of central tendency is not enough
Both of the distributions below have identical measures of
central tendency! But they are very different!
15
The Range
Range: the difference between the largest
and the smallest observations
16
The Range
If the following graph shows the distribution of employees’ annual
income for two companies.
17
Variance and Standard Deviation
Variance and Standard deviation summarize
the deviation of observations from their mean.
18
The Variance
For a population of size N, the population variance is defined as
N
2
ix 2 2 2
2 i 1
x1 x2 x N
N N
For a sample of size n, the sample variance s2 is defined as
n
2
ix x 2 2 2
s2 i 1
x1 x x2 x xn x
n 1 n 1
19
The Variances
Population Sample
take sample
X1, X2, …, XN x1, x2, …, xn
N n-1
Sample
Population
Parameter Statistics
20
The Standard Deviation
In practice, the standard deviation represents a typical distance or a
type of average distance of an observation from the mean.
The larger the standard deviation, the greater the variation among
observations.
• Wider spread
• Less stable process
• Riskier investment
• …
21
Example: Exam Scores
The first exam in your statistics course is graded on a scale of 0 to
100.Suppose that the mean score in your class is 80.
With s=10, a typical distance is 10, as occurs with the scores of 70 and 90.
22
Example: Investing 101
One of the first principles of investing is that taking more risk brings
higher returns, at least on the average in the long run.
Table lists the year returns on three investments over the 50 years from 1950 to 1999.
23
Coefficient of Variation
The larger the standard deviation, the greater the variability, the
higher the risk.
24
Measuring Position: Percentiles & Quartiles
25
Measuring Position: Percentiles & Quartiles
To know how large or small one observation is, comparing to
the rest of the observations, or comparing to its mean, we
can compute percentiles.
26
Example :You are the fourth tallest person in a group of 20
80% of people are shorter than you:
27
Measuring Position: Percentiles & Quartiles
The first quartile Q1 is the 25th percentile
The second quartile Q2 (or median) is the 50th percentile
The third quartile Q3 is the 75th percentile
The interquartile range IQR is Q3 - Q1
28
Finding a percentile
To find the p-th percentile:
1. Rank the n observations in ascending order (from
smallest to largest)
2. Find the index of your p-th percentile: = %×
3. If the result i is not a whole number, then your p-th
percentile is the (i+1)th observation
4. If the result i is a whole number, then your p-th
percentile is the average between your i-th and (i+1)th
observations.
Note: Some software may give you slightly different result.
29
Finding a percentile
20 cereal so n=20
If I want to find 60th percentile:
i=60% x 20=12, a whole number
So the 60th percentile is between 12th and 13th
observation: = 185
Practical interpretation:
60% of the breakfast cereal contains no more
than 185mg sodium per serving.
30
Five-Number Summary
Smallest Value
First Quartile
Median
Third Quartile
Largest Value
31
Detecting potential outliers
An observation is an outlier if it falls a distance of more than
1.5IQR below the first quartile or a distance of more than 1.5IQR
above the third quartile.
32
The Box-and-Whiskers Display
The box plots the:
first quartile, Q1 ; median, Md ; third quartile, Q3
limits, located 1.5IQR away from the quartiles:
lower limit= Q1 – (1.5 IQR)
33
Example: Cereal Sodium Data
34
Box-plot or Histogram?
A box plot does not portray certain features of a distribution, such as
distinct mounds and possible gaps, as clearly as a histogram.
• The box plot will not show us whether there is a large gap in
the distribution contributing to the skew.
• Box plots are useful for : identifying outliers.
compare groups
35
Example: Male and Female College Student Heights
a. The Box-plots show that generally males are taller than females.
b. The median(the center line in a box) is approximately 71 inches for the males
and 65 inches for the females.
c. The variability of the middle 50% of the distribution is similar, as indicated by
the width of the boxes(which is the IQR) being similar.
d. Both samples have heights that are unusually short or tall, flagged as outliers.
(Male has one outlier, while female has six outliers.)
e. The upper 75% of the male heights are higher than the lower 75% of female
heights. That is,75% of the female heights fall below their third quartile, about
67 inches, whereas 75% of the male heights fall above their first quartile, about
69 inches.
36
Weighted Means
Sometimes, some measurements are more important than
others
Assign numerical “weights” to the data
w i
37
Example: unemployment rate across U.S.
38
Table 3.3
Example: unemployment rate across U.S.
Calculate it as a weighted mean
So that the bigger the region, the more heavily it counts in
the mean
The data values are the regional unemployment rates
The weights are the sizes of the regional labor forces
Ben s score
30% 30%
= 0 × 10% + 80% × 20% + 85% × + 0% × + 40% × 65%
2 2
= 60.75%
40
Percentages can be Confusing.
We use percentages all the time.
41
Percentages can be Confusing.
Percentage change and actual change:
Let’s review the average scores (out of full mark 100) of same two
quizzes given to three classes A,B and C:
Class A: Quiz 1 average 20, Quiz 2 average 25.
Class B: Quiz 1 average 50, Quiz 2 average 55.
Class C: Quiz 1 average 90, Quiz 2 average 95.
42
Percentages can be Confusing.
In 1970, 12.5 million adult children (18-34 years old) lived
with their parents.
In 2015, research shows that such number rose to over 18.6
million.
43
Percentages can be Confusing.
44