0% found this document useful (0 votes)
4 views

Chapter 4 Data Description

Chapter 4 discusses the concept of averages, specifically mean, median, and mode, as measures of central tendency used to summarize data. It also explains the importance of understanding parameters versus statistics, the influence of outliers on the mean, and the significance of variance and standard deviation in describing data spread. Additionally, it covers percentiles, quartiles, and the use of weighted means to analyze data more accurately.

Uploaded by

yaq050120
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Chapter 4 Data Description

Chapter 4 discusses the concept of averages, specifically mean, median, and mode, as measures of central tendency used to summarize data. It also explains the importance of understanding parameters versus statistics, the influence of outliers on the mean, and the significance of variance and standard deviation in describing data spread. Additionally, it covers percentiles, quartiles, and the use of weighted means to analyze data more accurately.

Uploaded by

yaq050120
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Statistics

in
Our Daily Life
Chapter 4
Data Description

1
The “AVERAGE”

We often see “average” being mentioned, when people use one


number to summarize a large amount of information of a group.

But each of the following could be used as the “average”. We call


them the central tendency:
Mean: the ordinary average of the observations. We find mean
by adding up all the values observed and divide the total by the
number of observations.
Median: The midpoint of the distribution, or the middle number
within an ordered set of data.
Mode: The most frequent value(s)
2
Example: Sodium in Cereal

20 popular cereals and the amounts of


salt (sodium) is measured and rounded
to the nearest number of milligrams.

Mean =(0+340+…+200+210)/20=167

Practical interpretation:
On average, the sodium level for a typical
breakfast cereal on the market is 167mg per
serving.

3
Parameters and Statistics
A parameter: a number that describes the population.
 A parameter is a fixed number.
 In practice, often we don’t know the actual value of this
number.

A statistic: is a number that describes a sample.


 The value of a statistic is known when we have taken a
sample,
 It can change from sample to sample.

We often use a statistic to estimate an unknown parameter.

4
The Mean

Population Sample
take sample
X1, X2, …, XN x1, x2, …, xn

Population Mean Sample Mean


n
N

 Xi estimate
x
i=1
i

 i=1 x
N n
Sample
Population
Parameter Statistics
5
The Mean
Mean is the balance point of a set of data if we were to place
identical weights on a line representing where the observations
occur.

Usually, the mean is not equal to any value that was observed in
the sample.
The mean can be highly influenced by an outlier, which is an
unusually small or unusually large observation

6
Try to imagine that ten guys are sitting
on bar stools in a middle-class drinking
establishment in Seattle; each of these
guys earns $35,000 a year, which makes
the mean annual income for the group
$35,000.
Bill Gates walks into the bar. Let’s assume that Bill Gates has an
annual income of $1 billion. When Bill sits down on the eleventh bar
stool, the mean annual income for the bar patrons rises to about $91
million. Obviously none of the original ten drinkers is any richer.

If I were to describe the patrons of this bar as having an average


annual income of $91 million, the statement would be both
statistically correct and grossly misleading.

7
Example: Sodium in Cereal

Median: observations in ascending order:

Median is between 10th and 11th observation.


Median=(180+180)/2=180

Practical interpretation:
Among the 20 tested breakfast cereal, half of them
contain more than 180mg of sodium(salt) per
serving.

8
The Median
The population or sample median Md is a value such that 50% of
all measurements, after having been arranged in numerical
order, lie above (or below) it

The median Md is found as follows:


1. Order the n observations, usually in ascending order.
2. When the number of observations n is odd, the median is the
middle observation in the ordered sample.
3. When the number of observations n is even, two observations
from the ordered sample fall in the middle, and the median is
their average.

9
Median or Mean?
Mean is sensitive to extreme values.
Median is resistant to the effect of extreme observations.

If we have a unimodal (single peak) distribution:

10
Example: Sodium in Cereal

Mode: The most frequent value=180

Practical interpretation:
Majority of the breakfast cereal on the
market has the sodium content of 180mg
per serving.

11
The Mode
The mode Mo of a population or sample of measurements
is the measurement that occurs most frequently
• Modes are the values that are observed “most typically”
• A set of data can have:
a) One mode (unimodal)
b) More than one mode
• If there are two modes, the data is bimodal
• If more than two modes, the data is multimodal
c) No mode (uniform distribution)

12
The Well-Chosen Average
The figure shows how people get
paid in a corporation:

The boss might like to express the


situation as “average wage $5,700--
-using that deceptive mean.

The mode, however, is more


revealing: most common rate of pay
in this business is $2000 a year.

The median tells more about the


situation than any other single
figure does; half the people get
more than $3000 and half get less.
13
When the average doesn’t tell you everything, or,
anything.
Death Valley National Park, in
California, U.S.

Average temperature:
77 oF (25 oC)
Very comfortable… NO!

14
Measures of Variation
 Knowing the measures of central tendency is not enough
 Both of the distributions below have identical measures of
central tendency! But they are very different!

20 Repair Times for Personal Computers at Two Service Centers

15
The Range
Range: the difference between the largest
and the smallest observations

The range measures the interval spanned


by all the data

In our breakfast cereal case:


Range=340-0=340
Practical interpretation:
The sodium content among the breakfast
cereal can range from 340mg per serving

16
The Range
If the following graph shows the distribution of employees’ annual
income for two companies.

17
Variance and Standard Deviation
Variance and Standard deviation summarize
the deviation of observations from their mean.

Some (Life, Crackling Oar Bran, Wheaties, …) are


close to the mean level.
Some (Raisin Bran, Rice Krispies, Frosted Mini
Wheat, …) are very different from the mean.

So, how to measure the “overall spread”?

18
The Variance
For a population of size N, the population variance is defined as
N
2

 ix    2 2 2
2  i 1

 x1      x2       x N   
N N
For a sample of size n, the sample variance s2 is defined as
n
2

 ix  x  2 2 2
s2  i 1

 x1  x    x2  x      xn  x 
n 1 n 1

The sample variance and sample standard deviation are point


estimates for population variance and population standard
deviation, respectively.

19
The Variances

Population Sample
take sample
X1, X2, …, XN x1, x2, …, xn

Population Variance Sample Variance


N n
2
 X i
2
-  estimate
 x i - x
2
  i=1 s 2= i=1

N n-1
Sample
Population
Parameter Statistics
20
The Standard Deviation
In practice, the standard deviation represents a typical distance or a
type of average distance of an observation from the mean.

The larger the standard deviation, the greater the variation among
observations.
• Wider spread
• Less stable process
• Riskier investment
• …

21
Example: Exam Scores
The first exam in your statistics course is graded on a scale of 0 to
100.Suppose that the mean score in your class is 80.

Which value is most plausible for the standard deviation s:


0, 0.5, 10, or 50?

The standard deviation s is a typical distance of an observation from the mean.


A value of s=0 seems unlikely. For that to happen, every deviation would
have to be 0.This implies that every student must have scored 80,the mean.

A value of s=0.5 is implausibly small. s=50 is implausibly large because 50


would not be a typical distance of a student’s score from the mean of 80.(For
instance, it is impossible to score 130.)

With s=10, a typical distance is 10, as occurs with the scores of 70 and 90.

22
Example: Investing 101
One of the first principles of investing is that taking more risk brings
higher returns, at least on the average in the long run.

Risk can be measured by how unpredictable the return on an


investment is.

Sometimes, risk is defined as the variability of returns.

Table lists the year returns on three investments over the 50 years from 1950 to 1999.

23
Coefficient of Variation
The larger the standard deviation, the greater the variability, the
higher the risk.

So is common stocks the most risky investment among the three?

Sometimes we need to measure the size of the standard deviation of


a population or sample relative to the size of the population or
sample mean.

Coefficient of Variation (CV)= × 100

24
Measuring Position: Percentiles & Quartiles

25
Measuring Position: Percentiles & Quartiles
To know how large or small one observation is, comparing to
the rest of the observations, or comparing to its mean, we
can compute percentiles.

The p-th percentile is a value such that p percent of the


observations fall below or at that value

26
Example :You are the fourth tallest person in a group of 20
80% of people are shorter than you:

That means you are at the 80th percentile.

If your height is 1.85m, then “1.85m” is the 80th percentile height


in that group.

27
Measuring Position: Percentiles & Quartiles
The first quartile Q1 is the 25th percentile
The second quartile Q2 (or median) is the 50th percentile
The third quartile Q3 is the 75th percentile
The interquartile range IQR is Q3 - Q1

28
Finding a percentile
To find the p-th percentile:
1. Rank the n observations in ascending order (from
smallest to largest)
2. Find the index of your p-th percentile: = %×
3. If the result i is not a whole number, then your p-th
percentile is the (i+1)th observation
4. If the result i is a whole number, then your p-th
percentile is the average between your i-th and (i+1)th
observations.
Note: Some software may give you slightly different result.
29
Finding a percentile

20 cereal so n=20
If I want to find 60th percentile:
i=60% x 20=12, a whole number
So the 60th percentile is between 12th and 13th
observation: = 185

Practical interpretation:
60% of the breakfast cereal contains no more
than 185mg sodium per serving.
30
Five-Number Summary
 Smallest Value
 First Quartile
 Median
 Third Quartile
 Largest Value

Try finding the five-number summary for our 20 test


breakfast cereal:

31
Detecting potential outliers
An observation is an outlier if it falls a distance of more than
1.5IQR below the first quartile or a distance of more than 1.5IQR
above the third quartile.

 Sometimes we call them the limits:


 lower limit= Q1 – (1.5  IQR)

 upper limit= Q3 + (1.5  IQR)

Are there any potential outliers in our cereal case?

32
The Box-and-Whiskers Display
 The box plots the:
first quartile, Q1 ; median, Md ; third quartile, Q3
 limits, located 1.5IQR away from the quartiles:
 lower limit= Q1 – (1.5  IQR)

 upper limit= Q3 + (1.5  IQR)

 The whiskers: 2 dashed lines


 A dashed line drawn from the box below Q1 down to the
smallest measurement between the lower and upper limits
 Another dashed line drawn from the box above Q3 up to the
largest measurement between the lower and upper limits
 Outliers lie beyond the limits of the box-and-whiskers plot.
 Outliers are measurements that are very different from most of
the other measurements. Plot each outlier using the symbol *.

33
Example: Cereal Sodium Data

34
Box-plot or Histogram?
A box plot does not portray certain features of a distribution, such as
distinct mounds and possible gaps, as clearly as a histogram.

 The histogram revealed a bimodal


distribution.

 Both histogram and boxplot suggest a


mild left skewed.

• The box plot will not show us whether there is a large gap in
the distribution contributing to the skew.
• Box plots are useful for : identifying outliers.
compare groups
35
Example: Male and Female College Student Heights

a. The Box-plots show that generally males are taller than females.
b. The median(the center line in a box) is approximately 71 inches for the males
and 65 inches for the females.
c. The variability of the middle 50% of the distribution is similar, as indicated by
the width of the boxes(which is the IQR) being similar.
d. Both samples have heights that are unusually short or tall, flagged as outliers.
(Male has one outlier, while female has six outliers.)
e. The upper 75% of the male heights are higher than the lower 75% of female
heights. That is,75% of the female heights fall below their third quartile, about
67 inches, whereas 75% of the male heights fall above their first quartile, about
69 inches.
36
Weighted Means
 Sometimes, some measurements are more important than
others
 Assign numerical “weights” to the data

 Weights measure relative importance of the value


 Calculate weighted mean as
 w x i i

 w i

where wi is the weight assigned to the ith measurement xi

37
Example: unemployment rate across U.S.

June 2001 unemployment rates in the U.S. by region


Census Region Civilian Labor Force Unemployment
(millions) Rate (%)
Northeast 26.9 4.1
South 50.6 4.7
Midwest 34.7 4.4
West 32.5 5.0

Want the mean unemployment rate for the U.S.

38
Table 3.3
Example: unemployment rate across U.S.
 Calculate it as a weighted mean
 So that the bigger the region, the more heavily it counts in
the mean
 The data values are the regional unemployment rates
 The weights are the sizes of the regional labor forces

26.9 ∗ 4.1 + 50.6 ∗ 4.7 + 34.7 ∗ 4.4 + (32.5 ∗ 5.0) 663.29


=
26.9 + 50.6 + 34.7 + 32.5 144.7
= 4.58%

 Note that the unweigthed mean is 4.55%, which


underestimates the true rate by 0.03%
 That is, 0.0003  144.7 million = 43,410 workers
39
Example: Statistics results

Ben skipped all lectures.


He self-studied and got 80% for assignments
He only did first project and got 85% but he
didn’t do the second project.
Ben studied very hard for the finals but
the questions are hard so he only got 65%.
What is Ben’s overall score for this course?

Ben s score
30% 30%
= 0 × 10% + 80% × 20% + 85% × + 0% × + 40% × 65%
2 2
= 60.75%

40
Percentages can be Confusing.
We use percentages all the time.

The tax rate is 10% last year.


1. The tax rate has gone up by 2 percentage points
this year.
2. The tax rate rose by 20% this year.

 What is the tax rate this year?


 Which scenario suggest a higher tax rate this year?

41
Percentages can be Confusing.
Percentage change and actual change:

Let’s review the average scores (out of full mark 100) of same two
quizzes given to three classes A,B and C:
Class A: Quiz 1 average 20, Quiz 2 average 25.
Class B: Quiz 1 average 50, Quiz 2 average 55.
Class C: Quiz 1 average 90, Quiz 2 average 95.

Which class gain the most significant improvement?

42
Percentages can be Confusing.
In 1970, 12.5 million adult children (18-34 years old) lived
with their parents.
In 2015, research shows that such number rose to over 18.6
million.

Number of young adults with their parents has increase by


48%. ---- some social concern?

What if you are given further information:


During the last 45 years, the American population has
increase by 75 million people (32% increase)

43
Percentages can be Confusing.

Remember: always look at the background information and


the context of the numbers!

44

You might also like