Applied Statistics in Business & Economics: David P. Doane and Lori E. Seward
Applied Statistics in Business & Economics: David P. Doane and Lori E. Seward
Economics
Vũ Võ
[email protected]
4-1
Chapter 4
Descriptive Statistics
Chapter Contents
4.1 Numerical Description
4.2 Measures of Center
4.3 Measures of Variability
4.4 Standardized Data
4.5 Percentiles, Quartiles, and Box Plots
4.6 Covariance and Correlation
4.7 Grouped Data
4.8 Skewness and Kurtosis
4-2
Chapter 4
Chapter Learning Objectives
4-3
Chapter 4
Chapter Learning Objectives (continued)
4-4
Chapter 4
4.1 Numerical Description
LO4-1: Explain the concepts of center, variability, and
shape.
Three key characteristics of numerical data:
Characteristic Interpretation
Where are the data values
concentrated? What seem to be typical
Center
or middle data values? Is there central
tendency?
How much dispersion is there in the
Variability data? How spread out are the data
values? Are there unusual values?
Are the data values distributed
Shape symmetrically? Skewed? Sharply
peaked? Flat? Bimodal?
Copyright ©2019 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the
prior written consent of McGraw-Hill Education. 4-5
Chapter 4
LO4-1: Explain the concepts of center, variability, and
shape (continued).
4-6
Chapter 4
LO4-1: Explain the concepts of center, variability, and shape
(continued, 2).
Brand Defects Brand Defects Brand Defects
Porsche 83 Audi 111 Chrysler 122
Acura 86 Cadillac 111 Suzuki 122
Mercedes-
87 Chevrolet 111 GMC 126
Benz
Lexus 88 Nissan 111 Kia 126
Ford 93 BMW 113 Jeep 129
Honda 95 Mercury 113 Dodge 130
Hyundai 102 Buick 114 Jaguar 130
Lincoln 106 Mazda 114 MINI 133
4-7
Chapter 4
LO4-1: Explain the concepts of center, variability, and
shape (continued, 3).
4-8
Chapter 4
LO4-1: Explain the concepts of center, variability, and
shape (continued, 4).
4-9
Chapter 4
LO4-1: Explain the concepts of center, variability, and
shape (continued, 5).
Example 4.1 (continued):
4-10
Chapter 4
4.2 Measures of Center
LO4-2: Calculate and interpret common measures of
center.
When we speak of center, we are trying to describe
the middle or typical values of a distribution. You
can assess central tendency in a general way from
a dot plot or histogram, but numerical statistics
allow more precise statements. Following are six
common measures of center. Each has strengths
and weaknesses. We need to look at several of
them to obtain a clear picture of central tendency.
4-11
Chapter 4
LO4-2: Calculate and interpret common measures of
center (continued).
Mean:
The most familiar statistical measure of center is the mean. It is
the sum of the data values divided by the number of data items.
For a population we denote it by μ (mu), while for a sample we
call it ̅ (x-bar). The formulas used to compute them are given
below.
Population Mean Sample Mean
4-12
Chapter 4
LO4-2: Calculate and interpret common measures of
center (continued, 2).
Median
• The median (M) is the 50th percentile or
midpoint of the ordered sample data.
• M separates the upper and lower halves of the
ordered observations.
• If n is odd, the median is the middle observation
in the ordered data set.
• If n is even, the median is the average of the
middle two observations in the ordered data set.
4-13
Chapter 4
LO4-2: Calculate and interpret common measures of
center (continued, 3).
Mode
• The most frequently occurring data value.
• May have multiple modes or no mode.
• The mode is most useful for discrete or
categorical data with only a few distinct data
values. For continuous data or data with a wide
range, the mode is rarely useful.
4-14
Chapter 4
LO4-2: Calculate and interpret common measures of
center (continued, 4).
Mode
• Figure 4.6 shows a dot plot of the 44 P/E ratios.
• There are modes at 10 and 16 (each occurs four times),
suggesting that these are somewhat “typical” P/E ratios.
• However, 11 and 13 occur three times, suggesting that the mode
is not a robust measure of center for this data set.
• That is, we would suspect that these modes would be unlikely to
recur if we were to take a different sample of 44 stocks.
4-15
Chapter 4
LO4-2: Calculate and interpret common measures of
center (continued, 5).
Shape
• Compare mean and median or look at the histogram to
determine degree of skewness.
• Figure 4.9 shows prototype population shapes showing
varying degrees of skewness.
4-16
Chapter 4
LO4-2: Calculate and interpret common measures of
center (continued, 6).
Symptoms of Skewness
• The table on the next slide summarizes the
symptoms of skewness in a sample.
• Because few data sets are exactly symmetric,
skewness is a matter of degree.
• Due to the nature of random sampling, the mean
and median may differ, even when a symmetric
population is being sampled.
• Small differences between the mean and median
do not indicate significant skewness and may lack
practical importance.
4-17
Chapter 4
LO4-2: Calculate and interpret common measures of
center (continued, 7).
Symptoms of Skewness
Distribution's Histogram
Statistics
Shape Appearance
Long tail of histogram
Skewed left
points left (a few low
(negative Mean < Median
values but most data
skewness)
on right)
Tails of histogram are
Symmetric balanced (low/high Mean ≈ Median
values offset)
Long tail of histogram
Skewed right points right (most
Mean > Median
(positive skewness) data on left but a few
high values)
4-18
Chapter 4
LO4-2: Calculate and interpret common measures of
center (continued, 8).
Example of a Right Skewed Distribution
4-19
Chapter 4
LO4-2: Calculate and interpret common measures of
center (continued, 9).
Example of a Left Skewed Distribution
4-20
Chapter 4
LO4-2: Calculate and interpret common measures of center
(continued, 10).
Geometric Mean
• The geometric mean (G)
is multiplicative average.
Growth Rates
• A variation on the geometric
mean used to find the average
growth rate for a time series.
4-21
Chapter 4
LO4-2: Calculate and interpret common measures of
center (continued, 11).
Growth Rates Year Revenue
2011 4,504
• For example, from
2011 to 2015, JetBlue 2012 4,982
Airlines revenues are 2013 5,441
as shown. 2014 5,817
2015 6,416
The average growth rate is given by taking the geometric mean of the ratios
of each year’s revenue to the preceding year. However, due to
cancellations, only the first and last years are relevant:
4-22
Chapter 4
LO4-2: Calculate and interpret common measures of
center (continued, 12).
Midrange
• The midrange is the point halfway between the lowest and
highest values of X. Easy to use but sensitive to extreme data
values.
4-23
Chapter 4
LO4-2: Calculate and interpret common measures of
center (continued, 13).
Trimmed Mean
• To calculate the trimmed mean, first remove the
highest and lowest k percent of the observations.
• For example, for the n = 33 P/E ratios, we want a 5
percent trimmed mean (i.e., k = .05).
• To determine how many observations to trim,
multiply k by n, which is 0.05 x 33 = 1.65 or 2
observations.
• So, we would remove the two smallest and two
largest observations before averaging the
remaining values.
4-24
Chapter 4
LO4-2: Calculate and interpret common measures of
center (continued, 14).
• Here is a summary of all the measures of central
tendency for the J.D. Power data.
Excel’s measures of center for the J.D. Power data:
Mean: =AVERAGE(Data) = 114.70
Median: =MEDIAN(Data) = 113
Mode: =MODE.SNGL(Data) = 111
Geo Mean: =GEOMEAN(Data) = 113.35
Midrange: =(MIN(Data)+MAX(Data))/2 = 126.5
5% Trim Mean: =TRIMMEAN(Data,0.1) = 113.94
4-25
Chapter 4
4.3 Measures of Variability
LO4-3: Calculate and interpret common measures of
variability.
Range
4-26
Chapter 4
LO4-3: Calculate and interpret common measures of
variability (continued).
4-27
Chapter 4
LO4-3: Calculate and interpret common measures of
variability (continued, 2).
4-28
Chapter 4
LO4-3: Calculate and interpret common measures of
variability (continued, 3).
4-29
Chapter 4
LO4-3: Calculate and interpret common measures of
variability (continued, 4).
Coefficient of Variation
• To compare dispersion in data sets with dissimilar units of
measurement (e.g. kilograms and ounces) or dissimilar means
(e.g. home prices in two different cities), we define the
coefficient of variation (CV), which is a unit-free measure of
dispersion:
CV = ̅ × 100%
• The CV is the standard deviation expressed as a percent of the
mean.
• In some data sets, the standard deviation can actually exceed
the mean, so the CV can exceed 100%.
• This can happen in skewed data sets, especially if there are
outliers.
• The CV is useful for comparing variables measured in different
units.
4-30
Chapter 4
LO4-3: Calculate and interpret common measures of
variability (continued, 5).
Mean Absolute Deviation
• An additional measure of dispersion is the mean absolute
deviation (MAD). This statistic reveals the average distance
from the center. Absolute values must be used; otherwise
the deviations around the mean would sum to zero.
• The MAD is appealing because of its simple, concrete
interpretation. Using the lever analogy, the MAD tells us
what the average distance is from an individual data point.
4-31
Chapter 4
LO4-3: Calculate and interpret common measures of
variability (continued, 6).
Central Tendency vs. Dispersion: Manufacturing
• Figure 4.17 (next slide) shows histograms of hole diameters
drilled in a steel plate during a manufacturing process.
• The desired distribution is shown in red.
• The samples from Machine A have the desired mean diameter (5
mm) but too much variation around the mean. It might be an
older machine whose moving parts have become loose through
normal wear, so there is greater variation in the holes drilled.
• Samples from Machine B have acceptable variation in hole
diameter, but the mean is incorrectly adjusted (less than the
desired 5 mm).
• To monitor quality, we would take frequent samples from the
output of each machine so that the process can be stopped and
adjusted if the sample statistics indicate a problem.
4-32
Chapter 4
LO4-3: Calculate and interpret common measures of
variability (continued, 7).
Central Tendency vs. Dispersion: Manufacturing
4-33
Chapter 4
4.4 Standardized Data
LO4-4: Apply Chebyshev’s theorem.
Chebyshev’s Theorem
• For any population with mean m and standard deviation , the
percentage of observations that lie within k standard deviations of
the mean must be at least 100[1 – 1/k2].
• For k = 2 standard deviations, 100[1 – Although
1/22] = 75%. applicable to any
data set, these
• So, at least 75.0% will lie within m + 2.
limits tend to be
• For k = 3 standard deviations, 100[1 – rather wide.
1/32] = 88.9%.
• So, at least 88.9% will lie within m + 3.
4-34
Chapter 4
LO4-5: Apply the Empirical Rule and recognize outliers.
4-35
Chapter 4
LO4-5: Apply the Empirical Rule and recognize outliers
(continued).
The Empirical Rule
Note: No upper
bound is given.
Data values
outside m + 3
are rare.
4-36
Chapter 4
LO4-6: Transform a data set into standardized values.
4-37
Chapter 4
LO4-6: Transform a data set into standardized values
(continued).
Unusual Observations
4-38
Chapter 4
LO4-6: Transform a data set into standardized values
(continued, 2).
Estimating Sigma
• For a normal distribution, the range of values is
almost 6 (from m – 3 to m + 3).
• If you know the range R (high – low), you can
estimate the standard deviation as = R/6.
• Useful for approximating the standard deviation
when only R is known.
• This estimate depends on the assumption of
normality.
4-39
Chapter 4
4.5 Percentiles, Quartiles, and Box-
Plots
LO4-7: Calculate quartiles and other percentiles.
Percentiles
• Percentiles are data that have been divided into 100
groups.
• For example, you score in the 83rd percentile on a
standardized test. That means that 83% of the test-takers
scored below you.
• Deciles are data that have been divided into 10 groups.
• Quintiles are data that have been divided into 5 groups.
• Quartiles are data that have been divided into 4
groups.
4-40
Chapter 4
LO4-7: Calculate quartiles and other percentiles
(continued).
Percentiles
• Percentiles may be used to establish benchmarks
for comparison purposes (e.g. health care,
manufacturing, and banking industries use 5th,
25th, 50th, 75th and 90th percentiles).
• Quartiles (25, 50, and 75 percent) are commonly
used to assess financial performance and stock
portfolios.
• Percentiles can be used in employee merit
evaluation and salary benchmarking.
4-41
Chapter 4
LO4-7: Calculate quartiles and other percentiles
(continued, 2).
Quartiles
• Quartiles are scale points that divide the sorted
data into four groups of approximately equal size.
Q1 Q2 Q3
Q2
Lower 50% | Upper 50%
Q1 Q3
Lower 25% | Middle 50% | Upper 25%
4-43
Chapter 4
LO4-7: Calculate quartiles and other percentiles
(continued, 4).
Quartiles – The method of medians
• The first quartile Q1 is the median of the data values below Q2, and
the third quartile Q3 is the median of the data values above Q2.
Q1 Q2 Q3
For first half of data, 50% above, For second half of data, 50%
50% below Q1. above, 50% below Q3.
4-44
Chapter 4
LO4-7: Calculate quartiles and other percentiles
(continued, 5).
Method of Medians
• For small data sets, find quartiles using method of
medians:
Step 1: Sort the observations.
Step 2: Find the median Q2.
Step 3: Find the median of the data values that lie
below Q2.
Step 4: Find the median of the data values that lie
above Q2.
4-46
Chapter 4
LO4-8: Make and interpret box plots (continued).
Box Plots
• A box plot shows variability and shape.
4-47
Chapter 4
LO4-8: Make and interpret box plots (continued, 2).
4-48
Chapter 4
LO4-8: Make and interpret box plots (continued, 3).
4-49
Chapter 4
LO4-8: Make and interpret box plots (continued, 4).
4-50
Chapter 4
LO4-8: Make and interpret box plots (continued, 5).
Outlier
4-51
Chapter 4
LO4-8: Make and interpret box plots (continued, 6).
4-52
Chapter 4
LO4-8: Make and interpret box plots (continued, 7).
4-53
Chapter 4
4.6 Covariance and Correlation
LO4-9: Calculate and interpret a correlation coefficient
and covariance.
Covariance
The covariance of two random variables X and Y (denoted σXY )
measures the degree to which the values of X and Y change
together.
4-54
Chapter 4
LO4-9: Calculate and interpret a correlation coefficient
and covariance (continued).
Covariance
4-55
Chapter 4
LO4-9: Calculate and interpret a correlation coefficient and
covariance (continued, 2).
Correlation Coefficient
• Conceptually, a correlation coefficient is the covariance
divided by the product of the standard deviations
(denoted σX and σY for a population or sX and sY for a
sample). For a population, the correlation coefficient is
indicated by the lowercase Greek letter ρ (rho), while for a
sample we use the lowercase Roman letter r.
4-56
Chapter 4
LO4-9: Calculate and interpret a correlation coefficient
and covariance (continued, 3).
Note: -1 ≤ r ≤ +1.
4-57
Chapter 4
LO4-9: Calculate and interpret a correlation coefficient
and covariance (continued, 4).
Correlation Coefficient
4-58
Chapter 4
4.7 Grouped Data
LO4-10: Calculate the mean and standard deviation from
grouped data.
Weighted Mean
• The weighted mean is a sum that assigns each
data value a weight that represents a fraction of
the total (i.e. the k weights must sum to 1).
4-59
Chapter 4
LO4-10: Calculate the mean and standard deviation from
grouped data (continued).
Group Mean
• Each interval j has a midpoint mj and a frequency fj. We
calculate the estimated mean by multiplying the midpoint
of each class by its class frequency, taking the sum over
all k classes, and dividing by sample size n.
4-60
Chapter 4
LO4-10: Calculate the mean and standard deviation
from grouped data (continued, 2).
Group Standard Deviation
• The estimate for the standard deviation is obtained by
subtracting the estimated mean from each class midpoint,
squaring the difference, multiplying by the class frequency,
taking the sum over all classes to obtain the sum of
squared deviations about the mean, dividing by n − 1, and
taking the square root. Avoid the common mistake of
“rounding off” the mean before subtracting it from each
midpoint.
4-61
Chapter 4
4.8 Skewness and Kurtosis
LO4-11: Assess skewness and kurtosis in a sample.
Skewness
4-62
Chapter 4
LO4-11: Assess skewness and kurtosis in a sample
(continued).
Kurtosis
4-63
Chapter 4
LO4-11: Assess skewness and kurtosis in a sample
(continued, 2).
Kurtosis
4-64
Chapter 4
LO4-11: Assess skewness and kurtosis in a sample
(continued, 3).
Kurtosis
4-65