0% found this document useful (0 votes)
343 views65 pages

Applied Statistics in Business & Economics: David P. Doane and Lori E. Seward

Uploaded by

Minh Anh Hoang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
343 views65 pages

Applied Statistics in Business & Economics: David P. Doane and Lori E. Seward

Uploaded by

Minh Anh Hoang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

Applied Statistics in Business &

Economics

David P. Doane and Lori E. Seward

Vũ Võ
[email protected]

4-1
Chapter 4
Descriptive Statistics
Chapter Contents
4.1 Numerical Description
4.2 Measures of Center
4.3 Measures of Variability
4.4 Standardized Data
4.5 Percentiles, Quartiles, and Box Plots
4.6 Covariance and Correlation
4.7 Grouped Data
4.8 Skewness and Kurtosis

4-2
Chapter 4
Chapter Learning Objectives

LO4-1: Explain the concepts of center, variability, and


shape.
LO4-2: Calculate and interpret common measures of
center.
LO4-3: Calculate and interpret common measures of
variability.
LO4-4: Apply Chebyshev’s theorem.
LO4-5: Apply the Empirical Rule and recognize
outliers.
LO4-6: Transform a data set into standardized values.

4-3
Chapter 4
Chapter Learning Objectives (continued)

LO4-7: Calculate quartiles and other percentiles.


LO4-8: Make and interpret box plots.
LO4-9: Calculate and interpret a correlation coefficient
and covariance.
LO4-10: Calculate the mean and standard deviation from
grouped data.
LO4-11: Assess skewness and kurtosis in a sample.

4-4
Chapter 4
4.1 Numerical Description
LO4-1: Explain the concepts of center, variability, and
shape.
Three key characteristics of numerical data:

Characteristic Interpretation
Where are the data values
concentrated? What seem to be typical
Center
or middle data values? Is there central
tendency?
How much dispersion is there in the
Variability data? How spread out are the data
values? Are there unusual values?
Are the data values distributed
Shape symmetrically? Skewed? Sharply
peaked? Flat? Bimodal?
Copyright ©2019 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the
prior written consent of McGraw-Hill Education. 4-5
Chapter 4
LO4-1: Explain the concepts of center, variability, and
shape (continued).

Example 4.1: Every year, J.D. Power and Associates


issues its initial vehicle quality ratings. These ratings are
of interest to consumers, dealers, and manufacturers.
The sorted table (Table 4.3 in the text) on the next slide
shows defect rates for 33 vehicle brands for model year
2010. Reported defect rates are based on a sample of
vehicles within each brand.
• We will demonstrate how numerical statistics can be
used to summarize a data set like this.

4-6
Chapter 4
LO4-1: Explain the concepts of center, variability, and shape
(continued, 2).
Brand Defects Brand Defects Brand Defects
Porsche 83 Audi 111 Chrysler 122
Acura 86 Cadillac 111 Suzuki 122
Mercedes-
87 Chevrolet 111 GMC 126
Benz
Lexus 88 Nissan 111 Kia 126
Ford 93 BMW 113 Jeep 129
Honda 95 Mercury 113 Dodge 130
Hyundai 102 Buick 114 Jaguar 130
Lincoln 106 Mazda 114 MINI 133

Infiniti 107 Scion 114 Volkswagen 135

Volvo 109 Toyota 117 Mitsubishi 146

Ram 110 Suburu 121 Land Rover 170

4-7
Chapter 4
LO4-1: Explain the concepts of center, variability, and
shape (continued, 3).

Example 4.1 (continued): The sorted data provide insight


into both center and variability. The values range from
83 (Porsche) to 170 (Land Rover), while the middle
values seem mostly to lie between 110 and 120.
• The dot plot reveals additional detail, including one
unusual value.

4-8
Chapter 4
LO4-1: Explain the concepts of center, variability, and
shape (continued, 4).

Example 4.1 (continued): The next visual step is a


histogram. The modal class (largest frequency)
between 100 and 120 reveals the center. The
shape of the histogram is right-skewed (most data
on the left, longer right tail).
• The histogram is displayed on the next slide.

4-9
Chapter 4
LO4-1: Explain the concepts of center, variability, and
shape (continued, 5).
Example 4.1 (continued):

4-10
Chapter 4
4.2 Measures of Center
LO4-2: Calculate and interpret common measures of
center.
When we speak of center, we are trying to describe
the middle or typical values of a distribution. You
can assess central tendency in a general way from
a dot plot or histogram, but numerical statistics
allow more precise statements. Following are six
common measures of center. Each has strengths
and weaknesses. We need to look at several of
them to obtain a clear picture of central tendency.

4-11
Chapter 4
LO4-2: Calculate and interpret common measures of
center (continued).

Mean:
The most familiar statistical measure of center is the mean. It is
the sum of the data values divided by the number of data items.
For a population we denote it by μ (mu), while for a sample we
call it ̅ (x-bar). The formulas used to compute them are given
below.
Population Mean Sample Mean

4-12
Chapter 4
LO4-2: Calculate and interpret common measures of
center (continued, 2).

Median
• The median (M) is the 50th percentile or
midpoint of the ordered sample data.
• M separates the upper and lower halves of the
ordered observations.
• If n is odd, the median is the middle observation
in the ordered data set.
• If n is even, the median is the average of the
middle two observations in the ordered data set.

4-13
Chapter 4
LO4-2: Calculate and interpret common measures of
center (continued, 3).

Mode
• The most frequently occurring data value.
• May have multiple modes or no mode.
• The mode is most useful for discrete or
categorical data with only a few distinct data
values. For continuous data or data with a wide
range, the mode is rarely useful.

4-14
Chapter 4
LO4-2: Calculate and interpret common measures of
center (continued, 4).
Mode
• Figure 4.6 shows a dot plot of the 44 P/E ratios.
• There are modes at 10 and 16 (each occurs four times),
suggesting that these are somewhat “typical” P/E ratios.
• However, 11 and 13 occur three times, suggesting that the mode
is not a robust measure of center for this data set.
• That is, we would suspect that these modes would be unlikely to
recur if we were to take a different sample of 44 stocks.

4-15
Chapter 4
LO4-2: Calculate and interpret common measures of
center (continued, 5).
Shape
• Compare mean and median or look at the histogram to
determine degree of skewness.
• Figure 4.9 shows prototype population shapes showing
varying degrees of skewness.

4-16
Chapter 4
LO4-2: Calculate and interpret common measures of
center (continued, 6).

Symptoms of Skewness
• The table on the next slide summarizes the
symptoms of skewness in a sample.
• Because few data sets are exactly symmetric,
skewness is a matter of degree.
• Due to the nature of random sampling, the mean
and median may differ, even when a symmetric
population is being sampled.
• Small differences between the mean and median
do not indicate significant skewness and may lack
practical importance.

4-17
Chapter 4
LO4-2: Calculate and interpret common measures of
center (continued, 7).
Symptoms of Skewness
Distribution's Histogram
Statistics
Shape Appearance
Long tail of histogram
Skewed left
points left (a few low
(negative Mean < Median
values but most data
skewness)
on right)
Tails of histogram are
Symmetric balanced (low/high Mean ≈ Median
values offset)
Long tail of histogram
Skewed right points right (most
Mean > Median
(positive skewness) data on left but a few
high values)

4-18
Chapter 4
LO4-2: Calculate and interpret common measures of
center (continued, 8).
Example of a Right Skewed Distribution

4-19
Chapter 4
LO4-2: Calculate and interpret common measures of
center (continued, 9).
Example of a Left Skewed Distribution

4-20
Chapter 4
LO4-2: Calculate and interpret common measures of center
(continued, 10).
Geometric Mean
• The geometric mean (G)
is multiplicative average.

Growth Rates
• A variation on the geometric
mean used to find the average
growth rate for a time series.

4-21
Chapter 4
LO4-2: Calculate and interpret common measures of
center (continued, 11).
Growth Rates Year Revenue
2011 4,504
• For example, from
2011 to 2015, JetBlue 2012 4,982
Airlines revenues are 2013 5,441
as shown. 2014 5,817
2015 6,416
The average growth rate is given by taking the geometric mean of the ratios
of each year’s revenue to the preceding year. However, due to
cancellations, only the first and last years are relevant:

or 9.2 percent per year.

4-22
Chapter 4
LO4-2: Calculate and interpret common measures of
center (continued, 12).
Midrange
• The midrange is the point halfway between the lowest and
highest values of X. Easy to use but sensitive to extreme data
values.

• Here, the midrange (126.5) is higher than the mean (114.70)


or median (113).

4-23
Chapter 4
LO4-2: Calculate and interpret common measures of
center (continued, 13).
Trimmed Mean
• To calculate the trimmed mean, first remove the
highest and lowest k percent of the observations.
• For example, for the n = 33 P/E ratios, we want a 5
percent trimmed mean (i.e., k = .05).
• To determine how many observations to trim,
multiply k by n, which is 0.05 x 33 = 1.65 or 2
observations.
• So, we would remove the two smallest and two
largest observations before averaging the
remaining values.

4-24
Chapter 4
LO4-2: Calculate and interpret common measures of
center (continued, 14).
• Here is a summary of all the measures of central
tendency for the J.D. Power data.
Excel’s measures of center for the J.D. Power data:
Mean: =AVERAGE(Data) = 114.70
Median: =MEDIAN(Data) = 113
Mode: =MODE.SNGL(Data) = 111
Geo Mean: =GEOMEAN(Data) = 113.35
Midrange: =(MIN(Data)+MAX(Data))/2 = 126.5
5% Trim Mean: =TRIMMEAN(Data,0.1) = 113.94

• The trimmed mean mitigates the effects of very high


values but still exceeds the median.

4-25
Chapter 4
4.3 Measures of Variability
LO4-3: Calculate and interpret common measures of
variability.

Variation is the “spread” of data points about the center


of the distribution in a sample. Consider the following
measures of variability:

Range

• The range is the difference between the largest


and smallest observations.
Range = Xmax − Xmin

4-26
Chapter 4
LO4-3: Calculate and interpret common measures of
variability (continued).

Variance and Standard Deviation

• The population variance (denoted 2, where  is


the lowercase Greek letter “sigma”) is defined as
the sum of squared deviations from the mean
divided by the population size:

4-27
Chapter 4
LO4-3: Calculate and interpret common measures of
variability (continued, 2).

Variance and Standard Deviation


• The sample variance (denoted by s2) is defined
as the sum of squared deviations from the
sample mean divided by the (n – 1), where n is
the sample size:

4-28
Chapter 4
LO4-3: Calculate and interpret common measures of
variability (continued, 3).

Variance and Standard Deviation


• The standard deviation is defined as the square
root of the variance. The units of measurement for
the standard deviation is same as the units of the
variable.

Population Standard Sample Standard


Deviation Deviation

4-29
Chapter 4
LO4-3: Calculate and interpret common measures of
variability (continued, 4).
Coefficient of Variation
• To compare dispersion in data sets with dissimilar units of
measurement (e.g. kilograms and ounces) or dissimilar means
(e.g. home prices in two different cities), we define the
coefficient of variation (CV), which is a unit-free measure of
dispersion:
CV = ̅ × 100%
• The CV is the standard deviation expressed as a percent of the
mean.
• In some data sets, the standard deviation can actually exceed
the mean, so the CV can exceed 100%.
• This can happen in skewed data sets, especially if there are
outliers.
• The CV is useful for comparing variables measured in different
units.
4-30
Chapter 4
LO4-3: Calculate and interpret common measures of
variability (continued, 5).
Mean Absolute Deviation
• An additional measure of dispersion is the mean absolute
deviation (MAD). This statistic reveals the average distance
from the center. Absolute values must be used; otherwise
the deviations around the mean would sum to zero.
• The MAD is appealing because of its simple, concrete
interpretation. Using the lever analogy, the MAD tells us
what the average distance is from an individual data point.

4-31
Chapter 4
LO4-3: Calculate and interpret common measures of
variability (continued, 6).
Central Tendency vs. Dispersion: Manufacturing
• Figure 4.17 (next slide) shows histograms of hole diameters
drilled in a steel plate during a manufacturing process.
• The desired distribution is shown in red.
• The samples from Machine A have the desired mean diameter (5
mm) but too much variation around the mean. It might be an
older machine whose moving parts have become loose through
normal wear, so there is greater variation in the holes drilled.
• Samples from Machine B have acceptable variation in hole
diameter, but the mean is incorrectly adjusted (less than the
desired 5 mm).
• To monitor quality, we would take frequent samples from the
output of each machine so that the process can be stopped and
adjusted if the sample statistics indicate a problem.
4-32
Chapter 4
LO4-3: Calculate and interpret common measures of
variability (continued, 7).
Central Tendency vs. Dispersion: Manufacturing

4-33
Chapter 4
4.4 Standardized Data
LO4-4: Apply Chebyshev’s theorem.

Chebyshev’s Theorem
• For any population with mean m and standard deviation , the
percentage of observations that lie within k standard deviations of
the mean must be at least 100[1 – 1/k2].
• For k = 2 standard deviations, 100[1 – Although
1/22] = 75%. applicable to any
data set, these
• So, at least 75.0% will lie within m + 2.
limits tend to be
• For k = 3 standard deviations, 100[1 – rather wide.
1/32] = 88.9%.
• So, at least 88.9% will lie within m + 3.

4-34
Chapter 4
LO4-5: Apply the Empirical Rule and recognize outliers.

The Empirical Rule


• The normal distribution is symmetric and is also
known as the bell-shaped curve.
• The Empirical Rule states that for data from a
normal distribution, we expect the interval m ± k to
contain a known percentage of data. For:
• k = 1, 68.26% will lie within m + 1.
• k = 2, 95.44% will lie within m + 2.
• k = 3, 99.73% will lie within m + 3.

4-35
Chapter 4
LO4-5: Apply the Empirical Rule and recognize outliers
(continued).
The Empirical Rule

Note: No upper
bound is given.
Data values
outside m + 3
are rare.

4-36
Chapter 4
LO4-6: Transform a data set into standardized values.

• A standardized variable (Z) redefines each observation


in terms of the number of standard deviations from the
mean.
Standardization A negative z value
formula for a means the
observation is to the
population: left of the mean.

Standardization Positive z means the


formula for a sample: observation is to the
right of the mean.

4-37
Chapter 4
LO4-6: Transform a data set into standardized values
(continued).

Unusual Observations

Based on its standardized z-score, a data value is classified as:

Unusual if |zi| > 2 (beyond µ ± 2σ)

Outlier if |zi| > 3 (beyond µ ± 3σ)

4-38
Chapter 4
LO4-6: Transform a data set into standardized values
(continued, 2).
Estimating Sigma
• For a normal distribution, the range of values is
almost 6 (from m – 3 to m + 3).
• If you know the range R (high – low), you can
estimate the standard deviation as  = R/6.
• Useful for approximating the standard deviation
when only R is known.
• This estimate depends on the assumption of
normality.

4-39
Chapter 4
4.5 Percentiles, Quartiles, and Box-
Plots
LO4-7: Calculate quartiles and other percentiles.
Percentiles
• Percentiles are data that have been divided into 100
groups.
• For example, you score in the 83rd percentile on a
standardized test. That means that 83% of the test-takers
scored below you.
• Deciles are data that have been divided into 10 groups.
• Quintiles are data that have been divided into 5 groups.
• Quartiles are data that have been divided into 4
groups.

4-40
Chapter 4
LO4-7: Calculate quartiles and other percentiles
(continued).
Percentiles
• Percentiles may be used to establish benchmarks
for comparison purposes (e.g. health care,
manufacturing, and banking industries use 5th,
25th, 50th, 75th and 90th percentiles).
• Quartiles (25, 50, and 75 percent) are commonly
used to assess financial performance and stock
portfolios.
• Percentiles can be used in employee merit
evaluation and salary benchmarking.

4-41
Chapter 4
LO4-7: Calculate quartiles and other percentiles
(continued, 2).
Quartiles
• Quartiles are scale points that divide the sorted
data into four groups of approximately equal size.

Q1 Q2 Q3

Lower 25% | Second 25% | Third 25% | Upper 25%

• The three values that separate the four groups are


called Q1, Q2, and Q3, respectively.
4-42
Chapter 4
LO4-7: Calculate quartiles and other percentiles
(continued, 3).
Quartiles
• The second quartile Q2 is the median, a measure of central
tendency.

Q2
 Lower 50%  |  Upper 50% 

• Q1 and Q3 measure dispersion since the interquartile range


Q3 – Q1 measures the degree of spread in the middle 50
percent of data values.

Q1 Q3
Lower 25% |  Middle 50%  | Upper 25%

4-43
Chapter 4
LO4-7: Calculate quartiles and other percentiles
(continued, 4).
Quartiles – The method of medians
• The first quartile Q1 is the median of the data values below Q2, and
the third quartile Q3 is the median of the data values above Q2.

Q1 Q2 Q3

Lower 25% | Second 25% | Third 25% | Upper 25%

For first half of data, 50% above, For second half of data, 50%
50% below Q1. above, 50% below Q3.

4-44
Chapter 4
LO4-7: Calculate quartiles and other percentiles
(continued, 5).

Method of Medians
• For small data sets, find quartiles using method of
medians:
Step 1: Sort the observations.
Step 2: Find the median Q2.
Step 3: Find the median of the data values that lie
below Q2.
Step 4: Find the median of the data values that lie
above Q2.

Refer to the text for examples.


4-45
Chapter 4
LO4-8: Make and interpret box plots.

A useful tool of exploratory data analysis (EDA) is


the box plot (also called a box-and-whisker plot)
based on the five-number summary:
x min , Q 1 , Q 2 , Q 3 , x max

The box plot is displayed visually, like this:

4-46
Chapter 4
LO4-8: Make and interpret box plots (continued).

Box Plots
• A box plot shows variability and shape.

4-47
Chapter 4
LO4-8: Make and interpret box plots (continued, 2).

Box Plots: Fences and Unusual Data Values

• Use quartiles to detect unusual data points by


defining fences using the following formulas:

Inner fences Outer fences:


Lower fence Q1 – 1.5 (Q3 – Q1) Q1 – 3.0 (Q3 – Q1)
Upper fence Q3 + 1.5 (Q3 – Q1) Q3 + 3.0 (Q3 – Q1)

4-48
Chapter 4
LO4-8: Make and interpret box plots (continued, 3).

Box Plots: Fences and Unusual Data Values

• Values outside the inner fences are unusual while


those outside the outer fences are outliers. Here
is a visual illustrating the fences:

4-49
Chapter 4
LO4-8: Make and interpret box plots (continued, 4).

Box Plots: Fences and Unusual Data Values

• For example, consider the P/E ratio data:

Inner fences Outer fences:


Lower fence: 107 – 1.5 (126 –107) = 78.5 107 – 3.0 (126 –107) = 50
Upper fence: 126 + 1.5 (126 –107) =
126 + 3.0 (126 –107) = 183
154.5

• There is one outlier (170) that lies above the


inner fence. There are no extreme outliers that
exceed the outer fence.

4-50
Chapter 4
LO4-8: Make and interpret box plots (continued, 5).

Box Plots: Fences and Unusual Data Values

• Truncate the whisker at the fences and display


unusual values and outliers as dots.

Outlier

• Based on these fences, there is only one outlier.

4-51
Chapter 4
LO4-8: Make and interpret box plots (continued, 6).

Box Plots: Midhinge

• Quartiles can be used to define an additional


measure of center that has the advantage of not
being influenced by outliers. The midhinge is the
average of the first and third quartiles:

4-52
Chapter 4
LO4-8: Make and interpret box plots (continued, 7).

Box Plots: Midhinge

• The midhinge is always exactly halfway between


Q1 and Q3, while the median Q2 can be
anywhere within the “box,” which suggests a new
way to describe skewness:

⇒ Skewed right (longer


Median < Midhinge
right tail)
⇒ Symmetric (tails roughly
Median ≅ Midhinge
equal)
⇒ Skewed left (longer left
Median > Midhinge
tail)

4-53
Chapter 4
4.6 Covariance and Correlation
LO4-9: Calculate and interpret a correlation coefficient
and covariance.
Covariance
The covariance of two random variables X and Y (denoted σXY )
measures the degree to which the values of X and Y change
together.

4-54
Chapter 4
LO4-9: Calculate and interpret a correlation coefficient
and covariance (continued).

Covariance

• The units of measurement for the covariance are


unpredictable because the magnitude and/or units
of measurement of X and Y may differ. For this
reason, analysts generally work with the
correlation coefficient, which is a standardized
value of the covariance that ensures a range
between −1 and +1.

4-55
Chapter 4
LO4-9: Calculate and interpret a correlation coefficient and
covariance (continued, 2).
Correlation Coefficient
• Conceptually, a correlation coefficient is the covariance
divided by the product of the standard deviations
(denoted σX and σY for a population or sX and sY for a
sample). For a population, the correlation coefficient is
indicated by the lowercase Greek letter ρ (rho), while for a
sample we use the lowercase Roman letter r.

4-56
Chapter 4
LO4-9: Calculate and interpret a correlation coefficient
and covariance (continued, 3).

Sample Correlation Coefficient

• The sample correlation coefficient is a statistic that


describes the degree of linearity between paired
observations on two quantitative variables X and Y.

Note: -1 ≤ r ≤ +1.

4-57
Chapter 4
LO4-9: Calculate and interpret a correlation coefficient
and covariance (continued, 4).

Correlation Coefficient

4-58
Chapter 4
4.7 Grouped Data
LO4-10: Calculate the mean and standard deviation from
grouped data.

Weighted Mean
• The weighted mean is a sum that assigns each
data value a weight that represents a fraction of
the total (i.e. the k weights must sum to 1).

4-59
Chapter 4
LO4-10: Calculate the mean and standard deviation from
grouped data (continued).

Group Mean
• Each interval j has a midpoint mj and a frequency fj. We
calculate the estimated mean by multiplying the midpoint
of each class by its class frequency, taking the sum over
all k classes, and dividing by sample size n.

4-60
Chapter 4
LO4-10: Calculate the mean and standard deviation
from grouped data (continued, 2).
Group Standard Deviation
• The estimate for the standard deviation is obtained by
subtracting the estimated mean from each class midpoint,
squaring the difference, multiplying by the class frequency,
taking the sum over all classes to obtain the sum of
squared deviations about the mean, dividing by n − 1, and
taking the square root. Avoid the common mistake of
“rounding off” the mean before subtracting it from each
midpoint.

4-61
Chapter 4
4.8 Skewness and Kurtosis
LO4-11: Assess skewness and kurtosis in a sample.

Skewness

4-62
Chapter 4
LO4-11: Assess skewness and kurtosis in a sample
(continued).

Kurtosis

• Kurtosis refers to the relative length of the tails and the


degree of concentration in the center.
• A normal bell-shaped population is called mesokurtic
and serves as a benchmark.
• A population that is flatter than a normal population
(i.e., has heavier tails) is called platykurtic, while one
that is more sharply peaked than a normal population
(i.e., has thinner tails) is leptokurtic.
• Kurtosis is not the same thing as variability, although
the two are easily confused.

4-63
Chapter 4
LO4-11: Assess skewness and kurtosis in a sample
(continued, 2).

Kurtosis

• A histogram is an unreliable guide to kurtosis


because its scale and axis proportions may vary, so a
numerical statistic is needed:

4-64
Chapter 4
LO4-11: Assess skewness and kurtosis in a sample
(continued, 3).
Kurtosis

4-65

You might also like