2 - Descriptive Statistics
2 - Descriptive Statistics
Chapter 6 Contents
Copyright © 2019 John Wiley & Sons, Inc. All Rights Reserved
2
Learning Objectives for Chapter 6
After careful study of this chapter, you should be able to do the following:
1. Compute and interpret the sample mean, variance, standard deviation, median, and range
2. Explain the concepts of sample mean, variance, population mean, and population variance
3. Construct and interpret visual data displays, including stem-and-leaf display, histogram, and
the box plot
4. Explain how to use box plots and other data displays to visually compare two or more samples
of data
5. Know how to use simple time series plots to visually display the important features of time-
oriented data
6. Know how to construct and interpret scatter diagrams of two or more variables
x i
12.6 12.9 ... 13.1
2
3
12.9
13.4
x average i 1
8 8 4 12.3
104 5 13.6
13.0 pounds 6 13.5
8 7 12.6
8 13.1
13.00
= AVERAGE($B2:$B9)
Copyright © 2019 John Wiley & Sons, Inc. All Rights Reserved
6
Measures of Central Tendency: Sample
Median
• The median of a variable is the “center”.
• When the data is sorted in order, the median is the middle value
splitting the data into halves.
• The calculation of the median of a variable is slightly different
depending on
• If there are an odd number of points, or
• If there are an even number of points
Copyright @ 2019 John Wiley & Sons, Inc. All Rights Reserved
7
Sample Median (contd.)
Taken from: Statistics Informed Decisions Using Data, 5th Edition by Michael Sullivan
Copyright @ 2019 John Wiley & Sons, Inc. All Rights Reserved
8
Sample Median – Example
The following data represent the travel times (in minutes) to work
for all seven employees of a start-up web development company.
23, 36, 23, 18, 5, 26, 43
Determine the median of this data.
Copyright © 2019 John Wiley & Sons, Inc. All Rights Reserved
11
Example 6.2 | Sample Variance
Calculate the variance and standard
deviation of the pull-off force of the sample
of engine connectors mentioned earlier.
The numerator of 𝑠 2 is
Copyright © 2019 John Wiley & Sons, Inc. All Rights Reserved
12
𝟐
Computation of 𝒔
The prior calculation is definitional and tedious. A shortcut
is derived here and involves just 2 sums.
Copyright © 2019 John Wiley & Sons, Inc. All Rights Reserved
13
Example 6.3 | Shortcut Calculation for
For Example 6.2, we calculate the sample variance and standard
deviation using the shortcut method.
Copyright © 2019 John Wiley & Sons, Inc. All Rights Reserved
14
The meaning of 𝒏 − 𝟏 in the denominator
• The population variance is calculated with 𝑁, the population size.
Why isn’t the sample variance calculated with 𝑛, the sample size?
• The true variance is based on data deviations from the true
mean, 𝜇 (parameter).
• The sample calculation is based on the data deviations from 𝑥ҧ
(statistic), not 𝜇.
• 𝑥ҧ is an estimator of 𝜇; close but not the same.
• So, the 𝑛 − 1 divisor is used to compensate for the error in the
mean estimation.
Sec 6.1 Numerical Summaries of Data
Copyright © 2019 John Wiley & Sons, Inc. All Rights Reserved
15
Dispersion: Sample Range
In addition to the sample variance and sample standard
deviation, the sample range is a useful measure of variability.
Copyright © 2019 John Wiley & Sons, Inc. All Rights Reserved
16
Measures of Position – Quartiles
• Quartiles are special percentiles.
• The 𝑘th percentile denoted 𝑃𝑘 of a set of data is a value such that 𝑘
percent of the observations are less than or equal to the value.
• Quartiles divide data sets into four equal parts. The quartiles are the
25th ,50th and 75th percentiles
• 𝑄1 = 25th percentile Taken from: Statistics: Informed Decisions Using
Data, 5th Edition by Michael Sullivan
• 𝑄2 = 50 percentile = median
th
• 𝑄3 = 75th percentile
• Quartiles are the most commonly used percentiles.
Copyright @ 2019 John Wiley & Sons, Inc. All Rights Reserved
17
Measures of Position – Quartiles (contd.)
• The 1st quartile, denoted 𝑄1 , divides the bottom 25% the data
from the top 75%.
• The 2nd quartile divides the bottom 50% of the data from the
top 50% of the data.
• The 3rd quartile divides the bottom 75% of the data from the
top 25% of the data.
Taken from: Statistics Informed Decisions Using Data, 5th Edition by Michael Sullivan
Copyright @ 2019 John Wiley & Sons, Inc. All Rights Reserved
18
Measures of Position – Quartiles (contd.)
Taken from: Statistics Informed Decisions Using Data, 5th Edition by Michael Sullivan
Copyright @ 2019 John Wiley & Sons, Inc. All Rights Reserved
19
Quartiles – Example
A group of university students collected data on the speed of
vehicles traveling through a construction zone on a state
highway, where the posted speed was 25 mph. The recorded
speed of 14 randomly selected vehicles is given below:
20, 24, 27, 28, 29, 30, 32, 33, 34, 36, 38, 39, 40, 40
Copyright © 2019 John Wiley & Sons, Inc. All Rights Reserved
22
Example 6.4a | Alloy Strength
• Consider the data in the table. We select as stem values the numbers 7, 8, 9, …, 24.
Compressive strength refers
to the ability of a material to
withstand loads that reduce the
size of that material when
applied.
https://civilunlimited.com/compressive-strength-
test-of-concrete-cube/
Sec 6.2 Stem-and-Leaf Diagrams
Copyright © 2019 John Wiley & Sons, Inc. All Rights Reserved
23
Example 6.4b | Alloy Strength
• The resulting stem-and-leaf
diagram is shown.
• Inspection of the diagram reveals
that most of the compressive
strengths lie between 110 and
200 psi and that a central value is
somewhere between 150 and 160
psi.
• The strengths are distributed
approximately symmetrically
about the central value
Sec 6.2 Stem-and-Leaf Diagrams
Copyright © 2019 John Wiley & Sons, Inc. All Rights Reserved
24
Graphical summaries: Frequency Distributions
• A frequency distribution is a more compact summary of data
than a stem – and – leaf diagram.
• To construct, we must divide the range of the data into intervals,
which are usually called class intervals, cells, or bins.
• Choosing number of bins approximately equal to the square
root of the number of observations often works well in practice.
Copyright © 2019 John Wiley & Sons, Inc. All Rights Reserved
25
Frequency Distribution Table
Copyright © 2019 John Wiley & Sons, Inc. All Rights Reserved
26
Graphical summaries: Histograms
• A histogram is a visual display of the frequency distribution.
• Provides a visual impression of the shape and distribution of the
measurements and information about the central tendency and scatter or
dispersion in the data.
bin frequency
• Unequal bin widths may be employed Rectangle height =
bin width
Copyright © 2019 John Wiley & Sons, Inc. All Rights Reserved
27
Graphical summaries: Histograms
Taken from: Statistics Informed Decisions Using Data, 5th Edition by Michael Sullivan
Copyright @ 2019 John Wiley & Sons, Inc. All Rights Reserved
30
Graphical summaries: Box Plots (contd.)
Copyright @ 2019 John Wiley & Sons, Inc. All Rights Reserved
31
Graphical summaries: Box Plots (contd.)
Copyright © 2019 John Wiley & Sons, Inc. All Rights Reserved
32
Graphical summaries: Box Plots (contd.)
Copyright @ 2019 John Wiley & Sons, Inc. All Rights Reserved
33
Graphical summaries: Box Plots (contd.)
Step 1: The interquartile range (IQR) is 14.4% - 12% = 2.4%. The lower
and upper fences are:
Lower Fence = Q1 – 1.5(IQR) Upper Fence = Q3 + 1.5(IQR)
= 12 – 1.5(2.4) = 14.4 + 1.5(2.4)
= 8.4% = 18.0%
Steps 2, 3, 4 and 5:
* [ ]
Copyright @ 2019 John Wiley & Sons, Inc. All Rights Reserved
34
Graphical summaries: Time Sequence Plots
• A time series or time sequence is a data set in which the
observations are recorded in the order in which they occur.
• A time series plot is a graph in which the vertical axis denotes
the observed value of variable and the horizontal axis denotes
the time.
Copyright © 2019 John Wiley & Sons, Inc. All Rights Reserved
35
Graphical summaries: Time Sequence Plots
• Combination of stem – and – leaf plot with a time series plot is a
digidot plot
Copyright © 2019 John Wiley & Sons, Inc. All Rights Reserved
36
Graphical summaries: Scatter Diagrams
• Multivariate Data: each observation consists of
measurements of several variables (for the same
individual).
• The scatter diagram is a useful way to graphically
display the potential relationship between two
measurements of the same individual.
• When two or more variables exist, the matrix of
scatter diagrams may be useful in looking at all of
the pairwise relationships between the variables in
the sample.
• The sample correlation coefficient (Pearson
correlation coefficient) is a quantitative measure of
the strength of the linear relationship between two
variables x and y.
Copyright © 2019 John Wiley & Sons, Inc. All Rights Reserved
37
Sample Correlation Coefficient
• If the two variables are perfectly linearly related with a positive
slope, then 𝑟𝑥𝑦 = 1, and if they are perfectly linearly related with
a negative slope, then 𝑟𝑥𝑦 = −1 . If no linear relationship
between the two variables exists, then 𝑟𝑥𝑦 = 0.
• Note that 0 correlation only means no linear relationship, there
could be a non-linear relationship between the variables still.
• Correlations below |0.5| are generally considered weak and
correlations above |0.8| are generally considered strong.
• Note the difference between correlation and causation!
Copyright @ 2019 John Wiley & Sons, Inc. All Rights Reserved
38
Sample Correlation Coeff. Interpretation
Copyright @ 2019 John Wiley & Sons, Inc. All Rights Reserved
39
Sample Correlation Coefficient: Example
• An article in Technometrics by S. C. Sale
Taxes (local,
school,
Sale
Taxes (local,
school,
Price/1000 Price/1000
narula and J. F. Wellington country)/1000 country)/1000
["Prediction, Linear Regression, and 25.9 4.9176 30.0 5.0500
29.5 5.0208 36.9 8.2464
a Minimum Sum of Relative Errors"
27.9 4.5429 41.9 6.6969
(1977, Vol. 19)] presents data on the 25.9 4.5573 40.5 7.7841
selling price and annual taxes for 24 29.9 5.0597 43.9 9.0384
houses. The data are shown in the 29.9 3.8910 37.5 5.9894
following table. 30.9 5.8980 37.9 7.5422
28.9 5.6039 44.5 8.7951
• What is the simple correlation 35.9 5.8282 37.9 6.0831
coefficient between price (y) and 31.5 5.3003 38.9 8.3607
taxes (x)? 31.0 6.2712 36.9 8.1400
30.9 5.9592 45.8 9.1416
Copyright @ 2019 John Wiley & Sons, Inc. All Rights Reserved
40
Sample Correlation Coefficient: Example (contd.)
xi yi xi-x_baryi-y_bar(xi-x_bar)^2 (yi-y_bar)^2 yi(xi-x_bar)
4.9176 25.9 -1.49 -8.71 2.212111 75.90766 -38.5215
5.0208 29.5 -1.38 -5.11 1.915779 26.13766 -40.8314
4.5429 27.9 -1.86 -6.71 3.467106 45.05766 -51.9503
4.5573 25.9 -1.85 -8.71 3.413687 75.90766 -47.8533 x_bar 6.405
5.0597 29.9 -1.35 -4.71 1.809608 22.20766 -40.222 y_bar 34.61
3.891 29.9 -2.51 -4.71 6.319777 22.20766 -75.1661
5.898 30.9 -0.51 -3.71 0.256965 13.78266 -15.6637
5.6039 28.9 -0.8 -5.71 0.641628 32.63266 -23.1494
5.8282 35.9 -0.58 1.288 0.332602 1.657656 -20.7041 r 0.876
5.3003 31.5 -1.1 -3.11 1.220178 9.687656 -34.7954
6.2712 31 -0.13 -3.61 0.01788 13.05016 -4.14522
5.9592 30.9 -0.45 -3.71 0.198663 13.78266 -13.7726
5.05 30 -1.35 -4.61 1.835799 21.27516 -40.6475
8.2464 36.9 1.841 2.288 3.391061 5.232656 67.95074
6.6969 41.9 0.292 7.288 0.085254 53.10766 12.2341
7.7841 40.5 1.379 5.888 1.902147 34.66266 55.85693
9.0384 43.9 2.633 9.288 6.935234 86.25766 115.6099
5.9894 37.5 -0.42 2.888 0.172654 8.337656 -15.5819
7.5422 37.9 1.137 3.288 1.293413 10.80766 43.10304
8.7951 44.5 2.39 9.888 5.712976 97.76266 106.3632
6.0831 37.9 -0.32 3.288 0.103566 10.80766 -12.1969
8.3607 38.9 1.956 4.288 3.825088 18.38266 76.07997
8.14 36.9 1.735 2.288 3.010514 5.232656 64.02458
9.1416 45.8 2.737 11.19 7.489436 125.1602 125.3401
Sum 57.56313 829.0463 191.3612
Copyright @ 2019 John Wiley & Sons, Inc. All Rights Reserved
41
Important Terms and Concepts
• Box plot • Population standard • Sample standard deviation
• Digidot plot deviation • Sample variance
• Population variance
• Frequency distribution and • Scatter diagram
histogram • Quartiles and percentiles • Stem-and-leaf diagram
• Histogram • Relative frequency • Time series
• Interquartile range distribution
• Sample correlation
• Matrix of scatter plots
coefficient
• Multivariate data
• Sample mean
• Outlier
• Sample median
• Percentile
• Sample range
• Population mean
Chapter 6 Important Terms and Concepts
Copyright © 2019 John Wiley & Sons, Inc. All Rights Reserved
42