Lectures 11 12 13 - Engineering Statistics 2017 - Handouts
Lectures 11 12 13 - Engineering Statistics 2017 - Handouts
Engineering Statistics
Motivation
• To learn methods of statistical analysis:
– What is available.
– How to interpret results.
– What are the pitfalls and caveats.
– The necessity of good experimental design,
technique and data-analysis to achieve high
quality results.
Overview of Statistics
• Statistical Analysis is the science of data
collection and data interpretation.
– Uses formal probabilistic methods for drawing
inferences and making decisions from these
data.
• Summarize data
• Maximize information derived from data
• Test alternate hypotheses or models
• Compute probabilities of future occurrences
• Make rational decisions based on data and information.
– Quantifies our Ignorance (uncertainty)
What are Statistics?
• Statistics are about ways to describe our
previous experience
• Statistics are useful as guides and motivators
• Statistics are generally poorly understood and
often abused
• “Oh, people can come up
with statistics to prove
anything, 14% of people
know that”, Homer Simpson.
Definition of a Statistic
A statistic is a specified, determinable function of a
set of observations (data set)
- Given n measures of some "random" quantity: xi
you can calculate various statistics such as these examples:
n is the sample size
n
x
i 1
i is the sample sum x1 x2 x3 ... xn
1 n
n i 1
xi is the sample mean (arithmetic average)
n
i 1
Recall:
Median
• Median is the value of the data point in the
centre of the data set when arranged in
ascending (or descending) order
• E.g., 2.98, 3.18, 3.25, 3.50 and 3.74
– Median is 3.25
• If there is an even number of points, the median
is the mean of the two centre data points
– E.g., 2, 3, 6, 7, 12, and 15
– Median is (6 + 7)/2 = 6.5
Mode
• Mode is one (or more) sets of numbers that
occurs with the greatest frequency
• e.g., 2, 2, 5, 7, 9, 9, 9, 10, 10, 11, 12, and 18
– Mode is 9 (is unimodal)
• e.g., 2, 3, 4, 4, 4, 5, 5, 7, 7, 7, 9
– Modes are 4 and 7 (called bimodal)
• A data set can have no mode (e.g., uniform)
How to make a Histogram
• Start with a data set of
observations
• Find the big Range
(= max value - min value);
• Make a ‘number’ of bins
with smaller equal-size
ranges that span the big
Range;
• Count how many of your
data fit in the ranges of
your bins
Heights of 100 Residents in Anytown, Canada (ft)
Introduction to Engineering Analysis, 3rd Edition, K. D. Hagen, 2008
Classification of Height Data
Histogram for Heights in Anytown
Histograms
• Valuable statistical tool for showing the
frequency distribution of data
– Information about location, spread, and shape
that is portrayed can provide clues about the
underlying process that generated the data
Frequency Distributions (Histograms)
Note that Figure 18.4 in the textbook has left skew defined incorrectly.
The distribution looks ‘smoother’ as the number of
students in the sample gets larger.
• Mean:
140
17.5/27
120
Midterm 2017
mean 17.5
(65 %)
• Median:
100 median 18
Std.Ddev 3.8
Std.Err
80
18/27
Count
N=1067
60
(67%)
• N=1067
40
20
0
5 10 15 20 25
Mark out of 27
The red line is the “bell curve”, which is also the Normal distribution,
which is also called a Gaussian distribution.
Distributions
• Many types of distributions
• The distribution that best describes your data
will depend on the ‘physics’
• Many engineering measurements have
normal (or near normal) distributions
– Normal distributions are also called Gaussian
distributions, and sometimes Bell curves
• Other distributions
– Bimodal, multimodal, flat, skewed (either right or
left), etc.
Aside on Histograms
• The appearance of histograms is quite dependent on
the number of bins and how the bin boundaries are
computed.
100 40 100
80 80
Mean 20.2 20 Mean 20.2
Mediun 21.0 Mediun 21.0
SD 4.3 SD 4.3
N 907 N 907
60 60
Count
Count
0
5 10 15 20 25 30
40 40
Mark out of 30
20 20
0 0
5 10 15 20 25 30 0 5 10 15 20 25 30
Do you place the bin at the front end or the back end?
i.e., do you want to show 30 and above to the next bin, or
above 29 to 30? Or maybe in the middle?
Abuse of Statistics
• There is a famous expression that you might hear
when people talk about statistics
• It is attributed to Mark Twain and/or Disraeli in the
early 1900’s.
• “There are three types of falsehoods, each worse
than the one before – lies, damned lies, and
statistics.”
• This is meant to be funny, but it does reflect a feeling
people have that statistics can be used to prove
anything, hence, people mistrust statistics.
• “Oh, people can come up
with statistics to prove
anything, 14% of people
know that”, Homer Simpson.
Stretching the truth with statistics
18
• A politician in power 16
Count
salary is $71 000.” 8 Mode: $30 000
less that $56 000.” 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280
• The third party leader Everyone is correct: the first quotes the
says: “Most of the mean, the second the median and the
third quotes the mode; but the
people make $30 000”
messages are ‘different’.
Bill Gates moves into
18
16
14 Salaries in Anytown
Count
• A politician in power 8 Mode: $30 000
says: “My town is
6
18
16
14 Mean + 1Std.Dev
Salaries in Anytown = $127 000
12
10
Count
8
This is a ‘population’ because we 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280
16
14 Salaries
Mean in Anytown = $94 000
+ 1Std.Dev
12
Count
8 distribution is
smaller then the
6
Std.Dev is smaller
4
0
0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300
This is a ‘population’ because we Salary (x1000$) per year
have included the entire town
population. If we only considered a
small number of people in the town Mean - 1Std.Dev = $48 000
we would calculate statistics for a
sample, see next slide.
Standard Deviation (Sample)
N
1
2
N
(x )
i 1
i
2
n
1
s
2
n 1 i 1
( xi x ) 2
Descriptive Data Analysis
1. Graphics Displays:
– Dot plots,
– Box and Whisker Plots
– Scatter plots,
– Frequency plots (histograms)
2. Summary statistics:
– mean,
– median,
– mode,
– standard deviation,
– range (and interquartile range)
Descriptive Graphics of Data
Box & Whisker Plot
Box & Whisker Plot
1350
1350
1300
1300
1250
1250
1200
1200
1150
1150
1100
1100
1050
1000 1050
Median
Mean
25%-75%
±SD
950 Min-Max
1000 ±1.96*SD
BELT1 BELT2 BELT3 BELT4
BELT1 BELT2 BELT3 BELT4
BELT2
BELT3
BELT4
1280 1280
1260
1260
1240
1240
1220
1220
1200
1200
1180
1160 1180
1140 1160
1120
1140
1100
1120
1080
Median = 1177.5 1100 Mean = 1175.2128
1060 ±SD
25%-75%
= (1147.5, 1217.5) 1080 = (1122.5375, 1227.888)
1040
Min-Max ±1.96*SD
1020 = (1042.5, 1285) 1060 = (1071.9693, 1278.4563)
MeanOxy MeanOxy
1260
1240
20
1220
1200
15 1180
1160
1140
MeanOxy
10
No. of obs.
1120
1100
1080
5
1060
1040
0 1020
1000 1050 1100 1150 1200 1250 1300 -10 0 10 20 30 40 50
X <= Category Boundary Seq
Frequency Distr. Of Mean Oxy. Scatter Plot of Mean Oxy. vs Seq. Num.
Summary Statistics (samples)
Measures of Location Measures of Spread
• Mean: 1 n
• Variance: 1 n
x xi
s
2
( xi x ) 2
n 1 i
n i 1 • Standard Deviation s Var s 2
• Median: middle value when you • Range: R =max(x) – min(x)
order the values: • Interquartile Range:
x1 x2 x3 ... xn
• Mode: position or positions of
maximum probability
– Not always easy to define in
experimental data • Full Width Half Max (FWHM)
• Remember: a statistic is any – Half width half max. (HWHM)
function of the data.
The first “Golden Rule” of Data
Analysis
• Study the data
– Use the descriptive methods above to get a ‘feel’ for the
data.
– Complicated data sets deserve several hours, days, or
even weeks of study.
– ‘Outliers’ must be carefully vetted.
– Data are not simply numbers but rather measurements or
counts of real entities.
– Tentative conclusions should be made in the contexts of
their meaning in relation to these entities, the real
background of the data, and how the data were collected.
• Valid conclusions are unlikely to be obtained from
poor data.
Summary
• For the descriptive analysis to be valid
– The data must be independent (exchangeable)
– Arbitrary inclusion of repeat measurements is
not allowed.
– The data cannot be correlated such that
exchangeability is violated.
• If these conditions aren’t met, the standard
interpretation of the results could lead to
very, very, wrong conclusions.
What Statistics can’t do
• Can’t get blood from a stone
– Can’t rescue you from bad data.
– Can’t get good results from poorly designed and
badly executed experiments.
– Can’t generate information where non exists.
• Statistics might help you understand and
quantify just how bad your results are from
your bad data.
• In other words, it is not a magic Mr. Fix-it in a
black box.
How well do we know the mean value
(i.e., arithmetic average) we determine
from our data?
The mean mark on the midterm was 18/30
The mean salary in Anytown, before Bill, was
$71 000.
? How many sig figs should we quote for these
mean values ?
? Want is the error in these mean values ?
As an example:
The goal is to determine the average (mean)
weight of students on campus
• You could weigh the entire population of students at Carleton
and then calculate their average weight, but this takes time,
and would cost a lot of money.
• Or, you can choose a sample of students, say 5 students,
weigh them and calculate an average.
• How far would this average be from the real average of the
total population of students?
• If you chose a second group of 5 students, how different
would the second group’s average weight be?
– You would expect that the average weight of each group of 5 students
would be ‘slightly’ different.
10kg
350 kg
Standard Error in the mean value
Start with any ‘arbitrary’
distribution that represents
your population
1. Take, for example, n=5
samples from the starting
distribution
2. Take the average of these n=5
samples
3. Put a ‘blue’ box in the lower
histogram showing the
average value calculated in 2.
4. Goto 1 until the histogram
doesn’t change much.
Standard Error in the mean value
Standard Error in the mean value
• The standard error is related to the width of the sampled distribution. The
standard error is the standard deviation of the distribution of the means, i.e., the
‘sd’ in the lower plots.
• The lower plots will always be normal curves (Gaussian curves) regardless of the
parent population: this is the Central Limit theorem.
• The standard error indicates how well you know the mean.
Standard Error for the mean of x sx
n
You can do this at home:
• Go to:
http://onlinestatbook.com/rvls/index.html
• Go to the ‘Simulations/Demonstrations’ link
and then to ‘Central Limit Theorem’ and then
to ‘Sampling Distribution Simulation’.
s
sx
n n
Standard Error
• Thus, the standard error associated with the
estimated population mean is
x zc s x
where
zc confidence level
e.g. z95 1.96 for 95% confidence
( x )2
2
1 2
1 z 2 /2
f ( x) e
f ( z) e
2 2
( x x )2
2
1 1 z 2 /2
f ( x) e 2 s
f ( z) e
s 2 2
Standard Normal Distribution
• In the new variable z, the mean is at z = 0
• We also set the standard deviation to one (1)
• We get the standard normal distribution:
1 z 2 /2
f ( z) e
2
Called z-statistics
Standard Normal (Gaussian) Distribution
The total area under the
standard normal distribution is
unity, which means equal to 1.
z2 1 z 2 /2
AREA e dz
z1
2
0 z1 z2
z-statistics:
use z scale
How do we know that Zc=1.96
corresponds to 95% of the area?
• We integrate the equation for the ‘standard’
Gaussian curve from -1.96 to +1.96 and we get
0.95
• Or, integrate from 0 to 1.96, and multiply by 2,
because the Gaussian curve is symmetric
about the mean.
z-Statistics
(half-areas)
0.4750 is the area under the
standard curve from zero,
which is the mean, to z=1.96
1
1.96 1 z 2
0
2
e 2
dz 0.475
mean
6
2
95% of the students
0 had marks in the
5 10 15 20 25
118
Estimating with Small Samples
Introduction to t-Statistics
119
Small Samples
• Small samples are those with n< 30 elements
• Our z-statistics are no longer accurate
• Must use t-statistics
– The sample variance is weighted
– Result is a different distribution from the normal
distribution, but with a similar shape
– Appropriate for small samples
Student t-Distribution
• The distribution for
small samples is
generally called the
student t-distribution
– Based on a weighting
described by W. S.
Gosset (1876-1937)
– He was interested in
beer (yes, beer)
– He worked for a beer
company (Guiness)
Student t-Distribution
• Because of variability in the
ingredients of beer, samples
that come from the same
population are generally
small
• Important for Quality
Control
• Gosset’s company did not
allow him to publish, so he
did so under the pseudonym
“Student”
Student t-Distribution
• Gosset showed that small samples
DOF= 1
taken from an essentially normal
DOF= 2
population have a wider confidence DOF= 3
interval than we would predict with DOF=
∞
z-statistics.
• For small samples, we must use
t-statistics.
• The t-distribution is shorter and
fatter than the normal distribution,
but when DOF=∞ (i.e., for large n) DOF n 1
the t-distribution becomes a normal = Degrees of Freedom
(Gaussian) distribution
Student t-Distribution
x x t ,c sx
where
t ,c t-statistic for degrees of freedom
n 1 DOF = Degrees of Freedom
t-Statistics
125
There are ‘calculators’ on the web. See, for example:
http://www.tutor-homework.com/statistics_tables/statistics_tables.html
126
Example
• Consider a sample of eighteen batteries from the
entire population of batteries (each battery
should be about 9 V)
• In order to establish the true mean of the
voltages, you measure the voltage across each
battery with a voltmeter
• Using the collected data, calculate:
– Mean and standard deviation
– 95% and 99% confidence interval estimates
– Population mean with 90% confidence
Voltage
Meas. # (V)
7
1 6.51
x 9.1498 6
9-Volt Battery Histogram
2 8.45 5
3 11.76 4
Count
4 8.36 3
5 9.35 2
6 9.23 1
7 7.85
0
8 8.59 Mean 6 7 8 9 10 11 12 13
1 6.51
x 9.1498 6
9-Volt Battery Histogram
2 8.45
s 1.5776
5
3 11.76 4
Count
4 8.36 3
5 9.35 2
6 9.23 1
7 7.85
0
8 8.59 6 7 8 9 10 11 12 13
12 9.87
13 8.04 1
[(6.51 9.1498)2 (8.45 9.1498)2 (7.27 9.1498)2 ]
14 8.38 18 1
15 10.01 2.4888
16 12.84
17 10.48
18 7.27 Do not round yet!
Voltage
Meas. # (V)
7
1 6.51
x 9.1498 6
9-Volt Battery Histogram
2 8.45
s 1.5776
5
3 11.76 4
Count
4 8.36
sx 0.3718
3
5 9.35 2
6 9.23 1
7 7.85
0
8 8.59 6 7 8 9 10 11 12 13
x x t17,95 sx
where
t17,95 2.110 = Student's 95% confidence for 17 degrees of freedom
Voltage
Meas. # (V)
7
1 6.51
x 9.1498 6
9-Volt Battery Histogram
2 8.45
s 1.5776
5
3 11.76 4
Count
4 8.36
sx 0.3718
3
5 9.35 2
6 9.23 1
7
8
7.85
8.59
x x t17,c sx 0
6 7 8 9 10 11 12 13
1.740
Comparison of z-statistics
with t-statistics
• On the next slide, note that as the sample size
nears 30, one can see how the results
obtained using t-statistics get closer to those
from z-statistics
• Using z-statistics overestimates confidence
when the sample size is small:
• the error bars are too small if you use z-statistics for
samples of less than 30.
t-Statistics
x x t17,95 sx
where
t17,95 2.110 = Student's confidence interval for 17 degrees of freedom
Compare this with z95 1.96, which is for 95% confidence for large samples
Estimating Proportional
Mean Values
136
Tossing a Coin
• P(H) = P(T) = 0.5 (we know this is true for a
‘fair coin’)
• Try 100 tosses and find:
– 58/100 are heads in one trial
– 45/100 are heads in another trial
• We call these ratios proportions
137
Population Mean for Proportions
• Estimate the proportion p by sampling
p (1 p )
p zc
n
np 5
Must be satisfied to
n(1 p ) 5 use this method
p (1 p)
p zc 80 /150 1.645 (80 /150) (70 /150) /150
n
0.53 0.07 From this measurement, we cannot say the coin is unfair.
1.645 corresponds to a half
area of 0.4500. This means
that the total area under the
curve from –z to +z is twice
this number, which is 0.90.
So, z90=1.645
Polling (e.g., in Politics)
• A poll is conducted with a sample size of
n = 200 to test the support for candidate A
• The sample suggests that A will receive 32% of
the votes
• Estimate the population proportion, with 95%
confidence
p (1 p )
p zc 0.32 1.960 0.32 0.68 / 200
n
Confident to within six percentage
0.32 0.06 points, 19 times out of 20
np 200 0.3 60 5
Check
n(1 p ) 200 0.7 140 5
Manufacturing
• In a manufacturing operation, 200 parts were
examined and 8/200 were found to be defective
• Estimate the population proportion, with 95%
confidence
p (1 p )
p zc 0.040 1.960 0.04 0.96 / 200
n
0.04 0.027 0.04 0.03
np 200 0.04 8 5
Check
n(1 p ) 200 0.96 192 5
• For every 100 parts made, between about 1 and
7 parts will be defective 95% of the time