0% found this document useful (0 votes)
38 views97 pages

Lectures 11 12 13 - Engineering Statistics 2017 - Handouts

- ECOR1010 is an engineering statistics course that teaches statistical analysis methods, how to interpret results, and the importance of experimental design and data analysis ([DOCUMENT]) - Statistics are used to summarize data, maximize information

Uploaded by

Stephen Alao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views97 pages

Lectures 11 12 13 - Engineering Statistics 2017 - Handouts

- ECOR1010 is an engineering statistics course that teaches statistical analysis methods, how to interpret results, and the importance of experimental design and data analysis ([DOCUMENT]) - Statistics are used to summarize data, maximize information

Uploaded by

Stephen Alao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 97

ECOR1010

Engineering Statistics
Motivation
• To learn methods of statistical analysis:
– What is available.
– How to interpret results.
– What are the pitfalls and caveats.
– The necessity of good experimental design,
technique and data-analysis to achieve high
quality results.
Overview of Statistics
• Statistical Analysis is the science of data
collection and data interpretation.
– Uses formal probabilistic methods for drawing
inferences and making decisions from these
data.
• Summarize data
• Maximize information derived from data
• Test alternate hypotheses or models
• Compute probabilities of future occurrences
• Make rational decisions based on data and information.
– Quantifies our Ignorance (uncertainty)
What are Statistics?
• Statistics are about ways to describe our
previous experience
• Statistics are useful as guides and motivators
• Statistics are generally poorly understood and
often abused
• “Oh, people can come up
with statistics to prove
anything, 14% of people
know that”, Homer Simpson.
Definition of a Statistic
 A statistic is a specified, determinable function of a
set of observations (data set)
- Given n measures of some "random" quantity: xi
you can calculate various statistics such as these examples:
n is the sample size
n

x
i 1
i is the sample sum  x1  x2  x3  ... xn

1 n

n i 1
xi is the sample mean (arithmetic average)
n

 i is the sample sum of squares


x 2

i 1

R  max( x)  min( x), the sample range


Definition of a Statistic
 Statistics are numbers that we use to describe a
large number of measurements
 Depending on what is being measured, some
statistics are better than others, and some statistics
are of no real value.
 Just because you can calculate something does not make
it useful !!
 An example of a silly statistic:
 If you have one hand in cold water and the other
hand in hot water, does the mean (average)
temperature tell you anything about how
comfortable you feel?
(Tcold  Thot )
Tavg 
2
Sometimes an arithmetic average
(Mean) is relevant:
• because it gives an indication of where the
‘center’ of the measurement distribution is.
• Median is another statistic used to indicate
the ‘center’
• Mode is a statistic used to indicate the most
probable measurement
Mean
• Mean (or arithmetic average) is the sum of all
the data divided by the number of data points
– Notation depends on whether we are computing
the mean of a population (N) or of a sample (n)

Recall:
Median
• Median is the value of the data point in the
centre of the data set when arranged in
ascending (or descending) order
• E.g., 2.98, 3.18, 3.25, 3.50 and 3.74
– Median is 3.25
• If there is an even number of points, the median
is the mean of the two centre data points
– E.g., 2, 3, 6, 7, 12, and 15
– Median is (6 + 7)/2 = 6.5
Mode
• Mode is one (or more) sets of numbers that
occurs with the greatest frequency
• e.g., 2, 2, 5, 7, 9, 9, 9, 10, 10, 11, 12, and 18
– Mode is 9 (is unimodal)
• e.g., 2, 3, 4, 4, 4, 5, 5, 7, 7, 7, 9
– Modes are 4 and 7 (called bimodal)
• A data set can have no mode (e.g., uniform)
How to make a Histogram
• Start with a data set of
observations
• Find the big Range
(= max value - min value);
• Make a ‘number’ of bins
with smaller equal-size
ranges that span the big
Range;
• Count how many of your
data fit in the ranges of
your bins
Heights of 100 Residents in Anytown, Canada (ft)
Introduction to Engineering Analysis, 3rd Edition, K. D. Hagen, 2008
Classification of Height Data
Histogram for Heights in Anytown
Histograms
• Valuable statistical tool for showing the
frequency distribution of data
– Information about location, spread, and shape
that is portrayed can provide clues about the
underlying process that generated the data
Frequency Distributions (Histograms)

Note that Figure 18.4 in the textbook has left skew defined incorrectly.
The distribution looks ‘smoother’ as the number of
students in the sample gets larger.

• Mean:
140
17.5/27
120
Midterm 2017
mean 17.5
(65 %)
• Median:
100 median 18
Std.Ddev 3.8
Std.Err
80
18/27
Count

N=1067
60
(67%)
• N=1067
40

20

0
5 10 15 20 25

Mark out of 27

The red line is the “bell curve”, which is also the Normal distribution,
which is also called a Gaussian distribution.
Distributions
• Many types of distributions
• The distribution that best describes your data
will depend on the ‘physics’
• Many engineering measurements have
normal (or near normal) distributions
– Normal distributions are also called Gaussian
distributions, and sometimes Bell curves
• Other distributions
– Bimodal, multimodal, flat, skewed (either right or
left), etc.
Aside on Histograms
• The appearance of histograms is quite dependent on
the number of bins and how the bin boundaries are
computed.

See also: onlinestatbook.com/rvls/index.html


Go to the ‘Simulations/Demonstrations’ link and then to ‘Histograms’ and then to
‘Histograms, Bin Widths and Cross Validation’
Aside on Histograms
100

• The appearance of histograms is quite dependent on


80
the number of bins and how the bin boundaries are
Mean 20.2
Mediun 21.0
SD 4.3
computed. Count 60
N 907

100 40 100

80 80
Mean 20.2 20 Mean 20.2
Mediun 21.0 Mediun 21.0
SD 4.3 SD 4.3
N 907 N 907
60 60
Count

Count
0
5 10 15 20 25 30
40 40
Mark out of 30

20 20

0 0
5 10 15 20 25 30 0 5 10 15 20 25 30

Mark out of 30 Mark out of 30

Do you place the bin at the front end or the back end?
i.e., do you want to show 30 and above to the next bin, or
above 29 to 30? Or maybe in the middle?
Abuse of Statistics
• There is a famous expression that you might hear
when people talk about statistics
• It is attributed to Mark Twain and/or Disraeli in the
early 1900’s.
• “There are three types of falsehoods, each worse
than the one before – lies, damned lies, and
statistics.”
• This is meant to be funny, but it does reflect a feeling
people have that statistics can be used to prove
anything, hence, people mistrust statistics.
• “Oh, people can come up
with statistics to prove
anything, 14% of people
know that”, Homer Simpson.
Stretching the truth with statistics
18

• A politician in power 16

says: “My town is 14 Salaries in Anytown


12
wealthy: the average Mean: $71 000
10 Median: $56 000

Count
salary is $71 000.” 8 Mode: $30 000

• The leader of the 6

opposition replies: “But 4

half of our citizens make 0

less that $56 000.” 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280

Salary (x1000$) per year

• The third party leader Everyone is correct: the first quotes the
says: “Most of the mean, the second the median and the
third quotes the mode; but the
people make $30 000”
messages are ‘different’.
Bill Gates moves into
18

16

14 Salaries in Anytown

town ‘for an hour’ 12


Mean: $71 000
10 Median: $56 000

Count
• A politician in power 8 Mode: $30 000
says: “My town is
6

wealthy: the average 2

salary is $86 000.”


0
0 20 40 60 80 100 120 140 160 180 200 220 240 260 280

Salary (x1000$) per year

• The leader of the 50

opposition replies: “But Salaries in Anytown


40

half of our citizens make after Bill moves in

less that $56 000.” Count


30
Mean: $86 000
Median: $56 000
• The third party leader 20
Mode: $50 000
says: “Most of the 10

people make $50 000” 0


0 200 400 600 800 1000

Salary (x1000$) per year


Bill Gates would be called an outlier
• An outlying observation, or outlier, is one that
appears to deviate markedly from other
members of the sample in which it occurs.
• Often people make arguments to ‘ignore’
outliers
• You should only eliminate outliers after very,
very careful consideration:
– Ignoring Bill Gates might hurt his feelings!
Interlude on Outliers and
Censoring
• Extreme values (outliers) pose a problem
– Are they valid extreme values or gross errors of
measurement?
– Extreme values should never be deleted (censored)
without careful investigation.
• Sometimes the extreme values are the most important data in
the experiment; don’t delete your Nobel Prize mindlessly!
• Including erroneous data can seriously bias the results.
• Censoring of good data always introduces bias and
always makes the data look better than it really is.
– Automatic censoring, as is done by some instrument
software is not generally an acceptable procedure.
Descriptive Statistics
When you wish to use a few numbers to
describe your data set:
• Mean and Median give an idea of ‘central
tendency’
– where is the middle?
• Mode tells you the most probable ‘bin’ range
for your data
• Standard Deviation, and Range give
indications of width Next slide

– how wide is the distribution?


Measures of Variation (Width)
• A measure of variation is a number that
indicates the extent to which data are spread
out around the mean
– Standard deviation
– Variance
– Range
• How far is a given data point from the mean?
i.e., what is its deviation from the mean?

Three ways of indicating with a number how far xi


is from the (population) mean?
Standard Deviation (Population)
The Standard Deviation (abbreviated Std.Dev below) gives an
indication of the width of the distribution
Population Descriptive Statistics Mean: $71 000

18

16

14 Mean + 1Std.Dev
Salaries in Anytown = $127 000
12

10

Count
8

This is a ‘population’ because we 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280

Salary (x1000$) per year


have included the entire town
population. If we only considered a
small number of people in the town Mean - 1Std.Dev = $15 000
we would calculate statistics for a
sample, see later slide.
The Standard Deviation (abbreviated Std.Dev below) gives an
indication of the width of the distribution
Population Descriptive Statistics Mean: $71 000
18

16

14 Salaries
Mean in Anytown = $94 000
+ 1Std.Dev
12

10 If the width of the

Count
8 distribution is
smaller then the
6
Std.Dev is smaller
4

0
0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300
This is a ‘population’ because we Salary (x1000$) per year
have included the entire town
population. If we only considered a
small number of people in the town Mean - 1Std.Dev = $48 000
we would calculate statistics for a
sample, see next slide.
Standard Deviation (Sample)

Statisticians have discovered


that using the sample mean
underestimates the true
standard deviation, so n−1 is
used here rather than n.
Variance
• Variance is simply the square of the standard
deviation (population and sample variance is
shown below, respectively)

N
1
 
2

N
 (x  )
i 1
i
2

n
1
s 
2

n  1 i 1
( xi  x ) 2
Descriptive Data Analysis
1. Graphics Displays:
– Dot plots,
– Box and Whisker Plots
– Scatter plots,
– Frequency plots (histograms)
2. Summary statistics:
– mean,
– median,
– mode,
– standard deviation,
– range (and interquartile range)
Descriptive Graphics of Data
Box & Whisker Plot
Box & Whisker Plot
1350
1350

1300
1300

1250
1250

1200

1200
1150

1150
1100

1100
1050

1000 1050
Median
Mean
25%-75%
±SD
950 Min-Max
1000 ±1.96*SD
BELT1 BELT2 BELT3 BELT4
BELT1 BELT2 BELT3 BELT4

Median, IQR, Range Mean, SD, 95%CL

Correlations (ChMPIngotOxygen.sta 7v*47c)


BELT1

BELT2

BELT3

BELT4

Correlation Plot of Oxy. Measurements


Descriptive Graphics of Data
Box & Whisker Plot Box & Whisker Plot
1300 1300

1280 1280

1260
1260
1240
1240
1220
1220
1200
1200
1180

1160 1180

1140 1160

1120
1140
1100
1120
1080
Median = 1177.5 1100 Mean = 1175.2128
1060 ±SD
25%-75%
= (1147.5, 1217.5) 1080 = (1122.5375, 1227.888)
1040
Min-Max ±1.96*SD
1020 = (1042.5, 1285) 1060 = (1071.9693, 1278.4563)
MeanOxy MeanOxy

Median, IQR, Range Mean, SD, 95%CL

Histogram: MeanOxy Scatterplot (ChMPIngotOxygen.sta 7v*47c)


K-S d=.10137, p> .20; Lilliefors p> .20 MeanOxy = 1163.4459+0.4903*x
Shapiro-Wilk W=.98245, p=.69545 1300
25
1280

1260

1240
20
1220

1200

15 1180

1160

1140
MeanOxy

10
No. of obs.

1120

1100

1080
5
1060

1040

0 1020
1000 1050 1100 1150 1200 1250 1300 -10 0 10 20 30 40 50
X <= Category Boundary Seq

Frequency Distr. Of Mean Oxy. Scatter Plot of Mean Oxy. vs Seq. Num.
Summary Statistics (samples)
Measures of Location Measures of Spread
• Mean: 1 n
• Variance: 1 n

x   xi
s 
2
( xi  x ) 2
n 1 i
n i 1 • Standard Deviation s  Var  s 2
• Median: middle value when you • Range: R =max(x) – min(x)
order the values: • Interquartile Range:
x1  x2  x3 ...  xn
• Mode: position or positions of
maximum probability
– Not always easy to define in
experimental data • Full Width Half Max (FWHM)
• Remember: a statistic is any – Half width half max. (HWHM)
function of the data.
The first “Golden Rule” of Data
Analysis
• Study the data
– Use the descriptive methods above to get a ‘feel’ for the
data.
– Complicated data sets deserve several hours, days, or
even weeks of study.
– ‘Outliers’ must be carefully vetted.
– Data are not simply numbers but rather measurements or
counts of real entities.
– Tentative conclusions should be made in the contexts of
their meaning in relation to these entities, the real
background of the data, and how the data were collected.
• Valid conclusions are unlikely to be obtained from
poor data.
Summary
• For the descriptive analysis to be valid
– The data must be independent (exchangeable)
– Arbitrary inclusion of repeat measurements is
not allowed.
– The data cannot be correlated such that
exchangeability is violated.
• If these conditions aren’t met, the standard
interpretation of the results could lead to
very, very, wrong conclusions.
What Statistics can’t do
• Can’t get blood from a stone
– Can’t rescue you from bad data.
– Can’t get good results from poorly designed and
badly executed experiments.
– Can’t generate information where non exists.
• Statistics might help you understand and
quantify just how bad your results are from
your bad data.
• In other words, it is not a magic Mr. Fix-it in a
black box.
How well do we know the mean value
(i.e., arithmetic average) we determine
from our data?
The mean mark on the midterm was 18/30
The mean salary in Anytown, before Bill, was
$71 000.
? How many sig figs should we quote for these
mean values ?
? Want is the error in these mean values ?
As an example:
The goal is to determine the average (mean)
weight of students on campus
• You could weigh the entire population of students at Carleton
and then calculate their average weight, but this takes time,
and would cost a lot of money.
• Or, you can choose a sample of students, say 5 students,
weigh them and calculate an average.
• How far would this average be from the real average of the
total population of students?
• If you chose a second group of 5 students, how different
would the second group’s average weight be?
– You would expect that the average weight of each group of 5 students
would be ‘slightly’ different.

Each group of 5 students must be chosen randomly. Why?


The n5 Student Weighing Machine
Massing

10kg

350 kg
Standard Error in the mean value
 Start with any ‘arbitrary’
distribution that represents
your population
1. Take, for example, n=5
samples from the starting
distribution
2. Take the average of these n=5
samples
3. Put a ‘blue’ box in the lower
histogram showing the
average value calculated in 2.
4. Goto 1 until the histogram
doesn’t change much.
Standard Error in the mean value
Standard Error in the mean value

Now we are taking 10 000 samples and averaging each.


What if we choose n=20?
The super-duper n20 Student weighing machine
What if we choose n=20?
The super-duper n20 Student weighing machine
The n5 Weighing Machine versus the n20 Weighing Machine

Averages of 5 samples taken Averages of 20 samples taken


from the student population. from the student population.

Which machine is better?


Which machine is more precise?
Which machine is more accurate?
What number (statistic) would you use to Ans: Some measure
quantify the difference in these machines? of the width.
Precision and Accuracy
• High accuracy
(low bias)
• High precision
(low standard
deviation)
Precision and Accuracy
• Low accuracy
(large bias)
• High precision
(low standard
deviation)
Precision and Accuracy
• High accuracy
(low bias)
• Low precision
(high standard
deviation)
The n5 Weighing Machine versus the n20 Weighing Machine

• The standard error is related to the width of the sampled distribution. The
standard error is the standard deviation of the distribution of the means, i.e., the
‘sd’ in the lower plots.
• The lower plots will always be normal curves (Gaussian curves) regardless of the
parent population: this is the Central Limit theorem.
• The standard error indicates how well you know the mean.

Standard Error for the mean of x  sx 
n
You can do this at home:
• Go to:
http://onlinestatbook.com/rvls/index.html
• Go to the ‘Simulations/Demonstrations’ link
and then to ‘Central Limit Theorem’ and then
to ‘Sampling Distribution Simulation’.

Sampling Distribution Simulation


Standard Error
• The standard deviation of the means is called
the standard error
– Describes the variation of the means about the
estimated mean
– We usually do not know the true standard
deviation, σ, of the parent population, so
– The standard error is estimated as follows

 s
sx  
n n
Standard Error
• Thus, the standard error associated with the
estimated population mean is
  x  zc s x
where
zc  confidence level
e.g. z95  1.96 for 95% confidence

At 95% confidence, the true mean, μ, lies


within this ± interval.
So, what is the point of all of this?
• Almost all of our measurements are made with instruments
that give us ‘average values’.
• If you take many measurements and plot a histogram of these
values, then you will get a normal (Gaussian) distribution
• If we only do one N5, or one N20, measurement then we will
have an estimate of the true mean, based on a single sample.
• The single N5, or N20, measurement distribution will also have
a width, which we calculate as the standard deviation: ‘s’.
• From the standard deviation, s, we calculate the standard error
in the estimated mean:
sx  s / n
• We can never be 100% confident that the mean of the sample
is the ‘true mean’, but …
• We can give the sample mean value, which is our best estimate,
along with a  zc sx range, which is a confidence interval
 depending on the choice of zc.
Common Confidence Values
'true value' = measured mean value  zc measured
where
zc  confidence level
z90  1.645 for 90% confidence
z95  1.96 for 95% confidence (or 19 times out of 20)
z99  2.58 for 99% confidence

So, up to now, you simply put error bars on your


measurements. Now the size of the error bar depends
on how confident you are about the measurement.
Evolution of Reporting
Numbers
What you see on your calculator:
13.923
13.9 to 3 sig figs, because of the least
significant number in your calculation
13.9 ± 0.05, because of an assumed
error based on sig figs
13.9 ± 0.3, (2σ, 95% confidence)
Gaussian (normal) Distributions are
everywhere, so let’s look at them in
some detail
Gaussian Distribution
• Also called Normal
distribution
– Named after Carl
Fredrich Gauss (1777-
1855)
• Common distribution in
many engineering
applications
• Symmetric about a
central value
106
Continuous Form of the
Gaussian Distribution
Standard Normal Distribution
• To make the calculation easier, we create a
special table the eliminates the need to
perform a new integration every time
• Use a standard form, otherwise you need a
table for every new mean and standard
deviation
– Apply a change of variable:
Population vs. Sample

 ( x   )2 
 2 
1  2
 1  z 2 /2
f ( x)  e 
 f ( z)  e
 2 2

 ( x  x )2 
 2 
1  1  z 2 /2
f ( x)  e  2 s 
 f ( z)  e
s 2 2
Standard Normal Distribution
• In the new variable z, the mean is at z = 0
• We also set the standard deviation to one (1)
• We get the standard normal distribution:

1  z 2 /2
f ( z)  e
2

Called z-statistics
Standard Normal (Gaussian) Distribution
The total area under the
standard normal distribution is
unity, which means equal to 1.

z2 1  z 2 /2
AREA   e dz
z1
2

0 z1 z2
z-statistics:
use z scale
How do we know that Zc=1.96
corresponds to 95% of the area?
• We integrate the equation for the ‘standard’
Gaussian curve from -1.96 to +1.96 and we get
0.95
• Or, integrate from 0 to 1.96, and multiply by 2,
because the Gaussian curve is symmetric
about the mean.
z-Statistics
(half-areas)
0.4750 is the area under the
standard curve from zero,
which is the mean, to z=1.96

1
1.96 1   z 2


0
2
e 2
dz  0.475

Source: Introduction to Engineering Practice, 3rdEd., Hagen, 2008. Similar


tables are in Chapter 10 of your textbook. 0 1.96 (z-scale)
Normal (Gaussian) Bell Curve

Compare these numbers with the zc values:


z95  1.96 for 95% confidence (or 19 times out of 20)
z99  2.58 for 99% confidence
Normal (Gaussian) Distributions
14
σ
12
Section C
68% of the students
10
had marks in the
8
range:
Count

mean  
6

2
95% of the students
0 had marks in the
5 10 15 20 25

Mark out of 30 range:



mean  1.96
The Usual Values People Know

Most computer programs, like Excel for example, have


functions that give you the areas, so tables are ‘old fashion’.
Random Number Generation
• We can use a computer to generate “pseudo-random”
numbers
– Not actually random, but have approximately random
statistical properties
• Random numbers according to a distribution can be
generated:
• Normal (Gaussian)
• Lognormal
• Poisson
• Binomial
• Uniform
• ChiSq , etc.

118
Estimating with Small Samples

Introduction to t-Statistics

119
Small Samples
• Small samples are those with n< 30 elements
• Our z-statistics are no longer accurate
• Must use t-statistics
– The sample variance is weighted
– Result is a different distribution from the normal
distribution, but with a similar shape
– Appropriate for small samples
Student t-Distribution
• The distribution for
small samples is
generally called the
student t-distribution
– Based on a weighting
described by W. S.
Gosset (1876-1937)
– He was interested in
beer (yes, beer)
– He worked for a beer
company (Guiness)
Student t-Distribution
• Because of variability in the
ingredients of beer, samples
that come from the same
population are generally
small
• Important for Quality
Control
• Gosset’s company did not
allow him to publish, so he
did so under the pseudonym
“Student”
Student t-Distribution
• Gosset showed that small samples
DOF= 1
taken from an essentially normal
DOF= 2
population have a wider confidence DOF= 3
interval than we would predict with DOF=

z-statistics.
• For small samples, we must use
t-statistics.
• The t-distribution is shorter and
fatter than the normal distribution,
but when DOF=∞ (i.e., for large n) DOF    n  1
the t-distribution becomes a normal = Degrees of Freedom
(Gaussian) distribution
Student t-Distribution

x  x  t ,c sx
where
t ,c  t-statistic for  degrees of freedom
  n  1  DOF = Degrees of Freedom
t-Statistics

See the EXCEL functions:


TDIST & TINV

125
There are ‘calculators’ on the web. See, for example:

http://www.tutor-homework.com/statistics_tables/statistics_tables.html

126
Example
• Consider a sample of eighteen batteries from the
entire population of batteries (each battery
should be about 9 V)
• In order to establish the true mean of the
voltages, you measure the voltage across each
battery with a voltmeter
• Using the collected data, calculate:
– Mean and standard deviation
– 95% and 99% confidence interval estimates
– Population mean with 90% confidence
Voltage
Meas. # (V)
7

1 6.51
x  9.1498 6
9-Volt Battery Histogram

2 8.45 5

3 11.76 4

Count
4 8.36 3

5 9.35 2

6 9.23 1

7 7.85
0

8 8.59 Mean 6 7 8 9 10 11 12 13

Battery Voltage (V)


9 9.05
1 n 1
10
11
8.08
10.59
x   xi  ( x1  x2  x3  xn )
12 9.87
n i 1 n
13 8.04
1
14 8.38  (6.51  8.45  11.76 7.27)
15 10.01 18
16 12.84
17 10.48
 9.1498
18 7.27  Do not round yet!
Voltage
Meas. # (V)
7

1 6.51
x  9.1498 6
9-Volt Battery Histogram

2 8.45
s  1.5776
5

3 11.76 4

Count
4 8.36 3

5 9.35 2

6 9.23 1

7 7.85
0

8 8.59 6 7 8 9 10 11 12 13

Battery Voltage (V)


9 9.05 n
1
10
11
8.08
10.59
s  2

n 1 i
( xi  x ) 2
Sample Standard Variance

12 9.87
13 8.04 1
 [(6.51  9.1498)2  (8.45  9.1498)2  (7.27  9.1498)2 ]
14 8.38 18  1
15 10.01  2.4888
16 12.84
17 10.48
18 7.27  Do not round yet!
Voltage
Meas. # (V)
7

1 6.51
x  9.1498 6
9-Volt Battery Histogram

2 8.45
s  1.5776
5

3 11.76 4

Count
4 8.36
sx  0.3718
3

5 9.35 2

6 9.23 1

7 7.85
0

8 8.59 6 7 8 9 10 11 12 13

Battery Voltage (V)


9 9.05
10 8.08 Sample Standard Error
11 10.59
12 9.87
s 1.5776
13
14
8.04
8.38
sx    0.3718
15 10.01 n 18
16 12.84
17 10.48
18 7.27  Do not round yet!
t-Statistics

x  x  t17,95 sx
where
t17,95  2.110 = Student's 95% confidence for 17 degrees of freedom
Voltage
Meas. # (V)
7

1 6.51
x  9.1498 6
9-Volt Battery Histogram

2 8.45
s  1.5776
5

3 11.76 4

Count
4 8.36
sx  0.3718
3

5 9.35 2

6 9.23 1

7
8
7.85
8.59
x  x  t17,c sx 0
6 7 8 9 10 11 12 13

Battery Voltage (V)


9 9.05
10 8.08 x  9.1498  (2.110  0.3718)
11
12
10.59
9.87
 9.1498  0.7845
13
14
8.04
8.38
 9.15  0.8  95% Confidence
15 10.01 x  9.1498  (2.898  0.3718)
16 12.84
17 10.48
 9.1498  1.076
18 7.27
 9.1  1  99% Confidence
We were asked for 90% confidence
estimate of the Population Mean
• What will be the difference in the calculations
between what was just done, and what we are
now being asked to do?
• The previous estimates were also for the
population mean.
• The only difference is we use x  x  t17,90 sx

1.740
Comparison of z-statistics
with t-statistics
• On the next slide, note that as the sample size
nears 30, one can see how the results
obtained using t-statistics get closer to those
from z-statistics
• Using z-statistics overestimates confidence
when the sample size is small:
• the error bars are too small if you use z-statistics for
samples of less than 30.
t-Statistics

x  x  t17,95 sx
where
t17,95  2.110 = Student's confidence interval for 17 degrees of freedom
Compare this with z95  1.96, which is for 95% confidence for large samples
Estimating Proportional
Mean Values

136
Tossing a Coin
• P(H) = P(T) = 0.5 (we know this is true for a
‘fair coin’)
• Try 100 tosses and find:
– 58/100 are heads in one trial
– 45/100 are heads in another trial
• We call these ratios proportions

137
Population Mean for Proportions
• Estimate the proportion p by sampling
p (1  p )
p  zc
n

np  5
Must be satisfied to
n(1  p )  5 use this method

• Use z-statistics for n > 30 and t-statistics for smaller n


Proportion Example
• Suppose we take n = 150 coin flips
• 80 of the 150 are ‘heads’
• Then p = 80/150 is the sample proportion of heads
• p = 80/150 = 53%
What error do we put on p?

In other words: given that we are sampling, how


well do we think we know the ‘true’ proportion of
heads?
Example
• For our coin-flipping example, find the 90%
confidence interval for the proportion p
np  150  80 /150  80  5 Check
n(1  p )  150  (1  80 /150)  70  5
z90  1.645 (see next slide)

p (1  p)
p  zc  80 /150  1.645 (80 /150)  (70 /150) /150
n
 0.53  0.07 From this measurement, we cannot say the coin is unfair.
1.645 corresponds to a half
area of 0.4500. This means
that the total area under the
curve from –z to +z is twice
this number, which is 0.90.
So, z90=1.645
Polling (e.g., in Politics)
• A poll is conducted with a sample size of
n = 200 to test the support for candidate A
• The sample suggests that A will receive 32% of
the votes
• Estimate the population proportion, with 95%
confidence
p (1  p )
p  zc  0.32  1.960 0.32  0.68 / 200
n
Confident to within six percentage
 0.32  0.06 points, 19 times out of 20

np  200  0.3  60  5
Check
n(1  p )  200  0.7  140  5
Manufacturing
• In a manufacturing operation, 200 parts were
examined and 8/200 were found to be defective
• Estimate the population proportion, with 95%
confidence
p (1  p )
p  zc  0.040  1.960 0.04  0.96 / 200
n
 0.04  0.027  0.04  0.03

np  200  0.04  8  5
Check
n(1  p )  200  0.96  192  5
• For every 100 parts made, between about 1 and
7 parts will be defective 95% of the time

You might also like