0% found this document useful (0 votes)
16 views68 pages

08 GB DMAIC Basic Statistics Part 3

DMAIC Basic Statistics Part 3

Uploaded by

dinesh.munaswamy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views68 pages

08 GB DMAIC Basic Statistics Part 3

DMAIC Basic Statistics Part 3

Uploaded by

dinesh.munaswamy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 68

g GE Global Research

Green Belt DMAIC Workshop

BASIC STATISTICS
Part 3

Define
Measure
Analyze
Improve
Control

i 1
GB DMAIC - Basic Statistics Part 3 Version 2.0 5/2002 7.1
g GE Global Research

Learning Objectives

 Understand the terms sample and


population.
 Understand the terms parameter and
statistic
 Understand the terms point estimator and
interval estimator
 Understand the Central Limit Theorem
 Understand Confidence Interval
 Understand what is hypothesis
 Understand various distributions
 Understand ANOVA

2
GB DMAIC - Basic Statistics Part 3 Version 2.0 5/2002 7.2
g GE Global Research

Statistics Fundamentals

GB DMAIC - Basic Statistics Part 3 Version 2.0 5/2002 7.3


g GE Global Research

Population and Sample


Universe / Population :
The set of all potential observations about which the experimenter
wishes to make some general statement.

Sample :
A small fraction (subset) of the population which the experimenter
chooses for study in order to make some statement about the
population.

Inference :
The conclusion drawn about the population based on the study of
sample.
Ah! Now I
understand!

 Inference about the population has to be drawn on the basis of sample


 The inferences must be drawn under uncertainty

GB DMAIC - Basic Statistics Part 3 Version 2.0 5/2002 7.4


g GE Global Research

Parameter and Statistic

Population Sample

Mean  x

Variance  s

Parameter Statistic

Statistic is used as the estimate of the parameter to draw the inference

GB DMAIC - Basic Statistics Part 3 Version 2.0 5/2002 7.5


g GE Global Research

Types of Estimators

Point Estimator :
• Estimation of population in terms of a single number

Population

Point Estimator

Interval Estimator :

• The estimation of the population in terms of a range.


• Specified as Point Estimator  Error

Population

Interval Estimator

GB DMAIC - Basic Statistics Part 3 Version 2.0 5/2002 7.6


g GE Global Research

Central Limit Theorem

GB DMAIC - Basic Statistics Part 3 Version 2.0 5/2002 7.7


g GE Global Research

The Central Limit Theorem Applies also


to Non-Normal Parent Populations

Distributions of individual measurements


Distributions of individual measurements

Distribution of averages -- n measurements in each average


Distribution of averages -- n measurements in each average

X
X (grand average)

There are “n” samples in each subgroup.

The
Thecentral
centrallimit
limittheorem
theoremstates
statesthat,
that,for
forlarge
largevalues
values
of
ofn,
n,the
thedistribution
distributionof
ofthe
thesample
samplemean
meanwillwillhave
have
approximately
approximatelyaanormal
normaldistribution,
distribution,even
eventhough
thoughthethe
individual
individualdata
datapoints
pointsmay
maybe benon-normal.
non-normal.

GB DMAIC - Basic Statistics Part 3 Version 2.0 5/2002 7.8


g GE Global Research

Central Limit Theorem

 Any Distribution of x

 x
Population Mean and Variance

 x 
Normal Distribution of x
  / n
2
x
2

x
x x
Mean and Variance of Sample Means
For n > 30, Mean of Sample means tends to be normally distributed

x  x x 
z 
x  n

1 Z Distribution
0 z
GB DMAIC - Basic Statistics Part 3 Version 2.0 5/2002 7.9
g GE Global Research

Central Limit Theorem –illustrated with


a normally distributed parent population

Parent Population

Mean = 

SDEV = 
95 %

   Sampling Distribution


of the Means
Mean = 
Randomly select “n” 
samples from our SDEV = SD =
n
parent population and take the mean.
Do this for all possible combinations of “n”.
The Central Limit Theorem tells us
that the distribution of these means
will be normal and have the same
mean, as the parentwith the SD  SD
value shown above.

GB DMAIC - Basic Statistics Part 3 Version 2.0 5/2002 7.10


g GE Global Research

Confidence Intervals
• Means

• Standard Deviations

• Process Capability

GB DMAIC - Basic Statistics Part 3 Version 2.0 5/2002 7.11


Interval g GE Global Research
Confidence Interval : Estimator
Confidence Interval : Concept

Does not contain 

It is the estimated
range of values which
is likely to include an
unknown population
parameter, the
estimated range being
calculated from a
given set of sample
data.

x  x x 

    
How much confident are you that x  will contain  ?

GB DMAIC - Basic Statistics Part 3 Version 2.0 5/2002 7.12


g GE Global Research

Confidence Interval

zc
Shaded Area under the curve is the Confidence Level .
Remaining Area is .

zcˆ
Use  to arrive at zc 
n

x  , x   is the confidence interval for 

GB DMAIC - Basic Statistics Part 3 Version 2.0 5/2002 7.13


g GE Global Research

Confidence Intervals for Means

Population Distribution
Standard Normal Probability Curve
95 %

  
In this example we know the true mean and true standard deviation
of the total population.

• 95% of the population lies between the  limits shown

• Randomly select a sample from this population


• 95% of the time the sample that you select
will have a value, X, in the range

X < 


or X = 95% of the time)
• Now look at it another way

• Suppose you are told only the  value of the population and are asked
to estimate the value of from a sample with value, X, that is randomly
selected from the population.

• 95% of the time we are confident that the value of the unknown mean, ,
lies “somewhere” in the interval:

X - 1.96< < X + 1.96

or we estimate that = X at the 95% level of confidence

GB DMAIC - Basic Statistics Part 3 Version 2.0 5/2002 7.14


g GE Global Research

Confidence Intervals for Means


Real Data

What about data collected in the real world?

• Typically we select a number, n, of samples and


determine the mean.

• We also obtain an estimate , S, of the standard


deviation from the n-samples

How do we use this limited data to

• Estimate the true population mean?

• Determine our level of confidence?

We turn to the Student t-distribution

GB DMAIC - Basic Statistics Part 3 Version 2.0 5/2002 7.15


g GE Global Research

Confidence Intervals for Means


Student’s t-distribution

The t-distribution probability


density function takes into
consideration the uncertainties
inherent in our estimates of the
mean and standard deviation for a
finite sample. The shape of the
curve depends upon the degrees
of freedom, .
William Sealey Gosset
“Student”

  1 
  21 
f t ,    1 
  t

2   2 
for   t 
 2     
Student's t-distribution

0.5
Probability Density

0.4 df=1
df=2
Function

0.3
df=5
0.2 df=10
0.1 N(0,1)

0
-4 -3 -2 -1 0 1 2 3 4
t

GB DMAIC - Basic Statistics Part 3 Version 2.0 5/2002 7.16


g GE Global Research

Confidence Intervals for Means


Central Limit Theorem
Parent Population

Mean = 

SDEV = 
95 %

   Sampling Distribution


of the Means
Mean = 
Randomly select “n” 
samples from our SDEV = SD =
n
parent population and take the mean.
Do this for all possible combinations of “n”.
The Central Limit Theorem tells us
that the distribution of these means
will be normal and have the same
mean, as the parentwith the SD  SD
value shown above.
If we randomly select a sample of value, X, from our sampling
distribution then we can estimate the mean from:

 = X 1.96 (at the 95% confidence level )
n

GB DMAIC - Basic Statistics Part 3 Version 2.0 5/2002 7.17


g GE Global Research

Confidence Interval and t distribution


Use z distribution if Use t distribution if
n > 30 n <= 30

( X   ) Note the Difference ( X   )


Z t df  n  1
 n s n

z Distribution Student's t-distribution

0.5 0.5

0.4 0.4 df=1


df=2
0.3 0.3
df=5
Probability Density
Function
Probability Density
Function

0.2 0.2 df=10


0.1 0.1 N(0,1)

0 0
-4 -3 -2 -1 0 1 2 3 4 -4 -3 -2 -1 0 1 2 3 4
z t

• Symmetric about its mean t=0


• More tail-heavy

Do
Donot
notuse
usezzdistribution
distributionfor
fornn<< 30
30

GB DMAIC - Basic Statistics Part 3 Version 2.0 5/2002 7.18


g GE Global Research

Confidence Intervals for Means


Student’s t-distribution

Using a definition analogous to the “Z” statistic


we define a “t” statistic based upon our “real
data” t-distribution

X-
t= Compare to Z= X-
S/n 

Student's t-distribution

0.5
Probability Density

0.4 df=1
df=2
Function

0.3
df=5
0.2 df=10
0.1 N(0,1)

0
-4 -3 -2 -1 0 1 2 3 4
t

GB DMAIC - Basic Statistics Part 3 Version 2.0 5/2002 7.19


g GE Global Research

Confidence Intervals for Means


Student’s t-distribution
n=6 n=6
df = 5 df = 5

95 % 95 %

2.57 0 2.57 2.57 0 2.57


t t
n=3
df = 2

95 %

4.30 0 4.30
t

= X t S/n degrees of freedom = n - 1

As the number of samples decreases


our confidence interval gets larger.

GB DMAIC - Basic Statistics Part 3 Version 2.0 5/2002 7.20


g GE Global Research

Confidence Intervals for Means


Student’s t-distribution

You have just received a report from the Analytical


Laboratory informing you that your sample was found
to contain 97 ppb Sodium with a STDEV of 2 ppb.

You want to estimate the true population mean at a


95% confidence level. Being a well educated
Green Belt you go back to the analyst to learn how
many samples were used in the analysis. The analyst
can’t remember exactly but is sure that either 3 or 6
samples were used.

For 3 samples: degrees of freedom = 3-1 = 2 and t = 4.30

= 97  4.30*2/1.73 = 97 5 ppb

95% Confidence Interval 92 ---- 97---- 102

For 6 samples: degrees of freedom = 6-1 = 5 and t = 2.57

= 97  2.57*2/2.45 = 97 2 ppb

95% Confidence interval 95 ---- 97 ---- 99

GB DMAIC - Basic Statistics Part 3 Version 2.0 5/2002 7.21


g GE Global Research

Example on Confidence Interval for Mean


Time of last boarding call of an airplane to New York for
various days at random in a year is as follows –
10:20am, 10:22am, 10:31am, 10:47am, 10:25am
10:39am, 11:13am, 10:18am, 10:38am, 10:26am
Find at what time a tele-checked person should reach airport
in order that he would not miss the plane with 95%
confidence.
Reference 10:00 Confidence Level 0.95
Degrees of Freedom 9
Time of Difference
Arrival w.r.t. 10:00 tc 2.262159
10:20 0:20
10:22 0:22
10:31 0:31
10:47 0:47
10:25 0:25
11:39 1:39
10:13 0:13
10:18 0:18
10:38 0:38
10:26 0:26

Average Sample Standard


Difference 0:33 Deviation 0:24
Half interval 0:17
Mean Confidence Interval
Arrival Time 10:33 10:15 10:50

GB DMAIC - Basic Statistics Part 3 Version 2.0 5/2002 7.22


g GE Global Research

Confidence Intervals
• Means

• Standard Deviations

• Process Capability

GB DMAIC - Basic Statistics Part 3 Version 2.0 5/2002 7.23


g GE Global Research
Confidence Intervals for Standard Deviation
Helmert’s 2-distribution

The probability density function of


the variance (2) for finite samples
also has a dependence upon the
number of degrees of freedom, 

1 Friedrich Robert Helmert


f x ,   1 
 x

x 2
e 2

2  2

2

Chi Square Distribution

0.35
0.3
Probability Density Function

0.25 df=1
0.2 df=2
0.15 df=5
0.1 df=10
0.05
0
0 10 20 30
Chi-Square

GB DMAIC - Basic Statistics Part 3 Version 2.0 5/2002 7.24


g GE Global Research

Chi-squared distribution
If X is normally Then s2 is 2
distributed distributed

( X  ) (n  1) s 2
Z   2
df  n  1
  2

e. g. Z (1  ) 1.645 for  5% e.g.  2 (1  ),df 9.488 for n 5, 5%

z Distribution Chi Square Distribution

0.5 0.35
0.3
0.4
0.25 df=1
0.3 0.2 df=2
Probability Density Function
Probability Density
Function

0.2 0.15 df=5


0.1 df=10
0.1
0.05
0 0
-4 -3 -2 -1 0 1 2 3 4 0 10 20 30
z Chi-Square

22 distribution
distributionisisused
usedto tofind
findthe
the
confidence
confidenceinterval
intervalfor
forvariance.
variance.

GB DMAIC - Basic Statistics Part 3 Version 2.0 5/2002 7.25


g GE Global Research
Confidence Intervals for Standard Deviation
Helmert’s 2-distribution

95% Confidence Interval

Because of the shape of the -distribution the


confidence intervals determined for standard
deviation estimates are not symmetric.

Example: Using 16 samples we compute s = 1.66

95% Confidence Interval for S

1.23 -------- 1.66 ------------------ 2.57

Fortunately, as we will see later, programs such as


MiniTab will calculate the confidence intervals for us.

GB DMAIC - Basic Statistics Part 3 Version 2.0 5/2002 7.26


g GE Global Research

Confidence Intervals
• Means

• Standard Deviations

• Process Capability

GB DMAIC - Basic Statistics Part 3 Version 2.0 5/2002 7.27


g GE Global Research

How much confidence do we have in our


Process Capability Z-scores?
95% Confidence Interval
(assumes normal distribution)

n = # samples used to
estimate
Mean and STDEV
Short Term
Long Term

P(d)LSL P(d)USL

1 Z2
[ ]
1/2
Z + 1.96 n + 2(n-1)

Problem: I have calculated Z = 2.56 for a process


capability study
in which I used 75 samples.

Question: What are the 95% confidence interval limits for


my estimated value of Z?

GB DMAIC - Basic Statistics Part 3 Version 2.0 5/2002 7.28


g GE Global Research

Hypothesis Testing

• t-test for means

• F-test for standard deviations

• ANOVA test for means

GB DMAIC - Basic Statistics Part 3 Version 2.0 5/2002 7.29


g GE Global Research

Hypothesis Testing

Hypothesis Tests as an alternative


method for determining difference
Confidence intervals give a range of plausible
values for a population value (parameter).

Hypothesis tests determine if an apparent


difference is real or could be due to chance. We
can quantify our level of confidence that the
difference is real.

All
Potential
“X”s

Vital Few
“X”s

GB DMAIC - Basic Statistics Part 3 Version 2.0 5/2002 7.30


Hypothesis Testing
g GE Global Research

Defining the Hypotheses

Ho
The starting point for a hypothesis test is the “null”
hypothesis - Ho. Ho is the hypothesis of sameness, or
no difference.
Example: The population mean equals the test mean.

Ha
The second hypothesis is Ha - the “alternative”
hypothesis. It represents the hypothesis of
difference.
Example: The population mean does not equal the
test mean.
•• You
Youusually
usuallywant
wanttotoshow
showthat
thatthere
thereis
isaa
difference
difference(H(Ha).). a

•• Start
Startby
byassuming
assumingequality
equality(H
(Hoo).).
•• IfIfthe
thedata
datashow
showthey
theyare
arenot
notequal,
equal,then
thenthey
they
must
mustbe
bedifferent
different(H
(Haa).).

GB DMAIC - Basic Statistics Part 3 Version 2.0 5/2002 7.31


g GE Global Research

Evaluation of Decision Error


Four possible outcomes that determine whether a
decision is correct or in error:

Ho: Person is innocent.


Ha: Person is guilty.
Truth
Truth
Ho Ha
Innocent Guilty

Ho Innocent, Guilty,
Set Free Set Free
Set Free
Verdict
Verdict

Innocent, Guilty,
Ha Jailed Jailed
Jailed

© 1994 Dr. Mikel J. Harry V3.0

GB DMAIC - Basic Statistics Part 3 Version 2.0 5/2002 7.32


g GE Global Research

Evaluation of Decision Error

Truth
Truth

Ho Ha
1 -  = Chance
of detecting a
Type II specified
Correct Error
change in the
Ho population
Decision

(with the given
sample) if the
difference is
actually there to
Accept
Accept detect. Also
called “power
of the test.” In
Type I some respects,

Error Correct it is the


likelihood of
Ha Decision detecting

Note: It is not possible


 beneficial
change.

to simultaneously
commit a Type I and
Type II decision error. 1 -  = Confidence that an observed outcome in the
In short, either an alpha sample is “real” (i.e., the outcome is not due to random
or beta decision error sampling error and, therefore, reflects the true state-of-
can be made, but not
both. affairs in the population).

GB DMAIC - Basic Statistics Part 3 Version 2.0 5/2002 7.33


g GE Global Research

Hypothesis Testing

• t-test for means

• F-test for standard deviations

• ANOVA test for means

GB DMAIC - Basic Statistics Part 3 Version 2.0 5/2002 7.34


g GE Global Research

Hypothesis Testing of Means

ˆ old 10.0
ˆ new 11 .1

Example:

We have made changes to a process to shift the mean.


We have taken 6 samples of both the original process
and the new process and have estimates for the means
and STDEV’s.
ˆ old 10.0 ˆ old 0.85
ˆ new 11 .1 ˆ new 1.14

The means and STDEV’s are observed to differ.


Questions:

Are the differences statistically significant?


We can use a hypothesis test to determine this.
Are the differences practically significant?
You and the team have to make this decision.

GB DMAIC - Basic Statistics Part 3 Version 2.0 5/2002 7.35


g GE Global Research

Hypothesis Test of Means – p-value


1 Sided t-test Example

Our Null Hypothesis: ˆ old ˆ new

The distribution of ˆ new  ˆ old will have a


standard deviation given by:

2 2
ˆ test  ˆ old  ˆ new

If our null hypothesis is true, then this distribution


should have a mean value = 0
ˆ new  ˆ old 0

2 2
ˆ test  ˆ old  ˆ new

GB DMAIC - Basic Statistics Part 3 Version 2.0 5/2002 7.36


g GE Global Research

Hypothesis Test of Means – p-value


1 Sided t-test Example

t-distribution for
10 degrees of freedom
(based upon 2 samples of 6 each) 1.1 (11.1 – 10.0)

t = 1.88

-5 -4 -3 -2 -1 0 1 2 3 4 5

Area beyond “t = 1.88” is 0.045

Our “p-value” = 0.045

0.045 < 0.05 – Therefore we say that we accept the


alternate hypothesis at the 95% confidence level

GB DMAIC - Basic Statistics Part 3 Version 2.0 5/2002 7.37


g GE Global Research

The p-value


The p-value is the probability of making
a Type I Error.

Unless there is an exception based on
engineering judgment, we will set an
acceptance level of a Type I error at
 = 0.05.

Thus, any p-value less than 0.05
means we accept the alternative
hypothesis.
Truth
Truth
Ho Ha

Ho

Accept
Accept

Ha p-value

GB DMAIC - Basic Statistics Part 3 Version 2.0 5/2002 7.38


g GE Global Research

Two-Sided Use of the t Distribution


Distribution of
Sampling Averages

df = 4
Different LCL Same Process UCL Different
Process Process

 
= 2.5% 95% 2.5% =
2 2
x
t = 2.776 t = 2.776

Confidence Interval
Risk (Rejection Region for Ha) Risk
 
LCL = x – t /2 UCL = x + t /2
n n
 
LCL = x – 2.776 UCL = x + 2.776
5 5
There is 95% certainty that the true population mean will be contained within the
given confidence interval. If we observe a sampling average greater than UCL or less
than LCL, we may conclude that such an event could only occur 5 out of 100 by
random chance (sampling variations).

© 1994 Dr. Mikel J. Harry V3.0

GB DMAIC - Basic Statistics Part 3 Version 2.0 5/2002 7.39


g GE Global Research

Hypothesis Testing

• t-test for means

• F-test for standard deviations

• ANOVA test for means

GB DMAIC - Basic Statistics Part 3 Version 2.0 5/2002 7.40


g GE Global Research

Hypothesis Test for Standard Deviations

The ratio between two variances


follows a non-symmetric distribution
called the F-distribution that depends
upon the degrees of freedom, .

ˆ 12
F
ˆ 22 Ronald A Fisher

The experimentally determined value of the F-value is


evaluated wrt the appropriate distribution and a p-value
(area beyond the F-value) is determined as is done in
the t-test.

Fisher's F Distribution

5.0
Probability Density

4.0
F(1,1)
Function

3.0 F(1,2)
2.0 F(2,1)
F(10,1)
1.0

0.0
0.00 0.20 0.40 0.60 0.80 1.00
F

GB DMAIC - Basic Statistics Part 3 Version 2.0 5/2002 7.41


g GE Global Research

F distribution
If X is normally Then s12/ s22 is F
distributed distributed
21 / 1 s12
( X  ) F1 n1  1  2  2
Z  2 n2  1  /  2 s2
 2

e. g. Z (1  ) 1.645 for  5% df1  1 n1  1 df 2  2 n2  1


e.g. F(1  ),df1 ,df2 6.388 for n1 n2 5, 5%
z Distribution Fisher's F Distribution

0.5 5.0

0.4 4.0
F(1,1)
0.3 3.0 F(1,2)
Probability Density
Function
Probability Density
Function

0.2 2.0 F(2,1)


F(10,1)
0.1 1.0

0 0.0
-4 -3 -2 -1 0 1 2 3 4 0.00 0.20 0.40 0.60 0.80 1.00
z F

•F-distribution is symmetric (skewed to the right)

FFdistribution
distributiongives
givesp-values
p-valuesfor
for
two-sample
two-samplevariance
variancecomparisons.
comparisons.
GB DMAIC - Basic Statistics Part 3 Version 2.0 5/2002 7.42
g GE Global Research

Hypothesis Testing

• t-test for means

• F-test for standard deviations

• ANOVA test for means

GB DMAIC - Basic Statistics Part 3 Version 2.0 5/2002 7.43


g GE Global Research

ANOVA - Analysis of Variance

The ANOVA test for means uses two


tools:
• ANOVA – The analysis of variance
• The F-test for differences between
standard deviations

We will first describe the basics of


ANOVA as this is a very powerful tool
that is used in many statistical
analyses

GB DMAIC - Basic Statistics Part 3 Version 2.0 5/2002 7.44


g GE Global Research

Introduction to Analysis of Variance


ANOVA

Basic Assumption: Our data is normal

Example:

We are making precision spacing blocks with


a target dimension of 10 mm, and we
have gathered data for the output of our factory.

The standard deviation (spread in the data) is


higher than we would like.

We want to analyze our data for the purpose of


identifying sources of variation in our factory
that might be responsible and which we can correct.

After some work we have discovered that the blocks


are made on 3 different machines.

After more work, we are able to identify which parts


were made on each of the three machines.

GB DMAIC - Basic Statistics Part 3 Version 2.0 5/2002 7.45


g GE Global Research

Introduction to Analysis of Variance


ANOVA
Total variation of products
coming from factory

Variation associated
with machine #1 Variation associated with
(group #1) machine #2 (group #2)

Variation associated with


machine #3 (group #3)

Original Total Data Set from Factory


9.64 8.18 14.91 11.08 11.99 13.25 9.88 17.75 14.95

8.43 11.67 12.99 10.41 10.61 14.66 11.04 9.09 12.59

Data sorted by Machine (Group)

Machine #1 9.64 11.08 9.88 8.43 10.41 11.04

Machine #2 8.18 11.99 10.56 10.23 10.41 10.21

Machine #3 14.91 13.25 14.95 12.99 12.59 14.66

GB DMAIC - Basic Statistics Part 3 Version 2.0 5/2002 7.46


g GE Global Research

Introduction to Analysis of Variance


ANOVA

Breaking down total variance

SST = SSB + SSW


SST = Total Variation (Sum of Squares)

SSB = Between-Group Variation (Sum of Squares)

SSW = Within-Group Variation (Sum of Squares)

GB DMAIC - Basic Statistics Part 3 Version 2.0 5/2002 7.47


g GE Global Research

Introduction to Analysis of Variance


ANOVA

Total Variation (Sum of Squares)

c n
SST =   ( X ij  X )2
i

j 1 i 1

X ij = an individual data point – the i-th observation in


in the group (or level) j.

c n j

  X ij
j 1 i 1 is called the overall or grand mean
X 
n

ni = the number of data points (observations) within a given


group j.

n = the total number of data points (observations) in all


of the groups combined. ( n = n1 + n2 + n3 + …..+ nc)

c = the number of groups

GB DMAIC - Basic Statistics Part 3 Version 2.0 5/2002 7.48


g GE Global Research

Introduction to Analysis of Variance


ANOVA

Between-Group Variation
(Sum of Squares)

c
SSB =  nj ( X j  X ) 2
j 1

X = the overall or grand mean (see SST definition)

X j = the sample mean of group j.

nj = the number of data points (observations) within a given


group j.

c = the number of groups

GB DMAIC - Basic Statistics Part 3 Version 2.0 5/2002 7.49


g GE Global Research

Introduction to Analysis of Variance


ANOVA

Within-Group Variation
(Sum of Squares)

n
c j
SSW =   ( X ij  X j ) 2
j 1 i 1

X ij = an individual data point – the i-th observation in


in the group (or level) j.

X j = the sample mean of group j.

GB DMAIC - Basic Statistics Part 3 Version 2.0 5/2002 7.50


g GE Global Research

Introduction to Analysis of Variance


ANOVA

Analyzing Variance in Our Factory Data

Subgroup: Size Average Sum of Squares


nj X SSW j   ( X ij  2
Group Values
j X j)
i
1 9.64 11.08 9.88 8.43 10.41 11.04 6 10.08 4.99
2 8.18 11.99 17.75 11.67 10.61 9.09 6 11.55 56.97
3 14.91 13.25 14.95 12.99 12.59 14.66 6 13.89 5.66

Overall: Size Average Sum of Squares


c n j
c n
  X ij SST =   ( X ij  X )2
i

n j 1 i 1
X  j 1 i 1
n
18 11.84 111.95

Standard Deviation: Overall Pooled


SST SSW
̂  ̂ Pool 
 nj  1  ( n j  1)
j j
2.57

GB DMAIC - Basic Statistics Part 3 Version 2.0 5/2002 7.51


g GE Global Research

Introduction to Analysis of Variance


ANOVA

Variation Source Sum of Squares % Contribution


Machine #1 Variation 5.0 4.5
Machine #2 Variation 57.0 50.9
Machine #3 Variation 5.7 5.1
SSW - Within-Group Variation 67.6 60.4
SSA - Among Group Variation 44.3 39.6
SST - Total Variation 112.0 100.0

Assuming for now that these differences are


statistically significant, where might be the
opportunities to reduce variation in our overall
process (ie. the factory)?

GB DMAIC - Basic Statistics Part 3 Version 2.0 5/2002 7.52


g GE Global Research

Introduction to Analysis of Variance


ANOVA

Home Work
nj X j SSW j   ( X ij  2
Group Values
X j)
i
1 1 2 3
2 2 3 4
3 3 4 5
4 4 5 6
5 5 6 7

Overall: Size Average Sum of Squares


c n j
c n
  X ij SST =   ( X ij  X )2
i

n j 1 i 1
X  j 1 i 1
n

Standard Deviation: Overall Pooled


SST SSW
̂  ̂ Pool 
 nj  1  ( n j  1)
j j

Variation Source Sum of Squares % Contribution


Group 1
Group 2
Group 3
Group 4
Group 5
SSW - Within-Group Variation

SSB - Between Group Variation

SST - Total Variation

GB DMAIC - Basic Statistics Part 3 Version 2.0 5/2002 7.53


g GE Global Research

Click on Notes Page

To See Answer to

ANOVA Home Work

GB DMAIC - Basic Statistics Part 3 Version 2.0 5/2002 7.54


g GE Global Research

Hypothesis Testing

• t-test for means

• F-test for standard deviations

• ANOVA test for means

GB DMAIC - Basic Statistics Part 3 Version 2.0 5/2002 7.55


g GE Global Research

What is Analysis of Variance?

• A technique used to determine the statistical


significance of the relationship between a
dependent variable (“Y”) and single or multiple
independent variable(s) (“X”s) that have been
organized into two or more discrete groups or
levels.

• A procedure that determines whether or not


the means of the responses at each level are
drawn from the same population. (Are they
different?)

• A way to screen for potential Vital Few “X”s

ANOVA is used for continuous “Y” data


with discrete / continuous “X” levels

GB DMAIC - Basic Statistics Part 3 Version 2.0 5/2002 7.56


The Concept of ANOVA g GE Global Research
A tool to compare several means
(for continuous response data!)

level 1 level 2

Current New Process

the gap
Between group
variation (signal)
the delta (
Total variation

Within
Withingroup
group
variation
variation(noise)
(noise)

ANOVA determines if the variation between the average of


the levels is greater than could reasonably be expected
from the variation that occurs within the level
... that’s how it got its name

Is the signal between greater


than the noise within?

GB DMAIC - Basic Statistics Part 3 Version 2.0 5/2002 7.57


g GE Global Research

Variation Split

Between group variation


(signal) delta ()

Total variation

Within group
variation (noise)

Average SSbetween signal


ANOVA calculates the ratio of : Average SS =
within noise

SS = Sum of Squares (a measure of variation)

GB DMAIC - Basic Statistics Part 3 Version 2.0 5/2002 7.58


g GE Global Research

Why not simply use the t-test?

In the Analyze Phase, you learned how to use the “t-test”


statistic to compare two sample averages for difference.
(Remember the “two sample” t-test?)

Example: Insurance Costs Project


How would you compare the average regional insurance
costs? Are the costs different between the five regions?

Regional Operation Insurance Costs


($K)
763
763 1,335
1,335 596
596 3,742
3,742 1,632
1,632
4,365
4,365 1,262
1,262 1,448
1,448 1,833
1,833 5,078
5,078
2,144
2,144 217
217 1,183
1,183 375
375 3,010
3,010
1,998
1,998 4,100
4,100 3,200
3,200 2,010
2,010 671
671
5,412
5,412 2,948
2,948 630
630 743
743 2,145
2,145
957
957 3,210
3,210 942
942 867
867 4,063
4,063
1,286
1,286 867
867 1,285
1,285 1,233
1,233 1,232
1,232
311
311 3,744
3,744 128
128 1,072
1,072 1,456
1,456
863
863 1,635
1,635 844
844 3,105
3,105 2,735
2,735
1,499
1,499 643
643 1,683
1,683 1,767
1,767 767
767
Average: 1,960
1,960 1,996
1,996 1,194
1,194 1,675
1,675 2,279
2,279

GB DMAIC - Basic Statistics Part 3 Version 2.0 5/2002 7.59


g GE Global Research

Problems with Multiple Comparisons


Using t-Tests

Problems with all possible “two sample” t-tests:


• We would have to make 10 separate
comparisons, to test each pair of averages.
(AB, AC, AD, AE, BC, BD, BE, CD, CE, DE)
• Even if all average costs were equal, there is
a 5% chance that we would reject Ho and
conclude that one of the pairs of averages is not
equal. When this test procedure is repeated ten
times, the risk of incorrectly concluding that at
least one pair of averages is different would be
very high (much greater than 5%).

ANOVA gives a single hypothesis test to compare all


five averages at one time.

Analysis of Variance (ANOVA) allows us to


make all ten comparisons at one time, and
controls the overall  risk…

GB DMAIC - Basic Statistics Part 3 Version 2.0 5/2002 7.60


g GE Global Research

One-Way ANOVA

One way ANOVA is used to compare the means from


three or more sample sets to determine if there is
evidence that at least one of the means is different.

Example Data
Group means
Mean
Machine #1 9.64 11.08 9.88 8.43 10.41 11.04 10.08

Machine #2 8.18 11.99 10.56 10.23 10.41 10.21 10.26

Machine #3 14.91 13.25 14.95 12.99 12.59 14.66 13.892


Grand mean 11.41
In an ANOVA analysis one:

• Assumes that the samples come from the same


“normal” parent population

• Assumes that the “within sample” variation is the


same for all groups.
The standard deviation estimates are “pooled”

• Computes the F-value and determines the p-value.


• If p < 0.05 we accept the alternate hypothesis
(there is a difference)

GB DMAIC - Basic Statistics Part 3 Version 2.0 5/2002 7.61


g GE Global Research

One-Way ANOVA

Assume Same
Parent Population

Mean = 

SDEV = 


Machine # 1 Machine # 2 Machine # 3
9.64, 11.0,…. 8.18,11.99,… 14.91,13.25,…

S2 estimates 0.97 1.49 1.13


P

Pooled estimate 1.21


Sampling Distribution
of SP2 df = 15
of the Means
Mean = 

SDEV = SSD 

F  271..272 20.67(, ,  F 3,8) 


Means
p  0.05  Reject Null Hypothesis
10.08, 10.26,13.89

S2 estimate 55.45/2
SD

S2 estimate 27.73
P
df = 2

GB DMAIC - Basic Statistics Part 3 Version 2.0 5/2002 7.62


g GE Global Research

Click on Notes Page

To See Answer to

ANOVA Class Exercise

GB DMAIC - Basic Statistics Part 3 Version 2.0 5/2002 7.63


g GE Global Research

Example of ANOVA (Homework)


Subgroup: Size Average Sum of Squares Within
ni xi 
1
x SSWi  ( xij  xi ) 2
Group Values n i j ij j
1 1 2 3
2 2 3 4
3 3 4 5
4 4 5 6
5 5 6 7
No. of Groups m = SSW
Size Average Sum of Squares Between Sum of Squares Total

 ni x 
1
x SSB ni (xi  x )2 SST  ( xij  x ) 2
i  ni
i
i j
ij
i

Standard Deviation: Overall Pooled

ˆ LT 
SST  SSW i
ˆ ST  i
 ni  1
i
 (n  1) i
i

Sum of Square Varation Dof Mean Sum of Square Variation F

SSB= m-1= MSB=


SSW= mn-m= MSW=
SST= mn-1=

At 95% confidence level, F-Critical = How is it related with F?


What is the conclusion?

• What is the relation between SST, SSW and SSB ?

GB DMAIC - Basic Statistics Part 3 Version 2.0 5/2002 7.64


g GE Global Research

Click on Notes Page

To See Answer to

ANOVA HomeWork

GB DMAIC - Basic Statistics Part 3 Version 2.0 5/2002 7.65


g GE Global Research

More on Hypothesis

Relation Symbol Type of Hypothesis

Equality = Simple
Inequality > Directional / One sided Composite
< Directional / One sided Composite
= Non -Directional / Two sided Composite

One Sample Hypothesis for Mean One Sample Hypothesis for Variance
0= 0=
a=
 a=

Two Sample Hypothesis for Mean Two Sample Hypothesis for Variance
0= 0=
a=
 
a=
 

Multi-Sample Hypothesis for Mean Multi-Sample Hypothesis for Variance


H0 : =2=…=n H0 : =2=…=n
Ha : At least one not equal Ha : At least one not equal

GB DMAIC - Basic Statistics Part 3 Version 2.0 5/2002 7.66


g GE Global Research

Hypotheses of Means & tests

Hypothesis Test

One Sample Hypothesis for Mean


0=
z-test if sample-size > 30
a=

1 sample t-test if sample-size < 30

Two Sample Hypothesis for Mean


0=
a=
 2 sample t-test

Multi-Sample Hypothesis for Mean


H0 : =2=…=n
ANOVA
Ha : At least one not equal

What are these tests ?

GB DMAIC - Basic Statistics Part 3 Version 2.0 5/2002 7.67


g GE Global Research

Hypotheses of Variances & tests

Hypothesis Test

One Sample Hypothesis for Variance


0=
Chi squared

a=

Two Sample Hypothesis for Variance


0=
F test
a=

Multi-Sample Hypothesis for Variance


H0 : =2=…=n
Homogeneity of Variance
Ha : At least one not equal

What are these tests ?

GB DMAIC - Basic Statistics Part 3 Version 2.0 5/2002 7.68

You might also like