Chapter12 OneWayANOVA
Chapter12 OneWayANOVA
1
DEPARTMENT OF STATISTICS
ANOVA
(ANalysis Of VAriance)
2
DEPARTMENT OF STATISTICS
3
DEPARTMENT OF STATISTICS
ANOVA Terminology
4
DEPARTMENT OF STATISTICS
One-Way ANOVA
5
DEPARTMENT OF STATISTICS
6
DEPARTMENT OF STATISTICS
ANOVA Terminology
Two-way ANOVA (ANalysis Of Variance) is a statistical
technique used to analyze the effects of two independent
factor variables on a continuous dependent variable. The
primary objective of two-way ANOVA is to determine whether
there is a significant difference in means between the groups
defined by the two factors, as well as exploring the interaction
between the two factors.
2) Does the hardwood concentration in pulp (%) influence tensile strength of bags made from the pulp?
Factor: hardwood concentration levels: Needs to be determined
3) Does the resulting color density of a fabric depend on the amount of dye used?
Factor: amount of dye levels: Needs to be determined
8
DEPARTMENT OF STATISTICS
Hypotheses
Null Hypothesis 𝐇 𝟎 :𝝁𝟏=𝝁𝟐=…=𝝁𝒌
Alternative
at least two are different
Hypothesis
for some
A reporter from the student newspaper asks 50 random customers of each of the five coffeehouses to
respond to a questionnaire. One variable of interest is the customer’s age.
Note: There was some non-response.
Variables:
Quantitative Response: age of the customer
Factor: Coffeehouse Levels: (5-levels)
10
DEPARTMENT OF STATISTICS
11
DEPARTMENT OF STATISTICS
Sample Mean for the Group () Sample Mean for the entire Sample ()
for
The double dot subscript signifies that we averaged over both
sets of indices.
The dot subscript signifies that we averaged over the
second index, i.e., the observations for the group.
13
DEPARTMENT OF STATISTICS
14
DEPARTMENT OF STATISTICS
2. The responses in each group are independent of those in the other groups.
3. Each of the populations is normally distributed or the CLT holds for the statistics that we
measure; therefore, the statistics are Normal distributed for each of the groups.
for all
4. Traditional One-Way ANOVA (what we cover) we assume homogeneity of variance. (Pooled Estimator)
for all
15
DEPARTMENT OF STATISTICS
16
DEPARTMENT OF STATISTICS
17
DEPARTMENT OF STATISTICS
Welch’s -test for One-Way ANOVA Similar to the two independent sample procedure
Modified procedure that does not assume equal variance. The test statistic has a complex form
for an approximate degree of freedom and is approximately -distributed.
Nonparametric Procedures
Kruskal-Wallis test is a nonparametric test that uses the ranks of the observations across
the groups rather than the actual values and test statistic is based on the sum of the ranks.
18
DEPARTMENT OF STATISTICS
-distribution
𝐅 ( 𝐝 𝐟 𝐀 =𝒌 −𝟏 , 𝐝 𝐟 𝐄=𝒏 −𝒌 )
populations under investigation (Factor levels)
19
DEPARTMENT OF STATISTICS
-distribution
𝐅 ( 𝐝 𝐟 𝐀 =𝒌 −𝟏 , 𝐝 𝐟 𝐄=𝒏 −𝒌 )
For all
20
DEPARTMENT OF STATISTICS
ANOVA Model
Group Mean () Data () Error ()
for such that , for for
( observation on the group)
22
DEPARTMENT OF STATISTICS
𝒌 𝒏𝒊 𝒌
𝐒𝐒𝐄 =∑ ∑ ( 𝒙𝒊𝒋 − 𝒙 𝒊 )𝟐= ∑ ( 𝒏𝒊 −𝟏 ) 𝒔 𝟐𝒊
𝒊=𝟏 𝒋=𝟏 𝒊 =𝟏
𝐝𝐟 𝐄 =𝒏− 𝐤
Mean Squared Error
Estimate of the variation that is unexplained by the differences
𝒌 due to the population means. The mean squared error is always
𝟏
𝐌 𝐒𝐄 = ∑
𝒏− 𝒌 𝒊=𝟏
(
𝟐
𝒏𝒊 −𝟏 ) 𝒔 𝒊 an unbiased estimate of the variance of the error term in the
model when the equal variance assumption holds. 24
DEPARTMENT OF STATISTICS
𝒌 𝒏𝒊
𝐒𝐒𝐓 = ∑ ∑ ( 𝒙 𝒊𝒋 − 𝒙.. )𝟐
𝒊=𝟏 𝒋=𝟏
𝐝𝐟 𝐓 =𝒏−𝟏
𝒌 𝒏𝒊
¿
𝒌 𝒏𝒊 𝒌
𝐒𝐒𝐄 =∑ ∑ ( 𝒙𝒊𝒋 − 𝒙 𝒊 )
+¿ 𝐒𝐒𝐀 =∑ 𝒏𝒊 ( 𝒙 𝒊 . − 𝒙 .. ) 𝐒𝐒𝐓 = ∑ ∑ ( 𝒙 𝒊𝒋 − 𝒙.. )𝟐
𝟐 𝟐
26
DEPARTMENT OF STATISTICS
∑ ∑ ( 𝒊𝒋 .. ) ∑ ∑ ( 𝒊𝒋 𝒊 . .. ) =¿
𝒙
𝒊=𝟏 𝒋=𝟏
− 𝒙 𝟐
= 𝒙 ± 𝒙
𝒊 =𝟏 𝒋 =𝟏
− 𝒙 𝟐
𝟎
𝒌 𝒏𝒊 𝒌 𝒏𝒊 𝒌 𝒏𝒊
𝒌 𝒏𝒊 𝒌
∑ ∑ ( 𝒙 𝒊𝒋 − 𝒙 𝒊 .) + ∑ 𝒏𝒊 ( 𝒙 𝒊 . − 𝒙 .. ) +𝟎
𝟐 𝟐
𝐒𝐒𝐓 ¿ 𝐒𝐒𝐄 +¿ 𝐒𝐒 𝐀
𝒌 𝒏𝒊 𝒌 𝒏𝒊 𝒌
∑ ∑ ( 𝒙 𝒊𝒋 − 𝒙 .. ) = ∑ ∑ ( 𝒙 𝒊𝒋 − 𝒙 𝒊 . ) +∑ 𝒏𝒊 ( 𝒙𝒊 . − 𝒙.. )
𝟐 𝟐 𝟐
One-Way ANOVA (ANalysis Of Variance) assesses whether the observed variation in the
sample means can be attributed to true differences in the population means, or if it is
simply due to chance.
One-Way ANOVA (ANalysis Of Variance) partitions the total variability in the data into different
sources, including the variability within each group and the variability between groups. It then
compares the magnitude of the between-group variability with the within-group variability to
determine if the differences in the means of the groups are statistically significant.
28
DEPARTMENT OF STATISTICS
∑
𝒏 − 𝒌 𝒊=𝟏
(
𝟐
𝒏𝒊 − 𝟏 ) 𝒔 𝒊
29
DEPARTMENT OF STATISTICS
Step 2
Null hypothesis is always the same. Pick your favorite way of
State the Hypothesis.
stating the alternative.
𝐇 𝟎 :𝝁𝟏=𝝁𝟐=…=𝝁𝒌
at least one is different
for some
30
DEPARTMENT OF STATISTICS
∑
𝒏 − 𝒌 𝒊=𝟏
(
𝟐
𝒏𝒊 − 𝟏 ) 𝒔 𝒊
If you have the data, we will use a built in R function for this entire process.
fit<- aov(quantitativeVariable ~ categoricalVariable, data = dataframe)
summary(fit)
31
DEPARTMENT OF STATISTICS
ANOVA Table
𝝈 Estimate 𝒔=√ 𝐌 𝐒𝐄
33
DEPARTMENT OF STATISTICS
Coffeehouse Example
Example (Coffee House):
How do five coffeehouses around campus differ in the demographics of their customers? Are certain
coffee houses more popular among graduate students? Do professors tend to favor one coffeehouse?
A reporter from the student newspaper asks 50 random customers of each of the five coffeehouses to
respond to a questionnaire. One variable of interest is the customer’s age.
34
DEPARTMENT OF STATISTICS
Coffeehouse Example
𝒎𝒂𝒙 𝒔 𝒊
𝟏 ≤ 𝒊 ≤𝒌
?
≤𝟐 √ 𝟏𝟐.𝟗𝟕 =𝟏 .𝟑𝟔𝟐𝟏𝟕 ≤? 𝟐
Rule of thumb 𝐦 𝐢𝐧 𝒔 𝒊
𝟏 ≤𝒊 ≤ 𝒌 √ 6.99 35
DEPARTMENT OF STATISTICS
Coffeehouse Example
Checking Normality
36
DEPARTMENT OF STATISTICS
Coffeehouse Example
Checking Normality
Some indication of
skewness but ANOVA
procedure is robust to
some skewness.
37
DEPARTMENT OF STATISTICS
Coffeehouse Example
Example (Coffee House):
How do five coffeehouses around campus differ in the demographics of their customers? Are certain
coffee houses more popular among graduate students? Do professors tend to favor one coffeehouse?
A reporter from the student newspaper asks 50 random customers of each of the five coffeehouses to
respond to a questionnaire. One variable of interest is the customer’s age. Use to conduct a
hypothesis test to see if there is any difference in the mean age at the coffeeshops around campus.
for some
38
DEPARTMENT OF STATISTICS
Coffeehouse Example
Step 3
39
DEPARTMENT OF STATISTICS
Coffeehouse Example
ANOVA Table Output from R
Df Sum Sq Mean Sq F value Pr(>F)
Coffeehouse 4 8834 2208.4 22.14 4.4e-15 ***
Residuals 195 19451 99.8
40
DEPARTMENT OF STATISTICS
Conclusion
41
DEPARTMENT OF STATISTICS
42
DEPARTMENT OF STATISTICS
( ) ( )
𝟐 𝟐 𝟐 𝟐
𝐌𝐒 𝐀 𝒏𝟏 ( 𝒙 𝟏 . − 𝒙 .. ) +𝒏𝟐 ( 𝒙 𝟐 . − 𝒙 .. ) 𝒏𝟏 𝒙 𝟏. +𝒏𝟐 𝒙 𝟐. 𝒏𝟏 𝒙 𝟏 . +𝒏𝟐 𝒙 𝟐 .
𝐅 𝐓𝐒= = 𝒏𝟏 𝒙 𝟏. − +𝒏 𝟐 𝒙 𝟐. −
𝐌𝐒 𝐄 𝟏 𝒏 +𝒏 𝒏𝟏 +𝒏𝟐
[ ( 𝒏𝟏 −𝟏 ) 𝒔 𝟏+ ( 𝒏𝟐 −𝟏 ) 𝒔 𝟐 ] ¿
𝟐 𝟐 𝟏 𝟐
𝟐
𝒏− 𝟐 𝒔𝒑
𝟐 𝟐
𝒏𝟏 𝒏𝟐 𝟐 𝒏𝟐 𝒏𝟏 𝟐
𝟐 ( 𝒙 𝟏 . − 𝒙 𝟐 . ) + 𝟐 ( 𝒙 𝟏 . − 𝒙 𝟐 . )
( 𝒏𝟏 +𝒏𝟐 ) ( 𝒏𝟏+ 𝒏𝟐 )
¿ 𝟐
𝒔𝒑
𝟐
𝒏 𝟏 𝒏 𝟐 (𝒏 𝟏+𝒏 𝟐 ) ( 𝒙 𝟏 . − 𝒙 𝟐 . )𝟐 ( 𝒙 𝟏 . − 𝒙 𝟐 . )𝟐
𝟐 ( 𝒙 𝟏 . − 𝒙 𝟐. ) ¿ ¿ 𝟐 For the equal
( 𝒏𝟏 +𝒏 𝟐 ) ¿ 𝒕 𝐓𝐒
¿ 𝟐
𝒔𝒑
𝒔
𝟐
𝒑 ( 𝒏𝟏 +𝒏 𝟐
𝒏𝟏 𝒏𝟐 ) 𝒔
𝟐
𝒑
( 𝟏
+
𝟏
𝒏𝟏 𝒏 𝟐 ) variance assumption.
43
DEPARTMENT OF STATISTICS
44
DEPARTMENT OF STATISTICS
45
DEPARTMENT OF STATISTICS
The null hypothesis for the equality of population means for the levels of a factor is rejected i.e.,
then we have evidence that at least one population mean differs from the others i.e.,
Alternative
for some Statistically Significant Results.
Hypothesis
Simultaneous Comparisons
Rejected the null hypothesis so at least one is different but which?
for some
For all we can run a two-sample independent procedure with equal variance assumption.
𝐇 𝟎 :𝝁𝒊 =𝝁 𝒋
𝟐
𝒔 =𝐌𝐒𝐄
( 𝒙 𝒊 . − 𝒙 𝒋 . ) ± 𝒕𝜶 /𝟐 , 𝒅𝒇
√ (
𝟏 𝟏
𝐌𝐒𝐄 +
𝒏𝒊 𝒏 𝒋 ) Pooled variance
confidence interval.
Simultaneous Comparisons
For all with we can run a two-sample independent procedure.
𝐇 𝟎 :𝝁𝒊 =𝝁 𝒋
We do not know which, so we need to compare all possible pairs
simultaneously and want to control the Type I error for all of them.
The probability of Type I error in this case is the probability that we conclude
that for at least one pairwise comparison there is a significant difference
between the means when in fact there is not.
49
DEPARTMENT OF STATISTICS
𝒙 − 𝒙 ± 𝒕
√√
∗∗
((
( 𝒊𝒊..− 𝒙 𝒋 .𝒋).±) 𝒕𝜶 /𝟐 , 𝒅𝒇 𝐌𝐒𝐄
𝐌𝐒𝐄
𝟏𝟏 𝟏
++
𝒏𝒏𝒊 𝒊 𝒏𝒏 𝒋𝒋 )
Interval needs to be wider to control the overall Type I error
for all paired comparisons.
50
DEPARTMENT OF STATISTICS
𝜶 𝐎𝐯𝐞𝐫𝐚𝐥𝐥 =𝟏 − ( 𝟏− 𝜶 𝐬𝐢𝐧𝐠𝐥𝐞 ) 𝒄
Bonferroni Correction
If you have distinct populations then there are different comparisons.
53
DEPARTMENT OF STATISTICS
Bonferroni Correction
If you have distinct populations then there are different comparisons.
√
Note the degrees of
𝜶 𝐬𝐢𝐧𝐠𝐥𝐞 =
𝜶 𝐎𝐯𝐞𝐫𝐚𝐥𝐥
𝒄
( 𝒙𝒊.− 𝒙 𝒋.) ± 𝒕𝜶 𝐬𝐢𝐧𝐠𝐥𝐞
𝟐
,𝒏−𝒌
𝐌𝐒𝐄
( 𝟏 𝟏
+
𝒏𝒊 𝒏 𝒋 ) freedom is from the
estimate of the
variance ().
54
DEPARTMENT OF STATISTICS
𝑿 ( 𝐌𝐚𝐱 ) − 𝑿 ( 𝐌𝐢𝐧 )
𝑸=
√
Range of the sample means The distribution is
𝑸=
Standard error
𝟏
𝟐
𝑴𝑺𝑬
𝟏 𝟏
+
(
𝒏𝒊 𝒏 𝒋 ) positively skewed.
55
DEPARTMENT OF STATISTICS
𝒙𝒊− 𝒙 𝒋
𝑸 𝐓𝐒 =
√ 𝟏
𝟐
𝑴𝑺𝑬
(
𝟏 𝟏
+
𝒏𝒊 𝒏 𝒋 ) For all such that .
56
DEPARTMENT OF STATISTICS
( 𝒙𝒊.− 𝒙 𝒋.) ±
𝑸 𝜶 ,𝒌 , 𝒏−𝒌
√𝟐 √ (
𝟏 𝟏
𝐌𝐒𝐄 +
𝒏𝒊 𝒏 𝒋 ) Preferred method.
57
DEPARTMENT OF STATISTICS
58
DEPARTMENT OF STATISTICS
Dunnett’s Method is a multiple comparison test used when comparing multiple treatments to a
control group. It controls the family-wise error rate for the pairwise comparisons between
each treatment and the control group, rather than all possible pairwise comparisons among the
treatments and the control.
This method is not implemented in the standard R packages. You are asked to discover an R package
that implements it and learn to use it on your own in a bonus question for Computer Assignment 8.
59
DEPARTMENT OF STATISTICS
2. Order your sample means (for ) in increasing order and put them under the symbols.
3. Draw lines under neighboring pairs of means such that they were identified as not being statistically significant.
4. If all pairs are significantly different from each other then no lines will be drawn.
5. Combine lines if the pairs indicate that three or more of the means are not significantly different from each other.
60
DEPARTMENT OF STATISTICS
Significant?
B–A No
C–A Yes
D–A Yes
C–B Yes
What can we say about population means ?
D–B Yes
D–C No
means are larger than the means of but we do not
know if is larger than or vice-versa.
61
DEPARTMENT OF STATISTICS
Significant?
B–A Yes
C–A Yes
D–A Yes
C–B No
What can we say about population means ?
D–B No
D–C No
means are larger than the mean but we do not know
which of is largest.
If was a control and the others were treatments of
different dosages, then it would say that any dose is
significantly different from the control.
62
DEPARTMENT OF STATISTICS
Significant?
B–A Yes
C–A Yes
D–A Yes
C–B No
What can we say about population means ?
D–B Yes
D–C No • means are larger than the mean
• is larger than .
• and are not different statistically significant.
• is larger than and statistically significant but is not significant with it
may be reasonable to assume that is the largest.
63
DEPARTMENT OF STATISTICS
Coffeehouse Example
Example (Coffee House):
How do five coffeehouses around campus differ in the demographics of their customers? Are certain
coffee houses more popular among graduate students? Do professors tend to favor one coffeehouse?
A reporter from the student newspaper asks 50 random customers of each of the five coffeehouses to
respond to a questionnaire. One variable of interest is the customer’s age.
Coffeehouse Example
But which coffeehouses differ and how?
Use Pairwise comparisons!
Tukey’s honestly significant difference (Tukey HSD): Select Method
We will use a family-wise confidence level of .
66
DEPARTMENT OF STATISTICS
Coffeehouse Example
But which coffeehouses differ and how?
Use Pairwise comparisons!
Tukey’s honestly significant difference (Tukey HSD):
67
DEPARTMENT OF STATISTICS
Coffeehouse Example
But which coffeehouses differ and how?
Use Pairwise comparisons!
Tukey’s honestly significant difference (Tukey HSD):
Create graphical display.
68