OneFactorANOVA Introduction
OneFactorANOVA Introduction
Introduction
Recall from first year that you tested the hypothesis H0 : µ1 = µ2 . ANOVA is the extension of these tests
to the case of k groups / populations. We are interested in the following hypothesis test:
H0 : µ1 = µ2 = ... = µk .
Let
We will now consider the intuitive view of ANOVA. From this we will also see the steps followed to perform
one-factor ANOVA.
1
dim(datWide)
## [1] 20 3
str(datWide)
Notice this dataset is in “wide” format. Sometimes, data must be stored in “long” (“narrow”) format. We
can use the reshape2 package to do this.
## Group Response
## 1 Convnce 529
## 2 Convnce 658
## 3 Convnce 793
## 4 Convnce 514
## 5 Convnce 663
## 6 Convnce 719
dim(dat)
## [1] 60 2
str(dat)
Notice that the factor names are shortened. Let’s change this.
2
##
## Convnce Quality Price
## 20 20 20
dat$Group = as.factor(
ifelse(dat$Group %in% 'Convnce',
'Convenience', as.character(dat$Group))
)
table(dat$Group)
##
## Convenience Price Quality
## 20 20 20
Sample sizes
convenience = datWide$Convnce
quality = datWide$Quality
price = datWide$Price
n1 = n2 = n3 = nrow(datWide)
c('Convenience' = n1, 'Quality' = n2, 'Price' = n3)
nTotal = n1 + n2 + n3
c('Total sample size' = nTotal)
Group means
meanConvenience = mean(convenience)
meanQuality = mean(quality)
meanPrice = mean(price)
c('Convenience' = meanConvenience,
'Quality' = meanQuality,
'Price' = meanPrice)
3
allData = c(convenience, quality, price)
grandMean = mean(allData)
c('Grand mean' = round(grandMean, 2))
## Grand mean
## 613.07
s1 = sd(convenience)
s2 = sd(quality)
s3 = sd(price)
c('Convenience' = round(s1, 2),
'Quality' = round(s2, 2),
'Price' = round(s3, 2))
sd(allData)
## [1] 97.81474
library(ggplot2)
library(ggpubr)
4
'Convenience',
'Price',
'Quality'))
bins = 8
brks = seq(min(dat$Response),
max(dat$Response),
length.out = bins+1)
plt2 = ggplot(data = dat,
aes(x = Response, y = ..density..)) +
geom_histogram(breaks = brks,
color = 'black',
fill = 'white') +
geom_density(data = dat,
aes(fill = Group),
alpha = 0.3) +
theme_pubr()
plt1
800
700
Response
600
500
400
plt2
5
Group Convenience Price Quality
0.004
0.003
density
0.002
0.001
0.000
400 500 600 700 800
Response
Notice that this can be decomposed into the sum of squared deviations between the group means and the
sum of squared deviations within each group. The sum of squared deviations between the group means is
k
¯
X 2
SSB = nj x̄j − x̄ ,
j=1
If the grouping variable significantly influences the response variable, the SSB should be larger relative to
the SSW . Let’s calculate these quantities.
Calculating SST
Notice that this is just the sum of the squared deviations around the grand mean. Clearly,
nj
k X
¯ 2 = (n − 1)S 2 .
X
SST =
xij − x̄
j=1 i=1
6
calculateSST = function(Y, grandMean){
SST = sum((Y - grandMean)ˆ2)
return(SST)
}
SST = calculateSST(Y = dat$Response, grandMean = grandMean)
round(SST, 3)
## [1] 564495.7
(nrow(dat) - 1)*var(dat$Response)
## [1] 564495.7
Calculating SSB
This is the sum of the squared differences between the group means.
## [1] 57512.23
Calculating SSW
Notice that we only need the group sample sizes and standard deviations to compute the SSW . Furthermore,
SSW = SSE.
## [1] 506983.5
7
Concept of ANOVA
Note that
SST = SSW + SSB.
SST
## [1] 564495.7
SSW + SSB
## [1] 564495.7
The question now becomes: How much larger must SSB relative to SSW be? Notice that by adding more
observations the sum of squares always increases. Therefore, we need some average measure of the sum
of squares. This is exactly what the sample variance is – the average squared deviations around the grand
mean. Hence, we should compute the mean of the SSB (MSB) and the mean of the SSW (MSE).
Once these quantities are computed, M SB can be compared to M SE.
To get a mean sum of squares deviation, we divide by the degrees of freedom which is the effective
sample size.
Although we can calculate a mean of the SST, it is not necessary as we want to only compare the mean
of the SSB with the mean of the SSE (or SSW). Note however that
k nj
SST 1 XX ¯ 2
M ST = =
xij − x̄
n−1 n − 1 j=1 i=1
This is the sample variance of the data when ignoring the grouping variable.
This variance is then decomposed into the M SB and M SE, where
k
SSB 1 X ¯ 2,
M SB = =
nj x̄j − x̄
k−1 k − 1 j=1
and
k nj
SSE 1 XX 2
M SE = = (xij − x̄j )
n−k n − k j=1 i=1
# Total variance
MST = SST/(n1+n2+n3 - 1)
MST
## [1] 9567.724
8
var(dat$Response)
## [1] 9567.724
# MSB
k = 3 # Number of groups
MSB = SSB/(k-1)
MSB
## [1] 28756.12
# MSE
MSE = SSW/(n1+n2+n3-k)
MSE
## [1] 8894.447
Now, is MSB much larger than MSE? For this we need an hypothesis test and, hence, a test statistic. It can
be shown that
M SB
∼ Fk−1,n−k
M SE
Therefore, we have two approaches: The critical value and p-value approach.
1. Critical value: Choose α. Then critical value is Fα,k−1,n−k such that P (Fk−1,n−k > Fα,k−1,n−k ) = α.
If the test statistic M
M SE is larger than the critical value, reject H0 .
SB
2. p-value: Choose α. Then the p-value is p − value = P Fk−1,n−k > M M SE . If p − value < α, reject
SB
H0 .
# Test statistic
test_statistic = MSB/MSE
test_statistic
## [1] 3.233041
# Critical value
alpha = 0.05
crit_val = qf(p = 1-alpha, df1 = k-1, df2 = n1+n2+n3-k)
crit_val
## [1] 3.158843
# p-value
p_value = 1 - pf(q = test_statistic, df1 = k-1, df2 = n1+n2+n3-k)
p_value
## [1] 0.04677299
9
What is our statistical decision?
Using the critical value approach, it is clear that Fcalc = 3.23 > 3.16 = Fcrit . Therefore, reject H0 at a 5%
level of significannce.
Using the p-value approach, it is clear that p − value = 0.047 < 0.05 = α. Therefore, reject H0 at a 5% level
of significance.
Since H0 is rejected, there is sufficient evidence that at least one of the population means is different from
the rest.
We can finally make a plot of the rejection region and see where the test statistic falls on this graph.
1.00
0.75
Density
0.50
0.25
0.00
0 1 2 3 4
Data
10
General steps
We can now consider the steps we just followed. The main idea is that we can decompose the total (squared)
deviation of the data as SST = SSW + SSB. Therefore, we can use the mean of these quantities to see if
the groups means differ.
Step 1: State the null and alternative hypotheses.
H0 : µ1 = µ2 = ... = µk
H1 : At least one µi 6= µj for i 6= j.
Step 3: Compute the critical value Fcrit = Fα,k−1,n−k or the p-value P (Fk−1,n−k > Fcalc ).
Step 4: Make a statistical decision.
Step 5: Interpret the results.
Conclusion
We saw how we can decompose the variation in the data into variation within each group and the variation
between the groups.
Using probability theory, it can be shown that the ratio of the M SB and M SE is an F-distribution. This
fact enables us to perform the abovementioned hypothesis test.
However, there are some assumptions that must be satisfied for this result to hold. Next, we will define these
assumptions. We will also consider methods to test the assumptions and use non-parametric techniques if
the assumptions can’t be verified.
11