0% found this document useful (0 votes)
10 views11 pages

OneFactorANOVA Introduction

Uploaded by

estian.maritz1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views11 pages

OneFactorANOVA Introduction

Uploaded by

estian.maritz1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

One-factor ANOVA by Example

Introduction
Recall from first year that you tested the hypothesis H0 : µ1 = µ2 . ANOVA is the extension of these tests
to the case of k groups / populations. We are interested in the following hypothesis test:

H0 : µ1 = µ2 = ... = µk .

Note that the alternative hypothesis is

H1 : At least one µj is different from the others.

Let

• Categorical (Grouping) variable: X = Advertisement strategy (Convenience, Quality, Price).


• Response variable Y = Xij = The ith observation of the number of juices sold for the j th advertisement
strategy.

We will now consider the intuitive view of ANOVA. From this we will also see the steps followed to perform
one-factor ANOVA.

Sample statistics and graphical displays


Firstly, we must always look at our data before doing anything. Let’s import the data from this example
and get some descriptive measures.

# Make sure data is located in currect WD


list.files()

## [1] "ExampleData.txt" "OneFactorANOVA_introduction.pdf"


## [3] "OneFactorANOVA_introduction.Rmd"

# Import and view data.


datWide = read.table('ExampleData.txt', header = TRUE)
head(datWide)

## Convnce Quality Price


## 1 529 804 672
## 2 658 630 531
## 3 793 774 443
## 4 514 717 596
## 5 663 679 602
## 6 719 604 502

1
dim(datWide)

## [1] 20 3

str(datWide)

## ’data.frame’: 20 obs. of 3 variables:


## $ Convnce: int 529 658 793 514 663 719 711 606 461 529 ...
## $ Quality: int 804 630 774 717 679 604 620 697 706 615 ...
## $ Price : int 672 531 443 596 602 502 659 689 675 512 ...

Notice this dataset is in “wide” format. Sometimes, data must be stored in “long” (“narrow”) format. We
can use the reshape2 package to do this.

# If you haven't installed this, first run the code below.


# install.packages('reshape2')
library(reshape2)
# See the information for the "melt" function
# ?melt.data.frame

# Make long / narrow format


dat = melt(data = datWide,
measure.vars = 1:3,
variable.name = 'Group',
value.name = 'Response')
head(dat)

## Group Response
## 1 Convnce 529
## 2 Convnce 658
## 3 Convnce 793
## 4 Convnce 514
## 5 Convnce 663
## 6 Convnce 719

dim(dat)

## [1] 60 2

str(dat)

## ’data.frame’: 60 obs. of 2 variables:


## $ Group : Factor w/ 3 levels "Convnce","Quality",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Response: int 529 658 793 514 663 719 711 606 461 529 ...

Notice that the factor names are shortened. Let’s change this.

# Let's change the factor names.


table(dat$Group)

2
##
## Convnce Quality Price
## 20 20 20

dat$Group = as.factor(
ifelse(dat$Group %in% 'Convnce',
'Convenience', as.character(dat$Group))
)
table(dat$Group)

##
## Convenience Price Quality
## 20 20 20

Next we can compute sample statistics to describe the data.

Sample sizes

convenience = datWide$Convnce
quality = datWide$Quality
price = datWide$Price

n1 = n2 = n3 = nrow(datWide)
c('Convenience' = n1, 'Quality' = n2, 'Price' = n3)

## Convenience Quality Price


## 20 20 20

nTotal = n1 + n2 + n3
c('Total sample size' = nTotal)

## Total sample size


## 60

Group means

meanConvenience = mean(convenience)
meanQuality = mean(quality)
meanPrice = mean(price)
c('Convenience' = meanConvenience,
'Quality' = meanQuality,
'Price' = meanPrice)

## Convenience Quality Price


## 577.55 653.00 608.65

Grand mean: mean of all the data

3
allData = c(convenience, quality, price)
grandMean = mean(allData)
c('Grand mean' = round(grandMean, 2))

## Grand mean
## 613.07

Group standard deviations

s1 = sd(convenience)
s2 = sd(quality)
s3 = sd(price)
c('Convenience' = round(s1, 2),
'Quality' = round(s2, 2),
'Price' = round(s3, 2))

## Convenience Quality Price


## 103.80 85.08 93.11

Grand (total) standard deviation

sd(allData)

## [1] 97.81474

Graphical display of data

library(ggplot2)

## Warning: package ’ggplot2’ was built under R version 4.0.5

library(ggpubr)

plt1 = ggplot(data = dat,


aes(x = '', y = Response)) +
geom_boxplot(width = 0.5) +
geom_boxplot(data = dat,
aes(x = Group, y = Response),
width = 0.5) +
theme_pubr() +
scale_x_discrete(breaks=c('',
'Convenience',
'Price',
'Quality'),
labels=c('Total',

4
'Convenience',
'Price',
'Quality'))

bins = 8
brks = seq(min(dat$Response),
max(dat$Response),
length.out = bins+1)
plt2 = ggplot(data = dat,
aes(x = Response, y = ..density..)) +
geom_histogram(breaks = brks,
color = 'black',
fill = 'white') +
geom_density(data = dat,
aes(fill = Group),
alpha = 0.3) +
theme_pubr()

plt1

800

700
Response

600

500

400

Total Convenience Price Quality


x

plt2

5
Group Convenience Price Quality

0.004

0.003
density

0.002

0.001

0.000
400 500 600 700 800
Response

Idea of Analysis of Variance (ANOVA)


Consider the response variable, the number of units of juices sold (Y ). We look at the average distance that
each observation is from the grand mean. This is captured in the total sum of squared deviations of the data
(SST) given by
n k Xnj
¯ ¯ 2
X 2 X
SST = xi − x̄ =

xij − x̄
i=1 j=1 i=1

Notice that this can be decomposed into the sum of squared deviations between the group means and the
sum of squared deviations within each group. The sum of squared deviations between the group means is
k
¯
X 2
SSB = nj x̄j − x̄ ,
j=1

and the sum of squared deviations within each group is


nj
k X k
X 2
X
SSW = SSE = (xij − x̄j ) = (nj − 1)s2j
j=1 i=1 j=1

It can shown that


SST = SSB + SSW

If the grouping variable significantly influences the response variable, the SSB should be larger relative to
the SSW . Let’s calculate these quantities.

Calculating SST

Notice that this is just the sum of the squared deviations around the grand mean. Clearly,
nj
k X
¯ 2 = (n − 1)S 2 .
X
SST =

xij − x̄
j=1 i=1

6
calculateSST = function(Y, grandMean){
SST = sum((Y - grandMean)ˆ2)
return(SST)
}
SST = calculateSST(Y = dat$Response, grandMean = grandMean)
round(SST, 3)

## [1] 564495.7

Note, we obtain the same value with the following.

(nrow(dat) - 1)*var(dat$Response)

## [1] 564495.7

Calculating SSB

This is the sum of the squared differences between the group means.

calculateSSB = function(vec_of_n, vec_of_means, grandMean){


SSB = sum(vec_of_n*(vec_of_means - grandMean)ˆ2)
return(SSB)
}

SSB = calculateSSB(vec_of_n = c(n1, n2, n3),


vec_of_means = c(meanConvenience,
meanQuality,
meanPrice),
grandMean = grandMean)
round(SSB, 3)

## [1] 57512.23

Calculating SSW

Notice that we only need the group sample sizes and standard deviations to compute the SSW . Furthermore,
SSW = SSE.

calculateSSW = function(vec_of_n, vec_of_stdev){


SSW = sum((vec_of_n - 1)*vec_of_stdevˆ2)
return(SSW)
}

SSW = calculateSSW(vec_of_n = c(n1, n2, n3),


vec_of_stdev = c(s1, s2, s3))
round(SSW, 3)

## [1] 506983.5

7
Concept of ANOVA

Note that
SST = SSW + SSB.

SST

## [1] 564495.7

SSW + SSB

## [1] 564495.7

The question now becomes: How much larger must SSB relative to SSW be? Notice that by adding more
observations the sum of squares always increases. Therefore, we need some average measure of the sum
of squares. This is exactly what the sample variance is – the average squared deviations around the grand
mean. Hence, we should compute the mean of the SSB (MSB) and the mean of the SSW (MSE).
Once these quantities are computed, M SB can be compared to M SE.
To get a mean sum of squares deviation, we divide by the degrees of freedom which is the effective
sample size.

• df of SST : This is the total degrees of freedom = dfSST = n − 1.


• df of SSB: This is the degrees of freedom between groups = dfSSB = k − 1.
• df of SSE: This is the degrees of freedom within groups = dfSSE = n − k.

Although we can calculate a mean of the SST, it is not necessary as we want to only compare the mean
of the SSB with the mean of the SSE (or SSW). Note however that
k nj
SST 1 XX ¯ 2
M ST = =

xij − x̄
n−1 n − 1 j=1 i=1

This is the sample variance of the data when ignoring the grouping variable.
This variance is then decomposed into the M SB and M SE, where
k
SSB 1 X ¯ 2,
M SB = =

nj x̄j − x̄
k−1 k − 1 j=1

and
k nj
SSE 1 XX 2
M SE = = (xij − x̄j )
n−k n − k j=1 i=1

In the following, these quantities are computed.

# Total variance
MST = SST/(n1+n2+n3 - 1)
MST

## [1] 9567.724

8
var(dat$Response)

## [1] 9567.724

# MSB
k = 3 # Number of groups
MSB = SSB/(k-1)
MSB

## [1] 28756.12

# MSE
MSE = SSW/(n1+n2+n3-k)
MSE

## [1] 8894.447

Now, is MSB much larger than MSE? For this we need an hypothesis test and, hence, a test statistic. It can
be shown that
M SB
∼ Fk−1,n−k
M SE
Therefore, we have two approaches: The critical value and p-value approach.

1. Critical value: Choose α. Then critical value is Fα,k−1,n−k such that P (Fk−1,n−k > Fα,k−1,n−k ) = α.
If the test statistic M
M SE is larger than the critical value, reject H0 .
SB

2. p-value: Choose α. Then the p-value is p − value = P Fk−1,n−k > M M SE . If p − value < α, reject
SB


H0 .

Computing the statistics for the hypothesis test


We can easily now compute the test statistic and critical value or p-value. Let’s assume α = 0.05.

# Test statistic
test_statistic = MSB/MSE
test_statistic

## [1] 3.233041

# Critical value
alpha = 0.05
crit_val = qf(p = 1-alpha, df1 = k-1, df2 = n1+n2+n3-k)
crit_val

## [1] 3.158843

# p-value
p_value = 1 - pf(q = test_statistic, df1 = k-1, df2 = n1+n2+n3-k)
p_value

## [1] 0.04677299

9
What is our statistical decision?

Using the critical value approach, it is clear that Fcalc = 3.23 > 3.16 = Fcrit . Therefore, reject H0 at a 5%
level of significannce.
Using the p-value approach, it is clear that p − value = 0.047 < 0.05 = α. Therefore, reject H0 at a 5% level
of significance.

What is our interpretation?

Since H0 is rejected, there is sufficient evidence that at least one of the population means is different from
the rest.
We can finally make a plot of the rejection region and see where the test statistic falls on this graph.

u = seq(from = 0.001, to = 4, length.out = 10000)


fu = df(x = u, df1 = k-1, df2 = n1+n2+n3-k)

df_plot = data.frame('Data' = u, 'Density' = fu)

ggplot(data = df_plot, aes(x = Data, y=Density)) +


geom_line() +
theme_pubr() +
geom_ribbon(data = subset(df_plot,
Data>crit_val & Data<4),
aes(x = Data, ymax = Density),
ymin=0,
alpha=0.5,
fill='blue') +
geom_vline(xintercept = test_statistic,
linetype = 'dashed', size=1.3)

1.00

0.75
Density

0.50

0.25

0.00
0 1 2 3 4
Data

10
General steps
We can now consider the steps we just followed. The main idea is that we can decompose the total (squared)
deviation of the data as SST = SSW + SSB. Therefore, we can use the mean of these quantities to see if
the groups means differ.
Step 1: State the null and alternative hypotheses.

H0 : µ1 = µ2 = ... = µk
H1 : At least one µi 6= µj for i 6= j.

Step 2: Compute the test statistic.


M SB
Fcalc = .
M SE

Step 3: Compute the critical value Fcrit = Fα,k−1,n−k or the p-value P (Fk−1,n−k > Fcalc ).
Step 4: Make a statistical decision.
Step 5: Interpret the results.

Conclusion
We saw how we can decompose the variation in the data into variation within each group and the variation
between the groups.
Using probability theory, it can be shown that the ratio of the M SB and M SE is an F-distribution. This
fact enables us to perform the abovementioned hypothesis test.
However, there are some assumptions that must be satisfied for this result to hold. Next, we will define these
assumptions. We will also consider methods to test the assumptions and use non-parametric techniques if
the assumptions can’t be verified.

11

You might also like