Testing The Assumptions

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Testing the assumptions of ANOVA

Introduction
In this section, we will discuss the theory of the assumptions necessary for an ANOVA and look at methods
to test these assumptions in R. We will also consider the general steps in R to perform an ANOVA.

Overview of previous work


ANOVA is a linear model to decompose each observation into a global mean, an effect on the mean of some
group, and a random error. Formally, each observation is a realisation of the random variable Xij , where

Xij = µ + αj + ij .

Our goal is to determine if all αj , j = 1, 2, ..., k are equal to zero or if at least one of them differs significantly
from zero. This led to decomposing the total variation (SST) into the sum of the variation within each group
(SSW or SSE) and the variation of the group means around the global mean. Finally, we computed a test
statistic given by
M SB
Fcalc = .
M SE
However, which assumptions are required such that the test statistic is an Fk−1,n−k distribution?

Assumptions necessary for ANOVA


The ANOVA test relies on the following assumptions:

1. We assume that the response variable from each group is normally distributed, i.e. X | G is normally
distributed.

2. Homoscedasticity. The population variances of the groups are equal, i.e. σ12 = σ22 = ... = σk2 = σ 2 .
3. The observations are random and independent. An observation of X should not depend on any other
observation in the sample. Difficult to test for the Independence

Naturally, we would like to test these assumptions and methods will be shown shortly. However, at this
stage it is worthwhile to note that not all assumptions are equally important. It is well known that the
F-test is robust with respect to violations of the normality assumption. This implies that even if there are
(small) departures from normality, the ANOVA will give a credible solution. However, if there are large
deviations from normality, especially in the case of small group sample sizes, then there are some remedial
measures that can be taken. One may try a transformation (to make the data normally distributed) or an
appropriate non-parametric test. Non-parametric tests are roughly equivalent tests that do not rely on any
distributional assumptions. The non-parametric equivalent of the one-factor ANOVA is called the Kruskal-
Wallis test and will be covered in later sections. We will test the normality assumptions using any of the

1
previously developed normality tests (e.g. Lilliefors), except that each group has to be tested for normality
separately and all groups have to be normally distributed for the assumption to hold.
The F-test though is much more sensitive to departures from homogeneity. This is not usually a problem
where there are equal group sizes, but in cases where there are unequal group sizes, a violation of the equal
variances assumption may distort the results significantly. This assumption is usually tested using Bartlett’s
test for homogeneity and will be illustrated later on. If the assumption is violated, certain remedial measures,
such as transformation of variables, can again be applied in order to rectify the situation.
The last assumption of independence is not very easy to test with a single test. It is usually present in
experiments where there are repeated measures taken on the same experimental unit over time. In this case,
one would perform a repeated measures ANOVA, but this is beyond the scope of this course. However, if the
data are collected randomly and with an appropriate sampling strategy, it is generally acceptable to assume
that the data are independent.

Testing the assumption of normalityAlso the Shapiro TEST


The general test for normality is the Lilliefors (Kolmogorov-Smirnov) test. Although we will only use R to
perform this test, a quick description of the test is given below.
Consider a sample {x1 , x2 , ..., xn } from an unknown distribution. Let F (x) be the cumulative distribution
function (CDF) of the normal distribution and let F (z) be the CDF of the standard normal distribution.
We test the following hypothesis.

H0 : X ∼ F (x) H0 is always status quo


H1 : X 6∼ F (x)

We can also write the hypothesis test in words:

H0 : The random variable X is normally distributed.


H1 : The random variable X is not normally distributed.

In this case, the test statistic is the maximum difference between the standard normal distribution and the
empirical distribution function of the (standardised) data.
Firstly, compute the sample mean x̄ and sample standard deviation s. Then, compute all the z-values with
xi − x̄
zi = , i = 1, ..., n.
s

Next, we can compute the empirical distribution of the zi values with


n
X I(zi ≤ z)
F̂ (z) = .
i=1
n

Sometimes the denominator is rather n + 1 to avoid probabilities of 1.


Finally, the test statistic is computed as the maximum distance from F̂ (z) to F (z). Hence, the test statistic
is
D = sup | F̂ (z) − F (z) | .
zi

This test statistic follows a Lilliefors (Kolmogorov-Smirnov) distribution and can only be computed with
software or using an appropriate table. Note that the Lilliefors test is the extension of the Kolmogorov-
Smirnov test when the population mean and variance are unknown. Although the same test statistic is

2
used in both tests, the sampling distributions differ. The sampling distributions of both tests can only be
computed numerically. Therefore, we rely on statistical software, like R, to compute the critical value or
p-value of the test. This will be demonstrated in later sections.
It is known that the power of this test is quite low compared to similar tests. Notice that, since the maximum
deviation between the expected and observed distribution is used, the test statistic will be severely affected
by outliers. A single outlier can therefore cause a deviation from normality.

Testing the assumption of equal variances


Each group has some unknown population variance. The test statistic of an ANOVA follows an F-distribution
only if all the population group variances are equal. Hence, we want to test the following hypothesis.

H0 : σ12 = σ22 = ... = σk2 = σ 2


H1 : At lease one σj2 6= σ 2 for some j = 1, 2, ..., k.

To test this hypothesis, we use Bartlett’s test for homogeneity. The test statistic of this test is given by
Pk
2
(n − k) ln(M SE) − j=1 (nj − 1) ln(s2j )
K = P  .
1 k 1 1
1 + 3(k−1) j=1 nj −1 − n−k

It can be shown that this test statistic follows a χ2 (k − 1) distribution. Furthermore, we reject the null
hypothesis for large values of the test statistic. This can be compared to the critical value χ2α,k−1 .

Testing the assumption of independence


This assumption cannot be tested easily. However, if our samples are random and the groups are homoge-
neous, it is valid to assume that the observations are independent.
The only case where we can really consider this assumption is if we have time-series data. Since then the
order in which the data appears is known and we can determine if observations are dependent on previous
ones.
We will not explicitly test for independence. However, if this assumption is clearly violated, alternative tests
that take this dependence into account must be used.

ANOVA in R
In this section we use R libraries to perform ANOVA and to test the assumptions. We will use the same
dataset as before considering the number of weekly sales of a juice for different advertising strategies.
Please download the correct dataset from SUNLearn and import the data with the code below.

# Check if data is stored in current WD


any(list.files() == 'ExampleDataNarrow.txt')

## [1] TRUE

3
# Import data
dat = read.table('ExampleDataNarrow.txt', header = TRUE)
str(dat)

## ’data.frame’: 60 obs. of 2 variables:


## $ Population: chr "Convenience" "Convenience" "Convenience" "Convenience" ...
## $ Sales : int 529 658 793 514 663 719 711 606 461 529 ...

Next, we can visualise the three groups’ sales data.

library(ggplot2)

## Warning: package ’ggplot2’ was built under R version 4.0.5

library(ggpubr)

plt1 = ggplot(data = dat, aes(x = Population, y = Sales)) +


geom_boxplot(width=0.5) +
theme_pubr()
plt1

800

700
Sales

600

500

400

Convenience Price Quality


Population

plt2 = ggplot(data = dat, aes(x = Population, y = Sales)) +


geom_violin(trim = F) +
geom_boxplot(width=0.1) +
theme_pubr()
plt2

4
800
Sales
600

400

200
Convenience Price Quality
Population

data_summary = function(x){
m = mean(x)
ymin = m-sd(x)
ymax = m+sd(x)
return(c(y=m,ymin=ymin,ymax=ymax))
}

plt3 = ggplot(data = dat, aes(x = Population, y = Sales)) +


geom_violin(trim = F, fill = 'lightgrey') +
stat_summary(fun.data=data_summary) +
theme_pubr()
plt3

800
Sales

600

400

200
Convenience Price Quality
Population

5
We want to test the following hypothesis.

H0 : µ1 = µ2 = µ3
H1 : At least one µi 6= µj for some i 6= j.

This can be achieved in R as follows.

# Method 1
fit = aov(Sales~Population, data = dat)
summary(fit)

## Df Sum Sq Mean Sq F value Pr(>F)


## Population 2 57512 28756 3.233 0.0468 *
## Residuals 57 506984 8894
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

# Method 2
linModel = lm(Sales~Population, data = dat)
anova(linModel)

## Analysis of Variance Table


##
## Response: Sales
## Df Sum Sq Mean Sq F value Pr(>F)
## Population 2 57512 28756.1 3.233 0.04677 *
## Residuals 57 506984 8894.4
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

As before, we obtain a test statistic of Fcalc = 3.233 and a p-value of P (F2,57 > 3.233) = 0.0468. Therefore,
there is sufficient evidence to reject the null hypothesis. Hence, at a 5% level of significance, the population
average sales of the three different advertising strategies are not all equal.
Next, we want to test our assumptions. The above result is valid only if the assumptions are satisfied.

Testing normality

Notice that each group must be normally distributed. Hence, if we have k groups, we must perform k tests.
Although this would inflate the type I error, we do not consider multiple comparisons for this test.
The following code performs the Lilliefors test for normality.

conv = dat$Sales[dat$Population == 'Convenience']


qual = dat$Sales[dat$Population == 'Quality']
price = dat$Sales[dat$Population == 'Price']

library(nortest)

# Convenience
lillie.test(conv)

6
##
## Lilliefors (Kolmogorov-Smirnov) normality test
##
## data: conv
## D = 0.12847, p-value = 0.5235

# Quality
lillie.test(qual)

##
## Lilliefors (Kolmogorov-Smirnov) normality test
##
## data: qual
## D = 0.13836, p-value = 0.4027

# Price
lillie.test(price)

##
## Lilliefors (Kolmogorov-Smirnov) normality test
##
## data: price
## D = 0.15565, p-value = 0.2309

Testing homoscedasticity
Here we are testing if all the group variances are equal. Hence, we only perform one test.
The following code performs Bartlett’s test for equal variances.
# Method 1
bartlett.test(x = dat$Sales, g = dat$Population)

##
## Bartlett test of homogeneity of variances
##
## data: dat$Sales and dat$Population
## Bartlett’s K-squared = 0.73887, df = 2, p-value = 0.6911

# Method 2
bartlett.test(formula = Sales~Population, data = dat)

##
## Bartlett test of homogeneity of variances
##
## data: Sales by Population
## Bartlett’s K-squared = 0.73887, df = 2, p-value = 0.6911

Next steps: multiple comparisons


If we conclude that there is a significant difference between some of the population group means, the next
step is to see which groups have different means. This is done with the multiple comparisons t-test which
we will discuss in the next section.

You might also like