Lecture 11
Lecture 11
Javed Iqbal
Analysis of Variance:
We have learned how to compare two population means, that is, the means of a single variable for
two different populations. We studied various methods for making such comparisons, one being
the pooled t-procedure. Analysis of variance (ANOVA) provides a method for testing equality of
several population means. One-way analysis of variance deals with comparing means of different
populations (or treatment) for a single variable.
Consider Weiss, Example 16.2, p-726: Here they want to compare the average energy
consumption in the four regions of US.
The ANOVA considers the total variability (SST) in the response variable under study and
partition this sum of squares into two parts:
i) SSTR: sum of squares due to a treatment (in this case region) which we think affects
the response or dependent variable (explained variation due to different regions)
ii) SSE: sum of squares due to error or unexplained variation
(unexplained variation: why there are differences in household consumption within a
region?)
Thus we have the famous one way ANOVA identity:
The logic of ANOVA is to compare the variance due to treatment and variance due to error. If
variance due to treatment (i.e. explained variation) is large relative to error (i.e. unexplained
variation), then we reject the null hypothesis that means of all treatment are same.
Suppose there are k levels of a factor (i.e. k treatments) and (n1 + n2 +…+ nk = n) total number of
observations.
Where 𝑥̅𝑖 is sample mean of treatment i and 𝑥̅ is the grand mean of all n observations.
𝑛1 𝑥̅ 1 + 𝑛2 𝑥̅ 2 +⋯+ 𝑛𝑘 𝑥̅ 𝑘
Grand mean 𝑥̅ = , 𝑛 = 𝑛1 + 𝑛2 + ⋯ + 𝑛𝑘
𝑛1 + 𝑛2 +⋯+ 𝑛𝑘
The test statistic has the F distribution with k-1 and n-k degrees of freedom.
𝑀𝑆𝑇𝑅
𝐹= ~ 𝐹( 𝑘 − 1, 𝑛 − 𝑘)
𝑀𝑆𝐸
In the energy consumption example above for the four regions of US. Here k = 4 and n =20
Conclusion: As F calculated falls in the rejection region, we reject H0 and conclude that average
household level of energy consumption differs significantly in the 4 regions of the US.
The box plot of the 4 regions indicate that the highest average energy consumption is for Midwest
region. The average level of energy consumption for the South and West region is nearly same.
Assumptions of ANOVA model:
1)The variances of k treatments or populations are equal (that’s why this procedure is an extension
of pooled t test).
2)Treatments are applied randomly and independently to subjects
3)Error within each population is normally distributed.
Note that one of the assumptions of the ANOVA is that variances are equal for the sub populations
which seems to be violated in this case. In such cases one can transform the data e.g., by log
transformation before applying ANOVA.
Anderson Ex 9, 10, 12 pdf p-644
Sol Ex 9:
SUMMARY
Groups Count Sum Average Variance
50° 5 165 33 32
60° 5 145 29 17.5
70° 5 140 28 9.5
Grand mean = 30
SSTR = 70, MSTR = 35, SSE = 236, MSE = 19.66667, F = 1.7796, F(0.05, 2,12) =3.885
Do not reject H0. There is no sufficient evidence of difference in average yield at the three
temperatures.
ANOVA
Source of
Variation SS df MS F P-value F crit
Between Groups 70 2 35 1.779661 0.210447 3.885294
Within Groups 236 12 19.66667
Total 306 14
In two-way analysis of variance, we are interested in knowing the effect of two factors
simultaneously affecting a response variable e.g. in agricultural experiments both variety of seed
and fertilizer type may affect the response crop yield. Analysis of variance is the computational
method in the statistical field of Design of Experiment.