Lecture 2
Lecture 2
Lecture Notes 2
Assa Mulagha-Maganga
Dept of Agricultural and Applied Economics, LUANAR
Department of Mathematical Sciences (Statistics), Chancellor College
2 Analysis of variance
This lecture introduces the first in a series of lectures devoted to linear models. The topic of
this chapter, analysis of variance, provides a methodology for partitioning the total variance
computed from a data set into components, each of which represents the amount of the
total variance that can be attributed to a specific source of variation. The results of this
partitioning can then be used to estimate and test hypotheses about population variances and
means. In this chapter we focus our attention on hypothesis testing of means. Specifically,
we discuss the testing of differences among means when there is interest in more than two
populations or two or more variables. The techniques discussed in this chapter are widely
used in business or health sciences.
1
Statistics for Economists 2
The probability distribution used in this unit is the F distribution. It was named to honor Sir
Ronald Fisher, one of the founders of modern-day statistics. This probability distribution is
used as the distribution of the test statistic for several situations. It is used to test whether
two samples are from populations having equal variances, and it is also applied when we
want to compare several population means simultaneously.
What are the characteristics of the F distribution?
1. The F distribution is continuous. This means that it can assume an infinite number
of values between zero and positive infinity.
3. It is positively skewed. The long tail of the distribution is to the right-hand side. As
the number of degrees of freedom increases in both the numerator and denominator
the distribution approaches a normal distribution.
4. It is asymptotic. As the values of X increase, the F curve approaches the X-axis but
never touches it. This is similar to the behavior of the normal distribution.
When these conditions are met, F is used as the distribution of the test statistic
SST R
M ST R =
k−1
The numerator of formula, SST R, is called the treatment sum of squares, and is computed
by using formula:
The within samples variation is measured by the error mean square and is represented by
M SE. The expression for M SE is given by
SSE
M SE =
n−k
The numerator of formula, SSE, is called the error sum of squares, and is computed by using
formula below, where s1 , s2 , . . . sk are the sample variances.
The denominator k − 1, is called the degrees of freedom for treatments and the denominator,
n − k, is called the degrees of freedom for error.
The sum of the treatment sum of squares and the error sum of squares is called the total sum
of squares. The total sum of squares is represented by SST and is given by: SST = SST R +
SSE. The total sum of squares may be computed directly by using formula SST = Σ(x− x̄)2
, where the sum is over all n sample values. The degrees of freedom for total is equal to
n − 1.
The results of the computations in the proceeding sections are usually conveniently displayed
in a one-way ANOVA table. The general structure of the one-way ANOVA table is given as:
Table 2.1: ANOVA Table
What are the Steps for Testing the Equality of Means Using the One-Way ANOVA Proce-
dure?
Step 1: State the null and alternative hypotheses as follows:
H0 : µ1 = µ2 = · · · = µk Ha : All k means are not equal.
Step 2: Use the F distribution table and the level of significance, α, to determine the
rejection region.
Step 3: Build the ANOVA table, and from the table determine the computed value of the
F ratio.
Step 4: State your conclusion. The null hypothesis is rejected if the computed value of the
test statistic falls in the rejection region. Otherwise, the null hypothesis is not rejected.
Example 7.1
Fifteen students at LUANAR are randomly assigned to three different schools, all of which
are concerned with developing a specified level of skill in agricultural economics. The achieve-
ment test scores at the conclusion of the instructional unit are reported in Table 7.2, along
with the mean performance score associated with each instructional approach. Use the anal-
ysis of variance procedure in Section to test the null hypothesis that the three sample means
were obtained from the same population, using the 5 percent level of significance for the test.
Table 7.2:
Solution
Step 1
H0 : µ1 = µ2 = µ3
H1 : Not all µ1 = µ2 = µ3
Step 2
Critical F (df = k − 1, n − k; α = 0.05) = F (2, 12; α = 0.05) = 3.89
The is obtained from the F -distribution tables with alpha level of 0.05 or 5% as;
P
x
Step 3 The overall mean of all 15 test scores is x̄ = n
= 1200
15
= 80
Figure 1:
P
x1i
The mean of n1 is x̄1 = n
= 86+79+81+70+84
5
= 80
P
x2i
The mean of n2 is x̄2 = n
= 90+76+88+82+89
5
= 85
P
x2i
The mean of n3 is x̄3 = n
= 82+68+73+71+81
5
= 75
Therefore using the formular for MSTR we get,
(86 − 80)2 + (79 − 80)2 + (81 − 80)2 + (70 − 80)2 + (84 − 80)2
S12 = = 38.5
5−1
(90 − 85)2 + (76 − 85)2 + (88 − 85)2 + (82 − 85)2 + (89 − 85)2
s22 = = 35.0
5−1
(82 − 75)2 + (68 − 75)2 + (73 − 75)2 + (71 − 75)2 + (82 − 75)2
s23 = = 38.5
5−1
Then the
Practical Activity
1. Define the meaning of the terms response variable, factor, treatments, and experimental
units.
Solution
2. Explain the assumptions that must be satisfied in order to validly use the one-way
ANOVA formulas.
Solution
3. Explain the difference between the between-treatment variability and the within-
treatment variability when performing a one-way ANOVA.
Solution
• If the one-way ANOVA F test leads us to conclude that at least two of the treat-
ment means differ, then we wish to investigate which of the treatment means
differ and we wish to estimate how large the differences are.
5. A consumer preference study compares the effects of three different bottle designs (A,
B, and C) on sales of a popular fabric softener. A completely randomized design is
employed. Specifically, 15 supermarkets of equal sales potential are selected, and 5 of
these supermarkets are randomly assigned to each bottle design. The number of bottles
sold in 24 hours at each supermarket is recorded. The data obtained are displayed in
Table below. Let µA , µB , and µC represent mean daily sales using bottle designs A, B,
and C, respectively. Test the null hypothesis that µA , µB , and µC are equal by setting
That is, test for statistically significant differences between these treatment means at
the .05 level of significance. Based on this test, can we conclude that bottle designs A,
B, and C have different effects on mean daily sales?
A B C
16 33 23
18 31 27
19 37 21
17 29 28
13 34 25
Solution
1X b
x̄j. = xjk
b k=1
1X a
x̄.k = xjk
a j=1
¯ 1 Xab
x̄ = xjk
ab j,k
Block
1 2 ··· b
Treatment 1 X11 X12 · · · X11 x̄1.
Treatment 2 X21 X22 · · · X11 x̄2.
But in this lesson we will for now ignore the interaction: hence,
Example
Table 3 gives fresh graduates daily earnings (in thousands of MK) of former students with
bachelor’s degrees from 5 colleges and for 3 class rankings at graduation. Test at the 5%
level of significance that the means are identical (a) for college populations and (b) for
class-ranking populations.
Table 2.2
Class rank Bunda Chanco Poly Medicine Nursing Sample
mean
Top 20 18 16 14 12 16
Middle 19 16 13 12 8 14
Bottom 18 14 10 10 10 12
Sample 19 16 13 12 10 14
mean
Solution
H0 : µ1 = µ2 = µ3 = µ4 = µ5
We define each of these sums of squares and show how they are calculated for the
bakery demand data as follows (note that a = 3, b = 5) : Where µ refers to the various
means for factor A (school) populations
a X
b
SST O = (xjk − x̄¯)
X
j=1 k=1
= (20 − 14)2 + (18 − 14)2 + (16 − 14)2 + (20 − 14)2 + (20 − 14)2
+ (19 − 14)2 + (16 − 14)2 + (13 − 14)2 + (20 − 14)2 + (20 − 14)2
+ (18 − 14)2 + (14 − 14)2 + (10 − 14)2 + (20 − 14)2 + (20 − 14)2
= 36 + 16 + 4 + 0 + 4 + 25 + 4 + 1 + 4 + 16 + 16 + 0 + 16 + 16 + 36
= 194
Step 3: Calculate SS(a), which measures the amount of variability due to the different
levels of factor a:
3
SS(a) = b (x̄j. − x̄¯)2
X
j=1
Step 4: Calculate SS(b), which measures the amount of variability due to the different
levels of factor b (colleges):
SS(b) = a[(x̄.1 − x̄¯)2 + (x̄.2 − x̄¯)2 + (x̄.3 − x̄¯)2 + (x̄.4 − x̄¯)2 + (x̄.5 − x̄¯)2 ]
= 3[(19 − 14)2 + (16 − 14)2 + (13 − 14)2 + (12 − 14)2 + (10 − 14)2 ]
= 3(23 + 4 + 1 + 4 + 14)
= 150
Step 5: Calculate SSE, which measures the amount of variability due to the error:
These results are summarized in Table 2.3. From F distribution table, F=3.84 for
degrees of freedom 4 and 8 and ∝= 0.05. Since the calculated F=70, we reject H0 and
accept H1 , that the population means of fresh graduates’ earnings for the 5 colleges
are different.
Table 2.3 Two-Factor ANOVA Table for First-Year Earnings
Variation sum of squares Degree of Mean square F
Freedom
Expllained by SSA = 40 b-1=4 MSA=150/4=37.5 MSA/MSE=70
Schools (B)
(between
columns)
Explained by SSB=150 a-1=2 MSB=40/2=20 MSB/MSE=40
ranking (A)
(between rows)
Error or SSE=4 (a-b)(b-1)=8 MSE=4/8=0.5
unexplained
Total SST=194 ab-1=14
H0 : µ1 = µ2 = µ3
H1 : µ1 = µ2 = µ3 are not equal
Where µ refers to the various means for factor B (class-ranking) populations. From Table
2.3, we get that the calculated value of F = M SB/M SE = 40. Since this is larger than the
tabular value of F = 4.46 for df 2 and 8 and ∝= 0.05, we reject H0 and accept H1 , that the
population means of first-year earnings for the 3 class rankings are different. Thus, the type
of school and class ranking are both statistically significant at the 5% level in explaining
differences in first-year earnings. The preceding analysis implicitly assumes that the effects
of the two factors are additive (i.e., there is no interaction between them).
Activity:
1. Table 2.4 gives the km per litre of petrol for 4 different filling stations in Lilongwe for
5 days. Assume that the km per litre for each filling station is normally distributed
with equal variance. Should the hypothesis of equal population means be accepted or
rejected at the 5% level of significance?
Table 2.4
Filling station 1 Filling station 2 Filling station 3 Filling station 4
12 12 16 17
11 14 14 15
12 13 15 17
13 15 13 16
11 14 14 18
Answer: Rejected
2. Table 2.5 gives the miles per litre of petrol for each of 4 different filling stations and
3 types of car (heavy, medium, and light) in a completely randomized design. Should
the hypothesis be accepted at the 1% level of significance that the population means
are the same for each (a) filling station? (b) Type of car?
Table 2.5
Type of Filling station 1 Filling station 2 Filling station 3 Filling station 4
Car
Heavy 8 9 9 10
Medium 16 15 18 17
Light 24 26 28 30
Answer: (a) Yes (b)No
3. Table 2.6 gives sales data for soap with each of 3 different packaging and 4 different
varieties of groundnuts in a completely randomized design. Should the hypothesis be
accepted at the 5% level of significance that the population means are the same for
each (a) packaging? (b) variety?
Table 2.6 Groundnut sales for of 3 package wrappings and 4 varieties
Parkaging 1 Parkaging 2 Parkaging 3
Manipinta 87 78 90
Chalimbana 79 79 84
Kalisere 83 81 91
CG7 85 83 89
Answer: (a)No (b) Yes
4. Table below gives the outputs of an experimental farm that used each of four fertilizers
and three pesticides such that each plot of land had an equal probability of receiving
each fertilizer-pesticide combination (completely randomized design).
a. Find the average output for each fertilizer X̄.j , for each pesticide X̄i. and for
¯.
the sample as a whole X̄
b. Find the total sum of squares, SST, the sum of squares for fertilizer or factor
A, SSA, for pesticides or factor B, SSB, and for the error or unexplained residual,
SSE.
c. Find the degrees of freedom for SSA, SSB, SSE, and SST.
d. Find MSA, MSB, MSE, MSA/MSE, and MSB/MSE.