Statistics 502 Lecture Notes: Peter D. Hoff
Statistics 502 Lecture Notes: Peter D. Hoff
Peter D. Hoff
c
December
9, 2009
Contents
1 Principles of experimental design
1.1 Induction . . . . . . . . . . . . . . . .
1.2 Model of a process or system . . . . . .
1.3 Experiments and observational studies
1.4 Steps in designing an experiment . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
2
2
6
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
9
10
13
17
18
23
.
.
.
.
.
.
.
25
25
29
30
36
42
43
43
.
.
.
.
47
47
49
52
54
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
CONTENTS
ii
5 Introduction to ANOVA
5.1 A model for treatment variation . . . . . . . . .
5.1.1 Model Fitting . . . . . . . . . . . . . . .
5.1.2 Testing hypothesis with MSE and MST .
5.2 Partitioning sums of squares . . . . . . . . . . .
5.2.1 The ANOVA table . . . . . . . . . . . .
5.2.2 Understanding Degrees of Freedom: . . .
5.2.3 More sums of squares geometry . . . . .
5.3 Unbalanced Designs . . . . . . . . . . . . . . . .
5.3.1 Sums of squares and degrees of freedom .
5.3.2 ANOVA table for unbalanced data: . . .
5.4 Normal sampling theory for ANOVA . . . . . .
5.4.1 Sampling distribution of the F -statistic .
5.4.2 Comparing group means . . . . . . . . .
5.4.3 Power calculations for the F-test . . . .
5.5 Model diagnostics . . . . . . . . . . . . . . . . .
5.5.1 Detecting violations with residuals . . .
5.5.2 Checking normality assumptions: . . . .
5.5.3 Checking variance assumptions . . . . .
5.5.4 Variance stabilizing transformations . . .
5.6 Treatment Comparisons . . . . . . . . . . . . .
5.6.1 Contrasts . . . . . . . . . . . . . . . . .
5.6.2 Orthogonal Contrasts . . . . . . . . . . .
5.6.3 Multiple Comparisons . . . . . . . . . .
5.6.4 False Discovery Rate procedures . . . . .
5.6.5 Nonparametric tests . . . . . . . . . . .
6 Factorial Designs
6.1 Data analysis: . . . . . . . . . . . . . .
6.2 Additive effects model . . . . . . . . .
6.3 Evaluating additivity: . . . . . . . . . .
6.4 Inference for additive treatment effects
6.5 Randomized complete block designs . .
6.6 Unbalanced designs . . . . . . . . . . .
6.7 Non-orthogonal sums of squares: . . . .
6.8 Analysis of covariance . . . . . . . . .
6.9 Types of sums of squares . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
60
62
63
66
70
72
73
76
78
79
81
83
85
88
90
92
93
94
96
100
106
107
110
112
115
115
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
116
. 117
. 123
. 126
. 130
. 140
. 146
. 153
. 155
. 159
CONTENTS
iii
7 Nested Designs
163
7.1 Mixed-effects approach . . . . . . . . . . . . . . . . . . . . . . 171
7.2 Repeated measures analysis . . . . . . . . . . . . . . . . . . . 174
List of Figures
1.1
2.1
2.2
2.3
12
16
3.1
3.2
3.3
3.4
3.5
3.6
27
33
34
39
40
44
4.1
4.2
4.3
2.4
2.5
2.6
4.4
5.1
5.2
. . . . . .
. . . . . .
. . . . . .
. . . . . .
under H0
. . . . . .
20
21
22
22
. 52
. 55
. 57
. 59
LIST OF FIGURES
5.3
5.4
5.5
5.6
5.7
5.8
5.9
5.10
5.11
5.12
5.13
5.14
5.15
Coagulation data . . . . . . . . . . . . . . . . . . . . . . . . . 83
F-distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Normal-theory and randomization distributions of the F -statistic 88
Power as a function of n for m = 4, = 0.05 and 2 / 2 = 1 . 92
Power as a function of n for m = 4, = 0.05 and 2 / 2 = 2 . 92
Normal scores plots of normal samples, with n {20, 50, 100} 95
Crab data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Crab residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Fitted values versus residuals . . . . . . . . . . . . . . . . . . 99
Data and log data . . . . . . . . . . . . . . . . . . . . . . . . . 101
Diagnostics after the log transformation . . . . . . . . . . . . 102
Mean-variance relationship of the transformed data . . . . . . 107
Yield-density data . . . . . . . . . . . . . . . . . . . . . . . . 110
6.1
6.2
6.3
6.4
6.5
6.6
6.7
6.8
6.9
6.10
6.11
6.12
6.13
6.14
6.15
6.16
6.17
6.18
6.19
Marginal Plots. . . . . . . . . . . . . . . . . . . . . . . . . .
Conditional Plots. . . . . . . . . . . . . . . . . . . . . . . .
Cell plots. . . . . . . . . . . . . . . . . . . . . . . . . . . .
Mean-variance relationship. . . . . . . . . . . . . . . . . . .
Mean-variance relationship for transformed data. . . . . . .
Plots of transformed poison data . . . . . . . . . . . . . . .
Comparison between types I and II, without respect to delivery. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Comparison between types I and II, with delivery in color. .
Marginal plots of the data. . . . . . . . . . . . . . . . . . . .
Three datasets exhibiting non-additive effects. . . . . . . . .
Experimental material in need of blocking. . . . . . . . . . .
Results of the experiment . . . . . . . . . . . . . . . . . . .
Marginal plots, and residuals without controlling for row. .
Marginal plots for pain data . . . . . . . . . . . . . . . . . .
Interaction plots for pain data . . . . . . . . . . . . . . . . .
Oxygen uptake data . . . . . . . . . . . . . . . . . . . . . .
ANOVA and ANCOVA fits to the oxygen uptake data . . .
Unbalanced design: Controlling eliminates effect. . . . . . .
Unbalanced design: Controlling highlights effect. . . . . . . .
7.1
7.2
7.3
.
.
.
.
.
.
118
119
120
120
121
122
.
.
.
.
.
.
.
.
.
.
.
.
.
131
132
136
139
141
142
143
150
151
156
157
159
161
LIST OF FIGURES
7.4
7.5
7.6
7.7
7.8
Potato data . . . . . . . . . . . . . . . . . .
Sitka spruce data. . . . . . . . . . . . . . . .
ANCOVA fit and residuals . . . . . . . . . .
Within-tree dependence . . . . . . . . . . .
Reduction to tree-specific summary statistics
vi
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
172
174
176
177
178
Chapter 1
Principles of experimental
design
1.1
Induction
x1
x2
Process
Figure 1.1: Model of a variable process
1.2
1.3
16,608 women randomized to either
x = 1 (estrogen treatment)
x = 0 (no estrogen treatment)
correlation
randomization
correlation=
causation
1.4
bad soil
good soil
Chapter 2
Test statistics and
randomization distributions
Example: Wheat yield
Question: Is one fertilizer better than another, in terms of yield?
Outcome variable: Wheat yield.
Factor of interest: Fertilizer type, A or B. One factor having two levels.
Experimental material: One plot of land to be divided into 2 rows of 6
subplots each.
2.1
Distribution:
1
n
Pn
i=1
yi
To find the median, sort the data in increasing order, and call
these values y(1) , . . . , y(n) . If there are no ties, then
if n is odd, then y( n+1 ) is the median;
2
1 X
s =
(yi y)2 ,
n 1 i=1
2
s=
interquantile range:
[y.25 , y.75 ] (interquartile range)
[y.025 , y.975 ] (95% interval)
Example: Wheat yield
All of these sample summaries are easily obtained in R:
> yA<c ( 1 1 . 4 , 2 3 . 7 , 1 7 . 9 , 1 6 . 5 , 2 1 . 1 , 1 9 . 6 )
> yB<c ( 2 6 . 9 , 2 6 . 6 , 2 5 . 3 , 2 8 . 5 , 1 4 . 2 , 2 4 . 3 )
s2
20
y
25
0.04
Density
0.02
0.00
10
15
20
y
25
30
F(y)
0.4 0.6
0.8
10
15
20
25
20
y
25
10
20
y
30
40
0.00
0.0
0.2
Density
0.00 0.10
yA
15
0.12
Density
0.00 0.08
1.0
15
Density
0.04
0.08
0.0
0.00
0.2
0.02
F(y)
0.4 0.6
Density
0.04
0.8
0.06
1.0
10
15
20
yB
25
30
10
15
20
25
30
35
2.2
Questions:
Could the observed differences be due to fertilizer type?
Could the observed differences be due to plot-to-plot variation?
Hypothesis tests:
H0 (null hypothesis): Fertilizer type does not affect yield.
H1 (alternative hypothesis): Fertilizer type does affect yield.
A statistical hypothesis test evaluates the plausibility of H0 in light of the
data.
Suppose we are interested in mean wheat yields. We can evaluate H0 by
answering the following questions:
Is a mean difference of 5.93 plausible/probable if H0 is true?
Is a mean difference of 5.93 large compared to experimental noise?
To answer the above, we need to compare
{|
yB yA | = 5.93}, the observed difference in the experiment
0.00
0.00
Density
0.04
0.08
Density
0.10
0.20
0.12
10
0
YB YA
10
4
6
|YB YA|
#{gk x}
924
This distribution is sometimes called the randomization distribution, because it is obtained by the randomization scheme of the experiment.
Comparing data to the null distribution:
Is there any contradiction between H0 and our data?
Pr(g(YA , YB ) 5.93|H0 ) = 0.056
According to this calculation, the probability of observing a mean difference of 5.93 or more is unlikely under the null hypothesis. This probability
calculation is called a p-value. Generically, a p-value is
The probability, under the null hypothesis, of obtaining a result as or more
extreme than the observed result.
2.3
2.4
In the previous section we said that the test statistic g(y) should be able
to differentiate between H0 and H1 in ways that are scientifically relevant.
What does this mean?
Suppose our data consist of samples yA and yB from two populations A and
A |. Lets consider two different
B. Previously we used g(yA , yB ) = |
yB y
test statistics:
gt (yA , yB ) =
s2p
This is a scaled version of our previous test statistic, in which we compare the difference in sample means to a pooled version of the sample
standard deviation and the sample size. Note that this statistic is
A |;
increasing in |
yB y
increasing in nA and nB ;
decreasing in sp .
A more complete motivation for using this statistic will be given in the
next chapter.
Kolmogorov-Smirnov statistic:
gKS (yA , yB ) = max |FB (y) FA (y)|
yR
This is just the size of the largest gap between the two sample CDFs.
Comparing the test statistics:
Suppose we perform a CRD and obtain samples yA and yB like those in
Figure 2.3. For these data,
nA = nB = 40
yA = 10.05, yB = 9.70.
sA = 0.87, sB = 2.07
The main difference between the two samples seems to be in their variances
and not in their means. Now lets consider evaluating
H0 : treatment does not affect response
using our two new test statistics. We can approximate the null distributions
of gt (YA , YB ) and gKS (YA , YB ) by randomly reassigning the treatments but
leaving the responses fixed:
10
yA
12
14
10
yB
12
14
F(y)
0.00
0.0
0.2
Density
0.10
0.4
0.6
0.0
0.8
1.0
Density
0.3
0.6
10
y
12
14
Figure 2.3: Histograms and empirical CDFs of the first two hypothetical
samples.
Gsim<NULL
for ( s in 1:5000)
{
xsim<sample ( x )
yAsim<y [ xsim==A ] ; yBsim<y [ xsim==B ]
g1< g . t s t a t ( yAsim , yBsim )
g2< g . ks ( yAsim , yBsim )
Gsim<r b i n d ( Gsim , c ( g1 , g2 ) )
}
Density
3 4
0
0.0
0.2
Density
0.4
0.6
2
3
t statistic
0.1
0.2
0.3
KS statistic
0.4
Figure 2.4: Randomization distributions for the t and KS statistics for the
first example.
yA = 10.11, yB = 10.73.
sA = 1.75, sB = 1.85
The difference in sample means is about twice as large as in the previous example, and the sample standard deviations are pretty similar. The B-samples
are slightly larger than the A-samples on average. Is there evidence that this
is caused by treatment? Again, we evaluate H0 using the randomization
distributions of our two test statistics.
t-statistic :
10
12
14
16
12
14
16
0.00
0.0
0.2
Density
0.15
0.4
yA
F(y)
0.6
0.00
0.8
1.0
Density
0.15 0.30
10
10
12
14
16
yB
Density
3 4
0
0.0
Density
0.2
0.4
0.6
Figure 2.5: Histograms and empirical CDFs of the second two hypothetical
samples.
2
3
t statistic
0.1
0.5
Figure 2.6: Randomization distributions for the t and KS statistics for the
second example.
In this case H0 and H1 are not complementary, and we are only interested
in evidence against H0 of a certain type, i.e. evidence that is consistent with
H1 . In this situation we may want to use a statistic like gt .
2.5
H0 true
correct decision
type I error
H0 false
type II error
correct decision
As we discussed,
the p-value can measure of evidence against H0 ;
the smaller the p-value, the larger the evidence against H0 .
Decision procedure:
1. Compute the p-value by comparing observed test statistic to the null
distribution.
2. Reject H0 if the p-value , otherwise accept H0 .
=
=
=
=
1
?
?
Chapter 3
Tests based on population
models
3.1
If the experiment is
complicated,
non- or partially randomized, or
includes nuisance factors
then a null distribution based on randomization may be difficult to obtain.
An alternative approach to hypothesis testing is based on formulating a sampling model.
Consider the following model for our wheat yield experiment:
There is a large/infinite population of plots of similar size/shape/composition as the plots in our experiment.
When A is applied to these plots, the distribution of plot yields can be
represented by a probability distribution pA with
R
expectation = E[YA ] = ypA (y)dy = A ,
variance =Var[YA ] = E[(YA A )2 ] = A2 .
25
26
E[YA ] = E[
Yi,A ]
nA i=1
nA
1 X
E[Yi,A ]
nA i=1
nA
1 X
=
A = A
nA i=1
0.10
0.06
0.06
random sampling
yA=18.37
0.04
0.04
sA=4.23
0.02
0.02
0.00
0.00
Experimental samples
0.08
0.08
10
15
20
yA
25
30
35
Experimental samples
0.08
random sampling
yB=24.30
sB=5.15
0.00
0.00
0.02
0.04
0.04
0.06
0.12
0.08
27
10
15
20
yB
25
30
35
28
pA (y)dy
3.2
29
X1
P1 (1 , 12 )
m
X
X
X
X2
P2 (2 , 22 )
normal
,
j2 .
..
i
j
j=1
2
Xm Pm (m , m
)
Sums of varying quantities are approximately normally distributed.
Normally distributed data
Consider crop yields from plots of land:
Yi = a1 seedi + a2 soili + a3 wateri + a4 suni +
The empirical distribution of crop yields from a population of fields with
varying quantities of seed, soil, water, sun, etc. will be approximately normal
(, ), where and depend on the effects a1 , a2 , a3 , a4 , . . . and the variability
of seed, soil, water, sun, etc..
Additive effects normally distributed data
Normally distributed means
Consider the following scenario:
(1)
(1)
(2)
(2)
(m)
Experiment m: sample y1 , . . . , yn
30
A histogram of {
y (1) , . . . , y(m) } will look approximately normally distributed
with
sample mean {y (1) , . . . , y (m) }
sample variance {y (1) , . . . , y (m) } 2 /n
i.e. the sampling distribution of the mean is approximately normal(, 2 /n),
even if the sampling distribution of the data are not normal.
Basic properties of the normal distribution:
Y normal(, 2 ) aY + b normal(a + b, a2 2 ).
Y1 normal(1 , 12 ), Y2 normal(2 , 22 ), Y1 , Y2 independent
Y1 + Y2 normal(1 + 2 , 12 + 22 )
if Y1 , . . . , Yn i.i.d. normal(, 2 ), then Y is statistically independent
of s2 .
How does this help with hypothesis testing?
Consider testing H0 : A = B (treatment doesnt affect mean). Then regardless of distribution of data, under H0 :
YA
YB
YB YA
normal(, A2 /nA )
normal(, B2 /nB )
2
normal(0, AB
)
2
where AB
= A2 /nA + B2 /nB . So if we knew the variances, wed have a null
distribution.
3.3
31
H 0 : = 0
H1 : 6= 0
Examples:
Physical therapy
Yi = muscle strength after treatment - muscle score before.
H0 : E[Yi ] = 0
Physics
Yi = boiling point of a sample of an unknown liquid.
H0 : E[Yi ] = 100 C.
To test H0 , we need a test statistic and its distribution under H0 .
|
y 0 | might make a good test statistic:
it is sensitive to deviations from H0 .
its sampling distribution is approximately known:
E[Y ] =
Var[Y ] = 2 /n
Y is approximately normal.
Under H0
(Y 0 ) normal(0, 2 /n),
f (Y) =
/ n
is approximately standard normal and we write f (Y) normal(0, 1). Since
this distribution contains no unknown parameters we could potentially use
it as a null distribution. However, having observed the data y, is f (y) a
statistic?
y is computable from the data and n is known;
32
Y 0
s/ n
s so
s/ n
/ n
The 2 distribution
Z1 , . . . , Zn i.i.d. normal(0, 1)
X
2 2n1
(Zi Z)
Y1 , . . . , Yn i.i.d. normal(, ) (Y1 )/, . . . , (Yn )/ i.i.d. normal(0, 1)
1 X
(Yi )2 2n
1 X
(Yi Y )2 2n1
2
0.08
33
0.00
p (X )
0.04
n=9
n=10
n=11
10
15
X
20
25
30
X/m
34
0.3
0.4
0.0
0.1
p (t )
0.2
n=3
n=6
n=12
n=
0
t
n(Y )/ normal(0,1)
n1 2
s
2
2n1
Y , s2 are independent.
n1 2
s.
2
Then
Z
n(Y )/
p
= q
n1 2
X/(n 1)
s /(n 1)
2
=
Y
tn1
s/ n
Y 0
tn1
s/ n
if E[Y ] = 0
35
36
Pr(|t(Y)| |t(y)||H0 )
Pr(|Tn1 | |t(y)|)
2 Pr(Tn1 |t(y)|)
2 (1 pt(tobs, n 1))
t.test(y, mu = mu0)
3.4
i.i.d. normal(A , 2 )
i.i.d. normal(B , 2 ).
In addition to normality we assume for now that both variances are equal.
Hypotheses: H0 : A = B ;
Recall that
YB YA N
37
1
1
2
B A ,
+
.
nA nB
0,
1
1
+
nA nB
.
Y YA
qB
tnA +nB 2
sp n1A + n1B
Self-check exercises:
1. Show that (nA + nB 2)s2p / 2 2nA +nB 2 (recall how the 2 distribution was defined).
2. Show that the two-sample t-statistic has a t-distribution with nA +nB
2 d.f.
38
Data:
yA = 18.36, s2A = 17.93, nA = 6
yB = 24.30, s2B = 26.54, nB = 6
t-statistic:
s2p = 22.24, sp = 4.72
p
t(yA , yB ) = 5.93/(4.72 1/6 + 1/6) = 2.18
Inference:
Hence the p-value= Pr(|T10 | 2.18) = 0.054
Hence H0 : A = B is not rejected at level = 0.05
> t . t e s t ( y [ x==A ] , y [ x==B ] , var . e q u a l=TRUE)
Two Sample tt e s t
data : y [ x == A ] and y [ x == B ]
t = 2.1793 , d f = 1 0 , pv a l u e = 0 . 0 5 4 3 1
a l t e r n a t i v e h y p o t h e s i s : t r u e d i f f e r e n c e i n means i s not e q u a l t o 0
95 p e r c e n t c o n f i d e n c e i n t e r v a l :
11.999621
0.132954
sample e s t i m a t e s :
mean o f x mean o f y
18.36667 24.30000
Always keep in mind where the p-value comes from: See Figure 3.4.
Comparison to the randomization test:
Recall that we have already compared the two-sample t-statistic to its randomization distribution. A sample from the randomization distribution were
obtained as follows:
1. Sample a treatment assignment according to the randomization scheme.
2. Compute the value of t(YA , YB ) under this treatment assignment and
assuming the null hypothesis.
39
0.0
0.1
p (T )
0.2
0.3
0.4
0
T
#(|t(s) | |tobs |)
S
40
0.0
0.1
Density
0.2
0.3
0.4
0
t( Y A ,Y B )
41
Pn
i=1
Yi , Y i
42
3.5
Checking assumptions
Y YA
qB
sp n1A + n1B
we showed that if Y1,A , . . . , YnA ,A and Y1,B , . . . , YnB ,B are independent samples
from pA and pB respectively, and
(a) A = B
(b) A2 = B2
(c) pA and pB are normal distributions
then
t(YA , YB ) tnA +nB 2
So our null distribution really assumes conditions (a), (b) and (c). Thus if
we perform a level- test and reject H0 , we are really just rejecting that (a),
(b), (c) are all true.
43
reject H0
For this reason, we will often want to check if conditions (b) and (c) are
plausibly met. If
(b) is met
(c) is met
H0 is rejected, then
this is evidence that H0 is rejected because A 6= B .
3.5.1
Checking normality
here Pr(Z z
k
12
nA
)=
k
nA
nA
1
.
2
The
12
nA
is a continuity correction.
1
12
nA
3.5.2
Unequal variances
1.0
0.0
1.0
Theoretical Quantiles
Sample Quantiles
0.5 0.0 0.5
Sample Quantiles
0.0 0.5 1.0
1.0
0.0
1.0
Theoretical Quantiles
1.0
0.0
1.0
Theoretical Quantiles
1.0
0.0
1.0
Theoretical Quantiles
1.0
0.0
1.0
Theoretical Quantiles
1.0
0.0
1.0
Theoretical Quantiles
1.0
0.0
1.0
Theoretical Quantiles
44
Sample Quantiles
0.0 0.5 1.0 1.5
1.0
0.0
1.0
Theoretical Quantiles
Sample Quantiles
1 0
1
2
1.0
0.0
1.0
Theoretical Quantiles
Sample Quantiles
0.5 0.5 1.5 2.5
Sample Quantiles
1.0 0.0
1.0
1.0
0.0
1.0
Theoretical Quantiles
Sample Quantiles
1.0 0.0 1.0 2.0
1.0
0.0
1.0
Theoretical Quantiles
1.0
0.0
1.0
Theoretical Quantiles
1.0
0.0
1.0
Theoretical Quantiles
Sample Quantiles
1.5 0.5 0.5
1.0
0.0
1.0
Theoretical Quantiles
Sample Quantiles
2.0 1.0 0.0
1.0
Sample Quantiles
2.0
0.5 0.5
1.0
0.0
1.0
Theoretical Quantiles
Sample Quantiles
1.5
0.5
0.5
Sample Quantiles
1.0
0.0
Sample Quantiles
1.0
0.0 0.5
Sample Quantiles
0.5
0.5
1.5
Sample Quantiles
18 22 26
14
Sample Quantiles
12
16
20
24
1.0
0.0
1.0
Theoretical Quantiles
45
This may not sound very convincing. In later sections, we will show how
to perform formal hypothesis tests for equal variances. However, this wont
completely solve the problem. If variances do seem unequal we have a variety
of options available:
use the randomization null distribution;
transform the data to stabilize the variances (to be covered later);
use a modified t-test that allows unequal variance.
The modified t-statistic is
yB yA
tw (y A , y B ) = q 2
sA
s2
+ nBB
nA
This statistic looks pretty reasonable, and for large nA and nB its null distribution will indeed be a normal(0, 1) distribution. However, the exact null
distribution is only approximately a t-distribution, even if the data are
actually normally distributed. The t-distribution we compare tw to is a tw distribution, where the degrees of freedom w are given by
2
2
s2B
sA
+
nA
nB
w =
2 2
2 2 .
sA
s
1
1
+ nB 1 nBB
nA 1 nA
This is known as Welchs approximation; it may not give an integer as the
degrees of freedom.
This t-distribution is not, in fact the exact sampling distribution of tdiff (yA , yB )
under the null hypothesis that A = B , and A2 6= B2 . This is because the
null distribution depends on the ratio of the unknown variances, A2 and B2 .
This difficulty is known as the Behrens-Fisher problem.
Which two-sample t-test to use?
If the sample sizes are the same (nA = nB ) then the test statistics
tw (y A , y B ) and t(y A , y B ) are the same; however the degrees of freedom
used in the null distribution will be different unless the sample standard
deviations are the same.
46
If nA > nB , but A2 < B2 , and A = B then the two sample test based
on comparing t(y A , y B ) to a t-distribution on nA + nB 2 d.f. will
reject more than 5% of the time.
If the null hypothesis that both the means and variances are equal,
i.e.
H0 : A = B and A2 = B2
is scientifically relevant, then we are computing a valid p-value,
and this higher rejection rate is a good thing! Since when the
variances are unequal the null hypothesis is false.
If however, the hypothesis that is most scientifically relevant is
H0 : A = B
without placing any restrictions on the variances, then the higher
rejection rate in the test that assumes the variances are the same
could be very misleading, since p-values may be smaller than they
are under the correct null distribution (in which A2 6= B2 ).
Likewise we will underestimate the probability of type I error.
If nA > nB and A2 > B2 , then the p-values obtained from the test using
t(y A , y B ) will tend to be conservative (= larger) than those obtained
with tw (y A , y B ).
In short: one should be careful about applying the test based on t(yA , yB ) if
the sample standard deviations appear very different, and it is not reasonable
to assume equal means and variances under the null hypothesis.
Chapter 4
Confidence intervals and power
4.1
Recall that
H0 : E[Y ] = 0 is rejected if
n|(
y 0 )/s| t1/2
H0 : E[Y ] = 0 is not rejected if
n|(
y 0 )/s| t1/2
1
|
y 0 | s t1/2
n
s
s
y t1/2 0 y + t1/2
n
n
If 0 satisfies this last line, then it is in the acceptance region. Otherwise it
is in the rejection region. In other words, plausible values of are in the
interval
s
y t1/2
n
We say this interval is a 100 (1 )% confidence interval for . This
interval contains only those values of that are not rejected by this level-
test.
47
48
Pr(0 in interval|E[Y ] = 0 ) =
=
=
=
49
4.2
50
H1 : A 6= B
51
Remember, the critical value t1/2,nA +nB 2 above which we reject the null
hypothesis was computed from the null distribution.
However, now we want to work out the probability of getting a value of
the t-statistic greater than this critical value, when a specific alternative
hypothesis is true. Thus we need to compute the distribution of our tstatistic under the specific alternative hypothesis.
If we suppose Y1,A , . . . , YnA ,A i.i.d. normal(A , 2 ) and Y1,B , . . . , YnB ,B
i.i.d. normal(B , 2 ), where B A = then to calculate the power we need
to know the distribution of
YB YA
.
t(Y A , Y B ) = q
sp n1A + n1B
We know that if B A = then
YB YA
q
tnA +nB 2
sp n1A + n1B
but unfortunately
t(Y A , Y B ) =
YB YA
q
+ q
sp n1A + n1B
sp n1A +
(4.1)
1
nB
The first part in the above equation has a t-distribution, which is centered
around zero. The second part moves the t-statistic away from zero by an
amount that depends on the pooled sample variance. For this reason, we
call the distribution of the t-statistic under B A = the non-central
t-distribution. In this case, we write
52
0.4
0.0
0.1
0.2
0.3
=0
=1
=2
2
t
4.2.1
53
( 2 )
E[t(Y A , Y B )|B A = ] = q
n1A +
1
nB
(
)/( ) 1 so
2
2
2
E[t(Y A , Y B )|B A = ]
1
nA
1
nB
Y YA
qB
normal q
n1A + n1B
n1A +
1
nB
, 1 .
We also know that for large values of nA , nB , we have s , so the noncentral t-distribution will (for large enough nA , nB ) look approximately normal with
p
mean /( (1/nA ) + (1/nB ));
standard deviation 1.
54
Another way to get the same result is to refer back to the expression for the
t-statistic given in 4.1:
t(Y A , Y B ) =
YB YA
q
+ q
sp n1A + n1B
sp n1A +
anA ,nb
1
nB
bnA ,nB
The first term anA ,nB has a t-distribution, and becomes standard normal as
nA , nB . As for bnA ,nB , since s2p 2 as nA or nB , we have
1
bnA ,nB
4.2.2
p
1 as nA , nB .
sp 1/nA + 1/nB
Pr( p-value |B A = 0)
Pr(|t(Y A , Y B )| t1/2,nA +nB 2 |H0 )
Pr(|TnA +nB 2 | t1/2,nA +nB 2 )
Pr ( |t(Y A , Y B )| > tc | B A = )
Pr (|T | > tc )
Pr (T > tc ) + Pr (T < tc )
[1 Pr (T < tc )] + Pr (T < tc )
55
0.4
0.0
0.1
0.2
0.3
=0
=1
0
t
= q
n1A +
.
1
nB
We will want to make this calculation in order to see if our sample size is
sufficient to have a reasonable chance of rejecting the null hypothesis. If we
have a rough idea of and 2 we can evaluate the power using this formula.
t . c r i t < qt ( 1a l p h a /2 , nA + nB 2 )
t . gamma< d e l t a / ( sigma s q r t ( 1 /nA + 1/nB ) )
t . power < 1 pt ( t . c r i t , nA+nB2 , ncp=t . gamma )
pt( t . c r i t , nA+nB2 , ncp=t . gamma )
When you do these calculations you should think of Figure 4.2. Letting T
and T be non-central and central t-distributed random variables respectively,
make sure you can relate the following probabilities to the figure:
Pr(T > tc )
56
Pr(T < tc )
Pr(T > tc )
Pr(T < tc )
Note that if the power Pr(|T | > tc ) is large, then one of Pr(T > tc ) or
Pr(T < tc ) will be very close to zero.
Approximating the power
Recall that for large nA , nB ,
.
t(Y A , Y B ) normal(, 1)
The normal approximation to the power is thus given by
Pr(|X| > tc ) = [1 Pr(X < tc )] + Pr(X < tc )
where X normal(, 1). This can be computed in R as
t . norm . power <
This will be a reasonable approximation for large nA , nB . It may be an overestimate or under-estimate of the power obtained from the t-distribution.
Finally, keep in mind that in our calculations we have assumed that the
variances of the two populations are equal.
Example (selecting a sample size): Suppose the wheat researchers wish
to redo the experiment using a larger sample size. How big should their sample size be if they want to have a good chance of rejecting the null hypothesis
B A = 0 at level = 0.05, if the true difference in means is B A = 5
or more?
B A = 5
2 is unknown: Well assume the pooled sample variance from the first
experiment is a good approximation: 2 = 22.24.
15
20
25
power
normal approx
10
57
1.0
0.4
2.0
power
0.6
0.8
2.5
3.0
3.5
4.0
30
10
15
20
25
30
Figure 4.3: and power versus sample size, and the normal approximation
to the power.
Under these conditions, if nA = nB = n, then
=
B A
1/nA + 1/nB
5
p
=
= .75 n
4.72 2/n
What is the probability well reject H0 at level = 0.05 for a given sample
size?
d e l t a <5 ; s2< (
58
So we see that if the true mean difference were B A = 5, then the original
study only had about a 40% chance of rejecting H0 . To have an 80% chance
or greater, the researchers would need a sample size of 15 for each group.
Note that the true power depends on the unknown true mean difference and
true variance (assuming these are equal in the two groups). Even though our
power calculations were done under potentially inaccurate values of B A
and 2 , they still give us a sense of the power under various parameter values:
How is the power affected if the mean difference is bigger? smaller?
How is the power affected if the variance is bigger? smaller?
Example (power as a function of the effect): Suppose a chemical
company wants to know if a new procedure B will yield more product than
the current procedure A. Running experiments comparing A to B are expensive and they are only budgeted to run an experiment with at most 10
observations in each group.
Is running the experiment worthwhile? To assess this we can calculate the
power under nA = nB = 10 for a variety of values of B A and . The
first panel plots power as a function of the mean difference for three different
values of . From this plot, we can see that if the mean difference is 1 and
the variance is 1, then we have almost a 60% chance of rejecting the null
hypothesis, although we only have about a 23% chance of doing so if the
variance is 9 ( = 3).
Because the power varies as the ratio of effect size to the standard deviation,
it is often useful to plot power in terms of this ratio. The scaled effect size
, where
= (B A )/,
represents the size of the treatment effect scaled by the experimental variability (the standard deviation). The noncentrality parameter is then
p
= / 1/nA + 1/nB .
With nA = nB = 10, we have = 2.24 . A plot of power versus for a
level-0.05 test appears in the first panel of Figure 4.4. From this we see that
H0 will be rejected with probability 80% or more only if || is bigger than
about 1.33. In other words, for a sample size of 10 in each group, the effect
must be at least 1.33 times as big as the standard deviation in order to have
an 80% chance of rejecting H0 .
1.0
59
power
0.4 0.6
power
0.4 0.6
0.8
0.8
1.0
0.2
0.0
0.2
=1
=2
=3
4
0
B A
1
0
1
(
B A)/
Figure 4.4: Null and alternative distributions for another wheat example,
and power versus sample size.
Increasing power
As weve seen by the normal approximation to the power, for a fixed type I
error rate the power is a function of the noncentrality parameter
B A
,
= p
1/nA + 1/nB
so clearly power is
increasing in |B A |;
increasing in nA and nB ;
decreasing in 2 .
The first of these we do not generally control with our experiment (indeed, it
is the unknown quantity we are trying to learn about). The second of these,
sample size, we clearly do control. The last of these, the variance, seems
like something that might be beyond our control. However, the experimental
variance can often be reduced by dividing up the experimental material into
more homogeneous subgroups of experimental units. This design technique,
known as blocking, will be discussed in an upcoming chapter.
Chapter 5
Introduction to ANOVA
Example (Response times):
Background: Psychologists are interested in how learning methods affect
short-term memory.
Hypothesis: Different learning methods may result in different recall time.
Treatments: 5 different learning methods (A, B, C, D, E).
Experimental design: (CRD) 20 male undergraduate students were randomly assigned to one of the 5 treatments, so that there are 4 students
assigned to each treatment. After a learning period, the students were
given cues and asked to recall a set of words. Mean recall time for each
student was recorded in seconds.
Results:
Treatment
A
B
C
D
E
60
61
C
treatment
If = 0.05, then
Pr(reject one or more H0i1 i2 | all H0i1 i2 true ) 1 .9510 = 0.40
So, even though the pairwise error rate is 0.05 the experiment-wise error rate
is about 0.40. This issue is called the problem of multiple comparisons and
will be discussed further in Chapter 6. For now, we will discuss a method of
testing the global hypothesis of no variation due to treatment:
H0 : i1 = i2 for all i1 , i2
versus
H1 : i1 6= i2 for some i1 , i2
62
5.1
Data:
yij
i
j
63
The treatment means and treatment effects models represent two parameterizations of the same model:
i = + i i = i
Null (or reduced) model:
yij = + ij
E[ij ] = 0
Var[ij ] = 2
This is a special case of the above two models with
= 1 = m , or equivalently
i = 0 for all i.
In this model, there is no variation due to treatment.
5.1.1
Model Fitting
What are good estimates of the parameters? One criteria used to evaluate
different values of = {1 , . . . , m } is the least squares criterion:
SSE() =
m X
n
X
(yij i )2
i=1 j=1
64
X
SSE() =
(yij i )2
i
i j=1
X
= 2
(yij i )
= 2n(
yi i ), so
SSE() = 2n(
y )
=y
= {
Therefore, the global minimum occurs at
y1 , . . . , yt }.
provides a measure of experimental variability:
Interestingly, SSE()
XX
XX
=
SSE()
(yij
i )2 =
(yij yi )2
P
Recall s2i = (yij yi )2 /(n 1) estimates 2 using data from group i. If we
have more than one group, we want to pool our estimates to be more precise:
=
=
=
=
(n 1)s21 + + (n 1)s2m
(n 1) + + (n 1)
P
P
(y1j y1 )2 + + (y1j y1 )2
m(n 1)
PP
(yij
i )2
m(n 1)
SSE()
MSE
m(n 1)
65
m X
n
X
(yij yi )2
i=1 j=1
i=1
where
1 XX
yij
mn
1
=
(
y1 + + ym )
m
is the grand mean of the sample. We call SST the treatment sum of squares.
We also define MST = SST/(m 1) as the treatment mean squares or mean
squares (due to) treatment. Notice that MST is simply n times the sample
variance of the sample means:
"
#
m
1 X
MST = n
(
yi y )2
m 1 i=1
y =
5.1.2
66
m
X
(i
)2 > 0
i=1
Probabilistically,
m
m
X
X
2
(i
) > 0 a large
(
yi y )2 will probably be observed.
i=1
i=1
Inductively,
a large
(
yi y )2 observed
i=1m
(i
)2 > 0 is plausible
i=1m
So a large value of SST or MST gives evidence that there are differences
between the true treatment means. But how large is large? We need to
know what values of MST to expect under H0 .
MST under the null:
Suppose H0 : 1 = = m = is true . Then
E[(Yi ] = E[ nYi ] = n
( nYi nY )
is an unbiased estimate of Var[ nYi ] = 2 .
m1
67
Notice that
P 2
P
( nYi nY )
n (Yi Y )2
=
m1
m1
SST
=
m1
= MST,
so E[MST|H0 ] = 2 .
MST under an alternative:
We can show that under a given value of ,
P
)2
n m
2
i=1 (i
E[MST|] = +
Pmm 2 1
= 2 + n i=1 i
m1
2 + nv2
So E[MST|] 2 , with equality only if there is no variability in treatment means, i.e. v2 = 0.
ExpectedP
value of MSE:
1
2
MSE = m m
i=1 si , so
1 X
E[s2i ]
m
1 X 2
= 2
=
m
E[MSE] =
68
E[(MST|H1 ] = 2 + nv2
This should give us an idea for a test statistic:
If H0 is true:
MSE 2
MST 2
If H0 is false
MSE 2
MST 2 + nv2 > 2
So
under H0 , MST/MSE should be around 1,
under Hc0 , MST/MSE should be bigger than 1.
Thus the test statistic F (Y ) = MST/MSE is sensitive to deviations from the
null, and can be used to measure evidence against H0 . Now all we need is a
null distribution.
Example (response times):
ybar . t<t a p p l y ( y , x , mean )
s 2 . t<t a p p l y ( y , x , var )
SSE< sum ( ( n1) s 2 . t )
SST< nsum ( ( ybar . tmean ( y ) ) 2 )
MSE<SSE / (m ( n1))
MST<SST/ (m1)
> SSE
[ 1 ] 12.0379
> SST
[ 1 ] 7.55032
69
> MSE
[ 1 ] 0.8025267
> MST
[ 1 ] 1.88758
The observed between-group variation is larger than the observed withingroup variation, but not larger than the types of F -statistics wed expect
to get if the null hypothesis were true.
70
0.0
0.2
0.4
0.6
5.2
Proof:
m X
n
X
(yij y )2 =
i=1 j=1
XX
i
[(yij yi ) + (
yi y )]2
XX
XX
(yij yi )2 + 2(yij yi )(
yi y ) + (
yi y )2
(yij yi )2 +
XX
i
2(yij yi )(
yi y ) +
XX
i
(
yi y )2
71
P P
(1) =
(y yi )2 = SSE
Pi Pj ij
P
(3) =
(
yi y )2 = n i (
yi y )2 = SST
i
j
P P
P
P
(2) = 2 i j (yij yi )(
yi y ) = 2 i (
yi y ) j (yij yi )
but note that
!
X
(yij yi ) =
yij
n
yi,
= n
yi, n
yi, = 0
for all j. Therefore (2) = 0 and we have
total sum of squared deviations from grand mean
or more succinctly,
SSTotal = SST + SSE.
Putting it all together:
H0 : i1 = i2 for all i1 , i2
Reduced model : yij = + ij
H1 : i1 6= i2 for some i1 6= i2
Full model : yij = i + ij
A fitted value or predicted value of an observation yij is denoted yij and
represents the modeled value of yij , without the noise.
A residual ij is the observed value minus the fitted value, ij = yij yij .
If we believe H1 ,
our estimate of i is
i = yi
the fitted value of yij is yij =
i = yi
the residual for (ij) is ij = (yij yij ) = (yij yi ).
the model lack-of-fit is measured by the sum of squared errors:
XX
(yij yi )2 = SSEF
i
72
If we believe H0 ,
our estimate of is
= y
the fitted value of yij is yij =
= y
the residual for (ij) is ij = (yij yij ) = (yij y ).
the model lack-of-fit in this case is
XX
(yij y )2 = SSER = SSTotal
i
The main idea: Variance can be partitioned into parts representing different sources. The variance explained by different sources can be compared
and analyzed. This gives rise to the ANOVA table.
5.2.1
Source
Treatment
Noise
Total
F-ratio
MST/MSE
73
5.2.2
Data
Group means
y11 . . . , y1n
y1
n
y1 + + n
ym
y1 + + ym
y21 . . . , y2n
y2
y =
=
mn
m
..
..
.
.
ym1 . . . , ymn
ym
We can decompose each observation as follows:
yij = y + (
yi y) + (yij yi )
This leads to
(yij y)
=
(
yi y)
+
(yij yi )
total variation = between group variation + within group variation
All data can be decomposed this way, leading to the decomposition of the
data vector of length m n into two parts, as shown in Table 5.1. How do we
interpret the degrees of freedom? Weve heard of degrees of freedom before,
in the definition of a 2 random variable:
Treatment
(
y1. y.. )
(
y1. y.. )
.
.
.
(
y1. y.. )
(
y2. y.. )
.
.
.
(
y2. y.. )
..
.
74
Total
y11 y..
y12 y..
.
.
.
y1n y..
y21 y..
.
.
.
y2n y..
..
.
=
=
=
=
=
=
=
=
=
=
=
ym1 y..
.
.
.
ymn y..
= (
ym. y.. )
=
.
=
.
=
.
= (
ym. y.. )
+ (ym1 ym. )
+
.
+
.
+
.
+ (ymn ym. )
SSTotal
mn 1
=
=
+
+
SSTrt
m1
+
+
+
+
+
+
+
+
+
+
+
Error
(y11 y1. )
(y12 y1. )
.
.
.
(y1n y1. )
(y21 y2. )
.
.
.
(y2n y2. )
..
.
SSE
m(n 1)
75
x1 x
c1
x2 x = c2
x3 x
c3
How many degrees of freedom does the vector (c1 , c2 , c3 )T have? How many
components can vary independently, if we know the elements are equal to
some numbers minus the average of the numbers?
c1 + c2 + c3 =
=
=
=
x1 x + x2 x x3 x
(x1 + x2 + x3 ) 3
x
3
x 3
x
0
x1 x
..
.
xm x
5.2.3
76
y
y1
1n
.
..
.
trt =
y=
y
.
.
ym1
ym
..
..
.
.
ymn
ym
y
.
..
..
= 1
y =
y
..
.
y
m
X
ui vi = 0
i=1
mn
X
b l cl
k=1
=
=
=
m X
n
X
(
yi y )(yij yi )
i=1 j=1
m
X
n
X
(
yi y )
(yij yi )
i=1
m
X
j=1
(
yi y ) 0
i=1
= 0
So the vector a is the vector sum of two orthogonal vectors. We can draw
this as follows:
)
a = (y y
trt )
c = (y y
)
b = (
ytrt y
Now recall
||2 = SSTotal
||a||2 = ||y y
||b||2 = ||
ytrt y ||2 = SST
trt ||2 = SSE
||c||2 = ||y y
What do we know about right triangles?
||a||2
= ||b||2 + ||c||2
SSTotal = SST + SSE
So the ANOVA decomposition is an application of Pythagoras Theorem.
One final observation: recall that
) = m 1
dof (
ytrt y
trt ) = m(n 1)
dof (y y
) and (y y
trt ) are orthogonal.
(
ytrt y
The last lines means the degrees of freedom must add, so
) = (m 1) + m(n 1) = mn 1
dof(y y
77
5.3
78
Unbalanced Designs
Var[ij ] = 2
Full model:
yij = i + ij
= + i + ij
Var[ij ] = 2
How should we estimate these parameters? When ni = n for all i, we had
Treatment means parameterization:
i = yi
Treatment effects parameterization:
= y , i = (
yi y )
which meant that y =
1
y .
m i
Similarly, we had
Pm Pn
i )2
i=1
j=1 (yij y
2
P
s =
m
i=1 (n 1)
1
m
79
What should the parameter estimates be? With a bit of calculus you can
show that the least squares estimates of i or (, i ) are
i = yi
i = yi y ,
= y
We no longer have
=
1
m
i , or
i = 0, but we do have
Pm
ni
m X
X
i
i=1 ni y
P
=
yij /N
ni
i=1 j=1
= y , so
X
ni
i /
ni =
, and
X
ni i = 0.
So
is a weighted average of the
i s, and a weighted average of the i s is
zero. Similarly,
Pm Pni
Pm
i )2
(ni 1)s2i
i=1
j=1 (yij y
2
Pm
= Pi=1
s =
m
i=1 (ni 1)
i=1 (ni 1)
so s2 is a weighted average of the s2i s.
5.3.1
The vector decomposition is shown in table 5.2. Let a, b and c be the three
vectors in the table. We define the sums of squares as the squared lengths of
these vectors:
P Pni
SSTotal = ||a||2 = m
)2
i=1
j=1 (yij y
P P i
P
SSTrt = ||b||2 = ti=1 nj=1
(
yi y )2 = m
)2
i=1 ni (yi y
P Pni
SSE = ||c||2 = m
i )2
i=1
j=1 (yij y
80
Total
y11 y..
y12 y..
.
.
.
y1n1 y..
y21 y..
.
.
.
y2n2 y..
..
.
=
=
=
=
=
=
=
=
=
=
=
Treatment
(
y1. y.. )
(
y1. y.. )
.
.
.
(
y1. y.. )
(
y2. y.. )
.
.
.
(
y2. y.. )
..
.
+
+
+
+
+
+
+
+
+
+
+
Error
(y11 y1. )
(y12 y1. )
.
.
.
(y1n1 y1. )
(y21 y2. )
.
.
.
(y2n2 y2. )
..
.
ym1 y..
.
.
.
ymnm y..
= (
ym. y.. )
=
.
=
.
=
.
= (
ym. y.. )
+
+
+
+
+
(ym1 ym. )
.
.
.
(ymnm yt. )
=
=
+ P SSE
m
+
i=1 (ni 1)
SSTotal
N 1
SSTrt
m1
81
Lets see if things add in a nice way. First, lets check orthogonality:
bc =
ni
m X
X
(
yi y )(yij yi )
i=1 j=1
=
=
m
X
ni
X
(
yi y )
(yij yi )
i=1
j=1
m
X
(
yi y ) 0 = 0
i=1
1 X
yi 6= y ,
m i=1
But in the vector b we have ni copies of (
y y ), and
1 X
P
ni yi = y
ni
and so the vector b does sum to zero. Another way of looking at it is that
the vector b is made up of m numbers, which dont sum to zero, but their
weighted average sums to zero, and so the degrees of freedom are m 1.
5.3.2
82
Source
Deg. of Freedom
Sum of Squares
Mean Square
F-Ratio
Treatment
m1
SST
SST
MST= m1
MST/MSE
Noise
N m
SSE
Total
N 1
SSTotal
MSE =
SSE
N m
Var[ij ] = 2
v2
Pm
N
v2
m1
, where
ni i2
N
i=1
i = i
Pm
n
i=1
P i i.
ni
So yes, MST/MSE will still be sensitive to deviations from the null, but the
groups with larger sample sizes have a bigger impact on the power.
5.4
83
coagulation time
60
65
70
A
A
C
C
C
C
B
B
B
B
B
D
D
D
D
D
D
A
A
D
diet
Questions:
Does diet have an effect on coagulation time?
If a given diet were assigned to all the animals in the population, what
would the distribution of coagulation times be?
If there is a diet effect, how do the mean coagulation times differ?
The first question we can address with a randomization test. For the second
and third we need a sampling model:
yij = i + ij
11 . . . mnm i.i.d. normal(0, 2 )
This model implies
84
independence of errors
constant variance
normally distributed data
Another way to write it is as follows:
yA1 , . . . , yA4 i.i.d. normal(A , 2 )
yB1 , . . . , yB6 i.i.d. normal(B , 2 )
yC1 , . . . , yC6 i.i.d. normal(C , 2 )
yD1 , . . . , yD8 i.i.d. normal(D , 2 )
So we are viewing the 4 samples under A as a random sample from the
population of coagulation times that would be present if all animals got A
(and similarly for samples under B, C and D).
> anova ( lm ( c t i m e d i e t ) )
A n a l y s i s o f V a r i a n c e Table
Response : c t i m e
Df Sum Sq Mean Sq F v a l u e
diet
3 228.0
76.0 13.571
R e s i d u a l s 20 1 1 2 . 0
5.6
5.4.1
85
1 X
(Yi Y )2 2n1
2
Also,
X1 2k1
X2 2k2
X1 + X2 2k1 +k2
X1 , X2 independent
Distribution of SSE:
PP
(Yij Yi )2
2
1
2
(Y1j Y1 )2 +
+
2n1 1
2
N m
+
+
1
2
P
(Ymj Ym )2
2nm 1
So SSE/ 2 2N m .
Distribution of SST under the null: Under H0 ,
Yi normal(, 2 /ni )
ni Yi normal(ni , 2 )
1
1 X
SST
=
ni (Yi Y )2
2
2
m
1 X 2
=
( ni Yi ni Y ) 2m1
2 i=1
Results so far:
SSE/ 2 2N m
SST/ 2 2m1
SSE, SST independent (why?)
86
X1 2k1
X1 /k1
X2 2k2
Fk1 ,k2
X2 /k2
X 1 X2
Fk1 ,k2 is the F -distribution with k1 and k2 degrees of freedom.
Application: Under H0
SST
/(m 1)
M ST
2
=
Fm1,N m
SSE
M SE
/(N m)
2
A large value of F is evidence against H0 , so reject H0 if F > Fcrit . How to
determine Fcrit ?
Level- testing procedure:
1. gather data
2. construct ANOVA table
3. reject H0 : i = for all i if F > Fcrit
where Fcrit is the 1 quantile of an Ft1,N t distribution, available in
R via qf(1alpha, dof.trt , dof. err) . Under this procedure (and a host of
assumptions),
Pr(reject H0 |H0 true) =
Plots of several different F -distributions appear in Figure 5.4. Study these
plots until you understand the relationship between the shape of the curves
and the degrees of freedom. Now lets get back to the data analysis:
> anova ( lm ( c t i m e d i e t ) )
A n a l y s i s o f V a r i a n c e Table
Response : c t i m e
Df Sum Sq Mean Sq F v a l u e
Pr(>F)
diet
3 228.0
7 6 . 0 1 3 . 5 7 1 4 . 6 5 8 e 05
R e s i d u a l s 20 1 1 2 . 0
5.6
87
0.6
0.0
0.2
density
0.4
F(3,20)
F(3,10)
F(3,5)
F(3,2)
10
F
15
20
15
20
CDF
0.4 0.6
0.8
1.0
0.2
F(3,20)
F(3,10)
F(3,5)
F(3,2)
10
F
88
p (F 3,, 20)
0.4
0.6
0.0
0.2
4
F
5.4.2
89
]
=
s2 /ni , is an estimate of
i
p
SD[
i ] = Var[Yi ] = / ni . The standard error is a very useful quantity.
Standard error: The usual definition of the standard error of an estimator
of a parameter is an estimate of its sampling standard deviation:
)
= (Y
= 2
Var[]
\
= 2
Var[
]
=
SE[]
where 2 is an estimate of 2 . For example,
i = Yi
Var[
i ] = 2 /ni
\
Var[
i ] =
2 /ni = s2 /ni
SE[
i ] = s/ ni
Confidence intervals for treatment means: Obtaining confidence intervals is very similar to the one-sample case. The only difference is that we
use data from all of the groups to estimate the variance. As a result, the
degrees of freedom changes.
Yi i
Yi i
Yi, i
p
=
=
s/ ni
SE[Yi ]
MSE/ni
tN m
90
Note that degrees of freedom are those associated with MSE, NOT ni 1.
As a result,
Yi SE[Yi ] t1/2,N m
is a 100 (1 )% confidence interval for i .
A handy rule of thumb: If is an estimator of , then in many situations
2 SE[]
is an approximate 95% CI for .
Coagulation Example:
Yi SE[Yi ] t1/2,N m
For 95% confidence intervals,
t1,N m = t.975,20 = qt (.975,20) 2.1
p
p
5.4.3
ni
6
6
4
8
SE[
diet ]
95% CI
0.97
(65.9,70.0)
0.97
(63.9,68.0)
1.18
(58.5,63.5)
0.84
(59.2,62.8)
91
i2 / 2
where i = i
is the ith treatment effect.
In many texts, power is expressed as a function of the quantity :
sP
r P
2
i2 /m p
n i
=
= /m
=
2m
2 /n
Lets try to understand what represents:
P 2
i /m
2 =
2 /n
treatment variation
=
experimental uncertainty
= treatment variation experimental precision
Note that treatment variation means average squared treatment effect
size. We can gain some more intuition by rewriting as follows:
X
= n
i2 / 2
P 2
i
1
= nm
m
2
between-treatment variation
= N
within-treatment variation
6
n
10
power
0.8
0.6
0.4
3.0
F.crit
3.5
4.0
10 15 20 25 30 35 40
m = 4 = 0.05 var.b/var.w=1
92
1.0
6
n
10
6
n
10
6
n
10
0.95
power
0.85
0.75
3.0
F.crit
3.5
4.0
20 30 40 50 60 70 80
m = 4 = 0.05 var.b/var.w=2
6
n
10
6
n
10
5.5
Model diagnostics
Our model is
yij = j + ij .
We have shown that, if
93
5.5.1
94
Parameter estimates:
yij = y + (
yi y ) + (yij yi )
=
+ i + ij
Our fitted value for any observation in group i is yij =
+ i = yi
Our estimate of the error is ij = yij yi .
ij is called the residual for observation i, j.
Assumptions about ij can be checked by examining the values of ij s:
5.5.2
Two standard graphical ways of assessing normality are with the following:
Histogram:
Make a histogram of ij s. This should look approximately bell-shaped
if the (super)population is really normal and there are enough observations. If there are enough observations, graphically compare the
histogram to a N (0, s2 ) distribution.
In small samples, the histograms need not look particularly bell-shaped.
Normal probability, or qq-plot:
If ij N (0, 2 ) then the ordered residuals ((1) , . . . , (mn) ) should correspond linearly with quantiles of a standard normal distribution.
How non-normal can a sample from a normal population look? You can
always check yourself by simulating data in R. See Figure ??
Example (Hermit Crab Data): Is there variability in hermit crab population across six different coastline sites? A researchers sampled the population in 25 randomly sampled transects in each of the six sites.
Data: yij = population total in transect j of site i.
Model: Yij = + i + ij
Note that the data are counts so they cannot be exactly normally distributed.
95
Sample Quantiles
1
0
1
Density
0.0 0.1 0.2 0.3 0.4 0.5
0
y
0
1
Theoretical Quantiles
Sample Quantiles
1
0
1
Density
0.0 0.1 0.2 0.3 0.4 0.5
2
0
y
1
0
1
Theoretical Quantiles
Density
0.0 0.1 0.2 0.3 0.4 0.5
Sample Quantiles
1
0
1
2
2
0
y
1
0
1
Theoretical Quantiles
Figure 5.8: Normal scores plots of normal samples, with n {20, 50, 100}
96
sample median
17
10
5
2
2
4
ANOVA:
> anova ( lm ( c r a b [ , 2 ] a s . f a c t o r ( c r a b [ , 1 ] ) ) )
A n a l y s i s o f V a r i a n c e Table
Response : c r a b [ , 2 ]
Df Sum Sq Mean Sq F v a l u e Pr(>F)
as . f a c t o r ( crab [ , 1 ] )
5 76695
15339 2 . 9 6 6 9 0 . 0 1 4 0 1
Residuals
144 744493
5170
Residuals:
ij = yij
i = yij yi
Residual diagnostic plots are in Figure 5.10. The data are clearly not
normally distributed.
5.5.3
The null distribution in the F-test is based on Var[ij ] = 2 for all groups i.
(a) tabulate residual variance in each treatment:
Trt
1
..
.
s2m
97
site2
0.000
0.000
Density
0.004 0.008
Density
0.010
0.020
0.012
site1
100
200
300
population
400
100
400
site4
0.00
0.000
0.04
Density
0.08
Density
0.005
0.010
0.12
0.015
site3
200
300
population
100
200
300
population
400
100
400
site6
0.00
0.00
0.02
0.02
Density
0.04
Density
0.04
0.06
0.06
site5
200
300
population
100
200
300
population
400
100
200
300
population
400
0.000
0.004
0.008
Sample Quantiles
100 200 300 400
0.012
100
98
2 1 0
1
2
Theoretical Quantiles
Sample mean
9.24
10.00
12.64
33.80
50.64
68.72
residual
100 200 300 400
99
10
20
30
40
fitted value
50
60
70
which is the ratio of the between group variability of the dij to the
within group variability of the dij .
Reject H0 : Var[ij ] = 2 for all i, j if F0 > Ft1,t(n1),1
Crab data:
F0 =
14, 229
= 2.93 > F5,144,0.95 = 2.28
4, 860
hence we reject the null hypothesis of equal variances at the 0.05 level.
See also
100
5.5.4
Log transformation:
log Yij = log i + (log Xij1 + log Xij2 + )
Var[log Yij ] = Var[log i + log Xij1 + log Xij2 + ]
= Var[log Xij1 + log Xij2 + ]
2
= log
y
So that the variance of the log-data does not depend on the mean i . Also
note that by the central limit theorem the errors should be approximately
normally distributed.
crab population
0
200 400
101
site
3
site
> anova ( lm ( l o g ( c r a b [ , 2 ] + 1 / 6 ) a s . f a c t o r ( c r a b [ , 1 ] ) ) )
A n a l y s i s o f V a r i a n c e Table
Response : l o g ( c r a b [ , 2 ] + 1 / 6 )
Df Sum Sq Mean Sq F v a l u e Pr(>F)
as . f a c t o r ( crab [ , 1 ] )
5 54.73
10.95 2.3226 0.04604
Residuals
144 6 7 8 . 6 0
4.71
> anova ( lm ( l o g ( c r a b [ , 2 ] + 1 / 6 ) a s . f a c t o r ( c r a b [ , 1 ] ) ) )
A n a l y s i s o f V a r i a n c e Table
4
2
residual
0
0.10
0.05
102
0
2
residual
0.00
4
Sample Quantiles
0
2
0.15
0.20
1
0
1
2
Theoretical Quantiles
1.0
1.5
fitted value
2.0
Response : l o g ( c r a b [ , 2 ] + 1 / 6 )
Df Sum Sq Mean Sq F v a l u e Pr(>F)
as . f a c t o r ( crab [ , 1 ] )
5 54.73
10.95 2.3226 0.04604
Residuals
144 6 7 8 . 6 0
4.71
E[Yij ] i
Var[Yij ] E[(Yij i )2 ](1
)2
i
SD[Yij ] i 1
= +1
i
i
103
So if we observe i i , then i +1
. So if we take = 1 then
i
we will have stabilized the variances to some extent. Of course, we typically
dont know , but we could try to estimate it from data.
Estimation of :
i i i = ci
log i = log c + log i ,
so log si log c + log yi
Thus we may use the following procedure:
(1) Plot log si vs. log yi
(2) Fit a least squares line: lm( log si log yi )
(3) The slope
of the line is an estimate of .
(4) Analyze yij = yij1 .
Here are some common transformations:
mean-var relation
=1
transform
y const.
no transform!
y
y
1/2
i
3/4
i
y i
3/2
y i
y 2i
1/2
1/2
square root
yij
yij
1/2
yij
3/4
1/4
quarter power
1/4
yij
log
log yij
3/2
-1/2
-1
reciprocal
yij
1/2
yij
1/yij
Note that
all the mean-variance relationships here are examples of power-laws.
Not all mean-variance relations are of this form.
= 1 is the multiplicative model discussed previously.
104
y 1
y + c.
y ln y
=
= ln y
1 =0
Note that for a given 6= 0 it will not change the results of the ANOVA on
the transformed data if we transform using:
y = y
or
y () =
y 1
= ay + b.
105
case is y = arcsin y.
Keep in mind that statisticians disagree on the usefulness of transformations: some regard them as a hack more than a cure:
It can be argued that if the scientist who collected the data had a good
reason for using certain units, then one should not just transform the
data in order to bang it into an ANOVA-shaped hole. (Given enough
time and thought we could instead build a non-linear model for the
original data.)
106
The sad truth: as always you will need to exercise judgment while
performing your analysis.
These warnings apply whenever you might reach for a transform, whether in
an ANOVA context, or a linear regression context.
Example (Crab data): Looking at the plot of means vs. sd.s suggests
1, implying a log-transformation. However, the zeros in our data lead
to problems, since log(0) = .
Instead we can use yij = log(yij +1/6). For the transformed data this gives us
a ratio of the largest to smallest standard deviation of approximately 2 which
is acceptable based on the rule of 4. Additionally, the residual diagnostic
plots (Figure 5.13) are much improved
This table needs to be fixed: Third column needs to be sd(log(y)).
site sample sd
4
17.39
19.84
5
6
23.01
50.39
1
3
107.44
2
125.35
sample mean
9.24
10.00
12.64
33.80
50.64
68.72
log(sample sd)
2.86
2.99
3.14
3.92
4.68
4.83
log(sample mean)
2.22
2.30
2.54
3.52
3.92
4.23
lm ( f o r m u l a = l o g s d log mean )
Coefficients :
( Intercept )
log mean
0.6652
0.9839
5.6
Treatment Comparisons
Recall the coagulation time data from the beginning of the chapter: Four
different diets were assigned to a population of 24 animals, with n1 = 4,
n2 = 6, n3 = 6 and n4 = 8.
> anova ( lm ( c t i m e d i e t ) )
A n a l y s i s o f V a r i a n c e Table
107
4.0
3.5
log(sd)
4.5
3.0
2.5
3.0
3.5
4.0
log(mean)
We conclude from the F -test that there are substantial differences between
the population treatment means. How do we decide what those differences
are?
5.6.1
Contrasts
Examples:
diet 1 vs diet 2 : C = 1 2
i=1
108
m
X
ki i =
i=1
m
X
ki yi
i=1
= C, so C is a unbiased estimator of C.
Then E[C]
Standard errors:
m
X
=
Var[C]
i=1
m
X
Var[ki yi ]
ki2 2 /ni
i=1
m
X
ki2 /ni
i=1
is
So an estimate of Var[C]
s2C
=s
m
X
ki2 /ni
i=1
109
ki2 /ni
s
SE[C]
If the data are normally distributed , then under H0 : C =
C
SE[C]
ki i = 0,
tN m
> t1/2,N m .
Level- test : Reject H0 if |C/SE[
C]|
y y2
p 1
s 1/6 + 1/4
SE[C]
5
=
1.53
= 3.27
=
C t1/2,N m SE[C]
110
20
12
grain yield
14
16
18
10
20
30
plant density
40
50
5.6.2
Orthogonal Contrasts
What use are contrasts beyond just comparing two means? Consider the
data in Figure 5.15, which show the results of a CRD for an experiment on
the effects of planting density on crop yield in which there were three fields
randomly assigned to each of 5 planting densities.
> anova(lm(y~as.factor(x))
Df Sum Sq Mean Sq F value
Pr(>F)
as.factor(x) 4 87.600 21.900 29.278 1.690e-05 ***
Residuals
10 7.480
0.748
There is strong evidence of an effect of planting density. How should we summarize the effect? In this experiment, the treatment levels have an ordering
to them (this is not always the case). Consider the following m 1 = 4
contrasts:
Contrast k1 k2 k3 k4 k5
C1
-2 -1 0 1 2
C2
2 -1 -2 -1 2
C3
-1 2 0 -2 1
C4
1 -4 6 -4 1
Note:
111
112
> 3 c . hat 2
. L .Q . C 4
[ 1 , ] 4 3 . 2 42 0 . 3 2 . 1
> sum ( 3 c . hat 2 )
[ 1 ] 87.6
df
1
1
1
1
4
10
14
SS
MS
F
43.20 43.20 57.75
42.00 42.00 56.15
0.30 0.30 0.40
2.10 2.10 2.81
87.60 21.90 29.28
7.48 0.75
95.08
The useful idea behind orthogonal contrasts is that the treatment variation can be decomposed into orthogonal parts. As you might expect, under
H0r : Cr = 0, the F -statistic corresponding to the rth contrast has an F distribution with 1 and N m degrees of freedom (assuming normality, constant variance, etc.). For the planting density data, we find strong evidence
of linear and quadratic components to the relationship between density and
yield.
5.6.3
Multiple Comparisons
An experiment with m treatment levels has m2 pairwise comparisons, i.e.
contrasts of the form C = i j . Should we perform hypothesis tests for
all comparisons?
The more hypotheses we test, the higher the probability that at least one of
them will be rejected, regardless of their validity.
Two levels of error: Define the hypotheses
H0 : i = j for all i, j
113
H0ij : i = j
We can associate error rates to both of these types of hypotheses
Experiment-wise type I error rate: Pr(reject H0 |H0 is true ).
Comparison-wise type I error rate : Pr(reject H0ij |H0ij is true ).
Consider the following procedure:
1. Gather data
2. Compute all pairwise contrasts and their t-statistics
3. Reject each H0ij for which |tij | > t1C /2,N m
Letting tcrit = t1C /2,N m , the comparison-wise type I error rate is of
course
P (|tij | > tcrit |H0ij ) = C .
The experiment-wise type I error rate is the probability that we say
differences between treatments exist when no differences exist:
P (|tij | > tcrit for some i, j |H0 ) C
with equality only if there are two treatments total. The fact that the
experiment-wise error rate is larger than the comparison-wise rate is called
the issue of multiple comparisons. What is the experiment-wise rate in this
analysis procedure?
Pr(one or more H0ij rejected |H0 ) = 1 Pr(none of H0ij rejected |H0 )
Y
<
1
Pr(H0ij not rejected |H0 )
i,j
m
= 1 (1 C )( 2 )
We can approximate this with a Taylor series expansion: Let f (x) = 1 (1
C )x . Then f 0 (0) = log(1 C ) and
f (x) f (0) + xf 0 (0)
1 (1 C )x (1 1) x log(1 C )
xC for small C .
114
m
rejected |H0 )
C
2
<
i,j
m
=
C
2
Bonferroni error control:
The Bonferroni procedure for controlling experiment-wise type I error rate
is as follows:
1. Compute pairwise t-statistics on all m2 pairs.
2. Reject H0ij if |tij | > t1C /2,N m
where C = E / m2 .
the experiment-wise error rate is less than E
the comparison-wise error rate is C .
So for example if E = 0.05 and m = 5, then c = 0.005.
Fishers Protected LSD (Fishers least-significant-difference method):
Another approach to controlling type I error makes use of the F -test.
1. Perform ANOVA and compute the F -statistic.
2. If F (y) < F1E ,m1,N m then dont reject H0 and stop.
115
3. If F (y) > F1E ,m1,N m then reject H0 and reject all H0ij for which
|Cij /SE[Cij ]| > t1 /2,N m .
C
5.6.4
5.6.5
Nonparametric tests
Chapter 6
Factorial Designs
Example (Insecticide): In developing methods of pest control, researchers
are interested in the efficacy of different types of poison and different delivery methods.
Treatments:
Type {I, II, III}
Delivery {A, B, C, D}
Response: time to death, in minutes.
Possible experimental design: Perform two separate CRD experiments,
one testing for the effects of Type and the other for Delivery. But,
which Delivery to use for the experiment testing for Type effects?
which Type to use for the experiment testing for Delivery effects?
To compare different TypeDelivery combinations, we need to do experiments under all 12 treatment combinations.
Experimental design: 48 insects randomly assigned to treatments: 4 to
each treatment combination, i.e.
4 assigned to (I, A),
4 assigned to (I, B),
116
117
..
.
It might be helpful to visualize the design as follows:
Type
1
2
3
A
y I,A
y II,A
y III,A
Delivery
B
C
y I,B
y I,C
y II,B y II,C
y III,B y III,C
D
y I,D
y II,D
y III,D
6.1
Data analysis:
12
10
8
10
2
12
118
II
III
dat$type
Df Sum Sq Mean Sq F v a l u e
Pr(>F)
2 0 . 3 4 8 7 7 0 . 1 7 4 3 9 2 5 . 6 2 1 3 . 7 2 8 e 08
119
2.0
2.5
3.0
3.5
10
4.0
12
4.5
II
III
III
10
II
II
III
II
III
120
10
12
I.A II.A
I.B II.B
I.C II.C
I.D II.D
1.0
2.0
0.0
log(sds)
1.0
0.0
means
5 6
3
2
1.0
2.0
sds
3.0
0.8
1.2
1.6
log(means)
2.0
3.8
0.2
0.02
0.04
0.06
log(sds)
3.4
3.0
means
0.3
0.4
2.6
121
0.08
2.2
sds
1.8
1.4
1.0
log(means)
122
10
8
12
10
8
12
6
4
2
II
III
0.10
0.25
0.35
0.20
0.45
0.30
0.55
II
III
II
0.10
0.15
0.25
0.20
0.35
0.30
0.45
III
II
III
II
III
I.A
II.A
III.A
I.B
II.B
III.B
I.C
II.C
III.C
I.D
II.D
III.D
123
In the third ANOVA, can we assess the effects of Type and Delivery
separately?
Can you think of a situation where the F -stats in the first and second
ANOVAs would be small, but the F -stat in the third ANOVA big?
Basically, the first and second ANOVAs may mischaracterize the data and
sources of variation. The third ANOVA is valid, but wed like a more
specific result: wed like to know which factors are sources of variation, and
the relative magnitude of their effects. Also, if the effects of one factor are
consistent across levels of the other, maybe we dont need to have a separate
parameter for each of the 12 treatment combinations, i.e. a simpler model
may suffice.
6.2
Yi,j,k = + ai + bj + i,j,k ,
i = 1, . . . , m1 , j = 1, . . . , m2 , k = 1, . . . , n
= overall mean;
a1 , . . . , am1 = additive effects of factor 1;
b1 , . . . , bm2 = additive effects of factor 2.
Notes:
1. Side conditions: As with with the treatment effects model in the onefactor case, we only need m1 1 parameters to differentiate between
m1 means, so we usually
restrict a1 = 0, b1 = 0 (set-to-zero side conditions ), OR
P
P
restrict
ai = 0,
bj = 0 (sum-to-zero side conditions).
2. The additive model is a reduced model : There are m1 m2 groups
or treatment combinations, and a full model fits a different population
mean separately to each treatment combination, requiring m1 m2
parameters. In contrast, the additive model only has
1
parameter for
m1 1
parameters for ai s
m2 1
parameters for bj s
m1 + m2 1 parameters total.
124
from the a
i s, and subtract b1 from the bj s. Note that this does not change
the fitted value in each group:
bj
fitted(yijk ) =
+
a
i
+
= (
+a
1 + b1 ) + (
ai a
1 ) + (bj b1 )
b
=
+
a
i
+
j
As you might have guessed, we can write this decomposition out as vectors
of length m1 m2 n:
+
+ b
y y = a
vT
= v1 + v2 + ve
The columns represent
vT variation of the data around the grand mean;
v1 variation of factor 1 means around the grand mean;
v2 variation of factor 2 means around the grand mean;
ve variation of the data around fitted the values.
You should be able to show that these vectors are orthogonal, and so
P P P
P P P 2
P P P 2
P P P 2
2
(y
)
=
a
+
b
+
i
ijk
i
i
i
j
k
i
j
k
i
j
k
i
j
k
SSTotal
=
SSA
+
SSB
+
SSE
Degrees of Freedom:
contains m1 different numbers but sums to zero m1 1 dof
a
contains m2 different numbers but sums to zero m2 1 dof
b
125
ANOVA table
Source
SS
A
SSA
SSB
B
SSE
Error
Total SSTotal
df
m1 1
m2 1
(m1 1)(m2 1) + m1 m2 (n 1)
m1 m2 n 1
MS
SSA/dfA
SSB/dfB
SSE/dfE
F
MSA/MSE
MSB/MSE
Mean Sq F v a l u e
0.17439 71.708
0.06805 27.982
0.00243
This ANOVA has decomposed the variance in the data into the variance of
additive Type effects, additive Delivery effects, and residuals. Does this
adequately represent what is going on in the data? What do we mean by
additive? Assuming the model is correct, we have:
E[Y |type=I, delivery=A] = + a1 + b1
E[Y |type=II, delivery=A] = + a2 + b1
This says that the difference between Type I and Type II is a1 a2 regardless
of Delivery. Does this look right based on the plots? Consider the following
table:
Effect of Type I vs II, given Delivery
Delivery full model
additive model
A
IA IIA ( + a1 + b1 ) ( + a2 + b1 ) = a1 a2
B
IB IIB ( + a1 + b2 ) ( + a2 + b2 ) = a1 a2
C
IC IIC ( + a1 + b3 ) ( + a2 + b3 ) = a1 a2
D
ID IID ( + a1 + b4 ) ( + a2 + b4 ) = a1 a2
The full model allows differences between Types to vary across levels
of Delivery
The reduced/additive model says differences are constant across levels
of Delivery.
126
=
+
a
i
+
bj
+
(ab)
+
ijk
ij
Note that the interaction term is equal to the fitted value under the full
model (yij ) minus the fitted value under the additive model (
yi + yj y ).
Deciding between the additive/reduced model and the interaction/full model
s is large or
is tantamount to deciding if the variance explained by the (ab)
ij
not, i.e. whether or not the full model is close to the additive model.
6.3
Evaluating additivity:
127
(ab)
ij
128
Interactions and the full model: The interaction terms also can be
derived by taking the additive decomposition above one step further: The
residual in the additive model can be written:
i yj + y
A
ijk = yijk y
= (yijk yij ) + (
yij yi yj + y )
I
= + (ab)
ijk
ij
=
+
a
i
+
bj
+
(ab)
+
ijk
Fitted value:
ij
yijk =
+a
i + bj + (ab)
= yij =
ij
This is a full model for the treatment means: The estimate of the mean
in each cell depends only on data from that cell. Contrast this to
additive model.
Residual:
ijk = yijk yijk
= yijk yij
Thus the full model ANOVA decomposition partitions the variability among
the cell means y11 , y12 , . . . , ym1 m2 into
the overall mean
129
As you might expect, these different parts are orthogonal, resulting in the
following orthogonal decomposition of the variance.
Total SS
Df
pois$deliv
3
pois$type
2
pois$deliv : pois$type 6
Residuals
36
Sum Sq
0.20414
0.34877
0.01571
0.08643
Mean Sq F v a l u e
0.06805 28.3431
0.17439 72.6347
0.00262 1.0904
0.00240
So notice
0.10214 = 0.01571 + 0.08643, that is SSEadd = SSABint + SSEint
42 = 6 + 36, that is dof(SSEadd ) = dof(SSABint ) + dof(SSEint )
SSA, SSB, dof(A), dof(B) are unchanged in the two models
MSEadd MSEint , but degrees of freedom are larger in the additive
model. Which do you think is a better estimate of the within-group
variance?
Expected sums of squares:
If H0 : (ab)ij = 0 is true, then
E[MSE] = 2
E[MSAB] = 2
If H0 : (ab)ij = 0 is not true, then
E[MSE] = 2
130
2
E[MSAB] = 2 + rAB
> 2.
This suggests
An evaluation of the adequacy of the additive model can be assessed
by comparing MSAB to MSE. Under H0 : (ab)ij = 0 ,
FAB = MSAB/MSE F(m1 1)(m2 1),m1 m2 (n1)
Evidence against H0 can be evaluated by computing the p-value.
If the additive model is adequate then MSEint and MSAB are two
independent estimates of roughly the same thing (why independent?).
We may then want to combine them to improve our estimate of 2 .
Df
pois$deliv
3
pois$type
2
pois$deliv : pois$type 6
Residuals
36
Sum Sq
0.20414
0.34877
0.01571
0.08643
Mean Sq F v a l u e
Pr(>F)
0 . 0 6 8 0 5 2 8 . 3 4 3 1 1 . 3 7 6 e 09
0 . 1 7 4 3 9 7 2 . 6 3 4 7 2 . 3 1 0 e 13
0.00262 1.0904
0.3867
0.00240
For these data, there is strong evidence of both treatment effects, and little
evidence of non-additivity. We may want to use the additive model.
6.4
2 22
2 2
1 1 1 111 111
0.1
2 2 2 22
11 1
0.2
131
2 2
0.3
rate
0.4
0.5
Figure 6.7: Comparison between types I and II, without respect to delivery.
s12 = 0.081, n m2 = 4 4 = 16.
t-statistic = -1.638, df=30, p-value=0.112.
Questions:
What is s212 estimating?
What should we be comparing the factor level differences to?
If Delivery is a known source of variation, we should compare differences
between levels of Poison type to variability within a treatment
combination,
2
i.e. . For the above example, s12 = .081, whereas sMSE = 0.00240 0.05,
a ratio of about 1.65.
y1 y2
p
= 2.7,
sMSE 2/(4 4)
p-value 0.01
Testing additive effects Let ij be the population mean in cell ij. The
relationship between the cell means model and the parameters in the interaction model are as follows:
ij = + (i ) + (j ) + (ij i j + )
= + ai + bj + (ab)ij
and so
2 22
2 22
2 2 2 2 22
1 11 111 11
1
0.1
11 1
132
2 2
0.2
0.3
rate
0.4
0.5
Figure 6.8: Comparison between types I and II, with delivery in color.
j bj
i (ab)ij
ai = 0
=0
= 0 for each j,
j (ab)ij
= 0 for each i
133
(population) means:
F2 = 1 F2 = 2 F2 = 3 F2 = 4
F1 = 1
11
12
13
14
4
1
F1 = 2
21
22
23
24
4
2
F1 = 3
31
32
33
34
4
3
3
1
3
2
3
3
3
4
12
So
a1 a2 =
1
2
= (11 + 12 + 13 + 14 )/4 (21 + 22 + 23 + 24 )/4
Like any contrast, we can estimate/make inference for it using contrasts of
sample means:
a1 a2 = a
1 a
2 = y1 y2 is an unbiased estimate of a1 a2
Note that this estimate is the corresponding contrast among the m1 m2
sample means:
F2 = 1 F2 = 2 F2 = 3 F2 = 4
F1 = 1
y11
y12
y13
y14
4
y1
F1 = 2
y21
y22
y23
y24
4
y2
F1 = 3
y31
y32
y33
y34
4
y3
3
y1
3
y2
3
y3
3
y4
12
y
So
a
1 a
2 = y1 y2
= (
y11 + y12 + y13 + y14 )/4 (
y21 + y22 + y23 + y24 )/4
Hypothesis tests and confidence intervals can be made using the standard
assumptions:
E[
a1 a
2 ] = a1 a2
Under the assumption of constant variance:
Var[
a1 a
2 ] =
=
=
=
Var[
y1 y2 ]
Var[
y1 ] + Var[
y2 ]
2
/(n m2 ) + 2 /(n m2 )
2 2 /(n m2 )
134
(
a1 a
2 ) (a1 a2 )
(
a1 a
) (a1 a2 )
q2
=
t
2
SE[
a
]
1
2
MSE nm2
r
MSE
|
a1 a
2 | >
2
t1C /2,
n m2
So the quantity
LSD1 = t1C /2, SE(
1
2)
r
2
= t1C /2, MSE
n m2
is a yardstick for comparing levels of factor 1. It is sometimes called the
least significant difference for comparing levels of Factor 1. It is analogous
to the LSD we used in the 1-factor ANOVA.
Important note: The LSD depends on which factor you are looking at:
The comparison of levels of Factor 2 depends on
Var[
y1 y2 ] = Var[
y1 ] + Var[
y2 ]
2
= /(n m1 ) + 2 /(n m1 )
= 2 2 /(n m1 )
135
Sum Sq
0.20414
0.34877
0.01571
0.08643
r
MSE
2
n m1
Mean Sq F v a l u e
Pr(>F)
0 . 0 6 8 0 5 2 8 . 3 4 3 1 1 . 3 7 6 e 09
0 . 1 7 4 3 9 7 2 . 6 3 4 7 2 . 3 1 0 e 13
0.00262 1.0904
0.3867
0.00240
There is not very much evidence that the effects are not additive. Lets
assume there is no interaction term. If we are correct then we will have
increased the precision of our variance estimate.
Df Sum Sq
p o i s $ d e l i v 3 0.20414
pois$type
2 0.34877
R e s i d u a l s 42 0 . 1 0 2 1 4
Mean Sq F v a l u e
Pr(>F)
0 . 0 6 8 0 5 2 7 . 9 8 2 4 . 1 9 2 e 10
0 . 1 7 4 3 9 7 1 . 7 0 8 2 . 8 6 5 e 14
0.00243
So there is strong evidence against the hypothesis that the additive effects
are zero for either factor. Which treatments within a factor are different from
each other?
LSD Grouping
1
2
3
0.2
0.3
0.4
0.5
136
0.1
0.1
0.2
0.3
0.4
0.5
II
III
LSD Grouping
1
1
2
3
Note the differences between these comparisons and those from two-sample
t-tests.
Interpretation of estimated additive effects: If the additive model is
clearly wrong, can we still interpret additive effects? The full model is
yijk = ij + ijk
A reparameterization of this model is the interaction model:
yijk = + ai + bj + (ab)ij + ijk
where
=
1
m1 m2
P P
i
ij =
1
m2
j (ij
) =
i
bj =
1
m1
i (ij
) =
j
137
(ab)ij = ij
i
j +
= ij ( + ai + bj )
The terms {a1 , . . . , am1 } , {b1 , . . . , bm2 } are sometimes called main effects.
The additive model is
yijk = + ai + bj + ijk
Sometimes this is called the main effects model. If this model is correct,
it implies that
(ab)ij = 0 i, j
i1 j i2 j = ai1 ai2 i1 , i2 , j
ij1 ij2 = bj1 bj2 i, j1 , j2
What are the estimated additive effects estimating, in either case?
a
1 a
2 = (
y1 y ) (
y2 y )
= y1 y2
m2
m2
1 X
1 X
y1j
y2j
=
m2 j=1
m2 j=1
m2
1 X
=
(
y1j y2j )
m2 j=1
Now
(2)
m2
m2
1 X
1 X
E[
(
y1j y2j )] =
(1j 2j ) = a1 a2 ,
m2 j=1
m2 j=1
so a
1 a
2 is estimating a1 a2 regardless if additivity is correct or not. Now,
how do we interpret this effect?
Regardless of additivity, a
1 a
2 can be interpreted as the estimated
difference in response between having factor 1 =1 and factor 1=2, averaged over the experimental levels of factor 2.
138
If additivity is correct, a
1 a
2 can further be interpreted as the estimated difference in response between having factor 1 =1 and factor
1=2, for every level of factor 2 in the experiment.
Statistical folklore suggests that if there is significant non-additivity, then
you cant interpret main/additive effects. As we can see, this is not true:
the additive effects have a very definite interpretation under the full model.
In some cases (block designs, coming up), we may be interested in additive
effects even if there is significant interaction.
Dissecting the interaction: Sometimes if there is an interaction, we
might want to go in and compare individual cell means. Consider the following table of means from a 2 3 two-factor ANOVA.
y11 y12 y13 y14
y21 y22 y23 y24
A large interaction SS in the ANOVA table gives us evidence, for example,
that 1j 2j varies across levels j of factor 2. It may be useful to dissect
this variability further, and understand how the non-additivity is manifested
in the data: For example, consider the three plots in Figure 4.10. These all
would give a large interaction SS, but imply very different things about the
effects of the factors.
Contrasts for examining interactions: Suppose we want to compare
the effect of (factor 1=1) to (factor 1=2) across levels of factor 2. This
involves contrasts of the form:
C = (1j 2j ) (1k 2k )
This contrast can be estimated with the sample contrast:
C = (
y1j y2j ) (
y1k y2k )
As usual, the standard error of this contrast is the estimate of its standard
deviation:
= 2 /r + 2 /r + 2 /r + 2 /r = 4 2 /r
Var[C]
p
= 2 MSE/n
SE[C]
Confidence intervals and t-tests for C can be made in the usual way.
139
2.1
1.2
2.2
1.3
2.3
1.4
2.4
1.1
2.1
1.2
2.2
1.3
2.3
1.4
2.4
2.2
1.3
2.3
1.4
2.4
1.1
1.1
2.1
1.2
6.5
140
F-ratio
If SS2 is large,
MS1/MSE
If a factor
1. affects response
2. varies across experimental units
then it will increase the variance in response and also the experimental error
variance/MSE if unaccounted for. If F2 is a known, potentially large source
of variation, we can control for it pre-experimentally with a block design.
Blocking: The stratification of experimental units into groups that are more
homogeneous than the whole.
Objective: To have less variation among units within blocks than between
blocks.
141
dry
wet
irrigation
Figure 6.11: Experimental material in need of blocking.
Typical blocking criteria:
location
physical characteristics
time
Example(Nitrogen fertilizer timing): How does the timing of nitrogen
additive affect nitrogen uptake?
Treatment: Six different timing schedules 1, . . . , 6: Level 4 is standard
Response: Nitrogen uptake (ppm102 )
Experimental material: One irrigated field
Soil moisture is thought to be a source of variation in response.
142
5
37.99
4
37.18
1
34.98
6
34.89
3
42.07
1
41.22
3
49.42
4
45.85
6
50.15
5
41.99
2
46.69
6
44.57
3
52.68
5
37.61
1
36.94
2
46.65
4
40.23
2
41.9
4
39.2
6
43.29
5
40.45
3
42.91
1
39.97
row
2
40.89
column
Figure 6.12: Results of the experiment
Design:
1. Field is divided into a 4 6 grid.
2. Within each row or block, each of the 6 treatments are randomly allocated.
1. The experimental units are blocked into presumably more homogeneous
groups.
2. The blocks are complete, in that each treatment appears in each block.
3. The blocks are balanced, in that there are
m1 = 6 observations for each level of block
m2 = 4 observations for each level of trt
n = 1 observation for each trtblock combination.
This design is a randomized complete block design.
40
Ni
45
50
143
35
35
40
Ni
45
50
3
4
treatment
row
5
1.52
4
3.44
1
3.3
6
8.33
3
4.7
1
2.94
3
2.65
4
5.24
6
6.92
5
2.48
2
2.66
6
1.34
3
5.91
5
1.9
1
1.34
2
2.62
4
0.39
2
2.13
4
1.41
6
0.06
5
0.94
3
3.86
1
1.69
row
2
3.14
column
Figure 6.13: Marginal plots, and residuals without controlling for row.
144
Analysis of the RCB design with one rep: Analysis proceeds just as
in the two-factor ANOVA:
yij y = (
yi y ) + (
yj y ) + (yij yi yj + y )
SSTotal =
SSTrt
+
SSB
+
SSE
ANOVA table
Source
SS
dof
Trt
SST
m1 1
Block
SSB
m2 1
Error
SSE
(m1 1)(m2 1)
Total SSTotal
m1 m2 1
MS
F-ratio
SST/(m1 1)
MST/MSE
SSB/(m2 1)
(MSB/MSE)
SSE/(m1 1)(m2 1)
145
a s . f a c t o r ( c ( t r t ) ) : a s . f a c t o r ( c ( rw ) ) 23 5 0 6 . 3 3
Residuals
0
0.00
#######
22.01
Can we test for interaction? Do we care about interaction in this case, or just
main effects? Suppose it were true that in row 2, timing 6 is significantly
better than timing 4, but in row 3, treatment 3 is better. Is this relevant in
for recommending a timing treatment for other fields?
Did blocking help? Consider CRD as an alternative:
block
block
block
block
1
2
3
4
2
5
6
1
4
5
3
2
3
3
4
6
2
4
2
2
1
1
6
5
4
4
5
6
Advantages:
more possible treatment assignments, so power is increased in a
randomization test.
If we dont estimate block effects, well have more dof for error.
Disadvantages:
It is possible, (but unlikely) that some treatment level will get assigned many times to a good row, leading to post-experimental
bias.
If row is a big source of variation, then ignoring it may lead to
an overly large MSE.
Consider comparing the F -statistic from a CRD with that from an RCB:
According to Cochran and Cox (1957)
SSB + n(m 1)MSErcbd
1
nm
n1
n(m 1)
= MSB
+ MSErcbd
nm 1
nm 1
MSEcrd =
6.6
146
Unbalanced designs
rainy
not rainy
sum
marginal mean
cell means
interstate two-lane
15
5
n11 = 8
n12 = 2
20
10
n21 = 2
n22 = 8
160
90
n1 = 10 n2 = 10
16
9
sum
marginal means
130
n1 = 10
120
n1 = 10
250
n = 20
13
12
y = 12.5
SSF =
XXX
i
(
yij y )2 =
k
2
XX
i
2
nij (
yij y )2
147
ij =
1 X
yijk ,
nij k
2 = s2 =
XXX
1
(yijk yij )2
N m1 m2 i j k
The idea:
m2
1 X
ij 6 yi
=
m2 j=1
m1
1 X
=
ij 6 yj
m1 i=1
148
1
y11
y21
..
.
2
y12
y22
..
.
m1 ym1 1 ym1 2
m2
y1m2
y2m2
ym1 m2
m1
m2
Accident example:
rainy
not rainy
marginal mean
LS mean
interstate two-lane
15
5
20
10
16
9
17.5
7.5
marginal mean
13
12
LS mean
10
15
1
m2
yij , so
1 X 2
Var[i ] =
/nij
m22 j
s
1 X MSE
SE[
i ] =
m2
nij
j
s
1 X MSE
SE[
j ] =
m1
nij
i
j
and similarly
149
P P P
1. Compute SSEF = min,a,b,(ab) i j k (yijk [ + ai + bj + (ab)ij ])2
P P P
2. Compute SSEA = min,a,b i j k (yijk [ + ai + bj ])2
3. Compute SS12 = SSEA SSEF . Note that this is always positive.
Allowing for interaction improves the fit, and reduces error variance. SSI
measures the improvement in fit. If SSI is large, i.e. SSEA is much bigger than
SSEF , this suggests the additive model does not fit well and the interaction
term should be included in the model.
Testing:
F =
SSI/(m1 1)(m2 1)
MSI
=
MSE
SSEF /(N m1 m2 )
150
50
60
70
80
Do these marginal plots and means misrepresent the data? To evaluate this
possibility,
compare the marginal plots in Figure 6.14 to the interaction plots in
Figure 6.15;
compute the LS means and compare to marginal means.
151
1.50
3.50
2.60
1.70
3.70
2.80
What are the differences between LS means and marginal means? Not as
extreme as in the accident example, but the differences can be explained by
looking at the interaction plot, and the slight imbalance in the design:
The youngest patients (ageg=50) were imbalanced towards the higher
152
2
3
0 . 5 8 2 6 3 8 9 1.1179861
70
0.18902778
80
0.74458333
What linear modeling commands in R will get you the same thing?
> o p t i o n s ( c o n t r a s t s=c ( c o n t r . sum , c o n t r . p o l y ) )
> f i t f u l l <lm ( y a s . f a c t o r ( ageg ) a s . f a c t o r ( t r t ) )
> fit full$coef [2:4]
a s . f a c t o r ( ageg ) 1 a s . f a c t o r ( ageg ) 2 a s . f a c t o r ( ageg ) 3
0.90902778
0.02458333
0.18902778
Note that the coefficients in the reduced/additive model are not the same:
> f i t a d d <lm ( y a s . f a c t o r ( ageg )+ a s . f a c t o r ( t r t ) )
> fit add$coef [2:4]
a s . f a c t o r ( ageg ) 1 a s . f a c t o r ( ageg ) 2 a s . f a c t o r ( ageg ) 3
0.7921041
0.3593577
0.3049595
> fit add$coef [5:6]
as . f a c t o r ( t r t )1 as . f a c t o r ( t r t )2
1.208328354
0.002085645
6.7
153
lm ( y a s . f a c t o r ( ageg )+ a s . f a c t o r ( t r t ) ) )
Df Sum Sq Mean Sq F v a l u e Pr(>F)
a s . f a c t o r ( ageg ) 3 1 3 . 3 5 5
4.452 0.9606 0.42737
as . f a c t o r ( t r t )
2 28.254 14.127 3.0482 0.06613 .
Residuals
24 1 1 1 . 2 3 0
4.635
lm ( y a s . f a c t o r ( t r t )+ a s . f a c t o r ( ageg ) ) )
Df Sum Sq Mean Sq F v a l u e Pr(>F)
as . f a c t o r ( t r t )
2 31.588 15.794 3.4079 0.0498
a s . f a c t o r ( ageg ) 3 1 0 . 0 2 1
3.340 0.7207 0.5494
Residuals
24 1 1 1 . 2 3 0
4.635
Where do these sums of squares come from? What do the F -tests represent?
By typing ?anova.lm in R we see that anova() computes
a sequential analysis of variance table for that fit. That is, the
reductions in the residual sum of squares as each term of the formula
is added in turn are given in as the rows of a table, plus the residual
sum of squares.
154
s s 0 <sum (
s s 1 <sum (
s s 2 <sum (
ss3<
lm ( y 1 ) $ r e s 2 )
lm ( y a s . f a c t o r ( ageg ) ) $ r e s 2 )
lm ( y a s . f a c t o r ( ageg )+ a s . f a c t o r ( t r t ) ) $ r e s 2 )
> s0s s 1
[ 1 ] 13.3554
>
> s s 1 s s 2
[ 1 ] 28.25390
>
> s s 2 s s 3
[ 1 ] 53.75015
> ss3
[ 1 ] 57.47955
lm ( y a s . f a c t o r ( ageg ) a s . f a c t o r ( t r t ) ) )
Df Sum Sq Mean Sq F v a l u e Pr(>F)
a s . f a c t o r ( ageg )
3 13.355
4.452 1.3941 0.27688
as . f a c t o r ( t r t )
2 28.254 14.127 4.4239 0.02737
a s . f a c t o r ( ageg ) : a s . f a c t o r ( t r t ) 6 5 3 . 7 5 0
8.958 2.8054 0.04167
Residuals
18 5 7 . 4 8 0
3.193
> anova (
155
6.8
Analysis of covariance
156
residuals
0
5
o2_change
0
5 10
15
10
AA
10 5
A
B
A
A
B
20
22
grp
24
26
age
28
30
A
A B
B
B
B
20
22
24
26
age
28
AA
B
o2_change
0
5 10
10 5
10 5
o2_change
0
5 10
AA
B
15
15
157
30
A
A
A B
B
B
B
20
22
24
26
age
28
30
Figure 6.17: ANOVA and ANCOVA fits to the oxygen uptake data
This model gives a linear relationship between age and response for each
group:
intercept
slope
if i = A, Yi = ( + aA ) + b xi,j
i = B, Yi = ( + aB ) + b xi,j
error
+ i,j
+ i,j
i,j
158
The second one decomposes the variation in the data that is orthogonal to
treatment (SSE from the first ANOVA) into a parts that can be ascribed to
age (SS age in the second ANOVA), and everything else (SSE from second
ANOVA). I will try to draw some triangles that describe this situation.
Now consider two other ANOVAs:
> anova ( lm ( o 2 c h a n g e age ) )
Df Sum Sq Mean Sq F v a l u e
Pr(>F)
age
1 5 7 6 . 0 9 5 7 6 . 0 9 4 0 . 5 1 9 8 . 1 8 7 e 05
R e s i d u a l s 10 1 4 2 . 1 8
14.22
2.5
2.0
A
A
B
A
B
B
B
A
B
1
f2
0.5
1.0
A
B
A
B
B
B
A
0.0
0.5
1.0
0.0
B
B
B
A
0.0
0.5
1.0
1.5
2.0
A
B
A
1.5
2.0
1.5
2.5
2.5
159
B
1
f2
2
f2
6.9
Figure 6.18 gives a simple example where an unbalanced design can lead to
results that are difficult to interpret.
Two factors:
F1 = treatment (A vs B)
F2 = block (location 1 versus 2)
10 experimental units
Balanced CRD in F1, but not in F1F2.
Looking at things marginally in F1, we have yA > yB , and there seems to be
an effect of F1=A versus B. This is highlighted in the second panel of the
figure, which shows the difference between the sample marginal means and
the grand mean seems large compared to error variance.
160
The ANOVA quantifies this: There is variability in the data that can be
explained by either F1 or F2. In this case,
SSA > SSA|B
SSB > SSB|A
Do these inequalities always hold? Consider the data in Figure 6.19. In this
case F1=A is higher than F1=B for both values of F2. But there are more
A observations in the low-mean value of F2 than the high-mean value. The
second an third plots suggest
Marginally, the difference between levels of F1 are small.
Within each group, the difference between levels of F1 are larger.
Thus controlling for F2 highlights the differences between levels of F1.
This is confirmed in the corresponding ANOVA tables:
> anova ( lm ( y f 1+f 2 ) )
Df Sum Sq Mean Sq F v a l u e
Pr(>F)
f1
1 0.1030 0.1030 0.5317 0.4895636
f2
1 5 . 7 8 5 1 5 . 7 8 5 1 2 9 . 8 7 7 2 0 . 0 0 0 9 3 9 9
Residuals 7 1.3554 0.1936
2.5
A
B
B
B
2.0
B
B
B
A
2
f2
1.5
0.5
B
1
1.0
A
A
A
0.5
1.0
A
A
0.0
1.5
0.0
0.5
B
B
B
A
0.0
1.0
A
A
1.5
161
2.0
2.5
2.0
2.5
B
1
f2
2
f2
Which ANOVA table to use? Some software packages combine these anova
tables to form ANOVAs based on alternative types of sums of squares.
Consider a two-factor ANOVA in which we plan on decomposing variance
into additive effects of F1, additive effects of F2, and their interaction.
Type I SS: Sequential, orthogonal decomposition of the variance.
Type II SS: Sum of squares for a factor is the improvement in fit from
adding that factor, given inclusion of all other terms at that level or
below.
Type III SS: Sum of squares for a factor is the improvement in fit from
adding that factor, given inclusion of all other terms.
So for example:
SSF11 = RSS(0) RSS(F1) if F1 first in sequence
SSF11 = RSS(F2) RSS(F1 + F2) if F1 second in sequence
162
Chapter 7
Nested Designs
Example(Potato): Sulfur added to soil kills bacteria, but too much sulfur
can damage crops. Researchers are interested in comparing two levels of
sulfur additive (low, high) on the damage to two types of potatoes.
Factors of interest:
1. Potato type {A, B}
2. Sulfur additive {low,high}
Experimental material: Four plots of land.
Design constraints:
It is easy to plant different potato types within the same plot
It is difficult to have different sulfur treatments in the same plot,
due to leeching.
Experimental Design: A Split-plot design
1. Each sulfur additive was randomly assigned to two of the four
plots.
2. Each plot was split into four subplots. Each potato type was
randomly assigned to two subplots per plot.
163
164
A B
A B
B A
A B
B A
A B
B A
A B
Randomization:
Sulfur type was randomized to whole plots;
Potato type was randomized to subplots.
Initial data analysis: Sixteen responses, 4 treatment combinations.
8 responses for each potato type
8 responses for each sulfur type
4 responses for each potatotype combination
> f i t . f u l l <lm ( y type s u l f u r ) ;
> anova ( f i t . f u l l )
Df Sum Sq Mean Sq
type
1 1.48840 1.48840
sulfur
1 0.54022 0.54022
type : s u l f u r 1 0 . 0 0 3 6 0 0 . 0 0 3 6 0
Residuals
12 1 . 3 2 8 3 5 0 . 1 1 0 7 0
f i t . add<lm ( y type+s u l f u r )
F value
Pr(>F)
1 3 . 4 4 5 9 0 . 0 0 3 2 2 5
4.8803 0.047354
0.0325 0.859897
5.0
5.5
165
4.5
4.5
5.0
5.5
high
low
B
type
4.5
5.0
5.5
sulfur
high.A
low.A
high.B
low.B
166
0.4
Sample Quantiles
0.0 0.2 0.4
fit.add$res
0.0 0.2
0.4
0.4
1
0
1
Theoretical Quantiles
4.6
5.4
..
.
..
.
Field 1
low
A B
low
B A
low
B A
low
B A
low
B A
..
.
.. ..
. .
B A
B A
B A
B A
B A
..
.
..
.
Field 2
low
A B
low
A B
low
B A
low
A B
high
A B
..
.
.. ..
. .
B A
B A
B A
B A
B A
..
.
..
.
Field 3
high
A B
high
A B
high
A B
high
B A
low
A B
..
.
.. ..
. .
B A
B A
B A
B A
B A
..
.
..
.
Field 4
high
A B
high
A B
high
A B
high
A B
high
A B
..
.
.. ..
. .
B
B
B
B
B
..
.
167
0.0
0.00
0.1
Density
0.05
0.10
Density
0.2 0.3
0.4
0.15
10
15
F.t.null
20
10
F.s.null
15
4 4
2
168
What happened?
rand
anova1
Ftype
F1,13 prand
type ptype
anova1
rand
6 F1,13 prand
Fsulfur
sulfur 6 psulfur
The F -distribution approximates a null randomization distribution if treatments are randomly assigned to units. But here, the sulfur treatment is
being assigned to groups of four units.
The precision of an effect is related to the number of independent
treatment assignments made
We have 16 assignments of type, but only 4 assignments of sulfur. It
is difficult to tell the difference between sulfur effects and field effects.
Our estimates of sulfur effects are less precise than those of type.
Note:
From the point of view of type alone, the design is a RCB.
Each whole-plot(field) is a block. We have 2 observations of each
type per block.
We compare MSType to the MSE from residuals left from a model
with type effects, block effects and possibly interaction terms.
Degrees of freedom breakdown for the sub-plot analysis:
Source
dof
block=whole plot 3
type
1
subplot error
11
subplot total
15
From the point of view of sulfur alone, the design is a CRD.
Each whole plot is an experimental unit.
We want to compare MSSulfur to the variation in whole plots,
which are fields.
Degrees of freedom breakdown for the whole-plot analysis:
Source
dof
sulfur
1
whole plot error 2
whole plot total 3
169
Thus there is strong evidence for type effects, and little evidence that the
effects of type vary among levels of sulfur.
170
> F. sulfur
[ 1 ] 1.022911
> 1p f (F . s u l f u r , 1 , 2 )
[ 1 ] 0.4182903
This is more in line with the analysis using the randomization test.
The above calculations are somewhat tedious. In R there are several automagic ways of obtaining the correct F -test for this type of design. One
way is with the aov command:
> f i t 1 <aov ( y type s u l f u r + E r r o r ( f a c t o r ( f i e l d ) ) )
> summary ( f i t 1 )
Error : f a c t o r ( f i e l d )
Df Sum Sq Mean Sq F v a l u e Pr(>F)
sulfur
1 0.54022 0.54022 1.0229 0.4183
Residuals 2 1.05625 0.52813
E r r o r : Within
Df Sum Sq
type
1 1.48840
type : s u l f u r 1 0 . 0 0 3 6 0
Residuals
10 0 . 2 7 2 1 0
Mean Sq F v a l u e
Pr(>F)
1 . 4 8 8 4 0 5 4 . 7 0 0 5 2 . 3 2 6 e 05
0.00360 0.1323
0.7236
0.02721
###
> f i t 2 <aov ( y type + s u l f u r + E r r o r ( f a c t o r ( f i e l d ) ) )
> summary ( f i t 2 )
Error : f a c t o r ( f i e l d )
Df Sum Sq Mean Sq F v a l u e Pr(>F)
sulfur
1 0.54022 0.54022 1.0229 0.4183
Residuals 2 1.05625 0.52813
E r r o r : Within
Df Sum Sq Mean Sq F v a l u e
Pr(>F)
type
1 1 . 4 8 8 4 0 1 . 4 8 8 4 0 5 9 . 3 8 5 9 . 3 0 7 e 06
R e s i d u a l s 11 0 . 2 7 5 7 0 0 . 0 2 5 0 6
171
7.1
Mixed-effects approach
What went wrong with the normal sampling model approach? What is wrong
with the following model?
yijk = + ai + bj + (ab)ij + ijk
where
i indexes sulfur level, i {1, 2} ;
j indexes type level, j {1, 2} ;
k indexes reps, k {1, . . . , 4}
ijk are i.i.d. normal
We checked the normality and constant variance assumptions for this model
previously, and they seemed ok. What about independence? Figure 5.4 plots
the residuals as a function of field. The figure indicates that residuals are
more alike within a field than across fields, and so observations within a field
are positively correlated. Statistical dependence of this sort is common
to split-plot and other nested designs.
dependence within whole-plots
affects the amount of information we have about factors applied at
the whole-plot level: within a given plot, we cant tell the difference
between plot effects and whole-plot factor effects.
This doesnt affect the amount of information we have about factors
applied at the sub-plot level: We can tell the difference between plot
effects and sub-plot factor effects.
If residuals within a whole-plot are positively correlated, the most intuitively
straightforward way to analyze such data (in my opinion) is with a hierarchical mixed-effects model:
yijkl = + ai + bj + (ab)ij + ik + ijkl
where things are as before except
172
l
h
hh
h
l
l
l
l
l
h
h
h
h
0.4
residual
0.0 0.2
0.4
1.0
1.5
2.0
2.5
field
3.0
l
3.5
4.0
This and more complicated random-effects models can be fit using the lme
command in R. To use this command, you need the nlme package:
173
l i b r a r y ( nlme )
f i t . me<lme ( f i x e d=y type+s u l f u r , random =1| a s . f a c t o r ( f i e l d ) )
>summary ( f i t . me)
Fixed e f f e c t s : y type + s u l f u r
Value Std . E r r o r DF
tv a l u e pv a l u e
( I n t e r c e p t ) 4 . 8 8 5 0 0 . 2 5 9 9 6 5 0 11 1 8 . 7 9 0 9 9 1 0 . 0 0 0 0
typeB
0 . 6 1 0 0 0 . 0 7 9 1 5 7 5 11 7 . 7 0 6 1 5 3 0 . 0 0 0 0
sulfurlow
0.3675 0 . 3 6 3 3 6 0 2 2 1.011393 0 . 4 1 8 3
> anova ( f i t . me)
numDF denDF Fv a l u e pv a l u e
( Intercept )
1
11 7 5 9 . 2 9 4 6 <.0001
type
1
11 5 9 . 3 8 4 8 <.0001
sulfur
1
2
1.0229 0.4183
7.2
174
height
4
5
The size of each tree (roughly the volume) was measured at five time points:
152, 174, 201, 227 and 258 days after the beginning of experiment.
152
174
201
time
227
258
175
> f i t <lm ( S i t k a $ s i z e S i t k a $ t r e a t )
> anova ( f i t )
Df Sum Sq Mean Sq F v a l u e Pr(>F)
Sitka$treat
1
3.810
3.810 6.0561 0.01429
Residuals
393 2 4 7 . 2 2 2
0.629
Normal QQ Plot
ozone
1
0
fit$res
1
0
1
Theoretical Quantiles
fit$res
1
0
control
20
Frequency
40 60 80
Sample Quantiles
2
1
0
1
100
Histogram of fit$res
227
258
174
201
as.factor(Sitka$Time)
fit$res
1
0
152
9 11
14
17
20
23
26
29
32
35 38 41 44 47
as.factor(Sitka$tree)
50
53
56
59
62
65
68
71
74
77
176
Naive approach II: Clearly there is some effect of time. Lets now account for growth over time, using a simple ANCOVA:
yi,j,t = 0 + ai + b t + ci t + i,j,t
for ozone (i = 1) , E[y1,j,t ] = (0 + a1 ) + (b + c1 ) t
for control (i = 2) , E[y2,j,t ] = (0 + a2 ) + (b + c2 ) t
> f i t <lm ( s i z e Time+t r e a t+Time t r e a t , data=S i t k a )
> anova ( f i t )
Df Sum Sq Mean Sq F v a l u e
Pr(>F)
Time
1 8 9 . 5 6 4 8 9 . 5 6 4 2 2 2 . 9 0 2 0 < 2 . 2 e 16
treat
1
3.810
3.810
9 . 4 8 1 3 0 . 0 0 2 2 2 2
Time : t r e a t
1
0.551
0.551
1.3703 0.242480
R e s i d u a l s 391 1 5 7 . 1 0 7
0.402
height
4
5
160
180
200
220
240
260
time
Normal QQ Plot
0 20
Frequency
60
100
Sample Quantiles
1
0
140
Histogram of fit$res
0
fit$res
1
0
1
Theoretical Quantiles
177
fit$res
11
15
19
23
27
31 35 39 43 47
as.factor(Sitka$tree)
51
55
59
63
67
71
75
We can then compare averages, intercepts and slopes across the two treatment groups. Note that, for each type of y, there is only one observation for
each tree: we have eliminated the problem of dependent measurements.
79
178
height
160
180
200
220
240
260
0.020
0.015
intercept
2
slope
0.010
0.005
3.5
4.0
average
4.5 5.0
5.5
6.0
time
3.0
control
ozone
treatment
control
ozone
control
treatment
ozone
treatment
179
Linear random effects models: The last approach is extremely conservative, as it basically reduces all the information we have from a tree to one
number. Of course, the observations from a single tree are not completely
dependent, and so compressing the data in this way throws away potentially
valuable information. To make use of all the information from a tree, we can
use a random effects model which accounts for correlation of observations
common to a given tree.
yi,j,t = (a1 + b1,j + c1,i ) + (a2 + b2,j + c2,j ) t + i,j,t
where
(b1,j , b2,j ), j = 1, 2 are fixed-effects, measuring the heterogeneity of
the average slope and intercept across the two levels of treatment;
(c1,1 , c2,1 ), . . . , (c1,n , c2,n ) i.i.d. multivariate normal(0, ) are random
effects , inducing a within-tree correlation of observations.