SM Notes 2020
SM Notes 2020
Statistical Modelling
Vasiliki Koutra1
Department of Mathematics
King’s College London
January 2020
2 Estimation 17
3 Inference 31
iii
iv CONTENTS
3.1.2 F test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.1.3 Estimating σ 2 . . . . . . . . . . . . . . . . . . . . . . . . 36
3.1.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4 Model Checking 59
5 Model Selection 79
6.1 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Think back to the Probability & Statistics I and II modules. There were statements
like: “Y1 , Y2 ,. . .,Yn are independent and identically distributed normal random
variables with mean µ and variance σ 2 ”. Another way of writing this is
Y i = µ + εi , i = 1, 2, . . . , n,
This statistical model has two components, a part which tells us about the expec-
tation of Y , which is constant, and a random part.
In this course we are interested in models where the mean depends on values
of other variables. In the simplest case, we have a response variable Y and one
explanatory variable X. Then the expectation of Y depends on the value of X,
say xi , and we may write
Y i = µ i + εi , i = 1, 2, . . . , n,
In practice, we start with a real life problem for which we have some data. We
think of a statistical model as a mathematical representation of the variables we
3
4 CHAPTER 1. INTRODUCTION TO STATISTICAL MODELLING
Table 1.1: Data on batch size and time to make each batch
have measured. This model usually involves some parameters. We may then
try to estimate the values of these parameters or to test hypotheses about them.
We may wish to use the model to predict what would happen in the future in a
similar situation. In order to test hypotheses or to make predictions we usually
have to make some assumptions. Part of the modelling process is to test these
assumptions. Having found an adequate model we must compare its predictions
with reality to check that it gives reasonable answers.
We begin by plotting the data to see what sort of relationship might hold.
From this plot, Figure 1.1, it seems that a straight line relationship is a good repre-
sentation of the data although it is not an exact relationship. We can fit this model
and obtain the fitted line plot, Figure 1.2.
The fitted line is yb = 10 + 2x. One interpretation of this is that on average it takes
10 hours to set up the machinery to make widgets and then, on average, it takes 2
extra man-hours to produce a batch increased in size by one widget.
But before we come to this conclusion we should check that our data satisfy the
assumptions of the statistical model. One way to do this is to look at residual plots,
as in Figure1.3. We shall discuss these later in the course and in the practicals but
here we see that there is no apparent reason to doubt our model. In fact, for small
1.1. GENERAL CONCEPTS 5
data sets histograms do not represent the distribution well. It is better to examine
the Normal Probability Plot.
Statistical modelling is iterative. We think of a model we believe will fit the data.
We fit it and then check the model. If it is ok we use the model to explain what is
happening or to predict what may happen. Note that we should be very wary of
making predictions far outside of the x values which are used to fit the model.
In Time Series Analysis, we relax the assumption that errors are uncorrelated.
1.2. SIMPLE LINEAR REGRESSION MODEL 7
We start with the simplest situation where we have one response variable Y and
one explanatory variable X.
xi [days] yi [cm]
x1 = 3 y1 = 1.4
x2 = 3 y2 = 1.5
x3 = 5 y3 = 2.2
x4 = 6 y4 = 2.4
x5 = 8 y5 = 2.8
x6 = 8 y6 = 3.2
x7 = 10 y7 = 3.2
x8 = 11 y8 = 3.9
x9 = 12 y9 = 4.1
x10 = 13 y10 = 4.7
x11 = 14 y11 = 4.5
x12 = 15 y12 = 5.2
x13 = 16 y13 = 5.0
Readings of wing’s length may vary for different birds of the same age. Time, X,
is known and we are not interested in modelling it. We condition on time assume
that Y is random, so that repeated observations of Y for the same values of X
may vary.
A useful initial stage of modelling is to plot the data. Figure 1.4 shows the plot of
the sparrow wing’s length against sparrow’s age.
The plot suggests that the wing length and age might be linearly related, although
we would not expect the wing’s length to keep increasing linearly over a long
8 CHAPTER 1. INTRODUCTION TO STATISTICAL MODELLING
Figure 1.4: Plot of the length of sparrow wings against age of sparrows.
period of time. In this example the linear relationship can be considered for some
short growth time only.
Other types of function could also describe the relationship well, for example a
quadratic polynomial with a very small second order coefficient. However, it is
better to use the simplest model which describes the relationship well. This is
called the principle of parsimony.
It means to represent well the expected shape and also the variability of the re-
sponse Y at each value of the explanatory variable X. We will be working on this
problem throughout the course.
We can write
Yi = β0 + β1 xi + εi , where i = 1, 2, . . . , n.
The errors are often called departures from the mean. The error εi is a random
variable, hence Yi is a random variable too and the assumptions can be rewritten
as
This means that the dependence of Y on X is linear and the variance of the re-
sponse Y at each value of X is constant (does not depend on xi ) and Y |X = xi
and Y |X = xj are uncorrelated.
E(yj)
E(y i)
xi xj X
Figure 1.5: Model Assumptions about the randomness of observations.
• Yi ∼ N (µi , σ 2 ), where µi = β0 + β1 xi , i = 1, 2, . . . , n,
ind
• Yi ∼ N (β0 + β1 xi , σ 2 ), i = 1, 2, . . . , n,
ind
• Yi = β0 + β1 xi + εi , where εi ∼ N (0, σ 2 ), i = 1, 2, . . . , n.
iid
Y1 = β0 + β1 x1 + ε1
Y2 = β0 + β1 x2 + ε2
.. ..
. .
Yn = β0 + β1 xn + εn
Y = Xβ + ε, (1.2)
ε ∼ N n 0, σ 2 I ,
Yi = β0 + εi for i = 1, . . . , n
is equivalent to
Y = 1β0 + ε
where 1 is an (n × 1) vector of 1’s.
Yi = β1 xi + εi for i = 1, . . . , n
12 CHAPTER 1. INTRODUCTION TO STATISTICAL MODELLING
Quadratic regression (p = 3)
Yi = β0 + β1 xi + β2 x2i + εi for i = 1, . . . , n
can be written in matrix notation with
1 x1 x21
1 x2 x2 β0
2
X = .. .. .. , β = β1 .
. . . β2
1 xn x2n
A fitted linear regression model always leaves some residual variation. There
might be another systematic cause for the variability in the observations yi . If we
have data on other explanatory variables we can ask whether they can be used to
explain some of the residual variation in Y . If this is the case, we should take it
into account in the model, so that the errors are purely random. We could write
Yi = β0 + β1 xi + β2 zi + ε?i .
| {z }
previously εi
A Multiple Linear Regression (MLR) model for a response variable Y and ex-
planatory variables X1 , X2 , . . . , Xp−1 is
E(Y |X1 = x1,i , . . . , Xp−1 = xp−1,i ) = β0 + β1 x1,i + . . . + βp−1 xp−1,i
var(Y |X1 = x1,i , . . . , Xp−1 = xp−1,i ) = σ 2 , i = 1, . . . , n
cov(Y |X1 = x1,i , .., Xp−1 = xp−1,i , Y |X1 = x1,j , .., Xp−1 = xp−1,j ) = 0, i 6= j
1.4. MULTIPLE LINEAR REGRESSION MODEL 13
or
εi ∼ N (0, σ 2 )
ind
To simplify the notation we write the MLR model in a matrix form
Y = Xβ + ε, (1.3)
that is,
Y1 1 x1,1 ··· xp−1,1 β0 ε1
Y2 1 x1,2 ··· xp−1,2 β1 ε2
= +
.. .. .. .. .. ..
. . . ··· . . .
Yn 1 x1,n ··· xp−1,n βp−1 εn
| {z } | {z }| {z } | {z }
:= Y := X := β := ε
Here Y is the vector of responses, X is often called the design matrix, β is the
vector of unknown, constant parameters and ε is the vector of random errors.
Vectors y and ε in equation (1.3) are random vectors as their elements are random
variables. Below we show some properties of random vectors.
Definition 1.1. The expected value of a random vector is the vector of the respec-
tive expected values. That is, for a random vector z = (z1 , . . . , zn )T we write
z1 E(z1 )
z2 E(z2 )
E(z) = E .. = .. (1.4)
. .
zn E(zn )
We have analogous properties of the expectation for random vectors as for single
random variables. Namely, for a random vector z, a constant scalar a, a constant
vector b and for matrices of constants A and B we have
E(az + b) = a E(z) + b
E(Az) = A E(z) (1.5)
T T
E(z B) = E(z) B
Variances and covariances of the random variables zi are put together to form the
so called variance-covariance (dispersion) matrix,
var(z1 ) cov(z1 , z2 ) · · · cov(z1 , zn )
var(z2 ) · · · cov(z2 , zn )
V ar(z) = (1.6)
.. ..
. .
cov(zn , z1 ) ··· var(zn )
(b) For mutually uncorrelated random variables the matrix is diagonal, since
cov(zi , zj ) = 0 for all i 6= j.
Var(u) = A Var(z)AT
Note that the property (c) gives the expression for the dispersion matrix of a ran-
dom vector analogous to the expression for the variance of a single rv, that is
A random vector z has a multivariate normal distribution if its p.d.f. can be written
as
1 1 T −1
f (z) = np exp − (z − µ) V (z − µ) , (1.8)
(2π) 2 det(V ) 2
where µ is the mean and V is the variance-covariance matrix of z.
16 CHAPTER 1. INTRODUCTION TO STATISTICAL MODELLING
In the model
Y = Xβ + ε
we assume the following properties of the random errors:
1.
ε1 E(ε1 ) 0
E[ε] = E ... = ... = .. = 0
. n
εn E(εn ) 0
2.
σ2 0 · · · 0
0 σ2 0
Var[ε] = = σ2I n
.. . . . ..
. .
0 ... σ2
ε ∼ Nn (0n , σ 2 I n ).
Y ∼ Nn (Xβ, σ 2 I n ).
Chapter 2
Estimation
The LS estimators of the model parameters β0 and β1 minimize the sum of squares
of errors denoted by S(β0 , β1 ). That is, the estimators minimize
n
X n
X
S(β0 , β1 ) = ε2i = [Yi − (β0 + β1 xi )]2 . (2.1)
i=1 i=1
The “best” here means the smallest value of S(β0 , β1 ). S is a function of the
parameters and so to find its minimum we differentiate it with respect to β0 and
β1 , then equate the derivatives to zero. We have
Pn
∂S
∂β0
= −2 i=1 [Yi − (β0 + β1 xi )]
Pn (2.2)
∂S
∂β1
= −2 i=1 [Yi − (β0 + β1 xi )]xi
17
18 CHAPTER 2. ESTIMATION
To check that S(β0 , β1 ) attains a minimum at (βb0 , βb1 ) we calculate second deriva-
tives and evaluate the determinant
2
∂2S
∂ S2 Pn
2n 2 x
∂β0 ∂β0 ∂β1 i=1 i
X n
= P = 4n (xi − x̄)2 > 0
n n
∂2S 2
2 i=1 xi 2 i=1 x2i
∂ S P
i=1
∂β 2
∂β ∂β
1 0 1
for all β0 , β1 (it does not depend on the values of the parameters).
∂2S ∂2S
Also, ∂β02
> 0 (and > 0) for all β0 , β1 . This means that the function S(β0 , β1 )
∂β12
attains a minimum at (βb0 , βb1 ) given by (2.5) and (2.6).
2.1. LEAST SQUARES ESTIMATION IN SIMPLE LINEAR REGRESSION19
Remark 2.1. Note that the estimators depend on Y . They are functions of Y
which is a random variable and so the estimators of the model parameters are
random variables too. When we calculate the values of the estimators for a given
data set, i.e. for observed values of Y at given values of X, we obtain estimates
of the parameters. We may obtain different estimates of β0 and β1 calculated for
different data sets fitted by the same kind of model.
Example 2.1. (Wing’s length cont.)
For the data in Example 1.2 we obtain
13
X 13
X
yi = 44.1, xi = 124.
i=1 i=1
13
X 13
X
xi yi = 488.3, x2i = 1418.
i=1 i=1
• no-intercept model
Yi = β1 xi + εi ,
which implies that E(Y |X = 0) = 0, and
20 CHAPTER 2. ESTIMATION
• constant model
Yi = β0 + εi ,
which implies that the response variable Y does not depend on the explana-
tory variable X.
Hence,
σ2
βb1 ∼ N β1 , .
Sxx
The normal equations obtained in the least squares method are given by
X T Y = X T X β.
b
and P
X X=T Pn P x2i =
n P nx̄
.
xi xi nx̄ x2i
The determinant of X T X is given by
X X
|X T X| = n x2i − (nx̄)2 = n x2i − nx̄2 = nSxx .
1
x2i −x̄
P
1
= n PnȲ
Sxx −x̄ 1 x i Yi
P 2 P
1 Ȳ P xi − x̄ xi Yi
=
Sxx xi Yi − nx̄Ȳ
P 2
xi − nx̄2 Ȳ + nx̄2 Ȳ − x̄ xi Yi
P
1 Ȳ
=
Sxx SxY
P 2 2
P
1 Ȳ ( xi − nx̄ ) − x̄( xi Yi − nx̄Ȳ )
=
Sxx SxY
1 Ȳ Sxx − x̄SxY
=
Sxx SxY
!
Ȳ − βb1 x̄
=
βb1
Note:
Let A and B be a vector and a matrix of real constants and let Z be a vector of
random variables, all of appropriate dimensions so that the addition and multipli-
cation are possible. Then
In particular,
E(Y ) = E(Xβ + ε) = Xβ
Var(Y ) = Var(Xβ + ε) = Var(ε) = σ 2 I.
These equalities let us prove the following theorem.
b = Var{(X T X)−1 X T Y }
Var(β)
= (X T X)−1 X T Var(Y )X(X T X)−1
= σ 2 (X T X)−1 X T IX(X T X)−1 = σ 2 (X T X)−1 .
Y = Xβ0 + ε
1X
βb = (X T X)−1 X T Y = Yi = Ȳ = βb0 ,
n
b = (X T X)−1 σ 2 = σ2
var[β] .
n
24 CHAPTER 2. ESTIMATION
2. No-intercept model
We saw that this example fits the General Linear Model with
x1
x2
X= , β = β1
..
.
xn
So X T X = x2i and X T Y =
P P
xi Yi , and we can calculate
P
T −1xi Y i T
βb = (X X) X Y = P 2 = βb1 ,
xi
2
b = σ 2 (X T X)−1 = Pσ .
Var[β]
x2i
To derive the least squares estimator (LSE) for the parameter vector β we min-
imise the sum of squares of the errors, that is
n
X
S(β) = [Yi − {β0 + β1 x1,i + · · · + βp−1 xp−1,i }]2
i=1
X
= ε2i
= εT ε
= (Y − Xβ)T (Y − Xβ)
= (Y T − β T X T )(Y − Xβ)
= Y T Y − Y T Xβ − β T X T Y + β T X T Xβ
= Y T Y − 2β T X T Y + β T X T Xβ.
βb = (X T X)−1 X T Y
S(β) − S(β 0 )
= Y T Y − 2β T X T Y + β T X T Xβ − Y T Y + 2β T0 X T Y − β T0 X T Xβ 0
= −2β T X T Xβ 0 + β T X T Xβ + 2β T0 X T Xβ 0 − β T0 X T Xβ 0
= β T X T Xβ − 2β T X T Xβ 0 + β T0 X T Xβ 0
= β T X T Xβ − β T X T Xβ 0 − β T X T Xβ 0 + β T0 X T Xβ 0
= β T X T Xβ − β T X T Xβ 0 − β T0 X T Xβ + β T0 X T Xβ 0
= β T (X T Xβ − X T Xβ 0 ) − β T0 (X T Xβ − X T Xβ 0 )
= (β T − β T0 )(X T Xβ − X T Xβ 0 )
= (β T − β T0 )X T X(β − β 0 )
Note that, as we did for the SLM in Chapter 2, it is possible to obtain this result
by differentiating S(β) with respect to β and setting it equal to 0.
26 CHAPTER 2. ESTIMATION
In this section we show an alternative approach to justify the least squares estima-
tor of β.
P
E@
I
E @
E @ :R
E
O hhhh
h E
hE
Q
~ . The vector OR
The data y is represented by OP ~ = Xβ is a typical vector in C.
~ 2
We are trying to minimise S = |RP | . To minimise the distance from R to P , we
take R = Q such that the angle OQP is a right angle, i.e. we take the orthogonal
projection of P onto C. Thus the vector QP~ = y − X βb is orthogonal to every
0
vector in C. Hence for any value of β
(Xβ 0 )T (y − X β)
b = 0
T
β 0 X T (y − X β)
b = 0
2.4. LEAST SQUARES ESTIMATION IN GENERAL LINEAR MODEL 27
X T (y − X β)
b = 0.
X T y = X T X β.
b
This is a system of p equations in p unknowns, which are called the normal equa-
tions.
Proof.
E[β]
b = E[(X T X)−1 X T Y ]
= (X T X)−1 X T E[Y ]
= (X T X)−1 X T Xβ
= β
b = σ 2 (X T X)−1 .
Theorem 2.4. Var[β]
An alternative proof is as follows: First note that Var[Y ] = E[Y Y T ]−E[Y ] E[Y T ]
and hence
Now
Var[β]
b
T
= b E[βbT ]
E[βbβb ] − E[β]
= E[(X T X)−1 X T Y Y T X(X T X)−1 ] − ββ T
= (X T X)−1 X T E[Y Y T ]X(X T X)−1 − ββ T
= (X T X)−1 X T (σ 2 I + Xββ T X T )X(X T X)−1 − ββ T
= σ 2 (X T X)−1 X T X(X T X)−1
+(X T X)−1 X T Xββ T X T X(X T X)−1 − ββ T
= σ 2 (X T X)−1 + ββ T − ββ T
= σ 2 (X T X)−1
Theorem 2.5. If
Y = Xβ + ε, ε ∼ Nn (0, σ 2 I),
then
βb ∼ N p (β, σ 2 (X T X)−1 ).
The expectation and variance-covariance matrix can be shown in the same way as
in Theorem 2.7.
A strong justification for the use of least squares estimation in linear models is
provided by the following famous theorem.
2.5. THE GAUSS-MARKOV THEOREM 29
Y = Xβ + ε,
Note: We call such an estimator the Best Linear Unbiased Estimator (BLUE).
It is the estimator, that among all unbiased estimators of the form cT Y , has the
smallest variance.
Proof. lT βb is a linear combination of the random sample Y ,
lT βb = lT (X T X)−1 X T Y .
E(cT Y ) = cT E(Y ) = cT Xβ = lT β.
var(cT Y ) = cT Var(Y )c = σ 2 cT Ic = σ 2 cT c.
Also,
var(lT β) b = σ 2 lT (X T X)−1 l
b = lT Var(β)l
= σ 2 cT X(X T X)−1 X T c = σ 2 cT Hc
Then
var(cT Y ) − var(lT β)
b = σ 2 (cT c − cT HC)
= σ 2 cT (I − H)c
= σ 2 cT (I − H)T (I − H)c
| {z } | {z }
=Z
T
=Z
2 T
= σ Z Z ≥ 0.
Inference
Yi = β0 + β1 xi + εi
can be used to estimate the mean response corresponding to each variable Yi , that
is,
[i ) = Ybi = βb0 + βb1 xi , i = 1, . . . , n.
E(Y
These, for a given data set (xi , yi ), are called fitted values and are denoted by ybi .
They are points on the fitted regression line corresponding to the values of xi .
The observed values yi usually do not fall exactly on the line and so are not equal
to the fitted values ybi , as shown in Figure 3.1.
ei := Yi − Ybi , i = 1, . . . , n, (3.1)
Thus
ei = Yi − (βb0 + βb1 xi )
= Yi − Ȳ − βb1 (xi − x̄)
31
32 CHAPTER 3. INFERENCE
4
y
1
0 2 4 6 8 10 12 14 16 18
x
Figure 3.1: Observations and fitted line for the Sparrow wing’s length data.
and X
ei = 0.
Also note that the estimators βb0 and βb1 minimize the function S(β0 , β1 ). The
minimum is called the Residual Sum of Squares and is denoted by SSE , that is,
n
X n
X n
X
2 2
SSE = [Yi − (β0 + β1 xi )] =
b b (Yi − Yi ) =
b e2i . (3.2)
i=1 i=1 i=1
Ybi = Ȳ , ei = Yi − Ybi = Yi − Ȳ
and
n
X
SSE = SST = (Yi − Ȳ )2 .
i=1
It is called the Total Sum of Squares and is denoted by SST . For a constant
model SSE = SST . When the model is non constant, i.e. it includes a slope, the
3.1. ASSESSING THE SIMPLE LINEAR REGRESSION MODEL 33
y
y(14)
5
y(14)
4
fitted line
y
2
1
x
0 2 4 6 8 10 12 14 16 18
Figure 3.2: Observations, fitted line and the mean for a constant model.
difference Yi − Ȳ can be split into two components: one due to the regression
model fit and one due to the residuals, that is
The following theorem gives such an identity for the respective sums of squares.
where
n
X
SST = (Yi − Ȳ )2
i=1
n
X
SSR = (Ybi − Ȳ )2
i=1
Xn
SSE = (Yi − Ybi )2
i=1
34 CHAPTER 3. INFERENCE
Proof.
n
X n
X
2
SST = (Yi − Ȳ ) = [(Yi − Ybi ) + (Ybi − Ȳ )]2
i=1 i=1
n
X
= [(Yi − Ybi )2 + (Ybi − Ȳ )2 + 2(Yi − Ybi )(Ybi − Ȳ )]
i=1
= SSE + SSR + 2A,
where
n
X
A = (Yi − Ybi )(Ybi − Ȳ )
i=1
n
X n
X
= (Yi − Ybi )Ybi − Ȳ (Yi − Ybi )
i=1 i=1
Xn n
X
= ei Ybi − Ȳ ei
i=1
|i=1
{z }
=0
n
X
= ei (βb0 + βb1 xi )
i=1
n
X n
X
= βb0 ei + βb1 ei xi .
i=1 i=1
| {z } | {z }
=0 =0
Hence A = 0.
For a given data set the model fit (regression) sum of squares, SSR , represents the
variability in the observations yi accounted for by the fitted model, the residual
sum of squares, SSE , represents the variability in yi accounted for by the differ-
ences between the observations and the fitted values.
The Analysis of Variance (ANOVA) Table shows the sources of variation, the
sums of squares and the statistic, based on the sums of squares, for testing the
significance of regression slope.
ANOVA table
Source of variation d.f. SS MS VR
SSR M SR
Regression νR = 1 SSR M SR = νR M SE
SSE
Residual νE = n − 2 SSE M SE = νE
Total νT = n − 1 SST
3.1. ASSESSING THE SIMPLE LINEAR REGRESSION MODEL 35
we have n − 1 free to vary pieces of information from the collected data, that is
we have n − 1 degrees of freedom. The one degree of freedom is taken up by ȳ.
Similarly, for
n
X n
X
2
SSE = (yi − ybi ) = (yi − βb0 − βb1 xi )2
i=1 i=1
we have two degrees of freedom taken up: one by βb0 and one by βb1 (both depend
on y1 , y2 , . . . , yn ). Hence, there are n − 2 independent pieces of information to
calculate SSE .
Finally, as SSR = SST − SSE we can calculate the d.f. for SSR as a difference
between d.f. for SST and for SSE , that is νR = (n − 1) − (n − 2) = 1.
In the ANOVA table there are also included so called Mean Squares (MS), which
can be thought of as measures of average variation.
The last column of the table contains the Variance Ratio (VR)
M SR
.
M SE
It measures the variation explained by the model fit relative to the variation due to
residuals.
36 CHAPTER 3. INFERENCE
3.1.2 F test
The mean squares are functions of the random variables Yi and so is their ratio.
We denote it by F . We will see later, that if β1 = 0, then
M SR
F = ∼ F1,n−2 .
M SE
Thus, to test the null hypothesis
H0 : β1 = 0
where Fcal denotes the value of the variance ratio F calculated for a given data set
and Fα;1,n−2 is such that
P (F > Fα;1,n−2 ) = α.
Rejecting H0 means that the slope β1 6= 0 and the full regression model
Yi = β0 + β1 xi + εi
Yi = β0 + εi .
3.1.3 Estimating σ 2
Note that the sums of squares are functions of the conditional random variables
Yi = (Y |X = xi ). Hence, the sums of squares are random variables as well. This
fact allows us to check some stochastic properties of the sums of squares, such as
their expectation, variance and distribution.
3.1. ASSESSING THE SIMPLE LINEAR REGRESSION MODEL 37
E(SSE ) = (n − 2)σ 2
Notice, that in the full model S 2 is not the sample variance. We have
n
1 X
S 2 = M SE = [i ))2 ,
(Yi − E(Y [i ) = βb0 + βb1 xi .
where E(Y
n − 2 i=1
[i ) = βb0 = Ȳ and
It is the sample variance in the constant (null) model, where E(Y
νE = n − 1. Then
n
2 1 X
S = (Yi − Ȳ )2 .
n − 1 i=1
3.1.4 Example
Analysis of Variance
Source DF SS MS F P
Regression 1 19.446 19.446 442.60 0.000
Residual Error 11 0.483 0.044
Total 12 19.929
Comments:
We fitted a simple linear model of the form
Y i = β 0 + β 1 xi + εi , i = 1, . . . , 13, εi ∼ N (0, 1).
iid
The ANOVA table also shows the significance of the regression (slope), that is the
null hypothesis
H0 : β1 = 0
versus the alternative
H1 : β1 6= 0
can be rejected at significance level α < 0.001 (p ∼
= 0.000).
The tests require the assumptions of the normality and of constant variance of
random errors. It should be checked whether the assumptions are approximately
met. If not, the tests may not be valid.
3.2. INFERENCE ABOUT THE REGRESSION PARAMETERS 39
The graph shows that the observations lie along the fitted line and there are no
strange points which are far from the line or which could strongly affect the slope.
Final conclusions:
We can conclude that the data indicate that the length of sparrows’ wings depends
linearly on their age (within the range 3 - 18 days). The mean increase in the
wing’s length per day is estimated as βb1 ∼
= 0.26 cm.
However, it might be wrong to predict the length or its increase per day outside
the range of the observed time. We would expect that the growth slows down in
time and so the relationship becomes non-linear.
Analysis of Variance
Source DF SS MS F P
Regression 1 63517077 63517077 23.46 0.000
Residual Error 14 37912232 2708017
Total 15 101429309
Unusual Observations
Obs labour Ovhd Fit SE Fit Residual St Resid
6 1067 24817 28028 413 -3211 -2.02R
Comments:
The model allows us to estimate the total overhead cost as a function of labour
hours, but as we noticed, there is large variability in the data. In such a case,
the point estimates may not be very reliable. Anyway, point estimates should
always be accompanied by their standard errors. Then we can also find confidence
intervals (CI) for the unknown model parameters, or test their significance.
βb0 = Ȳ − βb1 x̄
Pn
(x − x̄)(Yi − Ȳ )
β1 = i=1
b Pn i 2
i=1 (xi − x̄)
We now derive results which allow us to make inference about the regression
parameters and predictions.
Theorem 3.3. In the full simple linear regression model (SLRM) the distribution
of the LSE of β1 , βb1 , is normal with the expectation E(βb1 ) = β1 and the variance
2
var(βb1 ) = Sσxx , that is
2
σ
βb1 ∼ N β1 , . (3.5)
Sxx
Theorem 3.3 allows us to derive a confidence interval (CI) for β1 and a test of
significance for β1 . After standarisation of βb1 we obtain
βb1 − β1
√ ∼ N (0, 1).
σ/ Sxx
However, the error variance is usually not known and it is replaced by its estimator.
Then the normal distribution changes to a Student t-distribution. The explanation
is following.
Lemma 3.1. If Z ∼ N (0, 1) and U ∼ χ2ν , and Z and U are independent, then
Z
p ∼ tν .
U/ν
42 CHAPTER 3. INFERENCE
Here we have,
βb1 − β1
Z= √ ∼ N (0, 1).
σ/ Sxx
We will see later that
(n − 2)S 2
U= ∼ χ2n−2
σ2
and S 2 and βb1 are independent. It follows that
−β1
βb1√
σ/ Sxx βb1 − β1
T =q = √ ∼ tn−2 . (3.6)
(n−2)S 2 S/ Sxx
σ 2 (n−2)
The calculated values of βb1 , S and Sxx for the overhead costs (Example 3.2) are
the following
βb1 = 10.982, S = 1645.61, Sxx = 526656.9.
Also t0.025,14 = 2.14479. Hence, the 95% CI for β1 is
1645.61 1645.61
[a, b] = 10.982 − 2.14479 √ , 10.982 + 2.14479 √
526656.9 526656.9
= [6.11851, 15.8455]
3.2. INFERENCE ABOUT THE REGRESSION PARAMETERS 43
We would expect (with 95% confidence) that one hour increase in labour will in-
crease the cost between £6.12 and £15.82.
Test of H0 : β1 = 0 versus H0 : β1 6= 0
The null hypothesis H0 : β1 = 0 means that the slope is zero and a better model
is a constant model
Y i = β0 + εi , εi ∼ N (0, σ 2 )
iid
This statistic can be used as a test function for the null hypothesis.
We reject H0 at a significance level α when the calculated, for a given data set,
value of the test function, Tcal , is in the rejection region, that is
Many statistical software packages give the p-value when testing a hypothesis.
When the p-value is smaller than α then we may reject the null hypothesis at a
significance level ≤ α.
Remark 3.2. Square root of the variance var(βb1 ) is called the standard error of βb1
and it is denoted by se(βb1 ), that is
s
σ2
se(βb1 ) = .
Sxx
Its estimator is s
\ S2
se(βb1 ) = .
Sxx
Often this estimated standard error is called the standard error. You should be
aware of the difference between the two.
Remark 3.3. Note that the (1 − α)100% CI for β1 can be written as
\ \
β1 − t α2 ,n−2 se(β1 ), β1 + t α2 ,n−2 se(β1 )
b b b b
44 CHAPTER 3. INFERENCE
As we have noted before we can also test the hypothesis H0 : β1 = 0 using the
Analysis of Variance table and the F test. In this case the two tests are equivalent
since if the random variable W ∼ tν then W 2 ∼ F1,ν .
b0 = E(Y \
µ |X = x0 ) = βb0 + βb1 x0 .
Similarly as for the LSE of β0 and for β1 we have the following Theorem.
Theorem 3.4. In the full SLRM the distribution of the LSE of µ0 ,µ
b0 , is normal
2 1 (x0 −x̄)2
with the expectation E(b
µ0 ) = µ0 and the variance var(b µ0 ) = σ n + Sxx ,
that is
(x0 − x̄)2
2 1
µb0 ∼ N µ0 , σ + . (3.11)
n Sxx
Corollary 3.1. In the full simple linear regression model, we have
CI for µ0 : h i
b0 − t
µ α
,n−2
\
se(b
µ0 ), µ \
b0 + t α2 ,n−2 se(b
µ0 )
2
where s
\ 1 (x0 − x̄)2
se(b
µ0 ) = S2 + .
n Sxx
Remark 3.4. Care is needed when estimating the mean at x0 . It should only be
done if x0 is within the data range. Extrapolation beyond the range of the given
x-values is not reliable, as there is no evidence that a linear relationship is appro-
priate there.
Note that
HT = H
and also
HH = X(X T X)−1 X T X(X T X)−1 X T
| {z }
=I
= X(X T X)−1 X T
= H.
A matrix, which satisfies the condition AA = A is called an idempotent matrix.
Note that if A is idempotent, then (I − A) is also idempotent.
Lemma 3.3. Var(e) = σ 2 (I − H).
Proof.
Var(e) = (I − H) var(Y )(I − H)T
= (I − H)σ 2 I(I − H)
= σ 2 (I − H)
Lemma 3.4. The sum of squares of the residuals is Y T (I − H)Y .
Proof.
n
X
e2i = eT e = Y T (I − H)T (I − H)Y
i=1
= Y T (I − H)Y
Lemma 3.5. The elements of the residual vector e sum to zero, i.e
n
X
ei = 0.
i=1
Corollary 3.2.
n
1Xb
Yi = Ȳ .
n i=1
P P P
Proof. The residuals ei = Yi − Ybi , so ei = (Yi − Ybi ) but ei = 0. Hence
P Pb
Yi = Yi and so the result follows.
3.4. ANALYSIS OF VARIANCE 47
X
SSR = (Ybi − Ȳ )2
X X
= Ybi2 − 2Ȳ Ybi +nȲ 2
| {z }
=nȲ
X
= Ybi2 − nȲ 2
T
= Yb Yb − nȲ 2
T
= βb X T X βb − nȲ 2
= Y T X(X T X)−1 X T X(X T X)−1 X T Y − nȲ 2
| {z }
=I
= Y T HY − nȲ 2 .
48 CHAPTER 3. INFERENCE
In testing H0 we are asking if there is sufficient evidence to reject the null model.
Source d.f. SS MS VR
T SSR M SR
Overall regression p−1 Y HY − nȲ 2 p−1 M SE
H0 : β1 = β2 = . . . = βp−1 = 0,
versus
H1 : ¬H0 .
We reject H0 at the 100α% level of significance if
βb ∼ N p (β, σ 2 (X T X)−1 ).
Therefore,
βbj ∼ N (βj , σ 2 cjj ), j = 0, 1, 2, . . . , p − 1,
where cjj is the jth diagonal element of (X T X)−1 (counting from 0 to p − 1).
Hence, it is straightforward to make inferences about βj , in the usual way.
where S 2 = M SE .
βbj
T =p ∼ tn−p if H0 is true.
S 2 cjj
Care is needed in interpreting the confidence intervals and tests. They refer only to
the model we are fitting. Thus not rejecting H0 : βj = 0 does not mean that Xj has
no explanatory power; it means that, conditionally on X1 , . . . , Xj−1 , Xj+1 , . . . , Xp−1
being in the model Xj has no additional power.
50 CHAPTER 3. INFERENCE
It is often best to think of the test as comparing models without and with Xj , i.e.
versus
H1 : E(Yi ) = β0 + β1 x1,i + · · · + βp−1 xp−1,i .
It does not tell us anything about the comparison between models E(Yi ) = β0 and
E(Yi ) = β0 + βj xj,i .
We have
\) = µ
E(Y b = X β.
b
As with simple linear regression, we might want to estimate the expected response
at a specific x, say x0 = (1, x1,0 , . . . , xp−1,0 )T , i.e.
Theorem 3.6.
b0 ∼ N (µ0 , σ 2 xT0 (X T X)−1 x0 ).
µ
Proof.
(ii)
µ0 ) = E(xT0 β)
E(b b
= xT E(β)
b
0
= xT0 β
= µ0
3.7. SAMPLING DISTRIBUTION OF M SE (S 2 ) 51
(iii)
µ0 ) = var(xT0 β)
var(b b
= xT Var(β)x
0
b 0
= σ x0 (X T X)−1 x0 .
2 T
Y = Xβ + ε, ε ∼ N (0, σ 2 I),
we have
E(S 2 ) = σ 2 .
For this we need some results on matrix algebra.
A = CDC T ,
where
Ir 0
D= .
0 0
52 CHAPTER 3. INFERENCE
trace(A) = rank(A)
trace(A) = trace(CDC T )
= trace(C T CD)
= trace(D)
= r.
Lemma 3.8.
rank(I − H) = n − p.
Proof.
rank(I − H) = trace(I − H)
= trace(I) − trace(H)
= n − trace{X(X T X)−1 X T }
= n − trace{X T X(X T X)−1 }
= n − trace(I p )
= n−p
3.7. SAMPLING DISTRIBUTION OF M SE (S 2 ) 53
Theorem 3.7. Let Y = Xβ + ε be a linear model such that E(Y ) = Xβ and
Var(Y ) = σ 2 I n . Then the error sum of squares, SSE , has expectation equal to
E(SSE ) = (n − p)σ 2 ,
where p is the number of parameters in the model.
Proof.
SSE = Y T (I − H)Y
E(SSE ) = E[Y T (I − H)Y ]
= trace[(I − H) Var(Y )] + E(Y )T (I − H) E(Y )
= σ 2 trace(I − H) + β T X T (I − X(X T X)−1 X T )Xβ
= σ 2 (n − p) + β T (X T X − X T X (X T X)−1 X T X )β
| {z }
=I
= σ 2 (n − p)
54 CHAPTER 3. INFERENCE
Corollary 3.4.
E(M SE ) = σ 2
To show that
(n − p)S 2
∼ χ2n−p ,
σ2
the result we have used for deriving F tests, we will need the following lemmas.
e = (I − H)ε.
Proof.
e = Y − Yb
= Y − HY
= (I − H)Y
= (I − H)(Xβ + ε)
= Xβ − HXβ + (I − H)ε
= (I − H)ε
Corollary 3.5.
e ∼ Nn (0, σ 2 (I − H))
Theorem 3.8.
(n − p)S 2
∼ χ2n−p .
σ2
3.7. SAMPLING DISTRIBUTION OF M SE (S 2 ) 55
Proof.
(n − p)S 2 1
2
= SSE
σ σ2
1 T
= e e
σ2
1 T
= ε (I − H)T (I − H)ε
σ2
1 T
= ε (I − H)ε
σ2
1 T
= ε CDC T ε
σ2
= Z T DZ,
We assume that ε ∼ Nn (0, σ 2 I). Hence , Z is also normal with E(Z) = 0 and
1 T
Var(Z) = C Var(ε)C
σ2
σ2 T
= C C
σ2
= CT C
= I
as C is orthogonal. Hence
Z ∼ N (0, I)
and so Zi are independent and each distributed as N (0, 1).
Also
n−p
X
T
Z DZ = Zi2 .
i=1
Hence by Lemma 4.5 we have
n−p
(n − p)S 2 X 2
= Zi ∼ χ2n−p .
σ2 i=1
56 CHAPTER 3. INFERENCE
M SR
F = ∼ Fp−1,n−p .
M S E H0
SSR = Y T HY − nȲ 2 .
where
x1,1 · · · x1,n βb1
T .. .. .
X∗ = . , β ∗ = .. .
b
.
xp−1,1 · · · xp−1,n βbp−1
This gives
1T Y
T
SSR = (βb0 βb∗ ) − nȲ 2
X T∗ Y
T
= βb0 1T Y + βb∗ X T∗ Y − nȲ 2
T
= βb0 nȲ − nȲ 2 + βb X T Y∗ ∗
Now
T
βb0 = Ȳ − (βb1 x̄1 + · · · + βbp−1 x̄p−1 ) = Ȳ − βb∗ x̄,
3.8. SAMPLING DISTRIBUTION OF M SR AND F 57
C T C = X T∗c X ∗c
C = (C T )−1 X T∗c X ∗c
Hence
Z = C βb∗ ∼ Np−1 (0, σ 2 I)
58 CHAPTER 3. INFERENCE
that is
Zi ∼ N (0, σ 2 )
iid
and
1
Zi ∼ N (0, 1)
σ iid
Now
1 1 bT T
SSR = β X Y
σ 2 σ 2 ∗ ∗c
1 bT T
= β X X ∗c βb∗ (by normal equations)
σ 2 ∗ ∗c
1 bT T b
= β C C β∗
σ2 ∗
1 T
= Z Z
σ2
p−1
1 X 2
= Z
σ 2 i=1 i
p−1 2
X 1
= Zi .
i=1
σ
Hence
SSR
∼ χ2p−1
σ2
Corollary 3.6.
(p − 1)M SR
∼ χ2p−1 .
σ2
From this, Theorem 4.3 and from the independence of M SE and M SR we obtain
Theorem 3.10.
M SR
F = ∼ Fp−1,n−p .
M SE
Chapter 4
Model Checking
59
60 CHAPTER 4. MODEL CHECKING
To standardize a random variable we subtract its mean and divide by its standard
error. Hence, to standardize residuals we calculate
ei − E(ei ) ei
di = √ =p .
var ei 2
σ (1 − hii )
Then
di ∼ N (0, 1).
They are not independent, though for large samples the correlation should be
small.
Shapes of various residual plots can show whether the model assumptions are
approximately met.
(a) (b)
Figure 4.1: (a) No problem apparent (b) Clear non-linearity
(a) (b)
Figure 4.2: (a) No problem apparent (b) Variance increases as the mean response
increases
To check whether the distribution of the residuals follows a normal distribution
we can draw a so called Normal Probability Plot. It plots each value of ordered
residuals vs. the percentage of values in the sample that are less than or equal
to it, along a fitted distribution line. The scales are transformed so that the fitted
distribution forms a straight line. A plot that departs substantially from linearity
suggests that the error distribution is not normal as shown in plots 4.3 - 4.6.
62 CHAPTER 4. MODEL CHECKING
(a) (b)
Figure 4.3: (a) Histogram of data simulated from standard normal distribution, (b)
Normal Probability Plot, no problem apparent.
(a) (b)
Figure 4.4: (a) Histogram of data simulated from a Log-normal distribution, (b) Normal
Probability Plot indicates skewness of the distribution.
(a) (b)
Figure 4.5: (a) Histogram of data simulated from a Beta distribution, (b) Normal Proba-
bility Plot indicates light tails.
4.2. FURTHER MODEL CHECKING 63
(a) (b)
Figure 4.6: (a) Histogram of data simulated from a Student t-distribution, (b) Normal
Probability Plot indicates heavy tails.
e 1 (xi − x̄)2
ri = √ i , hii = + .
S 1 − hii n Sxx
One rule of thumb is that observations with standardized residuals greater (in ab-
solute value) than 2 are possible outliers. However, with a large number of ob-
servations there is more chance that a strange observation will occur in a data set.
So, we need to be cautious when deciding about such values.
Note that
n n
(xi − x̄)2
X X 1
hii = + = 2,
i=1 i=1
n Sxx
We can use a statistic called Cook’s distance to measure the influence of an obser-
vation.
For a simple linear regression model consider omitting the ith observation (xi , yi )
and refitting the model. Denote the new fitted values by ŷ (i) . We define Cook’s
statistic for case i to be
n
1 X (i)
Di = 2 (ŷ − ŷj )2 .
2s j=1 j
This shows that Di depends on both the size of the residual ei and the leverage
hii . So a large value of Di can occur due to large ei or large hii .
Obs. x y Obs. x y
1 15 95 11 7 113
2 26 71 12 9 96
3 10 83 13 10 83
4 9 91 14 11 84
5 15 102 15 11 102
6 20 87 16 10 100
7 18 93 17 12 105
8 11 100 18 42 57
9 8 104 19 17 121
10 20 94 20 11 86
21 10 100
The data represent the Gesell’s adaptive scores (y) versus age of infants (x, in
months) at first word. The scatter plot indicates two unusual observations: one is
a large value of y compared to other values at a similar x and one is a large value
of x, which is far from all the other x values.
We have seen that the residuals for the plasma data are not likely to be a sample
from a normal distribution with a constant variance. One of the reasons can be
that the straight line is not a good choice of the model. This fact can be easily
seen here, but we can also test lack of fit. The test function is also based on the
model assumptions so we should not see clear evidence against the assumptions
for the test to be valid.
The test is possible when we have replications, that is more than one observa-
tion for some values of the explanatory variable. In Example 2.7 we have five
observations for each age xi .
66 CHAPTER 4. MODEL CHECKING
Notation:
Denote by Yij the j-th response atPxi , i = 1, . . . , m, j = 1, . . . , ni , that is the
number of all observations is n = m i=1 ni . The average response at xi is
ni
1 X
Ȳi = Yij .
ni j=1
We denote the fitted response at xi by Ybi , which is the same for all observations at
xi .
These differences arise for two reasons. Firstly the j-th observation of a given xi
is an outcome of a random variable. Observations obtained for the same value of
X may produce different values of Y . Secondly the model we fit may not be a
good one.
How could we distinguish between the random variation and the lack of fit? We
need more than one observation at xi to be able to do it.
The difference
Yij − Ȳi
indicates the random variation at xi ; it is called pure error. The difference between
the mean and the fitted response, i.e.,
Ȳi − Ybi ,
Using the double index notation we may write the sum of squares for residuals as
ni
m X
X
SSE = (Yij − Ybi )2 .
i=1 j=1
Proof.
ni
m X
X
SSE = (Yij − Ybi )2
i=1 j=1
Xm X ni
= {(Yij − Ȳi ) + (Ȳi − Ybi )}2
i=1 j=1
Xm X ni m
X ni
m X
X
2 2
= (Yij − Ȳi ) + ni (Ȳi − Ybi ) + 2 (Yij − Ȳi )(Ȳi − Ybi )
i=1 j=1 i=1 i=1 j=1
m
X ni
X
= SSP E + SSLoF + 2 (Ȳi − Ybi ) (Yij − Ȳi )
i=1 j=1
= SSP E + SSLoF
since nj=1
P i
(Yij − Ȳi ) = 0.
This theorem shows how the residual sum of squares is split into two parts, one
due to the pure error and one due to the model lack of fit. To work out the split of
the degrees of freedom, note that to calculate SSP E we must calculate m sample
means Ȳi , i = 1, . . . , m. Each sample mean takes up one degree of freedom. Thus
the degrees of freedom for pure error are n − m. By subtraction, the degrees of
freedom for lack of fit are
νLoF = νE − νP E = (n − 2) − (n − m) = m − 2.
ANOVA table
It can also be shown that if the simple linear regression model is true then
E[SSLoF ] = (m − 2)σ 2 .
Hence, both M SP E and M SLoF give us unbiased estimators of σ 2 , but the latter
one only if the model is true.
Let
H0 : simple linear regression model is “true”
H1 : ¬H0
Then, under H0 ,
(m − 2)M SLoF
∼ χ2m−2 .
σ2 H0
Also
(n − m)M SP E
∼ χ2n−m
σ2
whatever the model.
Hence, under H0 , the ratio of these two independent statistics divided by the re-
spective degrees of freedom is distributed as Fm−2,n−m , namely
M SLoF
F = ∼ Fm−2,n−m .
M S P E H0
Note that we can only do this lack of fit test if we have replications. These have to
be true replications, not just repeated measurements on the same sampling unit.
4.2. FURTHER MODEL CHECKING 69
Source DF SS MS F P
Regression 1 2.6554 2.6554 60.63 0.000
Residual Error 23 1.0073 0.0438
Lack of Fit 3 0.0885 0.0295 0.64 0.597
Pure Error 20 0.9188 0.0459
Total 24 3.6627
The p-value is 0.597 so the numerical output shows no reason to doubt the fit of
this model.
where xT
i = (1, xi ) is the ith row of matrix X.
We can easily obtain other results we have seen for the SLRM written in non-
matrix notation, now using the matrix notation, both for the full model and for a
reduced SLM (no intercept or zero slope).
b = σ 2 (X T X)−1 . Thus
Now, by Theorem 2.1, Var[β]
P 2
2 xi
var[β0 ] = σ
b
nSxx
n o
1 x̄2
x2 = x2 − nx̄2 + nx̄2 , can be written as σ 2
P P
which, by writing n
+ Sxx
.
Also,
−nx̄
cov(βb0 , βb1 ) = σ 2
nSxx
2
−σ x̄
= ,
Sxx
and
σ2
var[βb1 ] = .
Sxx
4.3. MODEL CHECKING IN MULTIPLE REGRESSION 71
e = Y − Yb = (I − H)Y
We see that the residuals may have different variances which may make detecting
outlying observations more difficult. So we define the standardized residuals as
follows
ei
ri = p .
2
S (1 − hii )
For large samples hij will be small (for i 6= j) and we have an asymptotic re-
sult that the standardized residuals are approximately independent and identically
distributed as N (0, 1), that is
This allows us to carry out model checking. Note however that it will be most
reliable for large samples.
Apart from the usual residual diagnostics as done in the residual plots in R, we
may use standardized residuals to check the form of the expected response by
72 CHAPTER 4. MODEL CHECKING
Also, as with simple linear regression outliers may be evident from the residual
plots.
We noted in section 4.2.1 that an observation with high leverage was potentially
influential. We discuss this in greater detail here. The vector of fitted values is
b = X βb = X(X T X)−1 X T y = Hy
y
and the ith fitted value can be written as
n
X X
ybi = hij yj = hii yi + hij yj .
j=1 i6=j
4.3. MODEL CHECKING IN MULTIPLE REGRESSION 73
The weight hii indicates how heavily yi contributes to the fitted value ybi . The
quantity hii is called the leverage of case i. The ith diagonal element, hii , of the
hat matrix H has the following properties:
1. As var(ei ) = σ 2 (1 − hii ), we have hii < 1. This means that hii close to 1
will give var(ei ) ≈ 0 and so ybi ≈ yi , that is, the fitted value will be very
close to the ith observation.
2. hii is usually small when vector (x1,i , . . . , xp−1,i )T is close to the centroid
(x̄1 , . . . x̄p−1 )T and large when the vector is far from the centroid.
3. When p = 2 (SLRM)
1 (xi − x̄)2
hii = +
n Sxx
1
and hii = n
when xi = x̄.
1
4. In general, n
≤ hii < 1.
Pn
5. i=1 hii = p since
n
X
hii = trace(H)
i=1
= trace(X(X T X)−1 X T )
= trace(X T X)(X T X)−1
= trace I p
= p.
There may be various reasons for high leverage. It may be that the data of the
case were collected differently than the rest of the data or simply misrecorded.
It may just be that the case has one or more values which are atypical but cor-
rectly recorded. A low leverage case usually will not influence the fit much; a
high leverage case indicates potential influence, but not all high leverage cases are
influential.
74 CHAPTER 4. MODEL CHECKING
Cook’s distance
Recall from section 4.2.1 that Cook’s distance provides a measure of the influence
of an observation on the fitted model. Let
Y = Xβ + ε, ε ∼ N n (0, σ 2 I).
Denote by βb(i) the estimate of β obtained without the i-th case (x1,i , . . . , xp−1,i , yi ).
Then, βb − βb(i) is a good indicator of the influence of the i-th observation on the
model fit. When pre-multiplied by X, this is the difference between the vectors of
fitted values obtained with all cases included and with the i-th case omitted. The
Cook’s distance, as defined in Section 4.2.1, is
n
1 X (i)
Di = 2 yj − ybj )2
(b
ps j=1
1
= y − yb(i) )T (b
(b y − yb(i) )
ps2
1
= 2 (X βb − X βb(i) )T (X βb − X βb(i) )
ps
1
= 2 (βb − βb(i) )T X T X(βb − βb(i) )
ps
Large Cook’s distance indicates that the observation i is influential. Note that, this
depends on both the leverage hii and the standardized residual ri .
4.3. MODEL CHECKING IN MULTIPLE REGRESSION 75
(i)
e(i) = yi − ybi ,
(i)
where ybi = xTi β (i) and β (i) is the vector of least squares estimates of the model
b b
parameters obtained without case i. Then,
(i)
e(i) = yi − ybi = yi − xT i β (i)
b
T −1
T ei (X X) x i
= yi − xi βb −
1 − hii
T T −1
= yi − xT b + ei xi (X X) xi
β
i
1 − hii
ei hii ei
= ei + = .
1 − hii 1 − hii
We define P RESS as the sum of squares of the PRESS residuals, that is,
n n
X X e2i
P RESS = e2(i) = .
i=1 i=1
(1 − hii )2
P RESS assesses the model’s predictive ability. It is used for calculating pre-
dicted R2 .
Predicted R2
This is defined as
2 P RESS
R (pred) = 1 − 100%.
SST
Predicted R2 is used in MLRM to indicate how well the model predicts responses
for new observations. A good model would have R2 and R2 (pred) high and close
to each other. Large discrepancy between these two measures means that the
model may be over-fitted.
76 CHAPTER 4. MODEL CHECKING
Now recall that var(βbj ) = σ 2 cjj so in this case var(βb1 ) = 25.25σ 2 and var(βb2 ) =
25σ 2 which are both large. Also cov(βb1 , βb2 ) = −25σ 2 .
By contrast if
−1 −1
−1 1 T 4 0
X=
⇒X X= , det(X T X) = 16
1 −1 0 4
1 1
T −1 0.25 0
(X X) ==
0 0.25
so var(βbj ) = 0.25σ 2 for j = 1, 2 and cov(βb1 , βb2 ) = 0.
4.4. PROBLEMS WITH FITTING REGRESSION MODELS 77
In these simple cases we can see exactly where the problems are. With more vari-
ables it is not always obvious that some columns of the X matrix are close to being
linear combinations of other columns. This problem is sometimes called multi-
collinearity. These examples illustrate the general problems caused by multi-
collinearity:
(ii) difficulties may arise in variable selection as it will be possible to get very
different models that fit equally well;
(iii) some parameters may have the “wrong” sign; this can be noticed when, for
example, it is obvious that increasing the value of a regressor should result
in an increase in the dependent variable.
A large value of Rj2 (close to 1) will give a large VIFj . In this context a VIF > 10
is taken to indicate that the multi-collinearity may cause problems of the sort noted
above. However, VIF > 4, sometimes even just bigger than 2, can indicate that
an explanatory variable could be excluded from the model.
Model Selection
born) to 4 years old (X). The results are given in Table 5.1. We are interested
whether the level of polyamine decreases linearly while the age of children in-
creases up to four years.
If the model checking suggests that the variance is not constant, or that the data
are not from a normal distribution (these often happen together) then it might be
possible to obtain a better model by transforming the observations yi . Commonly
used transformations are
79
80 CHAPTER 5. MODEL SELECTION
• 1/y.
These are special cases of a large family of transformations, the Box-Cox trans-
formation, yλ −1
(λ) , when λ 6= 0;
y = λ
ln y, when λ = 0.
The Box-Cox transformation estimates the λ that minimizes the standard devia-
tion of a standardized transformed variable. Trigonometric functions are also used
in some cases, in particular the arc-sine or arc-tangent. In practice the log trans-
formation is often the most useful and is generally the first transformation we try,
but note all values of yi need to be positive.
We have already mentioned the principle of parsimony; we should use the simplest
model that achieves our purpose.
Figure 5.1: The Matrix Plot indicates that Y is clearly related to X4 and also to X2 . The relation
with other explanatory variables is not that obvious.
Analysis of Variance
Source DF SS MS F P
Regression 1 57064 57064 22.84 0.000
Residual Error 13 32483 2499
Total 14 89547
We can see that the residuals versus fitted values indicate that there may be non-
constant variance and also the linearity of the model is questioned. We will add
X2 to the model.
82 CHAPTER 5. MODEL SELECTION
Analysis of Variance
Source DF SS MS F P
Regression 2 89012 44506 998.90 0.000
Residual Error 12 535 45
Total 14 89547
Source DF Seq SS
X4 1 57064
X2 1 31948
Still, there is some evidence that the standardized residuals may not have constant
variance. Will this be changed if we add X3 to the model?
5.2. MODEL BUILDING 83
Analysis of Variance
Source DF SS MS F P
Regression 3 89013 29671 611.17 0.000
Residual Error 11 534 49
Total 14 89547
Source DF Seq SS
X4 1 57064
X2 1 31948
X3 1 1
Not much better than before. Now, we add X1 , the least related explanatory vari-
able to Y .
84 CHAPTER 5. MODEL SELECTION
Analysis of Variance
Source DF SS MS F P
Regression 4 89285 22321 851.72 0.000
Residual Error 10 262 26
Total 14 89547
Source DF Seq SS
X4 1 57064
X2 1 31948
X3 1 1
X1 1 272
The residuals now do not contradict the model assumptions. We analyze the nu-
merical output. Here we see that X3 may be a redundant variable as we have no
evidence to reject the hypothesis that β3 = 0 given that all the other variables are
in the model. Hence, we will fit a new model without X3 .
5.2. MODEL BUILDING 85
Analysis of Variance
Source DF SS MS F P
Regression 3 89274 29758 1200.14 0.000
Residual Error 11 273 25
Total 14 89547
Source DF Seq SS
X4 1 57064
X2 1 31948
X1 1 262
These residual plots also do not contradict the model assumptions, all the param-
eters are significant and R2 is very large.
86 CHAPTER 5. MODEL SELECTION
Suppose the overall regression model as tested by the Analysis of Variance table
is significant. We know that not all of the β parameters are zero, but we may still
be able to delete several variables.
We can carry out the Subset Test based on the extra sum of squares principle. We
are asking if we can reduce the set of explanatory variables.
X1 , X2 , . . . , Xp−1
to, say,
X1 , X2 , . . . , Xq−1
The difference between the sums of squares is called the extra sum of squares due
to Xq , . . . , Xp−1 given X1 , . . . , Xq−1 are already in the model and is defined by
the equation
5.2. MODEL BUILDING 87
= SSEred − SSE
| {z } |{z}
residual SS under residual SS under
reduced model full model.
Notation:
Let
β T1 = (β0 , β1 , . . . , βq−1 ) β T2 = (βq , βq+1 , . . . , βp−1 )
so that
β1
β= .
β2
Similarly divide X into two submatrices X 1 and X 2 so that X = (X 1 , X 2 ),
where
1 x1,1 · · · xq−1,1 xq,1 · · · xp−1,1
X 1 = ... .. .. . ..
X 2 = .. .
. . .
1 x1,n · · · xq−1,n xq,n · · · xp−1,n
Y = X 1 β 1 + ε?
has
T
SSRred = βb1 X T1 Y − nȲ 2
T
SS red = Y T Y − βb X T Y .
E 1 1
Source d.f. SS MS VR
Overall regression p−1 SSR
X1 , .., Xq−1 q−1 SSRred
SSextra SSextra
Xq , .., Xp−1 |X1 , .., Xq−1 p−q SSextra p−q (p−q)M SE
Residual n−p SSE M SE
Total n−1 SST
In the ANOVA table we use the notation Xq , . . . , Xp−1 |X1 , . . . , Xq−1 to denote
that this is the effect of the variables Xq , . . . , Xp−1 given that the variables X1 , . . . , Xq−1
are already included in the model.
Note that we can repeatedly test individual parameters. The we have the following
Sums of Squares and degrees of freedom.
Source of variation df SS
Full model p−1 SSR
X1 1 SS(β1 )
X2 |X1 1 SS(β2 |β1 )
X3 |X1 , X2 1 SS(β3 |β1 , β2 )
.. ..
. .
Xp−1 |X1 , . . . , Xp−2 1 SS(βp−1 |β1 , . . . , βp−2 )
Residual n−p SSE
Total n−1 SST
5.2. MODEL BUILDING 89
The output depends on the order the predictors are entered into the model. The
sequential sum of squares is the unique portion of SSR explained by a predictor,
given any previously entered predictors. If we have a model with three predictors,
X1 , X2 , and X3 , the sequential sum of squares for X3 shows how much of the
remaining variation X3 explains given that X1 and X2 are already in the model.
Instead we usually compare models by calculating a few statistics for each model.
Three statistics that are most useful are M SE , R2 and Cp .
If the full model with all candidate explanatory variables is correct then
E(M SE ) = σ 2 .
90 CHAPTER 5. MODEL SELECTION
(i) the one with the smallest number of explanatory variables (parameters) for
which M SEred is close to M SE of the full model;
(ii) the one with smallest M SEred .
Condition (i) aims for the simplest acceptable model. Condition (ii) is more con-
servative and should be considered carefully as it may just suggest the full model.
Denote by pe the number of parameters in the reduced model. For the full model
pe = p. Then, a sketch of the smallest M SEred for a given pe, denoted further by
M SEpe , against pe can be useful.
Coefficient of determination R2
The superscript pe indicates that the sums of squares are calculated for a model
with pe parameters.
Adding terms to a model always increases R2 . However, the model with pe param-
eters, for pe as small as possible, having Rp2e close to Rp2 (i.e., obtained from the full
model) might be regarded as being best. Judgement is required and a plot of Rp2e
against pe can be useful to identify where the plot levels off.
It takes into account the number of parameters in the model and can be useful for
comparing models with different numbers of predictors.
5.2. MODEL BUILDING 91
Mallows’ statistic Cp
It can also be shown that Cpe is an estimator of the mean square error of prediction,
i.e.,
n
1X
[var(Ŷi ) + {bias(Ŷi )}2 ].
n i=1
This suggests minimizing Cpe. Thus, we should choose either
Note that Cpe depends on the unknown σ 2 . If we take M SE from the full model as
the estimator of σ 2 , then
pe
bpe = SSE + 2e
C p − n.
M SE
It can be shown that
(n − pe)2
E(Cpe) =
b + 2ep − n,
n − pe − 2
so instead we could use an adjusted Cpe defined by
n − pe − 2 SSEpe
C̄pe = p−n
+ 2e
n − pe M SE
with expectation pe. A little algebra shows that
n − pe − 2 b p − n)
2(2e
C̄pe = Cpe + .
n − pe n − pe
92 CHAPTER 5. MODEL SELECTION
We can see that the model including X1 , X2 and X4 has very good values of the
measures we have just talked about. Let us call this model M1 . Also, the full
model has very good values of these measures. Let us call the full model M2 .
Which one should we choose?
R2 , Mallows’ Cp and S are helpful, but it is not sufficient to base the final decision
on these measures only. We should also do the residual diagnostics for the final
competing models as well as hypothesis testing for the model parameters.
Below, we see that the residuals in neither of the two models contradict the model
assumptions, but if we add X3 given that X1 , X2 , X4 are already there, we do not
gain much (SS(β3 |β1 , β2 , β4 ) is very small). Also, we have no evidence to reject
the null hypothesis that β3 = 0 given the other three variables are in the model.
Hence, it is better to choose the model with fewer parameters, that is M1 rather
than M2 . Fewer parameters give us more degrees of freedom (n−p) for estimating
the error variance, that is, give a more precise estimate of σ 2 .
5.2. MODEL BUILDING 93
(a) (b)
Figure 5.2: (a) Residual plots for the model fit including X1 , X2 , X4 (b) Residual plots for the
model fit including all explanatory variables.
Analysis of Variance
Source DF SS MS F P
Regression 4 89285 22321 851.72 0.000
Residual Error 10 262 26
Total 14 89547
Source DF Seq SS
X1 1 1074
X2 1 44505
X4 1 43695
X3 1 11
94 CHAPTER 5. MODEL SELECTION
Chapter 6
6.1 Prediction
Apart from making inference on the mean response we may also try to do it for
a new response itself, that is for an unknown (not observed) response at some x0 .
For example, we might want to predict an overhead cost for another department
of the same structure whose total labor hours are x0 (Example 3.2). In this section
we derive a Prediction Interval (PI) for a response
Y 0 = β 0 + β 1 x0 + ε0 = µ 0 + ε0 , ε0 ∼ N (0, σ 2 )
To obtain a prediction interval (PI) for the unknown observation we may use the
point predictor and its distribution as follows. First, we will find the distribution
b0 − Y0 . Note that for
of µ
b0 − Y0 = µ
µ b0 − (µ0 + ε0 ),
95
96 CHAPTER 6. INTERPRETATION OF FITTED MODELS
µ0 − Y0 ) = 0 and
we have E(b
µ0 ) + var(µ0 + ε0 ) = aσ 2 + σ 2 = σ 2 (1 + a).
µ0 − Y0 ) = var(b
var(b
b0 − Y0 ∼ N (0, σ 2 (1 + a)).
µ
b − Y0
µ
p 0 ∼ tn−2 .
S 2 [1 + a]
This interval is wider than the CI for the mean response µ0 . This is because to
predict a new observation rather than a mean, we need to add the variability of the
additional random error ε0 . Again, we should only make predictions for values of
x0 within the range of the data.
For the example on the overhead cost (Example 3.2) the confidence and prediction
intervals (here they are for x0 = 1000 hours) are:
We may say, with 95% confidence, that when the total direct labour hours are equal to
1000, then the expected total departmental cost would be between £26374 and £28210,
however if we were to observe the total cost for a 1000 hours of labour it might be anything
between £23645 and £30939.
6.1. PREDICTION 97
Figure 6.1: Data, fitted line plot, CI for the mean and PI for a new observation at
any x0 .
To predict a new observation we need to take into account not only its expectation, but
also a possible new random error.
Another useful class of linear models are polynomial regression models, e.g.,
Yi = β0 + β1 xi + β11 x2i + εi ,
Y = Xβ + ε, ε ∼ N n (0, σ 2 I),
where rows of matrix X are of the form (1, xi , x2i ) and β = (β0 , β1 , β11 )T . The
quadratic model belongs to the class of linear models as it is linear in the parameters.
If we wish to compare the quadratic regression model with the simple linear regression
model we fit Yi = β0 +β1 xi +β11 x2i +εi and test the null hypothesis H0 : β11 = 0 against
an alternative H1 : β11 6= 0. If we reject H0 the quadratic model gives a significantly
better fit than the simple linear model. This can be extended to cubic and higher order
polynomials. As higher powers of x quickly become large it is usually sensible to centre
x by subtracting its mean. Denote
zi = xi − x.
We can also have a second (or higher) order polynomial regression model in two (or more)
explanatory variables. For example,
This model is very commonly used in experiments for exploring response surfaces. Note
that if the second order terms x21i , x22i and x1i x2i are in the model then we should not
consider removing the first order terms x1i and x2i .
Yi = β0 + β1 f1 (e
xi ) + . . . + βp−1 fp−1 (e
xi ) + εi ,
Yi = x T
i β + εi ,
y x1 x2 y x1 x2 y x1 x2 y x1 x2 y x1 x2
49.2 6 20 51.5 8 20 51.1 10 20 48.6 12 20 43.2 14 20
48.1 6 21 51.7 8 21 51.5 10 21 47.0 12 21 42.6 14 21
48.0 6 22 50.4 8 22 50.3 10 22 48.0 12 22 42.1 14 22
49.6 6 23 51.2 8 23 48.9 10 23 46.4 12 23 43.9 14 23
47.0 6 24 48.4 8 24 48.7 10 24 46.2 12 24 40.5 14 24
where z1 = x1 − x1 and z2 = x2 − x2 .
100 CHAPTER 6. INTERPRETATION OF FITTED MODELS
Analysis of Variance
Source DF SS MS F P
Regression 5 227.587 45.517 62.62 0.000
Residual Error 19 13.810 0.727
Total 24 241.398
Source DF Seq SS
z1 1 116.129
z2 1 14.045
z1ˆ2 1 96.057
z2ˆ2 1 1.344
z1z2 1 0.012
The sequential sums of squares suggest that we can drop z22 and the product z1 z2 .
6.2. POLYNOMIAL REGRESSION 101
Analysis of Variance
Source DF SS MS F P
Regression 3 226.231 75.410 104.41 0.000
Residual Error 21 15.167 0.722
Total 24 241.398
Source DF Seq SS
z1 1 116.129
z2 1 14.045
z1ˆ2 1 96.057
102 CHAPTER 6. INTERPRETATION OF FITTED MODELS
Here we see that the residuals are not normal (the p-value is 0.013 for the test
of normality). Hence, some further analysis is needed. Various transformations
of the response variable did not work here. The residuals improve when z2 is
removed. The new model fit is below.
Analysis of Variance
Source DF SS MS F P
Regression 2 212.19 106.09 79.90 0.000
Residual Error 22 29.21 1.33
Lack of Fit 2 0.39 0.19 0.13 0.875
Pure Error 20 28.82 1.44
Total 24 241.40
Source DF Seq SS
z1 1 116.13
z1ˆ2 1 96.06
The residuals are slightly better here and do not clearly contradict the assump-
tion of normality. The Lack of Fit test does not indicate any evidence against the
model. R2 is good. The model is parsimonious, so we may stay with this one.
However, we could advise the agronomist that in future experiments of this kind
6.2. POLYNOMIAL REGRESSION 103
he might consider a wider range of temperature values, which would help to es-
tablish clearly whether this factor could be significant for yield of the new hybrid
tomatoes.
104 CHAPTER 6. INTERPRETATION OF FITTED MODELS
Chapter 7
Operator
1 2 3 4
59.8 59.8 60.7 61.0
60.0 60.2 60.7 60.8
60.8 60.4 60.5 60.6
60.8 59.9 60.9 60.5
59.8 60.0 60.3 60.5
Y = X f βf + ε , ε ∼ N (0, Iσ 2 ) , (7.1)
105
106 CHAPTER 7. QUALITATIVE EXPLANATORY VARIABLES
where:
Y - 20 × 1
Xf - 20 × 5
βf - 5 × 1
ε - 20 × 1.
Equivalently,
Yij = β0f + β1f x1j + β2f x2j + β3f x3j + β4f x4j + εij ,
where
1 if k = i
xkj = ,
0 otherwise
and i = 1, . . . , 4 and j = 1, . . . , 5.
In (7.1)
1 1 0 ... 0
. .. ..
1 ..
. .
1 1 ..
0 .
. ..
.. 0
1 .
Xf = ... ... .. ..
,
. .
.. ..
. . 1 0
. .
.. .. 0 1
. . .. ..
.. ..
. .
1 0 0 ... 1
and
7.1. SIMPLE COMPARATIVE EXPERIMENTS 107
β0f
β1f
βf =
β2f
β3f
β4f
However, we can only make comparative statements about the treatments, not
absolute. If we try to estimate β from model (7.1) as
β̂f = (X T X)−1 X T Y ,
we will find that X T X is singular, as it does not have full column rank. The sum
of the columns of X equals a column of 1’s; the last 4 columns sum to form the
first.
1 1 0 0
.. .. .. ..
. . . .
.. ..
. 1 0 .
.. ..
. 0 1 .
.. .. .. ..
. . . .
β0
.. ..
. . 1 0 β1
Y = + ε.
.. ..
β2
. 0. 1
β3
.. ..
.. ..
. .
. .
.. ..
..
. .
. 1
.. ..
..
. .
. 0
.. .. .. ..
. . . .
1 0 0 0
108 CHAPTER 7. QUALITATIVE EXPLANATORY VARIABLES
E(Y ) = β0 + βi ,
E(Y ) = β0 .
r
1X
Ȳi = Yij ∼ N (f (xi )T β, σ 2 /r) . (7.2)
r j=1
Ȳ = X̄β + ε̄ ,
with
Ȳ1
Ȳ = ... ,
Ȳp
1 1 0 0
..
. 0 1 0
X̄ = ,
.. ..
. . 0 1
1 0 0 0
7.1. SIMPLE COMPARATIVE EXPERIMENTS 109
and
σ2
ε̄ ∼ N 0, I .
r
Note that β holds the same parameters (the mean does not change).
We now have
β̂ = (X̄ T X̄)−1 X̄ T Ȳ ,
and
X̄ β̂ = X̄(X̄ T X̄)−1 X̄ T Ȳ
= X̄ X̄ −1 (X̄ T )−1 X̄ T Ȳ
= Ȳ .
Hence Ŷ = Ȳ , the mean response for each treatment regardless of the form of
f (xi )T β.
7.1.1 ANOVA
Source df SS MS
T T 2
Treatment p − 1 β̂ (X X)β̂ − N Ȳ SS/(p − 1)
T
Residual N − p (Y − X β̂) (Y − X β̂) SS/(N − p)
Total N −1 Y T Y − N Ȳ 2
In this table,
β̂ T (X T X)β̂ − N Ȳ 2 = (X β̂)T X β̂ − N Ȳ 2
= Ŷ T Ŷ − N Ȳ 2 ,
Source df SS MS
Operator 3 1.34 0.447
Residual 16 1.70 0.106
Total 19 3.04
Treatment MS
∼ Fp−1,N −p
Residual MS
∼ F3,16 .
In many experiments, interest lies in the study of the effects of two or more factors
simultaneously.
+1, -1, +1, +1 (high, low, high, high; 20, 19, 7, 1.33).
Main effects: to measure the average effect of a factor, say A, we can compute
For example,
This is the effect of changing temperature from low to high averaged across all
other factor levels
Expected Response
−1 1
Factor A
Expected Response
−1 1 −1 1
Factor A Factor A
Interactions: We can measure the joint effect of changing two or more factors
simultaneously through an interaction.
A two factor interaction can be interpreted as one-half the difference in the main
effect of A when B is set to its high and low levels:
1
Int(A, B) = [ME(A|B+) − ME(A|B−)]
2
1
= [ME(B|A+) − ME(B|A−)] ,
2
where
B=+1
Expected response
B=−1
−1 1
Factor A
B=−1
Expected response
B=+1
−1 1
A
B=+1
Expected response
B=−1
−1 1
A
We can define higher order interactions similarly, e.g. the ABC interaction mea-
sures how the AB interaction changes with the levels of C:
7.2. FACTORIAL EXPERIMENTS 117
1
Int(A, B, C) = [Int(A, B|C+) − Int(A, B|C−)]
2
1
= [Int(A, C|B+) − Int(A, C|B−)]
2
1
= [Int(B, C|A+) − Int(B, C|A−)] .
2
For m factors, each having two levels, there are 2m combinations (treatments) of
factor values.
(−1,+1)
(+1,+1)
x2
(−1,−1)
(−1,−1) (−1,+1)
x1
The design points in a two-level factorial design are always the corners of a hy-
percube; for m = 3 factors:
7.2. FACTORIAL EXPERIMENTS 119
(+1,+1,+1)
x2
x3
(−1,−1,−1) (+1,−1,−1)
x1
120 CHAPTER 7. QUALITATIVE EXPLANATORY VARIABLES
Advantages:
• vary all factors simultaneously, i.e. include points like (+1,+1,+1) which
would not be included in a one factor at a time experiment;
• allows estimation of interactions;
• more efficient for estimation of main effects than one factor at a time
– all observations are used in calculation of each factorial effect;
• better coverage of design space.
Disadvantage:
x1 x2 x3 x4
-1 -1 -1 -1
-1 -1 -1 +1
-1 -1 +1 -1
-1 -1 +1 +1
-1 +1 -1 -1
-1 +1 -1 +1
-1 +1 +1 -1
-1 +1 +1 +1
+1 -1 -1 -1
+1 -1 -1 +1
+1 -1 +1 -1
+1 -1 +1 +1
+1 +1 -1 -1
+1 +1 -1 +1
+1 +1 +1 -1
+1 +1 +1 +1
m
X
Yij = β0 + βl xil
l=1
m−1
X m
X
+ βkl xik xil
k=1 l=k+1
m−2
X m
X m−1
X
+ βklq xik xil xiq
k=1 l=k+1 q=l+1
+ · · · + εij , (7.3)
for i = 1, . . . , 2m , j = 1, . . . , r, with
−1 if kth factor is set to low level in run i
xik =
+1 if kth factor is set to high level in run i .
In matrix form:
Y = Xβ + ε ,
where:
X - N × p model matrix;
X T X β̂ = X T Y .
X TX = N I
1 T
β̂ = X Y .
N
That is, all regression parameters are estimated independently and there is no need
to make adjustments for other terms in the model; fitting submodels of (7.3) does
not change the parameter estimates.
X T X = 16I ,
7.2. FACTORIAL EXPERIMENTS 123
1 1 ... ... ... ... 1 Y1
1 T 1 −1 −1 . . . −1 1 . . . 1 .
X Y = ..
N 16
.. .. .. .. .. .. ..
. . . . . . . Y16
89.94 β0
4.06 β1 (temp) x1
1.28 β2 (time) x2
−1.11 β3 (conc.) x3
1.54 β4 (reagent) x4
−1.18 β12 (time×temp) x1 x2
1.18 β13
−1.39 β14
=
0.22 β23
−0.32 β24
0.25 β34
0.123 β123 (temp×time×conc.) x1 x2 x3
0.10 β124
−0.02 β134
−0.12 β234
0.10 β1234 (temp×time×conc.×reagent) x1 x2 x3 x4
The factorial effects are given by 2β; e.g. ME(x1 ) = 8.12, Int(x1 x2 ) = −2.36.
Source df SS
m
Regression 2 −1 β̂ X X β̂ − N Ȳ 2
T T
x1 1 N β̂12 (*)
x2 1 N β̂22
.. .. ..
. . .
x4 1 N β̂42
2
x1 x2 1 N β̂12
.. .. ..
. . .
2
x1 x2 x3 x4 1 N β̂1234
m
Residual 2 (r − 1) (Y − X β̂)T (Y − X β̂)
Total 2m r − 1
124 CHAPTER 7. QUALITATIVE EXPLANATORY VARIABLES
where
1 −1
1 −1
.. ..
. .
.
X1 = .. −1
.
.. 1
.. ..
. .
1 1
β̂0
β̂1 = .
β̂1
Hence,
N 0
X1T X1 = = NI ,
0 N
7.2. FACTORIAL EXPERIMENTS 125
and
= N β̂12 .
Other sums of squares are similar; as they are all independent, it does not matter
in which order we compare the models.
The density (1) can be thought of as a likelihood resulting from a single observa-
tion y. Then
yθ−b(θ)
log fY (y; θ, φ) = a(φ)
+ c(y, φ)
∂
∂ y− ∂θ b(θ) y−b0 (θ)
⇒ u(θ) = ∂θ
log fY (y; θ, φ) = a(φ)
= a(φ)
b00 (θ)
⇒ I(θ) = E[−H(θ)] = a(φ)
,
where H(θ) is the Hessian and I(θ) is the Fisher information matrix. From the
properties of the score function we know that E[U (θ)] = 0. Therefore
127
128 CHAPTER 8. GENERALISED LINEAR MODELS
Y − b0 (θ)
E =0 ⇒ E[Y ] = b0 (θ).
a(φ)
Furthermore,
Y − b0 (θ)
V ar[Y ]
V ar[U (θ)] = V ar = ,
a(φ) a(φ)2
as b0 (θ) and a(φ) are constants (not random variables). Now, we also know that
V ar[U (θ)] = I(θ). Therefore,
and hence the mean and variance of a random variable with probability density
function (or probability function) of the form (1), are b0 (θ) and a(φ)b00 (θ) respec-
tively.
1 y2
c(y, φ) = − + log(2πa[φ]) .
2 a(φ)
Therefore
E(Y ) = b0 (θ) = θ = µ
V ar(Y ) = a(φ)b00 (θ) = σ 2
V (µ) = 1.
This is in the form (1), with θ = log λ, b(θ) = exp θ, a(φ) = 1 and c(y, φ) =
− log y!. Therefore
p
This is in the form (1), with θ = log 1−p , b(θ) = log(1 + exp θ), a(φ) = 1 and
c(y, φ) = 0. Therefore
exp θ
E(Y ) = b0 (θ) = 1+exp θ
=p
00 exp θ
V ar(Y ) = a(φ)b (θ) = (1+exp θ)2
= p(1 − p)
V (µ) = µ(1 − µ).
p 1
This is in the form (1), with θ = log 1−p , b(θ) = log(1 + exp θ), a(φ) = n
and
n
c(y, φ) = log ny . Therefore
exp θ
E(Y ) = b0 (θ) = 1+exp θ
=p
00 exp θ p(1−p)
V ar(Y ) = a(φ)b (θ) = n1 (1+exp θ)2
= n
V (µ) = µ(1 − µ).
Here, we can write a(φ) ≡ σ 2 /w where the scale parameter σ 2 = 1 and the weight
w is n, the binomial denominator.
130 CHAPTER 8. GENERALISED LINEAR MODELS
η = xβ. (4)
Again, we call the n × p matrix x the design matrix. The ith row of x is xTi , the
explanatory data corresponding to the ith observation of the response. The jth
column of x contains the n observations of the jth explanatory variable.
For specifying the pattern of dependence of the response variable on the explana-
tory variables, the canonical parameters θ1 , . . . , θn in (2) are not of direct interest.
Furthermore, we have already specified that the distribution of Yi should depend
on xi through the linear predictor ηi . It is the parameters β1 , . . . , βp of the linear
predictor which are of primary interest.
The link between the distribution of Y and the linear predictor η is provided by
the link function g,
ηi = g(µi ) i = 1, . . . , n,
where µi ≡ E(Yi ), i = 1, . . . , n. Hence, the dependence of the distribution of the
response on the explanatory variables is established as
Recall that for a random variable Y with a distribution from the exponential fam-
ily, E(Y ) = b0 (θ). Hence, for a generalised linear model
µi = E(Yi ) = b0 (θi ) i = 1, . . . , n.
Therefore
0
θi = b −1 (µi ) i = 1, . . . , n
and as g(µi ) = ηi = xTi β, then
0
θi = b −1 (g −1 [xTi β]) i = 1, . . . , n. (5)
Hence, we can express the joint density (2) in terms of the coefficients β, and for
observed data y, this is the likelihood fY (y; β, φ) for β. As β is our parameter
of real interest (describing the dependence of the response on the explanatory
variables) this likelihood will play a crucial role.
0
Note that considerable simplification is obtained in (5) if the functions g and b −1
are identical. Then
θi = xTi β i = 1, . . . , n
and the resulting likelihood is
n n
!
X yi xT β − b(xT β)
i i
X
fY (y; β, φ) = exp + c(yi , φi ) .
i=1
a(φi ) i=1
The parameters in a generalised linear model are estimated using maximum like-
lihood. However, in most cases the maximum cannot be obtained algebraically
and we have to resort to numerical optimisation methods - these are beyond the
scope of this course.
8.3. EXAMPLE: BINARY REGRESSION 133
Clearly the normal linear model is also a generalised linear model. We assume
Y1 , . . . , Yn are independent normally distributed random variables. The normal
distribution is a member of the exponential family.
Furthermore, the explanatory variables enter a linear model through the linear
predictor
ηi = xTi β i = 1, . . . , n.
Finally, the link between E(Y) = µ and the linear predictor η is through the
(canonical) identity link function
µi = ηi i = 1, . . . , n.
In binary regression the data either follow the binomial or the Bernoulli distri-
bution (equivalently). The objective is to model the success probability p as a
function of the covariates. Because p(x) is a probability we think of it as the cu-
mulative distribution function (cdf) of a random variable. For the logit link the
134 CHAPTER 8. GENERALISED LINEAR MODELS
random variable follows the logistic distribution. But we can use the cdfs of other
random variables such as the standard normal and the log-Weibull distribution to
model the probability p(x). These will still fall under the glm but with different
link functions.
p(x)
θ = log = xT β = η.
1 − p(x)
This implies
exp(η) 1
p(x) = = .
1 + exp(η) 1 + exp(−η)
This is the cdf of the logistic distribution taking values in the real line (−∞ < η <
1
∞). It can be easily verified that F (η) = 1+exp(−η) is a cdf of a random variable
since it is non-negative and increases monotonically to 1 from zero. The logistic
distribution behaves almost like the t-distribution with 8-degrees of freedom.
If we use the cdf of the standard normal distribution to model the p(x) we get
what is called the probit link. For this link we set
where Φ(·) is the cdf of the standard normal distribution. Note that for this the
link function
g(µ) = g(p) = Φ−1 (µ) = η,
is called the probit link.
For the logit and probit link functions the cdf’s are symmetric about 1/2. However,
this is not the case for the complementary log-log link. Hence this should be used
when asymmetry as a function of the linear predictor is suspected. The logistic
distribution is heavier tailed than the standard normal distribution, hence the logit
link is often used when outliers are suspected in the linear predictor.