0% found this document useful (0 votes)
93 views139 pages

SM Notes 2020

This document introduces statistical modeling concepts and presents simple and multiple linear regression models. It covers estimation of model parameters using least squares, inference about the parameters including hypothesis testing and confidence intervals, model checking procedures, model selection techniques, and interpreting fitted models for prediction. Generalized linear models are also introduced. The overall content provides an overview of key topics in statistical modeling.

Uploaded by

Gabrielle LEVY
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
93 views139 pages

SM Notes 2020

This document introduces statistical modeling concepts and presents simple and multiple linear regression models. It covers estimation of model parameters using least squares, inference about the parameters including hypothesis testing and confidence intervals, model checking procedures, model selection techniques, and interpreting fitted models for prediction. Generalized linear models are also introduced. The overall content provides an overview of key topics in statistical modeling.

Uploaded by

Gabrielle LEVY
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 139

5CCM242A / 6CCM242B

Statistical Modelling

Vasiliki Koutra1
Department of Mathematics
King’s College London

January 2020

1 Originally devised by Steven Gilmour


ii
Contents

1 Introduction to Statistical Modelling 3

1.1 General Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Simple Linear Regression Model . . . . . . . . . . . . . . . . . . 7

1.3 Matrix Approach to Simple Linear Regression . . . . . . . . . . . 10

1.4 Multiple Linear Regression Model . . . . . . . . . . . . . . . . . 12

1.4.1 Vectors of random variables . . . . . . . . . . . . . . . . 14

2 Estimation 17

2.1 Least Squares Estimation in Simple Linear Regression . . . . . . 17

2.2 Properties of the Estimators . . . . . . . . . . . . . . . . . . . . . 20

2.3 Estimation in Matrix Form . . . . . . . . . . . . . . . . . . . . . 21

2.3.1 Some specific examples . . . . . . . . . . . . . . . . . . 23

2.4 Least Squares Estimation in General Linear Model . . . . . . . . 24

2.4.1 Estimation of β — the normal equations . . . . . . . . . . 26

2.4.2 Properties of the least squares estimator . . . . . . . . . . 27

2.5 The Gauss-Markov Theorem . . . . . . . . . . . . . . . . . . . . 28

3 Inference 31

iii
iv CONTENTS

3.1 Assessing the Simple Linear Regression Model . . . . . . . . . . 31

3.1.1 Analysis of Variance Table . . . . . . . . . . . . . . . . . 31

3.1.2 F test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.1.3 Estimating σ 2 . . . . . . . . . . . . . . . . . . . . . . . . 36

3.1.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.2 Inference about the regression parameters . . . . . . . . . . . . . 39

3.2.1 Inference about β1 . . . . . . . . . . . . . . . . . . . . . 41

3.2.2 Inference about E(Y |X = xi ) . . . . . . . . . . . . . . . 44

3.3 Inference in the Multiple Linear Regression Model . . . . . . . . 45

3.3.1 Properties of the least squares estimator . . . . . . . . . . 45

3.4 Analysis of Variance . . . . . . . . . . . . . . . . . . . . . . . . 47

3.5 Inferences about the parameters . . . . . . . . . . . . . . . . . . 49

3.6 Confidence interval for µ . . . . . . . . . . . . . . . . . . . . . . 50

3.7 Sampling distribution of M SE (S 2 ) . . . . . . . . . . . . . . . . 51

3.8 Sampling distribution of M SR and F . . . . . . . . . . . . . . . 56

4 Model Checking 59

4.1 Residuals in Simple Linear Regression . . . . . . . . . . . . . . . 59

4.1.1 Crude Residuals . . . . . . . . . . . . . . . . . . . . . . 59

4.1.2 Standardized/Studentized Residuals . . . . . . . . . . . . 60

4.1.3 Residual plots . . . . . . . . . . . . . . . . . . . . . . . . 60

4.2 Further Model Checking . . . . . . . . . . . . . . . . . . . . . . 63

4.2.1 Outliers and influential observations . . . . . . . . . . . . 63

4.2.2 Lack of Fit Test . . . . . . . . . . . . . . . . . . . . . . . 65


CONTENTS v

4.2.3 Matrix form of the model . . . . . . . . . . . . . . . . . . 69

4.3 Model checking in multiple regression . . . . . . . . . . . . . . . 71

4.3.1 Standardised residuals . . . . . . . . . . . . . . . . . . . 71

4.3.2 Lack of fit and pure error . . . . . . . . . . . . . . . . . . 72

4.3.3 Leverage and influence . . . . . . . . . . . . . . . . . . . 72

4.3.4 Prediction Error Sum of Squares (P RESS) . . . . . . . . 75

4.4 Problems with fitting regression models . . . . . . . . . . . . . . 76

4.4.1 Near-singular and ill-conditioned X T X . . . . . . . . . . 76

4.4.2 Variance inflation factor (VIF) . . . . . . . . . . . . . . . 77

5 Model Selection 79

5.1 Transformation of the response . . . . . . . . . . . . . . . . . . . 79

5.2 Model Building . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.2.1 F-test for the deletion of a subset of variables . . . . . . . 86

5.2.2 All subsets regression . . . . . . . . . . . . . . . . . . . 89

6 Interpretation of Fitted Models 95

6.1 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

6.1.1 Prediction Interval for a new observation in simple linear


regression . . . . . . . . . . . . . . . . . . . . . . . . . . 95

6.1.2 Predicting a new observation in general regression . . . . 97

6.2 Polynomial regression . . . . . . . . . . . . . . . . . . . . . . . 98

7 Qualitative Explanatory Variables 105

7.1 Simple Comparative Experiments . . . . . . . . . . . . . . . . . 105

7.1.1 ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . 109


CONTENTS 1

7.2 Factorial Experiments . . . . . . . . . . . . . . . . . . . . . . . . 110

7.2.1 Main Effects and Interactions . . . . . . . . . . . . . . . 111

7.2.2 Factorial Experiments . . . . . . . . . . . . . . . . . . . 116

7.2.3 Regression Modelling for Factorial Experiments . . . . . 120

7.2.4 Analysis of Variance . . . . . . . . . . . . . . . . . . . . 122

8 Generalised Linear Models 125

8.1 The Exponential family . . . . . . . . . . . . . . . . . . . . . . . 125

8.2 Components of a generalised linear model . . . . . . . . . . . . . 128

8.2.1 The random component . . . . . . . . . . . . . . . . . . 128

8.2.2 The systematic (or structural) component . . . . . . . . . 128

8.2.3 The link function . . . . . . . . . . . . . . . . . . . . . . 129

8.2.4 The linear model . . . . . . . . . . . . . . . . . . . . . . 131

8.3 Example: Binary Regression . . . . . . . . . . . . . . . . . . . . 131


2 CONTENTS
Chapter 1

Introduction to Statistical Modelling

1.1 General Concepts

What do we mean by statistical modelling and a statistical model?

Think back to the Probability & Statistics I and II modules. There were statements
like: “Y1 , Y2 ,. . .,Yn are independent and identically distributed normal random
variables with mean µ and variance σ 2 ”. Another way of writing this is

Y i = µ + εi , i = 1, 2, . . . , n,

where εi ∼ N (0, σ 2 ) and are independent. We wanted to estimate µ, which can


be done by using Ȳ , or to test a hypothesis such as H0 : µ = µ0 .

This statistical model has two components, a part which tells us about the expec-
tation of Y , which is constant, and a random part.

In this course we are interested in models where the mean depends on values
of other variables. In the simplest case, we have a response variable Y and one
explanatory variable X. Then the expectation of Y depends on the value of X,
say xi , and we may write

Y i = µ i + εi , i = 1, 2, . . . , n,

where µi = β0 + β1 xi and β0 and β1 are some unknown constant parameters.

In practice, we start with a real life problem for which we have some data. We
think of a statistical model as a mathematical representation of the variables we

3
4 CHAPTER 1. INTRODUCTION TO STATISTICAL MODELLING

x (batch size) y (man-hours)


30 73
20 50
60 128
80 170
40 87
50 108
60 135
30 69
70 148
60 132

Table 1.1: Data on batch size and time to make each batch

have measured. This model usually involves some parameters. We may then
try to estimate the values of these parameters or to test hypotheses about them.
We may wish to use the model to predict what would happen in the future in a
similar situation. In order to test hypotheses or to make predictions we usually
have to make some assumptions. Part of the modelling process is to test these
assumptions. Having found an adequate model we must compare its predictions
with reality to check that it gives reasonable answers.

We can illustrate these ideas using a simple example.


Example 1.1. Suppose that we are interested in some items, widgets say, which
are manufactured in batches. The size of the batch and the time to make the batch
in man hours are recorded, see Table 1.1.

We begin by plotting the data to see what sort of relationship might hold.

From this plot, Figure 1.1, it seems that a straight line relationship is a good repre-
sentation of the data although it is not an exact relationship. We can fit this model
and obtain the fitted line plot, Figure 1.2.

The fitted line is yb = 10 + 2x. One interpretation of this is that on average it takes
10 hours to set up the machinery to make widgets and then, on average, it takes 2
extra man-hours to produce a batch increased in size by one widget.

But before we come to this conclusion we should check that our data satisfy the
assumptions of the statistical model. One way to do this is to look at residual plots,
as in Figure1.3. We shall discuss these later in the course and in the practicals but
here we see that there is no apparent reason to doubt our model. In fact, for small
1.1. GENERAL CONCEPTS 5

Figure 1.1: Scatterplot of time versus batch size.

Figure 1.2: Fitted line plot of time versus batch size.


6 CHAPTER 1. INTRODUCTION TO STATISTICAL MODELLING

Figure 1.3: Residual plots.

data sets histograms do not represent the distribution well. It is better to examine
the Normal Probability Plot.

Statistical modelling is iterative. We think of a model we believe will fit the data.
We fit it and then check the model. If it is ok we use the model to explain what is
happening or to predict what may happen. Note that we should be very wary of
making predictions far outside of the x values which are used to fit the model.

In general different techniques are needed depending on whether the explanatory


variables are qualitative or quantitative and the response variable is continuous or
discrete.

We will mostly study continuous Y with quantitative X1 , X2 , . . . , Xp and con-


tinuous Y with qualitative X1 , X2 , . . . , Xp . Later in the course we study both
models with a mixture of quantitative and qualitative explanatory variables and
also discrete Y where we no longer assume errors are normally distributed.

Initially, we will use Linear Models.

In Time Series Analysis, we relax the assumption that errors are uncorrelated.
1.2. SIMPLE LINEAR REGRESSION MODEL 7

1.2 Simple Linear Regression Model

We start with the simplest situation where we have one response variable Y and
one explanatory variable X.

In many practical situations we deal with an explanatory variable X that can be


controlled and a response variable Y which can be observed. We want to estimate
or to predict the mean value of Y for given values of X working from a sample of
n pairs of observations

{(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )}.

Example 1.2. Sparrow’s wings.


An ornithologist is interested in the relationship of the wing length and age of
sparrows. Data were collected from 13 sparrows of known age, as follows.

xi [days] yi [cm]
x1 = 3 y1 = 1.4
x2 = 3 y2 = 1.5
x3 = 5 y3 = 2.2
x4 = 6 y4 = 2.4
x5 = 8 y5 = 2.8
x6 = 8 y6 = 3.2
x7 = 10 y7 = 3.2
x8 = 11 y8 = 3.9
x9 = 12 y9 = 4.1
x10 = 13 y10 = 4.7
x11 = 14 y11 = 4.5
x12 = 15 y12 = 5.2
x13 = 16 y13 = 5.0

Readings of wing’s length may vary for different birds of the same age. Time, X,
is known and we are not interested in modelling it. We condition on time assume
that Y is random, so that repeated observations of Y for the same values of X
may vary.

A useful initial stage of modelling is to plot the data. Figure 1.4 shows the plot of
the sparrow wing’s length against sparrow’s age.

The plot suggests that the wing length and age might be linearly related, although
we would not expect the wing’s length to keep increasing linearly over a long
8 CHAPTER 1. INTRODUCTION TO STATISTICAL MODELLING

Figure 1.4: Plot of the length of sparrow wings against age of sparrows.

period of time. In this example the linear relationship can be considered for some
short growth time only.

Other types of function could also describe the relationship well, for example a
quadratic polynomial with a very small second order coefficient. However, it is
better to use the simplest model which describes the relationship well. This is
called the principle of parsimony.

What does it mean “to describe the relationship well”?

It means to represent well the expected shape and also the variability of the re-
sponse Y at each value of the explanatory variable X. We will be working on this
problem throughout the course.

We can write

Yi = E(Y |X = xi ) + εi , where εi is a random variable, i = 1, 2, . . . , n.

Hence, if the expected relationship is linear, we have

Yi = β0 + β1 xi + εi , where i = 1, 2, . . . , n.

We call εi a random error. Standard assumptions about the error are

1. E(εi ) = 0 for all i = 1, 2, . . . , n,

2. var(εi ) = σ 2 for all i = 1, 2, . . . , n,


1.2. SIMPLE LINEAR REGRESSION MODEL 9

3. cov(εi , εj ) = 0 for all i, j = 1, 2, . . . , n, i 6= j.

The errors are often called departures from the mean. The error εi is a random
variable, hence Yi is a random variable too and the assumptions can be rewritten
as

1. E(Y |X = xi ) = µi = β0 + β1 xi for all i = 1, . . . , n,


2. var(Y |X = xi ) = σ 2 for all i = 1, . . . , n,
3. cov(Y |X = xi , Y |X = xj ) = 0 for all i, j = 1, . . . , n, i 6= j.

This means that the dependence of Y on X is linear and the variance of the re-
sponse Y at each value of X is constant (does not depend on xi ) and Y |X = xi
and Y |X = xj are uncorrelated.

Also, it is often assumed that the conditional distribution of Y is normal. Then,


due to the assumption (3) on the covariances, the variables Yi are independent.
This is written as
Y |X = xi ∼ N (µi , σ 2 ).
ind
The graph in Figure 1.5 summarizes all the model assumptions.

E(yj)

E(y i)

xi xj X
Figure 1.5: Model Assumptions about the randomness of observations.

For simplicity of notation we define


Yi := Y |X = xi . (1.1)
10 CHAPTER 1. INTRODUCTION TO STATISTICAL MODELLING

Then the simple linear model can be written as


E(Yi ) = β0 + β1 xi ,
var(Yi ) = σ 2 .
If we assume normality, we have the so called Normal Simple Linear Regression
Model denoted in one of the equivalent ways:

• Yi ∼ N (µi , σ 2 ), where µi = β0 + β1 xi , i = 1, 2, . . . , n,
ind

• Yi ∼ N (β0 + β1 xi , σ 2 ), i = 1, 2, . . . , n,
ind

• Yi = β0 + β1 xi + εi , where εi ∼ N (0, σ 2 ), i = 1, 2, . . . , n.
iid

In all cases β0 and β1 are unknown constant parameters.

1.3 Matrix Approach to Simple Linear Regression

In this section we will briefly discuss a matrix representation of simple linear


regression models. A random sample of size n gives n equations. For the full
SLRM we have

Y1 = β0 + β1 x1 + ε1
Y2 = β0 + β1 x2 + ε2
.. ..
. .
Yn = β0 + β1 xn + εn

We can write this in matrix form as

Y = Xβ + ε, (1.2)

where Y is an (n×1) vector of response variables (random sample), X is an (n×


2) matrix called the design matrix, β is a (2 × 1) vector of unknown parameters
and ε is an (n × 1) vector of random errors. That is,
     
Y1 1 x1 ε1
 Y2   1 x2     ε2 
β0
Y =  ..  , X =  .. ..  , β= , ε =  ..  .
     
 .   . .  β1  . 
Yn 1 xn εn
1.3. MATRIX APPROACH TO SIMPLE LINEAR REGRESSION 11

The assumptions about the random errors can be written

ε ∼ N n 0, σ 2 I ,


that is vector ε has n-dimensional normal distribution with


     
ε1 E(ε1 ) 0
 ε2   E(ε2 )   0 
E(ε) = E  ..  =  ..  =  ..  = 0
     
 .   .   . 
εn E(εn ) 0

and variance-covariance matrix


 
var(ε1 ) cov(ε1 , ε2 ) . . . cov(ε1 , εn )
 cov(ε2 , ε1 ) var(ε2 ) . . . cov(ε2 , εn ) 
Var(ε) = 
 
.. .. ... .. 
 . . . 
cov(εn , ε1 ) cov(εn , ε2 ) . . . var(εn )
 
σ2 0 . . . 0
 0 σ2 . . . 0 
2
=  .. ..  = σ I
 
.. . .
 . . . . 
0 0 . . . σ2

This formulation can be generalised to any number of parameters and is usually


called the Linear Model (in β). Many models can be written in this general form.
The dimensions of matrix X and of vector β depend on the number p of parame-
ters in the model and, respectively, they are n × p and p × 1. In the full SLRM we
have p = 2.

The null model (p = 1)

Yi = β0 + εi for i = 1, . . . , n

is equivalent to
Y = 1β0 + ε
where 1 is an (n × 1) vector of 1’s.

The no-intercept model (p = 1)

Yi = β1 xi + εi for i = 1, . . . , n
12 CHAPTER 1. INTRODUCTION TO STATISTICAL MODELLING

can be written as in matrix notation with


 
x1
 x2  
X =  ..  , β= β1 .
 
 . 
xn

Quadratic regression (p = 3)
Yi = β0 + β1 xi + β2 x2i + εi for i = 1, . . . , n
can be written in matrix notation with
 
1 x1 x21 

 1 x2 x2  β0
2 
X =  .. .. ..  , β =  β1  .

 . . .  β2
1 xn x2n


1.4 Multiple Linear Regression Model

A fitted linear regression model always leaves some residual variation. There
might be another systematic cause for the variability in the observations yi . If we
have data on other explanatory variables we can ask whether they can be used to
explain some of the residual variation in Y . If this is the case, we should take it
into account in the model, so that the errors are purely random. We could write
Yi = β0 + β1 xi + β2 zi + ε?i .
| {z }
previously εi

Z is another explanatory variable. Usually, we denote all explanatory variables


(there may be more than two of them) using letter X with an index to distinguish
between them, i.e., X1 , X2 , . . . , Xp−1 .

A Multiple Linear Regression (MLR) model for a response variable Y and ex-
planatory variables X1 , X2 , . . . , Xp−1 is
E(Y |X1 = x1,i , . . . , Xp−1 = xp−1,i ) = β0 + β1 x1,i + . . . + βp−1 xp−1,i
var(Y |X1 = x1,i , . . . , Xp−1 = xp−1,i ) = σ 2 , i = 1, . . . , n
cov(Y |X1 = x1,i , .., Xp−1 = xp−1,i , Y |X1 = x1,j , .., Xp−1 = xp−1,j ) = 0, i 6= j
1.4. MULTIPLE LINEAR REGRESSION MODEL 13

As in the SLR model we denote


Yi = (Y |X1 = x1,i , . . . , Xp−1 = xp−1,i )
and we usually omit the condition on Xs and write
µi = E(Yi ) = β0 + β1 x1,i + . . . + βp−1 xp−1,i
var(Yi ) = σ 2 , i = 1, . . . , n
cov(Yi , Yj ) = 0, i 6= j
or
Yi = β0 + β1 x1,i + . . . + βp−1 xp−1,i + εi
E(εi ) = 0
var(εi ) = σ 2 , i = 1, . . . , n
cov(εi , εj ) = 0, i 6= j
For testing we need the assumption of Normality, i.e., we assume that
Yi ∼ N (µi , σ 2 )
ind

or
εi ∼ N (0, σ 2 )
ind
To simplify the notation we write the MLR model in a matrix form
Y = Xβ + ε, (1.3)
that is,
      
Y1 1 x1,1 ··· xp−1,1 β0 ε1
 Y2   1 x1,2 ··· xp−1,2  β1   ε2 
= +
      
 .. .. .. ..  .. .. 
 .   . . ··· .  .   . 
Yn 1 x1,n ··· xp−1,n βp−1 εn
| {z } | {z }| {z } | {z }
:= Y := X := β := ε

Here Y is the vector of responses, X is often called the design matrix, β is the
vector of unknown, constant parameters and ε is the vector of random errors.

εi are independent and identically distributed, that is


ε ∼ N n (0n , σ 2 I n ).
Note that the properties of the errors give
Y ∼ N n (Xβ, σ 2 I n ).
14 CHAPTER 1. INTRODUCTION TO STATISTICAL MODELLING

1.4.1 Vectors of random variables

Vectors y and ε in equation (1.3) are random vectors as their elements are random
variables. Below we show some properties of random vectors.

Definition 1.1. The expected value of a random vector is the vector of the respec-
tive expected values. That is, for a random vector z = (z1 , . . . , zn )T we write
   
z1 E(z1 )
 z2   E(z2 ) 
E(z) = E  ..  =  ..  (1.4)
   
 .   . 
zn E(zn )

We have analogous properties of the expectation for random vectors as for single
random variables. Namely, for a random vector z, a constant scalar a, a constant
vector b and for matrices of constants A and B we have
E(az + b) = a E(z) + b
E(Az) = A E(z) (1.5)
T T
E(z B) = E(z) B

Variances and covariances of the random variables zi are put together to form the
so called variance-covariance (dispersion) matrix,
 
var(z1 ) cov(z1 , z2 ) · · · cov(z1 , zn )
 var(z2 ) · · · cov(z2 , zn ) 
V ar(z) =  (1.6)
 
.. .. 
 . . 
cov(zn , z1 ) ··· var(zn )

The dispersion matrix has the following properties.

(a) The matrix Var(z) is symmetric since cov(zi , zj ) = cov(zj , zi ).

(b) For mutually uncorrelated random variables the matrix is diagonal, since
cov(zi , zj ) = 0 for all i 6= j.

(c) The var-cov matrix can be expressed as

Var(z) = E[(z − E(z))(z − E(z))T ]


1.4. MULTIPLE LINEAR REGRESSION MODEL 15

(d) The dispersion matrix of a transformed variable u = Az is

Var(u) = A Var(z)AT

Proof. Denote by µ = (µ1 , . . . , µn )T = (E(z1 ), . . . , E(zn ))T .


To see (c) write
  
 z1 − µ1
 

T
E[(z − µ)(z − µ) ] = E  .
.  z1 − µ1 , . . . , zn − µn
 
 . 
 z −µ 
n n
 
E(z1 − µ1 )2 E[(z1 − µ1 )(z2 − µ2 )] · · · E[(z1 − µ1 )(zn − µn )]
 E(z2 − µ2 )2 · · · E[(z2 − µ2 )(zn − µn )] 
=
 
.. .. 
 . . 
E[(zn − µn )(z1 − µ1 )] ··· E(zn − µn )2
= Var(z).

To show (d) we can use the notation of (c),

Var(u) = E[(u − E(u))(u − E(u))T ]


= E[(Az − Aµ)(Az − Aµ)T ]
= E[A(z − µ)(z − µ)T AT ]
= A E[(z − µ)(z − µ)T ]AT
= A Var(z)AT .


Note that the property (c) gives the expression for the dispersion matrix of a ran-
dom vector analogous to the expression for the variance of a single rv, that is

Var(z) = E(zz T ) − µµT . (1.7)

Multivariate Normal Distribution

A random vector z has a multivariate normal distribution if its p.d.f. can be written
as  
1 1 T −1
f (z) = np exp − (z − µ) V (z − µ) , (1.8)
(2π) 2 det(V ) 2
where µ is the mean and V is the variance-covariance matrix of z.
16 CHAPTER 1. INTRODUCTION TO STATISTICAL MODELLING

Figure 1.6: Bivariate Normal pdf

In the model
Y = Xβ + ε
we assume the following properties of the random errors:

1.      
ε1 E(ε1 ) 0
E[ε] = E  ...  =  ...  =  ..  = 0
    
.  n
εn E(εn ) 0

2.  
σ2 0 · · · 0
 0 σ2 0 
Var[ε] =   = σ2I n
 
.. . . . ..
 . . 
0 ... σ2

3. εi are independent and identically distributed, that is

ε ∼ Nn (0n , σ 2 I n ).

Note that the properties of the errors give

Y ∼ Nn (Xβ, σ 2 I n ).
Chapter 2

Estimation

2.1 Least Squares Estimation in Simple Linear Re-


gression

Estimation is a method of finding values of the unknown model parameters for a


given data set so that the model fits the data in a “best” way. There are various
estimation methods, depending on how we define “best”. In this section we con-
sider the Method of Least Squares Estimation (LS or LSE).

The LS estimators of the model parameters β0 and β1 minimize the sum of squares
of errors denoted by S(β0 , β1 ). That is, the estimators minimize

n
X n
X
S(β0 , β1 ) = ε2i = [Yi − (β0 + β1 xi )]2 . (2.1)
i=1 i=1

The “best” here means the smallest value of S(β0 , β1 ). S is a function of the
parameters and so to find its minimum we differentiate it with respect to β0 and
β1 , then equate the derivatives to zero. We have

 Pn
∂S
 ∂β0
= −2 i=1 [Yi − (β0 + β1 xi )]
Pn (2.2)
∂S

∂β1
= −2 i=1 [Yi − (β0 + β1 xi )]xi

17
18 CHAPTER 2. ESTIMATION

When compared to zero we obtain the so called normal equations:


 P
 ni=1 (βb0 + βb1 xi ) = ni=1 Yi
P
(2.3)
 Pn (βb + βb x )x = Pn x Y
i=1 0 1 i i i=1 i i

This set of equations can be written as



 nβb0 + βb1 ni=1 xi = ni=1 Yi
P P
(2.4)
 βb Pn x + βb Pn x2 = Pn x Y
0 i=1 i 1 i=1 i i=1 i i

The solutions to these equations are


n n
1X 1X
βb0 = Yi − βb1 xi
n i=1 n i=1 (2.5)
= Ȳ − βb1 x̄
and, from the second normal equation,
Pn 1
Pn Pn
x i Y i − ( x i )( i=1 Yi )
βb1 = i=1Pn 2n 1 i=1 Pn 2
i=1 xi − n ( i=1 xi )
Pn
(x − x̄)(Yi − Ȳ )
= i=1 Pn i 2
(2.6)
i=1 (xi − x̄)
SxY
= ,
Sxx
where n n
X X
SxY = (xi − x̄)(Yi − Ȳ ), Sxx = (xi − x̄)2 .
i=1 i=1

To check that S(β0 , β1 ) attains a minimum at (βb0 , βb1 ) we calculate second deriva-
tives and evaluate the determinant
2
∂2S

∂ S2 Pn
2n 2 x

∂β0 ∂β0 ∂β1 i=1 i
X n

= P = 4n (xi − x̄)2 > 0


n n
∂2S 2

2 i=1 xi 2 i=1 x2i
∂ S P
i=1
∂β 2
∂β ∂β
1 0 1

for all β0 , β1 (it does not depend on the values of the parameters).

∂2S ∂2S
Also, ∂β02
> 0 (and > 0) for all β0 , β1 . This means that the function S(β0 , β1 )
∂β12
attains a minimum at (βb0 , βb1 ) given by (2.5) and (2.6).
2.1. LEAST SQUARES ESTIMATION IN SIMPLE LINEAR REGRESSION19

Remark 2.1. Note that the estimators depend on Y . They are functions of Y
which is a random variable and so the estimators of the model parameters are
random variables too. When we calculate the values of the estimators for a given
data set, i.e. for observed values of Y at given values of X, we obtain estimates
of the parameters. We may obtain different estimates of β0 and β1 calculated for
different data sets fitted by the same kind of model. 
Example 2.1. (Wing’s length cont.)
For the data in Example 1.2 we obtain
13
X 13
X
yi = 44.1, xi = 124.
i=1 i=1
13
X 13
X
xi yi = 488.3, x2i = 1418.
i=1 i=1

The estimates of the model parameters are


Pn 1
Pn Pn
i=1 xi yi − n ( i=1 xi )( i=1 yi )
β1 =
b Pn 2 1 Pn 2
i=1 xi − n ( i=1 xi )
1
488.3 − 13 × 124 × 44.1
= 1
1418 − 13 × 1242
= 0.28761
1 1
βb0 = ȳ − βb1 x̄ = × 44.1 − 0.28761 × × 124 = 0.649
13 13
and the estimated (fitted) linear model is
ybi = 0.649 + 0.288xi .
From this fitted model we may calculate estimates of the wing’s length of sparrows
for any age within the age range covered by the data. For example, we may
estimate the wing’s length of sparrows of age 7 days (the missing value). It is
ybi = 0.649 + 0.288 × 7 = 2.664 cm.
Remark 2.2. Two special cases of the simple linear model are

• no-intercept model

Yi = β1 xi + εi ,
which implies that E(Y |X = 0) = 0, and
20 CHAPTER 2. ESTIMATION

• constant model

Yi = β0 + εi ,

which implies that the response variable Y does not depend on the explana-
tory variable X. 

2.2 Properties of the Estimators

Definition 2.1. If θb is an estimator of θ and E[θ]


b = θ, then we say θb is unbiased
for θ.

Note that in this definition θb is a random variable. We must distinguish between


θb when it is an estimate and when it is an estimator. As a function of the random
variables Yi it is an estimator. Its value obtained for a given data set (observed yi )
is an estimate.

The parameter estimator βb1 can be written as


n
X xi − x̄ xi − x̄
βb1 = ci Yi , where ci = Pn 2
= . (2.7)
i=1 i=1 (xi − x̄) Sxx

We have assumed that Y1 , Y2 , . . . , Yn are normally distributed and hence using


the result that a linear combination of normal random variables is also a normal
random variable, βb1 is also normally distributed. We now derive the mean and
variance of βb1 using the representation (2.7).
n
X
E[βb1 ] = E[ ci Y i ]
i=1
n
X
= ci E[Yi ]
i=1
n
X
= ci (β0 + β1 xi )
i=1
n
X n
X
= β0 ci + β1 ci x i
i=1 i=1
2.3. ESTIMATION IN MATRIX FORM 21
P P P
but ci = 0 and ci xi = 1 as (xi − x̄)xi = Sxx , so E[βb1 ] = β1 . Thus βb1 is
unbiased for β1 . Also
" n #
X
var[βb1 ] = var ci Y i
i=1
n
X
= c2i var[Yi ] since the Y ’s are independent
i=1
n
X
= σ 2 (xi − x̄)2 /[Sxx ]2
i=1
2
= σ /Sxx .

Hence,
σ2
 
βb1 ∼ N β1 , .
Sxx

Similarly it can be shown that


2
  
2 1 x̄
βb0 ∼ N β0 , σ + .
n Sxx

2.3 Estimation in Matrix Form

The normal equations obtained in the least squares method are given by
X T Y = X T X β.
b

It follows that so long as X T X is invertible, i.e., its determinant is non-zero, the


unique solution to the normal equations is given by
b = (X T X)−1 X T Y .
β
This is a common formula for all linear models where X T X is invertible. For the
full simple linear regression model we have
 
Y1
 Y2 
 
1 1 ··· 1

T
X Y =
x1 x2 · · · xn  ... 
 
Yn
P ! !
Yi nȲ
= P = P
xi Y i xi Y i
22 CHAPTER 2. ESTIMATION

and  P   
X X=T Pn P x2i =
n P nx̄
.
xi xi nx̄ x2i
The determinant of X T X is given by
X X 
|X T X| = n x2i − (nx̄)2 = n x2i − nx̄2 = nSxx .

Hence, the inverse of X T X is


1
 P 2   
x2i −x̄
P
T −1 1 xi −nx̄ 1 n
(X X) = = .
nSxx −nx̄ n Sxx −x̄ 1

So the solution to the normal equations is given by


b = (X T X)−1 X T y
β

1
  
x2i −x̄
P
1
= n PnȲ
Sxx −x̄ 1 x i Yi
 P 2 P 
1 Ȳ P xi − x̄ xi Yi
=
Sxx xi Yi − nx̄Ȳ
 P 2 
xi − nx̄2 Ȳ + nx̄2 Ȳ − x̄ xi Yi
P
1 Ȳ
=
Sxx SxY
 P 2 2
P 
1 Ȳ ( xi − nx̄ ) − x̄( xi Yi − nx̄Ȳ )
=
Sxx SxY
 
1 Ȳ Sxx − x̄SxY
=
Sxx SxY
!
Ȳ − βb1 x̄
=
βb1

which is the same result as we obtained before. 

Note:
Let A and B be a vector and a matrix of real constants and let Z be a vector of
random variables, all of appropriate dimensions so that the addition and multipli-
cation are possible. Then

E(A + BZ) = A + B E(Z)


Var(A + BZ) = Var(BZ) = B Var(Z)B T .
2.3. ESTIMATION IN MATRIX FORM 23

In particular,
E(Y ) = E(Xβ + ε) = Xβ
Var(Y ) = Var(Xβ + ε) = Var(ε) = σ 2 I.
These equalities let us prove the following theorem.

Theorem 2.1. The least squares estimator β


b of β is unbiased and its variance-
covariance matrix is
b = σ 2 (X T X)−1 .
Var(β)

Proof. First we will show that β


b is unbiased. Here we have

b = E{(X T X)−1 X T Y } = (X T X)−1 X T E(Y )


E(β)
= (X T X)−1 X T Xβ = Iβ = β.

Now, we will show the result for the variance-covariance matrix.

b = Var{(X T X)−1 X T Y }
Var(β)
= (X T X)−1 X T Var(Y )X(X T X)−1
= σ 2 (X T X)−1 X T IX(X T X)−1 = σ 2 (X T X)−1 .

2.3.1 Some specific examples

1. The Null model


As we have seen, this can be written as

Y = Xβ0 + ε

where X = 1 is an (n × 1) vector of 1’s. So X T X = n, X T Y =


P
Yi ,
which gives

1X
βb = (X T X)−1 X T Y = Yi = Ȳ = βb0 ,
n

b = (X T X)−1 σ 2 = σ2
var[β] .
n
24 CHAPTER 2. ESTIMATION

2. No-intercept model
We saw that this example fits the General Linear Model with
 
x1
 x2 
X= , β = β1
 
..
 . 
xn

So X T X = x2i and X T Y =
P P
xi Yi , and we can calculate
P
T −1xi Y i T
βb = (X X) X Y = P 2 = βb1 ,
xi

2
b = σ 2 (X T X)−1 = Pσ .
Var[β]
x2i

2.4 Least Squares Estimation in General Linear Model

To derive the least squares estimator (LSE) for the parameter vector β we min-
imise the sum of squares of the errors, that is
n
X
S(β) = [Yi − {β0 + β1 x1,i + · · · + βp−1 xp−1,i }]2
i=1
X
= ε2i
= εT ε
= (Y − Xβ)T (Y − Xβ)
= (Y T − β T X T )(Y − Xβ)
= Y T Y − Y T Xβ − β T X T Y + β T X T Xβ
= Y T Y − 2β T X T Y + β T X T Xβ.

Theorem 2.2. The LSE βb of β is given by

βb = (X T X)−1 X T Y

if X T X is non-singular. If X T X is singular there is no unique LSE of β.


2.4. LEAST SQUARES ESTIMATION IN GENERAL LINEAR MODEL 25

Proof. Let β 0 be any solution of X T Xβ = X T Y . Then, X T Xβ 0 = X T Y and

S(β) − S(β 0 )

= Y T Y − 2β T X T Y + β T X T Xβ − Y T Y + 2β T0 X T Y − β T0 X T Xβ 0

= −2β T X T Xβ 0 + β T X T Xβ + 2β T0 X T Xβ 0 − β T0 X T Xβ 0

= β T X T Xβ − 2β T X T Xβ 0 + β T0 X T Xβ 0

= β T X T Xβ − β T X T Xβ 0 − β T X T Xβ 0 + β T0 X T Xβ 0

= β T X T Xβ − β T X T Xβ 0 − β T0 X T Xβ + β T0 X T Xβ 0

= β T (X T Xβ − X T Xβ 0 ) − β T0 (X T Xβ − X T Xβ 0 )

= (β T − β T0 )(X T Xβ − X T Xβ 0 )

= (β T − β T0 )X T X(β − β 0 )

= {X(β − β 0 )}T {X(β − β 0 )} ≥ 0

since it is a sum of squares of elements of the vector X(β − β 0 ).

We have shown that S(β) − S(β 0 ) ≥ 0.

Hence, β 0 minimises S(β), i.e. any solution of X T Xβ = X T Y minimises


S(β).

If X T X is nonsingular the unique solution is βb = (X T X)−1 X T Y .

If X T X is singular there is no unique solution. 

Note that, as we did for the SLM in Chapter 2, it is possible to obtain this result
by differentiating S(β) with respect to β and setting it equal to 0.
26 CHAPTER 2. ESTIMATION

2.4.1 Estimation of β — the normal equations

In this section we show an alternative approach to justify the least squares estima-
tor of β.

We can find an estimate of the vector β by least squares. Let


n
X
S = (yi − xTi β)2
i=1
= (y − Xβ)T (y − Xβ);

then we choose βb to be the value of β which minimises S.

Rather than solve this problem directly we consider a geometric representation of


the problem. We can represent y and Xβ (for any vector β) as vectors in <n . Let
C = {Xβ : β ∈ <p } which is a subspace of <n called the column space of X
(linear combinations of columns of X), with
dim(C) = rank(X) = p.
Consider the case when there are n = 3 observations and p − 1 = 1 explanatory
variables so that we can illustrate it in Figure 2.1.

P
E@
I
E @
E @ :R

  E


O hhhh

h E

hE
Q

Figure 2.1: Geometric illustration of least squares

~ . The vector OR
The data y is represented by OP ~ = Xβ is a typical vector in C.
~ 2
We are trying to minimise S = |RP | . To minimise the distance from R to P , we
take R = Q such that the angle OQP is a right angle, i.e. we take the orthogonal
projection of P onto C. Thus the vector QP~ = y − X βb is orthogonal to every
0
vector in C. Hence for any value of β
(Xβ 0 )T (y − X β)
b = 0
T
β 0 X T (y − X β)
b = 0
2.4. LEAST SQUARES ESTIMATION IN GENERAL LINEAR MODEL 27

but since this is true for any value of β 0 it follows that

X T (y − X β)
b = 0.

Rearranging this equation we see that

X T y = X T X β.
b

This is a system of p equations in p unknowns, which are called the normal equa-
tions.

As rank(X) = p, it is possible to show that the rank(X T X) = p, so the (p × p)


matrix X T X is non-singular. It follows that the unique solution to the normal
equations is given by
b = (X T X)−1 X T y.
β

2.4.2 Properties of the least squares estimator

The following three theorems show the properties of the LSE of β, β.


b

Theorem 2.3. The LSE βb is an unbiased estimator of β.

Proof.

E[β]
b = E[(X T X)−1 X T Y ]
= (X T X)−1 X T E[Y ]
= (X T X)−1 X T Xβ
= β


b = σ 2 (X T X)−1 .
Theorem 2.4. Var[β]

Proof. We have that βb = Ay where A = (X T X)−1 X T . Using the result for


var(Ay) we have

b = (X T X)−1 X T var(y)X(X T X)−1


Var[β]
= σ 2 (X T X)−1 X T IX(X T X)−1
= σ 2 (X T X)−1 .
28 CHAPTER 2. ESTIMATION

An alternative proof is as follows: First note that Var[Y ] = E[Y Y T ]−E[Y ] E[Y T ]
and hence

E[Y Y T ] = Var[Y ] + E[Y ] E[Y T ]


= σ 2 I + Xββ T X T .

Now

Var[β]
b
T
= b E[βbT ]
E[βbβb ] − E[β]
= E[(X T X)−1 X T Y Y T X(X T X)−1 ] − ββ T
= (X T X)−1 X T E[Y Y T ]X(X T X)−1 − ββ T
= (X T X)−1 X T (σ 2 I + Xββ T X T )X(X T X)−1 − ββ T
= σ 2 (X T X)−1 X T X(X T X)−1
+(X T X)−1 X T Xββ T X T X(X T X)−1 − ββ T
= σ 2 (X T X)−1 + ββ T − ββ T
= σ 2 (X T X)−1

Theorem 2.5. If
Y = Xβ + ε, ε ∼ Nn (0, σ 2 I),
then
βb ∼ N p (β, σ 2 (X T X)−1 ).

Proof. Each element of βb is a linear function of Y1 , . . . , Yn . We assume that


Yi , i = 1, . . . , n are normally distributed. Hence βb is also normally distributed.

The expectation and variance-covariance matrix can be shown in the same way as
in Theorem 2.7. 

2.5 The Gauss-Markov Theorem

A strong justification for the use of least squares estimation in linear models is
provided by the following famous theorem.
2.5. THE GAUSS-MARKOV THEOREM 29

Theorem 2.6. Given the linear model

Y = Xβ + ε,

where E(ε) = 0 and Var(ε) = σ 2 I n the least squares estimator βb = (X T X)−1 X T Y


is such that lT βb is the minimum variance linear unbiased estimator of the es-
timable function lT β of the parameters β.

Note: We call such an estimator the Best Linear Unbiased Estimator (BLUE).
It is the estimator, that among all unbiased estimators of the form cT Y , has the
smallest variance.
Proof. lT βb is a linear combination of the random sample Y ,

lT βb = lT (X T X)−1 X T Y .

Let cT Y be another linear unbiased estimator of lT β. That is,

E(cT Y ) = cT E(Y ) = cT Xβ = lT β.

It means that lT = cT X so that cT Y is unbiased. Now,

var(cT Y ) = cT Var(Y )c = σ 2 cT Ic = σ 2 cT c.

Also,

var(lT β) b = σ 2 lT (X T X)−1 l
b = lT Var(β)l
= σ 2 cT X(X T X)−1 X T c = σ 2 cT Hc

Then

var(cT Y ) − var(lT β)
b = σ 2 (cT c − cT HC)
= σ 2 cT (I − H)c
= σ 2 cT (I − H)T (I − H)c
| {z } | {z }
=Z
T
=Z
2 T
= σ Z Z ≥ 0.

Hence var(cT Y ) ≥ var(lT β) and so lT βb is BLUE of lT β. 


30 CHAPTER 2. ESTIMATION
Chapter 3

Inference

3.1 Assessing the Simple Linear Regression Model

3.1.1 Analysis of Variance Table

Parameter estimates obtained for the model

Yi = β0 + β1 xi + εi

can be used to estimate the mean response corresponding to each variable Yi , that
is,
[i ) = Ybi = βb0 + βb1 xi , i = 1, . . . , n.
E(Y
These, for a given data set (xi , yi ), are called fitted values and are denoted by ybi .
They are points on the fitted regression line corresponding to the values of xi .
The observed values yi usually do not fall exactly on the line and so are not equal
to the fitted values ybi , as shown in Figure 3.1.

The residuals (also called crude residuals) are defined as

ei := Yi − Ybi , i = 1, . . . , n, (3.1)

These are estimators of the random errors εi .

Thus

ei = Yi − (βb0 + βb1 xi )
= Yi − Ȳ − βb1 (xi − x̄)

31
32 CHAPTER 3. INFERENCE

4
y

1
0 2 4 6 8 10 12 14 16 18
x

Figure 3.1: Observations and fitted line for the Sparrow wing’s length data.

and X
ei = 0.

Also note that the estimators βb0 and βb1 minimize the function S(β0 , β1 ). The
minimum is called the Residual Sum of Squares and is denoted by SSE , that is,
n
X n
X n
X
2 2
SSE = [Yi − (β0 + β1 xi )] =
b b (Yi − Yi ) =
b e2i . (3.2)
i=1 i=1 i=1

Consider the constant model


Y i = β0 + εi .
For this model βb0 = Ȳ and we have

Ybi = Ȳ , ei = Yi − Ybi = Yi − Ȳ

and
n
X
SSE = SST = (Yi − Ȳ )2 .
i=1

It is called the Total Sum of Squares and is denoted by SST . For a constant
model SSE = SST . When the model is non constant, i.e. it includes a slope, the
3.1. ASSESSING THE SIMPLE LINEAR REGRESSION MODEL 33

y
y(14)
5
y(14)

4
fitted line

y
2

1
x
0 2 4 6 8 10 12 14 16 18

Figure 3.2: Observations, fitted line and the mean for a constant model.

difference Yi − Ȳ can be split into two components: one due to the regression
model fit and one due to the residuals, that is

Yi − Ȳ = (Yi − Ybi ) + (Ybi − Ȳ ).

For a given data set it could be represented as in Figure 3.2.

The following theorem gives such an identity for the respective sums of squares.

Theorem 3.1. Analysis of Variance Identity.


In the simple linear regression model the total sum of squares is a sum of the
regression sum of squares and the residual sum of squares, that is

SST = SSR + SSE , (3.3)

where
n
X
SST = (Yi − Ȳ )2
i=1
n
X
SSR = (Ybi − Ȳ )2
i=1
Xn
SSE = (Yi − Ybi )2
i=1
34 CHAPTER 3. INFERENCE

Proof.
n
X n
X
2
SST = (Yi − Ȳ ) = [(Yi − Ybi ) + (Ybi − Ȳ )]2
i=1 i=1
n
X
= [(Yi − Ybi )2 + (Ybi − Ȳ )2 + 2(Yi − Ybi )(Ybi − Ȳ )]
i=1
= SSE + SSR + 2A,
where
n
X
A = (Yi − Ybi )(Ybi − Ȳ )
i=1
n
X n
X
= (Yi − Ybi )Ybi − Ȳ (Yi − Ybi )
i=1 i=1
Xn n
X
= ei Ybi − Ȳ ei
i=1
|i=1
{z }
=0
n
X
= ei (βb0 + βb1 xi )
i=1
n
X n
X
= βb0 ei + βb1 ei xi .
i=1 i=1
| {z } | {z }
=0 =0
Hence A = 0. 

For a given data set the model fit (regression) sum of squares, SSR , represents the
variability in the observations yi accounted for by the fitted model, the residual
sum of squares, SSE , represents the variability in yi accounted for by the differ-
ences between the observations and the fitted values.

The Analysis of Variance (ANOVA) Table shows the sources of variation, the
sums of squares and the statistic, based on the sums of squares, for testing the
significance of regression slope.

ANOVA table
Source of variation d.f. SS MS VR
SSR M SR
Regression νR = 1 SSR M SR = νR M SE
SSE
Residual νE = n − 2 SSE M SE = νE
Total νT = n − 1 SST
3.1. ASSESSING THE SIMPLE LINEAR REGRESSION MODEL 35

The “d.f.” is short for “degrees of freedom”.

What are degrees of freedom?

For an intuitive explanation consider the observations y1 , y2 , . . . , yn and assume


that their sum is fixed, say equal to a, that is
y1 + y2 + . . . + yn = a.
For a fixed value of the sum a there are n − 1 arbitrary y-values but one y-value is
determined by the difference of a and the n − 1 arbitrary y values. This one value
is not free, it depends on the other y-values and on a. We say that there are n − 1
independent (free to vary) pieces of information and one piece is taken up by a.

Estimates of parameters can be based on different amounts of information. The


number of independent pieces of information that go into the estimate of a param-
eter is called the degrees of freedom. This is why in order to calculate
n
X
SST = (yi − ȳ)2
i=1

we have n − 1 free to vary pieces of information from the collected data, that is
we have n − 1 degrees of freedom. The one degree of freedom is taken up by ȳ.
Similarly, for
n
X n
X
2
SSE = (yi − ybi ) = (yi − βb0 − βb1 xi )2
i=1 i=1

we have two degrees of freedom taken up: one by βb0 and one by βb1 (both depend
on y1 , y2 , . . . , yn ). Hence, there are n − 2 independent pieces of information to
calculate SSE .

Finally, as SSR = SST − SSE we can calculate the d.f. for SSR as a difference
between d.f. for SST and for SSE , that is νR = (n − 1) − (n − 2) = 1.

In the ANOVA table there are also included so called Mean Squares (MS), which
can be thought of as measures of average variation.

The last column of the table contains the Variance Ratio (VR)
M SR
.
M SE
It measures the variation explained by the model fit relative to the variation due to
residuals.
36 CHAPTER 3. INFERENCE

3.1.2 F test

The mean squares are functions of the random variables Yi and so is their ratio.
We denote it by F . We will see later, that if β1 = 0, then
M SR
F = ∼ F1,n−2 .
M SE
Thus, to test the null hypothesis

H0 : β1 = 0

versus the alternative


H1 : β1 6= 0,
we use the variance ratio F as the test statistic. Under H0 the ratio has F distri-
bution with 1 and n − 2 degrees of freedom.

We reject H0 at a significance level α if

Fcal > Fα;1,n−2 ,

where Fcal denotes the value of the variance ratio F calculated for a given data set
and Fα;1,n−2 is such that

P (F > Fα;1,n−2 ) = α.

There is no evidence to reject H0 if Fcal < Fα;1,n−2 .

Rejecting H0 means that the slope β1 6= 0 and the full regression model

Yi = β0 + β1 xi + εi

is better then the constant model

Yi = β0 + εi .

3.1.3 Estimating σ 2

Note that the sums of squares are functions of the conditional random variables
Yi = (Y |X = xi ). Hence, the sums of squares are random variables as well. This
fact allows us to check some stochastic properties of the sums of squares, such as
their expectation, variance and distribution.
3.1. ASSESSING THE SIMPLE LINEAR REGRESSION MODEL 37

Theorem 3.2. In the full simple linear regression model we have

E(SSE ) = (n − 2)σ 2

Proof. Proof will be given later. 

From the theorem we obtain


 
1
E(M SE ) = E SSE = σ2
n−2

and so M SE is an unbiased estimator of σ 2 . It is often denoted by S 2 .

Notice, that in the full model S 2 is not the sample variance. We have
n
1 X
S 2 = M SE = [i ))2 ,
(Yi − E(Y [i ) = βb0 + βb1 xi .
where E(Y
n − 2 i=1

[i ) = βb0 = Ȳ and
It is the sample variance in the constant (null) model, where E(Y
νE = n − 1. Then

n
2 1 X
S = (Yi − Ȳ )2 .
n − 1 i=1

3.1.4 Example

Example 3.1. Sparrow Wings continued

The regression equation is


y = 0.787 + 0.265 x

Predictor Coef SE Coef T P


Constant 0.7868 0.1368 5.75 0.000
x 0.26463 0.01258 21.04 0.000

S = 0.209607 R-Sq = 97.6% R-Sq(adj) = 97.4%


38 CHAPTER 3. INFERENCE

Figure 3.3: Fitted line plot for Sparrow Wings

Analysis of Variance
Source DF SS MS F P
Regression 1 19.446 19.446 442.60 0.000
Residual Error 11 0.483 0.044
Total 12 19.929

Comments:
We fitted a simple linear model of the form
Y i = β 0 + β 1 xi + εi , i = 1, . . . , 13, εi ∼ N (0, 1).
iid

The estimated values of the parameters are


- intercept: βb0 ∼
= 0.79

- slope: β1 = 0.26
b
Both parameters are highly significant (p < 0.001).

The ANOVA table also shows the significance of the regression (slope), that is the
null hypothesis
H0 : β1 = 0
versus the alternative
H1 : β1 6= 0
can be rejected at significance level α < 0.001 (p ∼
= 0.000).

The tests require the assumptions of the normality and of constant variance of
random errors. It should be checked whether the assumptions are approximately
met. If not, the tests may not be valid.
3.2. INFERENCE ABOUT THE REGRESSION PARAMETERS 39

The graph shows that the observations lie along the fitted line and there are no
strange points which are far from the line or which could strongly affect the slope.

Final conclusions:
We can conclude that the data indicate that the length of sparrows’ wings depends
linearly on their age (within the range 3 - 18 days). The mean increase in the
wing’s length per day is estimated as βb1 ∼
= 0.26 cm.

However, it might be wrong to predict the length or its increase per day outside
the range of the observed time. We would expect that the growth slows down in
time and so the relationship becomes non-linear. 

3.2 Inference about the regression parameters

Example 3.2. Overheads.


A company builds custom electronic instruments and computer components. All
jobs are manufactured to customer specifications. The firm wants to be able to
estimate its overhead cost. As part of a preliminary investigation, the firm decides
to focus on a particular department and investigates the relationship between total
departmental overhead cost (Y) and total direct labour hours (X). The data for the
most recent 16 months are plotted in Figure 3.4.

Two objectives of this investigation are

1. to summarize for management the relationship between total departmental


overhead and total direct labour hours.
2. to estimate the expected and to predict the actual total departmental over-
head from the total direct labour hours.

The regression equation is


Ovhd = 16310 + 11.0 labour

Predictor Coef SE Coef T P


Constant 16310 2421 6.74 0.000
labour 10.982 2.268 4.84 0.000

S = 1645.61 R-Sq = 62.6% R-Sq(adj) = 60.0%


40 CHAPTER 3. INFERENCE

Figure 3.4: Plot of overheads data

Analysis of Variance
Source DF SS MS F P
Regression 1 63517077 63517077 23.46 0.000
Residual Error 14 37912232 2708017
Total 15 101429309

Unusual Observations
Obs labour Ovhd Fit SE Fit Residual St Resid
6 1067 24817 28028 413 -3211 -2.02R

R denotes an observation with a large standardized residual.

Comments:

• The model fit is ybi = 16310 + 11xi . There is a significant relationship


between the overheads and the labour hours (p < 0.001 in ANOVA).
• The increase of labour hours by 1 will increase the mean overheads by about
£11 (βb1 = 11.0).

The model allows us to estimate the total overhead cost as a function of labour
hours, but as we noticed, there is large variability in the data. In such a case,
the point estimates may not be very reliable. Anyway, point estimates should
always be accompanied by their standard errors. Then we can also find confidence
intervals (CI) for the unknown model parameters, or test their significance. 

Note that for the simple linear regression model


Yi = β0 + β1 xi + εi , where εi ∼ N (0, σ 2 ), (3.4)
iid
3.2. INFERENCE ABOUT THE REGRESSION PARAMETERS 41

we obtained the following LSE of the parameters β0 and β1 :

βb0 = Ȳ − βb1 x̄
Pn
(x − x̄)(Yi − Ȳ )
β1 = i=1
b Pn i 2
i=1 (xi − x̄)

We now derive results which allow us to make inference about the regression
parameters and predictions.

3.2.1 Inference about β1

We proved the following result in Section 2.3.

Theorem 3.3. In the full simple linear regression model (SLRM) the distribution
of the LSE of β1 , βb1 , is normal with the expectation E(βb1 ) = β1 and the variance
2
var(βb1 ) = Sσxx , that is
2
 
σ
βb1 ∼ N β1 , . (3.5)
Sxx


Remark 3.1. For large samples, where there is no assumption of normality of Yi ,


the sampling distribution of βb1 is approximately normal. 

Theorem 3.3 allows us to derive a confidence interval (CI) for β1 and a test of
significance for β1 . After standarisation of βb1 we obtain

βb1 − β1
√ ∼ N (0, 1).
σ/ Sxx

However, the error variance is usually not known and it is replaced by its estimator.
Then the normal distribution changes to a Student t-distribution. The explanation
is following.

Lemma 3.1. If Z ∼ N (0, 1) and U ∼ χ2ν , and Z and U are independent, then

Z
p ∼ tν .
U/ν

42 CHAPTER 3. INFERENCE

Here we have,
βb1 − β1
Z= √ ∼ N (0, 1).
σ/ Sxx
We will see later that
(n − 2)S 2
U= ∼ χ2n−2
σ2
and S 2 and βb1 are independent. It follows that
−β1
βb1√
σ/ Sxx βb1 − β1
T =q = √ ∼ tn−2 . (3.6)
(n−2)S 2 S/ Sxx
σ 2 (n−2)

Confidence interval for β1


To find a CI for an unknown parameter θ means to find values of the boundaries
A and B which satisfy
P (A < θ < B) = 1 − α
for some small α, that is for a high confidence level (1 − α)100%. From (3.6) we
have !
βb1 − β1
P −t α2 ,n−2 < √ < t α2 ,n−2 = 1 − α, (3.7)
S/ Sxx
where t α2 ,n−2 is such that P (|T | < t α2 ,n−2 ) = 1 − α.

Rearranging the expression in brackets of (3.7) gives


 
S S
P βb1 − t α2 ,n−2 √ < β1 < βb1 + t α2 ,n−2 √ = 1 − α. (3.8)
Sxx Sxx
That is the CI for β1 is
 
S S
[A, B] = β1 − t α2 ,n−2 √
b , β1 + t α2 ,n−2 √
b . (3.9)
Sxx Sxx

The calculated values of βb1 , S and Sxx for the overhead costs (Example 3.2) are
the following
βb1 = 10.982, S = 1645.61, Sxx = 526656.9.
Also t0.025,14 = 2.14479. Hence, the 95% CI for β1 is
 
1645.61 1645.61
[a, b] = 10.982 − 2.14479 √ , 10.982 + 2.14479 √
526656.9 526656.9
= [6.11851, 15.8455]
3.2. INFERENCE ABOUT THE REGRESSION PARAMETERS 43

We would expect (with 95% confidence) that one hour increase in labour will in-
crease the cost between £6.12 and £15.82.

Test of H0 : β1 = 0 versus H0 : β1 6= 0

The null hypothesis H0 : β1 = 0 means that the slope is zero and a better model
is a constant model

Y i = β0 + εi , εi ∼ N (0, σ 2 )
iid

showing no relationship between Y and X. From (3.6) we see that if H0 is true,


then
βb1
T = S ∼ tn−2 . (3.10)
√ H0
Sxx

This statistic can be used as a test function for the null hypothesis.

We reject H0 at a significance level α when the calculated, for a given data set,
value of the test function, Tcal , is in the rejection region, that is

|Tcal | > t α2 ,n−2 .

Many statistical software packages give the p-value when testing a hypothesis.
When the p-value is smaller than α then we may reject the null hypothesis at a
significance level ≤ α.
Remark 3.2. Square root of the variance var(βb1 ) is called the standard error of βb1
and it is denoted by se(βb1 ), that is
s
σ2
se(βb1 ) = .
Sxx

Its estimator is s
\ S2
se(βb1 ) = .
Sxx
Often this estimated standard error is called the standard error. You should be
aware of the difference between the two. 
Remark 3.3. Note that the (1 − α)100% CI for β1 can be written as
 
\ \
β1 − t α2 ,n−2 se(β1 ), β1 + t α2 ,n−2 se(β1 )
b b b b
44 CHAPTER 3. INFERENCE

and the test statistic for H0 : β1 = 0 as


βb1
T = .
\
se(β1 )
b


As we have noted before we can also test the hypothesis H0 : β1 = 0 using the
Analysis of Variance table and the F test. In this case the two tests are equivalent
since if the random variable W ∼ tν then W 2 ∼ F1,ν .

3.2.2 Inference about E(Y |X = xi )

In the simple linear regression model, we have


µi = E(Y |X = xi ) = β0 + β1 xi
and its LSE is
bi = E(Y\
µ |X = xi ) = βb0 + βb1 xi .
We may estimate the mean response at any value of X which is within the range
of the data, say x0 . Then,

b0 = E(Y \
µ |X = x0 ) = βb0 + βb1 x0 .
Similarly as for the LSE of β0 and for β1 we have the following Theorem.
Theorem 3.4. In the full SLRM the distribution of the LSE of µ0 ,µ
b0 , is normal 
2 1 (x0 −x̄)2
with the expectation E(b
µ0 ) = µ0 and the variance var(b µ0 ) = σ n + Sxx ,
that is
(x0 − x̄)2
  
2 1
µb0 ∼ N µ0 , σ + . (3.11)
n Sxx

Corollary 3.1. In the full simple linear regression model, we have

CI for µ0 : h i
b0 − t
µ α
,n−2
\
se(b
µ0 ), µ \
b0 + t α2 ,n−2 se(b
µ0 )
2

Test of the hypothesis H0 : µ0 = µ∗ :


b0 − µ∗
µ
T = ∼ tn−2 ,
\ µ0 ) H 0
se(b
3.3. INFERENCE IN THE MULTIPLE LINEAR REGRESSION MODEL 45

where s  
\ 1 (x0 − x̄)2
se(b
µ0 ) = S2 + .
n Sxx

Remark 3.4. Care is needed when estimating the mean at x0 . It should only be
done if x0 is within the data range. Extrapolation beyond the range of the given
x-values is not reliable, as there is no evidence that a linear relationship is appro-
priate there. 

3.3 Inference in the Multiple Linear Regression Model

3.3.1 Properties of the least squares estimator

Remark 3.5. The vector of fitted values is given by


µ
b = Yb = X βb
= X(X T X)−1 X T Y
= HY .

The matrix H = X(X T X)−1 X T is called the hat matrix.

Note that
HT = H
and also
HH = X(X T X)−1 X T X(X T X)−1 X T
| {z }
=I
= X(X T X)−1 X T
= H.
A matrix, which satisfies the condition AA = A is called an idempotent matrix.
Note that if A is idempotent, then (I − A) is also idempotent.

We now prove some results about the residual vector


e = Y − Yb
= Y − HY
= (I − H)Y .
46 CHAPTER 3. INFERENCE

As in Theorem 2.8, here we have


Lemma 3.2. E(e) = 0.
Proof.
E(e) = (I − H) E(Y )
= (I − X(X T X)−1 X T )Xβ
= Xβ − Xβ
= 0


Lemma 3.3. Var(e) = σ 2 (I − H).
Proof.
Var(e) = (I − H) var(Y )(I − H)T
= (I − H)σ 2 I(I − H)
= σ 2 (I − H)


Lemma 3.4. The sum of squares of the residuals is Y T (I − H)Y .
Proof.
n
X
e2i = eT e = Y T (I − H)T (I − H)Y
i=1
= Y T (I − H)Y


Lemma 3.5. The elements of the residual vector e sum to zero, i.e
n
X
ei = 0.
i=1

Corollary 3.2.
n
1Xb
Yi = Ȳ .
n i=1

P P P
Proof. The residuals ei = Yi − Ybi , so ei = (Yi − Ybi ) but ei = 0. Hence
P Pb
Yi = Yi and so the result follows. 
3.4. ANALYSIS OF VARIANCE 47

3.4 Analysis of Variance

We begin this section by proving the basic Analysis of Variance identity.


Theorem 3.5. The total sum of squares splits into the regression sum of squares
and the residual sum of squares, that is
SST = SSR + SSE .
Proof.
X
SST = (Yi − Ȳ )2
X
= Yi2 − nȲ 2
= Y T Y − nȲ 2 .

X
SSR = (Ybi − Ȳ )2
X X
= Ybi2 − 2Ȳ Ybi +nȲ 2
| {z }
=nȲ
X
= Ybi2 − nȲ 2
T
= Yb Yb − nȲ 2
T
= βb X T X βb − nȲ 2
= Y T X(X T X)−1 X T X(X T X)−1 X T Y − nȲ 2
| {z }
=I
= Y T HY − nȲ 2 .

We have seen (Lemma 3.3) that


SSE = Y T (I − H)Y
and so
SSR + SSE = Y T HY − nȲ 2 + Y T (I − H)Y
= Y T Y − nȲ 2
= SST


48 CHAPTER 3. INFERENCE

F-test for the Overall Significance of Regression

Suppose we wish to test the hypothesis


H0 : β1 = β2 = . . . = βp−1 = 0,
i.e. all coefficients except β0 are zero, versus
H1 : ¬H0 ,
which means that at least one of the coefficients is non-zero. Under H0 , the model
reduces to the null model
Y = 1β0 + ε,
where 1 is a vector of ones.

In testing H0 we are asking if there is sufficient evidence to reject the null model.

The Analysis of variance table is given by

Source d.f. SS MS VR
T SSR M SR
Overall regression p−1 Y HY − nȲ 2 p−1 M SE

Residual n−p Y T (I − H)Y SSE


n−p

Total n−1 Y T Y − nȲ 2

As in simple linear regression we have n − 1 total degrees of freedom. Fitting a


linear model with p parameters (β0 , β1 , . . . , βp−1 ) leaves n − p residual d.f. Then
the regression d.f. are n − 1 − (n − p) = p − 1.

It can be shown that E(SSE ) = (n − p)σ 2 , that is M SE is an unbiased estimator


of σ 2 . Also,
SSE
∼ χ2n−p
σ2
and if β1 = . . . βp−1 = 0, then
SSR
∼ χ2p−1 .
σ2
The two statistics are independent, hence
M SR
∼ Fp−1,n−p .
M S E H0
3.5. INFERENCES ABOUT THE PARAMETERS 49

This is a test function for the null hypothesis

H0 : β1 = β2 = . . . = βp−1 = 0,

versus
H1 : ¬H0 .
We reject H0 at the 100α% level of significance if

Fobs > Fα;p−1,n−p ,

where Fα;p−1,n−p is such that P (F > Fα;p−1,n−p ) = α.

3.5 Inferences about the parameters

In Theorem 2.5 we have seen that

βb ∼ N p (β, σ 2 (X T X)−1 ).

Therefore,
βbj ∼ N (βj , σ 2 cjj ), j = 0, 1, 2, . . . , p − 1,
where cjj is the jth diagonal element of (X T X)−1 (counting from 0 to p − 1).
Hence, it is straightforward to make inferences about βj , in the usual way.

A 100(1 − α)% confidence interval for βj is


p
βbj ± t α2 ,n−p S 2 cjj ,

where S 2 = M SE .

The test statistic for H0 : βj = 0 versus H1 : βj 6= 0 is

βbj
T =p ∼ tn−p if H0 is true.
S 2 cjj

Care is needed in interpreting the confidence intervals and tests. They refer only to
the model we are fitting. Thus not rejecting H0 : βj = 0 does not mean that Xj has
no explanatory power; it means that, conditionally on X1 , . . . , Xj−1 , Xj+1 , . . . , Xp−1
being in the model Xj has no additional power.
50 CHAPTER 3. INFERENCE

It is often best to think of the test as comparing models without and with Xj , i.e.

H0 : E(Yi ) = β0 + β1 x1,i + · · · + βj−1 xj−1,i + βj+1 xj+1,i + · · · + βp−1 xp−1,i

versus
H1 : E(Yi ) = β0 + β1 x1,i + · · · + βp−1 xp−1,i .
It does not tell us anything about the comparison between models E(Yi ) = β0 and
E(Yi ) = β0 + βj xj,i .

3.6 Confidence interval for µ

We have
\) = µ
E(Y b = X β.
b

As with simple linear regression, we might want to estimate the expected response
at a specific x, say x0 = (1, x1,0 , . . . , xp−1,0 )T , i.e.

µ0 = E(Y |X1 = x1,0 , . . . , Xp−1 = xp−1,0 ).

The point estimate will be


b0 = xT0 β.
µ b

Assuming normality, as usual, we can obtain a confidence interval for µ0 .

Theorem 3.6.
b0 ∼ N (µ0 , σ 2 xT0 (X T X)−1 x0 ).
µ

Proof.

b0 = xT0 βb is a linear combination of βb0 , βb1 , . . . , βbp−1 , each of which is normal.


(i) µ
Hence µ b0 is also normal.

(ii)

µ0 ) = E(xT0 β)
E(b b
= xT E(β)
b
0
= xT0 β
= µ0
3.7. SAMPLING DISTRIBUTION OF M SE (S 2 ) 51

(iii)

µ0 ) = var(xT0 β)
var(b b
= xT Var(β)x
0
b 0
= σ x0 (X T X)−1 x0 .
2 T

The following corollary is a consequence of Theorem 3.6.

Corollary 3.3. A 100(1 − α)% confidence interval for µ0 is


q
b0 ± t 2 ,n−p S 2 xT0 (X T X)−1 x0 .
µ α

3.7 Sampling distribution of M SE (S 2)

First we will show that in the linear model

Y = Xβ + ε, ε ∼ N (0, σ 2 I),

we have
E(S 2 ) = σ 2 .
For this we need some results on matrix algebra.

Lemma 3.6. For a symmetric idempotent matrix A of rank r, there exists an


orthogonal matrix C (C T C = I) such that

A = CDC T ,

where  
Ir 0
D= .
0 0

52 CHAPTER 3. INFERENCE

Lemma 3.7. Properties of trace

1. For any matrices A and B of appropriate dimensions and for a scalar k we


have

(a) trace(AB) = trace(BA)


(b) trace(A + B) = trace(A) + trace(B)
(c) trace(kA) = k trace(A)

2. For an idempotent matrix A

trace(A) = rank(A)

Proof. 1. A simple consequence of the definition of trace.

2. If A is idempotent then by Lemma 4.1 A = CDC T for an orthogonal C,


then

trace(A) = trace(CDC T )
= trace(C T CD)
= trace(D)
= r.

Lemma 3.8.
rank(I − H) = n − p.

Proof.

rank(I − H) = trace(I − H)
= trace(I) − trace(H)
= n − trace{X(X T X)−1 X T }
= n − trace{X T X(X T X)−1 }
= n − trace(I p )
= n−p


3.7. SAMPLING DISTRIBUTION OF M SE (S 2 ) 53

Lemma 3.9. Let Z be a random vector such that


E(Z) = µ Var(Z) = V
then
E(Z T AZ) = trace(AV ) + µT Aµ
for any matrix A.
Proof. We have
V = E(ZZ T ) − E(Z) E(Z T ) = E(ZZ T ) − µµT .
Hence
E(ZZ T ) = V + µµT .
Then
E(Z T AZ) = E[trace(Z T AZ)]
= E[trace(ZZ T A)]
= trace[A E(ZZ T )]
= trace[A(V + µµT )]
= trace(AV ) + trace(µT Aµ)
= trace(AV ) + µT Aµ


Theorem 3.7. Let Y = Xβ + ε be a linear model such that E(Y ) = Xβ and
Var(Y ) = σ 2 I n . Then the error sum of squares, SSE , has expectation equal to
E(SSE ) = (n − p)σ 2 ,
where p is the number of parameters in the model.
Proof.
SSE = Y T (I − H)Y
E(SSE ) = E[Y T (I − H)Y ]
= trace[(I − H) Var(Y )] + E(Y )T (I − H) E(Y )
= σ 2 trace(I − H) + β T X T (I − X(X T X)−1 X T )Xβ
= σ 2 (n − p) + β T (X T X − X T X (X T X)−1 X T X )β
| {z }
=I
= σ 2 (n − p)


54 CHAPTER 3. INFERENCE

Corollary 3.4.
E(M SE ) = σ 2

To show that
(n − p)S 2
∼ χ2n−p ,
σ2
the result we have used for deriving F tests, we will need the following lemmas.

Lemma 3.10. If Zi , i = 1, . . . , r are independent and identically distributed,


each with a standard normal distribution then
r
X
Zi2 ∼ χ2r
i=1

Lemma 3.11. The vector of residuals can be written as

e = (I − H)ε.

Proof.

e = Y − Yb
= Y − HY
= (I − H)Y
= (I − H)(Xβ + ε)
= Xβ − HXβ + (I − H)ε
= (I − H)ε

Corollary 3.5.
e ∼ Nn (0, σ 2 (I − H))

Theorem 3.8.
(n − p)S 2
∼ χ2n−p .
σ2
3.7. SAMPLING DISTRIBUTION OF M SE (S 2 ) 55

Proof.
(n − p)S 2 1
2
= SSE
σ σ2
1 T
= e e
σ2
1 T
= ε (I − H)T (I − H)ε
σ2
1 T
= ε (I − H)ε
σ2
1 T
= ε CDC T ε
σ2
= Z T DZ,

where, by Lemma 4.1 as I − H is idempotent, I − H = CDC T with C orthog-


onal,  
I n−p 0
D=
0 0
and Z = σ1 C T ε.

We assume that ε ∼ Nn (0, σ 2 I). Hence , Z is also normal with E(Z) = 0 and
1 T
Var(Z) = C Var(ε)C
σ2
σ2 T
= C C
σ2
= CT C
= I

as C is orthogonal. Hence
Z ∼ N (0, I)
and so Zi are independent and each distributed as N (0, 1).

Also
n−p
X
T
Z DZ = Zi2 .
i=1
Hence by Lemma 4.5 we have
n−p
(n − p)S 2 X 2
= Zi ∼ χ2n−p .
σ2 i=1


56 CHAPTER 3. INFERENCE

3.8 Sampling distribution of M SR and F

Further, we will show the result given in section 3.3 that

M SR
F = ∼ Fp−1,n−p .
M S E H0

In that section we showed that

SSR = Y T HY − nȲ 2 .

This can be written as

SSR = Y T X(X T X)−1 X T Y − nȲ 2


T
= βb X T Y − nȲ 2
 T 
T 1
= (β0 β ∗ )
b b Y − nȲ 2
X T∗

where
   
x1,1 · · · x1,n βb1
T  .. ..  . 
X∗ =  .  , β ∗ =  ..  .
 b
.
xp−1,1 · · · xp−1,n βbp−1

This gives

1T Y
 
T
SSR = (βb0 βb∗ ) − nȲ 2
X T∗ Y
T
= βb0 1T Y + βb∗ X T∗ Y − nȲ 2
T
= βb0 nȲ − nȲ 2 + βb X T Y∗ ∗

Now
T
βb0 = Ȳ − (βb1 x̄1 + · · · + βbp−1 x̄p−1 ) = Ȳ − βb∗ x̄,
3.8. SAMPLING DISTRIBUTION OF M SR AND F 57

where x̄ = (x̄1 , . . . , x̄p−1 )T . Hence


T T
SSR = nȲ 2 − nβb∗ x̄Ȳ − nȲ 2 + βb∗ X T∗ Y
T n 1
= βb∗ (X T∗ − x̄1T )Y as Ȳ = 1T Y
n n
T T T
= βb∗ (X ∗ − x̄1 )Y
   
x11 · · · x1n x̄1 · · · x̄1
T  . .
= βb∗  .. ..  −  ...
  ..  Y
. 
xp−11 · · · xp−1n x̄p−1 · · · x̄p−1
| {z }
T
=X ∗c
T
= βb∗ X T∗c Y

Now, this can be viewed as SSR in the model Y = X ∗c β ∗ + ε.


Theorem 3.9. Under H0 : β ∗ = 0 we have
SSR
∼ χ2p−1
σ2
Proof. E(βb∗ ) = β ∗ , hence E(βb∗ ) = 0 under H0 . Also,

Var(βb∗ ) = σ 2 (X T∗c X ∗c )−1 and βb∗ ∼ N (0, σ 2 (X T∗c X ∗c )−1 ).

Let C be a (p − 1) × (p − 1) dimensional matrix such that

C T C = X T∗c X ∗c

Now multiply on the left by (C T )−1 , so that

C = (C T )−1 X T∗c X ∗c

Now multiply on the right by (X T∗c X ∗c )−1 so that

C(X T∗c X ∗c )−1 = (C T )−1

Now multiply on the right by (C T ) to give

C(X T∗c X ∗c )−1 (C T ) = I

Hence
Z = C βb∗ ∼ Np−1 (0, σ 2 I)
58 CHAPTER 3. INFERENCE

that is
Zi ∼ N (0, σ 2 )
iid

and
1
Zi ∼ N (0, 1)
σ iid

Now
1 1 bT T
SSR = β X Y
σ 2 σ 2 ∗ ∗c
1 bT T
= β X X ∗c βb∗ (by normal equations)
σ 2 ∗ ∗c
1 bT T b
= β C C β∗
σ2 ∗
1 T
= Z Z
σ2
p−1
1 X 2
= Z
σ 2 i=1 i
p−1  2
X 1
= Zi .
i=1
σ

Hence
SSR
∼ χ2p−1
σ2


Corollary 3.6.
(p − 1)M SR
∼ χ2p−1 .
σ2


From this, Theorem 4.3 and from the independence of M SE and M SR we obtain

Theorem 3.10.
M SR
F = ∼ Fp−1,n−p .
M SE

Chapter 4

Model Checking

4.1 Residuals in Simple Linear Regression

4.1.1 Crude Residuals

In Section 2.4.1 we defined the residuals as


ei = Yi − Ybi .
These are often called crude residuals. We have
ei = Yi − (βb0 + βb1 xi )
= Yi − (Ȳ − βb1 x̄) − βb1 xi
= Yi − Ȳ − βb1 (xi − x̄).
We also have seen that n
X
ei = 0.
i=1
Now the question is what is the expectation and the variance of crude residuals?

The mean of the ith residual is


E[ei ] = E[Yi − βb0 − βb1 xi ]
= E[Yi ] − E[βb0 ] − xi E[βb1 ]
= β0 + β1 xi − β0 − β1 xi
= 0.

59
60 CHAPTER 4. MODEL CHECKING

The variance is given by


1 (xi − x̄)2
  
2
var[ei ] = σ 1 − + = σ 2 (1 − hii ),
n Sxx
which can be shown by writing ei as a linear combination of the Yi ’s. Note that it
depends on i, that is the variance of ei is not constant, unlike that of εi . Similarly
it can be shown that the covariance of two residuals ei and ej is
 
2 1 (xi − x̄)(xj − x̄)
cov[ei , ej ] = −σ + = −σ 2 hij .
n Sxx

We know that var[εi ] = σ 2 and cov[εi , εj ] = 0. So the crude residuals ei do not


quite mimic the properties of εi .

4.1.2 Standardized/Studentized Residuals

To standardize a random variable we subtract its mean and divide by its standard
error. Hence, to standardize residuals we calculate
ei − E(ei ) ei
di = √ =p .
var ei 2
σ (1 − hii )
Then
di ∼ N (0, 1).
They are not independent, though for large samples the correlation should be
small.

However, we do not know σ 2 . If we replace σ 2 by S 2 we get the so called studen-


tized residuals,
ei
ri = p .
S 2 (1 − hii )
For large samples they will approximate the standard di .

4.1.3 Residual plots

Shapes of various residual plots can show whether the model assumptions are
approximately met.

To check linearity, we plot ri against xi , as shown in Figure 4.1.


4.1. RESIDUALS IN SIMPLE LINEAR REGRESSION 61

(a) (b)
Figure 4.1: (a) No problem apparent (b) Clear non-linearity

To check the assumption of constant variance (homoscedasticity), we plot ri against


the fitted values ybi , as shown in Figure 4.2. This plot can also indicate whether the
assumption of model linearity is approximately satisfied.

(a) (b)
Figure 4.2: (a) No problem apparent (b) Variance increases as the mean response
increases
To check whether the distribution of the residuals follows a normal distribution
we can draw a so called Normal Probability Plot. It plots each value of ordered
residuals vs. the percentage of values in the sample that are less than or equal
to it, along a fitted distribution line. The scales are transformed so that the fitted
distribution forms a straight line. A plot that departs substantially from linearity
suggests that the error distribution is not normal as shown in plots 4.3 - 4.6.
62 CHAPTER 4. MODEL CHECKING

(a) (b)
Figure 4.3: (a) Histogram of data simulated from standard normal distribution, (b)
Normal Probability Plot, no problem apparent.

(a) (b)
Figure 4.4: (a) Histogram of data simulated from a Log-normal distribution, (b) Normal
Probability Plot indicates skewness of the distribution.

(a) (b)
Figure 4.5: (a) Histogram of data simulated from a Beta distribution, (b) Normal Proba-
bility Plot indicates light tails.
4.2. FURTHER MODEL CHECKING 63

(a) (b)
Figure 4.6: (a) Histogram of data simulated from a Student t-distribution, (b) Normal
Probability Plot indicates heavy tails.

4.2 Further Model Checking

4.2.1 Outliers and influential observations

An outlier, in the context of regression is an observation whose standardized resid-


ual is large (in absolute value) compared with the rest of the data. Recall the
definition of the standardized residuals:

e 1 (xi − x̄)2
ri = √ i , hii = + .
S 1 − hii n Sxx

An outlier will usually be apparent from any of the residual plots.

One rule of thumb is that observations with standardized residuals greater (in ab-
solute value) than 2 are possible outliers. However, with a large number of ob-
servations there is more chance that a strange observation will occur in a data set.
So, we need to be cautious when deciding about such values.

If we find an outlier we should check whether the observation was misrecorded or


miscopied and if so correct it. If it seems correctly recorded we should rerun the
analysis excluding the outlier. If the conclusions from the second analysis differ
substantially from the first one we should report both.

As well as outliers in the y values, we sometimes have values of x which are


different to the rest. To detect an observation with an unusual x value we use the
leverage. This is defined as the hii value (as in the definition of the standardized
residual).
64 CHAPTER 4. MODEL CHECKING

Note that
n n 
(xi − x̄)2

X X 1
hii = + = 2,
i=1 i=1
n Sxx

so on average an observation will have a leverage of 2/n. We shall regard an


observation with hii > 4/n as having a large leverage and with hii > 6/n as a
very large leverage.

An observation with a large leverage is not a wrong observation (although if the


leverage is very large it is probably worth checking wether the x value has been
recorded correctly). Rather, it is a potentially influential observation, i.e., one
whose omission would cause a big change in the parameter estimates.

We can use a statistic called Cook’s distance to measure the influence of an obser-
vation.

For a simple linear regression model consider omitting the ith observation (xi , yi )
and refitting the model. Denote the new fitted values by ŷ (i) . We define Cook’s
statistic for case i to be
n
1 X (i)
Di = 2 (ŷ − ŷj )2 .
2s j=1 j

It can be shown that


e2i hii
Di = .
2s (1 − hii )2
2

This shows that Di depends on both the size of the residual ei and the leverage
hii . So a large value of Di can occur due to large ei or large hii .

A common technique to determine if Di is unusually large is to determine whether


Di is bigger than the 50th percentile of an Fp,n−p distribution, where p is the num-
ber of parameters in the model. If so it has a major influence on the fitted value.
Even if the largest Di is not bigger than this value the corresponding observation
could still be considered influential if it is a lot larger than the second largest.

It is not recommended that influential observations be removed, but they indi-


cate that some doubt should be expressed about the conclusions since without the
influential observations the conclusions might be rather different.
Example 4.1. Gesell’s Score
The following data give Age at First Word (X) and Gesell Adaptive Score (Y ) for
21 individuals from an investigation into cyanotic heart disease.
4.2. FURTHER MODEL CHECKING 65

Obs. x y Obs. x y
1 15 95 11 7 113
2 26 71 12 9 96
3 10 83 13 10 83
4 9 91 14 11 84
5 15 102 15 11 102
6 20 87 16 10 100
7 18 93 17 12 105
8 11 100 18 42 57
9 8 104 19 17 121
10 20 94 20 11 86
21 10 100

The data represent the Gesell’s adaptive scores (y) versus age of infants (x, in
months) at first word. The scatter plot indicates two unusual observations: one is
a large value of y compared to other values at a similar x and one is a large value
of x, which is far from all the other x values.

4.2.2 Lack of Fit Test

We have seen that the residuals for the plasma data are not likely to be a sample
from a normal distribution with a constant variance. One of the reasons can be
that the straight line is not a good choice of the model. This fact can be easily
seen here, but we can also test lack of fit. The test function is also based on the
model assumptions so we should not see clear evidence against the assumptions
for the test to be valid.

The test is possible when we have replications, that is more than one observa-
tion for some values of the explanatory variable. In Example 2.7 we have five
observations for each age xi .
66 CHAPTER 4. MODEL CHECKING

Notation:
Denote by Yij the j-th response atPxi , i = 1, . . . , m, j = 1, . . . , ni , that is the
number of all observations is n = m i=1 ni . The average response at xi is

ni
1 X
Ȳi = Yij .
ni j=1

We denote the fitted response at xi by Ybi , which is the same for all observations at
xi . 

The residuals eij are


eij = Yij − Ybi .

These differences arise for two reasons. Firstly the j-th observation of a given xi
is an outcome of a random variable. Observations obtained for the same value of
X may produce different values of Y . Secondly the model we fit may not be a
good one.

How could we distinguish between the random variation and the lack of fit? We
need more than one observation at xi to be able to do it.

The difference
Yij − Ȳi

indicates the random variation at xi ; it is called pure error. The difference between
the mean and the fitted response, i.e.,

Ȳi − Ybi ,

indicates lack of fit at xi .

Using the double index notation we may write the sum of squares for residuals as
ni
m X
X
SSE = (Yij − Ybi )2 .
i=1 j=1

We can also define the pure error sum of squares as


ni
m X
X
SSP E = (Yij − Ȳi )2
i=1 j=1
4.2. FURTHER MODEL CHECKING 67

and the lack of fit sum of squares as a measure of lack of fit:


ni
m X
X
SSLoF = (Ȳi − Ybi )2
i=1 j=1
Xm
= ni (Ȳi − Ybi )2 .
i=1

Theorem 4.1. In the simple linear regression model we have

SSE = SSLoF + SSP E .

Proof.
ni
m X
X
SSE = (Yij − Ybi )2
i=1 j=1
Xm X ni
= {(Yij − Ȳi ) + (Ȳi − Ybi )}2
i=1 j=1
Xm X ni m
X ni
m X
X
2 2
= (Yij − Ȳi ) + ni (Ȳi − Ybi ) + 2 (Yij − Ȳi )(Ȳi − Ybi )
i=1 j=1 i=1 i=1 j=1
m
X ni
X
= SSP E + SSLoF + 2 (Ȳi − Ybi ) (Yij − Ȳi )
i=1 j=1
= SSP E + SSLoF

since nj=1
P i
(Yij − Ȳi ) = 0.


This theorem shows how the residual sum of squares is split into two parts, one
due to the pure error and one due to the model lack of fit. To work out the split of
the degrees of freedom, note that to calculate SSP E we must calculate m sample
means Ȳi , i = 1, . . . , m. Each sample mean takes up one degree of freedom. Thus
the degrees of freedom for pure error are n − m. By subtraction, the degrees of
freedom for lack of fit are

νLoF = νE − νP E = (n − 2) − (n − m) = m − 2.

This can be included in the Analysis of variance table as follows:


68 CHAPTER 4. MODEL CHECKING

ANOVA table

Source of variation d.f. SS MS VR


M SR
Regression 1 SSR M SR M SE
SSE
Residual n−2 SSE M SE = n−2

Lack of fit m−2 SSLoF M SLoF = SSLoF


m−2
M SLoF
M SP E

Pure Error n−m SSP E M SP E = SSPE


n−m
Total n−1 SST

We will see later that


E[SSP E ] = (n − m)σ 2
whether the simple linear regression model is true or not.

It can also be shown that if the simple linear regression model is true then

E[SSLoF ] = (m − 2)σ 2 .

Hence, both M SP E and M SLoF give us unbiased estimators of σ 2 , but the latter
one only if the model is true.

Let
H0 : simple linear regression model is “true”
H1 : ¬H0

Then, under H0 ,
(m − 2)M SLoF
∼ χ2m−2 .
σ2 H0

Also
(n − m)M SP E
∼ χ2n−m
σ2
whatever the model.

Hence, under H0 , the ratio of these two independent statistics divided by the re-
spective degrees of freedom is distributed as Fm−2,n−m , namely

M SLoF
F = ∼ Fm−2,n−m .
M S P E H0

Note that we can only do this lack of fit test if we have replications. These have to
be true replications, not just repeated measurements on the same sampling unit.
4.2. FURTHER MODEL CHECKING 69

Example 4.2. Plasma level continued.


To illustrate these ideas we return to the plasma example. We have seen that the
residual plots show some evidence that a transformation is necessary. The analysis
of variance table for the plasma data after the log transformation of the response
variable is following.

Source DF SS MS F P
Regression 1 2.6554 2.6554 60.63 0.000
Residual Error 23 1.0073 0.0438
Lack of Fit 3 0.0885 0.0295 0.64 0.597
Pure Error 20 0.9188 0.0459
Total 24 3.6627

The p-value is 0.597 so the numerical output shows no reason to doubt the fit of
this model. 

4.2.3 Matrix form of the model

We denote the vector of residuals as


e = Y − Yb ,
\) = X β
where Yb = E(Y b is the vector of fitted responses µ
bi . It can be shown that
the following theorem holds.
Theorem 4.2. The n × 1 vector of residuals e has mean
E(e) = 0
and variance-covariance matrix
Var(e) = σ 2 I − X(X T X)−1 X T .


Hence, variance of the residuals ei is


var[ei ] = σ 2 (1 − hii ),
where the leverage hii is the ith diagonal element of the Hat Matrix H = X(X T X)−1 X T ,
i.e.,
T −1
hii = xT
i (X X) xi ,
70 CHAPTER 4. MODEL CHECKING

where xT
i = (1, xi ) is the ith row of matrix X.

The ith mean response can be written as


 
β0
E(Yi ) = µi = xT
i β = (1, xi ) = β0 + β1 xi
β1

and its estimator as


bi = xT
µ i β.
b
Then, the variance of the estimator is
T −1
µi ) = var(xT
var(b 2 T 2
i β) = σ xi (X X) xi = σ hii
b

and the estimator of this variance is


\ µi ) = S 2 hii ,
var(b

where S 2 is a suitable unbiased estimator of σ 2 .

We can easily obtain other results we have seen for the SLRM written in non-
matrix notation, now using the matrix notation, both for the full model and for a
reduced SLM (no intercept or zero slope).

We have seen on page 22 that


 P 
T −1 1 x2i −nx̄
(X X) = .
nSxx −nx̄ n

b = σ 2 (X T X)−1 . Thus
Now, by Theorem 2.1, Var[β]
P 2
2 xi
var[β0 ] = σ
b
nSxx
n o
1 x̄2
x2 = x2 − nx̄2 + nx̄2 , can be written as σ 2
P P
which, by writing n
+ Sxx
.
Also,
 
−nx̄
cov(βb0 , βb1 ) = σ 2
nSxx
2
−σ x̄
= ,
Sxx
and
σ2
var[βb1 ] = .
Sxx
4.3. MODEL CHECKING IN MULTIPLE REGRESSION 71

The quantity hii is given by


T −1
hii = xTi (X X) xi
 P 2  
1 xj −nx̄ 1
= (1 xi ) .
nSxx −nx̄ n xi

We shall leave it as an exercise to show that this simplifies to


1 (xi − x̄)2
hii = + .
n Sxx

4.3 Model checking in multiple regression

4.3.1 Standardised residuals

We defined the vector of residuals

e = Y − Yb = (I − H)Y

and showed that


E(e) = 0, Var(e) = σ 2 (I − H),
where H is the hat matrix with elements hij . Then

var(ei ) = (1 − hii )σ 2 and cov(ei , ej ) = −hij σ 2 .

We see that the residuals may have different variances which may make detecting
outlying observations more difficult. So we define the standardized residuals as
follows
ei
ri = p .
2
S (1 − hii )
For large samples hij will be small (for i 6= j) and we have an asymptotic re-
sult that the standardized residuals are approximately independent and identically
distributed as N (0, 1), that is

ri ∼ N (0, 1) approximately, for large n.


iid

This allows us to carry out model checking. Note however that it will be most
reliable for large samples.

Apart from the usual residual diagnostics as done in the residual plots in R, we
may use standardized residuals to check the form of the expected response by
72 CHAPTER 4. MODEL CHECKING

plotting ri versus xji for each j = 1, . . . , p − 1. Any curvature suggests that a


higher order term in xj is needed.

Also, as with simple linear regression outliers may be evident from the residual
plots.

4.3.2 Lack of fit and pure error

This is similar to the case with simple linear regression.

Assume we have m distinct combinations of levels of the explanatory variables,


X1 , . . . , Xp−1 , and we have ni replicates of the combinations. That is, matrix X
has m distinct rows xT i = (1, x1,i , . . . , xp−1,i ), each repeated ni times.

Then the multiple regression model can be written as


Yij = xTi β + εij , i = 1, . . . , m; j = 1, . . . , ni .
As with simple linear regression we can separate the SSE into components for
lack of fit and for pure error
ni
m X
X
SSE = (Yij − Ybi )2
i=1 j=1
X ni
m X m
X
2
= (Yij − Ȳi ) + ni (Ȳi − Ybi )2
i=1 j=1 i=1
= SSP E + SSLoF

The analysis of variance table can be expanded as before.

4.3.3 Leverage and influence

We noted in section 4.2.1 that an observation with high leverage was potentially
influential. We discuss this in greater detail here. The vector of fitted values is
b = X βb = X(X T X)−1 X T y = Hy
y
and the ith fitted value can be written as
n
X X
ybi = hij yj = hii yi + hij yj .
j=1 i6=j
4.3. MODEL CHECKING IN MULTIPLE REGRESSION 73

The weight hii indicates how heavily yi contributes to the fitted value ybi . The
quantity hii is called the leverage of case i. The ith diagonal element, hii , of the
hat matrix H has the following properties:

1. As var(ei ) = σ 2 (1 − hii ), we have hii < 1. This means that hii close to 1
will give var(ei ) ≈ 0 and so ybi ≈ yi , that is, the fitted value will be very
close to the ith observation.

2. hii is usually small when vector (x1,i , . . . , xp−1,i )T is close to the centroid
(x̄1 , . . . x̄p−1 )T and large when the vector is far from the centroid.

3. When p = 2 (SLRM)
1 (xi − x̄)2
hii = +
n Sxx
1
and hii = n
when xi = x̄.

1
4. In general, n
≤ hii < 1.
Pn
5. i=1 hii = p since

n
X
hii = trace(H)
i=1
= trace(X(X T X)−1 X T )
= trace(X T X)(X T X)−1
= trace I p
= p.

Hence the average leverage is np . A case for which hii > 2p


n
is considered a
3p
high leverage case and one with hii > n is considered a very high leverage
case.

There may be various reasons for high leverage. It may be that the data of the
case were collected differently than the rest of the data or simply misrecorded.
It may just be that the case has one or more values which are atypical but cor-
rectly recorded. A low leverage case usually will not influence the fit much; a
high leverage case indicates potential influence, but not all high leverage cases are
influential.
74 CHAPTER 4. MODEL CHECKING

Cook’s distance

Recall from section 4.2.1 that Cook’s distance provides a measure of the influence
of an observation on the fitted model. Let

Y = Xβ + ε, ε ∼ N n (0, σ 2 I).

Denote by βb(i) the estimate of β obtained without the i-th case (x1,i , . . . , xp−1,i , yi ).
Then, βb − βb(i) is a good indicator of the influence of the i-th observation on the
model fit. When pre-multiplied by X, this is the difference between the vectors of
fitted values obtained with all cases included and with the i-th case omitted. The
Cook’s distance, as defined in Section 4.2.1, is
n
1 X (i)
Di = 2 yj − ybj )2
(b
ps j=1
1
= y − yb(i) )T (b
(b y − yb(i) )
ps2
1
= 2 (X βb − X βb(i) )T (X βb − X βb(i) )
ps
1
= 2 (βb − βb(i) )T X T X(βb − βb(i) )
ps

It can be shown that


ei (X T X)−1 xi
βb(i) = βb − , (4.1)
1 − hii
where ei = yi − ybi . Then, we get
T
ei (X T X)−1 xi ei (X T X)−1 xi

1
Di = 2 X TX
ps 1 − hii 1 − hii
2 T T −1
e x (X X) xi
= i i2
ps (1 − hii )2
e2i hii ri2 hii
= 2 =
ps (1 − hii )2 p(1 − hii )
ei
as ri = √ .
s2 (1−hii )

Large Cook’s distance indicates that the observation i is influential. Note that, this
depends on both the leverage hii and the standardized residual ri .
4.3. MODEL CHECKING IN MULTIPLE REGRESSION 75

4.3.4 Prediction Error Sum of Squares (P RESS)

First we define so called PRESS residuals e(i) as follows,

(i)
e(i) = yi − ybi ,

(i)
where ybi = xTi β (i) and β (i) is the vector of least squares estimates of the model
b b
parameters obtained without case i. Then,

(i)
e(i) = yi − ybi = yi − xT i β (i)
b
 T −1

T ei (X X) x i
= yi − xi βb −
1 − hii
T T −1
= yi − xT b + ei xi (X X) xi
β
i
1 − hii
ei hii ei
= ei + = .
1 − hii 1 − hii

We define P RESS as the sum of squares of the PRESS residuals, that is,

n n
X X e2i
P RESS = e2(i) = .
i=1 i=1
(1 − hii )2

P RESS assesses the model’s predictive ability. It is used for calculating pre-
dicted R2 .

Predicted R2

This is defined as
 
2 P RESS
R (pred) = 1 − 100%.
SST

Predicted R2 is used in MLRM to indicate how well the model predicts responses
for new observations. A good model would have R2 and R2 (pred) high and close
to each other. Large discrepancy between these two measures means that the
model may be over-fitted.
76 CHAPTER 4. MODEL CHECKING

4.4 Problems with fitting regression models

4.4.1 Near-singular and ill-conditioned X T X

We have seen that if X T X is singular, no unique least squares estimators exist.


The singularity is caused by linear dependence among the explanatory variables.

For example, suppose that for E(Yi ) = β1 x1i + β2 x2i we have


 
−1 −1  
 −1 −1  T 4 4
X=   ⇒X X= , det(X T X) = 0
1 1  4 4
1 1

and so X T X is singular. Now, take


 
−1 −0.9  
 −1 −1.1  T 4 4
X=  1
⇒X X= , det(X T X) = 0.16
0.9  4 4.04
1 1.1

and so X T X is nonsingular but det(X T X) is close to zero. If there are “near”


linear dependencies among the explanatory variables, the X T X matrix can be
“nearly” singular. We find
   
T −1 1 4.04 −4 25.25 −25
(X X) = =
0.16 −4 4 −25 25

Now recall that var(βbj ) = σ 2 cjj so in this case var(βb1 ) = 25.25σ 2 and var(βb2 ) =
25σ 2 which are both large. Also cov(βb1 , βb2 ) = −25σ 2 .

By contrast if
 
−1 −1  
 −1 1  T 4 0
X=
  ⇒X X= , det(X T X) = 16
1 −1  0 4
1 1
 
T −1 0.25 0
(X X) ==
0 0.25
so var(βbj ) = 0.25σ 2 for j = 1, 2 and cov(βb1 , βb2 ) = 0.
4.4. PROBLEMS WITH FITTING REGRESSION MODELS 77

In these simple cases we can see exactly where the problems are. With more vari-
ables it is not always obvious that some columns of the X matrix are close to being
linear combinations of other columns. This problem is sometimes called multi-
collinearity. These examples illustrate the general problems caused by multi-
collinearity:

(i) some or all parameter estimators will have large variances;

(ii) difficulties may arise in variable selection as it will be possible to get very
different models that fit equally well;

(iii) some parameters may have the “wrong” sign; this can be noticed when, for
example, it is obvious that increasing the value of a regressor should result
in an increase in the dependent variable.

4.4.2 Variance inflation factor (VIF)


The variance inflation factor can be used to indicate when multi-collinearity may
be a problem. Consider a regression with p − 1 predictors. Suppose we fitted a
regression model with Xj as a function of the remaining p − 2 explanatory vari-
ables. Let Rj2 be the coefficient of determination (not expressed as a percentage)
for this model. Then we define the jth variance inflation factor as
1
VIFj = .
1 − Rj2

A large value of Rj2 (close to 1) will give a large VIFj . In this context a VIF > 10
is taken to indicate that the multi-collinearity may cause problems of the sort noted
above. However, VIF > 4, sometimes even just bigger than 2, can indicate that
an explanatory variable could be excluded from the model.

We may be able to check for simple relationships between explanatory variables


Xi and Xj by plotting each Xi against Xj . Note that this may not reveal more
complex linear dependency between the variables.
78 CHAPTER 4. MODEL CHECKING
Chapter 5

Model Selection

5.1 Transformation of the response

Example 5.1. Plasma level of polyamine.


The plasma level of polyamine (Y ) was observed in 25 children of age 0 (new-

x = 0 20.12 16.10 10.21 11.24 13.35


x = 1 8.75 9.45 13.22 12.11 10.38
x = 2 9.25 6.87 7.21 8.44 7.55
x = 3 6.45 4.35 5.58 7.12 8.10
x = 4 5.15 6.12 5.70 4.25 7.98
Table 5.1: Plasma levels data

born) to 4 years old (X). The results are given in Table 5.1. We are interested
whether the level of polyamine decreases linearly while the age of children in-
creases up to four years. 

If the model checking suggests that the variance is not constant, or that the data
are not from a normal distribution (these often happen together) then it might be
possible to obtain a better model by transforming the observations yi . Commonly
used transformations are

• ln y; this is particularly good if Var(Yi ) ∝ [E(Yi )]2 .



• y; this is particularly good if Var(Yi ) ∝ E(Yi ).

79
80 CHAPTER 5. MODEL SELECTION

• 1/y.

These are special cases of a large family of transformations, the Box-Cox trans-
formation,  yλ −1
(λ) , when λ 6= 0;
y = λ
ln y, when λ = 0.
The Box-Cox transformation estimates the λ that minimizes the standard devia-
tion of a standardized transformed variable. Trigonometric functions are also used
in some cases, in particular the arc-sine or arc-tangent. In practice the log trans-
formation is often the most useful and is generally the first transformation we try,
but note all values of yi need to be positive.

5.2 Model Building

We have already mentioned the principle of parsimony; we should use the simplest
model that achieves our purpose.

It is easy to get a simple model (Yi = β0 + εi ) and it is easy to represent the


response by the data themselves. However, the first is generally too simple and
the second is not a useful model. Achieving a simple model that describes the
data well is something of an art. Often, there is more than one model which does
a reasonable job.
Example 5.2. Sales
A company is interested in the dependence of sales on promotional expenditure
(X1 in £1000), the number of active accounts (X2 ), the district potential (X3
coded), and the number of competing brands (X4 ). We will try to find a good
multiple regression model for the response variable Y (sales).
5.2. MODEL BUILDING 81

Figure 5.1: The Matrix Plot indicates that Y is clearly related to X4 and also to X2 . The relation
with other explanatory variables is not that obvious.

Let us start with fitting a simple regression model of Y as a function of X4 only.

The regression equation is


Y = 396 - 25.1 X4

Predictor Coef SE Coef T P


Constant 396.07 49.25 8.04 0.000
X4 -25.051 5.242 -4.78 0.000

S = 49.9868 R-Sq = 63.7% R-Sq(adj) = 60.9%

Analysis of Variance
Source DF SS MS F P
Regression 1 57064 57064 22.84 0.000
Residual Error 13 32483 2499
Total 14 89547

We can see that the residuals versus fitted values indicate that there may be non-
constant variance and also the linearity of the model is questioned. We will add
X2 to the model.
82 CHAPTER 5. MODEL SELECTION

The regression equation is


Y = 190 - 22.3 X4 + 3.57 X2

Predictor Coef SE Coef T P


Constant 189.83 10.13 18.74 0.000
X4 -22.2744 0.7076 -31.48 0.000
X2 3.5692 0.1333 26.78 0.000

S = 6.67497 R-Sq = 99.4% R-Sq(adj) = 99.3%

Analysis of Variance
Source DF SS MS F P
Regression 2 89012 44506 998.90 0.000
Residual Error 12 535 45
Total 14 89547

Source DF Seq SS
X4 1 57064
X2 1 31948

Still, there is some evidence that the standardized residuals may not have constant
variance. Will this be changed if we add X3 to the model?
5.2. MODEL BUILDING 83

The regression equation is


Y = 190 - 22.3 X4 + 3.56 X2 + 0.049 X3

Predictor Coef SE Coef T P


Constant 189.60 10.76 17.62 0.000
X4 -22.2679 0.7408 -30.06 0.000
X2 3.5633 0.1482 24.05 0.000
X3 0.0491 0.4290 0.11 0.911

S = 6.96763 R-Sq = 99.4% R-Sq(adj) = 99.2%

Analysis of Variance
Source DF SS MS F P
Regression 3 89013 29671 611.17 0.000
Residual Error 11 534 49
Total 14 89547

Source DF Seq SS
X4 1 57064
X2 1 31948
X3 1 1

Not much better than before. Now, we add X1 , the least related explanatory vari-
able to Y .
84 CHAPTER 5. MODEL SELECTION

The regression equation is


Y = 177 - 22.2 X4 + 3.54 X2 + 0.204 X3 + 2.17 X1

Predictor Coef SE Coef T P


Constant 177.229 8.787 20.17 0.000
X4 -22.1583 0.5454 -40.63 0.000
X2 3.5380 0.1092 32.41 0.000
X3 0.2035 0.3189 0.64 0.538
X1 2.1702 0.6737 3.22 0.009

S = 5.11930 R-Sq = 99.7% R-Sq(adj) = 99.6%

Analysis of Variance
Source DF SS MS F P
Regression 4 89285 22321 851.72 0.000
Residual Error 10 262 26
Total 14 89547

Source DF Seq SS
X4 1 57064
X2 1 31948
X3 1 1
X1 1 272

The residuals now do not contradict the model assumptions. We analyze the nu-
merical output. Here we see that X3 may be a redundant variable as we have no
evidence to reject the hypothesis that β3 = 0 given that all the other variables are
in the model. Hence, we will fit a new model without X3 .
5.2. MODEL BUILDING 85

The regression equation is


Y = 179 - 22.2 X4 + 3.56 X2 + 2.11 X1

Predictor Coef SE Coef T P


Constant 178.521 8.318 21.46 0.000
X4 -22.1880 0.5286 -41.98 0.000
X2 3.56240 0.09945 35.82 0.000
X1 2.1055 0.6479 3.25 0.008

S = 4.97952 R-Sq = 99.7% R-Sq(adj) = 99.6%

Analysis of Variance
Source DF SS MS F P
Regression 3 89274 29758 1200.14 0.000
Residual Error 11 273 25
Total 14 89547

Source DF Seq SS
X4 1 57064
X2 1 31948
X1 1 262

These residual plots also do not contradict the model assumptions, all the param-
eters are significant and R2 is very large.
86 CHAPTER 5. MODEL SELECTION

5.2.1 F-test for the deletion of a subset of variables

Suppose the overall regression model as tested by the Analysis of Variance table
is significant. We know that not all of the β parameters are zero, but we may still
be able to delete several variables.

We can carry out the Subset Test based on the extra sum of squares principle. We
are asking if we can reduce the set of explanatory variables.

X1 , X2 , . . . , Xp−1

to, say,
X1 , X2 , . . . , Xq−1

(renumbering if necessary) where q < p, by omitting Xq , Xq+1 , . . . , Xp−1 .

We are interested in whether the inclusion of Xq , Xq+1 , . . . , Xp−1 in the model


provides a significant increase in the overall regression sum of squares or equiva-
lently a significant decrease in residual sum of squares.

The difference between the sums of squares is called the extra sum of squares due
to Xq , . . . , Xp−1 given X1 , . . . , Xq−1 are already in the model and is defined by
the equation
5.2. MODEL BUILDING 87

SS(βq , . . . , βp−1 |β1 , . . . , βq−1 ) = SS(β1 , β2 , . . . , βp−1 ) − SS(β1 , β2 , . . . , βq−1 )


| {z } | {z }
regression SS for regression SS for
full model reduced model

= SSEred − SSE
| {z } |{z}
residual SS under residual SS under
reduced model full model.

Notation:
Let
β T1 = (β0 , β1 , . . . , βq−1 ) β T2 = (βq , βq+1 , . . . , βp−1 )
so that  
β1
β= .
β2
Similarly divide X into two submatrices X 1 and X 2 so that X = (X 1 , X 2 ),
where
   
1 x1,1 · · · xq−1,1 xq,1 · · · xp−1,1
X 1 =  ... .. ..  . ..
 X 2 =  .. .
  
. . .
1 x1,n · · · xq−1,n xq,n · · · xp−1,n

The full model


Y = Xβ + ε = X 1 β 1 + X 2 β 2 + ε
has
2 T
SSR = Y T HY − nY = βb X T Y − nȲ 2
T
SSE = Y T (I − H)Y = Y T Y − βb X T Y .

Similarly the reduced model

Y = X 1 β 1 + ε?

has
T
SSRred = βb1 X T1 Y − nȲ 2
T
SS red = Y T Y − βb X T Y .
E 1 1

Hence the extra sum of squares is


T T
SSextra = βb X T Y − βb1 X T1 Y .
88 CHAPTER 5. MODEL SELECTION

To determine whether the change in sum of squares is significant, we test the


hypothesis
H0 : βq = βq+1 = . . . = βp−1 = 0
versus
H1 : ¬H0
It can be shown that, if H0 is true,
SSextra /(p − q)
F = ∼ Fp−q,n−p ,
S2
where S 2 = M SE from the full model. So, we reject H0 at the α level if
F > Fα;p−q,n−p
and conclude that there is sufficient evidence that some (but not necessarily all) of
the ‘extra’ variables Xq , . . . , Xp−1 should be included in the model.

The ANOVA table is given by

Source d.f. SS MS VR
Overall regression p−1 SSR
X1 , .., Xq−1 q−1 SSRred
SSextra SSextra
Xq , .., Xp−1 |X1 , .., Xq−1 p−q SSextra p−q (p−q)M SE
Residual n−p SSE M SE
Total n−1 SST

In the ANOVA table we use the notation Xq , . . . , Xp−1 |X1 , . . . , Xq−1 to denote
that this is the effect of the variables Xq , . . . , Xp−1 given that the variables X1 , . . . , Xq−1
are already included in the model.

Note that we can repeatedly test individual parameters. The we have the following
Sums of Squares and degrees of freedom.

Source of variation df SS
Full model p−1 SSR
X1 1 SS(β1 )
X2 |X1 1 SS(β2 |β1 )
X3 |X1 , X2 1 SS(β3 |β1 , β2 )
.. ..
. .
Xp−1 |X1 , . . . , Xp−2 1 SS(βp−1 |β1 , . . . , βp−2 )
Residual n−p SSE
Total n−1 SST
5.2. MODEL BUILDING 89

The output depends on the order the predictors are entered into the model. The
sequential sum of squares is the unique portion of SSR explained by a predictor,
given any previously entered predictors. If we have a model with three predictors,
X1 , X2 , and X3 , the sequential sum of squares for X3 shows how much of the
remaining variation X3 explains given that X1 and X2 are already in the model.

5.2.2 All subsets regression

If there is no natural ordering to the explanatory variables, then it is desirable to


examine all possible subsets. For example, if we have three candidate explanatory
variables X1 , X2 and X3 , the possible models are
(j)
Yi ∼ N (µi , σ 2 ), j = 1, . . . , 8,
with
(1)
µi = β0
(2)
µi = β0 + β1 x1i
(3)
µi = β0 + β2 x2i
(4)
µi = β0 + β3 x3i
(5)
µi = β0 + β1 x1i + β2 x2i
(6)
µi = β0 + β1 x1i + β3 x3i
(7)
µi = β0 + β2 x2i + β3 x3i
(8)
µi = β0 + β1 x1i + β2 x2i + β3 x3i

There are 8 = 23 models. In general with p − 1 explanatory variables there are


2p−1 possible models, so even with p = 5 or 6 it is difficult to do a full comparison
of all models.

Instead we usually compare models by calculating a few statistics for each model.
Three statistics that are most useful are M SE , R2 and Cp .

Residual mean square M SE

If the full model with all candidate explanatory variables is correct then
E(M SE ) = σ 2 .
90 CHAPTER 5. MODEL SELECTION

If we have excluded one or more important variables then


E(M SEred ) > σ 2 .
Hence we may be able to identify the most appropriate model as being

(i) the one with the smallest number of explanatory variables (parameters) for
which M SEred is close to M SE of the full model;
(ii) the one with smallest M SEred .

Condition (i) aims for the simplest acceptable model. Condition (ii) is more con-
servative and should be considered carefully as it may just suggest the full model.

Denote by pe the number of parameters in the reduced model. For the full model
pe = p. Then, a sketch of the smallest M SEred for a given pe, denoted further by
M SEpe , against pe can be useful.

Coefficient of determination R2

The coefficient of determination obtained for a model with pe parameters will be


denoted by Rp2e, that is,
!
pe pe
SS SS
Rp2e = R
× 100 = 1 − E
× 100.
SST SST

The superscript pe indicates that the sums of squares are calculated for a model
with pe parameters.

Adding terms to a model always increases R2 . However, the model with pe param-
eters, for pe as small as possible, having Rp2e close to Rp2 (i.e., obtained from the full
model) might be regarded as being best. Judgement is required and a plot of Rp2e
against pe can be useful to identify where the plot levels off.

The adjusted R2 , obtained for a model with pe parameters, will be denoted by


Rp2e(adj), that is,
!
pe
SS /(n − p )
Rp2e(adj) = 1 − E
× 100.
e
SST /(n − 1)

It takes into account the number of parameters in the model and can be useful for
comparing models with different numbers of predictors.
5.2. MODEL BUILDING 91

Mallows’ statistic Cp

For a model with pe parameters we define


SSEpe
Cpe = p − n.
+ 2e
σ2
We have
E(SSEpe ) = (n − pe)σ 2
and so
(n − pe)σ 2
E(Cpe) = + 2ep − n = pe.
σ2
Hence we should choose a model with Cpe close to pe.

It can also be shown that Cpe is an estimator of the mean square error of prediction,
i.e.,
n
1X
[var(Ŷi ) + {bias(Ŷi )}2 ].
n i=1
This suggests minimizing Cpe. Thus, we should choose either

(i) the model which minimizes Cpe; or


(ii) a model with Cpe close to pe, with pe small.

Again a plot of Cpe versus pe is useful.

Note that Cpe depends on the unknown σ 2 . If we take M SE from the full model as
the estimator of σ 2 , then
pe
bpe = SSE + 2e
C p − n.
M SE
It can be shown that
(n − pe)2
E(Cpe) =
b + 2ep − n,
n − pe − 2
so instead we could use an adjusted Cpe defined by
n − pe − 2 SSEpe
C̄pe = p−n
+ 2e
n − pe M SE
with expectation pe. A little algebra shows that
n − pe − 2 b p − n)
2(2e
C̄pe = Cpe + .
n − pe n − pe
92 CHAPTER 5. MODEL SELECTION

Example 5.3. Sales continued.

Output for the best subset regression follows.

Best Subsets Regression: Y versus X1, X2, X3, X4


Response is Y
X X X X
Vars R-Sq R-Sq(adj) Mallows Cp S 1 2 3 4
1 63.7 60.9 1228.5 49.987 X
1 50.1 46.3 1694.0 58.627 X
1 9.3 2.3 3088.6 79.049 X
1 1.2 0.0 3364.9 82.496 X
2 99.4 99.3 11.4 6.6750 X X
2 68.1 62.7 1082.7 48.828 X X
2 64.2 58.2 1215.3 51.709 X X
2 50.9 42.7 1668.7 60.531 X X
3 99.7 99.6 3.4 4.9795 X X X
3 99.4 99.2 13.4 6.9676 X X X
3 69.0 60.5 1053.6 50.269 X X X
3 51.4 38.1 1653.8 62.903 X X X
4 99.7 99.6 5.0 5.1193 X X X X

We can see that the model including X1 , X2 and X4 has very good values of the
measures we have just talked about. Let us call this model M1 . Also, the full
model has very good values of these measures. Let us call the full model M2 .
Which one should we choose?

R2 , Mallows’ Cp and S are helpful, but it is not sufficient to base the final decision
on these measures only. We should also do the residual diagnostics for the final
competing models as well as hypothesis testing for the model parameters.

Below, we see that the residuals in neither of the two models contradict the model
assumptions, but if we add X3 given that X1 , X2 , X4 are already there, we do not
gain much (SS(β3 |β1 , β2 , β4 ) is very small). Also, we have no evidence to reject
the null hypothesis that β3 = 0 given the other three variables are in the model.

Hence, it is better to choose the model with fewer parameters, that is M1 rather
than M2 . Fewer parameters give us more degrees of freedom (n−p) for estimating
the error variance, that is, give a more precise estimate of σ 2 .
5.2. MODEL BUILDING 93

(a) (b)
Figure 5.2: (a) Residual plots for the model fit including X1 , X2 , X4 (b) Residual plots for the
model fit including all explanatory variables.

The regression equation is


Y = 177 + 2.17 X1 + 3.54 X2 - 22.2 X4 + 0.204 X3

Predictor Coef SE Coef T P


Constant 177.229 8.787 20.17 0.000
X1 2.1702 0.6737 3.22 0.009
X2 3.5380 0.1092 32.41 0.000
X4 -22.1583 0.5454 -40.63 0.000
X3 0.2035 0.3189 0.64 0.538

S = 5.11930 R-Sq = 99.7% R-Sq(adj) = 99.6%

Analysis of Variance
Source DF SS MS F P
Regression 4 89285 22321 851.72 0.000
Residual Error 10 262 26
Total 14 89547

Source DF Seq SS
X1 1 1074
X2 1 44505
X4 1 43695
X3 1 11
94 CHAPTER 5. MODEL SELECTION
Chapter 6

Interpretation of Fitted Models

6.1 Prediction

6.1.1 Prediction Interval for a new observation in simple linear


regression

Apart from making inference on the mean response we may also try to do it for
a new response itself, that is for an unknown (not observed) response at some x0 .
For example, we might want to predict an overhead cost for another department
of the same structure whose total labor hours are x0 (Example 3.2). In this section
we derive a Prediction Interval (PI) for a response

Y 0 = β 0 + β 1 x0 + ε0 = µ 0 + ε0 , ε0 ∼ N (0, σ 2 )

for which the point prediction is µ


b0 = βb0 + βb1 x0 .

By Theorem 3.4 we have


b0 ∼ N (µ0 , aσ 2 ),
µ
 
1 (x0 −x̄)2
where a = n
+ Sxx
.

To obtain a prediction interval (PI) for the unknown observation we may use the
point predictor and its distribution as follows. First, we will find the distribution
b0 − Y0 . Note that for
of µ

b0 − Y0 = µ
µ b0 − (µ0 + ε0 ),

95
96 CHAPTER 6. INTERPRETATION OF FITTED MODELS

µ0 − Y0 ) = 0 and
we have E(b

µ0 ) + var(µ0 + ε0 ) = aσ 2 + σ 2 = σ 2 (1 + a).
µ0 − Y0 ) = var(b
var(b

This is because µ b0 is the estimator based on the random sample Y1 , . . . , Yn and


not on Y0 , i.e., it is independent of Y0 . We get,

b0 − Y0 ∼ N (0, σ 2 (1 + a)).
µ

b0 − Y0 and replacing σ 2 by its estimator S 2 gives


Standardizing µ

b − Y0
µ
p 0 ∼ tn−2 .
S 2 [1 + a]

Hence, a (1 − α)100% PI for Y0 is


s  
1 (x0 − x̄)2
b0 ± t α2 ,n−2
µ S2 1+ + .
n Sxx

This interval is wider than the CI for the mean response µ0 . This is because to
predict a new observation rather than a mean, we need to add the variability of the
additional random error ε0 . Again, we should only make predictions for values of
x0 within the range of the data.
For the example on the overhead cost (Example 3.2) the confidence and prediction
intervals (here they are for x0 = 1000 hours) are:

Predicted Values for New Observations


New
Obs Fit SE Fit 95% CI 95% PI
1 27292 428 (26374, 28210) (23645, 30939)

Values of Predictors for New Observations


New Obs x
1 1000

We may say, with 95% confidence, that when the total direct labour hours are equal to
1000, then the expected total departmental cost would be between £26374 and £28210,
however if we were to observe the total cost for a 1000 hours of labour it might be anything
between £23645 and £30939.
6.1. PREDICTION 97

Figure 6.1: Data, fitted line plot, CI for the mean and PI for a new observation at
any x0 .

6.1.2 Predicting a new observation in general regression

To predict a new observation we need to take into account not only its expectation, but
also a possible new random error.

The point estimator of a new observation



Y0 = Y |X1 = x1,0 , . . . , Xp−1 = xp−1,0 = µ0 + ε0
is
Yb0 = xT0 βb (= µ
b0 ),
which, assuming normality, is such that
Yb0 ∼ N (µ0 , σ 2 xT0 (X T X)−1 x0 ).
Then,
Yb − Y0 = Yb0 − (µ0 + ε0 ) ∼ N (0, σ 2 xT0 (X T X)−1 x0 + σ 2 ).
Hence,
Yb0 − Y0
q ∼ N (0, 1).
σ 2 {1 + xT0 (X T X)−1 x0 }
As usual we estimate σ 2 by S 2 and get
Yb0 − Y0
q ∼ tn−p .
2 T T −1
S {1 + x0 (X X) x0 }

Hence a 100(1 − α)% prediction interval for Y0 is given by


q
Y0 ± t 2 ,n−p S 2 {1 + xT0 (X T X)−1 x0 }.
b α
98 CHAPTER 6. INTERPRETATION OF FITTED MODELS

6.2 Polynomial regression

Another useful class of linear models are polynomial regression models, e.g.,

Yi = β0 + β1 xi + β11 x2i + εi ,

the quadratic regression model. This can be written as

Y = Xβ + ε, ε ∼ N n (0, σ 2 I),

where rows of matrix X are of the form (1, xi , x2i ) and β = (β0 , β1 , β11 )T . The
quadratic model belongs to the class of linear models as it is linear in the parameters.

If we wish to compare the quadratic regression model with the simple linear regression
model we fit Yi = β0 +β1 xi +β11 x2i +εi and test the null hypothesis H0 : β11 = 0 against
an alternative H1 : β11 6= 0. If we reject H0 the quadratic model gives a significantly
better fit than the simple linear model. This can be extended to cubic and higher order
polynomials. As higher powers of x quickly become large it is usually sensible to centre
x by subtracting its mean. Denote

zi = xi − x.

Then, for some parameters γ we can write,

E(Yi ) = γ0 + γ1 zi + γ11 zi2


= γ0 + γ1 (xi − x) + γ11 (xi − x)2
= (γ0 − γ1 x + γ11 x2 ) + (γ1 − 2γ11 x)xi + γ11 x2i
= β0 + β1 xi + β11 x2i .

We can also have a second (or higher) order polynomial regression model in two (or more)
explanatory variables. For example,

Yi = β0 + β1 x1i + β2 x2i + β11 x21i + β22 x22i + β12 x1i x2i + εi .

This model is very commonly used in experiments for exploring response surfaces. Note
that if the second order terms x21i , x22i and x1i x2i are in the model then we should not
consider removing the first order terms x1i and x2i .

In fact, all models of the form

Yi = β0 + β1 f1 (e
xi ) + . . . + βp−1 fp−1 (e
xi ) + εi ,

where xe i = (x1i , . . . , xp−1,i )T and f is a linear or non-linear function in any of the


explanatory variables, are linear models in the parameters and can be written in the matrix
notation as
Y = Xβ + ε, ε ∼ N n (0, σ 2 I),
6.2. POLYNOMIAL REGRESSION 99

where the rows of X are (1, f1 (e


xi ), . . . , fp−1 (e
xi )). Yi is often written in the vector
notation as
Yi = f T (e xi )β + εi ,
where f T (e
xi ) is the i-th row of matrix X. The special case of f T (e
xi ) = xT
i gives

Yi = x T
i β + εi ,

the MLR model.


Example 6.1. Crop Yield
An agronomist studied the effects of moisture (X1 in inches) and temperature (X2
in ◦ C) on the yield (Y ) of a new hybrid tomato. He recorded values of 25 random
samples of yield obtained for various levels of moisture and temperature. The data
are below.

y x1 x2 y x1 x2 y x1 x2 y x1 x2 y x1 x2
49.2 6 20 51.5 8 20 51.1 10 20 48.6 12 20 43.2 14 20
48.1 6 21 51.7 8 21 51.5 10 21 47.0 12 21 42.6 14 21
48.0 6 22 50.4 8 22 50.3 10 22 48.0 12 22 42.1 14 22
49.6 6 23 51.2 8 23 48.9 10 23 46.4 12 23 43.9 14 23
47.0 6 24 48.4 8 24 48.7 10 24 46.2 12 24 40.5 14 24

Below, there is the output for the model


2 2
Yi = γ0 + γ1 z1i + γ2 z2i + γ11 z1i + γ22 z2i + γ12 z1i z2i + εi ,

where z1 = x1 − x1 and z2 = x2 − x2 .
100 CHAPTER 6. INTERPRETATION OF FITTED MODELS

The regression equation is


y = 50.4 - 0.762 z1 - 0.530 z2 - 0.293 z1ˆ2 - 0.139 z2ˆ2 - 0.0055 z1z2

Predictor Coef SE Coef T P VIF


Constant 50.3840 0.3349 150.45 0.000
z1 -0.76200 0.06029 -12.64 0.000 1.000
z2 -0.5300 0.1206 -4.40 0.000 1.000
z1ˆ2 -0.29286 0.02548 -11.50 0.000 1.000
z2ˆ2 -0.1386 0.1019 -1.36 0.190 1.000
z1z2 -0.00550 0.04263 -0.13 0.899 1.000

S = 0.852563 R-Sq = 94.3% R-Sq(adj) = 92.8%


PRESS = 22.8584 R-Sq(pred) = 90.53%

Analysis of Variance
Source DF SS MS F P
Regression 5 227.587 45.517 62.62 0.000
Residual Error 19 13.810 0.727
Total 24 241.398

Source DF Seq SS
z1 1 116.129
z2 1 14.045
z1ˆ2 1 96.057
z2ˆ2 1 1.344
z1z2 1 0.012

The sequential sums of squares suggest that we can drop z22 and the product z1 z2 .
6.2. POLYNOMIAL REGRESSION 101

The regression equation is


y = 50.1 - 0.762 z1 - 0.530 z2 - 0.293 z1ˆ2

Predictor Coef SE Coef T P VIF


Constant 50.1069 0.2649 189.17 0.000
z1 -0.76200 0.06009 -12.68 0.000 1.000
z2 -0.5300 0.1202 -4.41 0.000 1.000
z1ˆ2 -0.29286 0.02539 -11.53 0.000 1.000

S = 0.849836 R-Sq = 93.7% R-Sq(adj) = 92.8%


PRESS = 22.0520 R-Sq(pred) = 90.86%

Analysis of Variance
Source DF SS MS F P
Regression 3 226.231 75.410 104.41 0.000
Residual Error 21 15.167 0.722
Total 24 241.398

Source DF Seq SS
z1 1 116.129
z2 1 14.045
z1ˆ2 1 96.057
102 CHAPTER 6. INTERPRETATION OF FITTED MODELS

Here we see that the residuals are not normal (the p-value is 0.013 for the test
of normality). Hence, some further analysis is needed. Various transformations
of the response variable did not work here. The residuals improve when z2 is
removed. The new model fit is below.

The regression equation is


y = 50.1 - 0.762 z1 - 0.293 z1ˆ2

Predictor Coef SE Coef T P VIF


Constant 50.1069 0.3591 139.52 0.000
z1 -0.76200 0.08148 -9.35 0.000 1.000
z1ˆ2 -0.29286 0.03443 -8.51 0.000 1.000

S = 1.15230 R-Sq = 87.9% R-Sq(adj) = 86.8%


PRESS = 37.8736 R-Sq(pred) = 84.31%

Analysis of Variance
Source DF SS MS F P
Regression 2 212.19 106.09 79.90 0.000
Residual Error 22 29.21 1.33
Lack of Fit 2 0.39 0.19 0.13 0.875
Pure Error 20 28.82 1.44
Total 24 241.40

Source DF Seq SS
z1 1 116.13
z1ˆ2 1 96.06

The residuals are slightly better here and do not clearly contradict the assump-
tion of normality. The Lack of Fit test does not indicate any evidence against the
model. R2 is good. The model is parsimonious, so we may stay with this one.
However, we could advise the agronomist that in future experiments of this kind
6.2. POLYNOMIAL REGRESSION 103

he might consider a wider range of temperature values, which would help to es-
tablish clearly whether this factor could be significant for yield of the new hybrid
tomatoes.
104 CHAPTER 6. INTERPRETATION OF FITTED MODELS
Chapter 7

Qualitative Explanatory Variables

7.1 Simple Comparative Experiments

It is common to have at least one qualitative explanatory variable. Many experi-


ments involve only comparing discrete treatments.

Example: Pulp experiment

In a paper pulping mill, an experiment was run to examine differences between


the reflectance (brightness) of sheets of pulp made by 4 operators.

Operator
1 2 3 4
59.8 59.8 60.7 61.0
60.0 60.2 60.7 60.8
60.8 60.4 60.5 60.6
60.8 59.9 60.9 60.5
59.8 60.0 60.3 60.5

- one factor (operator) with four levels (one-way layout).

Model: We could write down the model

Y = X f βf + ε , ε ∼ N (0, Iσ 2 ) , (7.1)

105
106 CHAPTER 7. QUALITATIVE EXPLANATORY VARIABLES

where:

Y - 20 × 1

Xf - 20 × 5

βf - 5 × 1

ε - 20 × 1.

Equivalently,

Yij = β0f + β1f x1j + β2f x2j + β3f x3j + β4f x4j + εij ,

where


1 if k = i
xkj = ,
0 otherwise

and i = 1, . . . , 4 and j = 1, . . . , 5.

In (7.1)

 
1 1 0 ... 0
. .. ..
 1 ..
 
. . 

 1 1 .. 
 0 . 

 . ..
 .. 0

 1 . 

Xf =  ... ... .. ..
,
 
. .
 .. ..
 
 . . 1 0


 . . 
 .. .. 0 1 
 
 . . .. ..
 .. ..

. . 
1 0 0 ... 1

and
7.1. SIMPLE COMPARATIVE EXPERIMENTS 107

 
β0f

 β1f 

βf = 
 β2f 

 β3f 
β4f

Hence, for treatment i

E(Y ) = β0f + βif .

However, we can only make comparative statements about the treatments, not
absolute. If we try to estimate β from model (7.1) as

β̂f = (X T X)−1 X T Y ,

we will find that X T X is singular, as it does not have full column rank. The sum
of the columns of X equals a column of 1’s; the last 4 columns sum to form the
first.

We can estimate 3 comparisons among 4 treatments, and must formulate our


model accordingly. For example, set treatment 4 as a baseline, and estimate dif-
ferences from this treatment:

 
1 1 0 0
.. .. .. ..
. . . .
 
 
 .. .. 

 . 1 0 . 

 .. .. 

 . 0 1 . 

 .. .. .. .. 
 . . . . 
β0

 .. .. 
. . 1 0 β1 
 
Y =  + ε.
 
.. .. 
β2 
 . 0. 1 
β3
 
 .. ..
.. .. 
 . .
. . 
.. ..
..
 
. .
. 1
 
 
 .. ..
.. 

 . .
. 0 

 .. .. .. .. 
 . . . . 
1 0 0 0
108 CHAPTER 7. QUALITATIVE EXPLANATORY VARIABLES

Hence, expected responses for treatments i = 1, 2, 3 are

E(Y ) = β0 + βi ,

and for treatment 4,

E(Y ) = β0 .

Therefore, β0 now measures the expected response from treatment 4, and βi (i =


1, 2, 3) measures the expected difference in response between treatment i and
treatment 4.

Result: Regardless of the comparisons we choose to examine, Ŷ = X β̂ is always


the same; i.e. a reparameterisation of our model does not change the predictions
or fitted values.

Proof: (for equally replicated case)

If Yij ∼ N (f (xi )T β, σ 2 ), then

r
1X
Ȳi = Yij ∼ N (f (xi )T β, σ 2 /r) . (7.2)
r j=1

Hence, we can write our linear model in terms of (7.2) as

Ȳ = X̄β + ε̄ ,

with



Ȳ1
Ȳ =  ...  ,
 
Ȳp

 
1 1 0 0
..
. 0 1 0 
 
X̄ =  ,

.. ..
 . . 0 1 
1 0 0 0
7.1. SIMPLE COMPARATIVE EXPERIMENTS 109

and

σ2
 
ε̄ ∼ N 0, I .
r

Note that β holds the same parameters (the mean does not change).

We now have
β̂ = (X̄ T X̄)−1 X̄ T Ȳ ,

and

X̄ β̂ = X̄(X̄ T X̄)−1 X̄ T Ȳ
= X̄ X̄ −1 (X̄ T )−1 X̄ T Ȳ
= Ȳ .

[as (AB)−1 = B −1 A−1 for A, B non-singular square matrices.]

Hence Ŷ = Ȳ , the mean response for each treatment regardless of the form of
f (xi )T β.

7.1.1 ANOVA

Source df SS MS
T T 2
Treatment p − 1 β̂ (X X)β̂ − N Ȳ SS/(p − 1)
T
Residual N − p (Y − X β̂) (Y − X β̂) SS/(N − p)
Total N −1 Y T Y − N Ȳ 2

In this table,

β̂ T (X T X)β̂ − N Ȳ 2 = (X β̂)T X β̂ − N Ȳ 2
= Ŷ T Ŷ − N Ȳ 2 ,

which is invariant to the choice of comparisons.

For the pulp experiment,


110 CHAPTER 7. QUALITATIVE EXPLANATORY VARIABLES

Source df SS MS
Operator 3 1.34 0.447
Residual 16 1.70 0.106
Total 19 3.04

Comparison of mean squares: under H0 : β1 = β2 = β3 = 0,

Treatment MS
∼ Fp−1,N −p
Residual MS
∼ F3,16 .

For the pulp experiment,

P (F3,16 > 4.20) = 0.02 < 0.05

Therefore, there is evidence to reject H0 .

7.2 Factorial Experiments

In many experiments, interest lies in the study of the effects of two or more factors
simultaneously.

Example: Desilylation example from GlaxoSmithKline. In this experiment, the


aim was to optimise the desilylation of an ether into an alcohol; a key step in the
synthesis of a particular antibiotic. The response is the yield of alcohol, and there
are four factors which can be controlled:

Units -1 (low) +1 (high)



Temp C 10 20
Time Hours 19 25
Concentration vol 5 7
of solvent
Equivalents equiv. 1 1.33
of reagent
7.2. FACTORIAL EXPERIMENTS 111

We use coded units:

−1 for the low level


+1 for the high level

A treatment (which can be applied to an experimental unit) is now given by a


combination of factor values, e.g.

-1, -1, -1, -1 (low, low, low, low; 10, 19, 5, 1)

+1, -1, +1, +1 (high, low, high, high; 20, 19, 7, 1.33).

7.2.1 Main Effects and Interactions

What comparisons among the treatments might be of interest here?

Main effects: to measure the average effect of a factor, say A, we can compute

ME(A) = average response when A = +1 − average response when A = −1


= Ȳ (A+) − Ȳ (A−) .

For example,

ME(temp) = average response − average response


when temp =+1(20◦ C). when temp =-1(10◦ C).
Response from all Response from all
treatments of form treatments of form
(+1, ∗, ∗, ∗) (−1, ∗, ∗, ∗)

This is the effect of changing temperature from low to high averaged across all
other factor levels

The main effect is often displayed as a main effects plot, e.g.


112 CHAPTER 7. QUALITATIVE EXPLANATORY VARIABLES

Expected Response

−1 1
Factor A

Of course, the response need not increase with A.


7.2. FACTORIAL EXPERIMENTS 113
Expected Response

Expected Response
−1 1 −1 1
Factor A Factor A

Interactions: We can measure the joint effect of changing two or more factors
simultaneously through an interaction.

A two factor interaction can be interpreted as one-half the difference in the main
effect of A when B is set to its high and low levels:

1
Int(A, B) = [ME(A|B+) − ME(A|B−)]
2
1
= [ME(B|A+) − ME(B|A−)] ,
2
where

ME(A|B+) = average response − average response


when A = +1 when A = −1
and B = +1 and B = +1 ,
etc.

Two factor interactions are often displayed in interaction plots


114 CHAPTER 7. QUALITATIVE EXPLANATORY VARIABLES

B=+1
Expected response

B=−1

−1 1
Factor A

The above interaction is positive: ME(A|B+)>ME(A|B−).


7.2. FACTORIAL EXPERIMENTS 115

B=−1
Expected response

B=+1

−1 1
A

The above interaction is negative: ME(A|B+)<ME(A|B−).


116 CHAPTER 7. QUALITATIVE EXPLANATORY VARIABLES

B=+1
Expected response

B=−1

−1 1
A

Parallel lines imply there is no interaction; ME(A|B+) ≈ ME(A|B−).

We can define higher order interactions similarly, e.g. the ABC interaction mea-
sures how the AB interaction changes with the levels of C:
7.2. FACTORIAL EXPERIMENTS 117

1
Int(A, B, C) = [Int(A, B|C+) − Int(A, B|C−)]
2
1
= [Int(A, C|B+) − Int(A, C|B−)]
2
1
= [Int(B, C|A+) − Int(B, C|A−)] .
2

7.2.2 Factorial Experiments

For m factors, each having two levels, there are 2m combinations (treatments) of
factor values.

If we have sufficient resource, we could run each of these 2m treatments in our


experiment; called a 2m full factorial design. For example, with m = 2 factors:
118 CHAPTER 7. QUALITATIVE EXPLANATORY VARIABLES

(−1,+1)

(+1,+1)
x2
(−1,−1)

(−1,−1) (−1,+1)
x1

The design points in a two-level factorial design are always the corners of a hy-
percube; for m = 3 factors:
7.2. FACTORIAL EXPERIMENTS 119

(+1,+1,+1)
x2

x3

(−1,−1,−1) (+1,−1,−1)

x1
120 CHAPTER 7. QUALITATIVE EXPLANATORY VARIABLES

Advantages:

• vary all factors simultaneously, i.e. include points like (+1,+1,+1) which
would not be included in a one factor at a time experiment;
• allows estimation of interactions;
• more efficient for estimation of main effects than one factor at a time
– all observations are used in calculation of each factorial effect;
• better coverage of design space.

Disadvantage:

• can get very big designs for even moderate m.

We call these designs 2m (full) factorial designs. A design may be unreplicated


(one run of each treatment combination) or replicated, with each treatment com-
bination included r times in the experiment

Example 5 cont.: Desilylation Experiment; 24 unreplicated factorial design (16


runs, one for each treatment). The design is given by:

x1 x2 x3 x4
-1 -1 -1 -1
-1 -1 -1 +1
-1 -1 +1 -1
-1 -1 +1 +1
-1 +1 -1 -1
-1 +1 -1 +1
-1 +1 +1 -1
-1 +1 +1 +1
+1 -1 -1 -1
+1 -1 -1 +1
+1 -1 +1 -1
+1 -1 +1 +1
+1 +1 -1 -1
+1 +1 -1 +1
+1 +1 +1 -1
+1 +1 +1 +1

Each row is a treatment combination in our experiment.


7.2. FACTORIAL EXPERIMENTS 121

7.2.3 Regression Modelling for Factorial Experiments

We again use a linear model:

m
X
Yij = β0 + βl xil
l=1
m−1
X m
X
+ βkl xik xil
k=1 l=k+1
m−2
X m
X m−1
X
+ βklq xik xil xiq
k=1 l=k+1 q=l+1
+ · · · + εij , (7.3)

for i = 1, . . . , 2m , j = 1, . . . , r, with


−1 if kth factor is set to low level in run i
xik =
+1 if kth factor is set to high level in run i .

In matrix form:

Y = Xβ + ε ,

where:

Y - N × 1 response vector, N = r2m ;

X - N × p model matrix;

β - p × 1 vector of model parameters;

ε - iid error vector.

The least squares normal equations are the same as before:

X T X β̂ = X T Y .

For a factorial design,


122 CHAPTER 7. QUALITATIVE EXPLANATORY VARIABLES

X TX = N I

where I is an p × p identity matrix. This is because factorial designs are or-


thogonal: for every pair of factors, every combination of levels appears the same
number of times. Factorial designs are also level-balanced: for each factor col-
umn, each level (-1,+1) appears the same number of times. A consequence of
orthogonality and balance is that

1 T
β̂ = X Y .
N

That is, all regression parameters are estimated independently and there is no need
to make adjustments for other terms in the model; fitting submodels of (7.3) does
not change the parameter estimates.

Relationship between regression parameters and factorial effects: fixing x2 , . . . , xm ,


the change in expected response from x1 = −1 to x1 = +1 is given by

E(Y |x1 = +1) − E(Y |x1 = −1) = (β0 + β1 + |{z}


. . . ) − (β0 − β1 + |{z}
... )
cancel cancel
= 2β1
= ME(x1 )
⇒ ME(xi ) = 2βi .

Similarly for interactions, e.g. Ȳ (x1 x2 = +1) − Ȳ (x1 x2 = −1),

E(Y |x1 x2 = +1) − E(Y |x1 x2 = −1) = 2β12


= Int(x1 , x2 )
⇒ Int(xi , xj ) = 2βij .

For the desilylation example, r = 1, N = 2m = 16, p = 16, and

X T X = 16I ,
7.2. FACTORIAL EXPERIMENTS 123

  
1 1 ... ... ... ... 1 Y1
1 T 1  −1 −1 . . . −1 1 . . . 1   . 
X Y =   .. 
N 16

.. .. .. .. .. .. ..
. . . . . . . Y16
 
89.94 β0
 4.06  β1 (temp) x1
 
 1.28  β2 (time) x2
 
 −1.11  β3 (conc.) x3
 
 1.54  β4 (reagent) x4
 
 −1.18  β12 (time×temp) x1 x2
 
 1.18  β13
 
 −1.39  β14
= 
 
 0.22  β23

 −0.32  β24
 
 0.25  β34
 
 0.123  β123 (temp×time×conc.) x1 x2 x3
 
 0.10  β124
 
 −0.02  β134
 
 −0.12  β234
0.10 β1234 (temp×time×conc.×reagent) x1 x2 x3 x4

The factorial effects are given by 2β; e.g. ME(x1 ) = 8.12, Int(x1 x2 ) = −2.36.

7.2.4 Analysis of Variance

Source df SS
m
Regression 2 −1 β̂ X X β̂ − N Ȳ 2
T T

x1 1 N β̂12 (*)
x2 1 N β̂22
.. .. ..
. . .
x4 1 N β̂42
2
x1 x2 1 N β̂12
.. .. ..
. . .
2
x1 x2 x3 x4 1 N β̂1234
m
Residual 2 (r − 1) (Y − X β̂)T (Y − X β̂)
Total 2m r − 1
124 CHAPTER 7. QUALITATIVE EXPLANATORY VARIABLES

As before, the regression sum of squares is given by

Regression SS = RSS(mean) − RSS(model)


= total SS − RSS
= Y T Y − N Ȳ 2 − (Y − X β̂)T (Y − X β̂)
= β̂ T X T X β̂ − N Ȳ 2 .

Expression (*) in the ANOVA table is formed as

SS(x1 ) = RSS(mean) − RSS(mean + x1 )


= Y T Y − N Ȳ 2 − (Y − X1 β̂1 )T (Y − X1 β̂1 ) ,

where

 
1 −1
 1 −1 
 
 .. .. 
 . . 
 . 
X1 =  .. −1 
 
 .
 .. 1 

 
 .. .. 
 . . 
1 1
 
β̂0
β̂1 = .
β̂1

Hence,

SS(x1 ) = β̂1T X1T X1 β̂1 − N Ȳ 2 .

However, the design is orthogonal and so

 
N 0
X1T X1 = = NI ,
0 N
7.2. FACTORIAL EXPERIMENTS 125

and

SS(x1 ) = N β̂02 +N β̂12 − N Ȳ 2


|{z}
Ȳ 2

= N β̂12 .

Other sums of squares are similar; as they are all independent, it does not matter
in which order we compare the models.

Note: if r = 1 (single replicate design), the Residual df = 2m (r − 1) = 0 and RSS


= (Y − X β̂)T (Y − X β̂) = 0. This means we cannot conduct hypothesis testing
and we have no estimate of σ 2 .
126 CHAPTER 7. QUALITATIVE EXPLANATORY VARIABLES
Chapter 8

Generalised Linear Models

8.1 The Exponential family

A probability distribution is said to be a member of the exponential family if its


probability density function (or probability function, if discrete) can be written in
the form  
yθ − b(θ)
fY (y; θ, φ) = exp + c(y, φ) . (1)
a(φ)
The parameter θ is called the natural or canonical parameter. The parameter φ
is usually assumed known. If it is unknown then it is often called the nuisance
parameter.

The density (1) can be thought of as a likelihood resulting from a single observa-
tion y. Then
yθ−b(θ)
log fY (y; θ, φ) = a(φ)
+ c(y, φ)

∂ y− ∂θ b(θ) y−b0 (θ)
⇒ u(θ) = ∂θ
log fY (y; θ, φ) = a(φ)
= a(φ)

where u(θ) is the score.


∂2
∂2 b(θ) 00
⇒ H(θ) = ∂θ2
2
log fY (y; θ, φ) = − ∂θa(φ) = − ba(φ)
(θ)

b00 (θ)
⇒ I(θ) = E[−H(θ)] = a(φ)
,

where H(θ) is the Hessian and I(θ) is the Fisher information matrix. From the
properties of the score function we know that E[U (θ)] = 0. Therefore

127
128 CHAPTER 8. GENERALISED LINEAR MODELS

Y − b0 (θ)
 
E =0 ⇒ E[Y ] = b0 (θ).
a(φ)

Furthermore,
Y − b0 (θ)
 
V ar[Y ]
V ar[U (θ)] = V ar = ,
a(φ) a(φ)2
as b0 (θ) and a(φ) are constants (not random variables). Now, we also know that
V ar[U (θ)] = I(θ). Therefore,

V ar[Y ] = a(φ)2 V ar[U (θ)] = a(φ)2 I(θ) = a(φ)b00 (θ).

and hence the mean and variance of a random variable with probability density
function (or probability function) of the form (1), are b0 (θ) and a(φ)b00 (θ) respec-
tively.

We often denote the mean by µ, so µ = b0 (θ). The variance is the product of


two functions; b00 (θ) depends on the canonical parameter θ (and hence µ) only
and is called the variance function (V (µ) ≡ b00 (θ)); a(φ) is sometimes of the
form a(φ) = σ 2 /w where w is a known weight and σ 2 is called the dispersion
parameter or scale parameter.

♥ Example 8.1. Normal distribution, Y ∼ N(µ, σ 2 )


1
exp − 2σ1 2 (y − µ)2

fY (y; µ, σ 2 ) = √2πσ 2
y ∈ R; µ ∈ R
1 2
 h 2 i
yµ− 2 µ 1 y 2
= exp σ2
− 2 σ2
+ log(2πσ ) .

This is in the form (1), with θ = µ, b(θ) = 12 θ2 , a(φ) = σ 2 and

1 y2
 
c(y, φ) = − + log(2πa[φ]) .
2 a(φ)

Therefore
E(Y ) = b0 (θ) = θ = µ
V ar(Y ) = a(φ)b00 (θ) = σ 2
V (µ) = 1.

♥ Example 8.2. Poisson distribution, Y ∼ Poisson(λ)


y
fY (y; λ) = exp(−λ)λ
y!
y ∈ {0, 1, . . .}; λ ∈ R+
= exp (y log λ − λ − log y!)
8.1. THE EXPONENTIAL FAMILY 129

This is in the form (1), with θ = log λ, b(θ) = exp θ, a(φ) = 1 and c(y, φ) =
− log y!. Therefore

E(Y ) = b0 (θ) = exp θ = λ


V ar(Y ) = a(φ)b00 (θ) = exp θ = λ
V (µ) = µ.

♥ Example 8.3. Bernoulli distribution, Y ∼ Bernoulli(p)

fY (y; p) = py (1− p)1−y y ∈ {0, 1}; p ∈ (0, 1)


p
= exp y log 1−p
+ log(1 − p)

p
This is in the form (1), with θ = log 1−p , b(θ) = log(1 + exp θ), a(φ) = 1 and
c(y, φ) = 0. Therefore

exp θ
E(Y ) = b0 (θ) = 1+exp θ
=p
00 exp θ
V ar(Y ) = a(φ)b (θ) = (1+exp θ)2
= p(1 − p)
V (µ) = µ(1 − µ).

♥ Example 8.4. Binomial distribution, Y ∗ ∼ Binomial(n, p) Here, n is as-


sumed known (as usual) and the random variable Y = Y ∗ /n is taken as the pro-
portion of successes, so
 
n
y ∈ 0, n1 , n2 , . . . , 1 ; p ∈ (0, 1)

fY (y; p) = ny
pny (1 − p)n(1−y)
 y log p  
1−p
+log(1−p) n
= exp 1 + log ny .
n

p 1
This is in the form (1), with θ = log 1−p , b(θ) = log(1 + exp θ), a(φ) = n
and
 
n
c(y, φ) = log ny . Therefore

exp θ
E(Y ) = b0 (θ) = 1+exp θ
=p
00 exp θ p(1−p)
V ar(Y ) = a(φ)b (θ) = n1 (1+exp θ)2
= n
V (µ) = µ(1 − µ).

Here, we can write a(φ) ≡ σ 2 /w where the scale parameter σ 2 = 1 and the weight
w is n, the binomial denominator.
130 CHAPTER 8. GENERALISED LINEAR MODELS

8.2 Components of a generalised linear model

8.2.1 The random component

In practical applications, we often distinguish between a response variable and a


group of explanatory variables. The aim is to determine the pattern of dependence
of the response variable on the explanatory variables. We denote the n observa-
tions of the response by y = (y1 , y2 , . . . , yn )T . In a generalised linear model
(g.l.m.), these are assumed to be observations of independent random variables
Y = (Y1 , Y2 , . . . , Yn )T , which take the same distribution from the exponential
family. In other words, the functions a, b and c and usually the scale parame-
ter φ are the same for all observations, but the canonical parameter θ may differ.
Therefore, we write
 
yi θi − b(θi )
fYi (yi ; θi , φi ) = exp + c(yi , φi )
a(φi )

and the joint density for Y = (Y1 , Y2 , . . . , Yn )T is


Qn
fY (y; θ, φ) = i=1 fYi (yi ; θi , φi )
(2)
P 
n yi θi −b(θi ) Pn
= exp i=1 a(φi )
+ i=1 c(yi , φi ) ,

where θ = (θ1 , . . . , θn )T is the collection of canonical parameters and φ =


(φ1 , . . . , φn )T is the collection of nuisance parameters (where they exist).

Note that for a particular sample of observed responses, y = (y1 , y2 , . . . , yn )T , (2)


is the likelihood function for θ and φ.

8.2.2 The systematic (or structural) component

Associated with each yi is a vector xi = (xi1 , xi2 , . . . , xip )T of values of p ex-


planatory variables. In a generalised linear model, the distribution of the response
variable Yi depends on xi through the linear predictor ηi where

ηi = β 1 xi1 + β2 xi2 + . . . + βp xip


P p
= j=1 xij βj
(3)
= xTi β
= [xβ]i , i = 1, . . . , n,
8.2. COMPONENTS OF A GENERALISED LINEAR MODEL 131

where, as with a linear model,


   
xT1 x11 · · · x1p
x =  ...  =  .. .. .. 
  
. . . 
xTn xn1 · · · xnp

and β = (β1 , . . . , βp )T is a vector of fixed but unknown parameters describing the


dependence of Yi on xi . The four ways of describing the linear predictor in (3) are
equivalent, but the most economical is the matrix form

η = xβ. (4)

Again, we call the n × p matrix x the design matrix. The ith row of x is xTi , the
explanatory data corresponding to the ith observation of the response. The jth
column of x contains the n observations of the jth explanatory variable.

8.2.3 The link function

For specifying the pattern of dependence of the response variable on the explana-
tory variables, the canonical parameters θ1 , . . . , θn in (2) are not of direct interest.
Furthermore, we have already specified that the distribution of Yi should depend
on xi through the linear predictor ηi . It is the parameters β1 , . . . , βp of the linear
predictor which are of primary interest.

The link between the distribution of Y and the linear predictor η is provided by
the link function g,
ηi = g(µi ) i = 1, . . . , n,
where µi ≡ E(Yi ), i = 1, . . . , n. Hence, the dependence of the distribution of the
response on the explanatory variables is established as

g(E[Yi ]) = g(µi ) = ηi = xTi β i = 1, . . . , n,

In principle, the link function g can be any one-to-one differentiable function.


However, we note that ηi can in principle take any value in R (as we make no re-
striction on possible values taken by explanatory variables or model parameters).
However, for some exponential family distributions µi is restricted. For example,
for the Poisson distribution µi ∈ R+ ; for the Bernoulli distribution µi ∈ (0, 1).
If g is not chosen carefully, then there may exist a possible xi and β such that
ηi 6= g(µi ) for any possible value of µi . Therefore, ‘sensible’ choices of link
function map the set of allowed values for µi onto R.
132 CHAPTER 8. GENERALISED LINEAR MODELS

Recall that for a random variable Y with a distribution from the exponential fam-
ily, E(Y ) = b0 (θ). Hence, for a generalised linear model

µi = E(Yi ) = b0 (θi ) i = 1, . . . , n.

Therefore
0
θi = b −1 (µi ) i = 1, . . . , n
and as g(µi ) = ηi = xTi β, then
0
θi = b −1 (g −1 [xTi β]) i = 1, . . . , n. (5)

Hence, we can express the joint density (2) in terms of the coefficients β, and for
observed data y, this is the likelihood fY (y; β, φ) for β. As β is our parameter
of real interest (describing the dependence of the response on the explanatory
variables) this likelihood will play a crucial role.
0
Note that considerable simplification is obtained in (5) if the functions g and b −1
are identical. Then
θi = xTi β i = 1, . . . , n
and the resulting likelihood is
n n
!
X yi xT β − b(xT β)
i i
X
fY (y; β, φ) = exp + c(yi , φi ) .
i=1
a(φi ) i=1

The link function


0
g(µ) ≡ b −1 (µ)
is called the canonical link function. Under the canonical link, the canonical
parameter is equal to the linear predictor.

The parameters in a generalised linear model are estimated using maximum like-
lihood. However, in most cases the maximum cannot be obtained algebraically
and we have to resort to numerical optimisation methods - these are beyond the
scope of this course.
8.3. EXAMPLE: BINARY REGRESSION 133

Canonical link functions

Distribution Normal Poisson Bernoulli


Binomial
1 2
b(θ) 2
θ exp θ log(1 + exp θ)
exp θ
b0 (θ) ≡ µ θ exp θ
1 + exp θ
0 µ
b −1 (µ) ≡ θ µ log µ log
1−µ
µ
Link g(µ) = µ g(µ) = log µ g(µ) = log
1−µ
Identity link Log link Logistic link
(Logit link)

8.2.4 The linear model

Clearly the normal linear model is also a generalised linear model. We assume
Y1 , . . . , Yn are independent normally distributed random variables. The normal
distribution is a member of the exponential family.

Furthermore, the explanatory variables enter a linear model through the linear
predictor
ηi = xTi β i = 1, . . . , n.

Finally, the link between E(Y) = µ and the linear predictor η is through the
(canonical) identity link function

µi = ηi i = 1, . . . , n.

8.3 Example: Binary Regression

In binary regression the data either follow the binomial or the Bernoulli distri-
bution (equivalently). The objective is to model the success probability p as a
function of the covariates. Because p(x) is a probability we think of it as the cu-
mulative distribution function (cdf) of a random variable. For the logit link the
134 CHAPTER 8. GENERALISED LINEAR MODELS

random variable follows the logistic distribution. But we can use the cdfs of other
random variables such as the standard normal and the log-Weibull distribution to
model the probability p(x). These will still fall under the glm but with different
link functions.

When the canonical link, i.e., the logit, is used we have

p(x)
θ = log = xT β = η.
1 − p(x)

This implies
exp(η) 1
p(x) = = .
1 + exp(η) 1 + exp(−η)
This is the cdf of the logistic distribution taking values in the real line (−∞ < η <
1
∞). It can be easily verified that F (η) = 1+exp(−η) is a cdf of a random variable
since it is non-negative and increases monotonically to 1 from zero. The logistic
distribution behaves almost like the t-distribution with 8-degrees of freedom.

If we use the cdf of the standard normal distribution to model the p(x) we get
what is called the probit link. For this link we set

p(x) = Φ(xT β) = Φ(η),

where Φ(·) is the cdf of the standard normal distribution. Note that for this the
link function
g(µ) = g(p) = Φ−1 (µ) = η,
is called the probit link.

The cdf of the log-Weibull distribution is given by:


η
p = 1 − e−e , −∞ < η < ∞.

It is easy to verify that this defines a cdf. Solving back we get

η = log{− log(1 − p)}.

This is called the complementary log-log link function.

For the logit and probit link functions the cdf’s are symmetric about 1/2. However,
this is not the case for the complementary log-log link. Hence this should be used
when asymmetry as a function of the linear predictor is suspected. The logistic
distribution is heavier tailed than the standard normal distribution, hence the logit
link is often used when outliers are suspected in the linear predictor.

You might also like