0% found this document useful (0 votes)
17 views246 pages

Math2831 Course Pack

Uploaded by

Jake
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views246 pages

Math2831 Course Pack

Uploaded by

Jake
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 246

MATH2831/2931

LINEAR MODELS

School of Mathematics and Statistics


University of New South Wales

Term 3, 2019
CONTENTS 1

Contents
1 Introduction 3
1.1 Some Basic Terminology . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 The Model Building Process . . . . . . . . . . . . . . . . . . . . . 3

2 The Simple Linear Regression Model 4


2.1 Estimation of parameters by least squares . . . . . . . . . . . . . 5
2.2 Properties of Least Squares Estimators of slope and intercept . . 9
2.3 Estimation of the Error Variance . . . . . . . . . . . . . . . . . . 16
2.4 Maximum Likelihood Estimators . . . . . . . . . . . . . . . . . . 17
2.5 Attributing variation to different sources . . . . . . . . . . . . . . 19
2.6 Distribution theory of estimators in simple linear regression . . . 22
2.7 Confidence intervals for β0 and β1 . . . . . . . . . . . . . . . . . . 23
2.8 Hypothesis testing for β0 and β1 . . . . . . . . . . . . . . . . . . . 26
2.9 The Analysis of Variance (ANOVA) table . . . . . . . . . . . . . 29
2.10 Prediction in the simple linear regression model . . . . . . . . . . 35
2.11 Simultaneous inference and Bonferroni adjustment . . . . . . . . . 39
2.12 Bonferroni adjustment . . . . . . . . . . . . . . . . . . . . . . . . 40
2.13 Criticism of the model: residuals . . . . . . . . . . . . . . . . . . 41

3 The general linear model 45


3.1 Formulation of the general linear model . . . . . . . . . . . . . . . 45
3.2 Least squares estimation of mean parameters . . . . . . . . . . . . 49
3.3 Properties of least squares estimator of mean parameters . . . . . 55
3.4 Maximum likelihood estimation of mean parameters . . . . . . . . 61
3.5 Estimation of the error variance . . . . . . . . . . . . . . . . . . . 62
3.6 Interval estimation in the general linear model . . . . . . . . . . . 67
3.7 Hypothesis testing for coefficients . . . . . . . . . . . . . . . . . . 70
3.8 Confidence intervals for the mean and prediction intervals. . . . . 72
3.9 Joint confidence regions for the coefficients . . . . . . . . . . . . . 75
3.10 Decomposing variation in the full rank linear model . . . . . . . . 75
3.11 Hypothesis testing in the general linear model . . . . . . . . . . . 79
3.12 Multicollinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

4 Model Selection 94
4.1 Out of sample prediction, PRESS residuals and the PRESS statistic 97
4.2 The hat matrix and computation of the PRESS residuals . . . . . 98
4.3 Cross validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.4 The conceptual predictive criterion (Mallow’s Cp ). . . . . . . . . . 104
4.5 Sequential variable selection procedures . . . . . . . . . . . . . . . 106
4.6 Problems with forward selection, backward selection and stepwise
algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
CONTENTS 2

4.7 The MAXR procedure . . . . . . . . . . . . . . . . . . . . . . . . 113

5 Residuals and diagnostics 115


5.1 Residual plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.2 Use of residuals for detecting outlying points . . . . . . . . . . . . 121
5.3 Testing for outliers . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.4 Use of indicator variables . . . . . . . . . . . . . . . . . . . . . . . 133
5.5 Partial regression plots . . . . . . . . . . . . . . . . . . . . . . . . 135
5.6 Partial residual plots . . . . . . . . . . . . . . . . . . . . . . . . . 141
5.7 Testing for normality . . . . . . . . . . . . . . . . . . . . . . . . . 145
5.8 Measures of influence . . . . . . . . . . . . . . . . . . . . . . . . . 159
5.9 DFFITS, DFBETAS and other influence measures . . . . . . . . . 167
5.10 Dealing with violations of assumptions: transformations . . . . . 171
5.11 Weighted Least Squares . . . . . . . . . . . . . . . . . . . . . . . 180
5.12 Nonlinear transformations of a single predictor . . . . . . . . . . . 182
5.13 The Box-Tidwell Procedure . . . . . . . . . . . . . . . . . . . . . 185
5.14 Polynomial regression . . . . . . . . . . . . . . . . . . . . . . . . 190
5.15 Box-Cox transformation of the response . . . . . . . . . . . . . . 192

6 Categorical predictors 199


6.1 Categorical predictors with two levels . . . . . . . . . . . . . . . . 199
6.2 Categorical variables with more than two levels . . . . . . . . . . 207
6.3 Hypothesis testing for categorical variables with more than two
levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207

7 Less than full rank linear models. 213


7.1 One way classification model with fixed effects . . . . . . . . . . . 214
7.2 Randomized complete block designs . . . . . . . . . . . . . . . . . 225
7.3 Factorial Experiments . . . . . . . . . . . . . . . . . . . . . . . . 229

8 Logistic Regression 238


8.1 Hypothesis testing in logistic regression . . . . . . . . . . . . . . . 242
1 INTRODUCTION 3

1 Introduction
1.1 Some Basic Terminology
In this course we will be concerned with building statistical models which describe
the dependence of some variable of interest (called the response) on a number of
other variables (called predictors).
An example will help to illustrate the idea. The Bureau of Meteorology in
Sydney issues daily maximum temperature forecasts at many locations around
the city. One of the locations at which a temperature forecast is issued is Sydney
Airport, and forecasters in the Bureau make use of a statistical model to help
them in deciding on the forecast.
Forecasters have available to them the predictions of a deterministic physical
model (based on the equations of motion for fluids) which gives predicted values
of certain meteorological variables such as temperature, wind speed and direction
and pressure. However, the physical model may contain biases, and for this reason
the Bureau uses a statistical model to relate the observed maximum temperature
at Sydney airport (the response) to the set of variables predicted by the physical
model (the predictors). The use of the statistical model to modify the output of
the physical model gives a more accurate forecast, and allows quantification of
uncertainty about predictions of the response.
In situations where the predictors are under the control of an experimenter the
response is sometimes called the dependent variable and the predictors are called
independent variables. In this course we will not be concerned with modelling the
independent variables as random quantities, but assume that they are controlled
or at least accurately measured.

1.2 The Model Building Process


Before we proceed to a discussion of specific models, it is helpful to consider some
of the reasons we might have for building a statistical model. We build statistical
models in order to make decisions. For instance, in a medical study we might
administer several different kinds of treatment to a collection of patients with a
certain condition, and we want to decide which treatment is most effective. Or we
may wish to predict the future level of a share price index based on information
available to us now. Or we may wish to compare different descriptions of a set of
data based on competing scientific theories: we might ask ourselves which model
(and hence which theory) do the data support?
Qualitatively, we can write a statistical model for the response variable in the
form

Response = Signal + Noise


2 THE SIMPLE LINEAR REGRESSION MODEL 4

“Signal” here denotes systematic variation in the response, which for us will be
variation which can be explained in terms of the predictors and a small number
of unknown parameters to be estimated from the data. The other component
of the model, the “Noise” term, captures residual variation unexplained in the
signal: typically there will be some unknown parameters to be estimated in this
component of the model as well.
A good statistical model provides a reduction of a possibly large and complex
data set to a description in terms of a small number of parameters. When building
a model a good statistician will keep in mind that the the reduction of the data
obtained must be a useful one for making the decisions which motivated the
building of the model. Finding or selecting a good statistical model is a complex
iterative process which is difficult to formally describe. Usually statisticians cycle
between stages of tentative model formulation, estimation and model criticism.
It is often an objective of a statistical analysis to describe as much of the
variation in the response as possible systematically (that is, in the “Signal”).
When we achieve this objective we say that the model fits the data well. However,
as we have mentioned, parsimony of the model (that is, having a model with a
small number of parameters) is also desirable in order to provide a reduction of
the data which is useful for human decision making and for making predictions
of future response values. So we must manage a trade off between simplicity of a
statistical model and goodness of fit. How we should manage this trade off often
depends on the use to which the model will be put.
We have talked about the process of model building in general terms. We will
now look in detail at the general linear model which is the subject of this course.
We discuss perhaps the simplest example of the general linear model first (the
simple linear regression model) and illustrate the processes of estimation, model
selection and model criticism in this situation. Once we have done this, the rest
of the course will be concerned with describing the same processes in the general
case.

2 The Simple Linear Regression Model


As we have mentioned, perhaps the simplest example of a linear model is the
simple linear regression model, which we now describe. In the simple linear
regression model, the conditional distribution of the response is allowed to depend
on a single predictor. Write y1 , ..., yn for response variable, and write x1 , ..., xn
for the corresponding values of the predictor.
The fundamental assumption of the simple linear regression model is that, for
(unknown) parameters β0 and β1 taking values in R, the response variables are
of the form

yi = β0 + β1 xi + εi (1)
2 THE SIMPLE LINEAR REGRESSION MODEL 5

where εi , i = 1, ..., n denotes a collection of uncorrelated error terms with mean


zero and variance σ 2 . The formula (1) illustrates the decomposition mentioned
in the introduction of the response into signal and noise: here β0 + β1 xi is the
signal or systematic component of the model, and the terms εi describe the noise
or residual variation unexplained by the signal.
With the assumptions we have made so far, we will be able to derive estimators
for the unknown parameters β0 , β1 and σ 2 and examine some of their properties.
However, for deriving confidence intervals and performing hypothesis tests we will
also need to assume that the errors εi are normally distributed. The assumption
of the normality of errors also allows us to make stronger conclusions about the
optimality of estimators for the unknown parameters.

2.1 Estimation of parameters by least squares


Good estimators b0 and b1 of the parameters β0 and β1 will be estimators which
minimize the variation unexplained in the response by the systematic component
of the fitted model.
One would like to fit the data to a linear model, given by β0 + β1 xi . Then one
goodness of fit criterion which might be minimized to obtain good estimators of
β0 and β1 is

n
(yi − (β0 + β1 xi ))2 . (2)
i=1

where β0 and β1 are parameters which we minimize over. If b0 and b1 are the
values which minimize (2) with respect to β0 and β1 , then b0 and b1 minimize in
some overall sense the deviations of the responses yi from the fitted values ybi .
Of course it is possible to define global measures of the discrepancy of the
fitted values from the responses which are different to (2). For instance, we could
consider estimators obtained by minimizing

n
|yi − (β0 + β1 xi )| (3)
i=1

with respect to β0 and β1 . However, in the case of normally distributed errors,


there are good reasons for considering the criterion (2), which is also useful due
to its computational tractability.
The estimators b0 and b1 minimizing (2) are called the least squares estimators
of β0 and β1 . We derive expressions for these estimators now. Differentiating (2)
2 THE SIMPLE LINEAR REGRESSION MODEL 6

with respect to β0 and β1 gives

∂ ∑ ∑
n n
(yi − β0 − β1 xi )2 = −2 (yi − β0 − β1 xi )
∂β0 i=1 i=1

∂ ∑ ∑
n n
(yi − β0 − β1 xi ) = −2
2
xi (yi − β0 − β1 xi ).
∂β1 i=1 i=1

Setting these to zero, we obtain for b0 , b1 that


( n ( n ) )
∑ ∑
−2 yi − nb0 − xi b 1 = 0
( (
i=1
) ( i=1 ) )

n ∑n ∑ n
−2 xi y i − xi b0 − x2i b1 = 0.
i=1 i=1 i=1

Hence upon rearranging we have that the least squares estimators b0 and b1 satisfy
( n )
∑ ∑n
nb0 + x i b1 = yi (4)
( ) ( i=1 ) i=1

n ∑ n ∑n
xi b0 + x2i b1 = xi y i . (5)
i=1 i=1 i=1

These are linear equations in b0 and b1 (called the normal equations) which are
easily solved. Dividing (4) by n and solving for b0 gives

b0 = ȳ − b1 x̄. (6)

Subsituting (6) into (5) gives


( n ) ( n )
∑ ∑ ∑
n
xi (ȳ − b1 x̄) + 2
x i b1 = xi y i
i=1 i=1 i=1

or
( n )
∑ ∑
n ∑
n ∑
n
b1 x2i − x̄ xi = xi yi − ȳ xi .
i=1 i=1 i=1 i=1

Solving for b1 ,
∑n ∑n ∑
xi ni=1 yi
xi y i −
i=1
i=1
n
b1 = .
∑n 2 (∑ni=1 xi )2
i=1 xi − n
2 THE SIMPLE LINEAR REGRESSION MODEL 7

Table 2.1: Zinc levels in plants and lake sediment samples


Concentration of Zinc in sediment Concentration of Zinc in plant

37.5 15.9
72.5 42.7
85.2 85.7
76.5 52.6
64.5 49.1
86.8 59.0
90.8 53.6
105.8 77.8
85.8 63.2
87.9 62.3
53.9 22.7
102.3 66.1
90.7 47.4
86.0 59.4
79.0 50.9

We write Sxy for the numerator in (7) and Sxx for the denominator so that
b1 = Sxy /Sxx . We can then compute b0 from (6). An example will help to rein-
force some of the concepts we have discussed so far.

Example: zinc concentrations in plants

The following example is described in Understanding Data: Principles and Prac-


tice of Statistics, David Griffiths, W. Douglas Stirling and K. Laurence Weldon,
Wiley, 1998.
The data in the table below were collected from fifteen lakes in central Ontario.
The purpose of collecting the data was to assess how zinc levels in a certain plant
in the lake (the response) were related to zinc levels in the lake sediment (the
predictor). By describing and modelling this relationship we can predict zinc
levels in the plant from an analysis of lake sediment. A scatterplot of these data
is shown in Figure 2.1, and superimposed on the plot is the fitted least squares
regression line. To compute the fitted regression line, write y1 , ..., y15 for the
fifteen response values and x1 , ..., x15 for the corresponding fifteen values of the
2 THE SIMPLE LINEAR REGRESSION MODEL 8

Zinc concentrations


80


••
60

••

Zinc in plant

• •
• •

40


20


0

20 40 60 80 100 120
Zinc in sediment

Figure 2.1: Scatter plot showing Zinc concentrations in plants and lake sediment
samples.
2 THE SIMPLE LINEAR REGRESSION MODEL 9

predictor and observe that


15
yi = 808.4
i=1
∑15
xi = 1205.2
i=1

15
x2i = 101228.6
i=1

15
xi yi = 68716.78
i=1

so that
(808.4).(1205.2)
Sxy = 68716.78 −
15
= 3764.5

and
1205.22
Sxx = 101228.6 −
15
= 4394.8.

Hence the least squares estimate of the slope coefficient is 3764.5/4394.8 = 0.857
and the least squares estimate of the intercept is 808.4/15 − 0.857 × 1205.2/15 =
−14.9. Hence the fitted least squares regression line (the systematic component
of the fitted model) is

−14.9 + 0.857x.

2.2 Properties of Least Squares Estimators of slope and


intercept
We now discuss some properties of the least squares estimators of β0 and β1 . The
properties of the estimators derived in this section hold regardless of whether or
not the errors in the simple linear regression model are normally distributed.
To begin, we ask the question: how do we assess the quality of estimators of
parameters in a statistical model? For us, a good estimator is one that performs
well on average with repeated use. Suppose our data arise from an experiment
which can be repeated. We can imagine computing estimates from our data for
the same experiment over many repetitions, and examining the variablility of
the estimates obtained. We would want the average of the estimates to be equal
2 THE SIMPLE LINEAR REGRESSION MODEL 10

to the true parameter value (no systematic bias), and we want the variation of
the estimates about the true parameter value to be as small as possible (small
variance).
We would like to compute the mean and variance of the least squares estima-
tors of the slope and intercept in the simple linear regression model as a way of
evaluating the performance of these estimators.

Some Notation
In the rest of the note, for convenience, we write yi to represent both the i-th
response variable (see model assumption (1)) and its realization (a data point).
Recall that the least squares estimator b1 for β1 is given by
Sxy
b1 = (7)
Sxx
where

n ∑n ∑
xi ni=1 yi
Sxy = xi yi − i=1

i=1
n

and

n ∑
( ni=1 xi )2
Sxx = x2i − .
i=1
n

It is a very easy exercise to derive the following alternative expressions for Sxy
and Sxx :

n
Sxy = (xi − x̄)yi
i=1

and

n
Sxx = (xi − x̄)xi .
i=1

Also,

n
Sxy = (xi − x̄)(yi − ȳ)
i=1

and

n
Sxx = (xi − x̄)2 .
i=1
2 THE SIMPLE LINEAR REGRESSION MODEL 11

In the following, we compute the expectation and variance of b0 and b1 , which


are random under the model assumption (1).

Expected value of b1

Consider the expected value of b1 . We have


(∑n )
i=1 (xi − x̄)yi
E(b1 ) = E
Sxx
∑n
i=1 (xi − x̄)E(yi )
=
Sxx
∑n
i=1 (xi − x̄)(β0 + β1 xi )
= .
Sxx
Using the fact that

n ∑
n
(xi − x̄) = xi − nx̄
i=1 i=1
= nx̄ − nx̄
= 0,
we have
∑n
i=1 (xi − x̄)xi
E(b1 ) = β1
Sxx
= β1 .
So if we were to use the estimator b1 repeatedly for estimating β1 in a certain
experiment, on average the estimator gives the true parameter (no systematic
bias, we say that b1 is unbiased for β1 ).

Expected value of b0

Recall that
b0 = ȳ − b1 x̄.
Now,
∑n
E(yi )
E(ȳ) = i=1

∑n n
i=1 (β0 + β1 xi )
=
n∑
nβ0 + β1 ni=1 xi
=
n
= β0 + β1 x̄.
2 THE SIMPLE LINEAR REGRESSION MODEL 12

Hence since b1 is unbiased for β1 ,

E(b0 ) = E(ȳ) − E(b1 )x̄


= β0 + β1 x̄ − β1 x̄
= β0 .

So b0 is unbiased for β0 .

Variance of b0 and b1

For an estimator (such as b0 or b1 ) to be useful, we must be able to assess how


reliable it is. Following our previous discussion, it would be nice to know how
variable the estimators b0 and b1 are in repetitions of an experiment. We give
expresssions for the variances of b0 and b1 . These expressions are a special case
of a later more general result and are given without proof.

( )
2 1 x̄2
Var(b0 ) = σ +
n Sxx

σ2
Var(b1 ) =
Sxx

σ 2 x̄
Cov(b0 , b1 ) = − .
Sxx
Example: market model of stock returns

The following example is from Keller, Warrack and Bartel, “Statistics for Man-
agement and Economics: a Systematic Approach (Second Edition)”, Wadsworth,
Belmont, 1990, p. 725.
A well known model in finance, called the market model, assumes that the
monthly rate of return on a stock (R) is linearly related to the monthly rate of
return on the overall stock market (Rm ). The mathematical description of the
model is
R = β0 + β1 Rm + ε
where the error term ε is assumed to satisfy the requirements of the linear re-
gression model. For practical purposes, Rm is taken to be the monthly rate of
return on some major stock market index, such as the New York Stock Exchange
(NYSE) Composite Index.
The coefficient β1 , called the stock’s beta coefficient, measures how sensitive
the stock’s rate of return is to changes in the level of the overall market. For
2 THE SIMPLE LINEAR REGRESSION MODEL 13

example, if β1 > 1 (β1 < 1), the stock’s rate of return is more (less) sensitive to
changes in the level of the overall market than is the average stock. The monthly
rates of return to Host International Inc. stock and to the overall market (as
approximated by the NYSE composite index) over a 5-year period are shown in
the table below.

Month Host Market Month Host Market


International International
Jan 1975 26.7 13.5 Jul 1977 −4.2 −1.5
Feb 1975 7.0 6.1 Aug 1977 2.2 −1.4
Mar 1975 15.9 2.9 Sep 1977 3.1 0.1
Apr 1975 18.6 4.7 Oct 1977 6.4 −3.9
May 1975 −6.0 5.5 Nov 1977 11.0 4.2
Jun 1975 −4.2 5.2 Dec 1977 −1.0 0.5
Jul 1975 4.0 −6.4 Jan 1978 −2.7 −5.7
Aug 1975 −5.2 −2.0 Feb 1978 10.4 −1.2
Sep 1975 −1.7 −3.6 Mar 1978 10.2 3.2
Oct 1975 28.2 6.1 Apr 1978 15.6 8.3
Nov 1975 17.6 3.1 May 1978 10.2 3.2
Dec 1975 1.6 −1.0 Jun 1978 1.2 −1.3
Jan 1976 17.6 12.5 Jul 1978 22.3 5.7
Feb 1976 −13.4 0.1 Aug 1978 3.1 3.8
Mar 1976 −12.1 3.0 Sep 1978 −13.6 −0.6
Apr 1976 −6.2 −1.1 Oct 1978 −28.8 −10.2
May 1976 −12.1 3.0 Nov 1978 19.0 3.1
Jun 1976 −2.5 4.7 Dec 1978 −2.1 1.6
Jul 1976 −9.3 0.7 Jan 1979 −7.8 4.7
Aug 1976 0.0 0.0 Feb 1979 −10.1 −2.9
Sep 1976 −1.5 2.6 Mar 1979 11.4 6.2
Oct 1976 −5.3 −2.1 Apr 1979 −5.5 0.7
Nov 1976 9.7 0.5 May 1979 −6.6 −1.5
Dec 1976 18.7 5.8 Jun 1979 19.6 4.5
Jan 1977 −10.7 −4.0 Jul 1979 3.7 1.5
Feb 1977 −8.4 −1.6 Aug 1979 7.9 6.3
Mar 1977 6.3 −1.1 Sep 1979 −3.2 0.0
Apr 1977 1.2 0.4 Oct 1979 −10.4 −6.9
May 1977 2.5 −1.2 Nov 1979 −12.4 6.0
Jun 1977 14.3 5.3 Dec 1979 −4.2 2.3

Figure 2.2 shows a scatterplot of these data, together with a fitted least
squares regression line. As usual writing y for the response (Host International
2 THE SIMPLE LINEAR REGRESSION MODEL 14

30




20

• •• •
• •
• •

• •
10

• • • •
Host International

• • ••

• • • •
••••
• ••
0

• •
• • • •
•• • • •
• •

• •
• •
-10

• •
• •
• •
-20


-30

-10 -5 0 5 10
Market

Figure 2.2: Scatter plot of Host International returns versus overall market re-
turns with fitted least squares regression line.
2 THE SIMPLE LINEAR REGRESSION MODEL 15

return) and x for the predictor (overall market return) we have in this case that


n ∑
n
xi = 90.4 x2i = 1254.1
i=1 i=1


n ∑
n
yi = 136.0 yi2 = 8158.9
i=1 i=1


n
xi yi = 1992.8.
i=1

Hence the estimated beta coefficient for the stock (the least squares estimate of
the slope in the linear regression) is

Sxy
b1 =
Sxx
and we calculate Sxy and Sxx as


n ∑n ∑
xi ni=1 yi
Sxy = xi yi − i=1

i=1
n
(90.4).(136.0)
= 1992.8 −
60
= 1787.89

and

n ∑ 2
( ni=1 xi )
Sxx = x2i −
i=1
n
90.42
= 1254.1 −
60
= 1117.90

giving
b1 = 1787.89/1117.90 = 1.60.
So it seems as though the rate of return on Host International stock is more
sensitive to the level of the overall market than is the average stock. The least
squares estimate b0 of β0 here is ȳ − b1 x̄ = 2.2667 − (1.60)(1.5067) = −0.14.
We have previously shown that b0 and b1 are unbiased, and derived expressions
for the variance of b0 and b1 . How reliable is the estimate of the beta coefficient in
this example? Our expressions for the variance of b0 and b1 contain the parameter
σ 2 , which is unknown. We need to estimate this if we are to obtain estimated
values for the variances of b0 and b1 .
2 THE SIMPLE LINEAR REGRESSION MODEL 16

2.3 Estimation of the Error Variance


So far we have described least squares estimation of the parameters β0 and β1
in the simple linear regression model, but have not considered estimation of the
error variance σ 2 . If β0 and β1 were known, a reasonable estimator might be
∑n
i=1 (yi − β0 − β1 xi )
2

n
(since εi = yi − β0 − β1 xi and so the above expression is a sample average of the
squared εi which is an unbiased estimator of σ 2 = E(ε2 )).
Since β0 and β1 are not known, we might suggest plugging the least squares
estimators of β0 and β1 into the above expression.
In particular, consider the estimator
∑n
i=1 (yi − b0 − b1 xi )
2
∗2
σ = .
n
What are the properties of this estimator? In particular, is it unbiased for σ 2
(does it give the right value for σ 2 on average with repeated use)? It can be
shown (proof is a special case of later results) that
n−2 2
E(σ ∗ 2 ) = σ
n
which suggests using the modified estimator
∑n
i=1 (yi − b0 − b1 xi )
2
b =
σ 2
n−2
which is unbiased for σ 2 .
b2 is
A useful alternative computational formula for σ
Syy − b21 Sxx
b2 =
σ
n−2
where ∑n

n
(
2
yi )
Syy = yi2 − i=1
.
i=1
n
We can “plug in” the estimate σb2 for σ 2 in our expressions for the variances
of b0 and b1 to obtain estimated variances for these quantities.

Example: market model of stock returns

We can estimate the error variance as


Syy − b21 Sxx
b
σ 2
=
n−2
2 THE SIMPLE LINEAR REGRESSION MODEL 17

where

n ∑ 2
( ni=1 yi )
Syy = yi2 −
i=1
n
136.02
= 8158.9 −
60
= 7850.63

so that
7850.63 − 1.602 1117.90
σb2 =
58
= 86.01.

b = 9.27.
Hence σ
By “plugging in” this σb value in the expressions for Var(b0 ) and Var(b1 ) that
we obtained before, we get that the estimated variance of b0 is 1.61 (estimate
standard deviation of 1.27) and the estimated variance of b1 is 0.077 (estimated
standard deviation of 0.28).

2.4 Maximum Likelihood Estimators


In MATH2801 you have learnt about maximum likelihood estimators, which often
have certain optimality properties in large samples. In later discussion we will
assume that the error terms in the simple linear regression model are normally
distributed, so it is of interest to ask what the maximum likelihood estimators of
the slope, intercept, and error variance are under normal assumptions.
First we review the concept of maximum likelihood estimation. Suppose that
we have a discrete n-dimensional random vector Y with a distribution depending
on a p-dimensional parameter vector θ. Now suppose that we observe a realization
y of Y and that we wish to estimate θ. Since Y is discrete, we can consider the
probability that Y = y for different values of θ, and this probability P(Y = y|θ)
considered as a function of θ for the fixed observed y is called the likelihood
function.
The idea of maximum likelihood estimation is the following: we should choose
the parameter θ so that the observed value y of Y has the highest probability
of occurrence. That is, the maximum likelihood estimator is chosen to be the
maximizer of the likelihood function. In the case where Y is a continuous random
vector, the likelihood is the density function of Y considered as a function of θ
for the fixed observed y, and again the maximum likelihood estimator maximizes
the likelihood function.
What are the maximum likelihood estimators of β0 , β1 and σ 2 in the simple
linear regression model if we assume that the errors are normal? The density
2 THE SIMPLE LINEAR REGRESSION MODEL 18

function of the ith response is


( )
1 1
√ exp − 2 (yi − β0 − β1 xi )2
2πσ 2 2σ
and since the responses are now assumed to be independent we can write down
the likelihood function as
∏n ( )
1 1
2
L(y; β0 , β1 , σ ) = √ exp − 2 (yi − β0 − β1 xi ) 2
2πσ 2 2σ
i=1
( )
1 ∑
n
= (2πσ 2 )− 2 exp − 2
n
(yi − β0 − β1 xi )2 .
2σ i=1

We must maximize this function with respect to β0 , β1 and σ 2 to obtain the


maximum likelihood estimators. Note that maximizing the likelihood function is
equivalent to maximizing the logarithm of the likelihood function (since log is an
increasing function and hence if L1 and L2 are values of the likelihood function
for two different values of the parameters and L1 < L2 then log(L1 ) < log(L2 )).
If we write l(y; β0 , β1 , σ 2 ) for log L(y; β0 , β1 , σ 2 ) we have

1 ∑
n
n
l(y; β0 , β1 , σ ) = − log(2πσ ) − 2
2 2
(yi − β0 − β1 xi )2
2 2σ i=1
1 ∑
n
n n
= − log(2π) − log(σ ) − 2
2
(yi − β0 − β1 xi )2
2 2 2σ i=1

Clearly regardless of the value of σ 2 , l(y; β0 , β1 , σ 2 ) is maximized with respect to


β0 and β1 by choosing β0 and β1 to minimize

n
(yi − β0 − β1 xi )2 .
i=1

That is, the maximum likelihood estimators of β0 and β1 are just the least squares
estimators. It remains to find the maximum likelihood estimator of σ 2 . Differen-
tiating the log-likelihood with respect to σ 2 gives
( )
∂ n 1 ∑n
n 1 ∑
n
− log σ − 2
2
(yi − β0 − β1 xi )2
= − 2+ 4 (yi − β0 − β1 xi )2 .
∂σ 2 2 2σ i=1 2σ 2σ i=1

b2 for the maximum likelihood estimator of σ 2 and


Setting this to zero, writing σ
substituting in the maximum likelihood (least squares) estimators for β0 and β1
we get

1 ∑
n
n
= (yi − b0 − b1 xi )2
σ2
2b σ 4 i=1
2b
2 THE SIMPLE LINEAR REGRESSION MODEL 19

which gives

1∑
n
b2 =
σ (yi − b0 − b1 xi )2 .
n i=1

So the maximum likelihood estimator of σ 2 is the biased estimator that we con-


sidered in our last lecture.

2.5 Attributing variation to different sources


In this section we will prove a fundamental identity for the simple linear regression
model which we will later generalize when considering the general linear model.
The identity we derive describes a partition of the total variation into terms
representing variation explained in different parts of the model (signal and noise).
Writing ybi = b0 + b1 xi for the ith fitted value, the identity we derive is

n ∑
n ∑
n
(yi − ȳ) 2
= yi − ȳ) +
(b 2
(yi − ybi )2 . (8)
i=1 i=1 i=1

On the left hand side of (8) we have a term which we can think of as being the
total variation in the data – note that this term is the numerator of the expression
we would use for computing the sample variance of the y’s. On the right hand
side we have a sum of two terms: the first term represents the variation of the
predictions about the sample mean of the y’s (we can think of this as the variation
explained by the fit) and the second term represents variation of the responses
from the predictions (we can think of this as variation unexplained by the fit).
We now derive the identity above. We have

n ∑
n
(yi − ȳ) =
2
(yi − ybi + ybi − ȳ)2
i=1 i=1

n ∑
n ∑
n
= (yi − ybi ) +
2
yi − ȳ) + 2
(b 2
(yi − ybi )(b
yi − ȳ)
i=1 i=1 i=1

so we only need to show that



n
(yi − ybi )(b
yi − ȳ) = 0
i=1

or

n ∑
n
(yi − ybi )b
yi − ȳ (yi − ybi ) = 0. (9)
i=1 i=1
2 THE SIMPLE LINEAR REGRESSION MODEL 20

Now, we have that



n ∑
n
(yi − ybi ) = (yi − b0 − b1 xi )
i=1 i=1

n ∑
n
= yi − nb0 − b1 xi . (10)
i=1 i=1

Recalling that the normal equations solved to find b0 and b1 are


( n )
∑ ∑n
nb0 + xi b 1 = yi (11)
i=1 i=1

and
( ) ( n )

n ∑ ∑
n
xi b0 + x2i b1 = xi y i . (12)
i=1 i=1 i=1

it follows from (11) and (10) that



n
(yi − ybi ) = 0
i=1

and hence to show (9) we just have to prove that



n
(yi − ybi )b
yi = 0.
i=1

Now,

n ∑
n
(yi − ybi )b
yi = (yi − ybi )(b0 + b1 xi )
i=1 i=1

n
= b1 (yi − ybi )xi
(
i=1
)

n ∑
n
= b1 xi y i − (b0 + b1 xi )xi
( i=1 i=1
)
∑ n ∑
n ∑
n
= b1 x i y i − b0 x i − b1 x2i
i=1 i=1 i=1
= 0

from (12) as required.


2 THE SIMPLE LINEAR REGRESSION MODEL 21

Coefficient of determination

The identity (8) motivates one way of measuring the goodness of fit of a
statistical model. We introduce the notation

n
SStotal = (yi − ȳ)2
i=1
∑n
SSreg = yi − ȳ)2
(b
i=1

and

n
SSres = (yi − ybi )2
i=1

so that

SStotal = SSreg + SSres . (13)

We call SStotal the total sum of squares, SSreg the regression sum of squares and
SSres the residual sum of squares: as we have mentioned, SStotal , SSreg and SSres
represent the total variation, variation explained by the fit, and residual variation
respectively. One measure for describing the fit of a linear regression model is
the coefficient of determination,
SSreg
R2 =
SStotal
which from (13) can also be written as

SSres
R2 = 1 − .
SStotal

Clearly R2 is always non-negative and less than or equal to one: it will be one
when SSreg = SStotal , which occurs when SSres = 0 (that is, when the fitted
values are all equal to the responses). So R2 is a measure of the quality of the
fit, with values close to one indicating a good fit.
R2 needs to be used with some care for assessing the quality of linear statistical
models however. When we discuss the general linear model where we consider
multiple predictors, we will see that adding a new predictor cannot decrease R2 ,
even if the predictor is actually unrelated to the response. So according to this
criterion the best model is always the most complex one.
2 THE SIMPLE LINEAR REGRESSION MODEL 22

2.6 Distribution theory of estimators in simple linear re-


gression
b2 which
In this section we describe some basic distribution theory for b0 , b1 and σ
enables us to construct confidence intervals and hypothesis tests for β0 and β1 in
the simple linear regression model.

The χ2 , t and F distributions

We start by reviewing some results from MATH2801. Recall that if Z1 , ..., Zn


are independent
∑n standard normal random variables, Z1 , ..., Zn ∼ N (0, 1) and if
2
X = i=1 Zi , then X has a chi-squared distribution with n degrees of freedom,
and we write X ∼ χ2n .
Recall also that if X ∼ χ2m and Y ∼ χ2n and if X and Y are independent,
then we say that the ratio
X/m
F =
Y /n
has an F distribution with m and n degrees of freedom, and we write F ∼ Fm,n .
Finally, recall that if Z is a standard normal random variable Z ∼ N (0, 1),
and if X is a chi-squared random variable with n degrees of freedom independent
of Z, and if
Z
T =√
X/n
then we say that T has a t distribution with n degrees of freedom, and we write
T ∼ tn . Observe also that T 2 has an F distribution with 1 and n degrees of
freedom.

b2
Distribution theory for b0 , b1 and σ

We now state some results about the distributions of b0 and b1 . It is easy


to show that b0 and b1 are normally distributed (since linear combinations of
independent normal random variables are normal, and b0 and b1 are linear com-
binations of the responses). Since b0 and b1 are normal, their distributions are
determined by their means and variances. From our previous results, we have
( ( ))
1 x̄2
b0 ∼ N β 0 , σ 2
+
n Sxx

and ( )
σ2
b1 ∼ N β 1 , .
Sxx
2 THE SIMPLE LINEAR REGRESSION MODEL 23

For making inferences about β0 and β1 , we will also need to know something
b2 . It can be shown that
about the distribution of σ
(n − 2)b
σ2
σ2
has a chi-squared distribution with n−2 degrees of freedom, and that this variate
is independent of b0 and b1 .

Some important results for inference

These results allow us to state the distributions of some important statistics.


Consider
(b0 − β0 )
√ .
b n1 + Sx̄xx
2
σ
We can write this as   √
b0 − β 0 / (n−2)bσ2
 √  σ 2

2
σ n1 + Sx̄xx n−2

which takes the form of a standard normal random variable (b0 minus its mean
divided by its standard deviation) divided by the square root of an independent
χ2n−2 random variable ((n − 2)b σ 2 /σ 2 ) divided by its degrees of freedom. Hence
the above statistic has a t distribution with n − 2 degrees of freedom,

(b0 − β0 )
√ ∼ tn−2 .
b n1 + Sx̄xx
2
σ

Similar reasoning shows that


(b1 − β1 )
b
√σ
Sxx

has a t distribution with n − 2 degrees of freedom. These statistics are useful for
creating confidence intervals for β0 and β1 , as we now describe.

2.7 Confidence intervals for β0 and β1


We review the basic idea of a confidence interval before describing how to con-
struct confidence intervals for the coefficients in the simple linear regression
model.

Confidence intervals
2 THE SIMPLE LINEAR REGRESSION MODEL 24

Suppose we have a random vector Y = (Y1 , ..., Yn ) with a distribution de-


pending on an unknown parameter θ. A confidence interval for θ is a rule for
computing an interval of plausible values for θ based on the data. Our rule in-
volves functions of the data L(Y ) and U (Y ) which specify the lower and upper
limits of the interval (L(Y ), U (Y )) of plausible values.
We can talk about the properties of confidence intervals in terms of their
behaviour in repetitions of an experiment. Suppose we repeat some experiment
many times, and that we apply the rule (L(Y ), U (Y )) to the data for each ex-
periment in order to specify a range of plausible values for θ. In general, in some
experiments the interval (L(Y ), U (Y )) will contain θ, and in some experiments
it will not. If with repeated use our interval covers θ a proportion 1 − α of the
time, we say that (L(Y ), U (Y )) is a 100(1 − α)% confidence interval for θ.

Confidence intervals for β0 and β1

Suppose we wish to compute a confidence interval for β1 , the slope term in


the simple linear regression model. If tα/2;n−2 denotes the upper α/2 percentage
point of the t distribution with n − 2 degrees of freedom (that is, tα/2;n−2 is the
value that bounds an area of α/2 under the curve in the upper tail of a tn−2
density), then we can write
( )
(b1 − β1 )
P −tα/2,n−2 6 b
σ
6 tα/2,n−2 = 1 − α.

Sxx

Rearranging the inequality, we get


( )
b
σ b
σ
P b1 − tα/2;n−2 √ 6 β1 6 b1 + tα/2;n−2 √ = 1 − α.
Sxx Sxx

Hence a 100(1 − α) percentage confidence interval for β1 is


( )
b
σ b
σ
b1 − tα/2;n−2 √ , b1 + tα/2;n−2 √ .
Sxx Sxx

Similar reasoning to the above shows that a 100(1 − α) percentage confidence


interval for β0 is
( √ √ )
1 x̄ 2 1 x̄2
b0 − tα/2;n−2 σ
b + b
, b0 + tα/2,n−2 σ + .
n Sxx n Sxx

Example: zinc concentrations in plants


2 THE SIMPLE LINEAR REGRESSION MODEL 25

Consider once more the data on zinc concentrations described in the first lecture.
The data consist of fifteen measurements y1 , ..., y15 of zinc concentrations in a cer-
tain plant taken from fifteen different lakes, and fifteen measurements x1 , ..., x15
of zinc concentrations from sediment samples in the lakes. Using the fact that


15 ∑
15
yi = 808.4 xi = 1205.2
i=1 i=1


15 ∑
15
x2i = 101228.6 yi2 = 48130.92
i=1 i=1


15
xi yi = 68716.78
i=1

we compute the least squares estimates b0 and b1 of β0 and β1 as −14.9 and 0.857
respectively. We now derive a 95% confidence interval for β1 .
The expression for a 95% confidence interval for β1 is
( )
b
σ b
σ
b1 − t0.025,n−2 √ , b1 + t0.025,n−2 √ .
Sxx Sxx
Now, we have that

Syy − b21 Sxx


b2 =
σ
n−2
where as usual ∑
∑ ( i yi )2
Syy = yi2 −
i
n
and ∑
∑ ( i xi ) 2
Sxx = x2i − .
i
n
Substituting, we have Syy = 4563.54 and Sxx = 4394.80 and hence

4563.54 − 0.8572 4394.80


b2 =
σ
13
= 102.99

b = 10.14. Furthermore, the upper 2.5 percentage point of a t-distribution


or σ
with n − 2 = 13 degrees of freedom is t0.025,13 = 2.16 (this can be computed from
tables or a computer package like R). Hence the confidence interval is
10.14 10.14
(0.857 − 2.16 √ , 0.857 + 2.16 √ ) = (0.526, 1.188).
4394.80 4394.80
2 THE SIMPLE LINEAR REGRESSION MODEL 26

2.8 Hypothesis testing for β0 and β1


We review the basic ideas of hypothesis testing before discussing hypothesis tests
for β0 and β1 in the simple linear regression model.

Hypothesis testing

Suppose we have a random vector Y = (Y1 , ..., Yn ) with a distribution de-


pending on a parameter θ. In hypothesis testing there are two competing claims
about θ called the null hypothesis (written H0 ) and the alternative hypothesis
(written H1 ) and we want to decide between them based on the data.
The null hypothesis H0 is distinguished by the fact that we only want to
reject H0 if there is very strong evidence against it. An analogy is often made
between hypothesis testing and a criminal trial. In a criminal trial, the state
of the defendant is either “innocent” or “guilty” (think of the defendant’s state
as the parameter) and we want to decide which of these competing claims is
true based on the evidence. The null hypothesis in this analogy would be “The
defendant is innocent,” since (in our justice system) we only want to convict
someone if there is very strong evidence that they are guilty (beyond reasonable
doubt). The probability of wrongfully convicting an innocent person must be
controlled at some small level.
The way that hypothesis testing proceeds is by calculating a function of the
data t(Y ) (called the test statistic) and then defining a set of values for the test
statistic where H0 will be rejected (the critical region). If t(Y ) does not lie in the
critical region then H0 is accepted. So we have some rule for deciding between
H0 and H1 based on the data.
In hypothesis testing there are two kinds of errors which can be made. We
can reject H0 when it is true (this is called a type I error) or we can accept H0
when it is false (this is called a type II error). Since we only want to reject H0 if
there is very strong evidence, we fix the probability of making a type I error at
some small level, called the significance level of the test (usually denoted α). In
the example of the criminal trial, we can think of the significance level as defining
our standard of “reasonable doubt”.
Subject to fixing the probability of a type I error at the level α, a good test
is one that minimizes the probability of making a type II error.

Hypothesis testing for β1

b2 to test hy-
We can use our results about the distributions of b0 , b1 and σ
potheses about β0 and β1 .
To test the hypothesis
H0 : β1 = β
2 THE SIMPLE LINEAR REGRESSION MODEL 27

versus the alternative


H1 : β1 ̸= β
we use the test statistic
b1 − β
T = √
b/ Sxx
σ
which has a tn−2 distribution under H0 .
The critical region for a test at significance level α is
T < −tα/2;n−2 or T > tα/2;n−2 .
Clearly the probability of lying in this region (rejecting H0 ) when H0 is true is α
(since under H0 the test statistic T has a tn−2 distribution).

p-values

There is an alternative way of reporting the results of a hypothesis test that


you may remember from MATH2801. Rather than adopting a certain significance
level for a test, finding the critical region for that test and then seeing whether
or not the test statistic lies in the critical region, we can find the value for the
significance level such that the observed value of the test statistic lies “just on
the boundary” of the critical region. This critical value for the signifiance level is
called the p-value for the test. If the significance level is bigger than the p-value,
then the test statistic lies in the critical region and H0 is rejected, whereas if the
significance level is less than the p-value than H0 is accepted.
In the case of the test we have just described for β1 , if we observe T = t, we
can compute the p-value as
p = P(|T | > |t||β1 = β).
and then we compare this probability with the significance level α for the test.
If p < α, then we reject H0 , whereas if p > α we fail to reject H0 . In computing
the p-value, we are asking ourselves: what is the probability of getting a value
as extreme as the one observed for the test statistic or more extreme given H0 is
true (where what is “extreme” is defined by the alternative)? If this probability
is small, then we look to the alternative hypothesis as a possible explanation.

One-sided test for β1

In the discussion above we have considered only the alternative hypothesis


β1 ̸= β (called a two sided alternative). We can also construct tests for H0 : β1 =
β against the one sided alternatives H1 : β1 > β or H1 : β1 < β. For H1 : β1 > β
the critical region for a size α test is T > tα;n−2 and the p-value of the test is
comptued as
p = P(T > t|β1 = β).
2 THE SIMPLE LINEAR REGRESSION MODEL 28

For H1 : β1 < β the critical region is T < −tα;n−2 and the p-value of the test is

p = P(T 6 t|β1 = β).

Hypothesis testing for β0

So far we have only dealt with the construction of tests for the coefficient β1 .
We can also construct a test of the null hypothesis H0 : β0 = β against one and
two sided alternatives.
To test the hypothesis
H0 : β0 = β
versus the alternative
H1 : β0 ̸= β
we use the test statistic
b0 − β
T = √
b n1 + Sx̄xx
2
σ
which has a tn−2 distribution under H0 . The critical region for a test at signifi-
cance level α is
T < −tα/2,n−2 or T > tα/2,n−2
and the p-value of the test is

p = P(|T | > |t||β0 = β)

where t is the observed value of the test statistic.


For the one-sided alternatives H1 : β0 > β or H1 : β0 < β we modify the
critical region to T > tα,n−2 or to T < −tα,n−2 respectively. The p-value for the
test with H1 : β0 > β is
p = P(T > t|β0 = β).
The p-value for the test with H1 : β0 < β is

p = P(T 6 t|β0 = β).

Example: zinc concentrations in plants

Continuing our example on zinc concentration in plants, suppose we want to


test the null hypothesis
H0 : β1 = 0
2 THE SIMPLE LINEAR REGRESSION MODEL 29

versus
H1 : β1 ̸= 0
at the 5% level. We illustrate the computation of the p-value for the test. The
test statistic here is
b
√1
σb/ Sxx
which we compute as
0.857
√ = 5.59.
10.14/ 4934.8
Under H0 , this is a realization of a t random variable with 13 degrees of freedom,
so that if T ∼ t13 we can compute the p-value as

p = P(|T | > 5.59)


= 2P r(T > 5.59)
= 8.69e − 05

(from tables or R). So at the 5% level we reject H0 since p < 0.05. That is, we
believe that the zinc concentration in the sediment samples is a useful predictor
of zinc concentration in the plants.
Instead of computing the p-value we could also have computed the critical
region for the test. The critical region is

T < −t0.025;13 or T > t0.025;13

and from R or tables we have t0.025;13 = 2.16, so that the value of the test statistic
(5.59) lies in the critical region, and H0 is rejected.

2.9 The Analysis of Variance (ANOVA) table


It is often of particular interest to test the null hypothesis

H0 : β1 = 0

versus the alternative


H1 : β1 ̸= 0.
When we test this hypothesis we are asking ourselves: is the predictor in the
linear regression model at all useful for explaining variation in the response? As
we have just seen, we can do this test using the test statistic
b
√1 . (14)
b/ Sxx
σ

which has a tn−2 distribution under the null hypothesis.


2 THE SIMPLE LINEAR REGRESSION MODEL 30

We now derive an equivalent F statistic for the above test, and describe a way
of showing the calculations involved using the analysis of variance table. Observe
that

n
SSreg = yi − ȳ)2
(b
i=1

n
= (b0 + b1 xi − ȳ)2
i=1
∑n
= (ȳ − b1 x̄ + b1 xi − ȳ)2
i=1

n
= b21 (xi − x̄)2
i=1
= b21 Sxx . (15)

Now, squaring the statistic (14) gives (under H0 ) a random variable with an
F distribution with 1 and n − 2 degrees of freedom:

b21 Sxx
.
b2
σ
Using (15) we can write this last statistic as

SSreg /1
F = . (16)
SSres /(n − 2)

So we can write down an F statistic for testing for the usefulness of a predictor in
a simple linear regression model, and this F statistic is defined from the terms in
the fundamental partition of variation we have described previously. The critical
region for the test based on (16) is F > Fα;1,n−2 , where Fα;1,n−2 is the upper
100α percentage point of the F1,n−2 distribution (so we reject H0 : β1 = 0 if the
variation explained by the model is large compared to the residual variation).
The analysis of variance table presents for the simple linear regression model
the decomposition of total variation into SSreg and SSres and shows the compu-
tation of the F statistic above. The analysis of variance table has the form shown
below. The sum of squares column shows the partitioning of the total variation.
The second column, the degrees of freedom column, shows the degrees of freedom
parameters describing the null distribution of the statistic (16) (we will come to
a deeper understanding of the degrees of freedom parameters when we talk about
the general linear model). Entries in the mean square column are computed by
dividing the corresponding entry in the sum of squares column by its degrees of
freedom, and in the final column we have the F statistic which is used for testing
for the usefulness of the predictor. We will generalize this analysis of variance
2 THE SIMPLE LINEAR REGRESSION MODEL 31

Source Sum of Squares Degrees of Mean Square F


freedom

Regression SSreg 1 M Sreg = SSreg /1 σ2


M Sreg /b
Residual SSres n−2 SSres
M Sres = (n−2) =σ b2
Total SStotal n−1

table later when we talk about the general linear model. The analysis of vari-
ance table is a useful tool for displaying the sources of variation in data, and for
organizing the calculations involved in certain tests of hypotheses. We conclude
this section with two examples.

Example: market model of stock returns

We return to our example on the market model of stock returns (from Keller,
Warrack and Bartel, “Statistics for Management and Economics: a Systematic
Approach (Second Edition)”, Wadsworth, Belmont, 1990, p. 725). Recall that in
the market model the monthly rate of return on a stock (R) is linearly related to
the monthly rate of return on the overall stock market (Rm ). The mathematical
description of the model is

R = β0 + β1 Rm + ε

where the error term ε is assumed to satisfy the requirements of the simple
linear regression model. We are interested in the coefficient β1 , called the stock’s
beta coefficient, which measures how sensitive the stock’s rate of return is to
changes in the level of the overall market. If β1 > 1 (β1 < 1), the stock’s rate
of return is more (less) sensitive to changes in the level of the overall market
than is the average stock. Our data consist of 5 years of monthly rates of return
on Host International stock and rates of return on the NYSE composite index
(measuring the rate of return on the overall stock market). When we considered
this data set previously, we computed b1 as 1.60, b0 as −0.14 and σ b as 9.27. Also,
Sxx = 1117.90.
Since there is uncertainty in our estimate of the beta coefficient it is of interest
to compute a confidence interval for this coefficient, and to test the hypothesis
that the beta coefficient is 1 (is there any real evidence that the stock is more or
less sensitive than average to the overall market level given the inherent variation
in the data?) A 100(1 − α) percentage confidence interval for β1 is
( )
b
σ b
σ
b1 − tα/2,n−2 √ , b1 + tα/2,n−2 √
Sxx Sxx
2 THE SIMPLE LINEAR REGRESSION MODEL 32

which gives a 95% confidence interval in this case of


( )
9.27 9.27
1.60 − 2.002 √ , 1.60 + 2.002 √ = (1.04, 2.16).
1117.90 1117.90
We see that the 95% confidence interval does not contain 1, so there seems to be
some support for the assertion that the stock is more sensitive than average to
the overall market level. Looking at whether the confidence interval contains 1
is equivalent to performing a certain hypothesis test on β1 . We test

H0 : β1 = 1

versus
H1 : β1 ̸= 1
using the test statistic
b1 − 1

b/ Sxx
σ
which evaluates here to
1.60 − 1
√ = 2.16.
9.27/ 1117.90
So if T ∼ t58 , the p-value for the test is

p = P(|T | > 2.16)


= 2P r(T > 2.16)
= 0.0349

(from tables or R) so that we reject H0 : β1 = 1 at the 5% level. If we were to


conduct a one-tailed test, H1 : β1 > 1 say, computation of the p-value would be
altered to

p = P(T > 2.16)


= 0.0175

and again we would reject H0 : β1 = 1 at the 5% level.

Example: Risk Assessment from Financial Reports

We describe a data set now that we will return to later when considering linear
regression models with multiple predictors. The following example is described in
Gail E. Farrelly, Kenneth R. Ferris and William R. Reichenstein, “Perceived Risk,
Market Risk and Accounting-Determined Risk Measures”, Accounting Review 60
(1985), pp. 287–88.
Investors are interested in assessing the riskiness of a company’s common
stock, as well as its expected rate of return. It is therefore desirable to potential
2 THE SIMPLE LINEAR REGRESSION MODEL 33

investors that a company’s financial reports provide information to help them


assess the company’s risk.
Farrelly, Ferric and Reichenstein conducted an investigation into the relation-
ship between seven accounting-determined measures of risk and the average risk
assessment of financial analysts. The seven accounting-determined measures of
risk (all of which could be computed from a company’s financial reports) and
their definitions are as follows:

Dividend payout, x1 : (Cash dividends)/(Earnings)

Current ratio, x2 : (Current assets)/(Current liabilities)

Asset size, x3 : log(Total assets)

Asset growth, x4 : Average growth rate in asset size for the years 1977-1981

Leverage, x5 : (Total senior debt)/(Total assets)

Variability in earnings, x6 : Standard deviation of the price-earnings ratio for


the years 1977-1981

Covariability in earnings, x7 : Strength of the relationship between a firm’s


price-earnings ratio and the average price-earnings ratio of the market over-
all

These seven measures were computed for 25 well-known stocks, based on data
from the companies’ annual reports from 1977-1981. These data were then sent
to a random sample of 500 financial analysts, who “were requested to assess the
risk of each of the 25 companies on a scale of 1 (low) to 9 (high), assuming
that the stock was to be added to a diversified portfolio.” The mean rating, y,
assigned by the 209 financial analysts who responded is recorded for each of the
25 stocks. This measure of the financial analysts’ risk perception was taken to
be a reasonable surrogate for the (market) risk of each stock.
Prediction of market risk based on a linear regression model with multiple
predictors is a problem we will consider later. For the moment, we consider a
simple linear regression model for predicting market risk based on asset size. A
scatterplot of market risk versus asset size is shown in Figure 2.3. We do not
describe in detail the computation of parameters or computation of test statistics
here, but the fitted line is

ybi = 8.1433 − 0.4123xi ,

the estimated error standard deviation is 1.475, and the p-value for testing H0 :
β1 = 0 versus H1 : β1 ̸= 0 using the usual t-statistic is p = 0.0211. So it
seems that asset size is a useful predictor of market risk. In later work on linear
regression models with multiple predictors we consider the problem of which
2 THE SIMPLE LINEAR REGRESSION MODEL 34


8


6

• •
• • •
Risk



• •
• • •

4

• •


• •
• •
2


6 8 10 12
Asset size

Figure 2.3: Scatter plot of assessment of market risk versus asset size with fitted
least squares regression line.
2 THE SIMPLE LINEAR REGRESSION MODEL 35

predictors should be chosen for predicting market risk from amongst the seven
predictors described above. Choosing subsets of predictors when there are many
to choose from in explaining variation in the response is a much trickier problem
than the one we have considered here, and further discussion is delayed until we
have given a thorough treatment of the general linear model.

2.10 Prediction in the simple linear regression model


The purpose of building a simple linear regression model is often to predict a
future response value when the value of the predictor (x0 say) for that observation
is known. Consider, for instance, the risk assessment data in the last subsection.
For the companies in the study we know their asset size and an average rating of
risk from 209 financial analysts. For a new company which has not been assessed
by the analysts we can obtain the predictor (asset size) from company reports,
and we can forecast risk using the fitted regression line.
Obviously we may also wish to quantify our uncertainty about predictions
from a simple linear regression model. In this subsection we consider construction
of confidence intervals on the mean response for a new observation, as well as
construction of prediction intervals which cover the actual response for a new
observation with specified probability.
Suppose that we have fitted a simple linear regression model to data consisting
of responses y1 , ..., yn and predictors x1 , ..., xn and that we have obtained least
squares estimates b0 and b1 of β0 and β1 , and an estimate σ b2 of σ 2 . How do
we estimate the mean response for a future observation given a value x0 for the
predictor? Also, how do we estimate confidence in the estimated mean and in a
future response when we know the value of the predictor?
An intuitively reasonable estimator of the mean response when the predictor
is x0 is of course simply
b0 + b1 x 0
which following standard notation we write as yb(x0 ). The actual mean response
when the predictor is x0 is
β0 + β1 x0
and a new observation y0 at this predictor value can be written as

β0 + β1 x0 + ε0

where ε0 is a normally distributed error term with mean zero and variance σ 2 , in-
dependent of y1 , ..., yn . A confidence interval for the mean response at x0 will take
into account the uncertainty in estimating β0 and β1 , and a prediction interval for
y0 (which is a random interval containing y0 with specified probability) will take
into account both the uncertainty in estimating β0 and β1 and the variability of ε0 .
2 THE SIMPLE LINEAR REGRESSION MODEL 36

Confidence intervals for the mean response

First we give an expression for the variance of yb(x0 ): it can be shown that
( )
2 1 (x0 − x̄)2
Var(b
y (x0 )) = σ + .
n Sxx
In discussing the general linear model later we will show that b0 and b1 are jointly
normally distributed, so that yb(x0 ) (which is a linear combination of b0 and b1 ) is
normally distributed. Hence since E(b y (x0 )) = E(b0 + b1 x0 ) = β0 + β1 x0 we have
( ( ))
1 (x0 − x̄)2
yb(x0 ) ∼ N β0 + β1 x0 , σ 2
+
n Sxx
or
yb(x0 ) − β0 − β1 x0
√ ∼ N (0, 1). (17)
σ n1 + (x0S−x̄)
2

xx

Also, we have as before that


(n − 2)b
σ2
∼ χ2n−2 (18)
σ2
and that this variate is independent of b0 and b1 (and hence of yb(x0 )). So dividing
(17) by the square root of (18) divided by its degrees of freedom gives
yb(x0 ) − β0 − β1 x0
√ ∼ tn−2 .
(x0 −x̄)2
b n + Sxx
σ 1

Hence if tα/2,n−2 is the upper 100 × α/2 percentage point of a t distribution with
n − 2 degrees of freedom, we have
 
yb(x0 ) − β0 − β1 x0
P −tα/2,n−2 6 √ 6 tα/2,n−2  = 1 − α
(x0 −x̄)2
b n + Sxx
σ 1

or
 √ √ 
1 (x0 − x̄)2 1 (x0 − x̄)2
P yb(x0 ) − tα/2,n−2 σ
b + 6 β0 + β1 x0 6 yb(x0 ) + tα/2,n−2 σ
b + 
n Sxx n Sxx
= 1 − α.
Hence a 100(1−α) percentage confidence interval for the mean response β0 +β1 x0
when the predictor is x0 is
 √ √ 
yb(x0 ) − tα/2,n−2 σ 1 (x0 − x̄) 2 1 (x0 − x̄) 
2
b + , yb(x0 ) + tα/2,n−2 σ
b + .
n Sxx n Sxx
2 THE SIMPLE LINEAR REGRESSION MODEL 37

Prediction intervals

It remains to find a prediction interval for a new observation y0 when the


predictor value is x0 . Since y0 and yb(x0 ) are independent, we have that

Var(y0 − yb(x0 )) = Var(y0 ) + Var(b


y (x0 ))
( )
2 2 1 (x0 − x̄)2
= σ +σ +
n Sxx
( )
2 1 (x0 − x̄)2
= σ 1+ + .
n Sxx

Also, E(y0 − yb(x0 )) = β0 + β1 x0 − β0 − β1 x0 = 0. Now, y0 is normally distributed,


and so is yb(x0 ), and y0 and yb(x0 ) are independent, so we can write

y0 − yb(x0 )
√ ∼ N (0, 1)
1 (x0 −x̄)2
σ 1+ n
+ Sxx

and of course as before


(n − 2)b
σ2
∼ χ2n−2
σ2
independently of y0 . So similar reasoning to before gives

y − yb(x0 )
√ 0 ∼ tn−2
b 1 + n1 + (x0S−x̄)
2
σ xx

and a 100(1 − α)% prediction interval for y0 is


 √ √ 
yb(x0 ) − tα/2,n−2 σ 1 (x0 − x̄) 2 1 (x0 − x̄) 
2
b 1+ + , yb(x0 ) + tα/2,n−2 σ
b 1+ + .
n Sxx n Sxx

The expression given above for the prediction interval is a very natural one. It
tells us that our uncertainty of prediction is related to the residual error variance
(estimated by σ b2 ), the sample size, and how far x0 is away from the mean of the
predictors x̄. If x0 is far away from where most of the observed predictors lie, then
our uncertainty of prediction is increased. We must in any case be very careful
about prediction for values of the predictor far away from any of the observed xi :
while linearity and the other assumptions of the simple linear regression model
may seem to hold locally, these assumptions can break down when we are far
from observed values of the predictors. Extrapolation is dangerous!

Example: risk assessment data


2 THE SIMPLE LINEAR REGRESSION MODEL 38


8


6

• •
• • •
Risk



• •
• • •

4

• •


• •
• •
2


6 8 10 12
Asset size

Figure 2.4: Scatter plot of assessment of market risk versus asset size with fitted
least squares regression line and 95 percent confidence intervals for the mean
(inner bands) and 95 percent prediction intervals (outer bands).
2 THE SIMPLE LINEAR REGRESSION MODEL 39

We conclude this section with an application to the risk assessment data.


Recall that in the previous subsection we built a simple linear regression model
to predict risk based on asset size. Figure 2.4 shows a scatterplot of these data,
together with 95 percent confidence intervals for the mean over a range of values
for the predictor (the inner bands) and prediction intervals over the same range of
predictor values (the outer bands). Of course, the prediction intervals are wider
than the confidence intervals for the mean, since they reflect both our uncertainty
in estimating the conditional mean and the inherent variability of the errors.
It should be pointed out that the bands on the graph in Figure 2.4 show
pointwise confidence and prediction intervals and this must be kept in mind
when interpreting the graph. Consider, for instance, the inner bands, which give
the upper and lower confidence limits for the mean response over the range of
predictors in the graph. At each point we have 95 percent confidence that the
mean response lies between the bounds at that point: however, are we 95 percent
confident that the true mean response lies between the upper and lower limits
over the whole range of the predictors in the graph? That is, if we repeated this
experiment a large number of times, would the true mean response lie between
all the upper and lower confidence limits (the confidence bands) simultaneously
in 95 percent of the repetitions of the experiment? In general, the answer is no.

2.11 Simultaneous inference and Bonferroni adjustment


We now turn to a discussion of simultaneous inference, which we touched on at
the end of the last subsection. The confidence intervals for the mean that we drew
in Figure 2.4 were pointwise intervals, and it is not true that the true regression
line lies between the bands with 95% confidence, even though for any point x0
β0 + β1 x0 lies between the bands at x0 with 95% confidence.
Let’s be more precise. Write E(x0 ; α) for the random interval
( √ √ )
1 (x0 − x̄)2 1 (x0 − x̄)2
yb(x0 ) − tα/2,n−2 σ
b + , yb(x0 ) + tα/2,n−2 σ
b + .
n n n n
We know that
P(β0 + β1 x0 ∈ E(x0 ; α)) = 1 − α
so that E(x0 ; α) defines a 100(1 − α)% confidence interval.
Now suppose we consider two values for the predictor, x0,1 and x0,2 say. We
want to specify a pair of intervals which will contain β0 + β1 x0,1 and β0 + β1 x0,2
simultaneously with 100(1 − α)% confidence.
Pointwise 100(1 − α)% confidence intervals do not guarantee joint coverage
with confidence 100(1 − α)%. That is, it is not true in general that
P(β0 + β1 x0,1 ∈ E(x0,1 ; α) and β0 + β1 x0,2 ∈ E(x0,2 ; α)) = 1 − α.
To achieve the desired joint confidence level we must adjust the confidence level
for the pointwise intervals.
2 THE SIMPLE LINEAR REGRESSION MODEL 40

2.12 Bonferroni adjustment


Suppose that we have parameters θ1 , ..., θk and that we have a method for con-
structing pointwise confidence intervals for each of these parameters. Then it
can be shown that if we construct pointwise 100(1 − α/k)% confidence intervals
for θ1 , ..., θk , then these intervals will have joint coverage of at least 100(1 − α)%.
This adjustment to the pointwise confidence levels is referred to as Bonferroni
adjustment. (See your tutorial sheet for justification).
The ideas of simultaneous inference and Bonferroni adjustment can be applied
in hypothesis testing as well. An analogy here is with the notion of double
jeopardy in a criminal trial. This is the idea that a person should not be tried
for the same crime twice. Why would this not be fair?
Suppose that someone is tried for the same crime twice. They are convicted
either if they are convicted in the first trial or in the second trial. If the probability
is α that the person is wrongfully convicted in one of the two trials, then in general
the probability of wrongful conviction on either of the two trials will be bigger
than α. (Running the same small risk twice will result in a bigger risk).
In statistical language, suppose that we have parameters θ1 , ..., θk and that
we have for each i = 1, ..., k a way of testing the null hypothesis

H0 : θi ∈ Θi

against the alternative


H1 : θi ∈ Θci
where Θi is some appropriate subset of the parameter space and Θci is its comple-
ment (think of these tests as being the individual trials in our double jeopardy
analogy with H0 as “The defendant is innocent”).
Suppose we want to test

H0 : θ1 ∈ Θ1 , ..., θk ∈ Θk

against
H1 : Not all θi ∈ Θi , i = 1, ..., k
at significance level α. One approach to this test is to conduct individual tests on
the coefficients θi and to reject the null hypothesis if any of the individual tests
are rejected. But what significance level should be used for the individual tests
to ensure significance level α for the joint test? It turns out that if we adjust
the significance level for the individual tests to be α/k, then the joint test has
significance level less than or equal to α. This is the Bonferroni adjustment to
the significance level.
We should point out that the Bonferroni adjustment is conservative in the
sense that the Bonferroni confidence intervals discussed above give a joint cov-
erage which is usually greater than 100(1 − α)%, and the Bonferroni adjustment
2 THE SIMPLE LINEAR REGRESSION MODEL 41

to significance levels in hypothesis testing results in a test with significance level


less than α. Other methods for constructing joint confidence regions and tests
are preferable when they are available: for linear models, alternative methods
often are available (see later discussion for the general linear model).

2.13 Criticism of the model: residuals


At the beginning of the course we described the process of building a statistical
model as an iterative procedure where the analyst cycles through stages of model
formulation, estimation and criticism. We have described in detail for the simple
linear regression model some of the tools which are used in model formulation
and selection and estimation. In this section we take a brief look at one method
for criticism of a model.
Suppose we have fitted a simple linear regression model to data consisting of
responses y1 , ..., yn and predictors x1 , ..., xn . Under the assumptions of the model,

y i = β 0 + β 1 xi + εi

where the εi , i = 1, ..., n, are uncorrelated zero mean errors with common variance
σ 2 . We can estimate β0 and β1 via the least squares estimators b0 and b1 . The
fitted values are
ybi = b0 + b1 xi
for i = 1, ..., n. We define the residuals of the fitted model to be the differences
between the responses and the fitted values,

ei = yi − ybi

i = 1, ..., n. If we can estimate β0 and β1 precisely using b0 and b1 , then the resid-
uals should be approximately the error terms εi and so we can use the residuals
to check some of the assumptions made about the error terms in the simple linear
regression model. As well as the residuals ei (sometimes called the raw residuals)
there are other kinds of residuals which may be more useful in some situations,
but we will not discuss these at the moment.
What happens when the assumptions of the simple linear regression model
are violated? It is common for statisticians to use plots of the residuals against
the fitted values or predictor values to try to detect an incorrect specification of
the systematic part of the model or to detect departures from the constancy of
variance assumption for the errors. Since the error terms in the simple linear
regression model are supposed to have zero mean, any systematic trend in the
residuals as the fitted values or predictor values increase provides some evidence
that the errors do not have zero mean and that the mean structure is not correctly
specified. Similarly, increasing variability as the fitted values or predictor values
increase can indicate that the residuals do not have a common variance σ 2 . We
2 THE SIMPLE LINEAR REGRESSION MODEL 42

can also use the residuals to check the assumption of normality of errors, although
we do not discuss this further at the present time.
The usefulness of residual plots for model criticism is perhaps best shown via
some examples. Figure 2.5 shows a scatterplot of some synthetic data together
with a fitted linear regression model. Below is a plot of the residuals versus
the fitted values: there is a clear trend in the mean level of the residuals as
the fitted values increase, suggesting that perhaps a model which is nonlinear in
the predictor would be more appropriate than the model considered here. This
example shows how beneficial residual plots can be for detecting departures from
the assumed mean structure for the model.
We give another synthetic example which shows the usefulness of the residuals
for detection of violations of the assumption of constancy of variance of the errors.
It often happens in data sets from many areas of science, social science and
economics that the variation in the response increases as the mean response
increases. Figure 2.6 shows a scatterplot for a synthetic data set which exhibits
this kind of behaviour, together with a fitted linear regression model. Below it
is the corresponding residual plot: the increasing trend in the variance of the
residuals is obvious.
We will say much more about residuals later in this course.
2 THE SIMPLE LINEAR REGRESSION MODEL 43

2.0

• •
• •
•• • •
1.8



1.6

• •



• •
y


1.4

•• •
• • • •
• • •
• •
1.2


• •
• • •
• • •
• •

1.0

• • •


• •
0.0 0.2 0.4 0.6 0.8 1.0
x


0.2

• •
• • •

• • • •
0.1

• •
• • • •
• • •
• • •
Residuals

• • •
0.0


• • •
• •
• • •
• • •
• •
-0.1

• •

• • •
• • • •
-0.2


1.0 1.2 1.4 1.6 1.8
Fitted values

Figure 2.5: Scatter plot of synthetic data set with fitted linear regression (top)
and plot of residuals versus fitted values.
2 THE SIMPLE LINEAR REGRESSION MODEL 44


4


3





y


2

• •
• • ••
• • • •
• •• • • • • •
• •• • •
• • •• •
• • • • • •
1

• • •

• • •


0.0 0.2 0.4 0.6 0.8 1.0
x


2


1

• •

Residuals


• • • •
• • • •• • • • • ••
• • • • •
0

• • • •
• •
•• • •
• • • • •
• • •

• •
-1


1.2 1.4 1.6 1.8
Fitted Values

Figure 2.6: Scatter plot of synthetic data set with fitted linear regression (top)
and plot of residuals versus fitted values.
3 THE GENERAL LINEAR MODEL 45

3 The general linear model


In previous sections we considered the process of model building for the simple
linear regression model. We attempted to explain variation in the response in
terms of variation of a single predictor. In many practical problems, however,
there are multiple predictors which may be useful for explaining variation in the
response, and we can extend the simple linear regression model to the so-called
general linear model to deal with this more complicated situation.
The rest of this course is concerned with developing tools for model selection,
estimation and model criticism for the general linear model, just as we did for
the simple linear regression model.

3.1 Formulation of the general linear model


Extending our previous notation, we write yi i = 1, ..., n for the response vari-
able, but now corresponding to each yi we have values of k predictor variables
xi1 , ..., xik . In the general linear model we assume that

yi = β0 + β1 xi1 + ... + βk xik + εi

where β0 , ..., βk are unknown parameters and εi , i = 1, ..., n are a collection of


uncorrelated errors with zero mean and common variance σ 2 . As for the simple
linear regression model, we will also sometimes make the assumption that the
errors are normally distributed. This assumption is needed for constructing con-
fidence intervals and prediction intervals and for hypothesis testing.

Example: Risk Assessment Data

We discussed earlier the problem of forecasting the risk of a company’s common


stock based on accounting-determined measures of risk which can be obtained
from a company’s financial reports. The measures of risk considered were as
follows:

Dividend payout, x1 : (Cash dividends)/(Earnings)

Current ratio, x2 : (Current assets)/(Current liabilities)

Asset size, x3 : log(Total assets)

Asset growth, x4 : Average growth rate in asset size for the years 1977- 1981

Leverage, x5 : (Total senior debt)/(Total assets)

Variability in earnings, x6 : Standard deviation of the price-earnings ratio for


the years 1977-1981
3 THE GENERAL LINEAR MODEL 46

Covariability in earnings, x7 : Strength of the relationship between a firm’s


price-earnings ratio and the average price-earnings ratio of the market over-
all

These measures of risk were determined for 25 companies, and sent to 500 ran-
domly chosen financial analysts, 209 of whom responded to a request to rank the
risk of each of the 25 companies on a scale of 1 (low) to 9 (high), assuming that
the stock was to be added to a diversified portfolio. The average of the ratings
(y) for the 209 analysts was then computed for each of the 25 stocks.
We are interested in predicting risk (the response y) in terms of the accounting-
determined measures of risk x1 − x7 . One way of approaching this prediction
problem is to build a linear model for the response in terms of the multiple
predictors x1 − x7 . Specifically, we write

yi = β0 + β1 xi1 + ... + β7 xi7 + εi

where yi is the risk for the ith company, xi1 , ..., xi7 are the accounting-determined
measures of risk for the ith company, and εi , i = 1, ..., 25 are error terms which
are assumed uncorrelated with a common variance.
Some of the problems we face in using this model to predict risk are the same
as problems we have faced for the simple linear regression model, but some are
more complicated. We need to estimate the parameters β0 − β7 , and the error
variance σ 2 . We also need to decide what subset of the predictors is most useful
for predicting the risk (and deciding this question can be much more complicated
than deciding whether a single predictor is useful or not in a simple linear regres-
sion, particularly if the various predictors considered contain much the same kind
of information). We must also develop tools for checking model assumptions, and
once we have a model we are happy with we must be able to use it for making
predictions and for describing our uncertainty about those predictions. These
problems and others are considered in the remainder of the course.

It is important to note that when we talk about a linear model we mean a


model which is linear in the parameters. It is perfectly valid to consider new
predictors constructed from non-linear transformations of the original predictors
in the linear model, provided that the new predictors enter into our model linearly
in the parameters. For instance, consider the scatterplot of Figure 3.7. It is
evident from this scatterplot that a simple linear regression model for the response
y with the predictor x would not be appropriate. The mean response seems to
change non-linearly with x. However, one possible model for these data might be

yi = β0 + β1 xi + β2 x2i + β3 x3i + εi .

That is, we can use the general linear model with three predictors (the original
predictor x together with x2 and x3 ) as a model for the variation in the response.
3 THE GENERAL LINEAR MODEL 47

1.0



• •

0.5


• •
• • • • •••
••
• • •
• ••
0.0



• •• • •• ••
y

• •• •
• •
• •
• •

-0.5

• ••

-1.0


-1.0 -0.5 0.0 0.5 1.0
x

Figure 3.7: Scatter plot for synthetic data set showing non-linear relationship
between mean response and predictor.
3 THE GENERAL LINEAR MODEL 48

This model is still linear in the parameters, and hence it is a linear model by our
definition. We can describe quite complicated relationships between the mean re-
sponse and a predictor using polynomials, and we may wish to develop inferential
techniques for deciding on an appropriate order for a polynomial approximation
(we have used a cubic polynomial here). We will develop hypothesis tests for
the linear model that can provide one approach to problems of model selection
such as this. There are other flexible approximation techniques for non-linear
relationships between the mean response and a predictor which perform better
than polynomial regression for many purposes.
As another example we consider a synthetic data set in which there are two
predictors of the response. Below is a meshplot showing the responses, the values
of the predictors, and the true mean response as a function of the predictors. I

45

40

35

30

25

20

15

10

5
40

30 35
30
20 25
20
15
10 10
5
0 0

Figure 3.8: Scatter plot for synthetic data set showing non-linear relationship
between mean response and two predictors. The circles show the response values,
and the surface shows the true mean response used in the simulation.

simulated this data set from the model

yi = 1 + 2xi1 + 2.5xi2 − 0.02xi1 xi2 − 0.05x2i1 − 0.06x2i2 + εi

i = 1, ..., 50 where Var(εi ) = 1.0 here and the εi were taken to be normally
distributed. This data set also comes from a linear model, as the relationship of
the mean response to the predictors in the above model is linear in the parameters
(the coefficients in the above bivariate polynomial would be unknown for a real
data set and would need to be estimated from the data). By now you should
3 THE GENERAL LINEAR MODEL 49

realize that the linear model is a very flexible tool for modelling complicated
relationships between a response variable and a set of predictors.
Before we go on we formulate the linear model in matrix notation: this matrix
notation will be used throughout the rest of the course, and the use of matrix
language is indispensible in describing the theory of linear models. Write y for
the vector of responses, y = (y1 , ..., yn )⊤ where a⊤ denotes the transpose of the
vector a here (so y is a column vector). Also, write β for the vector of the mean
parameters in the linear model, β = (β0 , ..., βk )⊤ . and letting p = k + 1 (so that
p is the length of the vector β), write X for the n × p matrix whose 1st column
is a column of 1’s and whose (i + 1)th column is the vector (x1i , ..., xni )⊤ for
i = 1, ..., k. Finally write ε = (ε1 , ..., εn ) for the vector of errors in the linear
model. Then we can write the general linear model in matrix notation as
     
y1 1 x11 . . . x1k   ε 1
    β0  
     
 ..   .. .
.   ..   .. 
 . = . .  .  +  . 
     
    βk  
yn 1 xn1 . . . xnk εn
or
y = Xβ + ε.

3.2 Least squares estimation of mean parameters


In the simple linear regression model we obtained least squares estimators of the
parameters β0 and β1 in the systematic part of the model by minimizing the sum
of the squared residuals. In this subsection we develop a similar least squares
method of estimation for the parameters β in the general linear model.

The least squares criterion

In the general linear model, we have

y = Xβ + ε

where y is an n×1 vector of responses, X is an n×p matrix and ε is an n×1 vector


of zero mean errors, uncorrelated with a common variance σ 2 . The systematic
component of the model here is given by the term Xβ. If β is known, then
Xβ is the vector of predicted values for the responses. As for the simple linear
regression model, we wish to define a goodness of fit criterion which reflects the
discrepancy between the responses y and predictions Xβ: when β is unknown
this criterion can then be minimized to obtain an estimator of β.
A natural extension of the criterion considered for the simple linear regression
model is to minimize the sum of squares of the differences between the responses
3 THE GENERAL LINEAR MODEL 50

and predictions (that is, the sum of the squared elements∑ of the vector y − Xβ).
Note that for any n × 1 vector x, x⊤ x is the scalar value ni=1 x2i , so that we can
write this sum of squared differences as

(y − Xβ)⊤ (y − Xβ). (19)

We will refer to (19) as the least squares criterion. Of course, it reduces to


the criterion we considered in the previous topic for the case of simple linear
regression.
We will call the vector b which minimizes (19) with respect to β the least
squares estimator of β. We derive this estimator now. First we express (19) in
a more convenient form, before giving some results about differentiation with
respect to vectors. We will then differentiate (19) to get a set of equations which
can be solved to obtain the least squares estimator of β.

An alternative expression for the least squares criterion

First we recall a result about matrix transposes: for two matrices A and B which
have dimensions so that A and B can be multiplied together, the transpose of
AB, (AB)⊤ is equal to B ⊤ A⊤ . If you don’t know this result, or have forgotten it,
then you should prove it as an (easy) exercise. Of course, using the above result,
if A, B and C are three matrices such that the product ABC is well defined,
then it is easy to show that (ABC)⊤ = C ⊤ B ⊤ A⊤ , and a similar result holds for
a product of four or more matrices.
Now, observe that (19) can be written

(y − Xβ)⊤ (y − Xβ) = y ⊤ (y − Xβ) − (Xβ)⊤ (y − Xβ)


= y ⊤ y − y ⊤ Xβ − β ⊤ X ⊤ y + β ⊤ X ⊤ Xβ
= y ⊤ y − (X ⊤ y)⊤ β − β ⊤ X ⊤ y + β ⊤ X ⊤ Xβ.

Since β ⊤ X ⊤ y is a scalar, we know that the transpose of a scalar is itself, and so

β ⊤ X ⊤ y = (X ⊤ y)⊤ β.

Hence the least squares criterion is

(y − Xβ)⊤ (y − Xβ) = y ⊤ y − 2(X ⊤ y)⊤ β + β ⊤ X ⊤ Xβ. (20)

We must take the partial derivatives of the above expression with respect to the
elements of the vector β, and set these to zero to obtain the least squares esti-
mator of β. We do this now, but first we state some results which will help us
about differentiating vector expressions.

Differentiating Vector Expressions


3 THE GENERAL LINEAR MODEL 51

For a real valued function f (x) defined for an argument x = (x1 , ..., xp ) which is
a p-dimensional vector of real numbers, we define the derivative of f with respect
to x as the vector ( )⊤
∂f (x) ∂f (x) ∂f (x)
= , ..., .
∂x ∂x1 ∂xp
With this notation, in order to find the least squares estimator we need to find
the value b such that the function

(y − Xβ)⊤ (y − Xβ)
∂β
is zero at β = b. To differentiate the least squares criterion with respect to β,
the following result is useful.

Lemma: Let a be a fixed p-dimensional vector of real numbers, and let A be


a fixed p × p matrix of real numbers. If x is also a p-dimensional vector of real
numbers, and if f (x) is the function

f (x) = a⊤ x

then
∂f (x)
= a.
∂x
Also, if
f (x) = x⊤ Ax
then
∂f (x)
= Ax + A⊤ x.
∂x

Proof: An easy exercise.

The least squares estimator of β

Using the above result to differentiate the least squares criterion, we have
∂ ∂ ( ⊤ )
(y − Xβ)⊤ (y − Xβ) = y y − 2(X ⊤ y)⊤ β + β ⊤ X ⊤ Xβ
∂β ∂β
= −2X ⊤ y + X ⊤ Xβ + (X ⊤ X)⊤ β
= −2X ⊤ y + 2X ⊤ Xβ.

Hence the least squares estimator b of β satisfies

−2X ⊤ y + 2X ⊤ Xb = 0
3 THE GENERAL LINEAR MODEL 52

from which we have


(X ⊤ X)b = X ⊤ y. (21)
The set of equations (21) is referred to as the normal equations. Now, X ⊤ X is
a p × p matrix, and if this matrix is invertible we can write the least squares
estimator as
b = (X ⊤ X)−1 X ⊤ y.
In this course we will deal mainly with the so-called full rank model, for which
X ⊤ X is invertible so that the normal equations have a unique solution. We ex-
plain precisely what we mean by the full rank linear model now, but first we need
to revise some more concepts from linear algebra.

Linear Independence and Rank

First we describe the notion of linear independence. If u1 , ..., up are a set of


vectors, we say that they are linearly dependent if there is some collection of
constants a1 , ..., ap not all zero such that
a1 u1 + ... + ap up = 0.
That is, the vectors are linearly dependent if we can express one of them as a linear
combination of the others. If the vectors u1 , ..., up are not linearly dependent, then
they are said to be linearly independent.
The idea of linear independence is needed in defining the rank of a matrix.
Suppose U is an n × p matrix with columns u1 , ..., up so that
U = [u1 · · · up ].
Then we define the rank of U to be the maximal number of linearly independent
vectors in the set u1 , ..., up . Hence the rank of U is at most p. Now consider the
design matrix X in the general linear model. This is a matrix with p columns
and n rows, and we assume in what follows that the number of observations n
is greater than or equal to the number of mean parameters in the model p. A
linear model is said to be full rank if the rank of X is equal to p.
We now state some of the properties of the rank of a matrix. The following
result is from Raymond H. Myers and Janet S. Milton, “A First Course in the
Theory of Linear Statistical Models,” PWS-KENT, Boston, 1991, p. 26. In the
statement of the Lemma, U is an n × p matrix with n > p. One implication of
the results below is that if X is the design matrix in a full rank linear model,
then X ⊤ X is invertible and hence the normal equations have a unique solution.
We have used the notation rank(U ) here for the rank of a matrix U .

Lemma:
3 THE GENERAL LINEAR MODEL 53

1. If rank(U ) = p, then rank(U ) = rank(U ⊤ ) = rank(U ⊤ U ) = p.

2. Suppose n = p so that U is a p × p matrix. Then U is nonsingular (that is,


U has an inverse) if and only if rank(U ) = p.

3. If P is a nonsingular n × n matrix, and Q is a nonsingular p × p matrix,


then rank(U ) = rank(P U ) = rank(U Q).

4. The rank of a diagonal matrix is equal to the number of nonzero columns


in the matrix.

5. If V is a matrix with dimensions so that U V is well defined, then the rank


of U V is less than or equal to the rank of U and less than or equal to the
rank of V .

We end this subsection with two examples. In the first, we give the least
squares estimates of parameters in a multiple linear regression model for the risk
assessment data. In the second example, we show for the simple linear regression
model that the expression we have just derived in matrix notation for the least
squares estimator is equivalent to the expressions we gave in our previous work
for b0 and b1 , the least squares estimators of β0 and β1 .

Example: risk assessment data

For the risk assessment data, the response yi was risk assessment, and there were
seven accounting determined measures of risk (predictors) xi1 , ..., xi7 . The model
is
yi = β0 + β1 xi1 + ... + β7 xi7 + εi .
The result of fitting the parameters β0 , ..., β7 using the least squares criterion is
given next.

Regression Analysis

The regression equation is


Mean Risk Assessment = 2.19 + 0.443 Dividend Payout + 0.865 Current Ratio
- 0.247 Asset Size + 1.96 Asset Growth + 3.59 Leverage
+ 0.135 Variability Earnings + 1.05 Covariability Earnings

Predictor Coef StDev T P


Constant 2.191 1.870 1.17 0.258
Dividend 0.4426 0.1417 3.12 0.006
Current 0.8645 0.3733 2.32 0.033
Asset Si -0.2472 0.1316 -1.88 0.078
Asset Gr 1.963 6.253 0.31 0.757
3 THE GENERAL LINEAR MODEL 54

Leverage 3.592 1.436 2.50 0.023


Variabil 0.13459 0.05978 2.25 0.038
Covariab 1.0450 0.9380 1.11 0.281

S = 0.9816 R-Sq = 74.2% R-Sq(adj) = 63.5%

Example: the simple linear regression model

In the simple linear regression model, y = (y1 , ..., yn )⊤ and X is an n × 2 matrix,


 
1 x1
 1 x2 
 
X =  .. ..  .
 . . 
1 xn

Hence we have that [ ∑n ]


X X= ⊤ ∑nn
x
∑ni=1 2i

i=1 xi i=1 xi

and [ ∑n 2 ∑n ]
⊤ −1 1
∑i=1 x i − i=1 x i
(X X) = .
nSxx − ni=1 xi n
Also, [ ∑n ]
y
X y = ∑n
⊤ i=1 i
.
i=1 xi yi

Hence

b = (X ⊤ X)−1 (X ⊤ y)
[ ∑n 2 ∑n ∑n ∑n ]
1 x
i=1∑ i y i − x
∑ i=1 ∑ i=1
i x i y i
= i=1 (22)
nSxx n ni=1 xi yi − ni=1 xi ni=1 yi

Using our previous notation, we recognize the second element of (22) as (nSxy )/(nSxx ) =
Sxy /Sxx , which was the expression we derived for the least squares estimator of
the slope when studying the simple linear regression model. Our previous expres-
sion for the least squares estimator of the intercept was

b0 = ȳ − b1 x̄
3 THE GENERAL LINEAR MODEL 55

which we must show is equal to the second element of (22). Now,


b0 = ȳ − b1 x̄
Sxy
= ȳ − x̄
Sxx
ȳSxx − x̄Sxy
=
Sxx
(∑ ) ( ∑ ∑ )
∑n ∑
( ni=1 xi )
2 ∑n ∑n ( ni=1 xi )( ni=1 yi )
i=1 xi − − ( i=1 xi ) i=1 xi yi −
n 2
( i=1 yi ) n n
=
nSxx
∑n 2 ∑n ∑ 2 ∑
∑n 2 ∑n ( x i) ( yi ) ∑ ∑ ( n xi ) ( n yi )
( i=1 xi ) ( i=1 yi ) − i=1 n i=1 − ( ni=1 xi ) ( ni=1 xi yi ) + i=1 n i=1
=
nSxx
∑ ∑ ∑ ∑
( ni=1 x2i ) ( ni=1 yi ) − ( ni=1 xi ) ( ni=1 xi yi )
=
nSxx
as required.

3.3 Properties of least squares estimator of mean param-


eters
We now describe some of the properties of the least squares estimator b of β.
In our previous development of the simple linear regression model, we showed
that b0 and b1 were unbiased estimators of β0 and β1 . We also gave expressions
for the variance of b0 and b1 and the covariance Cov(b0 , b1 ) of b0 and b1 . In this
subsection we develop similar results for the least squares estimator b of β in the
general linear model. We will then discuss various optimality properties of the
estimator b.

Expectations of random vectors

Before we discuss properties of b, however, we need to give some results about


expectations for random vectors. If Y = (Y1 , ..., Yk )⊤ is a random vector, and if
E(Yi ) = µi , then
E(Y ) = µ = (µ1 , ..., µk )⊤ .
We state the following rules for expectations from Raymond H. Myers and Janet
S. Milton, “A First Course in the Theory of Linear Statistical Models,” PWS-
KENT, Boston, 1991, p. 52.

Lemma:

i) If a is a k × 1 vector of constants then E(a) = a.


3 THE GENERAL LINEAR MODEL 56

ii) If a is a k × 1 vector of constants, and Y is a k × 1 random vector with


E(Y ) = µ, then E(a⊤ Y ) = a⊤ µ.

iii) If A is an n × k matrix, and Y is a k × 1 random vector with E(Y ) = µ, then


E(AY ) = Aµ.

Proof:

i) is obvious from the corresponding property for expectations of random vari-


ables.
∑k
ii) a⊤ Y = i=1 ai Yi and hence

∑k
E(a⊤ Y ) = E( ai Y i )
i=1

k
= ai E(Yi )
i=1
∑k
= ai µ i
i=1

= a µ
∑k
iii) (AY )i = m=1 Aim Ym and so


k
E((AY )i ) = E( Aim Ym )
m=1

k
= Aim E(Ym )
m=1
∑k
= Aim µm
m=1
= (Aµ)i .

Therefore
E(AY ) = Aµ.

Covariance matrices

We also need to discuss the idea of a covariance matrix (sometimes called a


variance-covariance matrix) of a random vector. If Y = (Y1 , ..., Yk )⊤ is a random
3 THE GENERAL LINEAR MODEL 57

vector with E(Y ) = µ, then the covariance matrix of Y is the k × k matrix whose
(i, j)th entry is Cov(Yi , Yj ) = E((Yi −µi )(Yj −µj )). If we define the expected value
of a random matrix to be the matrix whose (i, j)th entry is just the expectation
of the (i, j)th entry of the random matrix, and if we use the notation Var(Y ) for
the covariance matrix of Y , we can write

Var(Y ) = E((Y − µ)(Y − µ)⊤ )

since
(Y − µ)(Y − µ)⊤
is the k × k matrix whose (i, j)th element is (Yi − µi )(Yj − µj ). Note that the ith
diagonal element of Var(Y ) is simply the variance of Yi , i = 1, ..., n. Also, Var(Y )
is symmetric, since Cov(Yi , Yj ) = Cov(Yj , Yi ). We can now state the following
result (Raymond H. Myers and Janet S. Milton, “A First Course in the Theory
of Linear Statistical Models,” PWS-KENT, Boston, 1991, p. 54).

Lemma:

i) Let Y be a k × 1 random vector with Var(Y ) = V . If a is a k × 1 vector of real


numbers, then
Var(a⊤ Y ) = a⊤ V a.
ii) Let Y be a k × 1 random vector with Var(Y ) = V . Let A be a k × k matrix.
If Z = AY , then
Var(Z) = AV A⊤ .
Proof:

i)
3 THE GENERAL LINEAR MODEL 58


k
Var(a⊤ Y ) = Var( ai Y i )
(i=1 )2 
∑ k ∑
k
= E ai Y i − ai µ i 
i=1 i=1
( )2 

k
= E ai (Yi − µi ) 
i=1
( )

k ∑
k
= E ai aj (Yi − µi )(Yj − µj )
i=1 j=1
k ∑
∑ k
= ai aj E((Yi − µi )(Yj − µj ))
i=1 j=1


k ∑
k
= ai aj Vij
i=1 j=1

= a⊤ V a.

ii)
3 THE GENERAL LINEAR MODEL 59

( k )
∑ ∑
k
Cov(Zi , Zj ) = Cov Aiq Yq , Ajr Yr
q=1 r=1
(( k )( ))
∑ ∑
k ∑
k ∑
k
= E Aiq Yq − Aiq µq Ajr Yr − Ajr µr
q=1 q=1 r=1 r=1
(( )( ))

k ∑
k
= E Aiq (Yq − µq ) Ajr (Yr − µr )
q=1 r=1
( )

k ∑
k
= E Aiq Ajr (Yr − µr )(Yq − µq )
q=1 r=1


k ∑
k
= Aiq Ajr Vqr
q=1 r=1


k ∑
k
= Aiq Vqr Ajr
q=1 r=1


k
= Aiq (V A⊤ )qj
q=1

= (AV A⊤ )ij

So
Var(Z) = AV A⊤ .

Mean and covariance of b

Using the above result, we can prove the following.

Theorem:

In the full rank linear model, the least squares estimator b = (X ⊤ X)−1 X ⊤ y is
unbiased,
E(b) = β
with covariance matrix
Var(b) = σ 2 (X ⊤ X)−1 .
3 THE GENERAL LINEAR MODEL 60

Proof:

From the above lemma, we have

E(b) = E((X ⊤ X)−1 X ⊤ y)


= (X ⊤ X)−1 X ⊤ Xβ
= β

so that b is unbiased. Next we consider Var(b). Also from the above lemma, and
noting that Var(y) = σ 2 I, where I is the identity matrix, we have

Var(b) = σ 2 (X ⊤ X)−1 X ⊤ ((X ⊤ X)−1 X ⊤ )⊤


= σ 2 (X ⊤ X)−1 X ⊤ X(X ⊤ X)−1

where we have used the fact that for a nonsingular matrix A, (A⊤ )−1 = (A−1 )⊤ .
Hence

Var(b) = σ 2 (X ⊤ X)−1

The Gauss-Markov theorem

The ability to compute the covariance matrix of the estimator b provides us with
a way of comparing the properties of this estimator with alternative estimators.
The least squares estimator of β is an example of a linear estimator which is an
estimator of the form Ay for some p × n matrix A (setting A = (X ⊤ X)−1 X ⊤
gives the least squares estimator). The Gauss-Markov Theorem states that if b∗
is any unbiased linear estimator of β, and if b is the least squares estimator, then
Var(bi ) 6 Var(b∗i ), i = 1, ..., p. We say that b is the best linear unbiased estimator
(BLUE) of β. We prove this result now.

Theorem:

The best linear unbiased estimator of β in the full rank linear model is the least
squares estimator b.

Proof:

Suppose that b∗ is an arbitrary linear unbiased estimator of β. We can write this


estimator in the form
b∗ = ((X ⊤ X)−1 X ⊤ + B)y
for some p × n matrix B (note that B = 0 just gives the least squares estimator).
3 THE GENERAL LINEAR MODEL 61

Using the rules we have stated about means of random vectors,


E(b∗ ) = ((X ⊤ X)−1 X ⊤ + B)Xβ
= β + BXβ

Now, we know that b∗ is unbiased, so for the above to hold for arbitrary β we
must have BX = 0. Now consider Var(b∗ ). We have that
(( ) )
Var(b∗ ) = Var (X ⊤ X)−1 X ⊤ + B y
= σ 2 ((X ⊤ X)−1 X ⊤ + B)((X ⊤ X)−1 X ⊤ + B)⊤
= σ 2 ((X ⊤ X)−1 X ⊤ + B)(X(X ⊤ X)−1 + B ⊤ )
= σ 2 ((X ⊤ X)−1 X ⊤ X(X ⊤ X)−1 + (X ⊤ X)−1 X ⊤ B ⊤ + BX(X ⊤ X)−1 + BB ⊤ ).
But BX = 0, and so the third term on the right in the above expression is zero:
also, the second term is zero, as (X ⊤ B ⊤ ) = (BX)⊤ = 0. Hence
( )
Var(b∗ ) = σ 2 (X ⊤ X)−1 + BB ⊤
Note that the diagonal elements of Var(b∗ ) (which are the variances of the ele-
ments of b∗ ) are just the variances of the least squares estimators plus the diagonal
elements of BB ⊤ . But the ith diagonal element of BB ⊤ is

n

(BB )ii = Bij2 ,
j=1

a positive quantity. So the least squares estimator is BLUE.

3.4 Maximum likelihood estimation of mean parameters


In the case of the simple linear regression model, we were able to show that the
least squares estimates of β0 and β1 were also the maximum likelihood estimates
if normality of the errors in the model was assumed. A similar result holds for
the least squares estimator of β in the general linear model: least squares corre-
sponds to maximum likelihood under an assumption of normality.

Maximum likelihood estimation of β

Under normal assumptions, we can write down the likelihood in the case of the
general linear model as
∏n ( )
1 1
2
L(β, σ ) = √ exp − 2 (yi − (Xβ)i )2
2πσ 2 2σ
i=1
( )
1 ∑
n
= (2πσ 2 )− 2 exp − 2
n
(yi − (Xβ)i )2 .
2σ i=1
3 THE GENERAL LINEAR MODEL 62

But

n
(yi − (Xβ)i )2 = (y − Xβ)⊤ (y − Xβ)
i=1

So the likelihood becomes


( )
2 −n 1 ⊤
2
L(β, σ ) = (2πσ ) 2 exp − 2 (y − Xβ) (y − Xβ) .

Hence the log-likelihood l(β, σ 2 ) is

l(β, σ 2 ) = log L(β, σ 2 )


n n 1
= − log(2π) − log(σ 2 ) − 2 (y − Xβ)⊤ (y − Xβ). (23)
2 2 2σ
Maximizing the likelihood is equivalent to maximizing the log-likelihood, and
clearly the log-likelihood is maximized with respect to β here regardless of the
value of σ 2 by minimizing (y −Xβ)⊤ (y −Xβ) with respect to β. So the maximum
likelihood estimator of β is also the least squares estimator.

Maximum likelihood estimation of σ 2

We also find the maximum likelihood estimator for σ 2 . Differentiating (23) with
respect to σ 2 , we get
∂ n 1
l(β, σ 2
) = − + (y − Xβ)⊤ (y − Xβ).
∂σ 2 2σ 2 2σ 4
Writing b for the least squares estimator of β, and σ ∗ 2 for the maximum likelihood
estimator of σ 2 , we have
n 1
∗ 2
= ∗ 4
(y − Xb)⊤ (y − Xb)
2σ 2σ
so that
1
σ∗2 = (y − Xb)⊤ (y − Xb).
n
Note that this expression agrees with the expression we obtained before for the
simple linear regression model.

3.5 Estimation of the error variance


We have just shown that the maximum likelihood estimator of the error variance
σ 2 is
1
(y − Xb)⊤ (y − Xb).
n
3 THE GENERAL LINEAR MODEL 63

We showed for the case of the simple linear regression model that this estimator
was biased. What is E(σ ∗ 2 ) in the general linear model?

Bias of the maximum likelihood estimator

Observing that b = (X ⊤ X)−1 X ⊤ y, we have that


1 ( )
E(σ ∗ 2 ) = E (y − X(X ⊤ X)−1 X ⊤ y)⊤ (y − X(X ⊤ X)−1 X ⊤ y)
n
1
= E(y ⊤ (I − X(X ⊤ X)−1 X ⊤ )(I − X(X ⊤ X)−1 X ⊤ )y).
n
I claim that

(I − X(X ⊤ X)−1 X ⊤ )(I − X(X ⊤ X)−1 X ⊤ ) = (I − X(X ⊤ X)−1 X ⊤ ).


(24)

Matrices like I − X(X ⊤ X)−1 X ⊤ which are unchanged when multiplied by them-
selves have a special name: they are called idempotent matrices, and they play
an important role in the theory of linear models.
To see (24), observe that

(I − X(X ⊤ X)−1 X ⊤ )(I − X(X ⊤ X)−1 X ⊤ ) = I − X(X ⊤ X)−1 X ⊤ − X(X ⊤ X)−1 X ⊤


+X(X ⊤ X)−1 X ⊤ X(X ⊤ X)−1 X ⊤
= I − 2X(X ⊤ X)−1 X ⊤ + X(X ⊤ X)−1 X ⊤
= I − X(X ⊤ X)−1 X ⊤ .

Hence we have that


1 ( ⊤ )
E(σ ∗ 2 ) = E y (I − X(X ⊤ X)−1 X ⊤ )y . (25)
n
To compute (25), we first give a general result about computing expectations
which have this general form. Before we can state this result, however, we need
to define the trace of a matrix, and to give some results about matrix traces. For
a k × k matrix X, the trace of X, written tr(X) is simply the sum of the diagonal
elements of X,
∑k
tr(X) = Xii .
i=1

The following properties of the trace are easy to prove (see Raymond H. Myers
and Janet S. Milton, “A First Course in the Theory of Linear Statistical Models,”
PWS-KENT, Boston, 1991, p. 26.)

Theorem:
3 THE GENERAL LINEAR MODEL 64

i) Let c be a real number. Then tr(cX) = c.tr(X).

ii) tr(X + Y ) = tr(X) + tr(Y )

iii) If X is an n × p matrix, and Y is a p × n matrix, then tr(XY ) = tr(Y X).

Means of quadratic forms of a random vector

We are now in a position to prove the result we need in order to find the mean
of σ ∗ 2 . We are interested in quadratic forms of a random vector, which are ex-
pressions of the form y ⊤ Ay for a random vector y and matrix A.

Lemma:

Let y be a k × 1 random vector with E(y) = µ and Var(y) = V . Let A be a k × k


matrix of real numbers. Then

E(y ⊤ Ay) = tr(AV ) + µ⊤ Aµ.

Proof:

See your tutorial sheet.

Unbiased estimation of σ 2

Applying the above lemma to compute E(σ ∗ 2 ) and observing that Cov(y) = σ 2 I,
we have
1
E(σ ∗ 2 ) = E(y ⊤ (I − X(X ⊤ X)−1 X ⊤ )y)
n
1( 2 )
= σ tr(I − X(X ⊤ X)−1 X ⊤ ) + (Xβ)⊤ (I − X(X ⊤ X)−1 X ⊤ )(Xβ)
n
1( 2 )
= σ tr(I) − σ 2 tr(X(X ⊤ X)−1 X ⊤ ) + β ⊤ X ⊤ (I − X(X ⊤ X)−1 X ⊤ )Xβ
n
1( 2 )
= σ n − σ 2 tr(X ⊤ X(X ⊤ X)−1 ) + β ⊤ X ⊤ Xβ − β ⊤ X ⊤ X(X ⊤ X)−1 X ⊤ Xβ
n
where we have used the fact that tr(AB) = tr(BA) for matrices A and B for
which these products are defined. Observe that (X ⊤ X)−1 (X ⊤ X) is the p × p
identity matrix, so that
n−p 2
E(σ ∗ 2 ) = σ .
n
3 THE GENERAL LINEAR MODEL 65

So the maximum likelihood estimator is biased, and the bias depends on the
number of mean parameters p. The above expressions suggest estimation of σ 2
by
1
b2 =
σ (y − Xb)⊤ (y − Xb)
n−p
which will be unbiased.
We conclude this section with an example where we illustrate the computation
of the least squares estimates of β, the estimation of σ 2 , and the computation of
estimated standard errors for the least squares estimators.

Example: Business failures and key economic indicators

This data set is from Gerald Keller, Brian Warrack and Henry Bartel, “Statistics
for Management and Economics: a Systematic Approach,” Wadsworth, Belmont,
California, 1990, p. 804.
Many people have attempted to get an overall picture of business failures and
bankruptcies – both by studying the incidence of business failure on a national or
industry-wide scale and by analyzing individual firms one by one – with varying
degrees of success. In the latter instance, promising results have been achieved
by using various accounting and financial ratios to predict bankruptcies.
In looking at the overall picture, some analysts have been tempted by dramatic
swings in the number of business failures from year to year to try to relate them
to overall economic conditions. The annual percentage change in certain key
economic indicators is reproduced in the table below.
In this example we fit a linear model (multiple regression model) to these
data, using percentage change in business failures as the response and percentage
change in the economic indicators shown in the table as predictors. For the
moment we do not discuss model selection or criticism of a model, but simply fit
this linear model. The coefficient estimates obtained from a common statistical
package are shown below.
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) -59.4138 47.4629 -1.2518 0.2572
Real.domestic.product 2.9335 3.1052 0.9447 0.3813
Wages.and.Salaries 3.7996 2.8988 1.3108 0.2379
Unemployment.Rate 8.6019 5.7039 1.5081 0.1823
Retail.sales -5.5136 3.2695 -1.6864 0.1427
Housing.Starts -0.2079 0.5414 -0.3840 0.7142
C.P.I. 1.7576 5.0222 0.3500 0.7383

Residual standard error: 15.47 on 6 degrees of freedom


Multiple R-Squared: 0.5537
3 THE GENERAL LINEAR MODEL 66

Year Business Real Wages and Unemployment Retail Housing C.P.I.


failures domestic salaries rate sales starts
product

1968 0.3 3.5 10.9 3.8 6.8 22.0 3.6


1969 -5.1 5.5 8.7 4.5 6.0 20.0 4.0
1970 24.3 6.2 12.1 4.4 6.9 6.9 4.6
1971 4.0 2.2 8.4 5.7 2.3 -9.5 3.3
1972 1.2 5.9 9.9 6.2 8.9 22.6 2.9
1973 -4.8 5.6 11.3 6.2 11.3 7.0 4.8
1974 -4.9 7.8 15.8 5.5 12.6 7.4 7.6
1975 5.1 4.7 19.4 5.3 16.9 17.3 10.8
1976 -4.9 0.7 16.4 6.9 14.5 4.2 10.8
1977 40.0 5.3 15.4 7.1 10.8 -7.3 7.5
1978 40.4 2.8 10.0 8.1 8.4 -10.1 8.0
1979 2.7 3.5 8.8 8.4 11.7 -7.3 9.0
1980 15.8 3.2 11.0 7.5 11.9 -13.4 8.8

It is instructive to compute the matrix (X ⊤ X)−1 for this example, to multiply by


the estimated residual variance 15.472 and then to take the square roots of the
diagonal elements of this matrix to see whether they correspond to the standard
errors quoted in the above table. We have
 
13.0 56.90 158.10 79.60 129.00 25.20 85.70
 56.9 293.79 702.71 335.52 568.08 225.37 353.78 
 
 158.1 702.71 2067.73 962.77 1693.85 145.45 1118.59 
 
X X = 

 79.6 335.52 962.77 512.20 812.72 15.86 551.68 

 129.0 568.08 1693.85 812.72 1462.36 28.82 960.96 
 
 25.2 225.37 145.45 15.86 28.82 2341.46 −145.68 
85.7 353.78 1118.59 551.68 960.96 −145.68 662.19

b2 (X ⊤ X)−1 is
From this, we have σ
 
2251.9 −46.619 −101.23 −237.93 81.109 −13.972 −18.089
 −46.619 9.6388 −1.0828 0.89594 −4.3519 0.20687 8.3275 
 
 −101.23 −1.0828 8.4000 11.622 −4.5641 0.43326 −3.4736 
 
 −237.93 0.89594 11.622 32.522 −6.6688 1.2118 −6.4694 
 
 81.109 −4.3519 −4.5641 −6.6688 10.686 −0.91736 −10.615 
 
 −13.972 0.20687 0.43326 1.2118 −0.91737 0.29297 1.3521 
−18.089 8.3275 −3.4736 −6.4694 −10.615 1.3521 25.213
3 THE GENERAL LINEAR MODEL 67

Taking the square root of the diagonal elements of this matrix gives the values

(47.454, 3.1046, 2.8982, 5.7028, 3.2689, 0.54127, 5.0213)

which after allowing for rounding error are the entries given in the table.

3.6 Interval estimation in the general linear model


For the simple linear regression model, we were able to construct confidence in-
tervals for the parameters of the model. In this subsection we do the same thing
for the general linear model.

The multivariate normal distribution

To construct our confidence intervals we will first need to give some results about
the distribution of the least squares estimator b of β and the distribution of our
estimator σb2 of the residual error variance.
To state the distribution of b we need to introduce the multivariate normal
distribution. You are of course already familiar with the univariate normal dis-
tribution. An n × 1 random vector Y is said to have a multivariate normal
distribution with mean vector µ and covariance matrix Σ (and we write this as
Y ∼ N (µ, Σ)) if it has the density
( )
1 1 ⊤ −1
fY (y; µ, Σ) = exp − (y − µ) Σ (y − µ) ,
(2π)n/2 |Σ|1/2 2

where |Σ| denotes the determinant of the matrix Σ. You will study some of the
properties of the multivariate normal distribution in later statistics courses. It
can be shown that the n × 1 vector µ actually is the mean of a random vec-
tor with the above density. Similarly, if Y is a random vector with the above
density, Var(Y ) = Σ. The mean vector µ and covariance matrix Σ completely
characterize a multivariate normal distribution. Note that if Y = (Y1 , ..., Yn ) is a
vector of univariate independent normal random variables with E(Yi ) = µi and
Var(Yi ) = σ 2 , then Y is a multivariate normal random vector according to the
above definition with mean µ and covariance matrix σ 2 I, where I is the n × n
identity matrix here. In order to state the distribution of b, we need the following
result.

Lemma:

Let Y be an n×1 multivariate normal random vector with mean µ and covariance
matrix Σ,
Y ∼ N (µ, Σ).
3 THE GENERAL LINEAR MODEL 68

Let A be a p × n matrix. Then Z = AY is a multivariate normal random vector


with mean Aµ and covariance matrix AΣA⊤ ,

Z ∼ N (Aµ, AΣA⊤ ).

We can now state the distribution of the least squares estimator b of β.

Distribution of b

Theorem:

In the full rank linear model, the least squares estimator b of β has a multivariate
normal distribution with mean β and covariance matrix σ 2 (X ⊤ X)−1 ,

b ∼ N (β, σ 2 (X ⊤ X)−1 ).

Proof:

We have already proven that the mean and covariance matrix of b are given
by E(b) = β and Var(y) = σ 2 (X ⊤ X)−1 . The fact that b is multivariate normal
follows from multivariate normality of y, the fact that b is a linear transformation
of y,
b = (X ⊤ X)−1 X ⊤ y,
and the above lemma.

In an analogous way to the case of the simple linear regression model, it is


true that
(n − p)b
σ2
∼ χ2n−p
σ2
b2 are independent
and that b and σ

Interval estimation

We can now write down test statistics which can be used for constructing confi-
dence intervals for the parameters βj .
Write cjj j = 0, ..., k for the diagonal elements of (X ⊤ X)−1 . Then we know
that
bj − β j
√ ∼ N (0, 1).
σ cjj
Also
(n − p)b
σ2
∼ χ2n−p
σ2
3 THE GENERAL LINEAR MODEL 69

and (n − p)b
σ 2 /σ 2 is independent of bj . So

( ) σ2
(n−p)b
bj − β j
√ / σ2
∼ tn−p .
σ cjj n−p

But the above statistic simplifies to


bj − β j
√ .
σb cjj

We can use the above statistic to derive a confidence interval for βj . Writing
tα/2,n−p for the upper 100α/2 percentage point of a t distribution with n − p
degrees of freedom, we have that
bj − β j
P(−tα/2,n−p 6 √ 6 tα/2,n−p ) = 1 − α.
σb cjj

Rearranging the above inequality, we get


√ √
b cjj 6 bj − βj 6 tα/2,n−p σ
P(−tα/2,n−p σ b cjj ) = 1 − α

or
√ √
P(−bj − tα/2,n−p σ
b cjj 6 −βj 6 −bj + tα/2,n−p σ
b cjj ) = 1 − α

which gives
√ √
P(bj − tα/2,n−p σ
b cjj 6 βj 6 bj + tα/2,n−p σ
b cjj ) = 1 − α.

Hence a 100(1 − α) percentage confidence interval for βj is


√ √
(bj − tα/2,n−p σ
b cjj , bj + tα/2,n−p σ
b cjj ).

Example: risk assessment data

For the risk assessment data, fitting a multiple linear regression model with mean
risk assessment as the response and the accounting determined measures of risk
as predictors gives the following.
Regression Analysis

The regression equation is


Mean Risk Assessment = 2.19 + 0.443 Dividend Payout + 0.865 Current Ratio
- 0.247 Asset Size + 1.96 Asset Growth + 3.59 Leverage
+ 0.135 Variability Earnings + 1.05 Covariability Earnings
3 THE GENERAL LINEAR MODEL 70

Predictor Coef StDev T P


Constant 2.191 1.870 1.17 0.258
Dividend 0.4426 0.1417 3.12 0.006
Current 0.8645 0.3733 2.32 0.033
Asset Si -0.2472 0.1316 -1.88 0.078
Asset Gr 1.963 6.253 0.31 0.757
Leverage 3.592 1.436 2.50 0.023
Variabil 0.13459 0.05978 2.25 0.038
Covariab 1.0450 0.9380 1.11 0.281

S = 0.9816 R-Sq = 74.2% R-Sq(adj) = 63.5%

We can use the above table of estimated coefficients and standard errors to con-
struct confidence intervals for coefficients. The estimated standard errors of the

b cjj , are listed in the table in the ‘StDev’ column. Noting that
estimators bj , σ
the upper 2.5 percentage point of a t distribution with n − p = 17 degrees of
freedom is approximately 2.1098, a 95 percent confidence interval for β1 , the
coefficient for Dividend, can be computed as

(0.4426 − (2.1098)(0.1417), 0.4426 + (2.1098)(0.1417)) = (0.1436, 0.7416)

and similarly for the other coefficients.

3.7 Hypothesis testing for coefficients



We can also use the above result about (bj − βj )/(b
σ cjj ) to develop a test for
the null hypothesis βj = γ against the alternatives βj ̸= γ, βj < γ or βj > γ
in the presence of the other terms in the model. This last qualification “in the
presence of the other terms in the model” is an important one: the conclusion
may depend on what predictors are already included.

Hypothesis testing for βj

To test
H0 : βj = γ
against the alternative
H1 : βj ̸= γ
we use the test statistic
bj − γ
T = √
b cjj
σ
which has a tn−p distribution under H0 .
3 THE GENERAL LINEAR MODEL 71

The critical region for the test (with significance level α) is

T < −tα/2;n−p or T > tα/2;n−p .

We can also conduct the test by computing a p-value. If we write t for the
observed value of T , then the p-value for the test is computed as

p = P(|T | > |t||βj = γ)

(where of course under the null hypothesis βj = γ we have T ∼ tn−p ). So we


just compute the probability that a tn−p random variable is as extreme as our
observed statistic or more extreme: if p is smaller than the significance level for
the test, we reject H0 in favour of the alternative βj ̸= γ.

One-sided alternatives

For a one-sided test, we must modify the critical region and the definition of the
p-value.
For
H1 : βj > γ
the critical region is
T > tα;n−p
and we compute the p-value as

p = P(T > t|βj = γ).

For the alternative


H1 : βj < γ
the critical region is
T < −tα;n−p
and we compute the p-value as

p = P(T 6 t).

We conclude this section by looking once again at the risk assessment data.

Example: risk assessment data

In the column labelled ‘T’ here we have the values of the t-statistics described
above for testing whether each element of β is zero in the presence of the other
terms in the model: that is, for each parameter βj , j = 0, ..., k the statistics
bj

b cjj
σ
3 THE GENERAL LINEAR MODEL 72

are listed (and they are realizations of tn−p random variables under H0 : βj = 0).
Also listed are the p-values for testing H0 : βj = 0 against the two-sided
alternative H1 : βj ̸= 0. A small p-value indicates rejection of the null hypothesis.
A large p-value indicates that the given predictor seems to make no contribution
to explaining the variation in the response in the presence of the other terms.
The t-tests we have just described are sometimes called partial t-tests. We
note from the above p-values that asset growth and covariability earnings do not
seem to make a contribution to explaining the variation in the presence of the
other terms in the model. A word of caution here: partial t-tests can be hard
to interpret. We certainly can’t conclude from a partial t-test that a predictor is
unrelated to the response, or that it should not be included in a good model for the
response. For instance, it may happen that if two predictors carry much the same
information about the response then a good model for the response should contain
one of the two predictors, but that any model which contains both predictors
results in a non-significant p-value for the partial t-tests for both predictors.
Upon deleting one of the predictors from a model containing both, however, we
might then find that the p-value for the partial t-test for the remaining variable is
significant! Note that there is also a possible problem of simultaneous inference
here: it could be argued that we should make an adjustment to the significance
level used in the partial t-tests when we examine a table of p-values like the one
given above.

3.8 Confidence intervals for the mean and prediction in-


tervals.
In the case of simple linear regression in the previous topic we developed confi-
dence intervals for the conditional mean and prediction intervals for a given value
of the predictor. In this subsection we do this for the general linear model.
Suppose that for each value yi of the response, i = 1, .., n, we have values of k
predictor variables xi1 , ..., xik which are used in a linear model for the response.
If we write xi for the vector xi = (1, xi1 , ..., xik )⊤ then under the assumptions of
the general linear model
y i = x⊤
i β + εi

where β = (β0 , ..., βk )⊤ is a vector of unknown parameters and the εi are inde-
pendent normal errors with zero mean and common variance σ 2 .
Now consider a new response value y∗ for which the values of the predictor
variables are x∗1 , ..., x∗k . If we write x∗ for the vector x∗ = (1, x∗1 , ..., x∗k )⊤ , then
the conditional mean of y∗ given the predictor values is x⊤ ∗ β. Also, we can write

y ∗ = x⊤
∗ β + ε∗

where ε∗ is a zero mean normal random variable with variance σ 2 independent of


εi , i = 1, ..., n. We wish to develop a confidence interval for the conditional mean
3 THE GENERAL LINEAR MODEL 73

x⊤
∗ β and a prediction interval for y∗ .

Confidence interval for the mean

First we develop a test statistic which can be used for constructing a confidence
interval for the conditional mean. Writing b for the least squares estimator of β,
we consider x⊤ ∗ b. We have that

E(x⊤ ⊤
∗ b) = x∗ β

and

Var(x⊤ ⊤
∗ b) = x∗ Var(b)x∗
= σ 2 x⊤ ⊤ −1
∗ (X X) x∗

where X is the design matrix for the fitted linear model. Furthermore, x⊤ ∗ b is
normal (as a linear combination of a multivariate normal random vector). So we
have
x⊤ ⊤ 2 ⊤ ⊤ −1
∗ b ∼ N (x∗ β, σ x∗ (X X) x∗ )

or
x⊤ b − x⊤
∗β
√ ∗ ∼ N (0, 1).
σ x⊤ ⊤ −1
∗ (X X) x∗

It can also be shown that x⊤


∗ b and (n − p)b σ 2 /σ 2 are independent (try to prove
this as an exercise).
Hence √
⊤ ⊤ (n−p)bσ2
x∗ b − x∗ β
√ / σ2
σ x⊤ ∗ (X ⊤ X)−1 x
∗ n−p
has a t distribution with n − p degrees of freedom (since as we showed in the last
lecture (n − p)bσ 2 /σ 2 has a χ2n−p distribution). But the above statistic simplifies
to
x⊤ ⊤
∗ b − x∗ β
√ . (26)
b
σ x⊤ ⊤ −1
∗ (X X) x∗

From the result that (26) has a t distribution with n − p degrees of freedom we
can derive a confidence interval in the usual way: we can show that

P(x⊤∗ b − tα/2,n−p b
σ x⊤ ⊤ −1 ⊤
∗ (X X) x∗ 6 x∗ β 6

x⊤ b x⊤
∗ b + tα/2,n−p σ ∗ (X X) x∗ ) = 1 − α
⊤ −1

which gives us our confidence interval for x⊤


∗ β.
3 THE GENERAL LINEAR MODEL 74

Prediction intervals

It remains to derive a prediction interval for y∗ . Write yb∗ for the value

yb∗ = x⊤
∗b

and consider y∗ − yb∗ . We have that

y∗ − yb∗ = x⊤ ⊤
∗ β + ε∗ − x∗ b

Now, clearly E(y∗ − yb∗ ) = 0. Furthermore, ε∗ and b are independent, and we have
just shown that
Var(x⊤ 2 ⊤ ⊤ −1
∗ b) = σ x∗ (X X) x∗

so that

Var(y∗ − yb∗ ) = Var(x⊤ ∗ b) + Var(ε∗ )


= σ x∗ (X ⊤ X)−1 x∗ + σ 2
2 ⊤

= σ 2 (1 + x⊤ ⊤ −1
∗ (X X) x∗ ).

It can also be shown that y∗ − yb∗ and (n − p)bσ 2 /σ 2 are independent, and so

(n−p)bσ2
y∗ − yb∗
√ / σ2
∼ tn−p .
σ 1 + x⊤ ∗ (X ⊤ X)−1 x
∗ n − p

Simplifying, we have
y∗ − yb∗
√ ∼ tn−p .
b 1 + x⊤
σ ⊤ −1
∗ (X X) x∗

So the usual manipulations lead to



P(b
y∗ − tα/2,n−p σ
b 1 + x⊤ ∗ (X X) x∗ 6 y∗ 6
⊤ −1

yb∗ + tα/2,n−p σ
b 1 + x⊤∗ (X X) x∗ ) = 1 − α.
⊤ −1

which gives us the desired 100(1 − α) percentage prediction interval for y∗ .

Example: risk assessment data

To illustrate the computation of confidence intervals for the mean and prediction
intervals we consider again the risk assessment data. Suppose we are interested
in predicting the market risk of the stock of a company with a divident payout
of 0.5, current ratio of 1.0, asset size of 10.0, asset growth of 0.1, leverage of
0.3, variability earnings of 2.0 and covariability earnings of 0.6. We can compute
a 95 percent confidence interval for the conditional mean of a company with
3 THE GENERAL LINEAR MODEL 75

these accounting determined risk measures, as well as a prediction interval for


our company of interest using software.
The fitted value (estimated market risk) here is 2.975, the 95 percent confi-
dence interval for the mean is

(2.138, 3.811)

and the 95 percent prediction interval is

(0.741, 5.208).

3.9 Joint confidence regions for the coefficients


Earlier we discussed the use of Bonferroni adjustment to obtain confidence re-
gions with a guaranteed level of coverage by adjusting the coverage of pointwise
confidence intervals for parameters. In the general linear model, adjusting the
pointwise coverage of confidence intervals for β0 , ..., βk to be 1 − α/(k + 1) guar-
antees joint coverage for these intervals of at least 1 − α. However, in the case of
the general linear model there is an alternative way of deriving confidence regions
which have exact coverage of 1 − α for the parameter vector β. We discuss this
method now.
It can be shown that the statistic
(b − β)⊤ X ⊤ X(b − β)
σ2
pb
has an F distribution with p and n−p degrees of freedom. Knowing this allows us
to develop a joint confidence region for the coefficients. In particular, if Fα;p,n−p
denotes the upper 100α percentage point of an F distribution with p and n − p
degrees of freedom, we have that

P((b − β)⊤ X ⊤ X(b − β)/pb


σ 2 6 Fα;p,n−p ) = 1 − α.

Hence a 100(1 − α) percentage confidence region for β consists of all points β


such that
(b − β)⊤ X ⊤ X(b − β)/pb
σ 2 6 Fα;p,n−p .

3.10 Decomposing variation in the full rank linear model


For the simple linear regression model we proved a fundamental identity which
partitioned total variation into variation explained by the model and residual
variation. More precisely, we showed that

n ∑
n ∑
n
(yi − ȳ)
2
= yi − ȳ) +
(b 2
(yi − ybi )2 . (27)
i=1 i=1 i=1
3 THE GENERAL LINEAR MODEL 76

where yi , i = 1, ..., n were the responses, ȳ is the mean of the responses, and ybi is
the ith fitted value. We expressed this identity in the notation

SStotal = SSreg + SSres

where SStotal is called the total sum of squares (the sum of squared deviations
of the responses about their mean), SSreg is called the regression sum of squares
(the sum of the squared deviations of the fitted values about their mean, which
is ȳ) and SSres is called the residual sum of squares (the sum of the squared
deviations of the fitted values from the responses).
The identity (27) holds in the general linear model, where of course the fit-
ted values are defined as the elements of the vector Xb where X is the design
matrix and b is the least squares estimator of β. I won’t prove this identity, but
MATH2931 students will have to do this in their next assignment.

The ANOVA table

In the case of simple linear regression, the partition (27) of total variation was
presented in an analysis of variance table. This table was also a convenient
way of organizing calculations for hypothesis testing. We will cover some theory
about hypothesis testing later, but for now we will look at the output of fitting
a regression model, and try to understand the analysis of variance table and
some of the other information. Below is the result of fitting a multiple regression
model for the risk assessment data discussed in previous lectures (where mean
risk assessment is the response, and the seven accounting determined measures
of risk are predictors). See previous notes for a description of the data.
Regression Analysis

The regression equation is


Mean Risk Assessment = 2.19 + 0.443 Dividend Payout + 0.865 Current Ratio
- 0.247 Asset Size + 1.96 Asset Growth + 3.59 Leverage
+ 0.135 Variability Earnings + 1.05 Covariability Earnings

Predictor Coef StDev T P


Constant 2.191 1.870 1.17 0.258
Dividend 0.4426 0.1417 3.12 0.006
Current 0.8645 0.3733 2.32 0.033
Asset Si -0.2472 0.1316 -1.88 0.078
Asset Gr 1.963 6.253 0.31 0.757
Leverage 3.592 1.436 2.50 0.023
Variabil 0.13459 0.05978 2.25 0.038
Covariab 1.0450 0.9380 1.11 0.281
3 THE GENERAL LINEAR MODEL 77

S = 0.9816 R-Sq = 74.2% R-Sq(adj) = 63.5%

Analysis of Variance

Source DF SS MS F P
Regression 7 47.0282 6.7183 6.97 0.001
Residual Error 17 16.3808 0.9636
Total 24 63.4090

Source DF Seq SS
Dividend 1 18.4253
Current 1 5.6024
Asset Si 1 10.1251
Asset Gr 1 1.6456
Leverage 1 6.0841
Variabil 1 3.9497
Covariab 1 1.1961

By now you should understand how the coefficients in the fitted model are com-
puted, and how their standard errors are estimated. You should also understand
the computation of the partial t statistics and associated p-values listed in the
table of coefficient estimates, and their use for testing for the significance of a
predictor in the multiple linear regression model in the presence of the other
terms.
Below the table of coefficient estimates is listed the estimated standard de-
viation of the errors S = 0.9816 (this was σb in our notation). Also listed is R2 ,
which takes the value 0.742 here (or 74.2 percent). For the simple linear regres-
sion model, we defined R2 in terms of our fundamental partition of variation
as
SSreg
R2 = (28)
SStotal
and the definition of R2 is the same in the general linear model. Beside the
quoted value for R2 is a value for the so-called adjusted R2 , which we write as
R̄2 . We have R̄2 = 0.635 here (or 63.5 percent). To define R̄2 , rewrite R2 as
SSres
R2 = 1 − (29)
SStotal
(the equivalence of the two expressions for R2 comes from replacing SSreg in the
numerator of (28) by the equivalent expression SStotal − SSres , which comes from
(27)). We define R̄2 by replacing SSres in (29) by σ b2 (which is SSres /(n − p))
and replacing SStotal by SStotal /(n − 1). SStotal /(n − 1) is in fact the estimated
3 THE GENERAL LINEAR MODEL 78

error variance for a linear model in which we fit only an intercept term. Hence
we can write
(n − 1)SSres
R̄2 = 1 − (30)
(n − p)SStotal
or
b2 (n − 1)
σ
R̄2 = 1 − . (31)
SStotal

In terms of R2 , we can write


n−1
R̄2 = 1 − (1 − R2 ).
n−p

What is the reason for the introduction of R̄2 ? R2 is an easily interpreted


measure of fit of a linear model: it is simply the proportion of total variation
explained by the model. However, one might be tempted to use R2 as a basis
for comparing models with different numbers of parameters. R2 is generally not
helpful for this purpose: the reason for this is that if we add a new predictor
to a linear model, the residual sum of squares decreases, and R2 will increase.
Hence if we attempt to select a subset of good predictors from a set of possible
predictors using the value of R2 , the best model according to this criterion will
be the model containing all the predictors, even if many of the predictors are in
fact unrelated to the response.
R̄2 , on the other hand, does not necessarily increase as new predictors are
added to a model. If we look at (31), we see that R̄2 increases as σ b2 decreases.
So if we were to rank different models using R̄2 as a crude model selection device,
it would be equivalent to ranking the models based on σ b2 . Does σ
b2 necessarily
decrease as new predictors are added to the model, and hence must R̄2 increase?
The answer is no.
Recall that
(y − Xb)⊤ (y − Xb)
b2 =
σ .
n−p
Now, consider two models in which one model contains a subset of the predictors
included in the other. Certainly for the larger model, the numerator in the above
expression is smaller, but the denominator would also be smaller, as the number
of parameters p is larger for the more complicated model. Hence any reduction
in the residual sum of squares must be large enough to overcome the reduction in
the denominator, a reduction which increases as the number of parameters in the
model increases. R̄2 is consequently sometimes useful as a crude device for com-
parison of different models. In future lectures we will say much more about the
process of model selection, and will develop other criteria for comparing models.
3 THE GENERAL LINEAR MODEL 79

Sequential sums of squares

We note that the column labelled ‘SS’ in the analysis of variance table lists
the components of the decomposition (27). SS stands for sum of squares here,
and this column shows the partition of variation for a multiple linear regression
model in exactly the same way as it did for the simple linear regression model.
It will be helpful to introduce some notation. Suppose we have k predictors
x1 , ..., xk in a linear model for a response y. There are p = k + 1 parameters
in the vector β = (β0 , ...βk )⊤ in the linear model with the predictors x1 , ..., xk .
⊤ ⊤
Partition the vector β into two parts β = (β (1) , β (2) )⊤ where β (1) is an r × 1
subvector say, and β (2) is a (p − r) × 1 subvector. We write R(β (2) |β (1) ) for the
increase in SSreg in (27) when the predictors corresponding to the parameters
β (2) are added to a model involving the parameters β (1) . Think of R(β (2) |β (1) )
as the variation explained by the term involving β (2) in the presence of the term
involving β (1) . We define R(β1 , ..., βk |β0 ) to be SSreg .
The sequential sums of squares shown below the analysis of variance table are
simply the values

R(β1 |β0 )
R(β2 |β0 , β1 )
R(β3 |β0 , β1 , β2 )
..
.
R(βk |β0 , ..., βk−1 )

Note that these contributions add up to R(β1 , ...βk |β0 ). For instance, consider
the case where k = 2. R(β2 |β0 , β1 ) is the increase in the regression sum of squares
when β2 is added to the model involving β1 , so that by definition

R(β1 , β2 |β0 ) = R(β1 |β0 ) + R(β2 |β0 , β1 ).

The sequential sums of squares are useful when we have first ordered the variables
in our model in a meaningful way (based on the underlying science or context).
They tell us about how much a term contributes to explaining variation given all
the previous terms in the table (but ignoring the terms which come after it).

3.11 Hypothesis testing in the general linear model


In the case of the simple linear regression model, we discussed the use of a t
test and an equivalent F test for examining the usefulness of the predictor. For
the general linear model we have already discussed the use of partial t tests for
deciding whether a predictor is needed in the presence of the other terms. For a
two-sided alternative there is an F test corresponding to this partial t test (where
the test statistic is just the square of the partial t statistic). These F tests are
3 THE GENERAL LINEAR MODEL 80

usually called partial F tests.

While partial t or F tests can be useful, in the case of the general linear
model there are more complicated hypotheses that may be considered. For ex-
ample, we may ask: are any of the predictors helpful for explaining variation in
the response? Or is the model including all the predictors no better than the one
containing just the intercept? There is an F statistic for testing this hypothesis,
and this F statistic is usually displayed in the analysis of variance table when we
fit a multiple linear regression model.

The ANOVA table

The ANOVA table for a typical multiple regression model with k predictors and
where p = k + 1 is shown below. Here as usual we write SStotal for the total
sum of squares, SSreg for the regression sum of squares and SSres for the residual
sum of squares. Writing β = (β0 , β1 , ..., βk )⊤ for the parameters in the general

Source Degrees of Sum of Mean F P


freedom squares square

SSreg M Sreg
Regression p−1 SSreg M Sreg = (p−1)
F = M Sres
P(Fp−1,n−p > F )
Residual n−p SSres M Sres = SSres
(n−p)
Total n−1 SStotal

linear model, where β0 is an intercept term and β1 , ..., βk are the coefficients
corresponding to the k predictors, it can be shown that the quantity M Sreg /M Sres
has an F distribution with p − 1 and n − p degrees of freedom when β1 = ... =
βk = 0. Hence we can use the statistic M Sreg /M Sres to test the null hypothesis

H0 : β1 = ... = βk = 0

versus the alternative

H1 : Not all βj = 0, j = 1, ..., k.

Writing F = M Sreg /M Sres , and writing F ∗ for a random variable with p − 1 and
n − p degrees of freedom, we compute the p-value for the test as

P(F ∗ > F ).

An alternative to the use of p-values is to find the critical region: for a test with
significance level α, the critical region is F > Fα;p−1,n−p .
3 THE GENERAL LINEAR MODEL 81

Example: Canadian business failures

We illustrate the test for the overall significance of the model using the Canadian
business failures data that we have discussed previously. These data describe
the annual percentage change in business failures and in the level of certain key
economic indicators. If we fit a multiple linear regression model with percentage
change in business failures as the response and percentage change in the economic
indicators (real domestic product, wages and salaries, unemployment rate, retail
sales, housing starts and C.P.I.) as predictors, we get the results shown below.

Regression analysis

The regression equation is


Business failures = - 59.4 + 2.93 Real domestic product
+ 3.80 Wages and Salaries + 8.60 Unemployment rate
- 5.51 Retail sales - 0.208 Housing starts + 1.76 C.P.I.

Predictor Coef StDev T P


Constant -59.41 47.46 -1.25 0.257
Real dom 2.934 3.105 0.94 0.381
Wages an 3.800 2.899 1.31 0.238
Unemploy 8.602 5.704 1.51 0.182
Retail s -5.514 3.270 -1.69 0.143
Housing -0.2079 0.5414 -0.38 0.714
C.P.I. 1.758 5.022 0.35 0.738

S = 15.47 R-Sq = 55.4% R-Sq(adj) = 10.7%

Analysis of Variance

Source DF SS MS F P
Regression 6 1782.3 297.1 1.24 0.400
Residual Error 6 1436.4 239.4
Total 12 3218.7

Source DF Seq SS
Real dom 1 8.3
Wages an 1 2.8
Unemploy 1 478.5
Retail s 1 1164.3
Housing 1 99.1
C.P.I. 1 29.3
3 THE GENERAL LINEAR MODEL 82

We may have anticipated that we would not learn very much here by fitting a
multiple linear regression model with six predictors and only thirteen observa-
tions. We see in the ANOVA table that the F statistic for testing overall model
adequacy is 1.24, and the associated p-value is 0.4. So at the 5 percent level we
accept the null hypothesis that the regression coefficients corresponding to the
six predictors are all zero.

Example: risk assessment data

As a further example of testing for overall model adequacy consider the risk
assessment data, which has been described in previous lectures. On pages 88
and 89 of your lecture notes I have given results when fitting a multiple linear
regression model for these data with mean risk assessment as the response and
the seven accounting determined measures of risk as predictors. It can be seen in
the ANOVA table that the F statistic for testing overall model adequacy is 6.97,
and the associated p-value
p = P(F ∗ > 6.97)
where F ∗ ∼ F7,17 is approximately 0.001. So for the risk assessment data we
reject the null hypothesis

H0 : β1 = ... = βk = 0

in favour of the alternative

H1 : Not all βj = 0, j = 1, ..., k.

General hypothesis testing in the linear model

The F -test given in the ANOVA table illustrates a general method for comparison
of linear models via hypothesis testing.
When we conduct a hypothesis we are always comparing two different models,
and the null hypothesis imposes a restriction upon the parameters in the more
general model which holds under the alternative hypothesis.
For instance, in the F -test in the ANOVA table we compare the model

y i = β0 + εi

(the model which holds under the null hypothesis) to the more general model

yi = β0 + β1 xi1 + ... + βk xik + εi .


3 THE GENERAL LINEAR MODEL 83

In general, we might wish to compare models where only a subset of the


predictors are removed under the null hypothesis. Reordering predictors so that
x1 , ..., xr are the predictors to be included under both the null and alternative
hypotheses, we might wish to compare the model

yi = β0 + β1 xi1 + ... + βr xir + εi

to the model
yi = β0 + β1 xi1 + ... + βr xir + .. + βk xik + εi .
Here H0 is βr+1 = ... = βk = 0 and H1 states that at least one of βr+1 , ..., βk is
nonzero.
Write SSnull for the residual sum of squares for the model where H0 holds, and
write SSf ull for the model which holds under H1 . The extra variation explained
by the more general model is SSnull − SSf ull . Is the amount of extra variation
explained so large that we would favour the more complex model? We can develop
a hypothesis test to address this question.
Writing ∆p for the number of restrictions imposed by the null hypothesis on
the most general model (in H0 given above this is k − r, the number of extra
parameters in the model under H1 ) it can be shown that

(SSnull − SSf ull )/∆p


F =
M Sres
has an F distribution with ∆p and n − p degrees of freedom under H0 By com-
paring this test statistic with percentage points of an F∆p,n−p distribution, we
can determine if the amount of extra variation explained by the more general
model indicates that this model is decisively favoured. An example will help to
illustrate the idea.

Example: risk assessment data

Suppose we feel that the most important predictor of mean risk assessment for
the risk assessment data is dividend, and that we wish to see whether the other
predictors in the model are needed in a multiple linear regression model in which
dividend is included. To compute the test statistic for testing the hypothesis that
the coefficients corresponding to current ratio, asset size, asset growth, leverage,
variability earnings and covariability earnings are all zero, we need to compute
the decrease in the residual sum of squares when we add these terms to a model
involving dividend. This is obtained from the R output from the table of sequen-
tial sums of squares (see previous notes for the R output for the risk assessment
data). Let β (1) = (β0 , β1 )⊤ and β (2) = (β2 , ..., β7 )⊤ . Then we wish to test

H0 : β (2) = 0
3 THE GENERAL LINEAR MODEL 84

against the alternative

H1 : Not all elements of β (2) are zero.

Write R(β (2) |β (1) ) for SSnull − SSf ull (the reduction in residual sum of squares
when the β (2) term is added to the null model). We have

R(β (2) |β (1) ) = 5.6024 + 10.1251 + 1.6456 + 6.0841 + 3.9497 + 1.1961


= 28.6030.

We need to divide this by k − r = 6, the number of extra parameters in the full


model, and then divide by M Sres to get the value of our test statistic:

R(β (2) |β (1) )/6 28.6030/6


=
M Sres 0.964
= 4.947.

Hence if F ∗ is a F6,17 random variable, we compute the p-value for the test as

p = P(F ∗ > 4.947).

From tables or using R we obtain p = 0.0042 approximately. Hence we reject the


null hypothesis β (2) = 0 at the 5 percent level of significance.

Sequential F tests

In the last lecture we discussed the table of sequential sums of squares which
appears in the computer output, and mentioned that the values which appear in
this table represent the increases in the regression sum of squares as terms are
added sequentially to the model. These sequential sums of squares can be used
as the basis for a formal test of the hypothesis of the usefulness of each predictor
in the model in the presence of the terms which appear before it but ignoring
terms which come after it. As we mentioned last time, the table of sequential
sums of squares gives the values

R(β1 |β0 )
R(β2 |β0 , β1 )
R(β3 |β0 , β1 , β2 )
..
.
R(βk |β0 , ..., βk−1 ).

Consider the jth of these sequential sums of squares,

R(βj |β0 , ..., βj−1 )


3 THE GENERAL LINEAR MODEL 85

and the model


yi = β0 + β1 xi1 + ... + βj xij + εi .
We can test the hypothesis
H0 : βj = 0
in this model against the alternative

H1 : βj ̸= 0

by using the test statistic


R(βj |β0 , ..., βj−1 )
F =
M Sres
which has an F1,n−p distribution assuming H0 : βj = 0. As usual if F ∗ is a random
variable with an F distribution with 1 and n − p degrees of freedom, then we
compute the p-value for the test as

p = P(F ∗ > F ).

Alternatively, for a test with significance level α, the critical region is F >
Fα;1,n−p . The test we have just described is called a sequential F test. These
sequential F tests will in general be hard to interpret unless we first carefully
choose the ordering of the predictors in the model: a sequential F test examines
the usefulness of a predictor in the presence of previous predictors but ignoring
predictors which come after it.

Example: Risk assessment data

For the risk assessment data. In fitting the model, the predictors are ordered so
that dividend is the first predictor, current ratio is next, then asset size, then
asset growth, then leverage, then variability earnings and finally covariability
earnings. Suppose we wish to test whether asset size is a useful predictor in the
model which includes dividend and current ratio. The increase in the regression
sum of squares when a term for the predictor asset size is added to the model
involving dividend and current ratio is 10.1251. Writing β0 for the intercept term,
β1 for the coefficient for dividend, β2 for the coefficient for current ratio and β3
for the coefficient for asset size, we want to test

H0 : β3 = 0

versus
H1 : β3 ̸= 0
in the model involving the terms β0 , β1 and β2 . The mean square error M Sres can
be computed from the output by (16.3808+1.1961+3.9497+6.0841+1.6456)/21 =
3 THE GENERAL LINEAR MODEL 86

29.2563/21 = 1.3932, and we know from the discussion above that the ratio
10.1251/1.3932 = 7.267514 is a realization of an F1,21 random variable under the
null hypothesis. If F ∗ ∼ F1,21 , then we compute the p-value for the sequential
test as
p = P(F ∗ > 7.267514)
which from tables or a computer package gives p = 0.01353433 approximately.
So adding the term asset size to a model involving dividend and current ratio
does help to explain variation in the response in a multiple linear regression model.

We have seen that in the analysis of variance table the “Degrees of freedom”
column specifies the degrees of freedom which are appropriate for computing a
p-value for an F test of overall model adequacy. We have not discussed the
question, however, of where these degrees of freedom parameters come from, and
of how we derive the distributions of the test statistics we have studied. The
key theorem in proving the results we have stated about the distributions of test
statistics is Cochran’s Theorem. I will state this result for the MATH2931 stu-
dents.

First we need a definition.

Definition:

Let y be an n × 1 normal random vector with mean µ and variance I. Then y ⊤ y


follows the noncentral χ2 distribution with n degrees of freedom and noncentral-
ity parameter λ = 21 µ⊤ µ. We write y ⊤ y ∼ χ2n,λ .

Theorem:

Suppose that A is an n × n idempotent matrix, and that y is an n × 1 normal


random vector with mean vector µ and covariance matrix σ 2 I, where I is the
n × n identity matrix. Write Q for the quadratic form Q = σ12 y ⊤ Ay, and suppose
that with Qi = σ12 y ⊤ Ai y i = 1, ..., k we have


k
Q= Qi .
i=1

If we write ri for the rank of Ai and r for the rank of A then any one of the
following conditions implies the other two.
∑k
1. i=1 ri = r
3 THE GENERAL LINEAR MODEL 87

2. Each Qi has a noncentral χ2 distribution with ri degrees of freedom and


noncentrality parameter λi = 2σ1 2 µ⊤ Ai µ.

3. Each Qi is independent of every other.

What we have called the total sum of squares SStotal , in the general linear
model takes the form y ⊤ Ay for an idempotent matrix A, where y here is the vector
of responses. Also, the regression sum of squares and residual sum of squares can
be written as quadratic forms in y. When we are deriving the test statistics above
we are considering decompositions of a quadratic form into meaningful parts, and
showing that the parts are independent and χ2 distributed (which is what we need
to do to prove that our test statistics, which involves ratios of quadratic forms,
are F distributed). For a quadratic form y ⊤ Ay, the rank of the quadratic form is
defined to be simply the rank of A. Each of the sums of squares in the analysis
of variance table are quadratic forms, and the degrees of freedom column in the
analysis of variance table specifies the ranks of those quadratic forms.

3.12 Multicollinearity
In a previous lecture where we developed a partial t test for the significance of
a predictor in the general linear model in the presence of other predictors we
mentioned that the interpretation of these tests is complicated when some of the
predictors in the model contribute similar information. In this subsection we
discuss this idea more formally.
In the course so far we have been dealing with the full rank linear model, in
which the design or model matrix X has full rank. Recall that X has full rank
if no column of X can be expressed as a linear combination of the remaining
columns. When the predictors in our linear model contribute similar informa-
tion, then it may happen that although X is of full rank, we can still nearly
express one of the predictors as a linear combination of the remaining predictors:
multicollinearity is the name given to this condition.

Effects of multicollinearity

Why is multicollinearity a potential problem, and what are its effects? If X does
not have full rank, then there is no unique solution to the normal equations: we
can get the same vector of fitted values by forming different linear combinations
of the vectors of predictor values. When multicollinearity occurs in the full rank
linear model, something close to this situation also occurs: quite different linear
combinations of the predictor values can result in almost the same vector of fitted
values. Hence for quite different β values the least squares criterion (which was
just the sum of the squared differences between the observations and fitted values)
3 THE GENERAL LINEAR MODEL 88

may not change very much. Roughly speaking, there may be a large region in
the space of the possible parameter values where we have much the same fit, and
this makes it difficult to estimate β precisely.
Figure 3.9 illustrates the phenomenon of multicollinearity for an artificial data
set involving two predictors. In the Figure, y is the response and x1 and x2 are
the predictors. We can see that x1 and x2 are strongly linearly related, and
the (y, x1 , x2 ) points nearly lie on a line in three-dimensional space. Since there
are an infinite number of planes which can pass through a given line in three-
dimensional space, it’s clear that when attempting to fit a plane through these
points we could have two planes with very different coefficients for x1 and x2
resulting in almost the same quality of fit (residual sum of squares). Because
x1 and x2 are so closely related, it is hard to separate out their effects. The
coefficients will not be well estimated, and attempting to make predictions at
new points which don’t respect the dependencies in the original data could result
in nonsensical answers.
The example above is sometimes referred to as the picket fence of multi-
collinearity. Think of the heights y as representing the tops of pickets at the
(x1 , x2 ) points in a picket fence. Imagine trying to balance a table top on the
pickets: the balance is very precarious in directions perpendicular to the line of
pickets.
More formally, recall that the covariance matrix of the least squares estimator
b of β in the general linear model was

σ 2 (X ⊤ X)−1 .

If multicollinearity occurs, then X ⊤ X is “nearly” singular, and this may cause


the diagonal elements of (X ⊤ X)−1 to be very large. Hence if multicollinearity
occurs, estimation of the parameters β can be very poor, with standard errors
of the components of b extremely large. It is important to note that we will
not detect multicollinearity by looking at goodness of fit measures like R2 or
adjusted R2 , or by examining residuals. Even though estimates of the regression
parameters are poor, the least squares estimator b of β by definition minimizes
the sum of the squared residuals, and the fitted model involving b may produce a
high R2 value. The presence of multicollinearity can also have an adverse effect
on prediction: predictions can be wild for vectors of predictor values which are
not typical of the linear dependencies in the original data.
How do we detect multicollinearity when it occurs? There are many sophisti-
cated diagnostics which can be used. However, we won’t discuss these diagnostics.
One simple measure which may be examined, however, is the correlation matrix
for the predictor variables. If two predictor variables are closely linearly related,
then the correlation coefficient for these two predictors will be close to 1, and
multicollinearity occurs. Note, however, that multicollinearity can be more sub-
tle than this: it is possible for all correlations between pairs of predictors to be
3 THE GENERAL LINEAR MODEL 89

Figure 3.9: Example illustrating multicollinearity: on the plot, y is the response


and x1 and x2 are the predictors. The points nearly lie on a line in three-
dimensional space, and we can rotate the fitted plane and still get almost the
same quality of fit (residual sum of squares).
3 THE GENERAL LINEAR MODEL 90

small, but for one of the predictors to be capable of nearly being expressed as a
linear combination of two or more of the remaining predictors. We now describe
an example which illustrates some of the effects of multicollinearity.

Example: Hospital manpower data

The following data are from Raymond H. Myers, “Classical and Modern Regres-
sion with Applications (Second Edition),” Duxbury, Belmont, California, 1990,
pp. 130–133. Data were collected from seventeen U.S. Naval hospitals at various
sites around the world. The regressors are variables which are thought to predict
workload (monthly man hours) at the hospitals. The variables are as follows:

y = Monthly man-hours
x1 = Average daily patient load
x2 = Monthly X-ray exposures
x3 = Monthly occupied bed days
x4 = Eligible population in the area / 1000
x5 = Average length of patients’ stay in days.

It was desired to predict workload (i.e. the monthly man-hours y) based on the
predictor variables x1 , x2 , x3 , x4 and x5 . Fitting a multiple linear regression
model gives the following output.
Regression Analysis

The regression equation is


y = 1963 - 15.9 x1 + 0.0559 x2 + 1.59 x3 - 4.22 x4 - 394 x5

Predictor Coef StDev T P


Constant 1963 1071 1.83 0.094
x1 -15.85 97.65 -0.16 0.874
x2 0.05593 0.02126 2.63 0.023
x3 1.590 3.092 0.51 0.617
x4 -4.219 7.177 -0.59 0.569
x5 -394.3 209.6 -1.88 0.087

S = 642.1 R-Sq = 99.1% R-Sq(adj) = 98.7%

Analysis of Variance

Source DF SS MS F P
Regression 5 490177488 98035498 237.79 0.000
Residual Error 11 4535052 412277
3 THE GENERAL LINEAR MODEL 91

x1 x2 x3 x4 x5 y

15.57 2463 472.9 18.0 4.45 566.5


44.02 2048 1339.8 9.5 6.92 696.8
20.42 3940 620.3 12.8 4.28 1033.2
18.74 6505 568.3 36.7 3.90 1603.6
49.20 5723 1497.6 35.7 5.50 1611.4
44.92 11520 1365.8 24.0 4.60 1613.3
55.48 5779 1687.0 43.3 5.62 1854.2
59.28 5969 1639.9 46.7 5.15 2160.6
94.39 8461 2872.3 78.7 6.18 2305.6
128.02 20106 3655.1 180.5 6.15 3503.9
96.00 13313 2912.0 60.9 5.88 3571.9
131.42 10771 3921.0 103.7 4.88 3741.4
127.21 15543 3865.7 126.8 5.50 4026.5
252.90 36194 7684.1 157.7 7.00 10343.8
409.20 34703 12446.3 169.4 10.78 11732.2
463.70 39204 14098.4 331.4 7.05 15414.9
510.22 86533 15524.0 371.6 6.35 18854.5
3 THE GENERAL LINEAR MODEL 92

Total 16 494712540

Source DF Seq SS
x1 1 480612694
x2 1 7231656
x3 1 598469
x4 1 276098
x5 1 1458572

At first sight this seems like a satisfactory model: we have a very high R2 value
of 99.1 percent here. However, when we compute the correlation matrix of the
predictors we find that the correlation between x1 and x3 is greater than 0.99.

The opposite extreme to collinearity is orthogonality. Write x̄j for the mean
of the values for the jth predictor,
∑n
xij
x̄j = i=1 .
n
Also, write sj for the quantity
v
u n
u∑
sj = t (xij − x̄j )2 .
i=1

Now suppose we create a new set of predictor values by


xij − x̄j
zij = .
sj
Let Z be the n × k matrix with (i, j)th element equal to zij . If we fit a model
using the predictors in Z instead of the predictors in the last k columns of X we
do not really change the model: note that
xij − x̄j
βi zij = βi
sj
x̄j βi
= −βi + xij
sj si
which is a linear function of the original predictors (using the transformed pre-
dictors just changes the intercept and slope). Also note that the matrix Z ⊤ Z is
simply the correlation matrix of the predictors. You can easily verify that the
(j, k)th element of this matrix is

(xij − x̄j )(xik − x̄k )
√∑ i ∑
i (xij − x̄j ) i (xik − x̄k )
2 2
3 THE GENERAL LINEAR MODEL 93

which is the sample correlation between the jth and kth predictors.
We say that the predictors are orthogonal if they are uncorrelated: that is,

Z Z = I where I is the identity matrix. In the case of orthogonality, if we fit the
model involving the transformed predictors, it is easy to see that the estimators of
the predictor coefficients are independent (the covariance matrix of the predictor
coefficients is σ 2 (Z ⊤ Z)−1 = σ 2 I and normality implies independence from this).
This independence makes interpretation of coefficient estimates and of the partial
t statistics much easier.
In general, if orthogonality doesn’t hold and we fit a model using the stan-
dardized predictors Z, then the diagonal elements of

σ 2 (Z ⊤ Z)−1

will be larger than σ 2 (their value in the case of orthogonality). The diagonal el-
ements of (Z ⊤ Z)−1 represent the increase in the variances of the coefficients over
the ideal case of orthogonality. These diagonal elements are called the variance
inflation factors of the coefficients, and they are very useful for detecting mul-
ticollinearity. For the hospital manpower data the variance inflation factors are
(9597.57, 7.94, 8933.09, 23.29, 4.28). The first and third elements (corresponding
to x1 and x3 , which are highly correlated) are very large here, and the quality of
estimation for the coefficients for these predictors is very poor.
More sophisticated methods for detecting multicollinearity and diagnosing
what variables are involved are based on looking at the eigenvalues and eigenvec-
tors of the correlation matrix Z ⊤ Z. (MATH2831 students can ignore the rest of
this section if they wish). Roughly speaking, the eigenvectors corresponding to
eigenvalues which are close to zero describe a near linear dependence among the
columns of Z. We can write

Q⊤ (Z ⊤ Z)Q = Λ

where Λ is a diagonal matrix of eigenvalues and Q is an orthogonal matrix of


eigenvectors, QQ⊤ = I. The eigenvalues of Z ⊤ Z are always positive, and Z ⊤ Z
will be “nearly” singular (indicating multicollinearity) if some of the eigenvalues
are close to zero. If an eigenvalue λj is small, we have for the eigenvector qj (jth
column of Q) that
qj⊤ (Z ⊤ Z)qj = (Zqj )⊤ (Zqj ) ≈ 0
The vector Zqj (length n) is a linear combination of the columns of Z (the vectors
of predictor values) and (Zqj )⊤ (Zqj ) is the squared norm of this vector. This
squared norm is close to zero only if the vector itself is nearly zero: the elements
of the eigenvector qj describe a near linear dependence among the predictors.
4 MODEL SELECTION 94

4 Model Selection
At the beginning of this course we discussed some of the reasons why we build
statistical models. We build statistical models in order to make decisions, and in
the process of selecting a model we must manage a trade off between complexity
and goodness of fit in order to provide a reduction of the data that is useful for
decision making.
In the context of the general linear model, the problem of model selection
appears when we are deciding which predictor variables should be included for
explaining variation in the response.
However, often data may have been collected for a large number of predic-
tors, some of which might be unrelated to the response. We may wish to choose
a simple model containing a subset of the predictors (or perhaps to choose a
small collection of simple models containing subsets of the predictors) which of-
fer a simpler explanation of the observed variation than a model incorporating
all the predictors. Our interest may be in summarizing the data succinctly, in
trying to determine which predictors are related to the response and which are
not, or in prediction of future response values. Often there is no single best model.

Model selection for prediction

One of the most common reasons for building a statistical model is prediction,
and we now discuss the issue of model selection when prediction is the goal. We
begin by asking the question: is it harmful for prediction if we fit a statistical
model which is more complicated than we really need? We will illustrate that
fitting an unnecessarily complicated model can be harmful by considering the
most elementary case, the simple linear regression model.
Suppose we have a data set consisting of responses y1 , ..., yn and corresponding
predictor values x1 , ..., xn , and suppose that the simple linear regression model
holds,
y i = β 0 + β 1 xi + εi
where εi i = 1, ..., n is a collection of zero mean errors uncorrelated with common
variance σ 2 say. We write M0 for the model

y i = β0 + εi

in which the predictor xi is excluded and write M1 for the full model

y i = β 0 + β 1 xi + εi .

We consider fitting these two models to the data, and develop expressions for the
expected squared prediction errors for the models for a new observation y∗ when
the predictor value is x∗ . The expressions for expected squared prediction error
4 MODEL SELECTION 95

will help to illustrate the trade off between complexity and goodness of fit that
is involved in all model selection problems.

Prediction for a model with just an intercept

Consider first fitting the model M0 which involves just an intercept. To estimate
β0 , we must minimize
∑n
(yi − β0 )2 .
i=1

Differentiating this with respect to β0 , we get



n
−2 (yi − β0 ).
i=1

Hence the least squares estimator b0 of β0 satisfies



n
(yi − b0 ) = 0
i=1

from which we have



n
yi = nb0
i=1

or

b0 = ȳ.

For the model M0 then, our prediction of a new observation y∗ from the fitted
model is simply ȳ, the mean of the responses. We write yb0 (x∗ ) for the predicted
value of y∗ for model M0 when the predictor is x∗ .

Prediction for the full model

For the model M1 , write yb1 (x∗ ) for the predicted value of y∗ when the predictor
is x∗ : that is, let
yb1 (x∗ ) = b0 + b1 x∗
where b0 and b1 are the least squares estimators of β0 and β1 in the simple linear
regression model that we developed at the beginning of the course. (At the risk
of some confusion we have used the notation b0 for the least squares estimator
of the intercept in both models M0 and M1 even though the estimator of the
intercept is different for the two models).
4 MODEL SELECTION 96

Comparing predictive performance

One way of measuring predictive performance of the models is to consider the


expected values of the squared prediction errors. That is, we consider

E((y∗ − yb0 (x∗ ))2 )

and
E((y∗ − yb1 (x∗ ))2 ).
Consider first E((y∗ − yb0 (x∗ ))2 ). Recall that for a random variable Z, Var(Z) =
E(Z 2 ) − E(Z)2 so that
E(Z 2 ) = Var(Z) + E(Z)2 .
Applying this identity to y∗ − yb0 (x∗ ), we have

E((y∗ − yb0 (x∗ ))2 ) = Var(y∗ − yb0 (x∗ )) + E(y∗ − yb0 (x∗ ))2
= Var(y∗ ) + Var(b y 0 (x∗ )) + E(y∗ − yb0 (x∗ ))2
y 0 (x∗ )) + E(y∗ − yb0 (x∗ ))2 .
= σ 2 + Var(b (32)

By entirely similar reasoning, we can also show that

E((y∗ − yb1 (x∗ ))2 ) = σ 2 + Var(b


y 1 (x∗ )) + E(y∗ − yb1 (x∗ ))2 .

The terms in the above expressions are easily interpreted. The second term is
just the variance of the prediction. The variance of prediction depends on the
variance of our estimators of the model parameters, and we might expect that
this is larger for the more complex model M1 where there are more parameters
to estimate. In fact, if either model M0 or M1 holds,

σ2
y 0 (x∗ )) =
Var(b
n
and ( )
1 ∗ 2 1 (x∗ − x̄)2
Var(b
y (x )) = σ + .
n Sxx
So prediction variance is larger for the more complex model.
The third term in both expressions for expected squared prediction error is
the square of the prediction bias. If the data conform to the model M1 , then we
showed in our discussion of prediction for the general linear model that E(y∗ −
yb1 (x∗ )) = 0 and so the squared bias term is zero for model M1 . However, if we
fit the model M0 when model M1 holds with a nonzero β1 , then the bias term
E(y∗ − yb0 (x∗ ))2 may be nonzero. If M0 holds then the bias term will be zero. So
the bias term is always smaller for the more complex model, and the variance
term is always larger.
4 MODEL SELECTION 97

The key idea we are coming to here is the following: a complicated model
has a higher variance of prediction than a simpler one, but a model which is too
simple and ignores important predictors may make predictions which are system-
atically biased. Good model selection involves managing a trade off between the
contributions of bias and variance to prediction error.

4.1 Out of sample prediction, PRESS residuals and the


PRESS statistic
If we want to get some idea of the magnitude of prediction errors for a fitted
linear model for new observations we cannot simply look at the residuals from
the fitted model. The within sample prediction errors (residuals) will typically
be smaller than so-called out of sample prediction errors (prediction errors for
new observations which were not used in fitting the model). We have chosen
the parameters in the fitted model by minimizing a measure of within sample
prediction error, so that the true extent of typical discrepancies between fitted
values and new observations may be underestimated by looking at the residuals
for the fitted model.
We illustrate this phenomenon for the hospital manpower data introduced in
the last lecture before discussing one measure which attempts to capture out of
sample predictive performance and which may be used as a criterion for model
selection.

Example: hospital manpower data

The following example of the difference between within sample and out of sam-
ple predictive performance is given by Myers (1990), “Classical and Modern Re-
gression with Applications,” Duxbury, Belmont, California, pp. 168–169. The
example relates to the hospital manpower data we discussed in the last lecture.
There are 17 observations here. If we fit a linear regression model with y as the
response and x1 , x2 , x3 , x4 and x5 as predictors, then the residual for the last
observation (within sample prediction error) is −466.27. Now, suppose we refit
the model with the 17th observation deleted. Then we can produce a forecast
of the response for the 17th observation based on this fitted model: when we do
this, the prediction error is −3675.121, which is much larger (in absolute value)
than the within sample prediction error of −466.27. Similarly, we can delete
the other observations one at a time, and produce forecasts of each observation
from a model not including that observation. For the fifteenth and sixteenth
observations, for instance, this gives prediction errors of −2510.842 and 2242.496
respectively (and the residuals for the fifteenth and sixteenth observations when
the model is fitted to all the data are −504.857 and 376.549 respectively, see
Myers, p. 169). So we can see that in this example, prediction out of sample
4 MODEL SELECTION 98

seems to be much poorer than prediction within sample.

In the example above, we tried to measure out of sample predictive per-


formance by omitting one observation at a time, fitting a model without that
observation and then using the fitted model to forecast the omitted observation.
In this way we avoid the problem of using an observation both for fitting a model
and assessment of the model. We write ybi,−i for the forecast of the ith observation
obtained by fitting a model to the data using all observations except the ith. The
quantity
ei,−i = yi − ybi,−i
is called the ith PRESS residual. A global measure of goodness of fit based on
the PRESS residuals is the PRESS statistic, which is the sum of the squared
PRESS residuals:
∑n
PRESS = e2i,−i .
i=1

Unlike the residual sum of squares, which the PRESS statisic resembles, the
PRESS statistic does not necessarily decrease as we make the model more com-
plex. The PRESS statistic can be computed and compared for different models,
and the model with minimum PRESS chosen as one likely to have good out of
sample predictive performance.

4.2 The hat matrix and computation of the PRESS resid-


uals
It may appear at first sight that computation of the PRESS statistic is difficult.
To compute the PRESS statistic, it seems that we must fit n different models
where we delete one of the observations for each fit. If there are a large number
of models to compare (large number of predictors) and a large number of obser-
vations, the computations quickly become unmanageable. As it turns out, it is
not necessary to fit n separate regressions to compute the PRESS residuals. If
we compute the residuals for the full fitted model, and a measure related to the
influence of each observation on the fit (the so-called leverage of each observation)
then the PRESS residuals and PRESS statistic can be computed.

Leverage and the hat matrix

In order to give the computational formula for the PRESS residuals in terms
of ordinary residuals and leverages, we need to first define the leverages. The
leverages are the diagonal elements of the so-called hat matrix,
H = X(X ⊤ X)−1 X ⊤ .
4 MODEL SELECTION 99

To interpret the hat matrix, just observe that the vector of fitted values is

yb = Xb = X(X ⊤ X)−1 X ⊤ y = Hy.

So multiplying the vector of responses by H gives the fitted values. The ith
diagonal element hii of H (ith leverage) multiplies yi in determining ybi . So the
leverage is some measure of the influence of yi on the fit at the ith set of predictor
values.
There is another interpretation of the leverages. If we write xi for the vec-
tor of predictors corresponding to the ith observation of the response, xi =
(1, xi1 , xi2 , .., xik )⊤ , then we can write hii as

hii = x⊤ ⊤ −1
i (X X) xi .

This can help us to interpret the leverages if we observe that

Var(b
y (xi )) = Var(x⊤ i b)

= xi Var(b)xi
= σ 2 x⊤ ⊤ −1
i (X X) xi
= σ 2 hii

so that the ith leverage is apart from σ 2 simply the prediction variance for the
fitted model at xi . It can be shown that the leverage always lies between zero and
one (the prediction variance is always nonnegative of course, and the variance of
a prediction is never worse than the residual error variance at one of the observed
xi ). Note that the leverage depends only on the vector of predictors (not on
the response value): you should think of it as measuring the potential influence
of an observation, with high leverage indicating that the vector of predictors is
somehow extreme compared to the other predictor vectors.

PRESS residuals and leverage

Now that we have discussed the idea of leverage, we can describe computation
of the PRESS residuals and the PRESS statistic. If ei is the ordinary residual
for the ith observation, hii is the ith leverage value, and if ei,−i is the ith PRESS
residual, then
ei
ei,−i = .
1 − hii
So to compute the PRESS residuals (and hence the PRESS statisic) all we need
to do is compute the ordinary residuals and the diagonal of the hat matrix. If
the leverage is large (so that the ith observation is an influential one in the fit)
then the PRESS residual is made much larger in absolute value than the ordinary
residual by the divisor of 1 − hii in the above equation. In effect, the PRESS
residuals are obtained by increasing in absolute value the size of the ordinary
4 MODEL SELECTION 100

residuals, with the amount of the increase related to how influential the observa-
tion may be in fitting the full model.

Example: capital asset pricing model

An example will help to make the concept of the leverage clear for the simple
linear regression model. As we have mentioned, the ith leverage value is (apart
from σ 2 ) the variance of our predictor of the response at xi . For the simple linear
regression model, we showed that the variance of our predictor of the conditional
mean at x0 was ( )
2 1 (x0 − x̄)2
σ +
n Sxx
so that the ith leverage value is

1 (xi − x̄)2
+ .
n Sxx
We see that the value of the leverage will be large if xi is far away from the
mean of the predictors x̄: outlying points in the predictor space (predictor values
distant from x̄) have potentially high influence and a high leverage. Below is
the result of fitting a simple linear regression model involving stock price as the
response and TSE300 index as the predictor for the capital asset pricing model
data discussed on Tutorial Sheet 5.
Also shown is a scatterplot of stock price versus TSE300 index.

Regression Analysis

The regression equation is


Stock Price = - 0.982 + 0.0101 TSE 300 Index

Predictor Coef StDev T P


Constant -0.9817 0.8685 -1.13 0.272
TSE 300 0.0101020 0.0008176 12.36 0.000

S = 0.3176 R-Sq = 88.9% R-Sq(adj) = 88.3%

Analysis of Variance

Source DF SS MS F P
Regression 1 15.400 15.400 152.65 0.000
Residual Error 19 1.917 0.101
Total 20 17.317
4 MODEL SELECTION 101

Figure 4.10: Scatter plot of stock price against TSE300 index.


4 MODEL SELECTION 102

The leverage values for the 21 observations are 0.073, 0.064, 0.057, 0.075,
0.088, 0.053, 0.052, 0.068, 0.070, 0.099, 0.059, 0.048, 0.072, 0.066, 0.048, 0.051,
0.080, 0.078, 0.168, 0.247, 0.386. We see that the largest leverage value corre-
sponds to the last observation (which happens to be the rightmost predictor value
in the scatter plot).

We now give an example illustrating the use of the PRESS statistic for model
selection.
Example: data on cheddar cheese

In this example we will consider some data from a study on cheddar cheese in
the La Trobe Valley, Victoria. For this data set, we have a subjective measure
of cheese taste as the response (taste) and as predictors we have measures of
concentration of acetic acid (acetic), concentration of hydrogen sulfide (H2S) and
concentration of lactic acid (lactic). The table below shows the values of R2 ,
adjusted R2 and the PRESS statistic for all possible models with 1, 2 and 3
predictors.

Model R2 R̄2 PRESS

H2S 0.571 0.556 3688.08


lactic 0.496 0.478 4375.64
acetic 0.302 0.277 6111.26
H2S, lactic 0.652 0.626 3135.44
H2S, acetic 0.582 0.551 3877.62
lactic, acetic 0.520 0.485 4535.47
H2S, lactic, acetic 0.652 0.612 3402.24

For this example, we see from the table that the best model in terms of the
PRESS statistic is the model which contains Lactic and H2S. This model is also
the best in terms of maximum adjusted R2 .

4.3 Cross validation


In this subsection we discuss the idea of cross validation, which is related to the
ideas we developed in the definition of PRESS residuals and the PRESS statistic.
In developing the PRESS residuals we left out one observation at a time and fitted
a model to the remaining observations, and then predicted the observation which
4 MODEL SELECTION 103

was omitted. The idea of cross validation is to split the data into two parts, a
training sample (which is used to estimate parameters in candidate models) and
a validation sample (which is used to assess predictive performance). Selection of
a model is based on predictive performance in the validation sample. If we have
m + n observations (responses y1 , .., ym+n and corresponding vectors of predictors
x1 , ..., xm+n ) and if the first m observations constitute the training sample and the
last n constitute the validation sample, then we might compare different models
according to the criterion


m+n
(yi − x⊤
i b)
2
(33)
i=m+1

where b is the least squares estimator of β based on the first m observations. A


common alternative is to compare different models based on


m+n
|yi − x⊤
i b|.
i=m+1

Once a model has been selected, we refit the selected model using the whole data
set.
There are a number of difficulties with the idea of cross validation. Firstly, we
must have a large enough number of observations so that we can split the data into
two parts which are large enough to enable reliable estimation of parameters and
reliable estimation of predictive performance. This may not always be the case.
Also, there is the problem of how exactly to split the data (which observations
should go in the training and validation samples, and what should m and n be?)
We give an example to illustrate the use of cross validation and some of the
problems involved in its application.

Example: cross validation for data on cheddar cheese tastings

For the cheddar cheese tastings data, there are 30 observations which we split
into two parts of 15 observations each (the first 15 observations in the sample
constitute the training sample, and the last 15 constitute the validation sample).
I have computed the criterion (33) using the validation sample after fitting each
of the possible models involving the three predictors to the training sample. The
results are shown in the table below. Also shown are the values of (33) when
the second half of the data is used as the training sample and the first half is
used as the validation sample. We can see from the table that the best model
according to the cross validation criterion for both cases is the one involving H2S
and Lactic. However, there are quite different predictor values in the first fifteen
and last fifteen observations, and our ranking of the models is highly dependent
on the choice of training set: for instance, the full model is the second best model
4 MODEL SELECTION 104

Model Cross validation Cross validation


criterion criterion
(fitting to (fitting to
first half) second half)

H2S 2257.94 3312.33


Lactic 2510.53 3381.18
Acetic 5820.82 3423.51
H2S, Lactic 2139.56 2924.22
H2S, Acetic 9814.08 3339.40
Lactic, Acetic 5081.51 3473.03
H2S, Lactic, Acetic 8930.57 2959.95

when the second half of the data is used for the training set, but it is the second
worst model when the first half of the data is used for the training set. How to
split the data into two parts can be a difficult question in application of the idea
of cross validation.

4.4 The conceptual predictive criterion (Mallow’s Cp ).


We describe another criterion for selection of a statistical model when prediction
is the goal of a study. MATH2931 students will derive this criterion in Assignment
3. The idea of the criterion is to minimize an estimate of the quantity

n
MSE(b
y (xi ))
(34)
i=1
σ2

where (writing y(xi ) for the mean of response at xi ) MSE(b


y (xi )) is the mean
squared error of prediction at xi ,

y (xi )) = E((b
MSE(b y (xi ) − y(xi ))2 )

Now, using the formula Var(Z) = E(Z 2 ) − E(Z)2 , we have E(Z 2 ) = Var(Z) +
E(Z)2 and hence

y (xi )) = Var(b
MSE(b y (xi ))2
y (xi )) + Bias(b

y (xi )) = E(b
where Bias(b y (xi ) − y(xi )). So (34) is equal to
∑n ∑n
i=1 Var(by (xi )) y (xi ))2
i=1 Bias(b
+ .
σ2 σ2
4 MODEL SELECTION 105

It can be shown that (n − p)(b


σ 2 − σ 2 ) is an unbiased estimator of

n
y (xi ))2
Bias(b
i=1

and that ∑n
i=1 Var(by (xi ))
2
=p
σ
so that
(n − p)(bσ2 − σ2)
p+
σ2
is an estimator of (34). In practice σ 2 is not known, but an unbiased estimate of
σ 2 (provided one of the models in the class of models being considered is correct)
bF2 based on the full model: substituting into the expression above
is the estimate σ
gives Mallow’s Cp statistic,

σ2 − σ
(b bF2 )(n − p)
Cp = p + .
bF2
σ

Although we don’t derive Mallow’s Cp here, it is easy to see that by minimizing


Cp we are managing a goodness of fit/complexity trade off. The first term p in the
above expression grows with the complexity of the model, and the last term is a
measure of goodness of fit (since if we have omitted important covariates the esti-
b2 of σ 2 will be biased upwards, and hence σ
mate σ b2 − σ
bF2 will tend to be positive.)

Example: cheddar cheese tastings

To illustrate the use of Cp , we consider once more the data on cheddar cheese
tastings.
Best Subsets Regression

Response is taste

A L
c a
e c
t H t
Adj. i 2 i
Vars R-Sq R-Sq C-p s c S c

1 57.1 55.6 6.0 10.833 X


1 49.6 47.8 11.6 11.745 X
1 30.2 27.7 26.1 13.821 X
4 MODEL SELECTION 106

2 65.2 62.6 2.0 9.9424 X X


2 58.2 55.1 7.2 10.890 X X
2 52.0 48.5 11.8 11.668 X X
3 65.2 61.2 4.0 10.131 X X X

The best model according to the Cp criterion is again the one involving H2S and
Lactic. All the various model selection criteria we have considered so far have
led to the model involving H2S and Lactic as the best model. The choice of a
best model (for predictive or other purposes) will often not be so clear, and it is
always worthwhile to consider a number of model selection criteria in choosing a
model.

4.5 Sequential variable selection procedures


The approaches we have developed so far to selection of a model are based on
computing some criterion (Mallow’s Cp or PRESS for instance) for all possible
subset models and then choosing the model which optimizes the criterion. If the
number of predictors under consideration is very large, then there is an enormous
number of possible models, and computing any criterion for all possible models
may be difficult.
Consider the situation where we have k predictors. Write γ = (γ1 , ..., γk ) for
a k × 1 vector where γj = 1 if the jth predictor is in the model and γj = 0 if
the jth predictor is not in the model. Each distinct combination of the binary
variables in γ corresponds to a different model. There are two possible values for
γ1 (zero or one), two possible values for γ2 , and so on. In general the number
of different distinct values for the vector of binary variables γ is 2 multiplied by
itself k times, or 2k . For instance, consider the situation where there are just 2
predictors. γ has length two here, and its possible values are (0, 0), (0, 1), (1, 0)
or (1, 1). So there are four possible models: the model containing no predictors,
two possible models containing one predictor and the full model.
As k increases, 2k quickly become very large indeed. With modern computing
packages it is now possible to compute a model selection criterion like Cp for ev-
ery possible model in most typical problems, but this was not always the case. To
cope with the computational difficulties of model selection for large k so-called
sequential methods of variable selection have been developed, and these methods
are still sometimes used in large problems. The methods we will describe are
also implemented in most computer packages, and are often used for this reason
despite the associated statistical problems (see later).

Forward selection
4 MODEL SELECTION 107

The idea of sequential variable selection methods is to start with some initial
model, and to then add, delete or change a variable in the initial model to im-
prove the fit. We then add, delete or change a variable in the new model, and
continue searching through the space of possible models in this way until some
stopping criterion is satisfied. By starting with some initial model and making
small changes to it we avoid the need to compute some criterion function for all
possible models. The mechanisms for adding or deleting variables in the stepwise
procedure are based on the hypothesis tests (partial t and F tests) which were
discussed earlier in the course.
There are three basic sequential variable selection techniques. The first is for-
ward selection. The idea of forward selection is to start with our best model in-
volving one variable, and then add the variable at each step which most improves
the fit until some stopping rule is satisfied. The hope with forward selection is
that by taking our best simple model and improving it as much as we can at each
step, we will arrive at a model that is good globally: this hope may not always
be realized. We can summarize the steps in forward selection as follows:

1. Fit the model involving just an intercept

2. Until some stopping rule is satisfied, find the predictor not currently in the
model (if there is one) for which the partial F -statistic is largest and add
it to the model.

We haven’t said anything yet about the stopping criterion in the method
above. One approach is to stop when the model contains a certain fixed number
of variables, p∗ say. Another common stopping rule is to stop when none of the
partial F statistics for the coefficients considered in the second step is significant
at some preselected level. The default significance level used varies between
computer packages. We give an example which will illustrate the idea of forward
selection.
Example: cheddar cheese tastings

Consider the data on cheddar cheese tastings given in assignment three. The
response here was the variable taste (a subjective measure of taste) and the
predictors were measures of the concentrations of various chemicals in the samples
(acetic, lactic and H2S).
Of all the one variable models, the largest partial F -statistic occurs for the
model involving H2S (the F statistic for testing for the significance of the slope
for H2S is 37.29, with a p-value of approximately 0).
Now we consider adding the variables not currently in the model (acetic and
lactic) to the variable H2S. Fitting the model involving H2S and lactic, the partial
F statistic for testing for the significance of lactic in the presence of H2S is
approximately 6.25. If we fit the model involving H2S and acetic, the partial
F statistic for testing for the significance of acetic in the presence of H2S is
4 MODEL SELECTION 108

approximately 0.71. The largest of these is the F statistic for lactic, with a p-
value of 0.019. Finally we consider adding acetic to the model involving H2S
and lactic. The partial F statistic for acetic is approximately 0.005, with a p-
value exceeding 0.9, clearly not significant at the 0.25 level. Since this is the only
variable not already in the model our stopping rule is satisfied, and the result of
our forward stepwise search is to choose the model involving H2S and lactic.
Stepwise Regression: Taste versus acetic,H2S,lactic

Forward Selection. Alpha-to-enter: 0.25

Response is taste on 3 predictors, with N = 30

Step 1 2
Constant -9.787 -27.592

H2S 5.78 3.95


T-Value 6.11 3.47
P-Value 0.000 0.002

Lactic 19.9
T-Value 2.50
P-Value 0.019

S 10.8 9.94
R-Sq 57.12 65.17
R-Sq(adj) 55.58 62.59
C-p 6.0 2.0
The “Step” columns labelled 1 and 2 here show the variables which are added
at each stage. We see that H2S was added first, and for the simple linear re-
gression model with H2S as predictor the estimated intercept was −9.787 and
the estimated slope was 5.78. Also reported is the partial t statistic for testing
significance of the slope which is 6.11 (we square this to get the partial F statis-
tic) with associated p-value 0.000. The residual standard error is 10.8, the R2 is
0.5712, the adjusted R2 is 0.5558 and Mallow’s Cp is 6.0. At the second stage,
the variable lactic is added to the model: again we have coefficient estimates,
the estimated residual standard error, the R2 , adjusted R2 , Cp and values of the
partial t statistics. The model involving H2S and lactic is the final model here.

Backward elimination

The second main sequential variable selection algorithm is called backward elim-
ination or backward selection. The idea of backward selection is to start with
4 MODEL SELECTION 109

an initial model containing all predictors, and to then delete terms at each step
which give the least degradation of fit until some stopping criterion is satisfied.
More formally, we can describe the algorithm as follows:

1. Fit the model involving all the predictors.

2. Until some stopping criterion is satisfied, delete the variable currently in


the model with the smallest partial F statistic.

The stopping rule for backward elimination is usually to stop when some fixed
number of parameters is reached, or to stop when all of the partial F statistics
are significant at a certain level (this is what R does, and the default significance
level is 0.1).

Example: data on cheddar cheese tastings

If we fit the full model for the cheddar cheese tastings data, then we have that the
partial F statistics for the three variables are approximately 9.80 for H2S, 0.005
for acetic and 5.20 for lactic. So at the first stage of backward elimination we
delete the variable acetic. Now we consider the model involving H2S and lactic.
In this model, the partial F statistics are approximately 12.04 for H2S and 6.25
for lactic. Since both of these are significant at the 0.1 level, our stopping criterion
is satisfied and the final model is again the model involving H2S and lactic.

Stepwise Regression: Taste versus acetic,H2S,lactic

Backward elimination. Alpha-to-remove: 0.1

Response is taste on 3 predictors, with N = 30

Step 1 2
Constant -28.88 -27.59

Acetic 0.3
T-Value 0.07
P-Value 0.942

H2S 3.9 3.9


T-Value 3.13 3.47
P-Value 0.004 0.002

Lactic 19.7 19.9


T-Value 2.28 2.50
P-Value 0.031 0.019
4 MODEL SELECTION 110

S 10.1 9.94
R-Sq 65.18 65.17
R-Sq(adj) 61.16 62.59
C-p 4.0 2.0

Again the “Step” columns labelled 1 and 2 show the variables in the model at
each stage of the backward elimination procedure. We start with the full model,
and in the second step move to the model involving just H2S and lactic. Also
shown are coefficient estimates, the estimate of the error standard deviation, the
R2 value, adjusted R2 and Cp values.

Stepwise methods

The final sequential variable selection algorithm we will discuss is called stepwise
variable selection, and combines elements of the forward and backward elimina-
tion algorithms. There are many variations of the basic stepwise procedure, and
we will just discuss the following algorithm. Repeat:

1. If there is at least one predictor in the current model, consider deleting


the variable with the smallest partial F statistic provided this partial F
statistic is not significant at some preset level αOU T .

2. If no variable could be deleted in the first step, consider adding a variable


to the model. Add the variable with the largest partial F statistic provided
it is significant at some preset level αIN .

3. If no variable could be added or deleted in the first two steps, then stop.

Default values for αIN and αOU T have to be supplied.


Example: data on cheddar cheese tastings

For the data on cheddar cheese tastings, we describe the steps involved in step-
wise variable selection. The steps involved are actually the same as for forward
selection in this case. The initial model contains just an intercept, and there is no
variable to delete. Then we consider adding a variable: the variable to be added
(with the largest partial F statistic of approximately 37.29) is H2S. This variable
cannot be deleted, and so we consider adding another variable: the largest partial
F is for lactic, with a partial F value of approximately 6.25 (which is significant
at the 0.15 level). Now consider deleting a variable: the partial F statistics in the
model involving Lactic and H2S are both significant at the 0.15 level, so there is
no variable to delete. The partial F statistic for acetic is not significant at the
0.15 level, so there is no variable to add. So we stop, and the final model is the
one involving Lactic and H2S.
4 MODEL SELECTION 111

Stepwise Regression: Taste versus acetic,H2S,lactic

Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15

Response is taste on 3 predictors, with N = 30

Step 1 2
Constant -9.787 -27.592

H2S 5.78 3.95


T-Value 6.11 3.47
P-Value 0.000 0.002

Lactic 19.9
T-Value 2.50
P-Value 0.019

S 10.8 9.94
R-Sq 57.12 65.17
R-Sq(adj) 55.58 62.59
C-p 6.0 2.0

If you want to keep variables in the model regardless of their p-values, you can
enter them as “Predictors to include in every model” after choosing Stepwise....
As we have mentioned there are many variations on the basic stepwise procedure:
one common variation is to consider swaps of variables currently in the model
with variables not in the model if this improves R2 , in addition to additions and
deletions of variables.

4.6 Problems with forward selection, backward selection


and stepwise algorithms
The use of sequential algorithms presents a number of statistical problems. Clearly
the motivation for the forward selection, backward elimination and stepwise al-
gorithms comes from hypothesis testing and in particular from the partial F tests
for examining the significance of a predictor in the presence of the other terms
in a linear model.
With hypothesis testing we make a comparison between two models, one of
which is simpler than the other. In sequential variable selection procedures, we
are comparing two models which differ by a single variable, and we might have
good reason to believe that neither model is adequate. Consider the early stages
of forward selection or the stepwise procedure for instance: here we might ex-
pect that there are many important omitted covariates, and the current model
4 MODEL SELECTION 112

or any one variable addition to it cannot be reasonable. The residual variance is


overestimated, and partial F statistics are “deflated” (since the residual variance
appears in the denominator of the partial F statistic). Clearly there is also a
problem of multiple testing in sequential algorithms, where decisions about om-
mission or inclusion of covariates are based on looking at a maximum or minimum
of multiple partial F statistics.
For all the model selection procedures we have considered, there is also the
problem of bias of parameter estimates and overstatement of significance levels
because the same data are used both for model selection and estimation of pa-
rameters. We did show that the least squares estimator of β was unbiased if the
assumed model was correct. However, there the model was known beforehand: if
the final model chosen to be fitted is based on the data y, then the development
we gave previously of distributions of estimators and test statistics is not really
appropriate. The effect of selection has not been taken into account.

Example: regression for simulated data where the response and pre-
dictors are unrelated

To illustrate some of the problems with sequential algorithms we consider a simu-


lated data set. Our discussion here is similar to that of Weisberg (1985), “Applied
Linear Regression (Second Edition),” Wiley, New York, pp. 214–215.
Suppose we simulate 51 vectors of independent N (0, 1) random variables of
length 100 We take the first of these vectors as the response, and the last 50
vectors as the predictors. Now, of course, the response here is completely unre-
lated to all the predictors: but what happens when we apply a sequential variable
selection procedure to this data set?
For a simulation I performed, the regression of the response on all fifty pre-
dictors (not doing any variable selection) gave an R2 of 0.597. This is perhaps
surprising: after all, the predictors are completely unrelated to the response. The
high R2 occurs simply because we have a large number of predictors, and some
of them happen to be related to the response purely by chance. However, the
F statistic for significance of the overall model indicates that the model with no
predictors is to be preferred to the full model.

Regression Analysis

The regression equation is


C1 = - 0.126 + 0.277 C37 - 0.343 C45 - 0.219 C14 - 0.181 C4

Predictor Coef StDev T P


Constant -0.12648 0.08919 -1.42 0.159
C37 0.27750 0.08253 3.36 0.001
C45 -0.34327 0.08991 -3.82 0.000
4 MODEL SELECTION 113

C14 -0.21942 0.08488 -2.58 0.011


C4 -0.18071 0.07892 -2.29 0.024

S = 0.8622 R-Sq = 25.4% R-Sq(adj) = 22.2%

Analysis of Variance

Source DF SS MS F P
Regression 4 24.0298 6.0074 8.08 0.000
Residual Error 95 70.6279 0.7435
Total 99 94.6576

Source DF Seq SS
C37 1 7.1434
C45 1 7.9000
C14 1 5.0881
C4 1 3.8982

We see here that the p value for overall significance of the model is less than
0.0005, and the p-values for the partial t tests are also less than 0.05 for all
four of the predictors. By ignoring the fact that the data was used to select
which model was fitted we compute inappropriate p values which indicate overall
significance of the model and significance of the coefficients.

4.7 The MAXR procedure


A sequential variable selection procedure we have not yet considered is the so-
called MAXR procedure. Unlike forward selection, backward selection and the
stepwise algorithm, the final result of the MAXR procedure is not a single model
but a list of models, one model for each possible subset size of the predictors.
MAXR begins in the same way as forward selection (that is, we find the
best one variable model). The MAXR procedure then proceeds iteratively. We
consider all the possible models obtained from the current model by adding a
single variable. The variable which gives the largest increase in R2 is chosen.
Given this new model, we then consider each of the predictors in the model in
turn, and we do a swap with the predictor not in the model producing the largest
increase in R2 (provided there is a swap that increases R2 ). We repeatedly cycle
through the predictors in this way making one on one swaps until no increase in
R2 can be made. When this happens, we store the current model as being the
best model of the current size, and then repeat the cycle of adding a predictor
and considering swaps. The procedure continues until all variables are in the
model, and the result of the algorithm is a list of models (one for each subset
size) which can be examined further.
4 MODEL SELECTION 114

More formally, we can write the algorithm as follows.

1. Set the initial model to be the best simple linear regression model (largest
partial F -statistic or equivalently largest R2 ).

2. Until all variables are in the model, do the following.

(a) Add the variable not currently in the model which produces the largest
increase in R2 . Repeatedly cycle through all the predictors in the new
model, doing the following until R2 does not increase in a cycle through
all the predictors (and save the current model when this occurs).
i. Swap the current predictor with the predictor not in the model
producing the largest increase in R2 (provided there is a swap
which increases R2 ).

Example: cheddar cheese tastings

We now discuss the implementation of the MAXR procedure for the data on
cheddar cheese tastings. The MAXR procedure starts in the same way as for-
ward selection: we simply choose the best one variable model by finding the
predictor which has the highest absolute correlation with the response. As we
saw in the last lecture, the best one variable model for the data on cheddar cheese
tastings involved the predictor H2S. We save this model as being our best one
variable model. Then we consider adding one of the variables acetic or lactic
to the model involving H2S. The maximum increase in R2 occurs when lactic
is added (the R2 for H2S and lactic is 0.652, and the R2 for H2S and acetic is
0.582). Then for the model involving H2S and lactic, we consider a swap between
H2S and acetic: does the model with acetic and lactic have a higher R2 ? The
answer here is no (the model with acetic and lactic has an R2 of 0.520). Then
we consider a swap between lactic and acetic: but we have already seen that the
model with H2S and acetic is inferior to the one involving H2S and lactic. So
there is no one on swap of acetic with one of the predictors H2S and lactic which
can improve the model. So we store this model as our best two predictor model.
Of course, there is only one three predictor model. This is the end of the MAXR
procedure: we have a list of possible models for each subset size.
5 RESIDUALS AND DIAGNOSTICS 115

5 Residuals and diagnostics


So far we have talked about selection of a model and estimation of parameters
under ideal conditions (that is, given that the assumed model is correct). We
mentioned when we talked about simple linear regression the need for criticism
of model assumptions, and we took a brief look at some methods for model
criticism based on residuals. In this topic we take a more detailed look at various
kinds of residuals in the context of the general linear model.
When we talked about simple linear regression, we mentioned that residuals
could be used to help detect an incorrectly specified mean structure or to detect
a violation of the constancy of error variance assumption. Residual plots can
be helpful for detecting similar violations of assumptions in the general linear
model, as well as for detecting outlying points and other problems with the
model formulation. However, we should not expect the raw residuals to be the
most helpful diagnostic for detecting all possible violations of model assumptions,
and there are modifications of the raw residuals which sharpen their usefulness
for investigating specific problems. A discussion of different kinds of residuals
and related diagnostic plots is the subject of this topic.

5.1 Residual plots


Consider the general linear model

y = Xβ + ε

where y = (y1 , ..., yn )T is an n × 1 vector of responses, β = (β0 , ..., βk )T is a p × 1


vector of parameters, X is an n × p design matrix and ε = (ε1 , ..., εn ) is a vector
of zero mean uncorrelated errors with common variance σ 2 say. As usual we write
b for the least squares estimate

b = (X T X)−1 X T y

b2 for the usual unbiased estimator of σ 2 . The fitted values are


of β, and σ

yb = Xb
= X(X T X)−1 X T y
= Hy

where H is the hat matrix introduced in our discussion of computation of the


PRESS residuals. The vector of residuals is the vector of responses minus the
vector of fitted values,

e = y − Hy
= (I − H)y (35)
5 RESIDUALS AND DIAGNOSTICS 116

where I denotes the n × n identity matrix here. We can also write

e = (I − H)y
= (I − H)(Xβ + ε)
= Xβ − HXβ − Hε + ε.

But HX = X(X T X)−1 X T X = X, so we have

e = (I − H)ε.

Hence in terms of the residuals,



ei = (1 − hii )εi − hij εj .
j̸=i

In general, as the sample size n increases the elements of the hat matrix tend to
zero and the residuals ei are a good approximation to the errors εi . Note that
E(e) = 0 always holds, so that the residuals have mean zero. We considered plots
of the raw residuals against predictors and fitted values in the context of the sim-
ple linear regression model for detecting an incorrectly specified mean structure
or a violation of the assumption of constant error variance. Similar plots can be
useful in the context of the general linear model, as the following examples show.

Example: Locating fast food restaurants

The following example is from Keller, Warrack and Bartel, “Statistics for Man-
agement and Economics: A Systematic Approach (Second Edition),” Wadsworth,
Belmont, California, 1990, pp. 832–833.
In trying to find new locations for their restaurants, hamburger emporiums
like McDonald’s and Wendy’s usually consider a number of factors. Suppose
that a statistician working for a hamburger restaurant chain wants to construct a
regression model that will help her make recommendations about new locations.
She knows that this type of restaurant has as its primary market middle-income
adults and their children (particularly children betwen the ages of 5 and 12).
She collected the (fictitious) data in the table below showing annual sales in
twenty five randomly chosen areas containing about five hundred households and
exactly one of her employer’s restaurants and exactly one competitor’s restaurant.
Consider fitting a linear model to these data in order to predict annual gross sales
based on mean annual household income and mean age of children. Fitting the
model results in the following.

The regression equation is


Sales = 668 + 11.4 Income + 16.8 Age
5 RESIDUALS AND DIAGNOSTICS 117

Annual Gross Mean Annual Mean Age


Sales ($1000s) Household of Children
Income ($1000s)

1128 23.5 10.5


1005 17.6 7.2
1212 26.3 7.6
893 16.5 5.9
1073 22.3 6.6
1179 26.1 6.3
1109 24.3 12.1
1019 20.9 14.9
1228 27.1 8.9
812 15.6 3.4
1193 25.7 10.5
983 3.5 6.0
1281 26.5 8.6
1156 25.7 11.6
1032 21.8 13.7
856 33.6 5.8
978 17.9 10.3
1017 18.3 5.3
1091 30.1 6.3
1048 29.8 5.3
1192 28.5 10.4
1256 27.5 8.7
1215 26.8 9.5
1233 24.3 8.3
950 17.8 6.1
5 RESIDUALS AND DIAGNOSTICS 118

Predictor Coef StDev T P


Constant 667.8 132.3 5.05 0.000
Income 11.430 4.677 2.44 0.023
Age 16.819 8.000 2.10 0.047

S = 111.6 R-Sq = 32.5% R-Sq(adj) = 26.4%

Analysis of Variance

Source DF SS MS F P
Regression 2 132071 66035 5.30 0.013
Residual Error 22 274025 12456
Total 24 406096

Source DF Seq SS
Income 1 77008
Age 1 55063

We see that the F test in the analysis of variance table indicates that this model
is to be preferred to the model involving just an intercept term. A plot of resid-
uals against the fitted values indicates some possible problems with the model
formulation, however (Figure 5.11). It seems that for both small and large fitted
values residuals tend to be negative, whereas away from the extreme fitted values
the residuals tend to be positive. This plot does suggest that the mean structure
of the model is not correctly specified: in particular, it may be that we need to
include some additional terms in our linear model such as the square of age or
square of income. We will look further at plots which can be used for assessing
the need for nonlinear terms in the predictors in later lectures.

Example: risk assessment data

As a further example of the use of the raw residuals for detecting violations of
model assumptions we consider the risk assessment data discussed in previous
lectures. We fit the full model with mean risk assessment as the response, and all
seven of the accounting determined measures of risk as predictors. A plot of the
raw residuals against the fitted values is shown in Figure 5.12 In this example,
we can see that the variability of the residuals seems to decrease for the most
extreme fitted values, and so the constancy of variance assumption is doubtful
here.
5 RESIDUALS AND DIAGNOSTICS 119

Figure 5.11: Scatter plot of residuals versus fitted values for fast food restaurants
data
5 RESIDUALS AND DIAGNOSTICS 120

Figure 5.12: Scatter plot of residuals versus fitted values for risk assessment data
5 RESIDUALS AND DIAGNOSTICS 121

5.2 Use of residuals for detecting outlying points


There are many uses for residuals in diagnosing the fit of a model. We have
discussed using the raw residuals for detecting if the mean structure of a model
is not correctly specified, and for investigating whether or not the variance of
the errors is constant. Another use for residuals is for detecting outlying points
(single points which do not fit the overall pattern of the data). We will see in
this section that the raw residuals are not necessarily the most useful quantities
for detecting outlying points, or for many other purposes.
One problem with interpreting plots of the raw residuals such as the plot in
the previous example is that the variance of the residuals is not constant, even
if the errors εi do have constant variance. From (35) and since y has covariance
matrix σ 2 I, and since I − H is symmetric, we can write down the covariance
matrix of the residuals as

Var(e) = σ 2 (I − H)2 .

But I − H is idempotent (that is, (I − H)2 = I − H) since

(I − H)2 = (I − X(X T X)−1 X T )2


= I − X(X T X)−1 X T − X(X T X)−1 X T + X(X T X)−1 X T X(X T X)−1 X T
= I − X(X T X)−1 X T
= I −H

Hence we have that

Var(e) = σ 2 (I − H). (36)

We can learn a number of things from the above equation. Firstly, unlike the
errors εi , i = 1, ..., n, the residuals ei are not necessarily uncorrelated. In fact,
for i ̸= j,
Cov(ei , ej ) = −σ 2 hij .
This expression is not necessarily zero, although in general the elements of the
hat matrix will tend to zero as the sample size increases. Secondly, we have
illustrated that the residuals ei tend to be smaller in absolute value than the
errors εi , even when the model assumptions hold: recall that the leverages hii
(the diagonal elements of the hat matrix H) are all positive quantities, so that
the variances

Var(ei ) = σ 2 (1 − hii ),

are smaller than Var(εi ) = σ 2 . Furthermore, the potentially high influence points
with large leverages hii are the ones corresponding to residuals with a small vari-
ance. In effect, an observation with a high leverage will tend to pull the fitted
5 RESIDUALS AND DIAGNOSTICS 122

line towards itself, ensuring that the residual at that point will be small.

Example: capital asset pricing model

We illustrate some of the problems which can occur with plots of ordinary
residuals by considering the capital asset pricing model data discussed in pre-
vious lectures. Here we had 21 measurements of a stock price (the response) as
well as corresponding measurements of two predictors, the TSE 300 index and
Price/Earnings ratio. For the purposes of this example I have deleted two of the
observations and I will consider a simple linear regression model of Stock price
on the TSE300 index. Figure 5.13 shows a scatter plot of stock price against the
TSE300 index for my modified data set. For this modified data set, linearity of

Figure 5.13: Scatter plot of stock price against TSE300 index for modified capital
asset pricing model data.

the expected response in the predictor seems reasonable, apart from the extreme
discrepant point. Of course, in a simple linear regression model we can display
the data very easily with a scatter plot, and the residual plot is perhaps not very
informative. But in multiple linear regression diagnostic residual plots may re-
veal patterns in the data that are not obvious. However, looking at the residual
plot after fitting a simple linear regression model in this example is instructive
5 RESIDUALS AND DIAGNOSTICS 123

for showing what can happen when there is a single outlier (Figure 5.14). We see

Figure 5.14: Residual plot for modified capital asset pricing model data.

that the residual corresponding to the discrepant point is no larger than many of
the other residuals. This observation is a high influence point, and it has pulled
the fitted line towards itself so that a plot of the raw residuals does not reflect
the inconsistency of this observation with the rest of the data. The residual at
this point has a smaller variance than the other residuals, making it difficult to
see from the residual plot that there is an outlier.

The fact that the variance of the raw residuals is not constant can make it tricky
to interpret plots of the raw residuals, as we have just seen. This has motivated
the introduction of standardized and studentized residuals. The ith standardized
residual is simply the ith raw residual ei divided by its standard deviation
e
√ i .
σ 1 − hii
Since this depends on σ (which is in general unknown) we define the studentized
residuals (or more precisely the internally studentized residuals) in which σ is
replaced by σb:
e
ri = √ i .
b 1 − hii
σ
5 RESIDUALS AND DIAGNOSTICS 124

For detection of outlying points you may be thinking that we could use the
PRESS residuals which were defined when we looked at model selection. Recall
that to define the PRESS residuals ei,−i we looked at the difference between the
ith response and a prediction of that response obtained by fitting a model to the
data with the ith response excluded. We gave a computational formula for ei,−i ,
namely
ei
ei,−i = .
1 − hii
We might think that if we could standardize the PRESS residuals to have constant
variance then these standardized PRESS residuals would be most appropriate for
detecting a single outlying observation which does not follow the pattern of the
rest of the data.
To standardize the PRESS residuals we first have to find their variances. We
have
Var(ei )
Var(ei,−i ) =
(1 − hii )2
σ 2 (1 − hii )
=
(1 − hii )2
σ2
= .
1 − hii
Hence the standardized PRESS residual is
ei
ei
1−hii
= √
√ σ
1−hii σ 1 − hii

which is simply ri , the ordinary standardized residual! Hence it does seem ap-
propriate to use the studentized residuals to look for single outlying points in the
data.

Example: capital asset pricing model

As an illustration of the use of studentized residuals we return to our example of


the capital asset pricing model in which we deleted two of the original observa-
tions and fitted a simple linear regression model with stock price as the response
and TSE300 index as the predictor. Figure 5.14 shows the ordinary residuals
plotted against the fitted values. The outlier here does not have the largest raw
residual in absolute value. A plot of the studentized residuals for this data set
is shown in Figure 5.15 In the plot of studentized residuals, the outlying point is
more effectively highlighted. The outlier corresponds to a high leverage point (a
leverage of approximately 0.65) so that the variance of the raw residual at this
5 RESIDUALS AND DIAGNOSTICS 125

Figure 5.15: Plot of studentized residuals versus fitted values for modified capital
asset pricing model data.
5 RESIDUALS AND DIAGNOSTICS 126

point differs substantially from the error variance σ 2 .

Example: risk assessment data

For the risk assessment data, we previously plotted the raw residuals versus fitted
values (Figure 5.12). For the same fitted model considered before, we can also
consider plotting the studentized residuals against the fitted values. The basic
pattern observed (Figure 5.16) is the same as before, but notice that there seem
to be some differences in the relative magnitudes of the standardized residuals,
with the ones corresponding to extreme fitted values being relatively larger com-
pared to the others than before.

Figure 5.16: Plot of the studentized residuals against fitted values for risk assess-
ment data.

Unfortunately, plots of residuals don’t always help us to unambiguously diagnose


a problem with the model formulation when it exists. The following example is
given in Sanford Weisberg, “Applied Linear Regression (Second Edition),” Wiley,
New York, 1985, pp. 131–132.

Example: technological innovation in jet fighters


5 RESIDUALS AND DIAGNOSTICS 127

Stanley and Miller, in a 1987 RAND corporation technical report, have attempted
to build a descriptive model of the role of various design and performance factors
in modelling technological innovation in jet fighter aircraft. Using data on Amer-
ican jet fighters built since 1940, they use the date of the first flight as a stand-in
for a measure of technology; presumably, the level of technology is increasing
with time. In some of their work, they considered the following variables:

FFD = first flight date, in months after January 1940


SP R = specific power, proportional to power per unit weight
RGF = flight range factor
P LF = payload as a fraction of gross weight of aircraft
SLF = sustained load factor
CAR = 1 if aircraft can land on a carrier; 0 otherwise.

Data on 22 planes flown between 1940 and 1979 were considered (we do not re-
produce the full data set here). If we fit a model with F F D as response and the
remaining variables as predictors, and if we plot the studentized residuals against
fitted values, we obtain the graph in Figure 5.17. We see here that the plot of

Figure 5.17: Plot of studentized residuals versus fitted values for jet fighter data.
5 RESIDUALS AND DIAGNOSTICS 128

studentized residuals is somewhat ambiguous. We might interpret the residual


plot as indicating an outlying point on the extreme right, or as indicating an
incorrectly specified mean structure, or perhaps as indicating that the constancy
of variance assumption is violated. The plot of residuals against fitted values is
not as helpful as we might hope.

We have pointed out that studentized residuals are useful for detection of single
outlying observations. However, we have not yet formally discussed the question
of how large the studentized residual needs to be to cause concern. Obviously
the standardized residual has variance one, but when we estimate the unknown
σ 2 in construction of the studentized residual there is some extra dispersion due
to estimation of this parameter and we might expect that a distribution more
heavy tailed than the normal such as a t distribution might be appropriate for
assessing whether an observation is more extreme than would be expected under
the model assumptions. We now discuss a statistic that can be used to formally
test whether or not an observation seems to conform to the model. The test
statistic used is the following:
e
ti = √i
b−i 1 − hii
σ
where σ b−i is the estimate of the error variance obtained by fitting the model to all
the data except the ith observation. Note that this statistic is exactly the same as
the studentized residual ri except that we use σ b−i to estimate the error variance
rather than σ b. The quantities ti are usually called the externally studentized
residuals (externally studentized since we have not used the ith observation in
estimation of σ 2 .) Use of the quantities ti for outlier detection does make good
intuitive sense: if the ith observation really is an outlying observation, there is
the danger that σ b2 overestimates σ 2 due to this extreme point and the residual
ri will be “deflated” because of this.
It can be shown that if the assumptions of the linear model hold, then ti has
a t-distribution with n − p − 1 degees of freedom. We may test the hypothesis
that the ith observation is an outlier by expanding the original model in one of
two different ways. In the mean shift outlier model, we assume that the ith error
εi does not have mean zero, but has mean ∆i say. Alternatively, we may assume
that εi has zero mean, but that the constancy of error variance assumption breaks
down at the ith point, so that Var(εi ) = σ 2 + σi2 say. In this second scenario an
unusual observation occurs because of a large random fluctuation at the ith point
consistent with the larger error variance. We can test the hypothesis

H0 : ∆i = 0

in the mean shift outlier model against the alternative

H1 : ∆i ̸= 0
5 RESIDUALS AND DIAGNOSTICS 129

or test
H0 : σi2 = 0
against the alternative
H1 : σi2 ̸= 0
in the model where constancy of error variance is violated using ti as a test statis-
tic with null distribution the t distribution with n − p − 1 degrees of freedom. We
will say more about testing for outliers with the externally studentized residuals,
and about associated problems of multiple testing, in the next lecture.

5.3 Testing for outliers


At the end of the last lecture we mentioned that we could use the externally
studentized residual ti as a test statistic in a formal test of whether the ith
observation conforms to the assumed model. In this subsection we develop this
idea further.
We consider two kinds of departures from the assumed linear model. In the
mean shift outlier model, we assume

E(εi ) = ∆i

where ∆i ̸= 0. That is, in the mean shift outlier model it is assumed that the
assumption of a zero mean for the ith error breaks down, possibly leading to an
outlier for the ith observation.
The second kind of departure we consider for the linear model is that

Var(εi ) = σ 2 + σi2

where σi2 > 0. In this scenario, an outlier can occur for the ith observation due
to a large random fluctuation in line with σi2 > 0.
We can test either
H0 : ∆i = 0
against the alternative
H1 : ∆i ̸= 0
in the mean shift outlier model or test

H0 : σi2 = 0

versus the alternative


H1 : σi2 > 0
in the model with inhomogeneous variance using
e
ti = √i
b−i 1 − hii
σ
5 RESIDUALS AND DIAGNOSTICS 130

as the test statistic. For both tests, under the null hypothesis ti is a realization
of a t random variable with n − p − 1 degrees of freedom in the linear model with
normal errors, so that if T ∼ tn−p−1 we compute the p-value for the tests above
as
p = P(|T | > |ti |).
Example: Capital asset pricing model

Consider the modified capital asset pricing model data which we examined in the
previous lecture. The data consisted of a response variable stock price and two
predictor variables price/earnings ratio and TSE300 index. For illustrating some
properties of raw and studentized residuals we deleted two observations from the
data set. Figure 5.13 shows the scatter plot of stock price versus TSE300 index
for the modified data set. Since the observation corresponding to the predictor
value on the extreme right is so far away from the other predictor values we may
have a prior interest in testing whether this observation follows the pattern of the
rest of the data. We can apply the testing procedure above to this observation.
The externally studentized residual for this point is approximately −3.37 here.
To compute the p-value for testing zero ∆i in the mean shift outlier model versus
the alternative ∆i ̸= 0, or for testing σi2 = 0 versus the alternative σi2 > 0 in the
model with a possibly inhomogeneous variance, we let T be a t random variable
with n − p − 1 = 19 − 2 − 1 = 16 degrees of freedom and calculate

P(|T | > 3.37)

which is approximately 0.0038. So we would reject the null hypothesis for either
test at the 5 percent level, and we would conclude that this observation may need
to be investigated further.

When we have a prior reason for believing that one of the observations may be
unusual then the test we have described above is appropriate. However, if we
were to conduct this test for all the observations in our data set, then we have a
problem of multiple testing or simultaneous inference: although it is rare for an
externally studentized residual to be as large as the upper 2.5 percentage point
of a tn−p−1 distribution for a single observation, it is very likely that in a large
data set at least some of the externally studentized residuals are this large even
if all observations conform to the model. So we need to make an adjustment to
the significance level of our outlier test when we apply it to all the observations
in the data set. We can apply the idea of Bonferroni adjustment discussed in
earlier lectures in this situation.
In particular, if we apply our outlier test for each observation in the data set,
then we adjust the significance level of our test from α to α/n.
5 RESIDUALS AND DIAGNOSTICS 131

Example: capital asset pricing model data

Let us reconsider for the moment the modified capital asset pricing model data
we discussed in the previous example. If we did not have prior knowledge that
the last observation was unusual, then we might apply our outlier test to all
nineteen observations in this data set. Since we were testing before at the 5
percent level, Bonferroni adjustment suggests we should use a significance level
of 0.05/19 which is approximately 0.0026. Comparing this with the p-value we
obtained previously of approximately 0.0038, we see that the observation would
not be considered an outlier in this analysis. We should remember, however, that
the Bonferroni adjustment is a conservative one, and we should not in any case
treat formal tests too seriously in the context of model criticism: the purpose
of the test is simply to bring to our attention individual observations that may
require further investigation.

Example: jet fighter data set

Recall the data set on technological innovation in jet fighters discussed on page
156 of your notes. We showed a plot of the internally studentized residuals when
we fitted a model with FFD as the response and the remaining five variables as
predictors in this data set. A plot of the externally studentized residuals is shown
in Figure 5.18. The value of the externally studentized residual on the extreme
right here is approximately 4.63. If we compare with the plot of the internally
studentized residuals given in your last set of lecture notes we see that this is an
example where it may be desirable to use an external estimate of the error variance
in the standardization: the internally studentized residual here is approximately
3.07, which is quite different to the value 4.63 when the possibly outlying point
is excluded in compuation of the estimated standard deviation. There are 22
observations in this data set, so if we apply Bonferroni adjustment with an initial
5 percent significance level for testing for outliers, the Bonferroni significance
level is 0.05/22 or approximately 0.0023. To compute the p-value for the outlier
test, we let T be a t distributed random variable with n − p − 1 = 22 − 6 − 1 = 15
degrees of freedom, and calculate

P(|T | > 4.63)

which is approximately 0.0004. So we reject the null hypothesis, and this obser-
vation with an externally studentized residual of 4.63 might be considered to be
an unusual one.

As we mentioned when discussing the capital asset pricing model data, the formal
tests of hypotheses we have considered in this section are not to be taken too
seriously in the context of model criticism. The purpose of the techniques we
5 RESIDUALS AND DIAGNOSTICS 132

Figure 5.18: Plot of externally studentized residuals versus fitted values for jet
fighter data set.
5 RESIDUALS AND DIAGNOSTICS 133

have discussed for criticism of the model is to highlight unusual observations


that should be investigated further. We should ask ourselves why an outlier has
occurred: was it measurement or recording error, or is it something which requires
us to change the model? What we do with the outlier will usually depend on
why we think the outlier occurred. Sometimes the right thing to do is to exclude
the outlier from the analysis. In other situations the outlier may be the most
interesting and informative observation in the data.

5.4 Use of indicator variables


In the previous subsection we described a way of testing a single observation
thought to be suspect from prior knowledge, and a way of testing all of the ob-
servations to highlight possibly unusual cases. In this subsection we discuss a way
of dealing with groups of observations which are thought from prior knowledge
to be suspect.
Suppose that we can partition the observations y into a set of r observations
which from prior knowledge may not conform to the model, and n − r remaining
observations. Without loss of generality we can reorder the observations, so we
let (y1 , ..., yr ) be the possible outliers, and we write (yr+1 , ..., yn ) for the remaining
observations. We will extend the mean shift outlier model to develop various tests
for whether some or all of the first r observations might be considered unusual.
If there is a common cause why the observations y1 , ..., yr might be considered
outliers, then it might be reasonable to assume that E(ε1 ) = ... = E(εr ) = ∆.
That is, it might be assumed that the errors corresponding to the possibly unusual
observations have a common mean which is not necessarily zero. We can develop
a test of
H0 : ∆ = 0
against the alternative
H1 : ∆ ̸= 0
as follows. Construct a new predictor variable which is 1 for the observations
y1 , ..., yr and 0 for the remaining observations. If we refit our linear model with
this new predictor added to the existing predictors, then the estimate of the
coefficient corresponding to this new predictor is an estimate of ∆: furthermore,
the partial t-test for this coefficient provides a test for ∆ = 0 in the context
of this model. It can be shown that when r = 1 looking at this partial t-test
is exactly the same as performing the test based on the externally studentized
residual discussed in the previous subsection.
An alternative situation may occur in which the observations y1 , ..., yr might
be thought to be suspcious, but there is no common cause why they are suspect.
In this case, we can follow a similar approach to the above to examine these
possible outliers: however, in this case, we can define r new predictors, with the
first predictor being one for the first observation and zero otherwise, the second
5 RESIDUALS AND DIAGNOSTICS 134

predictor being one for the second observation and zero otherwise, and so on.
Again we can examine estimates of coefficients for these predictors and examine
partial t-tests to highlight possibly unusual observations.

Example: capital asset pricing model data

We consider again the capital asset pricing model data. For the purposes of
this example, consider the full data set and a simple linear regression of stock
price on TSE300 index. The scatter plot of stock price against TSE300 index
is shown in Figure 5.19. Suppose that the three observations corresponding to

Figure 5.19: Scatter plot of stock price versus TSE300 index.

the predictors on the extreme right were thought to be suspicious on the basis
of prior knowledge: the data were collected over time, and some event occurred
which caused a shift in the stock price for the last three observations collected.
We can set up an indicator variable to model the possible effect. The indicator
variable is zero for the first eighteen observations, and is one for the last three.

The regression equation is


Stock Price = - 3.85 + 0.0129 TSE 300 Index - 0.795 Dummy
5 RESIDUALS AND DIAGNOSTICS 135

Predictor Coef StDev T P


Constant -3.847 1.483 -2.59 0.018
TSE 300 0.012915 0.001439 8.97 0.000
Dummy -0.7947 0.3487 -2.28 0.035

S = 0.2875 R-Sq = 91.4% R-Sq(adj) = 90.5%

Analysis of Variance

Source DF SS MS F P
Regression 2 15.8294 7.9147 95.77 0.000
Residual Error 18 1.4876 0.0826
Total 20 17.3170

Source DF Seq SS
TSE 300 1 15.4001
Dummy 1 0.4293

We see from the above output that the partial t-test for the coefficient of our
dummy variable is significant at the 5 percent level: there does seem to be reason
to further examine the last three observations and to possibly exclude them from
our analysis.

5.5 Partial regression plots


In the simple linear regression model, a plot of the response against the predictor
is a good way of summarizing the data. In the case of multiple linear regression,
however, plots of the response against each of the predictors may not be a helpful
way of summarizing the contribution of a predictor to the model. In particular,
in multiple linear regression the existence of relationships between predictors can
complicate interpretation of the role of a predictor. We also saw this problem
when we talked about partial t tests and multicollinearity.
For a multiple linear regression model partial regression plots (sometimes
called added variable plots) provide one useful way of displaying the relationship
between a given predictor and the response after adjusting for the relationship
between the predictor and the other covariates in the model. In our usual no-
tation we write y for the response and x1 , ..., xk for a collection of k predictors.
Suppose we are interested in the role that xj will play in a multiple linear re-
gression model with y as the response and x1 , ..., xk as covariates. To construct
a partial regression plot, we first fit a multiple linear regression model of y on
the predictors x1 , ..., xj−1 , xj+1 , ..., xk (that is, we consider a model involving all
predictors except xj ). Write ey,−xj for the vector of residuals from this fitted
model. We then fit a multiple linear regression model with xj as the response
5 RESIDUALS AND DIAGNOSTICS 136

and with x1 , .., xj−1 , xj+1 , ..., xk as predictors (that is, we regress xj on the remain-
ing predictors). Write exj ,−xj for the residuals from this second fitted model. We
can think of the vector of residuals exj ,−xj as the independent information in the
predictor xj , after removing linear dependence on the other covariates. In the
partial regression plot we simply plot ey,−xj against exj ,−xj . This plot helps us
to visualize the relationship between y and xj after we have removed the linear
dependence of both y and xj on the remaining predictors.
It can be shown that the least squares slope of the partial regression plot
is actually bj , the least squares estimator of βj in the multiple linear regression
model involving all the predictors (see, for instance, p. 233 of Myers, “Classical
and Modern Regression with Applications (Second Edition)”, Duxbury, Belmont,
California, 1990). The partial regression plot gives an effective visual represen-
tation of the contribution of a predictor in a multiple linear regression model,
and it can be helpful for suggesting the need for an additional nonlinear term
in the model (we might decide to add x2j as a predictor for example). Partial
regression plots may also be helfpul for identifying influential observations: the
partial regression plots in which an unusual observation shows up most clearly
correspond to predictors whose coefficients will be most affected by the unusual
point.

Example: capital asset pricing model

We return to the example of the capital asset pricing model data. Recall that
in this data set the response was stock price, and there were two predictors,
price/earnings ratio and TSE300 index. We wish to consider the role of each
predictor in a multiple linear regression model involving both predictors. The
partial regression plots for price/earnings ratio and TSE300 index are shown
below. The partial regression plot for TSE300 index may suggest the need to
include an additional predictor such as the square of TSE300 index in the model.
If we do include the square of TSE300 index as an additional predictor then
the partial t test for the coefficient of this predictor leads to a p-value of 0.004,
so that this additional term does seem to contribute to the model.

Example: fast food restaurants

As an additional example we return to the data on fast food restaurants which


was discussed in the last lecture. Recall that in this data set the response was
annual sales and the predictors were a measure of income and the mean age of
children. Below are the partial regression plots for the predictors income and age
for this data set. The plots here suggest that the model might benefit from
including a nonlinear term in income, or age, or possibly both.
5 RESIDUALS AND DIAGNOSTICS 137

Figure 5.20: Partial regression plot for price/earnings ratio.


5 RESIDUALS AND DIAGNOSTICS 138

Figure 5.21: Partial regression plot for TSE300 index.


5 RESIDUALS AND DIAGNOSTICS 139

Figure 5.22: Partial regression plot for income.


5 RESIDUALS AND DIAGNOSTICS 140

Figure 5.23: Partial regression plot for age.


5 RESIDUALS AND DIAGNOSTICS 141

5.6 Partial residual plots


Another plot which is useful for describing the role of a predictor in a multiple
linear regression model is the partial residual plot. Suppose as in the previous
subsection that we have a response y and predictors x1 , ..., xk . Write simply e for
the vector of residuals which results from fitting a multiple linear regression model
of y on all the predictors. Also, if as usual we write β for the vector of unknown
parameters in the multiple linear regression model, write bj for the least squares
estimate of βj . The partial residual plot is simply a plot of e + bj xj against xj .
Again this plot shows the role of xj in the mutliple linear regression involving all
the predictors. The partial residual plot is probably more useful than the partial
regression plot for indicating the need to include additional nonlinear terms in
the predictors.

Example: capital asset pricing model.

Figures 5.24 and 5.25 show the partial residual plots for the predictors price/earnings
ratio and TSE300 index for the capital asset pricing model data set. Again the

Figure 5.24: Partial residual plot for price/earnings ratio.

need for a nonlinear term in TSE300 index may be indicated.


5 RESIDUALS AND DIAGNOSTICS 142

Figure 5.25: Partial residual plot for TSE300 index.


5 RESIDUALS AND DIAGNOSTICS 143

Example: fast food restaurants.

Figures 5.26 and 5.27 show the partial residual plots for the predictors income
and age for the fast food restaurants data set. Again the possible need for

Figure 5.26: Partial residual plot for income.

nonlinear transformations of both predictors is indicated.

A further alternative to the partial regression and partial residual plots for
detecting the need for nonlinear transformations of the predictors is the use of an
augmented partial residual plot. In the augmented partial residual plot for the
predictor xj , we fit a linear model involving all the predictors with the additional
predictor x2j : writing bj for the least squares estimator of the coefficient for xj
(2)
and bj for the least squares estimator of the coefficient for x2j , and writing e
for the vector of residuals, then in the augmented partial residual plot we plot
(2)
e + bj xj + bj x2j against xj . The motivation for the introduction of this plot was
that if the contribution of xj is really nonlinear, then the nonlinear contribution
of xj may be masked by its relationship with the other terms, and including x2j
as an additional predictor can help to alleviate this problem.
5 RESIDUALS AND DIAGNOSTICS 144

Figure 5.27: Partial residual plot for age.


5 RESIDUALS AND DIAGNOSTICS 145

5.7 Testing for normality


We have discussed in this topic so far some ways of detecting failures of assump-
tions through the analysis of residuals. In our development of the general linear
model we saw that the assumption of normal errors was essential for hypothesis
testing and computation of confidence intervals. However, normality was not
required for derivation of parameter estimates or for computation of standard
errors of estimators or predictions.
If the purpose of a regression analysis concerns inferences which depend cru-
cially on the assumption of normality, it may be helpful to use a residual analysis
to examine the validity of the normality assumption. It is unfortunately difficult
to test for normality based on the residuals in small to moderate sized samples.
Although formal tests for normality exist which can be applied to residuals for a
fitted linear model, we will discuss in this subsection just one graphical method
for assessing normality, and for diagnosing the nature of departures from nor-
mality. In particular, we discuss normal probability plots.

Normal probability plots

When we construct a normal probability plot we do assume that the residuals


are identically distributed (at least approximately) so that before we construct
a normal probability plot we should examine residual plots in order to detect
any incorrect specification of the mean, inhomogeneity of the error variance or
outliers: if one of these model failures occurs, then the residuals (of whatever
kind) may not be close to identically distributed, and the normal probability plot
will not be meaningful.
We now develop the idea of a normal probability plot. For a random sample
U1 , ..., Un we define the ith order statistic of the sample to be the ith smallest
value, and we write this quantity as U(i) . Thus U(1) is the minimum of the sample
and U(n) is the maximum for instance. Now suppose Z1 , ..., Zn is a random sample
from a normal distribution with mean µ and variance σ 2 . Then we can write
Zi = µ + σWi where W1 , ..., Wn is a random sample from a normal distribution
with mean zero and variance one. Now consider the order statistics of Z1 , ..., Zn
and W1 , ..., Wn . Clearly we have

Z(i) = µ + σW(i)

since the ordering of the Wi ’s is not changed by the transformation. Taking


expectations, we have
E(Z(i) ) = µ + σE(W(i) ).
The idea of the normal probability plot is as follows: if Z1 , ..., Zn is a random
sample from a normal distribution with mean µ and variance σ 2 , then if we plot
the order statistics Z(i) against the expected order statistics E(W(i) ), then this plot
5 RESIDUALS AND DIAGNOSTICS 146

should be approximately a straight line with intercept µ and slope σ. Departure


from linearity in this plot indicates that the random sample Z1 , ..., Zn may not
be normally distributed. It should be noted that the idea of a probability plot is
quite a general one: although we only consider normal probability plots, a similar
idea can be used to graphically assess goodness of fit for other distributions.
We note before going on that in some statistical packages the order statistics
of the sample are plotted on the x-axis, and the values E(W(i) ) are plotted on the
y-axis:

Example: cheddar cheese tastings

As an example of the use of normal probability plots for assessing normality in


multiple linear regression we consider again the data on cheddar cheese tastings.
Recall that for these data our interest was in predicting a measure of taste for
cheddar cheese samples (taste) based on predictors which were related to the
concentration of various chemicals in the cheese (H2S, lactic and acetic). In this
example we fit a model for taste involving the predictors H2S and acetic, and
use computer to generate a normal probability plot of the internally studentized
residuals. The plot of internally studentized residuals against fitted values and
the normal probability plot of internally studentized residuals are shown in Fig-
ures 5.28 and 5.29. There is probably no real reason to question the assumption
of normality here, although the largest residual may be an outlier.

Example: relationship between income and education and job experi-


ence

As a further example we consider the data on the relationship between income


and job experience and education discussed on one of your tutorial sheets. We
fit a multiple linear regression model with income as the response and education
and job experience as predictors. A plot of the internally studentized residuals
against fitted values is shown in Figure 5.30 and a normal probability plot of the
internally studentized residuals is shown in Figure 5.31. There seems to be no
reason to question normality here.

We have stated that a departure from linearity in the normal probability


plot indicates possible non-normality. We also stated at the beginning of this
subsection that the nature of the departure from linearity in a normal probability
plot can help us diagnose the way in which residuals differ from what is expected
based on normality. We can learn a great deal about the characteristics of the
errors by looking at the shape of a normal probability plot. For the moment
consider a normal probability plot where the ordered data are plotted on the
y-axis and the expected order statistics are plotted on the x-axis.
To help assess linearity in a normal probability plot, it is common to draw a
5 RESIDUALS AND DIAGNOSTICS 147

Figure 5.28: Plot of internally studentized residuals versus fitted values for ched-
dar cheese tastings data.
5 RESIDUALS AND DIAGNOSTICS 148

Figure 5.29: Normal probability plot of internally studentized residuals for fitted
model to cheddar cheese tastings data.
5 RESIDUALS AND DIAGNOSTICS 149

Figure 5.30: Plot of internally studentized residuals for data on income, education
and job experience.
5 RESIDUALS AND DIAGNOSTICS 150

Figure 5.31: Normal probability plot for internally studentized residuals and data
on income, education and job experience.
5 RESIDUALS AND DIAGNOSTICS 151

reference line on the plot. There are various ways of doing this. One method is to
draw a straight line through the two points corresponding to the first and third
quartiles of the sample (which are roughly the points which bound the middle
half of the data). We can interpret characteristics of the residuals based on the
way the probability plot deviates from the reference line. Points below the line
correspond to points which are smaller than we would expect based on normality,
and points above the line corresond to points which are larger than we expect.
The ‘S’ shape plot shown in Figure 5.32 indicates that the sample plotted
has fewer extreme observations (both large and small) than we would expect
under normality (and we say that we have a distribution which is light tailed
with respect to the normal). The points above the line on the left indicate that
the small values in the sample are larger than we would expect, whereas the
values below the line on the right indicate that the large values in the sample are
smaller than we would expect. If we reflect this shape through the line y = x,

1.0

0.8

0.6
Sample

0.4

0.2

0.0

-2 -1 0 1 2
Normal Distribution

Figure 5.32: Normal probability plot for sample with fewer extreme values than
we would expect based on the normal distribution (these data were simulated
from a uniform distribution on [0, 1]). The data are plotted on the y-axis.

then we get a probability plot like the one shown in Figure 5.33, which indicates
that the sample plotted has more extreme observations than we would expect
under normality (and we say that we have a distribution which is heavy tailed
with respect to the normal). Here the points below the line on the left indicate
5 RESIDUALS AND DIAGNOSTICS 152

that the small values in the sample are smaller than we would expect based
on normality, whereas the values above the line on the right indicate that the
large values in the sample are larger than we expect. The plots of Figure 5.34

2
Sample

-2

-2 -1 0 1 2
Normal Distribution

Figure 5.33: Normal probability plot for sample with more extreme values than
we would expect based on the normal distribution (these data were simulated
from a t distribution with 10 degrees of freedom). The data are plotted on the
y-axis.

and Figure 5.35 show shapes which are characteristic of positive and negative
skewness respectively.
We conclude this subsection with an example to illustrate that apparent de-
viations from normality in the normal probability plot may in fact just be due to
a failure of other assumptions.

Example: inflation and central bank independence

We discuss the data given in Tutorial five on inflation rates and central bank
independence for twenty three developed and developing countries. The predic-
tors here are two measures of independence (QUES and LEGAL) as well as a
binary variable DEV (which is 1 for developed economies and 0 for developing
economies). In this example we fit a model for annual inflation rate with predic-
tors QUES, LEGAL, DEV, QUES*DEV and LEGAL*DEV. A plot of internally
5 RESIDUALS AND DIAGNOSTICS 153

0.25

0.20
Sample

0.15

0.10

0.05

0.00

-2 -1 0 1 2
Normal Distribution

Figure 5.34: Normal probability plot for sample which is positively skewed (these
data were simulated from a Beta(2,20) distribution). The data are plotted on the
y-axis.
5 RESIDUALS AND DIAGNOSTICS 154

1.0

0.9
Sample

0.8

0.7

0.6
-2 -1 0 1 2
Normal Distribution

Figure 5.35: Normal probability plot for sample which is negatively skewed (these
data were simulated from a Beta(20,2) distribution). The data are plotted on the
y-axis.
5 RESIDUALS AND DIAGNOSTICS 155

studentized residuals against fitted values and the normal probability plot of these
residuals is shown in Figures 5.36 and 5.37. The normal probability plot here

Figure 5.36: Plot of internally studentized residuals against fitted values for data
on inflation and central bank independence.

indicates that the errors are heavy tailed with respect to the normal distribution:
but the real problem here may be a violation of the constancy of variance as-
sumption. Certainly from the residual plot it does not seem that the studentized
residuals are identically distributed, and the normal probability plot is not very
meaningful here.

Example: fast food restaurants

As a further example consider the fast food restaurants data set we have dis-
cussed in previous lectures. Here we are trying to predict a measure of sales for
restaurants of a fast food chain in terms of the predictors income (a measure of
income levels for households in the area) and age (the mean age of children for
households in the area). For this example I have fitted a simple linear regression
model with sales as the response and income as the predictor. A plot of inter-
nally studentized residuals against fitted values and the normal probability plot
of these residuals are shown in Figures 5.38 and 5.39. The normal probability
5 RESIDUALS AND DIAGNOSTICS 156

Figure 5.37: Normal probability plot for internally studentized residuals and data
on inflation and central bank independence.
5 RESIDUALS AND DIAGNOSTICS 157

Figure 5.38: Plot of internally studentized residuals against fitted values for data
on fast food restaurants.
5 RESIDUALS AND DIAGNOSTICS 158

Figure 5.39: Normal probability plot for internally studentized residuals and data
on fast food restaurants.
5 RESIDUALS AND DIAGNOSTICS 159

plot would indicate some negative skewness. However, if we look at the plot of
studentized residuals against fitted values this indicates that the mean structure
of the model is not correctly specified, so that again the normal probability plot
is perhaps not terribly helpful.

5.8 Measures of influence


So far in this topic we have concentrated on the use of residuals for detecting
an incorrectly specified mean, a failure of the constancy of variance assumption
or the existence of outliers. We also considered assessment of normality in the
general linear model using normal probability plots of residuals.
In this subsection we consider the notion of influence. We are interested here
in whether a particular case (or cases) have a large impact on the inferences of
interest in a study. Some of the measures we have studied in previous lectures
(the externally studentized residuals and the leverages for instance) can help to
identify potentially influential observations. Observations with a large externally
studentized residual, for instance, correspond to unusual response values given
the location in predictor space and these observations are often influential. Sim-
ilarly, observations which correspond to unusual values for the predictors (high
leverage observations) have the potential to exert a large amount of influence
on inferences. However, high leverage observations which follow the pattern of
the rest of the data may not be influential. Similarly, an outlier (a value with a
large externally studentized residual) may not affect the inferences of interest in
a statistical analysis.

Example: fish biomass data

As an example illustrating that a high leverage observation need not be influ-


ential, we consider the fish biomass data discussed on your last tutorial sheet.
The response here is fish biomass, and there are four predictors which measure
stream characteristics. Consider a simple linear regression model involving x4 ,
the fourth predictor (which is the area of the stream greater than 25cm in depth).
The scatter plot of the response y against x4 is shown in Figure 5.40. The two
observations on the right are high leverage observations. One of them (the one
with the smaller response value) clearly does not follow the pattern of the rest
of the data, and this point would have a strong influence on the estimate of the
slope in a simple linear regression model. However, if this point were omitted, the
other high leverage observation would not affect the estimates of the regression
parameters very much: it would not be an influential observation in this sense.
So a large leverage value does not necessarily translate into high influence.

Example: Forbes data


5 RESIDUALS AND DIAGNOSTICS 160

Figure 5.40: Scatter plot of y against x4 for fish biomass data set.
5 RESIDUALS AND DIAGNOSTICS 161

The following example is discussed in some detail in Weisberg, “Applied Linear


Regression (Second Edition),” Wiley, New York, 1985, p. 2. In the 1840s and
1850s a Scottish physicist, James D. Forbes, wanted to be able to estimate alti-
tude above sea level from measurements of the boiling point of water. He knew
that altitude could be determined from atmospheric pressure, measured with a
barometer, with lower pressures corresponding to higher altitudes. He studied the
relationship between pressure and boiling point. His interest in the problem was
motivated by the difficulty of transporting the fragile barometers of the 1840s.
The scatter plot below (Figure 5.41) shows a measure of pressure as the re-
sponse variable on the y-axis (pressure was measured in units of inches of mercury,
and the pressures were then transformed by taking logs and multiplying by 100)
and boiling point on the x-axis (measured in degrees Farenheit). There is an

Figure 5.41: Scatterplot of transformed pressure (in units of 100 times log inches
of mercury) against boiling point (in degrees Farenheit).

obvious outlier in this data set from the scatter plot: if we fit a simple linear
regression model to these data, the externally studentized residual corresponding
to this observation is 12.4037, which is extremely large. However, if we remove
this outlying point and refit the model, we find that the estimate of the slope
parameter in the simple linear regression model is little changed. So if this data
5 RESIDUALS AND DIAGNOSTICS 162

set were observed in a situation where the inference of interest concerned the
slope parameter in the model, the outlier would not correspond to an influential
observation.

The need for detecting influential observations is obvious. If one observation


has a substantial effect on the inferences of interest in a statistical analysis, then
this observation must be examined carefully. But how do we detect influential
observations? While a large externally studentized residual or leverage does not
imply high influence as we have seen, it is certainly true that a high influence
observation will have either a large externally studentized residual, high lever-
age or both. So a good place to start when attempting to identify potentially
high influence observations is to examine the externally studentized residuals and
leverages.
How do we assess when a residual or leverage value is large? We have al-
ready seen that the externally studentized residuals are t distributed if the model
assumptions hold, and so this gives us a way of determining if an externally stu-
dentized residual is unusually large. But what about the leverages? It can be
shown that the trace of the hat matrix is p, the number of elements in the vector
β in the linear model. In other words,

n
hii = p.
i=1

This implies that the average leverage value is p/n, and one guideline which is
sometimes used for isolating large leverage values is to examine further observa-
tions with leverages larger than 2p/n.
We give some examples now where we consider the externally studentized
residuals and leverages. These examples also serve to illustrate how a single ob-
servation in a data set can substantially alter inferences of interest.

Example: rat data

The following example is discussed in Weisberg, “Applied Linear Regression (Sec-


ond Edition)”, Wiley, New York, 1985, p. 121 (data supplied by Dennis Cook).
An experiment was conducted to investigate the amount of a particular drug
present in the liver of a rat. Nineteen rats were randomly selected, weighed,
placed under light ether anesthesia and given an oral dose of the drug. Because
it was felt that large livers would absorb more of a given dose than smaller livers,
the actual dose an animal received was approximately determined as 40mg of the
drug per kilogram of body weight. (Liver weight is known to be strongly related
to body weight). After a fixed length of time each rat was sacrificed, the liver
weighed, and the percent of the dose in the liver determined.
The data are shown in the table. It was hypothesized that, for the method of
5 RESIDUALS AND DIAGNOSTICS 163

Body Weight Liver Weight Dose y

176 6.5 0.88 0.42


176 9.5 0.88 0.25
190 9.0 1.00 0.56
176 8.9 0.88 0.23
200 7.2 1.00 0.23
167 8.9 0.83 0.32
188 8.0 0.94 0.37
195 10.0 0.98 0.41
176 8.0 0.88 0.33
165 7.9 0.84 0.38
158 6.9 0.80 0.27
148 7.3 0.74 0.36
149 5.2 0.75 0.21
163 8.4 0.81 0.28
170 7.2 0.85 0.34
186 6.8 0.94 0.28
146 7.3 0.73 0.30
181 9.0 0.90 0.37
149 6.4 0.75 0.46
5 RESIDUALS AND DIAGNOSTICS 164

determining the dose, there is no relationship between the percentage of the dose
in the liver (y) and the body weight (x1 ), liver weight (x2 ) and relative dose (x3 ).
It can be shown that simple linear regressions of y on each of the predictors give
insignificant slope terms, in line with the hypothesis. However, fitting a multiple
linear regression for y involving predictors x1 , x2 and x3 results in the following
output.
Regression Analysis

The regression equation is


y = 0.266 - 0.0212 Body Weight + 0.0143 Liver Weight + 4.18 Dose

Predictor Coef StDev T P


Constant 0.2659 0.1946 1.37 0.192
Body Wei -0.021246 0.007974 -2.66 0.018
Liver We 0.01430 0.01722 0.83 0.419
Dose 4.178 1.523 2.74 0.015

S = 0.07729 R-Sq = 36.4% R-Sq(adj) = 23.7%

Analysis of Variance

Source DF SS MS F P
Regression 3 0.051265 0.017088 2.86 0.072
Residual Error 15 0.089609 0.005974
Total 18 0.140874

Source DF Seq SS
Body Wei 1 0.003216
Liver We 1 0.003067
Dose 1 0.044982
We see that the p-values for the partial t tests for the coefficients for body weight
and dose are significant here. If we remove the term for liver weight, the coeffi-
cients for body weight and dose remain significant and the p value in the ANOVA
table for overall significance of the model is 0.032. Given that the simple linear
regressions indicate no relationship in line with our original hypothesis this seems
like a strange result. The explanation of the problem can be revealed by examin-
ing externally studentized residuals and leverages. In the model involving body
weight and dose, the leverage of the third observation is approximately 0.83,
indicating that this is a potentially influential case because it has an unusual
predictor vector (as a matter of fact, for this rat the usual rule for assigning
dosage based on body weight was not used). If we omit this observation and refit
the model we get the following output.
5 RESIDUALS AND DIAGNOSTICS 165

Regression Analysis

The regression equation is


y = 0.332 - 0.0044 Body Weight + 0.88 Dose

Predictor Coef StDev T P


Constant 0.3320 0.1954 1.70 0.110
Body Wei -0.00444 0.01693 -0.26 0.797
Dose 0.875 3.400 0.26 0.800

S = 0.07622 R-Sq = 0.5% R-Sq(adj) = 0.0%

Analysis of Variance

Source DF SS MS F P
Regression 2 0.000423 0.000211 0.04 0.964
Residual Error 15 0.087138 0.005809
Total 17 0.087561

Source DF Seq SS
Body Wei 1 0.000038
Dose 1 0.000385

As we can see, no relationship between the predictors and the response in this
multiple linear regression model is indicated, in line with our hypothesis. The
apparent relationship was due to a single influential observation.

Example: BOQ data

We give a further example which illustrates how a single observation can be


extremely influential in a regression analysis, and how potential influence can be
detected by looking at the leverages and residuals. The data set we describe here
is given in Myers, “Classical and Modern Regression (Second Edition),” Duxbury,
Belmont, California, 1990, pp. 254–255. This data set describes manpower needs
for bachelor officer’s quarters in the US navy. We won’t give the full data set here,
but there are 25 observations in the data set. The response y is monthly man
hours, and there are seven predictors, average daily occupancy (x1 ), monthly
average number of check-ins (x2 ), weekly hours of service desk operation (x3 ),
square feet of common use area (x4 ), number of building wings (x5 ), operational
berthing capacity (x6 ) and number of rooms (x7 ).
If we fit the model involving all the predictors, we find that one observation
(number 23) has a very high leverage of 0.9885 and an externally studentized
residual of −5.24. There are also a number of other observations which might
5 RESIDUALS AND DIAGNOSTICS 166

need further investigation. Fitting the model without observation 23 drastically


changes inferences about the coefficients: the coefficients for average daily occu-
pancy and square feet of common use area become positive (which is important
if the model is to have a sensible interpretation). Also, changes occur in the
conclusions of the partial t-tests for average daily occupancy, square feet of com-
mon use area and operational berthing capacity. We will demonstrate this in the
lecture.

While the leverages and residuals can help to expose potential influence, it
would be nice to have measures which expose directly the influence of each obser-
vation. We will look at a number of different influence measures in this course,
each of which reflects the way that an observation will effect a certain aspect of
the fitted model. Perhaps the most commonly used measure of influence (and one
that is computed in many statistical packages) is Cook’s distance. The Cook’s
distance attempts to measure the change in the estimated parameters when an
observation is deleted. Write b−i for the estimate of the regression parameters β
when we fit using all the data except the ith observation, and write b for the es-
timate of β based on the complete data set. If we write σ b2 for the usual unbiased
estimator of σ 2 , then the Cook’s distance for the ith observation is simply

(b − b−i )T X T X(b − b−i )


Di = .
σ2
pb

To understand the motivation for this influence measure, recall that a 100(1 − α)
percentage confidence region for the vector of parameters β in the general linear
model is given by
{ ∗ T ∗
}
∗ (b − β ) X X(b − β )
T
β : 6 Fp,n−p;α
σ2
pb

where Fp,n−p;α is the upper 100α percentage point of an F distribution with p


and n − p degrees of freedom. This follows from the fact that

(b − β)T X T X(b − β)
∼ Fp,n−p .
σ2
pb
So if we compare the Cook’s distance to, say, the 50 percentage point of an F
distribution with p and n − p degrees of freedom, we are saying that deletion of
the ith case moves the estimate of the parameters outside a 50 percent confidence
region for β based on the fit to the full data. We point out that it is not necessary
to refit the model with one observation deleted in order to compute the Cook’s
distance. It can be shown that
ri2 hii
Di =
p 1 − hii
5 RESIDUALS AND DIAGNOSTICS 167

where ri is the internally studentized residual and hii is the leverage. So the
Cook’s distance gives an easily computed and interpreted measure of how much
the parameter estimates are influenced by each observation.

Example: rat data

We return briefly to the rat data. We identified the third case here as being a po-
tentially influential one, and confirmed this by deleting this case and reanalyzing
the data. Examining the Cook’s distance shows directly from the fit to the full
model that this observation is an influential one: the Cook’s distance for this ob-
servation is 1.82 (the next largest value for the Cook’s distance is approximately
0.21).

Example: BOQ data

Consider once again the data on manpower needs for bachelor officer’s quarters.
In this data set, observation 23 was influential: does the Cook’s distance reveal
this as we expect? The Cook’s distance for this observation is 115.041 (the next
largest value is 5.89). We will illustrate in the lecture computation of percentage
points of an F distribution for assessment of when the Cook’s distance is large.

5.9 DFFITS, DFBETAS and other influence measures


Last lecture we discussed the use of studentized residuals and leverages to identify
observations which potentially exert strong influence on a fitted linear regression
model. At the end of the lecture we also discussed the Cook’s distance, which is
a measure of how much the estimate of β changes when we delete an observation
from the model. The Cook’s distance Di was defined to be
(b−i − b)T X T X(b−i − b)
Di =
σ2
pb
where b is the estimate of β obtained from a fit to the full model, b−i is the
estimate of β from a fit to all the data except the ith observation, X is the
n × p design matrix and σ b2 is the usual unbiased estimator of σ 2 . As a rule of
thumb, we compare the Cook’s distance to the upper 50 percentage point of an
Fp,n−p distribution: recall that if the Cook’s distance is larger than this, then this
means that b−i lies outside a 50 percent confidence ellipsoid for β based on all
the data. While consideration of confidence ellipsoids provides one interpretation
for the Cook’s distance, there is an alternative interpretation which leads to the
introduction of a closely related influence measure.
Rearranging our expression for Di gives
(Xb−i − Xb)T (Xb−i − Xb)
Di = .
σ2
pb
5 RESIDUALS AND DIAGNOSTICS 168

Now, note that Xb is simply yb, the vector of fitted values for the complete data
set. Also, Xb−i is the vector of fitted values based on the fit with the ith case
deleted, and we write yb−i = Xb−i . So Di becomes

y − yb−i )T (b
(b y − yb−i )
Di = 2
.
pbσ
So we can see that we can interpret Di as representing the ordinary Euclidean
distance between the vectors of fitted values for the full data set and the data set
with the ith case deleted, scaled by a constant. You may have felt that looking at
the change in the vector of fitted values would be the appropriate way of assessing
influence, particularly if prediction is the goal of a regression analysis: we now
see that this is closely related to looking at the Cook’s distance, which measures
the change in the estimate of β upon deletion of the ith case.
A statistic which is sometimes used to look at the change in the fitted value
at the ith point upon deletion of the ith case is DFFITSi , defined to be
ybi − ybi,−i
DFFITSi = √
b−i hii
σ
where we have written ybi,−i for the ith element of yb−i and σb−i is our estimate of
the residual standard deviation based√on the fit with the ith case deleted. Note
that the standard error of ybi is just σ hii , so DFFITSi is estimating the number
of standard errors by which the fit changes at the ith point when the ith case
is deleted from the fit. One rule of thumb for assessing the magnitude of values
of DFFITS is to say that an absolute value larger than 2 is large. The statistic
DFFITSi is closely related to the Cook’s distance Di : it can be shown that

(b−i − b)T X T X(b−i − b)


DFFITS2i = ,
b−i
σ 2

so that the square of DFFITSi is simply p times a statistic that is very similar
to the Cook’s distance. Evidently the Cook’s distance and DFFITS give much
the same information, although DFFITS does give information about the sign of
the change in the fitted value upon deletion of the ith case. As for the Cook’s
distance, we can compute DFFITS without having to fit a new model when we
delete each case: the computational formula for DFFITSi is

hii
DFFITSi = ti
1 − hii
where ti is the externally studentized residual and hii is the leverage. We return
to some of the examples in the last lecture (the rat data and the BOQ data) to
illustrate the use of DFFITS.
5 RESIDUALS AND DIAGNOSTICS 169

Example: rat data

Recall from last lecture the rat data, where we were interested in modelling the
percentage of a drug in the livers of 19 rats after a period of time in terms of pre-
dictors body weight, liver weight and relative dose. When discussing this example
in the last lecture we determined that the 3rd observation was very influential: it
has a high leverage value, since the method for assigning doses for the other rats
was not followed in this case, and so the vector of predictors is unusual here. The
Cook’s distance for this observation based on a model involving body weight and
relative dose is 1.82, which may be compared to the upper 50 percentage point
of an F distribution with 3 and 16 degrees of freedom (approximately 0.82). The
DFFITS for the 3rd case and the model involving body weight and relative dose
is approximately 2.35 here, which suggests that deletion of this case does sub-
stantially influence the fit at this point. Note that with p = 3 we have that p
times the Cook’s distance is approximately 5.46, which is roughly the square of
DFFITS (2.352 or approximately 5.53). So the Cook’s distance and DFFITS do
give much the same kind of information.

Example: BOQ data

As a further example we consider the BOQ data. Recall that the goal of the
analysis with this data set was to predict manpower needs for bachelor officer’s
quarters in the US Navy. The response is man hours (a measure of manpower
needs), and there are seven predictors measured for 25 different naval installa-
tions. In the last lecture we considered fitting a model for the response including
all seven predictors. We identified the 23rd case as being an unusual one: it has
both a large externally studentized residual and a high leverage value. The Cook’s
distance for this observation is large (approximately 115.04) which is much larger
than the upper 50 percentage point of an F distribution with 8 and 17 degrees of
freedom (approximately 0.96). The DFFITS for the 23rd case is approximately
−48.52, which shows that the fitted value at this point changes by approximately
48.52 times the standard error of the fitted value at this point when this case is
deleted: this is certainly an influential case.

In defining the Cook’s distance our motivation was to identify observations which
have a substantial effect on the estimate of β. Once we have identified an ob-
servation which does appear to have an effect on estimation of β, it may be of
interest to ask which components of the estimate b of β are influenced most. It
may be that an unusual observation has a large influence on some of the coef-
ficient estimates but not on others. An influence statistic which can help us to
determine which of the coefficient estimates are most influenced by an unusual
observation is the DFBETAS statistic. For a given observation i and the jth
5 RESIDUALS AND DIAGNOSTICS 170

predictor, we define DFBETASj,i to be


bj − bj,−i
DFBETASj,i = √
σb−i cjj

b−i is the estimate


where bj is the jth element of b, bj,−i is the jth element of b−i , σ
of σ obtained from the fit with the ith case deleted and cjj is the jth diagonal el-
ement of (X T X)−1 . Recall that the covariance matrix of b is simply σ 2 (X T X)−1 ,
so DFBETASj,i estimates the number of standard errors by which the jth coef-
ficient estimate changes when the ith case is deleted. As for DFFITS and the
Cook’s distance, there is a convenient computational formula for the DFBETAS
statistic, but we do not discuss this here.

Example: BOQ data

As a further example we consider the BOQ data. From the Cook’s distance
observation 23 exerts substantial influence on the coefficient estimates: which
coefficient estimates are most influenced? Again we can compute the DFBETAS
to answer this question. For the intercept the DFBETAS for observation 23 is
approximately −0.25, for x1 it is −44.38, for x2 it is 1.17, for x3 it is 0.66, for x4
it is −4.57, for x5 it is 1.18, for x6 it is −3.94 and for x7 it is 7.28. Obviously the
DFBETAS for x4 , x6 and x7 are all large here, and the DFBETAS for x1 is very
large at −44.38.

We discuss one more influence measure which is less commonly used than
DFFITS and DFBETAS, and which gives an overall measure of how much the
standard errors of parameter estimates are influenced by deletion of cases. In
particular, we define COVRATIOi to be

b−i
σ 2p
1
COVRATIOi = 2p .
b 1 − hii
σ
We won’t discuss the motivation for the definition of this statistic in too much
detail. You should just think of this statistic as measuring in a global sense how
much standard error estimates of the coefficients are influenced by deletion of the
ith case. A rule of thumb for assessing the magnitude of this statistic is to further
examine observations which have a COVRATIO larger than 1 + 3p/n or smaller
than 1 − 3p/n. We illustrate the use of the COVRATIO statistic by looking at
the fish biomass data discussed last lecture.

Example: fish biomass data

Consider again the fish biomass data. For this data set the response y was fish
biomass, and there were four predictors consisting of measurements of stream
5 RESIDUALS AND DIAGNOSTICS 171

characteristics which were thought to relate to fish biomass. Recall that when
we considered a simple linear regression model using the fourth predictor (area
of stream with a depth greater than 25 centimetres) we saw that there were two
observations with very high leverages (see Figure 5.40). We might expect that
these observations exert a substantial influence on standard errors of coefficient
estimates: computing the COVRATIO for these observations we get approxi-
mately 0.00589 for the third observation and approximately 0.492 for the fourth
observation. These very small COVRATIO values reflect the fact that precision
of estimation is substantially reduced by inclusion of these observations.

5.10 Dealing with violations of assumptions: transforma-


tions
In previous subsections we have looked at the use of residuals and other diag-
nostics to highlight possible violations of assumptions. However, we have not yet
discussed how to fix the problems these diagnostics may detect. We now discuss
the use of transformations (of either the response or the predictors) in an analy-
sis where there is a possible violation of the model assumptions. We also discuss
some of the implications of the use of transformations and their drawbacks. It
is important to point out that applying a transformation to fix a violation of
the model assumptions may cause another violation which did not occur on the
original scale.

Variance Stabilizing Transformations

We will need to consider different kinds of transformations depending on which


assumptions we believe are violated in the linear model. We consider first trans-
formations which are appropriate when the assumption of a constant error vari-
ance may be violated (so-called variance stabilizing transformations). Weisberg,
“Applied Linear Regression (Second Edition)”, Wiley, New York, 1985, p. 134,
gives a list of variance stabilizing transformations and describes situations in
which these transformations are natural. It is actually quite common in a re-
gression analysis for the variance of a response to depend on the mean level: in
many situations, as the magnitude of the expected response grows, so does the
variance of the response (see below for some examples). A transformation of the
response may be useful in these situations so that the usual linear modelling and
inferential techniques can be applied on the transformed scale.

A square root transformation y for the response y is appropriate when
the error variance is proportional to the mean (of course the responses must be
positive to take square roots). To see why the square root transformation is

useful in this situation, consider a first order Taylor expansion of y about the
5 RESIDUALS AND DIAGNOSTICS 172


expected value E(y) of y. We have that y is approximately
√ 1
E(y) + √ (y − E(y)).
2 E(y)
from this we have that the variance of the square root of y is approximately
1
Var(y)
4E(y)
and hence if the variance of y is proportional to its mean, the variance of the
square root of y should be approximately constant. The square root transforma-
tion is often considered when a linear model is used for an approximate analysis
of count data. If a Poisson distribution is appropriate for the counts, then the
variance is proportional to the mean, and so the square root transformation has
some theoretical rationale in this situation.
A log transformation is appropriate when the error standard deviation is
proportional to the mean (that is, when the standard deviation is a percentage
of the response). We can consider a Taylor expansion as in our discussion of the
square root transformation to see why the log function will stabilize the variance
in this situation. We have that log y is approximately
(y − E(y))
log E(y) +
E(y)
so that the variance of log y is approximately
Var(y)
.
E(y)2
Again we require the responses to be positive for a log transformation to be appli-
cable. If some of the responses are zero, the transformation log(y+1) is sometimes
used. Another common variance stabilizing transformation is the inverse func-
tion, 1/y. This transformation is applicable when the responses are positive and
when the error standard deviation is proportional to the mean squared. If some
of the responses are zero, the transformation 1/(y + 1) is sometimes used.

Evaluating transformations

Before we give some examples we discuss how to evaluate whether a transforma-


tion has helped to improve the fitted model. If we are interested in prediction, we
saw when we discussed model selection that the PRESS statistic was one way of
comparing models. We might think that we can use the PRESS statistic here for
comparing models where we have transformed the response. However, we must
remember that we are interested in predictions of the original response: that is,
we must compare different models on the original untransformed scale.
5 RESIDUALS AND DIAGNOSTICS 173

One method for comparing models is the following. As usual write y for the
response, and suppose that after fitting a multiple linear regression to y an exam-
ination of diagnostic statistics indicates the need for a variance stabilizing trans-
formation. We apply the variance stabilizing transformation z = f (y) (where f
is invertible) and fit a multiple linear regression model to z. Is the model on the
transformed scale to be preferred if our goal is to predict y? We can develop a
statistic for the model on the transformed response which can be compared with
the PRESS statistic for the model for y (or with the corresponding statistic for a
model involving a different transformation of the response). As in our discussion
of the PRESS statistic, write zbi,−i for the fitted value for z obtained from fitting
a model to all the data except the ith observation. The ith PRESS residual on
the transformed scale is of course simply zi − zbi,−i . However, as we are interested
in prediction on the original scale we consider transforming zbi,−i by f −1 (·) (where
f −1 (·) denotes the inverse of f (·)) to get a prediction of yi , and then the analogue
of the PRESS residual on the original scale is yi − f −1 (b zi,−i ). We can calculate


n
(yi − f −1 (b
zi,−i ))2
i=1

and compare this to the PRESS statistic for the model for y or to a similar statistic
for a different transformation to get an idea of whether predictive performance is
improved on the original scale by using the variance stabilizing transformation.
Alternatively, we could compare the sum of the absolute PRESS residuals for the
model of the untransformed response with

n
|yi − f −1 (b
zi,−i )|.
i=1

It is important to stress that models must be compared on the original scale and
that we cannot simply look at the R2 or error standard deviation for models at
different scales. We give some examples to illustrate the use of variance stabiliz-
ing transformations.

Example: snow geese data

The following example is from Weisberg, S., “Applied Linear Regression (Second
Edition),” Wiley, New York, 1985, p. 102.
Aerial survey methods are regularly used to estimate the number of snow
geese in their summer range areas west of Hudson Bay in Canada. To obtain
estimates, small aircraft fly over the range and, when a flock of geese is spotted,
an experienced person estimates the number of geese in the flock. To investigate
the reliability of this method of counting, an experiment was conducted in which
an airplane carrying two observers flew over 45 flocks, and each observer made
5 RESIDUALS AND DIAGNOSTICS 174

Photo Observer Photo Observer

46 50 119 75
38 25 165 100
25 30 152 150
48 35 205 120
38 25 409 250
22 20 342 500
22 12 200 200
42 34 73 50
34 20 123 75
14 10 150 150
30 25 70 50
9 10 90 60
18 15 110 75
25 20 95 150
62 40 57 40
26 30 43 25
88 75 55 100
56 35 325 200
11 9 114 60
66 55 83 40
42 30 91 35
30 25 56 20
90 40

an independent estimate of the number of birds in each flock. Also, a photograph


of the flock was taken so that an exact count of the number of birds in the flock
could be made.
We do not reproduce the full data set here, but in the table below the photo
counts and counts for one of the observers are shown (see Weisberg (1985) for the
full data set). A scatter plot of the photo count against observer count is shown
in Figure 5.42 It is clear from the scatter plot that the variance increases with
the mean response. For count data of this kind, a square root transformation
is often appropriate. (One of the most commonly used models for count data is
the Poisson distribution, for which the mean is equal to the variance). Figure
5.43 shows the scatter plot of the square root of the photo count against the
square root of the observer count. An assumption√ of constant error variance in

the simple linear regression model for y with x as the predictor would seem to
be more nearly reasonable than in the model with y as the response and x as the
5 RESIDUALS AND DIAGNOSTICS 175

Figure 5.42: Scatter plot of photo count against observer count for data on snow
geese.
5 RESIDUALS AND DIAGNOSTICS 176

Figure 5.43: Scatter plot of square root of photo count against square root of
observer count for data on snow geese.
5 RESIDUALS AND DIAGNOSTICS 177

predictor, although there is still a suggestion that the variance increases with the
mean. Figure 5.44 shows the scatter plot of the log of photo count against the
log of observer count. The log transformation would appear to be better than

Figure 5.44: Scatter plot of log of photo count against log of observer count for
data on snow geese.

the square root transformation for stabilizing the variance, although there are
reasons for preferring the square root for interpretability (since the data consist
of counts).

We can compare the models for y, y and log(y) in terms of predictive per-
formance on the scale of y by using the analogue of the PRESS statistic that we
have just developed. Computation of these statistics will be illustrated in lectures.
The PRESS statistic for the simple linear regression model with y as response
and x as predictor is approximately 172738. The analogue of this statistic for the
model involving the square root of y is 137603, and for the log of y is 122740. The
sum of the absolute PRESS residuals for the model involving y is approximately
1475.55, and the analogue of this statistic is 1295.89 and 1257.81 for the models
involving the square root of y and log of y respectively. So for predictive perfor-
mance on the untransformed scale it would seem that the model for log(y) is best.
5 RESIDUALS AND DIAGNOSTICS 178

Example: inflation data

We consider the data on inflation rates for 22 different countries which was con-
sidered in previous lectures. The response here was INF (inflation rate), and we
were interested in predicting INF using measures of central bank independence
(QUES, LEGAL) and an indicator variable DEV (one for developed countries,
zero for developing countries). For the purposes of this example we consider just
a simple linear regression model for INF with QUES as the predictor. The scatter
plot of INF against QUES is given in Figure 5.45. Variation in the response seems

Figure 5.45: Scatter plot of INF against QUES for data on inflation rates and
central bank independence.

to increase with the mean here, and a variance stabilizing transformation might
be considered. Figure 5.46 shows a scatter plot of the log of INF against QUES,
which seems to indicate that the assumption of constancy of variance would be
more nearly reasonable if the response were modelled on this scale. There is a
clear outlier in this scatter plot.
Again we can fit models at the different scales and compare predictive perfor-
mance via the analogue of the PRESS statistic that we have developed. The
PRESS statistic for the model for INF in terms of QUES is approximately
5 RESIDUALS AND DIAGNOSTICS 179

Figure 5.46: Scatter plot of log(INF) against QUES for data on inflation rates
and central bank independence.
5 RESIDUALS AND DIAGNOSTICS 180

16070.9, and the analogue of this for the model with the log of INF as response is
21610.8. However, if we look at the sum of the absolute PRESS residuals for the
model of INF we get 433.84, whereas for the model of log INF the corresponding
statistic is 431.193. The conflict between the two measures of predictive perfor-
mance is due to the outlier, and the model for the log of INF would be preferable
on the untransformed scale if the outlier is removed.

5.11 Weighted Least Squares


We have considered the use of variance stabilizing transformations so that the
usual theory of linear models can be applied in modelling the response on a
suitable scale. However, there is an alternative to the use of variance stabilizing
transformations. If we know the variances of the errors εi , Var(εi ) = σi2 say, or if
we know the variance of the errors up to an unknown constant, Var(εi ) = σ 2 wi
where σ 2 is unknown and wi , i = 1, ..., n are known weights, then so-called
weighted least squares can be applied. If we write V for the (diagonal) covariance
matrix of the errors (ε1 , ..., εn ), then the maximum likelihood estimator βb of β
assuming normality of the errors is

βb = (X ⊤ V −1 X)−1 (X ⊤ V −1 y)

where X is the design matrix and y is the vector of responses. In the situation
where V = σ 2 W and W is the diagonal matrix with diagonal elements w1 , ..., wn
we can write
βb = (X ⊤ W −1 X)−1 (X ⊤ W −1 y)
and we see that βb doesn’t depend on the (unknown) σ 2 . The covariance matrix
of βb can be shown to be
(X ⊤ V −1 X)−1
which reduces to
σ 2 (X ⊤ W −1 X)−1
when V = σ 2 W . It can be shown that βb minimizes

n
wi−1 (yi − x⊤
i β)
2

i=1

which is a least squares type criterion in which observations with larger variances
receive less weight. Much of the theory of linear models with a constant error
variance can be carried over to the situation we have just described, and if we
can specify the variances σi2 of the errors or weights wi in a natural way then
modelling the inhomogeneity of the variances may be preferable to the application
of a variance stabilizing transformation. Sometimes the variances σi2 can be
5 RESIDUALS AND DIAGNOSTICS 181

estimated: for instance, if we have many observations for each set of predictor
values, then we could estimate σi2 from the data. Another possiblity which is
reasonable in some situations is to take the weights wi to be given by the values
of one of the predictors.
Example: transfer efficiency data

We illustrate the use of weighted regression with an example from Myers (1990),
“Classical and Modern Regression Analysis with Applications (Second Edition)”,
Duxbury, Belmont, California, p. 281–282. The response y here is a measure of
the efficiency of a particular electrostatic type of spray paint equipment. We are
interested in modelling the response as a function of two predictors, air velocity
and voltage. An experiment was conducted in which two levels of air velocity and
two levels of voltage were considered, and ten observations were taken for each of
the four possible air velocity/voltage combinations. The data are shown in the
table below. Since we have ten replicates for each distinct combination of the

Voltage
50 70
Air Velocity 60 87.5 88.2 77.4 68.1
88.1 87.3 70.7 65.3
89.5 89.2 67.0 61.0
86.2 85.9 71.7 81.7
90.0 87.0 79.2 60.3
120 82.5 81.3 61.2 50.7
81.6 80.7 67.2 52.3
77.4 79.3 55.9 68.6
81.5 82.0 52.0 69.5
79.7 79.2 63.5 70.1

predictor values, we can estimate the variance at each set of predictor values. We
can use these estimated variances in a weighted regression analysis of the data.
Weighted analysis using weights in Weights

The regression equation is


Efficiency = 142 - 0.924 Voltage - 0.124 Air Velocity

Predictor Coef StDev T P


Constant 141.552 4.449 31.82 0.000
Voltage -0.92402 0.08591 -10.76 0.000
Air Velo -0.12389 0.01066 -11.62 0.000

S = 0.9839 R-Sq = 87.3% R-Sq(adj) = 86.6%


5 RESIDUALS AND DIAGNOSTICS 182

It is important to note that while the ordinary least squares estimator will be
consistent even in the case where the error variances are unequal in the linear
model, an appropriate weighted least squares estimator will generally be much
more efficient (have a smaller variance). The ordinary least squares estimator
will be heavily influenced by unreliable high variance observations, whereas the
role of these observations is donwplayed in the weighted analysis.

5.12 Nonlinear transformations of a single predictor


In the previous subsection we considered the use of a transformation of the re-
sponse in order to stabilize the error variance. Another violation of model as-
sumptions where transformations can be helpful is where there is evidence of
non-linear relationships between the predictors and the response.
For simplicity we will consider just the case of a single predictor first. We
have already encountered the idea in previous lectures of using transformations
of the predictors in a linear model so that the mean structure is linear in the
transformed predictors. In this subsection, we describe a number of different
nonlinear relationships between a response and a predictor which can be written
as linear relationships between a transformed response and a transformed predic-
tor or predictors (see Myers, 1990, Section 7.3 for description of some additional
nonlinear relationships and corresponding transformations). We write y for the
response variable, and x for a predictor, and we consider various functional rela-
tionships between y and x (we ignore the random component of the model for the
moment and make some remarks at the end of the subsection about the effect of
transformations on the error structure).

Parabolic relationship

Suppose we do a scatter plot of y against x and we see a parabolic shape. If the


relationship between y and x is parabolic (we ignore errors) then we have

y = β0 + β1 x + β2 x2

so that by introducing the transformation x2 of the original predictor x we have


a relationship that is nonlinear in the predictor x but linear in the unknown pa-
rameters β0 , β1 and β2 . So if we see a parabolic relationship between y and x
in a scatter plot, it may be appropriate to fit a multiple linear regression model
involving the predcitors x and x2 .

Exponential relationship
5 RESIDUALS AND DIAGNOSTICS 183

If the relationship between y and x is described by an exponential function,


y = β0 exp(β1 x), then by taking logarithms of both sides we get

log y = log β0 + β1 x

or
log y = β0∗ + β1∗ x.
So if we see an exponential pattern in a scatter plot of y against x, a transfor-
mation of y to log y might be considered.

Inverse exponential relationship

If the relationship between y and x is described by an inverse exponential function,

y = β0 exp(β1 /x)

then by taking logarithms we get


1
log y = log β0 + β1
x
so an inverse exponential pattern in a scatter plot might indicate that a linear
regression model with log(y) as the response and 1/x as the predictor may be
appropriate.
There will be certain modelling situations in which some of these response/predictor
relationships are natural: for instance, in modelling population growth in biology
or ecology an exponential model might be appropriate. We give an example to
illustrate the use of transformations when there is a nonlinear relationship be-
tween a response y and predictor x.

Example: surgical services data

This example is from your text book (Myers (1990), “Classical and Modern Re-
gression with Applications (Second Edition),” Duxbury, Belmont, California, p.
299). These data were collected in an attempt to predict manpower requirements
for surgical services at US naval hospitals. The response y is man hours per
month measured for fifteen hospitals, and the predictor x is the number of sur-
gical cases at these hospitals. The data are shown in the table below. A scatter
plot of y against x is shown in Figure 5.47. The scatter plot certainly seems to
show a nonlinear relationship between y and x. However, there are a number of
different possibilities when it comes to modelling this nonlinear relationship. On
the basis of the above discussion, we consider the simple linear regression model
involving y and x, as well as a parabolic model and inverse exponential model.
For the simple linear regression model with y as response and x as the pre-
dictor, the PRESS statisic is 35, 927, 143 and the sum of the absolute PRESS
5 RESIDUALS AND DIAGNOSTICS 184

y (Man Hours) x (Surgical Cases)

1275 230
1350 235
1650 250
2000 277
3750 522
4222 545
5018 625
6125 713
6200 735
8150 820
9975 992
12200 1322
12750 1900
13014 2022
13275 2155

Figure 5.47: Scatter plot of y against x for surgical serices data.


5 RESIDUALS AND DIAGNOSTICS 185

residuals is 19064.7. For the parabolic model, (where y is the response and we
use x and x2 as predictors), the PRESS statisic is 7, 057, 692 and the sum of the
absolute PRESS residuals is 9348.06.
For the inverse exponential model, we fit a model with log y as the repsonse
and 1/x as the predictor. We use the analogue of the PRESS residuals developed
in the last lecture to derive statistics for comparison with the PRESS statistic
and sum of absolute PRESS residuals. The sum of squared prediction errors on
the original scale for this model is 20, 684, 140, and the sum of absolute prediction
errors is 14, 780.7.
From the above, it would appear that for predictive purposes the parabolic
model is best. The inverse exponential model also seems better than the simple
linear regression model with y as response and x as the predictor.

To conclude this subsection we make a cautionary remark about the effect of


transformations of the response on the error structure of a model. Suppose that
we have measured values of a response y, and corresponding values of a predictor
x, and that there is some nonlinear relationship between the mean of y and x
which can be linearized by a transformation. To be concrete, suppose we believe
that the model
yi = β0 exp(β1 xi ) + εi
holds, where β0 and β1 are unknown parameters and where εi is a collection of
zero mean errors with constant variance. Now, we know that by applying a log
transformation to the mean of yi we can obtain a function that is linear in xi
and linear in unknown parameters. But does this justify taking logarithms of the
responses yi and fitting a linear model? Taking logs of the above equation does
not yield a model in which

log yi = log β0 + β1 xi + ηi

where the ηi are zero mean with a constant variance. In other words, if we have
reasons for believing the first model stated above holds, then applying a trans-
formation which linearizes the mean in unknown parameters does not justify the
linear model above: we have to consider the effect on the errors of the transfor-
mation. It may be better to work with the original nonlinear model: we will not
discuss nonlinear regression in this course, but you may see more on this if you
do further statistics courses.

5.13 The Box-Tidwell Procedure


In the next two subsections we consider ways of choosing transformations of
the predictors in multiple linear regression. Of course, simple scatter plots of
the response against each of the predictors are not helpful for diagnosing the
need to transform when we have two or more predictors. Relationships among
5 RESIDUALS AND DIAGNOSTICS 186

the predictors may obscure the role that each individual predictor plays in a
multiple linear regression model. We have already discussed ways of diagnosing
the need to transform in a multiple regression. In particular, we have looked at
partial residual and regression plots and added variable plots. However, it may
not always be clear from these plots what kind of nonlinear transformation of a
predictor or predictors may be best.
Weisberg (1985), “Applied Linear Regression (Second Edition)”, Wiley, New
York, distinguishes two situations in considering the need to transform the pre-
dictors in linear regression models. In the case where the expected response takes
a minimum or maximum within the range of the predictors, it is natural to in-
troduce powers of the original predictors and products of the original predictors
(that is, we consider polynomial functions for approximating the response). In
the case where the expected response is increasing or decreasing in the predic-
tors, it may be more natural to consider so-called power transformations of the
original predictors. For instance, if we wish to transform the first predictor x1 in
a multiple linear regression model, in the first situation we might introduce an
additional term involving x21 into the model. For the case where power transfor-
mations are considered, we replace the term involving x1 with a term of the form
xα1 1 for some exponent α1 (for instance, we might consider α1 = 1/2, the square
root of x1 ). In this subsection we discuss a procedure for estimating the expo-
nent α1 when a power transformation is considered, and in the next subsection
we briefly consider polynomial regression.

Estimating a power transformation

The method we present for estimating an exponent in a power transformation


is similar to one suggested by Box and Tidwell, and is developed in Weisberg
(“Applied Linear Regression (Second Edition)”, Wiley, New York, 1985, pp. 153–
155). Suppose we have a response y and predictors x1 , ..., xk and consider the
multiple linear regression model

yi = β0 + β1 xi1 + ... + βk xik + εi (37)

where the εi are zero mean errors and the usual linear model assumptions hold.
For simplicity suppose we wish to investigate the need to transform just the
first predictor x1 (although the discussion which follows may be generalized to
the situation where we consider the need to transform more than one predictor
simultaneously). We write the above model as


k
yi = β0 + β1 xi1 + βj xij + εi .
j=2
5 RESIDUALS AND DIAGNOSTICS 187

Now consider the more general model



k
yi = β0 + β1 xαi11 + βj xij + εi . (38)
j=2

This model reduces to the original one if we set α1 = 1. When α1 = 0 we use


the predictor log x1 instead of xα1 1 (to see why this is natural, observe that if α1
is known, fitting a linear model with xα1 1 as a predictor is equivalent to fitting a
model with
xα1 1 − 1
α1
as predictor, and that this last expression approaches log x1 as α1 approaches
zero.) The model (38) is nonlinear in the unknown parameters when α1 is intro-
duced. However, Box and Tidwell suggest a method based on linear modelling
techniques that allows us to test for the need to transform x1 and which allows
us to get a crude estimate of α1 .
The method is based on performing a first order Taylor series expansion of
xαi11 about α1 = 1. We have that xαi11 is approximately

xi1 + (α1 − 1)xi1 log xi1 .

Substituting this approximation into (38) we obtain



k
yi = β0 + β1 xi1 + β1 (α1 − 1)xi1 log xi1 + βj xij + εi . (39)
j=2

and if we write η = β1 (α1 −1) we have a model which is linear in unknown param-
eters β0 , ..., βk , η where we have introduced a new predictor xi1 log xi1 . Examining
an estimate of η can help us to decide whether there is a need to transform (since
η = 0 when either α1 = 1 or β1 = 0).
The crude method for estimating α1 suggested by Weisberg (1985) is as fol-
lows. First we fit the linear model (37) where x1 is untransformed. Write b1 for
the estimate of β1 in this model. Then fit the model (39), and write ηb for the
estimate of η in this model. Since η = β1 (α1 − 1), one natural estimate of α1 is
obtained by solving for α1 in this equation and plugging in b1 and ηb: we obtain
an estimate
ηb
c1 =
α + 1.
b1
Weisberg warns that this crude estimator may not perform very well in some
situations, particularly if the partial t-statistic for b1 is small (since in this case
β1 is possibly close to zero, and in the above expression where b1 appears in the
denominator we do not expect the estimate α c1 to be reliable). We give some
examples to illustrate the use of the Box-Tidwell procedure.
5 RESIDUALS AND DIAGNOSTICS 188

Example: size of Romanesque churches

The following example is discussed in Weisberg (1985) and concerns a data set
supplied by Stephen J. Gould. The data set consists of two variables, which are
the measured area (in hundreds of square metres) and perimeter (in hundreds of
metres) for 25 Romanesque churches. The data are given in the table below. A

Perimeter Area Perimeter Area

3.48 38.83 3.14 34.27


3.69 43.92 2.04 17.61
1.43 9.14 1.77 13.37
2.05 16.66 0.59 2.04
3.05 36.16 0.69 2.22
4.19 38.66 0.50 1.46
2.43 17.74 0.69 1.92
2.40 19.46 0.63 1.86
2.72 23.00 0.58 1.69
2.99 29.75 0.86 3.31
4.78 51.19 0.41 1.13
1.33 6.60 1.23 6.74
1.67 9.04

scatter plot of area against perimeter is shown in Figure 5.48. From the scatter
plot, evidently a simple linear regression model of area against perimeter does not
seem to be reasonable here. Weisberg (1985) gives an argument which suggests
using the square root of the area as the response. We will return to a discussion
of transformation of the response in this example in later lectures. We investigate
the need to transform the predictor perimeter in a model for the square root of
area. We consider first the model

Areai = β0 + β1 Perimeteri + εi .

The estimate of β1 when this model is fitted is b1 = 1.544. Now consider the
model

Areai = β0 + β1 Perimeteri + ηPerimeteri log(Perimeteri ) + εi .

When we fit this model, we get ηb = −0.6726. Hence for a model involving a
power transformation Perimeterα1 a crude estimate of α1 is approximately
−0.6726
b1 =
α + 1 = 0.56.
1.544
5 RESIDUALS AND DIAGNOSTICS 189

Figure 5.48: Scatter plot of area against perimeter for 25 Romanesque churches.
5 RESIDUALS AND DIAGNOSTICS 190

Since this value is close to 0.5, we consider a model involving a square root trans-
formation. A scatter plot of the square root of area against the square root of
perimeter is shown in Figure 5.49. We may still not be entirely happy with this

Figure 5.49: Scatter plot of square root of area against square root of perimeter
for 25 Romanesque churches.

model. The relationship between the transformed response and transformed pre-
dictor does not seem to be quite linear (there seem to be some departures from
linearity for large and small perimeter values) and possibly the variance increases
as the mean response increases.

5.14 Polynomial regression


We mentioned in the last subsection that polynomial regression models are used
by many analysts when an expected response takes a minimum or maximum value
within the range of the predictors. In this subsection we discuss polynomial
regression models, the meaning of interaction in polynomial regression models
with multiple predictors, and give some examples.
5 RESIDUALS AND DIAGNOSTICS 191

Polynomials are of course a flexible class of models for a relationship between


the expected response and one or more predictors in a regression problem (an
argument involving a Taylor series expansion shows that we can approximate
any suitably smooth function by a polynomial of sufficiently high order). How-
ever, there are difficulties with the use of high order polynomial models (both
numerical and theoretical) as a general tool for function approximation in regres-
sion modelling. There are other (probably better) flexible ways of approximating
a general smooth response function within the framework of the linear model.
However, in many situations where a model which is linear in the original pre-
dictors is not appropriate we may wish to consider second order polynomial or
other low order polynomial models for the expected response.
For simplicity, let us just consider a situation in which we have a measured
response y and two predictors x1 and x2 . Suppose we have fitted a linear model
with x1 and x2 and added variable plots indicate the need for a nonlinear trans-
formation of x1 or x2 or both. We might consider a second order polynomial
model
yi = β0 + β1 x1 + β2 x21 + β3 x2 + β4 x22 + β5 x1 x2 + εi .
The term β5 x1 x2 is called an interaction term. To see where this name comes
from, suppose that β5 = 0 (that is, assume there is no interaction term). Then
observe that if we change x1 , the resulting change in the expected value of the
response is the same regardless of the value of x2 . Similarly, if we change x2 , the
change in the expected response is the same regardless of the value of x1 . How-
ever, if β5 ̸= 0, this is no longer the case: a change in one of the predictors does
not produce the same change in the expected response regardless of the value of
the other variable.

Example: fast food restaurants

As an example of the use of a polynomial regression model we consider the data


on fast food restaurants discussed in previous lectures. For this data set we were
interested in predicting annual gross sales for a store in a fast food restaurant
chain based on mean annual household income and mean age of children for
families in the area around the store. The variables are Sales (the response),
Income (the mean annual household income) and Age (the mean age of children).
We might expect that the expected response would achieve a maximum within
the range of the predictors here: we expect that increasing income and age cause
sales to increase up to a point, but this pattern would not persist indefinitely.
We might expect that middle income families with teenage children would be
most likely to consume fast food. The variables Incomesq and Agesq here are
the square of Income and Age, and the variable Income*Age is the product of
Income and Age.
Regression Analysis
5 RESIDUALS AND DIAGNOSTICS 192

The regression equation is


Sales = - 1134 + 173 Income + 23.5 Age - 3.73 Incomesq - 3.87 Agesq
+ 1.97 Income*Age

Predictor Coef StDev T P


Constant -1134.0 320.0 -3.54 0.002
Income 173.20 28.20 6.14 0.000
Age 23.55 32.23 0.73 0.474
Incomesq -3.7261 0.5422 -6.87 0.000
Agesq -3.869 1.179 -3.28 0.004
Income*A 1.9673 0.9441 2.08 0.051

S = 44.70 R-Sq = 90.7% R-Sq(adj) = 88.2%

Analysis of Variance

Source DF SS MS F P
Regression 5 368140 73628 36.86 0.000
Residual Error 19 37956 1998
Total 24 406096

Source DF Seq SS
Income 1 77008
Age 1 55063
Incomesq 1 202845
Agesq 1 24551
Income*A 1 8674

The p-value for the partial t-statistic for Income*Age is 0.051, which suggests
dropping the interaction term from the model at the five percent level in the
presence of the other terms (although this is borderline). The p-value for the
partial t-statistic for Age is large here also.

5.15 Box-Cox transformation of the response


In previous lectures we considered ways of transforming the predictors in multiple
linear regression. In particular, we considered the Box-Tidwell procedure and
polynomial regression models. These methods were possibly helpful when there
was evidence of a nonlinear relationship between the expected response and the
original predictors.
We have also considered the possibility of transforming the response in a
5 RESIDUALS AND DIAGNOSTICS 193

simple or multiple regression model. So far we have only considered the use of
a transformation of the response to achieve constancy of error variance when
certain mean/variance relationships hold. However, a transformation of the re-
sponse can also be helpful for achieving linearity of the mean response in the
original predictors, and for making the assumption of normality of errors more
reasonable. In this subsection we consider a class of transformations indexed by a
single parameter, and a way of choosing a transformation within this class which
attempts to make the specification of the mean structure, constancy of error
variance assumption and normality of the errors reasonable on the transformed
scale.
The class of transformations we consider is the class of Box-Cox transforma-
tions. We saw this class of transformations when we discussed the Box-Tidwell
procedure. If y is the response (which we assume for the moment to be positive),
then we consider
{ yλ −1
if λ ̸= 0
w = λ
log(y) if λ = 0.
Here λ is a real valued parameter. The reason for defining w = log(y) when λ = 0
is that (y λ − 1)/λ approaches log(y) as λ approaches zero (see our discussion of
the Box-Tidwell procedure). If we find upon fitting a multiple linear regression
model that the assumptions of the linear model are violated (either specification
of the mean structure seems incorrect or the error variance does not seem to be
constant for instance) then one thing that might be considered is to transform
the response using the above transformation for a given value of λ. How do we
choose λ? Box and Cox have suggested an automatic way of choosing λ based
on the data, which we now describe.

Estimating λ

The proposal by Box and Cox for estimating λ is to use maximum likelihood.
If for a fixed λ the transformed responses wi are normal and uncorrelated with
common variance σ 2 , then we can write down the joint density of w1 , ..., wn . If
w = (w1 , ..., wn )⊤ , we write
w = Xβ + ε
where as usual X is the design matrix (n × p), β is a p × 1 vector of parameters
and ε is a vector of uncorrelated zero mean normal errors with variance σ 2 . We
can easily write down the density of w, which is of course a product of univariate
normal densities. From this density function and from the relationship between
wi and the untransformed response yi , we can write down the joint density of
y = (y1 , ..., yn )⊤ . In terms of the parameters β, σ 2 and λ, this joint density is
( n )λ−1
1 ( ) ∏

exp −1/(2σ 2
)(w − Xβ) (w − Xβ) yi .
(2π)n/2 (σ 2 )n/2 i=1
5 RESIDUALS AND DIAGNOSTICS 194

(If you are familiar with transformations of random variables, this is easily de-
rived: if not, you may take this on trust).
This joint density function considered as a function of the parameters β, σ 2
and λ is the likelihood function. It can be shown that if we take logarithms and
maximize with respect to β and σ 2 , then ignoring additive constants and writing
RSS(λ) for the residual sum of squares when we use the value λ in the Box-Cox
transformation, we obtain a function of λ, L(λ):
( n )
n ∏
L(λ) = − log(RSS(λ)) + n(λ − 1) log ( yi )1/n .
2 i=1

To determine a rough value for λ, we can consider a grid of possible λ values


between −2.0 and 2.0 say, compute the residual sum of squares RSS(λ) by fit-
ting the model with the appropriately transformed response and then plot L(λ).
We can read a crude value for λ off the graph (it is common to round off to an
interpretable value, say to the nearest half so that we might have a log or square
root transformation for instance). Generally the function L(λ) will have a maxi-
mum in the interval [−2, 2]: if this is not the case, the Box-Cox transformation is
probably not useful. We have assumed in this discussion that the responses are
positive. If this is not the case, then we can add a constant to the responses so
that they are, but the best choice for the constant to be added may not be clear.
We give an example to illustrate the use of the Box-Cox transformation.

Example: size of Romanesque churches

As an illustration of the Box-Cox procedure we consider once more the data on


the size of Romanesque churches considered in our discussion of the Box-Tidwell
procedure. For this data set, there were measurements of area and perimeter for
25 Romanesque churches. We consider using a Box-Cox transformation of the
response for this data set. A plot of L(λ) against λ for the data on Romanesque
churches is shown in Figure 5.50 This plot seems to indicate a Box-Cox transfor-
mation with a value for λ of approximately 0.5 would be appropriate.

Example: surgical services data

As a further example of the application of Box-Cox transformation, we con-


sider the surgical services data (also discussed in previous lectures). Recall that
for this data set we had a response y which was a measure of manpower and a
predictor x which was the number of surgical cases at 15 US Naval hospitals.
The goal of the study was to predict manpower needs. A plot of L(λ) against λ
for this data set is shown in Figure 5.51. The Box-Cox procedure does not seem
to indicate a need to transform here.
5 RESIDUALS AND DIAGNOSTICS 195

Figure 5.50: Plot of λ against L(λ) for data on Romanesque churches.


5 RESIDUALS AND DIAGNOSTICS 196

Figure 5.51: Plot of λ against L(λ) for data on surgical services.


5 RESIDUALS AND DIAGNOSTICS 197

In the above examples we chose a rough value for λ by looking at a plot of L(λ)
against λ. However, in deciding on a suitable value for λ in the Box-Cox procedure
it is helpful to have some measure of the uncertainty about the maximizer λ b of
L(λ). It can be shown that an approximate 100(1 − α) percentage confidence
b is given by
interval for λ based on λ

b − L(λ) 6 1 χ2 }
{λ : L(λ)
2 1;α
where χ21;α is the upper 100α percentage point of a χ21 distribution.

Example: Romanesque churches

Consider again the data on Romanesque churches. For this data set we said that
a Box-Cox transformation with a value of λ of approximately 0.5 seemed to be
indicated. To compute an approximate 95 percent confidence interval we need
the upper 5 percentage point of a χ21 distribution, which is approximately 3.8415.
In generating the graph of Figure 5.50 I computed L(λ) for values of λ 0.1 units
apart between −2.0 and 2.0. The values at λ = 0.4, 0.5, 0.6, 0.7 and 0.8 are re-
spectively −21.554, −19.212, −17.925, −18.165 and −20.038. If we assume that
the maximizer is exactly at 0.6 (we can compute the maximizer of L(λ) more pre-
cisely by using a finer grid) then the approximate 95 percent confidence interval
for λ includes values of λ for which L(λ) is within 0.5 × 3.8415 = 1.92075 of the
maximum value. Based on this guideline, the value of 0.5 lies in an approximate
95 percent confidence interval, and a square root transformation of the response
seems appropriate.

Example: Surgical services data

For the surgical services data set above, we stated that the need for a transfor-
mation of the response did not seem to be indicated by the Box-Cox procedure.
Is this inference justified based on a confidence interval for λ? Again as a very
crude guideline we can consider values of λ on our grid in the plot which lie within
1.92075 of the maximum value: the value λ = 1.0 satisfies this criterion, and so
a transformation does not seem to be warranted.

In our discussion of transformations in this course we have presented a wide


variety of techniques, and it may not be clear in any given situation which tech-
nique to apply or how to use these techniques in combination with each other.
Weisberg, “Applied Linear Regression (Second Edition),” Wiley, New York, 1985,
p. 156, suggests a general algorithm to apply when the data are strictly posi-
tive. He suggests first transforming any predictors which have a maximum to
minimum ratio of larger than about ten to a log scale, then using the Box-Cox
procedure and finally considering the Box-Tidwell procedure for predictors which
5 RESIDUALS AND DIAGNOSTICS 198

have large t-values. It must be said that in many cases we may not wish to use
a transformation of the response at all, and transformations can be overused.
6 CATEGORICAL PREDICTORS 199

6 Categorical predictors
In this section we discuss model building in multiple linear regression when some
of our predictors are categorical variables. This discussion will lead on to consid-
eration of the less than full rank linear model in the next section.
As an example of a categorical predictor in multiple linear regression consider
the data set on inflation rates and central bank independence that we discussed
in previous lectures and in tutorial five. For this data set we had for a number
of countries the average annual inflation rate over 1980 − 1989, as well as two
measures of central bank independence (a questionaire measure of independence
and a legal measure of independence). In addition to these two quantitative
predictor variables, there was a predictor which took the value 1 for developed
economies and 0 for developing economies. This last predictor variable is an
example of a categorical predictor: the values 1 and 0 taken by this predictor are
just arbitrary labels, numeric values that have no particular meaning except to
distinguish two different groups within the observations. Instead of the values
1 and 0 we could equally well have used the labels A and B to distinguish the
groups. We wish to develop methods for handling categorical variables like this
one within the framework of the linear model.
To take another example, suppose a new drug has been developed for the
treatment of high blood pressure. A study was undertaken with a group of
patients having high blood pressure into the effectiveness of the new drug. Each
patient was assigned at random to either a treatment group (patients in this
group receive the new drug) or a control group (patients in this group receive
the standard treatment for high blood pressure). After a month we measure the
change in the blood pressure of each patient compared to their blood pressure at
the beginning of the trial. We are interested in investigating the effect of the drug
on the response (change in blood pressure), perhaps adjusting for the effect of
other quantitative characteristics of the patients (such as age for instance). So for
each patient we have the response (change in blood pressure) some quantitative
predictors (such as age) as well as a predictor which records whether the patient
was in the treatment or control group. This last predictor is an example of a
categorical predictor variable with two levels: the two levels are “treatment” and
“control”.

6.1 Categorical predictors with two levels


We consider first the situation where we have a single quantitative predictor and
a categorical predictor with just two levels. We write y for the response variable,
x for the quantitative predictor, and z for the categorical predictor. Suppose that
the categorical variable can take on values A and B.
Now define a binary quantitative variable w (sometimes called a dummy vari-
6 CATEGORICAL PREDICTORS 200

able) from the predictor z as


{
1 if z takes the value A
w = (40)
0 if z takes the value B

and consider what happens when we use w and x as predictors in a multiple


linear regression with y as the response. That is, consider the model

y i = β 0 + β 1 xi + β 2 w i + εi ,

i = 1, ..., n, where β0 , β1 and β2 are unknown parameters and εi , i = 1, ..., n is a


collection of zero mean errors uncorrelated with a common variance σ 2 .
If observation i has wi = 1 (that is, if this observation belongs to the A group
for the categorical predictor z) then we have

yi = β0 + β2 + β1 xi + εi .

On the other hand, suppose that wi = 0. In this case, we have

y i = β 0 + β 1 xi + εi .

So the effect of introducing the dummy variable w into the regression is to have
a shift in the intercept for the regression of y on x as we move between the
two groups defined by z. This may be a sensible model. We can interpret the
parameter β2 as being the effect of being in group A compared to being in group
B, and if we want to see if there is any difference between the two groups (for
instance, if we want to see in our blood pressure example whether the new drug
is more effective than the standard treatment after adjusting for the quantitative
predictor age) then we can simply look at the partial t statistic for β2 .
The appropriateness of the above model involving the dummy variable w
depends on the assumption that the slope of the regression relationship between
y and x does not change for the groups A and B. The effect due to the group
(A or B) does not depend on the level of the quantitative variable x and we say
in this situation that there is no interaction between x and z. Effectively, the
relationship between the expected value of y and x for the two groups is described
by a pair of parallel lines, with one line for each group.
We can of course envisage a more general model in which the mean response
varies linearly with x in the two groups, but with a different slope and intercept
for the line in each group (that is, the relationship between the mean of y and x is
described by a pair of lines which may not be parallel). Consider defining a new
predictor variable as the product of the dummy variable w and the quantitative
predictor x. Now suppose we fit the model

yi = β0 + β1 xi + β2 wi + β3 xi wi + εi .
6 CATEGORICAL PREDICTORS 201

What happens in this situation for observations from the two groups A and B?
If wi = 1 (observation belongs to group A) then we have that

yi = β0 + β2 + (β1 + β3 )xi + εi .

On the other hand, if wi = 0 (observation belongs to group B) then we have that

y i = β 0 + β 1 xi + εi .

So by fitting a model with predictors xi , wi and xi wi we are allowing a linear


relationship between the expected response and x in which there is a different
intercept and slope within each of the groups A and B. We could test for in-
teraction by looking at the partial t statistic for β3 in the above model (if β3 is
zero, then we are back in the situation we had before with parallel lines for the
two groups) and we could test whether there is any effect due to the categorical
variable by testing
H0 : β2 = β3 = 0
against the alternative that β2 and β3 are not both zero by doing an F -test.
In the above model it is important to note that we are assuming separate
linear relationships between the expected value of y and x within the two groups
A and B, but with a common error variance σ 2 . If we believe that there is a
different error variance for the two groups, then we can simply split the data and
fit separate simple linear regressions of y on x within the two groups.

Example: inflation and central bank independence

We illustrate some of the ideas we have discussed by looking at the example


on inflation and central bank independence. Here the categorical variable DEV
already has levels which are either zero or one to distinguish the two groups, so
we can use DEV as a quantitative predictor in the manner described above. The
two quantitative predictors measuring central bank independence were QUES and
LEGAL (the questionnaire and legal measures of independence) and in this example
we consider only modelling inflation rate in terms of QUES and DEV. First we fit
a multiple linear regression model with the predictors QUES and DEV. Recall that
in this model we are postulating that there is a linear relationship between the
expected inflation rate and QUES for both developed and developing economies
with the same slope but a different intercept within the two groups.
Regression Analysis

The regression equation is


INF = 59.6 - 4.07 QUES - 22.8 DEV

Predictor Coef StDev T P


6 CATEGORICAL PREDICTORS 202

Constant 59.62 14.32 4.16 0.001


QUES -4.072 2.753 -1.48 0.156
DEV -22.79 12.13 -1.88 0.076

S = 22.65 R-Sq = 43.0% R-Sq(adj) = 37.0%

Analysis of Variance

Source DF SS MS F P
Regression 2 7345.2 3672.6 7.16 0.005
Residual Error 19 9750.0 513.2
Total 21 17095.3

Source DF Seq SS
QUES 1 5533.4
DEV 1 1811.8
Assuming that this model is adequate, we can test for whether being a developed
economy is related to inflation after adjusting for QUES by looking at the partial
t-statistic for DEV and its associated p-value. The conclusion of the test for
H0 : β2 = 0 against H1 : β2 ̸= 0 here is uncertain: we would accept H0 at the 5
percent level, but reject at the 10 percent level.
Now consider the model where we have a separate linear relationship between
expected inflation rate and QUES within each DEV group. We fit a multiple linear
regression model with inflation rate as the response and QUES, DEV and QUES*DEV
as predictors.

Regression Analysis

The regression equation is


INF = 66.8 - 5.64 QUES - 57.1 DEV + 5.31 QUES*DEV

Predictor Coef StDev T P


Constant 66.80 16.58 4.03 0.001
QUES -5.641 3.300 -1.71 0.105
DEV -57.06 41.03 -1.39 0.181
QUES*DEV 5.314 6.074 0.87 0.393

S = 22.79 R-Sq = 45.3% R-Sq(adj) = 36.2%

Analysis of Variance

Source DF SS MS F P
6 CATEGORICAL PREDICTORS 203

Regression 3 7742.9 2581.0 4.97 0.011


Residual Error 18 9352.3 519.6
Total 21 17095.3

Source DF Seq SS
QUES 1 5533.4
DEV 1 1811.8
QUES*DEV 1 397.7

Assuming this model is an appropriate one, we can test for the presence of in-
teraction between QUES and DEV by looking at the partial t statistic and p-value
for the predictor QUES*DEV. The p-value here for testing H0 : β3 = 0 against the
alternative H1 : β3 ̸= 0 is 0.393, which would indicate acceptance of H0 at the 5
percent level. That is, there seems to be no real evidence of any interaction.
We can test whether the level of DEV seems to have any relationship to inflation
rate in the presence of QUES by testing

H0 : β2 = β3 = 0

against the alternative


H1 : β2 , β3 not both zero.
We use an F test. The value of the test statistic is the reduction of the residual
sum of squares when DEV and QUES*DEV are added to the model, divided by the
difference in the number of parameters for the larger and smaller models (which
is two here), all divided by the mean square error. This is a realization of an F
random variable with 2 and n − 4 = 18 degrees of freedom under H0 . The value
of the test statistic is
(1811.8 + 397.7)/2
= 2.13,
519.6
which can be compared with the upper 5 percentage point of an F distribution
with 2 and 18 degrees of freedom (approximately 3.55). We accept the null
hypothesis here at the 5 percent level (that is, DEV does not seem to be related
to inflation rate in the presence of QUES according to this test).
In all these hypothesis tests we are assuming that the model we have fitted
is appropriate. The appropriateness of all the models we have considered is in
fact very doubtful here. Figures 6.52 and 6.53 show plots of the standardized
residuals against fitted values and the predictor DEV respectively for the model
involving QUES, DEV and QUES*DEV. Clearly here there seems to be a problem
with the constancy of error variance assumption: in particular, the variability of
the inflation rate would seem to depend on the level of DEV. A better model than
the ones we have considered so far might be simply to fit separate simple linear
regressions of INF on QUES within the two groups defined by levels of DEV.
6 CATEGORICAL PREDICTORS 204

Figure 6.52: Scatter plot of standardized residuals against fitted values.

Figure 6.53: Scatter plot of standardized residuals against DEV.


6 CATEGORICAL PREDICTORS 205

Example: age and growth of mussel species.

The following example is from Myers, “Classical and Modern Regression with
Applications (Second Edition)”, Duxbury, Belmont, California, 1990, p. 154. In
a project to study age and growth characteristics of selected mussel species from
Southwest Virginia, the data below were taken from two distinct locations. It
was desired to investigate whether location was a significant factor in the growth
of the mussels.

Table 6.2: Age and weight of mussels: location 1.

Age Weight (g) Age Weight (g)

3 0.44 11 3.96
3 0.50 11 3.84
3 0.66 12 5.58
3 0.78 12 5.64
4 1.20 12 4.26
4 1.18 13 6.00
4 1.08 13 2.54
6 1.12 13 3.82
6 1.72 14 4.50
7 1.04 14 5.18
8 2.62 14 4.04
9 1.88 15 6.38
10 2.26 15 4.08
11 4.10 16 4.56
11 2.12

We can define a dummy variable Location which is 0 for location 1 and 1 for
location 2. If we do a multiple linear regression of Weight on Age, Location and
Age*Location, then we obtain the following.

Regression Analysis

The regression equation is


Weight = - 0.483 + 0.365 Age - 0.759 Location + 0.290 Age*Location

Predictor Coef StDev T P


Constant -0.4828 0.3653 -1.32 0.191
Age 0.36494 0.03310 11.03 0.000
6 CATEGORICAL PREDICTORS 206

Table 6.3: Age and weight of mussels: location 2.

Age Weight (g) Age Weight (g)

3 0.76 8 2.52
4 1.38 8 3.90
5 1.20 10 3.94
5 1.76 10 6.22
6 2.60 10 4.96
6 2.16 13 9.02
6 2.64 13 8.20
6 2.52 13 8.26
6 3.08 14 6.40
6 2.12 15 10.06
7 2.72 15 8.60
7 2.96 18 11.06
8 4.54 19 10.78
8 5.26 22 12.04
8 5.60 24 13.92

Location -0.7592 0.5050 -1.50 0.138


Age*Loca 0.28998 0.04505 6.44 0.000

S = 0.8893 R-Sq = 91.7% R-Sq(adj) = 91.3%

Analysis of Variance

Source DF SS MS F P
Regression 3 536.45 178.82 226.09 0.000
Residual Error 61 48.25 0.79
Total 64 584.70

Source DF Seq SS
Age 1 428.11
Location 1 75.57
Age*Loca 1 32.77
We can do an F test for the significance of the Location term and its interaction
with Age. The value of the appropriate test statistic for the F test is
(75.57 + 32.77)/2
= 68.57.
0.79
6 CATEGORICAL PREDICTORS 207

This is to be compared with the upper 5 percentage point of an F distribution


with 2 and 61 degrees of freedom (which is approximately 3.15). So we would
reject the hypothesis that Location has no effect on mussel weight after adjusting
for age.

6.2 Categorical variables with more than two levels


Categorical variables with more than two levels can be handled in much the same
way as for the previous case. An example will help to illustrate the idea: the
following example is from Keller, Warrack and Bartel, “Statistics for Manage-
ment and Economics: A Systematic Approach (Second Edition),” Wadsworth,
Belmont, California, 1990. Suppose that a real estate agent is interested in pre-
dicting the selling price of a house (y), and that the agent believes that the selling
price of houses in the local area is related to the size of the house (x), and the
kind of energy used for heating (measured as a categorical variable z with values
“electricity”, “heating oil” or “natural gas”). The agent wishes to develop a mul-
tiple linear regression model with selling price as the response and predictors x
and z (size of house and type of heating).
Here z is a categorical variable with three levels. To handle this situation we
proceed as follows. Define two dummy variables w1 and w2 to be
{
1 if z takes the value “electricity”
w1 =
0 otherwise
and
{
1 if z takes the value “heating oil”
w2 =
0 otherwise.
The dummy variables w1 and w2 together code the information in z, since w1
is one only for those homes heated by electicity, w2 is one only for those homes
heated by heating oil, and w1 and w2 are both zero only for those homes heated by
natural gas. As before we can consider fitting a multiple linear regression model
with y as the response and with predictors x, w1 and w2 . Fitting this model allows
a linear relationship between expected selling price and size of house for each
group defined by z, with a common slope within each group (the mean structure
is defined by three parallel lines). If we want to allow separate linear relationships
within the three groups then we can fit a model involving the predictors x, w1 ,
w2 , xw1 and xw2 . Hypothesis testing can also proceed in a similar way to before.

6.3 Hypothesis testing for categorical variables with more


than two levels
In the last subsection we started a discussion of what to do in the linear model
when there is a categorical predictor taking on more than two values. We began
6 CATEGORICAL PREDICTORS 208

to discuss an example on real estate prices. For this example the response y
was the selling price of fifteen different houses in a certain area, and it was
desired to predict selling price based on the size of the house and the method
used for heating the house. The method of heating is a categorical predictor,
with values “electricity”, “heating oil” and “natural gas”. The data are shown in
Table 6.4. In the case of a categorical predictor variable taking on two values,

Table 6.4: Real estate prices data.

Price (y) Size (x) Heating (z)


89.5 20.0 electricity
79.9 14.8 electricity
83.1 20.5 heating oil
56.9 12.5 heating oil
66.6 18.0 heating oil
82.5 14.3 natural gas
126.3 27.5 electricity
79.3 16.5 electricity
119.9 24.3 natural gas
87.6 20.2 heating oil
112.6 22.0 natural gas
120.8 19.0 natural gas
78.5 12.3 natural gas
74.3 14.0 electricity
74.8 16.7 heating oil

our approach was to define a binary dummy variable which was then introduced
as a quantitative predictor into the model. Using this dummy variable (and the
products of this dummy variable with any quantitative predictors) we were able
to test for the presence of an effect due to the categorical predictor, and to test
for interaction between the categorical predictor and any quantitative predictors.
A similar approach can be followed when dealing with a categorical variable with
more than two levels. For the case of the real estate prices data, and denoting
the selling price by y, the size of the house by x and the categorical predictor
heating type by z, we define binary dummy variables w1 and w2 by
{
1 if z takes the value “electricity”
w1 =
0 otherwise
and
{
1 if z takes the value “heating oil”
w2 =
0 otherwise.
6 CATEGORICAL PREDICTORS 209

Now consider the model

yi = β0 + β1 xi + β2 wi1 + β3 wi2 + β4 xi wi1 + β5 xi wi2 + εi .

Again we can ask ourselves: what does this model reduce to for houses heated
by electricity, heating oil and natural gas?
For electricity, wi1 = 1 and wi2 = 0, and we have

yi = β0 + β2 + (β1 + β4 )xi .

For heating oil, wi1 = 0 and wi2 = 1 and we have

yi = β0 + β3 + (β1 + β5 )xi .

Finally, for natural gas wi1 = 0 and wi2 = 0 and we have

yi = β0 + β1 xi .

So by the expedient of defining multiple binary dummy variables to code levels


of a categorical predictor when that predictor can take on three or more distinct
values, and by introducing these dummy variables into a multiple linear regression
model along with the products of the dummy variables with the quantitative
predictor x, we are in effect defining separate linear relationships between the
mean of y and x within the groups.
Again there are some special cases of this general model worth considering.
If β4 and β5 are both zero, then we have a model with no interaction (sometimes
called an additive model). The meaning of interaction here is similar to before
where we considered a categorical variable with two levels. If β4 and β5 are zero,
then the three lines describing the relationship between the expected value of y
and x for the the three heating method groups have a common slope but different
intercepts. That is, the difference in the expected selling prices for two houses of
the same size but different heating types would not depend on the size of house.
Of course, we can test
H0 : β4 = β5 = 0
against the alternative
H1 : β4 , β5 not both zero
by using an F -test.
Another hypothesis test we might be interested in testing is whether there is
any group effect. That is, does the heating system have any effect on selling price
in the model involving size? In the model with interaction we could test

H0 : β2 = β3 = β4 = β5 = 0

against the alternative

H1 : Not all β2 , β3 , β4 and β5 are zero


6 CATEGORICAL PREDICTORS 210

with an appropriate F test. In the additive model (model with no interaction)


we could test
H0 : β2 = β3 = 0
against
H1 : β2 , β3 not both zero
in a similar way. Partial t tests may be useful for testing for differences between
pairs of groups. For instance, suppose we have fitted the model without inter-
action (β4 = β5 = 0). Then testing H0 : β2 = 0 against a two-sided alternative
using the partial t statistic tests whether the effect for electicity and for natural
gas is the same. Similarly, testing H0 : β3 = 0 against a two-sided alternative
tests whether the effect for heating oil and for natural gas is the same. We now
illlustrate some of these tests for the real estate data.

Example: hypothesis testing for real estate prices.

We begin by doing a scatter plot of the selling price against size, using different
plotting symbols for the different heating type groups. This plot is shown in
Figure 6.54. Of course, there are only a few observations in each group here but

Figure 6.54: Scatter plot of selling price against size for real estate data using
different plotting symbols for different heating type groups.

on the basis of this plot we might believe that there is no interaction between
heating type and size (the relationship between expected selling price and size
for the three groups seems to be well described by three parallel lines). We
can test this by fitting a linear model with the dummy variables defined above as
predictors, and with the products of these dummy variables and size as predictors.
6 CATEGORICAL PREDICTORS 211

We can then do an F test in the way outlined above to test for interaction. The
computer output is shown below. The variable dummy1 is w1 and dummy2 is w2 .
Regression Analysis

The regression equation is


Price = 34.8 + 3.70 Size - 14.7 Dummy1 - 21.6 Dummy2 + 0.054 Size*Dummy1
- 0.26 Size*Dummy2

Predictor Coef StDev T P


Constant 34.77 13.92 2.50 0.034
Size 3.7048 0.7352 5.04 0.001
Dummy1 -14.66 19.02 -0.77 0.460
Dummy2 -21.55 24.69 -0.87 0.405
Size*Dum 0.0536 0.9981 0.05 0.958
Size*Dum -0.258 1.360 -0.19 0.854

S = 7.431 R-Sq = 92.0% R-Sq(adj) = 87.6%

Analysis of Variance

Source DF SS MS F P
Regression 5 5733.3 1146.7 20.77 0.000
Residual Error 9 497.0 55.2
Total 14 6230.2

Source DF Seq SS
Size 1 4034.4
Dummy1 1 1.7
Dummy2 1 1694.0
Size*Dum 1 1.1
Size*Dum 1 2.0
From the table of sequential sums of squares and the analysis of variance table,
the appropriate test statistic for testing for interaction is
(1.1 + 2.0)/2
= 0.028.
55.2
We compare this with the upper 5 percentage point of an F distribution with 2
and 9 degrees of freedom (approximately 4.26). We see that there is no evidence
for interaction here.
We can refit the model with no interaction (that is, fit the additive model).
Regression Analysis
6 CATEGORICAL PREDICTORS 212

The regression equation is


Price = 35.1 + 3.69 Size - 13.7 Dummy1 - 26.1 Dummy2

Predictor Coef StDev T P


Constant 35.070 8.182 4.29 0.001
Size 3.6882 0.4138 8.91 0.000
Dummy1 -13.664 4.265 -3.20 0.008
Dummy2 -26.109 4.277 -6.10 0.000

S = 6.742 R-Sq = 92.0% R-Sq(adj) = 89.8%

Analysis of Variance

Source DF SS MS F P
Regression 3 5730.2 1910.1 42.02 0.000
Residual Error 11 500.0 45.5
Total 14 6230.2

Source DF Seq SS
Size 1 4034.4
Dummy1 1 1.7
Dummy2 1 1694.0

We demonstrate testing for an effect for heating type in this model. The ap-
propriate F statistic (from the sequential sums of squares and the analysis of
variance table) is
(1.7 + 1694.0)/2
= 18.63.
45.5
We compare this with the upper 5 percentage point of an F distribution with
2 and 11 degrees of freedom (approximately 3.98). In this case we reject the
null hypothesis that there is no group effect. As mentioned above, the partial t
statistics can tell us something about differences between pairs of groups. The
partial t statistic for dummy1 tests for a significant difference between electricity
and natural gas, and the partial t statistic for dummy2 tests for a singificant dif-
ference between heating oil and natural gas (from the p-values, both differences
are significant here at the 5 percent level).

We end our discussion of categorical variables here for the moment. We have
focused on the case of a single categorical variable and a single quantitative
predictor. Situations involving multiple quantitative and categorical predictor
variables may be handled in much the same way. We will say more about the
general case in the next section.
7 LESS THAN FULL RANK LINEAR MODELS. 213

7 Less than full rank linear models.


In our discussion of the linear model so far, we have always assumed that the
design matrix X has full rank (that is, we have assumed that we can’t express one
column of X, or one predictor, as a linear combination of the others). However, a
natural formulation of some linear statistical models may lead to a design matrix
which does not have full rank. For these models, there is no unique solution to
the normal equations, and so the vector of parameters β in the mean structure
cannot be uniquely estimated. However, it may still be possible to estimate
certain functions of the parameters and to formulate meaningful hypotheses to
be tested in a less than full rank model. In this section we give an introduction
to the analysis of less than full rank linear models.
We begin by giving a simple example of a less than full rank linear model. This
example is from Heinz Kohler, “Statistics for Business and Economics (Second
Edition),” Scott, Foresman and Co, Glenview, Illinois, 1988.
A doctor wants to test whether dieting men, on average, lose the same weight
during a week, regardless of which of three diets are involved. The following
sample data are available (in pounds of weight loss per week).

Diet A: 5.1, 6.7, 7.0, 5.3, 9.1, 4.3


Diet B: 6.1, 7.3, 8.1, 9.1, 10.3, 12.1
Diet C: 7.3, 8.9, 9.4, 12.0, 15.7, 9.5

A natural way to formulate a model for these data is as follows. Define µ to


be a parameter representing the global mean of pounds lost for the population
under study after dieting, and let α1 , α2 , α3 denote effects on weight loss due to
diet A, diet B and diet C respectively. In an extension of our previous notation,
let yij i = 1, 2, 3, j = 1, ..., 6 denote the jth response value in the ith group (where
the first group is the diet A group, the second group is the diet B group and the
third is the diet C group). Then a sensible model for these data might postulate

yij = µ + αi + εij

where εij , i = 1, 2, 3, j = 1, ..., 6 are a collection of zero mean errors which are
uncorrelated with a common variance. Notice that this is a linear model. Just
define a dummy variable x1 which is one for the diet A group and zero otherwise,
a dummy variable x2 which is one for the diet B group and zero otherwise, and a
dummy variable x3 which is one for the diet C group and zero otherwise. Then if
we do a multiple linear regression of the responses on the predictors x1 , x2 and x3
then we are simply fitting the model above. In this model, the coefficient for x1
is what we have written as α1 , the coefficient for x2 is α2 , the coefficient for x3 is
α3 and the intercept term is µ. The model we have considered actually involves
7 LESS THAN FULL RANK LINEAR MODELS. 214

fitting a single categorical variable (diet group) and our discussion should remind
you of what we did in the last section.
However, in the last section we used k − 1 binary dummy variables to code
the information in a categorical variable with k levels. The reason for this is that
k binary variables are not required to code the information a k level categorical
variable: in the example above, if x1 and x2 are both zero (so that the subject is
not on diet A or B) then we know the subject is on diet C. If we looked at the
design matrix for fitting the model involving x1 , x2 , x3 and an intercept, we would
see that the sum of the columns for x1 , x2 and x3 is equal to the first column (a
column of ones). That is, x1 + x2 + x3 = 1. So the design matrix here is not of full
rank, and there is no unique solution to the normal equations. We can express
the nature of the problem in a different way. Suppose we define a parameter µ′ as
µ′ = µ + c where c is a constant, and suppose we define parameters αi′ , i = 1, 2, 3
as αi′ = αi − c. Then we see that µ + αi = µ′ + αi′ , and so these two sets of
parameters result in the same model for the data. So we should not expect to be
able to estimate all the parameters in the model (that is, we should not expect
to be able to solve the normal equations). We have too many parameters in the
model.
One solution to this problem is the one presented in the last section: just
remove one of the binary dummy variables. Effectively, we set one of the pa-
rameters αi to be zero. The reduced design matrix when one of the predictors is
removed now has full rank, and analysis can proceed as before. This approach to
dealing with the less than full rank model is called reparametrization.
However, reparameterization need not be used in dealing with a less than full
rank linear model. We can in fact work with the original model, provided we only
ask sensible questions about the parameters. We may not be able to estimate all
the parameters, but we might be able to estimate functions of the parameters:
for instance, in our example we could still estimate differences between the effects
for different diets (that is, we can estimate quantities like α1 − α2 for instance).
It is also possible to formulate sensible hypotheses to be tested. There will be
some discussion of the general approach to handling the less than full rank linear
model in the coming weeks.

7.1 One way classification model with fixed effects


In the last subsection we discussed a simple example to introduce the idea of
the less than full rank model. Our example was a special case of the so-called
one way classification model with fixed effects. This model may be described in
general as follows.
Suppose we collect random samples from k different populations (think of
these k populations as being indexed by the k levels of some categorical variable).
We write ni , i = 1, ..., k for the size of the sample from the ith population.
Also, write yij , j = 1, ..., ni for the random sample collected from population i,
7 LESS THAN FULL RANK LINEAR MODELS. 215

i = 1, ..., k. In the one way classification model with fixed effects it is assumed
that
yij = µ + αi + εij
where µ and αi , i = 1, ..., k are unknown parameters, and the errors εij have mean
zero and are uncorrelated with a common variance σ 2 . Normality of the errors
needs to be assumed for hypothesis testing and interval estimation. The param-
eter µ is a global mean parameter, and the parameters αi represent deviations
from this mean for the k populations.
It is usually of interest in the one way classification model to test whether all
the population means are the same, and to test for differences between pairs of
population means. We will describe how to conduct these tests and describe the
form of the ANOVA table for a one way classification model in the remainder of
this subsection.
It should be evident from the discussion of last week that the one way classi-
fication model above is a less than full rank linear model. An example will help
to show how we can write the one way classification model in the form of the
general linear model. Suppose there are two populations, with a random sample
of size two taken from each population. Now let

y = (y11 , y12 , y21 , y22 )⊤

ε = (ε11 , ε12 , ε21 , ε22 )⊤


β = (µ, α1 , α2 )⊤
and
 
1 1 0
 1 1 0 
X = 
 1
.
0 1 
1 0 1

Then we can write


yij = µ + αi + εij
i = 1, 2, j = 1, 2, in matrix notation as
     
y11 1 1 0   ε11
 y12   1 1 0  µ  ε12 
      
 y21  =  1 0 1  α1 +  ε21 
α2
y22 1 0 1 ε22
or
y = Xβ + ε.
7 LESS THAN FULL RANK LINEAR MODELS. 216

We can write the general one way classification model in a similar way. In the
general case the vector β will be

β = (µ, α1 , ..., αk )⊤

and the design matrix X will have k + 1 columns, with the first column being a
column of ones, and the (i + 1)th column consisting of entries which are one for
observations in the ith population and zero otherwise. Observe that the design
matrix X does not have full rank, since the sum of the last k columns of X is
equal to the first column (see the above example for instance). Since X does not
have full rank, the normal equations do not have a unique solution, and so there is
no unique estimate of β. One approach to fixing this problem is to reparametrize
the original model: we reparametrize so that for the new model the design matrix
does have full rank, and then the parameters can be estimated.
There are in general many possible ways that we could reparametrize the one
way classification model with fixed effects. If the mean of one of the populations
can be considered to establish some reference level (without loss of generality say
the reference population is the first one) then we could set α1 = 0 (this corre-
sponds to deleting the second column of the original design matrix X). Then the
parameter µ is the mean of the reference population, and the parameters α2 , ..., αk
represent deviations from this mean for the remaining populations. This way of
reparametrizing the model is often very natural in, say, medical trials where we
are interested in comparing different methods of treatment with a control group.
The partial t-tests in the regression output for the parameters αi i = 2, ..., k
in this reparametrized model test for a significant difference between the mean of
population i and the mean of the first population (reference population). A test
for equality of all population means amounts to testing

H0 : α2 = ... = αk = 0

against the alternative

H1 : Not all α2 , ..., αk are zero.

This hypothesis is tested using the usual F test. We will discuss another way
of parametrizing the one way classification model in the next lecture. In setting
α1 = 0 in the discussion above (removing a column of the design matrix) we
have effectively added an additional equation to the normal equations for the
less than full rank model, allowing a unique solution. Other arbitrary constraints
could have been made on the original∑ parameters: for instance, many statistical
packages introduce the constraint ki=1 αi = 0 on the parameters in the original
model. This will also allow a unique solution of the normal equations, where the
effects αi are required to be “centred” about the grand mean µ.
7 LESS THAN FULL RANK LINEAR MODELS. 217

Example: completion times for tax returns

This example is from Keller, Warrack and Bartel, “Statistics for Management
and Economics (Second Edition),” Wadsworth, Belmont, California 1990.
Three new formats have been developed recently for tax return forms (formats
A, B and C say). To determine which, if any, are superior to the current form,
20 individuals were asked to participate in an experiment. Each of the three
new forms and the old form were filled out by 5 different people. The amount of
time taken by each person to complete the task is recorded in the accompanying
table. At the 10 percent significance level, can we conclude that differences in
the completion times exist among the four forms? Three binary dummy variables

Current Format Format Format


Format A B C
6.3 9.2 12.6 7.5
9.8 8.6 10.9 9.3
10.1 10.3 11.8 8.8
8.5 11.4 10.8 7.8
7.1 8.5 10.1 9.1

Table 7.5: Completion times for tax returns: old format, and three new formats
(A, B and C).

were created (Dummy1, Dummy2 and Dummy3). The first dummy variable is one for
format A and zero otherwise, the second dummy variable is one for format B
and zero otherwise, and the third dummy variable is one for format C and zero
otherwise. Thus we have parametrized a one way classification model by the
method discussed above, with the subjects filling out the current form acting as
a control group. Fitting a multiple linear regression model with these dummy
variables as predictors and the completion time (y) as the response gave the
following output.
Regression Analysis

The regression equation is


y = 8.36 + 1.24 Dummy1 + 2.88 Dummy2 + 0.140 Dummy3

Predictor Coef StDev T P


Constant 8.3600 0.5409 15.46 0.000
Dummy1 1.2400 0.7649 1.62 0.125
Dummy2 2.8800 0.7649 3.77 0.002
Dummy3 0.1400 0.7649 0.18 0.857
7 LESS THAN FULL RANK LINEAR MODELS. 218

S = 1.209 R-Sq = 53.2% R-Sq(adj) = 44.4%

Analysis of Variance

Source DF SS MS F P
Regression 3 26.573 8.858 6.06 0.006
Residual Error 16 23.404 1.463
Total 19 49.977

Source DF Seq SS
Dummy1 1 0.204
Dummy2 1 26.320
Dummy3 1 0.049
The F statistic in the ANOVA table is a realization of an F3,16 random variable
under the null hypothesis that
H0 : α2 = α3 = α4 = 0
(that is, under the hypothesis that there is no difference in completion time among
the four groups). The p-value for this F test is 0.006, so that we reject the null
hypothesis that there is no difference among means in favour of the hypothesis
that at least two of the population means differ. The partial t statistics in the
above regression output for α2 , α3 and α4 relate to a test for a significant differ-
ence between mean completion time for the current form and mean completion
time for formats A, B and C respectively. We see here that there seems to be a
significant difference between mean completion time for the current form and for
format B (the current form is better).

It is interesting to have a closer look at the analysis of variance table for the
one way classification model. For the multiple linear regression model, if we write
yi i = 1, ..., n for the responses, ybi i = 1, ..., n for the fitted values and ȳ for the
mean of the responses, then we know that
SStotal = SSreg + SSres
where

n
SStotal = (yi − ȳ)2 ,
i=1
∑n
SSreg = yi − ȳ)2
(b
i=1
and

n
SSres = (yi − ybi )2
i=1
7 LESS THAN FULL RANK LINEAR MODELS. 219

(this was stated in lectures and proved by the MATH2931 students on their third
assignment). We can rewrite this analysis of variance identity in an interest-
ing way for the one way classification model. However, first we introduce some
notation. Write ȳi. for the sample mean of observations in the ith population,

1 ∑
ni
ȳi. = yij .
ni j=1

Also, write ȳ.. for the sample mean of all response values,

1 ∑∑
k i n
ȳ.. = yij .
n i=1 j=1

Thus the dots in the subscripts above mean that summation is taken with respect
to that index in forming the mean.
It is easy to show that the fitted value for an observation from population i
is ȳi. , and the analysis of variance identity becomes


k ∑
ni ∑
k ∑
k ∑
ni
(yij − ȳ.. ) =
2
ni (ȳi. − ȳ.. ) +
2
(yij − ȳi. )2 .
i=1 j=1 i=1 i=1 j=1

Thus total variation can be partitioned into a component representing variation


of group sample means from the grand mean (“between group variation”) and a
component representing variation of observations about their within group sample
means (“within group variation”). Testing whether all population means are the
same via the F test in the ANOVA table amounts to looking at whether between
group variation is large with respect to within group variation, an intuitively
sensible procedure. You will be asked to prove the above identity for the one way
classification model in your final assignment. The ANOVA table in the regression
output shows the components of variation in the above partition, and the degrees
of freedom, mean squares and F ratio follow from our previous discussion of the
ANOVA table for the multiple linear regression model.
In our previous discussion we have indicated how to test equality of all the
population means, and how to test for equality of the mean in a reference popu-
lation with one of the other population means via the partial t statistic when the
one way classification model is parametrized in the way that we have described.
However, how do we test, say, that αi = αj for arbitrary i, j? That is, how do we
test for equality of means for an arbitrary pair of populations in our experiment?
It can be shown that the statistic
αi − α
(b bj ) − (αi − αj )

b n1i + n1j
σ
7 LESS THAN FULL RANK LINEAR MODELS. 220

b denotes the estimated residual standard deviation) has a tn−k distribu-


(where σ
tion. This result can be used to construct a confidence interval for αi − αj or to
do either a one-sided or two-sided test of equality of population means for the
ith and jth populations. If a test like this one is to be conducted for every pair
of populations in the experiment, then we obviously have a problem of multiple
comparisons: one conservative way to deal with this is to use Bonferroni adjust-
ment.

Example: completion times for tax returns

Suppose in the previous example that it was of particular interest to test for a
difference between formats B and C. We wish to test

H0 : α3 = α4

against the alternative


H1 : α3 ̸= α4 .
From our previous numerical computations, we can evaluate the t-statistic given
above (with α3 − α4 = 0 under H0 ) as
2.88 − 0.14
√ = 3.58.
1 1
1.209 5 + 5

For a two-sided test at the 5 percent level, we compare this with the upper 2.5
percentage point of a t distribution with 16 degrees of freedom (which is approx-
imately 2.12). So we would reject the null hypothesis that there is no difference
in completion times for formats B and C.

An alternative parametrization and the general linear hypothesis

In the last lecture we discussed the one way classification model. The one way
classification model is of interest when we have random samples from k different
populations, and are interested in comparing population means. With a random
sample of size ni from the ith population, i = 1, ..., k, we write the one way
classification model as
yij = µ + αi + εij
where yij is the jth observation from the ith population, i = 1, ..., k, j = 1, ..., ni ,
µ is a global mean, αi is the deviation from the global mean for population i,
i = 1, .., k and the terms εij are random errors. We showed that the one way clas-
sification model was an example of a less than full rank linear model, and in order
to apply techniques from the full rank case we discussed ways of reparametrizing
to full rank. In particular, we considered a situation where one of the k popula-
tions was considered a reference population (without loss of generality say it is the
7 LESS THAN FULL RANK LINEAR MODELS. 221

first population): it was then natural to set the deviation from the global mean
for this population to zero (α1 = 0 say). In the resulting reparametrized model µ
can be considered to be the mean of the reference population, and the parameters
α2 , ..., αk describe deviations from the mean of the reference population for the
other populations of interest in the experiment.
In this lecture we discuss an alternative parametrization of the one way clas-
sification model, and discuss hypothesis testing for this parametrization.
Instead of defining a reference population as in the previous lecture, an alter-
native way of parametrizing the one way classification model is to define
µi = µ + α i
(so that µi is the population mean for the ith population) and to write
yij = µi + εij .
We have thus reduced the number of parameters in the mean structure from the
k+1 parameters µ, α1 , ..., αk to the k parameters µ1 , ..., µk . The new parametriza-
tion gives a full rank linear model (you might like to check this for yourself). The
parametrization above is attractive, since µi has a direct interpretation (the pop-
ulation mean for population i).
As for the parametrization considered in the last lecture, we are interested in
testing whether all population means are equal, and in looking at differences in
population means. To test equality of the population means we must test
H0 : µ1 = ... = µk
against the alternative
H1 : Not all µ1 , ..., µk are equal.
This test does not quite fit into the hypothesis testing framework we have dis-
cussed previously for the full rank linear model: testing µ1 = ... = µk does not
correspond to testing whether a subset of µ1 , ..., µk are zero. However, hypothe-
ses like the one above can be tested within the framework of the general linear
hypothesis, which we now discuss.

The general linear hypothesis

If we write the full rank linear model in the usual notation as


y = Xβ + ε
(where β = (β0 , ..., βk ) so that there are p = k + 1 parameters in the mean
structure) then for the general linear hypothesis we test
H0 : Cβ = d
7 LESS THAN FULL RANK LINEAR MODELS. 222

against the alternative


H1 : Cβ ̸= d
where C is an m × p matrix of rank m, m 6 p, and d is an m × 1 vector.
It is worth considering a few special cases to illustrate the usefulness of the
general linear hypothesis. Suppose we wish to test the hypothesis

H0 : βr+1 = ... = βk = 0

against the alternative

H1 : Not all βr+1 , ..., βk are zero.

Then define C to be the (k − r) × p matrix consisting of the last k − r rows of


the p × p identity matrix Ip . Clearly as Ip β = β, we have that Cβ is the vector

(βr+1 , ..., βk )⊤ .

Thus defining d to be a (k − r) × 1 vector of zeros, the hypothesis

H0 : βr+1 = ... = βk = 0

can be written as Cβ = d. So any of the hypothesis tests we have considered


previously may be considered as special cases of the general linear hypothesis.
The general linear hypothesis also allows us to test

H0 : µ1 = ... = µk

in the parametrization we have introduced for the one way classification model.
We rewrite the above hypothesis as

H0 : µ1 − µ2 = 0, µ1 − µ3 = 0, ..., µ1 − µk = 0.

Clearly these two hypotheses are equivalent. Each of the equations µ1 − µ2 =


0, ..., µ1 − µk = 0 can be expressed in the form c⊤ β = 0 where c is a k × 1 vector
and β = (µ1 , ..., µk )⊤ . For instance, if we let c = (1, −1, 0, ..., 0)⊤ , then c⊤ β = 0
is equivalent to µ1 − µ2 = 0. Hence by defining an appropriate (k − 1) × k matrix
C and by letting d be a (k − 1) × 1 vector of zeros, we can write the equations
µ1 − µ2 = 0, ..., µ1 − µk = 0 in the form Cβ = d. So the general linear hypothesis
gives a very general framework for testing many hypotheses of interest in the
general linear model.
What is the test statisic used for testing the general linear hypothesis? It can
be shown that under H0 : Cβ = d, and writing b for the least squares estimator
of β, the statistic

(Cb − d)⊤ (C(X ⊤ X)−1 C ⊤ )−1 (Cb − d)


mbσ2
7 LESS THAN FULL RANK LINEAR MODELS. 223

has an Fm,n−p distribution. For testing µ1 = ... = µk for the one way classification
model, this statistic is computed and displayed in the analysis of variance table.
Example: tax return forms.

We continue our discussion of the data on completion times for different formats
of tax return forms introduced in the last lecture. Recall that this data set was
collected by dividing a group of twenty subjects into four groups, and giving
each subject in the first group a tax return form in the format currently used,
and members of the remaining groups forms in three new formats (A, B and C).
The completion time for each of the subjects was recorded. This is a one way
classification model with four groups. We showed in the last lecture that we could
test if the mean completion time was the same for all four groups by introducing
appropriate binary variables to fit the model where we adopt a parametrization
with the mean completion time for the current form as a reference level.

One-way Analysis of Variance

Analysis of Variance for y


Source DF SS MS F P
factor 3 26.57 8.86 6.06 0.006
Error 16 23.40 1.46
Total 19 49.98
Individual 95% CIs For Mean
Based on Pooled StDev
Level N Mean StDev --+---------+---------+---------+----
1 5 8.360 1.655 (-------*------)
2 5 9.600 1.235 (-------*-------)
3 5 11.240 0.971 (-------*-------)
4 5 8.500 0.803 (-------*------)
--+---------+---------+---------+----
Pooled StDev = 1.209 7.5 9.0 10.5 12.0

Fisher's pairwise comparisons

Family error rate = 0.189


Individual error rate = 0.0500

Critical value = 2.120

Intervals for (column level mean) - (row level mean)

1 2 3
7 LESS THAN FULL RANK LINEAR MODELS. 224

2 -2.862
0.382

3 -4.502 -3.262
-1.258 -0.018

4 -1.762 -0.522 1.118


1.482 2.722 4.362
For the moment just consider the analysis of variance table. As we have men-
tioned, the hypothesis we wish to test (equality of population means) can be
written in the form Cβ = 0 here. The test statistic in this case is
(Cb)⊤ (C(X ⊤ X)−1 C ⊤ )−1 (Cb)
(k − 1)b
σ2
for an appropriate matrix C. In the analysis of variance table in the sum of
squares (SS) column and the factor row, 26.57 is the value of
(Cb)⊤ (C(X ⊤ X)−1 C ⊤ )−1 (Cb).
The degrees of freedom in the factor row is 3 (which is k − 1). The sum of squares
for the error row is as usual the residual sum of squares. In the mean square (MS)
column we have in the error row the estimated residual variance σ b2 (the residual
sum of squares divided by its degrees of freedom), and in the factor row we have
(Cb)⊤ (C(X ⊤ X)−1 C ⊤ )−1 (Cb)
.
(k − 1)
The F statistic is the ratio of the two entries in the mean square column, which is
the test statistic for the general linear hypothesis for testing equality of population
means. The p-value is 0.006, indicating we reject equality of population means.
You should compare the ANOVA table here with the one we obtained in the last
lecture in the alternative parametrization we have considered.
Below the analysis of variance table is a table giving the number of obser-
vations in each of the four groups, the estimated coefficients in the model (that
is, the estimated population means) and the estimated standard errors of the
coefficients. Also, 95 percent confidence intervals for the population means are
depicted graphically. These confidence intervals are just the usual ones for the
coefficients in a full rank linear model based on the t distribution. Below this is
the estimated residual standard deviation (which is 1.209 here). Below this under
the section labelled “Fisher’s pairwise comparisons” are 95% confidence intervals
for differences in pairs of population means. If µ bi , i = 1, ..., k are the estimated
means for the k populations, it can be shown that for i ̸= j
µi − µ
(b bj ) − (µi − µj )

b n1i + n1j
σ
7 LESS THAN FULL RANK LINEAR MODELS. 225

has a tn−k distribution. This result is the basis for computing the intervals in the
table.

7.2 Randomized complete block designs


In the last two lectures we have been discussing the one way classification model
with fixed effects (sometimes called a completely randomized experiment). In the
completely randomized experiment we have random samples from k populations,
and we are interested in comparing the means of these populations.
Today we will discuss an important extension of the completely randomized
design where we are interested in comparing population means for k populations
of interest, but where the members of the k populations can be divided into groups
based on a second classifying variable (we talk about blocks for the groupings
induced by the classification). We will consider the so-called randomized complete
block experiment where we attempt to reduce unexplained variation and improve
precision of comparisons between population means by taking account of variation
between blocks.
An example will help to illustrate the idea. This example is from Keller,
Warrack and Bartel, “Statistics for Management and Economics: A Systematic
Approach (Second Edition),” Wadsworth, Belmont, California. It was desired
to compare the annual incomes of business students with different majors two
years after graduation. A random sample of six students was taken of graduates
in accounting, marketing and finance and their income was determined. The
data are shown in the table below (income is given in thousands of dollars). We

Accounting Marketing Finance

27 23 48
22 36 35
33 27 46
25 44 36
38 39 28
29 32 29

are interested in making comparisons between mean income levels for students
majoring in different disciplines. One way to test for any differences between
income levels is to analyze the data as a one way classification model with fixed
effects. We have seen how to do this in previous lectures. Fitting a one way
classification model results in the following output.

One-way Analysis of Variance


7 LESS THAN FULL RANK LINEAR MODELS. 226

Analysis of Variance
Source DF SS MS F P
Factor 2 193.0 96.5 1.77 0.205
Error 15 819.5 54.6
Total 17 1012.5
Individual 95% CIs For Mean
Based on Pooled StDev
Level N Mean StDev ---+---------+---------+---------+---
Accounti 6 29.000 5.762 (---------*----------)
Marketin 6 33.500 7.765 (----------*----------)
Finance 6 37.000 8.390 (----------*---------)
---+---------+---------+---------+---
Pooled StDev = 7.391 24.0 30.0 36.0 42.0

From the p-value in the analysis of variance table, no significant difference be-
tween mean annual income levels seems to be indicated. We can see that only a
small fraction of total variation is explained by the model here. With a sample
of size six from each group, it will be difficult to detect differences in population
means which are small compared to the residual standard deviation (which is
about 7000 dollars). Now, suppose it was thought that annual income following
graduation was dependent on performance at university. We can classify the
members of each of the populations of interest (accounting majors, marketing
majors and finance majors) according to their average grade (A+, A, B+, B, C+
or C) during the course of their studies. A new experiment is conducted in which
we again take a sample of size six from the three populations, but in each of the
three groups of six students we take one student with an A+ average, one with
an A average, one with a B+ average, one with a B average, one with a C+
average and one with a C average: we say that students in different grade levels
form different blocks in this new experiment. So we have three populations, and
six blocks, and one observation for each population/block combination (eighteen
observations in all). The table below shows the data collected for this new exper-
iment. The idea of introducing the blocking variable here (average grade) is to
reduce residual variation to allow more precise estimates of differences between
population means. A key assumption here is that there is no interaction between
the blocking variable, average grade, and the major of the student: it is assumed
that the difference between mean salary for, say, an accounting and marketing
graduate with the same average grade does not depend of whether that grade is
an A+, a C or anything else. We allow the grade to have an additive effect on
mean annual income, but the amount of the effect must not depend on what the
student studied.
For notational convenience we consider first the situation where there is pre-
cisely one observation for each population/block combination. We write yij for
7 LESS THAN FULL RANK LINEAR MODELS. 227

Average Accounting Marketing Finance

A+ 41 45 51
A 36 38 45
B+ 27 33 31
B 32 29 35
C+ 26 31 32
C 23 25 27

the observation for population i, i = 1, .., k and block j, j = 1, ..., b. The model
we are considering in a randomized complete block experiment is

yij = µ + αi + βj + εij .

Here µ is a global mean, αi is an effect for population i, βj is an effect for block


j and the εij are zero mean random errors, i = 1, ..., k, j = 1, ..., b. We make the
usual distributional assumptions about the errors.
We want to develop a test for equality of the population means (that is, we
want to test if αi = 0 for i = 1, ..., k). You have probably already realized that
the model we have written down above is a less than full rank linear model.
There are k + b + 1 parameters in the mean structure, so that the design matrix
X has dimensions n × (k + b + 1). The rank of X is k + b − 1 here: effectively
there are two redundant parameters. As for the one way classification model,
there are a number of approaches to developing a hypothesis test for equality of
population means. One approach is to reparametrize to a full rank model and to
apply methods for the full rank case. For instance, we could set α1 = β1 = 0,
and then test
H0 : α2 = ... = αk = 0
against the alternative

H1 : Not all α2 , ..., αk are zero

in this reparametrized model to test equality of population means. There are


other parametrizations that could be considered in which the general linear hy-
pothesis can be employed to test equality of means. Also, it is possible to work
directly with the less than full rank linear model, although we have not devel-
oped the necessary matrix concepts to do this. Whichever method is adopted,
the analysis of variance table will be the same, and the same conclusion will be
reached for the test for equality of population means.
It is worth discussing the general form of the analysis of variance table for
the randomized complete block design. In a randomized complete block design
7 LESS THAN FULL RANK LINEAR MODELS. 228

we can partition up total variation into a component of variation for the popula-
tion mean differences (sometimes called the sum of squares due to treatments),
a component for block differences (sum of squares due to blocks) and residual
variation. The sum of squares due to treatments is simply the regression sum of
squares you would get if you ignored blocking and used a one way classification
model for the samples from the different populations. The sum of squares due
to blocks is simply the regression sum of squares you would get if you ignored
population mean differences and fitted a one way classification model treating
the different blocks as the populations of interest. It is a special feature of the
randomized complete block design where you have equal numbers of observations
for each population/block combination that these two sums of squares add to the
regression sum of squares you get for fitting the full model.
Income levels for business students when analyzing the data as a randomized
complete block experiment is given below.

Two-way Analysis of Variance

Analysis of Variance for y


Source DF SS MS F P
Major 2 108.44 54.22 10.38 0.004
Grade 5 854.94 170.99 32.74 0.000
Error 10 52.22 5.22
Total 17 1015.61

Our treatment sum of squares here is 108.44, block sum of squares is 854.94,
and residual sum of squares is 52.22. The estimated residual variance is 5.22,
and we see that blocking has considerably reduced the proportion of total varia-
tion unexplained by the model: this may allow us to detect differences between
population means which we could not have detected previously for the same ex-
penditure of resources. The F statistic for testing equality of mean income for
the different majors is 10.38 here (the mean square values in the ‘MS’ column
are the sums of squares divided by the degrees of freedom, and the F statistic
10.38 is the mean square due to treatments divided by the mean square error:
we look at whether the variation between different major types is large com-
pared to unexplained variation). The p-value for this F -test is 0.004, indicating
a significant difference between means for different majors at the 5 percent level.
The p-value for the blocking variable should be ignored here, although the F
statistic for blocking does give us a rough indication of whether blocking was
effective in reducing residual variation to allow more precise comparisons among
the populations of interest.
7 LESS THAN FULL RANK LINEAR MODELS. 229

7.3 Factorial Experiments


Often experiments are conducted where one or more observations of a response
variable are made for each combination of possible levels of a number of categor-
ical predictors (factors). These kind of experiments are called factorial experi-
ments.

Example: insanity and phases of the moon

The data described here are given by Blackman and Catalina (1993) in the ar-
ticle “The Moon and the Emergency Room,” Perceptual and Motor Skills, 37,
pp. 624–626. The admission rate (in patients per day) to the emergency room
of a Virginia mental health clinic was measured before, during and after the 12
full moons from August 1971 to July 1972. This is a factorial experiment: for
each level of the factor month (twelve levels) and each level of the factor moon
(three levels, before, during or after the full moon) we have an observation of the
response.

Example: paper plane experiment

The data described here was collected by Stewart Fischer and David Tippetts,
statistics students at the Queensland University of Technology in a subject taught
by Dr Margaret Mackisack.
In the experiment conducted by these students, the relationship between
weight, design and angle of release for a paper aeroplane and the distance trav-
elled by the aeroplane was investigated. Two different weights were considered
(factor Paper describing the weight of the sheet of paper used with levels 1=80
grams and 2=50 grams) as well as two different designs (1=sophisticated de-
sign, 2=simple design) and two different angles of release (1=horizontal, 2=45
degrees). For each distinct combination of levels of the three factors (2.2.2 = 8
different combinations) there are two observations of the response - 16 observa-
tions in all.

Factorial experiments arise frequently in experiments designed to investigate


process improvement in manufacturing. In a statistical approach to process im-
provement, typically a sequence of experiments is conducted. In the initial stages
of an investigation, an attempt will be made to identify any variables (factors)
which affect the process being studied. Then a factorial experiment will be con-
ducted where a response (some characteristic of the process being studied) is
measured for each combination of a small number of levels for the factors identi-
fied as possibly influential. Analysis of the factorial experiment identifies which
factors are most influential in explaining process variation. Further experiments
are then conducted to attempt to optimize settings for the factors found to be
7 LESS THAN FULL RANK LINEAR MODELS. 230

important in the initial experiment (response surface methodology is the name


given to these kind of sequential experimentation methods which aim to optimize
some process).
When the levels of categorical variables or factors are under the control of an
experimenter, there are many advantages in designing a study in the form of a
factorial experiment. These advantages include the ability to detect interactions
between factors (we discuss what this means later) as well as the need for fewer
observations of the response to achieve the same precision of estimation of effects
of interest. You will discuss these advantages more in later courses when you
look at experimental design.

Analysis of factorial experiments

For simplicity, we consider the analysis of an experiment with two factors (A


and B say). Factor A has a levels, and factor B has b levels. For each of the
ab combinations of possible levels of the two factors, we observe the response m
times. Write yijk for the kth observation of the response at the ith level of factor
A and jth level of factor B, i = 1, ..., a, j = 1, ..., b, k = 1, ..., m. The model we
consider is

yijk = µ + αi + βj + γij + εijk (41)

where µ is an overall mean, αi is the effect for the ith level of factor A, βj is
the effect for the jth level of factor B, γij is a term representing an interaction
between the ith level of factor A and the jth level of factor B (more on this in
a moment), and the εijk are zero mean uncorrelated normal errors with constant
variance.
Write µij = E(yijk ) for the mean of observations at level i of factor A and at
level j of factor B. The main effect at the ith level of factor A is defined to be
1∑ 1 ∑
µij ′ − µi′ j ′ .
b j′ ab i′ ,j ′

This is the difference between the average of means with factor A fixed at level i
and the average of all means.
The interaction between factor A at level i and factor B at level j is defined
to be ( ) ( )
1∑ 1∑ 1 ∑
µij − µi′ j − µij ′ − µi′ j ′ .
a i′ b j′ ab i′ ,j ′

The first bracketed term above is the main effect you would get for factor A at
level i in an experiment with factor B fixed at level j. The second bracketed term
is the main effect for factor A at level i in the full experiment. If the interaction
7 LESS THAN FULL RANK LINEAR MODELS. 231

is nonzero, it means that the effect of factor A at level i depends on the level j
of factor B. This is the meaning of interaction.
When there is interaction between factors, it may be hard to interpret the
main effects (when the way that one factor affects the response depends on the
level of another factor, it may not be meaningful to talk about an average effect
for the factor). You may check that if all the terms γij are zero in the model (41)
then the interactions as we have defined them above are zero.
In experiments with more than two factors, higher order interactions (that is,
interactions between more than two factors) can be defined, but we leave this for
later statistics courses.
We illustrate the idea of interaction with the paper plane experiment.

Example: paper plane experiment

For simplicity we analyse the paper plane experiment as a two factor experiment,
ignoring the angle of release factor. So we consider the two factor experiment
with factors Paper (weight of paper used in the plane) and Plane (design of
plane, sophisticated or simple).
Below is shown the so-called main effects plot for the paper plane experiment.
This plot shows response means for different levels of the factors. As we can see,
lighter planes seem to travel further and the planes with a sophisticated design
seem to travel further than planes with a simple design. The differences between
the points plotted on the graph and the horizontal lines on the graphs (the overall
mean of responses) give empirical estimates of the main effects as we’ve defined
them above.
Figure 7.56 shows a so-called interaction plot. Plotted on the y-axis are re-
sponse means, and plotted on the x-axis here are the levels of the factor Plane
(plane design, 1=sophisticated and 2=simple). The two lines (dashed and solid)
are for the light paper and heavy paper respectively. If there were no interaction,
then the lines in the plot would be parallel - for no interaction, the way that the
mean response should change as we change the level of Plane should not depend
on the level of Paper.

The model (41) is overparametrized: it is a less than full rank linear model where
the normal equations have no unique solution. We can impose some constraints
on the parameters, and adding these constraint equations to the normal equations
ensures that the normal equations have a unique solution.
The usual constraints are
∑ ∑
αi = 0, βj = 0
i j


a
γij = 0, j = 1, ..., b
i=1
7 LESS THAN FULL RANK LINEAR MODELS. 232

Figure 7.55: Main effects plot for data on paper planes.


7 LESS THAN FULL RANK LINEAR MODELS. 233

Figure 7.56: Interaction plot for data on paper planes.


7 LESS THAN FULL RANK LINEAR MODELS. 234


b
γij = 0, i = 1, ..., a.
j=1

An alternative to adding constraints is to reparametrize the model. As in our


previous discussion, we can code the information in factor A through a set of a−1
dummy variables, and the information in factor B by a set of b − 1 dummy vari-
ables. The interaction effects can be represented by including the products of the
dummy variables for the two factors as additional predictors. Reparametrization
in this way results in a full rank linear model, and we can then estimate parame-
ters and test hypotheses in the usual way. The interpretation of parameters will
depend on the way that the factors have been coded as dummy variables.
Often a factorial experiment will be conducted in which m = 1 (that is,
there is only one observation for each combination of levels of the factors). In
this case, the residual sum of squares in fitting the model is zero (the number
of parameters in the mean for the reparametrized model is equal to the total
number of observations) and in order to estimate the error variance we need to
assume that at least some of the interaction parameters are zero. If we assume
all interaction parameters are zero, then we have the additive model

yijk = µ + αi + βj + εijk .

Here there is assumed to be no interaction between the factors.

Example: insanity and phases of the moon

We return to the example on admissions to the emergency room of a hospital


and phases of the moon. In this experiment the daily rate of admissions to the
emergency room of a mental hospital was measured for each of twelve months.
For each of the twleve months there are measurements for the periods before,
during and after the full moon. This is a two factor factorial experiment with
factors month of the year and phase of the moon (a three level factor with levels
before, during and after).
Below is the main effects plot and shows for each of the factors the average
of all responses for each level of the factor. As we can see, there seems to be
some seasonal pattern to the level of admissions, and perhaps also admissions
are higher during the full moon. Figure 7.58 shows the interaction plot. In the
figure, mean responses are plotted on the y-axis, the x-axis shows the month
for each mean and the different lines are for the three levels of the phase of the
moon factor. As mentioned above, in an additive model (no interaction) the lines
should be roughly parallel - if there is interaction, this means that the effect of the
phase of the moon factor depends on the level of the month factor and the lines
would not be parallel. Since we only have one observation for each combination
of levels of the factors, we cannot fit the model with interactions.
7 LESS THAN FULL RANK LINEAR MODELS. 235

Figure 7.57: Main effects plot for data on insanity and phases of the moon.
7 LESS THAN FULL RANK LINEAR MODELS. 236

Figure 7.58: Interaction plot for data on insanity and phases of the moon.
7 LESS THAN FULL RANK LINEAR MODELS. 237

Two-way Analysis of Variance

Analysis of Variance for Admission


Source DF SS MS F P
Moon 2 41.51 20.76 3.57 0.045
Month 11 455.58 41.42 7.13 0.000
Error 22 127.82 5.81
Total 35 624.92

From the p-values in the table, it can be seen that there is a significant effect
for both month of year and phase of the moon - it does seem as though there is
strong evidence in the additive model that phase of the moon has an effect on
mental hospital admissions.

When we have more than one observation for each distinct combination of
levels of the factors in a factorial experiment, then we are able to estimate inter-
action terms in our model.

Example: paper plane experiment

In the paper plane experiment, we can fit a model including interactions


since we have replicate observations for each distinct combination of levels of
the factors. Again we consider a two factor experiment including only the factors
Plane (design of plane) and Paper (paper weight). The analysis of variance table
is given below. A significant interaction effect is indicated, as was shown in the
interaction plot we discussed earlier.

Two-way Analysis of Variance

Analysis of Variance for Distance


Source DF SS MS F P
Paper 1 1718721 1718721 2.16 0.168
Plane 1 385641 385641 0.48 0.500
Interaction 1 23386896 23386896 29.35 0.000
Error 12 9561029 796752
Total 15 35052287
8 LOGISTIC REGRESSION 238

8 Logistic Regression
This course has been concerned with the general linear model, and we have seen
how flexible the linear model is for interpreting relationships in data. However,
the assumptions of the linear model can be restrictive in some situations. Much
of the theory of the linear model we have discussed has relied on the assumption
of normal errors: this often may not be reasonable, and if the assumption of
normality is to hold the responses must also be continuous. Constancy of variance
for the errors is another assumption of the linear model which is often violated.
In this subsection we discuss how we can do regression when the responses are
binary (zero or one). The model we will consider is perhaps the simplest example
of a generalized linear model. Generalized linear models are a flexible class of
models (including normal linear models as a special case) which are often used
for analyzing discrete and other kinds of non-normal data. Generalized linear
models will be discussed in much greater depth in some of your later statistics
courses. For now, we will consider only logistic regression models for describing
the dependence of a binary response on one or more predictors.
There are many situations in regression modelling where we are interested in
a binary response. For instance, for a number of banks over a period of time we
may have a collection of predictor variables describing the financial viability of
the banks as well as a response variable which is one if the bank fails over the
period of observation and zero otherwise. We are interested in estimating the
probability of failure based on the predictors describing the financial viability of
the bank.
The use of regression models for a binary response is also common in medical
statistics. For instance, we might be interested in estimating the probability that
someone will suffer a heart attack based on lifestyle factors. We could take a
random sample of subjects, measure predictors which are thought to relate to
heart attack risk (such as weight, blood pressure, cholesterol, etc.) and follow
these subjects over a period of time. We could then record a response variable
indicating whether the subject suffered a heart attack during the period of in-
vestigation. Again we have a binary response which we are interested in relating
to predictor variables: what is the probability of heart attack for someone with
given weight, blood pressure, cholesterol, etc.?
Clearly the linear model is inappropriate for a regression analysis when the
response is binary. Normality doesn’t hold, and the assumption of a constant
variance cannot hold in general either. One method of analysis appropriate for a
binary response variable is based on the logistic regression model.
As before we write y = (y1 , ..., yn ) for a vector of n responses, but where now
each value yi is either zero or one. Regarding each response value yi as random,
we write
pi = P r(yi = 1).
8 LOGISTIC REGRESSION 239

Note also that

E(yi ) = 0.P r(yi = 0) + 1.P r(yi = 1) = P r(yi = 1)

so that pi = E(yi ). In the linear model we assumed that the expected response
was linear in a vector of unknown parameters β of length p, say: that is,

E(yi ) = xTi β

where we have written xi = (xi1 , ..., xip )T for a vector of p predictors (and where
xi1 = 1). Can we assume the same thing here, that

pi = xTi β?

Clearly there are some problems with this, since pi is a probability (and hence
must lie between zero and one) but the term xTi β can potentially be negative or
greater than one. One solution is to assume that some transformation of pi is
a linear function of the parameters β: we can use a transformation which maps
a number between zero and one (a probability) to the whole real line, and then
model the probability on this transformed scale.
One appropriate transformation of the probability pi might be
( )
pi
l(pi ) = log .
1 − pi

Note that as pi → 0, l(pi ) → −∞, whereas as pi → 1, l(pi ) → ∞. Furthermore,


l(pi ) increases as pi increases. In a logistic regression model it is assumed that
( )
pi
log = xTi β. (42)
1 − pi

How are parameters estimated in the logistic regression model? An obvious


approach is to use maximum likelihood to estimate the parameters β. Rearrang-
ing (42), we have that

exp(xTi β)
pi = . (43)
1 + exp(xTi β)

If we assume that the responses yi are independent, the likelihood function is the
product of the probability functions for the observed responses yi . If yi = 1, the
contribution to the likelihood from the ith response is the probability that yi = 1,
or pi . If yi = 0, the contribution to the likelihood is the probability that yi = 0,
or 1 − pi . Noting that the function

pyi i (1 − pi )1−yi
8 LOGISTIC REGRESSION 240

takes the value pi when yi = 1 and the value 1 − pi when yi = 0, we can write
the likelihood function as

n
L(β) = pyi i (1 − pi )1−yi
i=1

where pi is given by (43). We must maximize this likelihood with respect to β,


or equivalently maximize the log-likelihood

n ( ) ∑
n
pi
log L(β) = yi log + log(1 − pi ).
i=1
1 − pi i=1

After differentiating with respect to β, we find that the maximum likelihood


estimator βb of β satisfies

n b
exp(xTi β) ∑ n
xi = y i xi (44)
b
1 + exp(xT β)
i=1 i i=1

(you might like to prove this as an exercise). Now, for the linear model the
equations we needed to solve to obtain maximum likelihood estimates of β (the
normal equations) were linear equations, and we know from linear algebra when
a unique solution will exist and how to find it. However, the equations which
need to be solved to find the maximum likelihood estimate of β in the case of
logistic regression are nonlinear equations, and in general an iterative technique
is needed to solve them. We do not discuss the algorithm used in this course.
So far we have discussed the use of logistic regression for modelling a binary
response. It is easy to extend the idea of logistic regression to deal with responses
which are binomially distributed. Recall that a binomial random variable can be
regarded as the number of successes in n independent trials with a probability p
of success on each trial (or the number of heads in n tosses of a coin where the
coin has probability p of a head on each flip). If we have responses yi which are
the number of successes in ni trials with probability pi of success on each trial,
and if we are interested in modelling the probability pi in terms of predictor vari-
ables (with ni known) then logistic regression can be applied: we just regard each
binomial response yi as being equivalent to ni binary responses, with yi of these
binary responses being one, and ni − yi of them being zero. We now illustrate
the fitting of logistic regression models in MINITAB.

Example: response of cows to electric shocks

This example is from Weisberg, “Applied Linear Regression (Second Edition),”


Wiley, New York, 1985. An experiment was conducted into the effect of small
electrical currents on farm animals, with the eventual goal of understanding the
8 LOGISTIC REGRESSION 241

effects of high voltage power lines on livestock. Seven cows were subejcted to
six different shock intensities (0, 1, 2, 3, 4 and 5 milliamps). Each cow was given
thirty shocks, five at each intensity, in a random order. The experiment was
then repeated. For each shock, the response is mouth movement, which is either
present or absent. At each shock level, there are 70 observations of the response.
We ignore any effects due to the individual cows in our analysis (although we
could employ the idea of blocking here). We can model the number of positive
responses at each shock level by considering these to be binomial random variables
with n = 70 and a probability of success depending on the shock level.
Fitting the logistic regression model in MINITAB gives the following output.

Binary Logistic Regression

Link Function: Logit

Response Information

Variable Value Count


Number o Success 200
Failure 220
Number o Total 420

Logistic Regression Table


Odds 95% CI
Predictor Coef StDev Z P Ratio Lower Upper
Constant -3.3010 0.3238 -10.20 0.000
Current 1.2459 0.1119 11.13 0.000 3.48 2.79 4.33

Log-Likelihood = -170.078
Test that all slopes are zero: G = 241.134, DF = 1, P-Value = 0.000

From the MINITAB output, we have that the fitted probability of mouth move-
ment at shock level i is
exp(−3.3010 + (1.2459).Currenti )
.
1 + exp(−3.3010 + (1.2459).Currenti )

Also listed in the output are estimated standard errors of the coeficients, and
p-values for testing the hypothesis that the various coefficients in the model are
zero (these are analogous to the partial t-tests we looked at in the linear model).
We will discuss in more depth what some of the output given here means in the
next lecture.
8 LOGISTIC REGRESSION 242

8.1 Hypothesis testing in logistic regression


When building a logistic regression model for a binary response variable we may
wish to test hypotheses in the same way that we did for the linear model. For
instance, we might wish to test whether the predictors are at all useful in a
logistic regression for explaining variation in the response, or we may wish to
test whether a subset of the predictors can be deleted from the model. In this
subsection we extend the framework we have developed for hypothesis testing in
linear models to logistic regression.
We will start by rewriting some of the things we know about the linear model
in a different way. Consider the situation where we have a set of continuous
responses (y1 , ..., yn )T . We can obtain a linear model which gives a “perfect fit”
by having a parameter for every observation: we write

y i = µ i + εi

where the terms εi are zero mean errors which are here independent and normal
with a common variance σ 2 . Now, of course the model we have just suggested is
probably not a very helpful one for answering the questions motivating any real
experiment, but the perfect fit model (usually called a saturated model) can help
us to assess the fit of models with fewer parameters.
What are the maximum likelihood estimators of the parameters µi in the
above model? If we write µ = (µ1 , ..., µn )T for the vector of the unknown mean
parameters, and if we assume that σ 2 is known, we can write the likelihood in
terms of the unknown parameters µ as
∏n ( )
1 1
L(µ) = √ exp − 2 (yi − µi ) 2
2πσ 2 2σ
i=1
( )
1 ∑ n
= (2πσ 2 )−n/2 exp − 2 (yi − µi )2 .
2σ i=1

Clearly maximizing L(µ) with respect to µ is equivalent to maximizing

1 ∑
n
− 2 (yi − µi )2 .
2σ i=1

The above expression is clearly never positive, and the maximum value it achieves
bi = yi . The
is zero when µi = yi . So the maximum likelihood estimator of µi is µ
maximum value attained by the likelihood is thus

µ) = (2πσ 2 )−n/2 .
L(b

Now consider the model

yi = xTi β + εi (45)
8 LOGISTIC REGRESSION 243

where β is a p × 1 vector of parameters, and xi = (xi1 , ..., xip )T is a vector of


predictors associated with the ith response. Writing βb for the maximum likeli-
hood estimator of β in this model, the maximum value achieved by the likelihood
function is ( )
1 ∑
n
b = (2πσ 2 )−n/2 exp −
L(β) b2
(yi − xTi β)
2
2σ i=1
Clearly the maximized value of the likelihood for the saturated model is always
larger than the maximized value of the likelihood for a model with fewer param-
eters. Looking at the ratio
b
L(β)
L(bµ)
tells us something about how close the smaller model comes to attaining the ideal
of a perfect fit established by the saturated model. We have here that
( )
b 1 ∑
n
L(β) b2 .
= exp − 2 (yi − xTi β)
L(bµ) 2σ i=1

Hence
( )
b 1 ∑
n
L(β) b 2.
−2 log = (yi − xTi β)
L(b
µ) σ 2 i=1

We say that
( )
b
L(β)
λ(β) = −2 log
L(b
µ)

is the scaled deviance of the model (45). The scaled deviance for a normal linear
model is simply the residual sum of squares for the model divided by σ 2 . The
deviance of the model (45) is the scaled deviance multiplied by the variance
parameter σ 2 : the deviance for a normal linear model is thus simply the residual
sum of squares (note: you will sometimes find different definitions of the deviance
in different textbooks).
We have rewritten our usual measure of lack of fit for a normal linear model,
the residual sum of squares, by considering the maximum value of the likelihood
for the model and comparing this to the maximimum value of the likelihood for a
saturated model. Rewriting the residual sum of squares in this way enables us to
generalize to new situations. In particular, we can define the deviance for a logistic
regression model: we can consider a saturated model (with a parameter for each
observation) and define the scaled deviance as -2 times the log of the ratio of the
maximized likelihood for the model to the maximized likelihood for the saturated
model. In logistic regression the distributions of the responses are determined by
8 LOGISTIC REGRESSION 244

their means, and there is no additional scale or dispersion parameter (like σ 2 in


the normal linear case). Thus for logistic regression, the deviance and the scaled
deviance have the same definition. As before, the deviance tells us something
about how close our model comes to achieving the perfect fit established by the
saturated model: it is a measure of lack of fit, analogous to the residual sum of
squares.
Looking at differences of scaled deviances is the key idea in hypothesis testing
for logistic regression. Again an analogy with the case of the normal linear model
is helpful. Let β = (β0 , ..., βk )T denote the vector of parameters in the mean
structure for a normal linear model containing k predictors. Partition β into two
subvectors, β (1) = (β0 , ..., βr )T and β (2) = (βr+1 , ..., βk )T . Suppose that we wish
to test the hypothesis
H0 : β (2) = 0
against the alternative
H1 : β (2) ̸= 0.
By testing this hypothesis we are comparing two different models: we compare
a model containing all k predictors with a smaller model in which the last k − r
predictors in the original model are excluded. Recall that our procedure for
testing in this situation involved fitting the smaller model, and then looking at
the reduction in the residual sum of squares when the larger model was fitted.
We could divide this reduction in the residual sum of squares by the number
of additional parameters in the larger model (k − r here), divide all this by the
estimated residual error variance (the mean square error) and then assuming that
the smaller model was adequate this was a realization of an F random variable
with k−r and n−k−1 degrees of freedom (where n is the number of observations).
The derivation of the distribution of the F statistic used for testing β (2) = 0
involved showing that the difference in the scaled deviances for the larger and
smaller models (that is, the difference in the residual sum of squares divided by
σ 2 ) had a χ2k−r distribution. If σ 2 was known, this result would enable testing
β (2) = 0 via a χ2 statistic (the reason for using the F statistic was that σ 2 is
not known in general, and so to come up with a statistic with a distribution
free of unknown parameters which can be used for hypothesis testing we need
to divide by the mean square error). So we have seen that for testing β (2) = 0
in the normal linear model with σ 2 known we can simply refer the difference of
scaled deviances for the larger and smaller model divided by the difference in the
number of parameters for the larger and smaller model to a χ2 distribution.
This result is exact for a normal linear model. Fortunately for us, this result
also holds approximately in large samples (that is, as n → ∞) for comparing
two fixed models in logistic regression. For logistic regression there is no variance
or dispersion parameter (like σ 2 ) to estimate, and carrying over our previous
notation we can test whether β (2) is zero by comparing the difference in scaled
deviances for the larger and smaller models divided by the difference in the num-
8 LOGISTIC REGRESSION 245

ber of parameters for the larger and smaller models (k − r here) to a percentage
point of a χ2k−r distribution.
This result may be used to test the hypothesis that β1 = ... = βk = 0 (that is,
that none of the predictors in a logistic regression model are helpful for explaining
variation in the response) against the alternative that not all these parameters
are zero. Also, of course we can compare the full model with a model in which
one predictor is excluded to get a test analogous to a two-sided partial t-test for
significance of coefficients in the normal linear model (there is also another way
of testing significance of individual coefficients in the logistic regression based
on large sample normality results for the maximum likelihood estimator). We
will discuss some of these tests when we look at the MINITAB output for some
examples during the lecture.

You might also like