Math2831 Course Pack
Math2831 Course Pack
LINEAR MODELS
Term 3, 2019
CONTENTS 1
Contents
1 Introduction 3
1.1 Some Basic Terminology . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 The Model Building Process . . . . . . . . . . . . . . . . . . . . . 3
4 Model Selection 94
4.1 Out of sample prediction, PRESS residuals and the PRESS statistic 97
4.2 The hat matrix and computation of the PRESS residuals . . . . . 98
4.3 Cross validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.4 The conceptual predictive criterion (Mallow’s Cp ). . . . . . . . . . 104
4.5 Sequential variable selection procedures . . . . . . . . . . . . . . . 106
4.6 Problems with forward selection, backward selection and stepwise
algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
CONTENTS 2
1 Introduction
1.1 Some Basic Terminology
In this course we will be concerned with building statistical models which describe
the dependence of some variable of interest (called the response) on a number of
other variables (called predictors).
An example will help to illustrate the idea. The Bureau of Meteorology in
Sydney issues daily maximum temperature forecasts at many locations around
the city. One of the locations at which a temperature forecast is issued is Sydney
Airport, and forecasters in the Bureau make use of a statistical model to help
them in deciding on the forecast.
Forecasters have available to them the predictions of a deterministic physical
model (based on the equations of motion for fluids) which gives predicted values
of certain meteorological variables such as temperature, wind speed and direction
and pressure. However, the physical model may contain biases, and for this reason
the Bureau uses a statistical model to relate the observed maximum temperature
at Sydney airport (the response) to the set of variables predicted by the physical
model (the predictors). The use of the statistical model to modify the output of
the physical model gives a more accurate forecast, and allows quantification of
uncertainty about predictions of the response.
In situations where the predictors are under the control of an experimenter the
response is sometimes called the dependent variable and the predictors are called
independent variables. In this course we will not be concerned with modelling the
independent variables as random quantities, but assume that they are controlled
or at least accurately measured.
“Signal” here denotes systematic variation in the response, which for us will be
variation which can be explained in terms of the predictors and a small number
of unknown parameters to be estimated from the data. The other component
of the model, the “Noise” term, captures residual variation unexplained in the
signal: typically there will be some unknown parameters to be estimated in this
component of the model as well.
A good statistical model provides a reduction of a possibly large and complex
data set to a description in terms of a small number of parameters. When building
a model a good statistician will keep in mind that the the reduction of the data
obtained must be a useful one for making the decisions which motivated the
building of the model. Finding or selecting a good statistical model is a complex
iterative process which is difficult to formally describe. Usually statisticians cycle
between stages of tentative model formulation, estimation and model criticism.
It is often an objective of a statistical analysis to describe as much of the
variation in the response as possible systematically (that is, in the “Signal”).
When we achieve this objective we say that the model fits the data well. However,
as we have mentioned, parsimony of the model (that is, having a model with a
small number of parameters) is also desirable in order to provide a reduction of
the data which is useful for human decision making and for making predictions
of future response values. So we must manage a trade off between simplicity of a
statistical model and goodness of fit. How we should manage this trade off often
depends on the use to which the model will be put.
We have talked about the process of model building in general terms. We will
now look in detail at the general linear model which is the subject of this course.
We discuss perhaps the simplest example of the general linear model first (the
simple linear regression model) and illustrate the processes of estimation, model
selection and model criticism in this situation. Once we have done this, the rest
of the course will be concerned with describing the same processes in the general
case.
yi = β0 + β1 xi + εi (1)
2 THE SIMPLE LINEAR REGRESSION MODEL 5
where β0 and β1 are parameters which we minimize over. If b0 and b1 are the
values which minimize (2) with respect to β0 and β1 , then b0 and b1 minimize in
some overall sense the deviations of the responses yi from the fitted values ybi .
Of course it is possible to define global measures of the discrepancy of the
fitted values from the responses which are different to (2). For instance, we could
consider estimators obtained by minimizing
∑
n
|yi − (β0 + β1 xi )| (3)
i=1
∂ ∑ ∑
n n
(yi − β0 − β1 xi )2 = −2 (yi − β0 − β1 xi )
∂β0 i=1 i=1
∂ ∑ ∑
n n
(yi − β0 − β1 xi ) = −2
2
xi (yi − β0 − β1 xi ).
∂β1 i=1 i=1
Hence upon rearranging we have that the least squares estimators b0 and b1 satisfy
( n )
∑ ∑n
nb0 + x i b1 = yi (4)
( ) ( i=1 ) i=1
∑
n ∑ n ∑n
xi b0 + x2i b1 = xi y i . (5)
i=1 i=1 i=1
These are linear equations in b0 and b1 (called the normal equations) which are
easily solved. Dividing (4) by n and solving for b0 gives
b0 = ȳ − b1 x̄. (6)
or
( n )
∑ ∑
n ∑
n ∑
n
b1 x2i − x̄ xi = xi yi − ȳ xi .
i=1 i=1 i=1 i=1
Solving for b1 ,
∑n ∑n ∑
xi ni=1 yi
xi y i −
i=1
i=1
n
b1 = .
∑n 2 (∑ni=1 xi )2
i=1 xi − n
2 THE SIMPLE LINEAR REGRESSION MODEL 7
37.5 15.9
72.5 42.7
85.2 85.7
76.5 52.6
64.5 49.1
86.8 59.0
90.8 53.6
105.8 77.8
85.8 63.2
87.9 62.3
53.9 22.7
102.3 66.1
90.7 47.4
86.0 59.4
79.0 50.9
We write Sxy for the numerator in (7) and Sxx for the denominator so that
b1 = Sxy /Sxx . We can then compute b0 from (6). An example will help to rein-
force some of the concepts we have discussed so far.
Zinc concentrations
•
80
•
••
60
••
•
Zinc in plant
• •
• •
•
40
•
20
•
0
20 40 60 80 100 120
Zinc in sediment
Figure 2.1: Scatter plot showing Zinc concentrations in plants and lake sediment
samples.
2 THE SIMPLE LINEAR REGRESSION MODEL 9
∑
15
yi = 808.4
i=1
∑15
xi = 1205.2
i=1
∑
15
x2i = 101228.6
i=1
∑
15
xi yi = 68716.78
i=1
so that
(808.4).(1205.2)
Sxy = 68716.78 −
15
= 3764.5
and
1205.22
Sxx = 101228.6 −
15
= 4394.8.
Hence the least squares estimate of the slope coefficient is 3764.5/4394.8 = 0.857
and the least squares estimate of the intercept is 808.4/15 − 0.857 × 1205.2/15 =
−14.9. Hence the fitted least squares regression line (the systematic component
of the fitted model) is
−14.9 + 0.857x.
to the true parameter value (no systematic bias), and we want the variation of
the estimates about the true parameter value to be as small as possible (small
variance).
We would like to compute the mean and variance of the least squares estima-
tors of the slope and intercept in the simple linear regression model as a way of
evaluating the performance of these estimators.
Some Notation
In the rest of the note, for convenience, we write yi to represent both the i-th
response variable (see model assumption (1)) and its realization (a data point).
Recall that the least squares estimator b1 for β1 is given by
Sxy
b1 = (7)
Sxx
where
∑
n ∑n ∑
xi ni=1 yi
Sxy = xi yi − i=1
i=1
n
and
∑
n ∑
( ni=1 xi )2
Sxx = x2i − .
i=1
n
It is a very easy exercise to derive the following alternative expressions for Sxy
and Sxx :
∑
n
Sxy = (xi − x̄)yi
i=1
and
∑
n
Sxx = (xi − x̄)xi .
i=1
Also,
∑
n
Sxy = (xi − x̄)(yi − ȳ)
i=1
and
∑
n
Sxx = (xi − x̄)2 .
i=1
2 THE SIMPLE LINEAR REGRESSION MODEL 11
Expected value of b1
Expected value of b0
Recall that
b0 = ȳ − b1 x̄.
Now,
∑n
E(yi )
E(ȳ) = i=1
∑n n
i=1 (β0 + β1 xi )
=
n∑
nβ0 + β1 ni=1 xi
=
n
= β0 + β1 x̄.
2 THE SIMPLE LINEAR REGRESSION MODEL 12
So b0 is unbiased for β0 .
Variance of b0 and b1
( )
2 1 x̄2
Var(b0 ) = σ +
n Sxx
σ2
Var(b1 ) =
Sxx
σ 2 x̄
Cov(b0 , b1 ) = − .
Sxx
Example: market model of stock returns
The following example is from Keller, Warrack and Bartel, “Statistics for Man-
agement and Economics: a Systematic Approach (Second Edition)”, Wadsworth,
Belmont, 1990, p. 725.
A well known model in finance, called the market model, assumes that the
monthly rate of return on a stock (R) is linearly related to the monthly rate of
return on the overall stock market (Rm ). The mathematical description of the
model is
R = β0 + β1 Rm + ε
where the error term ε is assumed to satisfy the requirements of the linear re-
gression model. For practical purposes, Rm is taken to be the monthly rate of
return on some major stock market index, such as the New York Stock Exchange
(NYSE) Composite Index.
The coefficient β1 , called the stock’s beta coefficient, measures how sensitive
the stock’s rate of return is to changes in the level of the overall market. For
2 THE SIMPLE LINEAR REGRESSION MODEL 13
example, if β1 > 1 (β1 < 1), the stock’s rate of return is more (less) sensitive to
changes in the level of the overall market than is the average stock. The monthly
rates of return to Host International Inc. stock and to the overall market (as
approximated by the NYSE composite index) over a 5-year period are shown in
the table below.
Figure 2.2 shows a scatterplot of these data, together with a fitted least
squares regression line. As usual writing y for the response (Host International
2 THE SIMPLE LINEAR REGRESSION MODEL 14
30
•
•
•
20
• •• •
• •
• •
•
• •
10
• • • •
Host International
• • ••
•
• • • •
••••
• ••
0
• •
• • • •
•• • • •
• •
•
• •
• •
-10
• •
• •
• •
-20
•
-30
-10 -5 0 5 10
Market
Figure 2.2: Scatter plot of Host International returns versus overall market re-
turns with fitted least squares regression line.
2 THE SIMPLE LINEAR REGRESSION MODEL 15
return) and x for the predictor (overall market return) we have in this case that
∑
n ∑
n
xi = 90.4 x2i = 1254.1
i=1 i=1
∑
n ∑
n
yi = 136.0 yi2 = 8158.9
i=1 i=1
∑
n
xi yi = 1992.8.
i=1
Hence the estimated beta coefficient for the stock (the least squares estimate of
the slope in the linear regression) is
Sxy
b1 =
Sxx
and we calculate Sxy and Sxx as
∑
n ∑n ∑
xi ni=1 yi
Sxy = xi yi − i=1
i=1
n
(90.4).(136.0)
= 1992.8 −
60
= 1787.89
and
∑
n ∑ 2
( ni=1 xi )
Sxx = x2i −
i=1
n
90.42
= 1254.1 −
60
= 1117.90
giving
b1 = 1787.89/1117.90 = 1.60.
So it seems as though the rate of return on Host International stock is more
sensitive to the level of the overall market than is the average stock. The least
squares estimate b0 of β0 here is ȳ − b1 x̄ = 2.2667 − (1.60)(1.5067) = −0.14.
We have previously shown that b0 and b1 are unbiased, and derived expressions
for the variance of b0 and b1 . How reliable is the estimate of the beta coefficient in
this example? Our expressions for the variance of b0 and b1 contain the parameter
σ 2 , which is unknown. We need to estimate this if we are to obtain estimated
values for the variances of b0 and b1 .
2 THE SIMPLE LINEAR REGRESSION MODEL 16
n
(since εi = yi − β0 − β1 xi and so the above expression is a sample average of the
squared εi which is an unbiased estimator of σ 2 = E(ε2 )).
Since β0 and β1 are not known, we might suggest plugging the least squares
estimators of β0 and β1 into the above expression.
In particular, consider the estimator
∑n
i=1 (yi − b0 − b1 xi )
2
∗2
σ = .
n
What are the properties of this estimator? In particular, is it unbiased for σ 2
(does it give the right value for σ 2 on average with repeated use)? It can be
shown (proof is a special case of later results) that
n−2 2
E(σ ∗ 2 ) = σ
n
which suggests using the modified estimator
∑n
i=1 (yi − b0 − b1 xi )
2
b =
σ 2
n−2
which is unbiased for σ 2 .
b2 is
A useful alternative computational formula for σ
Syy − b21 Sxx
b2 =
σ
n−2
where ∑n
∑
n
(
2
yi )
Syy = yi2 − i=1
.
i=1
n
We can “plug in” the estimate σb2 for σ 2 in our expressions for the variances
of b0 and b1 to obtain estimated variances for these quantities.
where
∑
n ∑ 2
( ni=1 yi )
Syy = yi2 −
i=1
n
136.02
= 8158.9 −
60
= 7850.63
so that
7850.63 − 1.602 1117.90
σb2 =
58
= 86.01.
b = 9.27.
Hence σ
By “plugging in” this σb value in the expressions for Var(b0 ) and Var(b1 ) that
we obtained before, we get that the estimated variance of b0 is 1.61 (estimate
standard deviation of 1.27) and the estimated variance of b1 is 0.077 (estimated
standard deviation of 0.28).
1 ∑
n
n
l(y; β0 , β1 , σ ) = − log(2πσ ) − 2
2 2
(yi − β0 − β1 xi )2
2 2σ i=1
1 ∑
n
n n
= − log(2π) − log(σ ) − 2
2
(yi − β0 − β1 xi )2
2 2 2σ i=1
That is, the maximum likelihood estimators of β0 and β1 are just the least squares
estimators. It remains to find the maximum likelihood estimator of σ 2 . Differen-
tiating the log-likelihood with respect to σ 2 gives
( )
∂ n 1 ∑n
n 1 ∑
n
− log σ − 2
2
(yi − β0 − β1 xi )2
= − 2+ 4 (yi − β0 − β1 xi )2 .
∂σ 2 2 2σ i=1 2σ 2σ i=1
1 ∑
n
n
= (yi − b0 − b1 xi )2
σ2
2b σ 4 i=1
2b
2 THE SIMPLE LINEAR REGRESSION MODEL 19
which gives
1∑
n
b2 =
σ (yi − b0 − b1 xi )2 .
n i=1
On the left hand side of (8) we have a term which we can think of as being the
total variation in the data – note that this term is the numerator of the expression
we would use for computing the sample variance of the y’s. On the right hand
side we have a sum of two terms: the first term represents the variation of the
predictions about the sample mean of the y’s (we can think of this as the variation
explained by the fit) and the second term represents variation of the responses
from the predictions (we can think of this as variation unexplained by the fit).
We now derive the identity above. We have
∑
n ∑
n
(yi − ȳ) =
2
(yi − ybi + ybi − ȳ)2
i=1 i=1
∑
n ∑
n ∑
n
= (yi − ybi ) +
2
yi − ȳ) + 2
(b 2
(yi − ybi )(b
yi − ȳ)
i=1 i=1 i=1
or
∑
n ∑
n
(yi − ybi )b
yi − ȳ (yi − ybi ) = 0. (9)
i=1 i=1
2 THE SIMPLE LINEAR REGRESSION MODEL 20
and
( ) ( n )
∑
n ∑ ∑
n
xi b0 + x2i b1 = xi y i . (12)
i=1 i=1 i=1
Now,
∑
n ∑
n
(yi − ybi )b
yi = (yi − ybi )(b0 + b1 xi )
i=1 i=1
∑
n
= b1 (yi − ybi )xi
(
i=1
)
∑
n ∑
n
= b1 xi y i − (b0 + b1 xi )xi
( i=1 i=1
)
∑ n ∑
n ∑
n
= b1 x i y i − b0 x i − b1 x2i
i=1 i=1 i=1
= 0
Coefficient of determination
The identity (8) motivates one way of measuring the goodness of fit of a
statistical model. We introduce the notation
∑
n
SStotal = (yi − ȳ)2
i=1
∑n
SSreg = yi − ȳ)2
(b
i=1
and
∑
n
SSres = (yi − ybi )2
i=1
so that
We call SStotal the total sum of squares, SSreg the regression sum of squares and
SSres the residual sum of squares: as we have mentioned, SStotal , SSreg and SSres
represent the total variation, variation explained by the fit, and residual variation
respectively. One measure for describing the fit of a linear regression model is
the coefficient of determination,
SSreg
R2 =
SStotal
which from (13) can also be written as
SSres
R2 = 1 − .
SStotal
Clearly R2 is always non-negative and less than or equal to one: it will be one
when SSreg = SStotal , which occurs when SSres = 0 (that is, when the fitted
values are all equal to the responses). So R2 is a measure of the quality of the
fit, with values close to one indicating a good fit.
R2 needs to be used with some care for assessing the quality of linear statistical
models however. When we discuss the general linear model where we consider
multiple predictors, we will see that adding a new predictor cannot decrease R2 ,
even if the predictor is actually unrelated to the response. So according to this
criterion the best model is always the most complex one.
2 THE SIMPLE LINEAR REGRESSION MODEL 22
b2
Distribution theory for b0 , b1 and σ
and ( )
σ2
b1 ∼ N β 1 , .
Sxx
2 THE SIMPLE LINEAR REGRESSION MODEL 23
For making inferences about β0 and β1 , we will also need to know something
b2 . It can be shown that
about the distribution of σ
(n − 2)b
σ2
σ2
has a chi-squared distribution with n−2 degrees of freedom, and that this variate
is independent of b0 and b1 .
2
σ n1 + Sx̄xx n−2
which takes the form of a standard normal random variable (b0 minus its mean
divided by its standard deviation) divided by the square root of an independent
χ2n−2 random variable ((n − 2)b σ 2 /σ 2 ) divided by its degrees of freedom. Hence
the above statistic has a t distribution with n − 2 degrees of freedom,
(b0 − β0 )
√ ∼ tn−2 .
b n1 + Sx̄xx
2
σ
has a t distribution with n − 2 degrees of freedom. These statistics are useful for
creating confidence intervals for β0 and β1 , as we now describe.
Confidence intervals
2 THE SIMPLE LINEAR REGRESSION MODEL 24
Consider once more the data on zinc concentrations described in the first lecture.
The data consist of fifteen measurements y1 , ..., y15 of zinc concentrations in a cer-
tain plant taken from fifteen different lakes, and fifteen measurements x1 , ..., x15
of zinc concentrations from sediment samples in the lakes. Using the fact that
∑
15 ∑
15
yi = 808.4 xi = 1205.2
i=1 i=1
∑
15 ∑
15
x2i = 101228.6 yi2 = 48130.92
i=1 i=1
∑
15
xi yi = 68716.78
i=1
we compute the least squares estimates b0 and b1 of β0 and β1 as −14.9 and 0.857
respectively. We now derive a 95% confidence interval for β1 .
The expression for a 95% confidence interval for β1 is
( )
b
σ b
σ
b1 − t0.025,n−2 √ , b1 + t0.025,n−2 √ .
Sxx Sxx
Now, we have that
Hypothesis testing
b2 to test hy-
We can use our results about the distributions of b0 , b1 and σ
potheses about β0 and β1 .
To test the hypothesis
H0 : β1 = β
2 THE SIMPLE LINEAR REGRESSION MODEL 27
p-values
For H1 : β1 < β the critical region is T < −tα;n−2 and the p-value of the test is
So far we have only dealt with the construction of tests for the coefficient β1 .
We can also construct a test of the null hypothesis H0 : β0 = β against one and
two sided alternatives.
To test the hypothesis
H0 : β0 = β
versus the alternative
H1 : β0 ̸= β
we use the test statistic
b0 − β
T = √
b n1 + Sx̄xx
2
σ
which has a tn−2 distribution under H0 . The critical region for a test at signifi-
cance level α is
T < −tα/2,n−2 or T > tα/2,n−2
and the p-value of the test is
versus
H1 : β1 ̸= 0
at the 5% level. We illustrate the computation of the p-value for the test. The
test statistic here is
b
√1
σb/ Sxx
which we compute as
0.857
√ = 5.59.
10.14/ 4934.8
Under H0 , this is a realization of a t random variable with 13 degrees of freedom,
so that if T ∼ t13 we can compute the p-value as
(from tables or R). So at the 5% level we reject H0 since p < 0.05. That is, we
believe that the zinc concentration in the sediment samples is a useful predictor
of zinc concentration in the plants.
Instead of computing the p-value we could also have computed the critical
region for the test. The critical region is
and from R or tables we have t0.025;13 = 2.16, so that the value of the test statistic
(5.59) lies in the critical region, and H0 is rejected.
H0 : β1 = 0
We now derive an equivalent F statistic for the above test, and describe a way
of showing the calculations involved using the analysis of variance table. Observe
that
∑
n
SSreg = yi − ȳ)2
(b
i=1
∑
n
= (b0 + b1 xi − ȳ)2
i=1
∑n
= (ȳ − b1 x̄ + b1 xi − ȳ)2
i=1
∑
n
= b21 (xi − x̄)2
i=1
= b21 Sxx . (15)
Now, squaring the statistic (14) gives (under H0 ) a random variable with an
F distribution with 1 and n − 2 degrees of freedom:
b21 Sxx
.
b2
σ
Using (15) we can write this last statistic as
SSreg /1
F = . (16)
SSres /(n − 2)
So we can write down an F statistic for testing for the usefulness of a predictor in
a simple linear regression model, and this F statistic is defined from the terms in
the fundamental partition of variation we have described previously. The critical
region for the test based on (16) is F > Fα;1,n−2 , where Fα;1,n−2 is the upper
100α percentage point of the F1,n−2 distribution (so we reject H0 : β1 = 0 if the
variation explained by the model is large compared to the residual variation).
The analysis of variance table presents for the simple linear regression model
the decomposition of total variation into SSreg and SSres and shows the compu-
tation of the F statistic above. The analysis of variance table has the form shown
below. The sum of squares column shows the partitioning of the total variation.
The second column, the degrees of freedom column, shows the degrees of freedom
parameters describing the null distribution of the statistic (16) (we will come to
a deeper understanding of the degrees of freedom parameters when we talk about
the general linear model). Entries in the mean square column are computed by
dividing the corresponding entry in the sum of squares column by its degrees of
freedom, and in the final column we have the F statistic which is used for testing
for the usefulness of the predictor. We will generalize this analysis of variance
2 THE SIMPLE LINEAR REGRESSION MODEL 31
table later when we talk about the general linear model. The analysis of vari-
ance table is a useful tool for displaying the sources of variation in data, and for
organizing the calculations involved in certain tests of hypotheses. We conclude
this section with two examples.
We return to our example on the market model of stock returns (from Keller,
Warrack and Bartel, “Statistics for Management and Economics: a Systematic
Approach (Second Edition)”, Wadsworth, Belmont, 1990, p. 725). Recall that in
the market model the monthly rate of return on a stock (R) is linearly related to
the monthly rate of return on the overall stock market (Rm ). The mathematical
description of the model is
R = β0 + β1 Rm + ε
where the error term ε is assumed to satisfy the requirements of the simple
linear regression model. We are interested in the coefficient β1 , called the stock’s
beta coefficient, which measures how sensitive the stock’s rate of return is to
changes in the level of the overall market. If β1 > 1 (β1 < 1), the stock’s rate
of return is more (less) sensitive to changes in the level of the overall market
than is the average stock. Our data consist of 5 years of monthly rates of return
on Host International stock and rates of return on the NYSE composite index
(measuring the rate of return on the overall stock market). When we considered
this data set previously, we computed b1 as 1.60, b0 as −0.14 and σ b as 9.27. Also,
Sxx = 1117.90.
Since there is uncertainty in our estimate of the beta coefficient it is of interest
to compute a confidence interval for this coefficient, and to test the hypothesis
that the beta coefficient is 1 (is there any real evidence that the stock is more or
less sensitive than average to the overall market level given the inherent variation
in the data?) A 100(1 − α) percentage confidence interval for β1 is
( )
b
σ b
σ
b1 − tα/2,n−2 √ , b1 + tα/2,n−2 √
Sxx Sxx
2 THE SIMPLE LINEAR REGRESSION MODEL 32
H0 : β1 = 1
versus
H1 : β1 ̸= 1
using the test statistic
b1 − 1
√
b/ Sxx
σ
which evaluates here to
1.60 − 1
√ = 2.16.
9.27/ 1117.90
So if T ∼ t58 , the p-value for the test is
We describe a data set now that we will return to later when considering linear
regression models with multiple predictors. The following example is described in
Gail E. Farrelly, Kenneth R. Ferris and William R. Reichenstein, “Perceived Risk,
Market Risk and Accounting-Determined Risk Measures”, Accounting Review 60
(1985), pp. 287–88.
Investors are interested in assessing the riskiness of a company’s common
stock, as well as its expected rate of return. It is therefore desirable to potential
2 THE SIMPLE LINEAR REGRESSION MODEL 33
Asset growth, x4 : Average growth rate in asset size for the years 1977-1981
These seven measures were computed for 25 well-known stocks, based on data
from the companies’ annual reports from 1977-1981. These data were then sent
to a random sample of 500 financial analysts, who “were requested to assess the
risk of each of the 25 companies on a scale of 1 (low) to 9 (high), assuming
that the stock was to be added to a diversified portfolio.” The mean rating, y,
assigned by the 209 financial analysts who responded is recorded for each of the
25 stocks. This measure of the financial analysts’ risk perception was taken to
be a reasonable surrogate for the (market) risk of each stock.
Prediction of market risk based on a linear regression model with multiple
predictors is a problem we will consider later. For the moment, we consider a
simple linear regression model for predicting market risk based on asset size. A
scatterplot of market risk versus asset size is shown in Figure 2.3. We do not
describe in detail the computation of parameters or computation of test statistics
here, but the fitted line is
the estimated error standard deviation is 1.475, and the p-value for testing H0 :
β1 = 0 versus H1 : β1 ̸= 0 using the usual t-statistic is p = 0.0211. So it
seems that asset size is a useful predictor of market risk. In later work on linear
regression models with multiple predictors we consider the problem of which
2 THE SIMPLE LINEAR REGRESSION MODEL 34
•
8
•
6
• •
• • •
Risk
•
•
• •
• • •
•
4
• •
•
•
• •
• •
2
•
6 8 10 12
Asset size
Figure 2.3: Scatter plot of assessment of market risk versus asset size with fitted
least squares regression line.
2 THE SIMPLE LINEAR REGRESSION MODEL 35
predictors should be chosen for predicting market risk from amongst the seven
predictors described above. Choosing subsets of predictors when there are many
to choose from in explaining variation in the response is a much trickier problem
than the one we have considered here, and further discussion is delayed until we
have given a thorough treatment of the general linear model.
β0 + β1 x0 + ε0
where ε0 is a normally distributed error term with mean zero and variance σ 2 , in-
dependent of y1 , ..., yn . A confidence interval for the mean response at x0 will take
into account the uncertainty in estimating β0 and β1 , and a prediction interval for
y0 (which is a random interval containing y0 with specified probability) will take
into account both the uncertainty in estimating β0 and β1 and the variability of ε0 .
2 THE SIMPLE LINEAR REGRESSION MODEL 36
First we give an expression for the variance of yb(x0 ): it can be shown that
( )
2 1 (x0 − x̄)2
Var(b
y (x0 )) = σ + .
n Sxx
In discussing the general linear model later we will show that b0 and b1 are jointly
normally distributed, so that yb(x0 ) (which is a linear combination of b0 and b1 ) is
normally distributed. Hence since E(b y (x0 )) = E(b0 + b1 x0 ) = β0 + β1 x0 we have
( ( ))
1 (x0 − x̄)2
yb(x0 ) ∼ N β0 + β1 x0 , σ 2
+
n Sxx
or
yb(x0 ) − β0 − β1 x0
√ ∼ N (0, 1). (17)
σ n1 + (x0S−x̄)
2
xx
Hence if tα/2,n−2 is the upper 100 × α/2 percentage point of a t distribution with
n − 2 degrees of freedom, we have
yb(x0 ) − β0 − β1 x0
P −tα/2,n−2 6 √ 6 tα/2,n−2 = 1 − α
(x0 −x̄)2
b n + Sxx
σ 1
or
√ √
1 (x0 − x̄)2 1 (x0 − x̄)2
P yb(x0 ) − tα/2,n−2 σ
b + 6 β0 + β1 x0 6 yb(x0 ) + tα/2,n−2 σ
b +
n Sxx n Sxx
= 1 − α.
Hence a 100(1−α) percentage confidence interval for the mean response β0 +β1 x0
when the predictor is x0 is
√ √
yb(x0 ) − tα/2,n−2 σ 1 (x0 − x̄) 2 1 (x0 − x̄)
2
b + , yb(x0 ) + tα/2,n−2 σ
b + .
n Sxx n Sxx
2 THE SIMPLE LINEAR REGRESSION MODEL 37
Prediction intervals
y0 − yb(x0 )
√ ∼ N (0, 1)
1 (x0 −x̄)2
σ 1+ n
+ Sxx
y − yb(x0 )
√ 0 ∼ tn−2
b 1 + n1 + (x0S−x̄)
2
σ xx
The expression given above for the prediction interval is a very natural one. It
tells us that our uncertainty of prediction is related to the residual error variance
(estimated by σ b2 ), the sample size, and how far x0 is away from the mean of the
predictors x̄. If x0 is far away from where most of the observed predictors lie, then
our uncertainty of prediction is increased. We must in any case be very careful
about prediction for values of the predictor far away from any of the observed xi :
while linearity and the other assumptions of the simple linear regression model
may seem to hold locally, these assumptions can break down when we are far
from observed values of the predictors. Extrapolation is dangerous!
•
8
•
6
• •
• • •
Risk
•
•
• •
• • •
•
4
• •
•
•
• •
• •
2
•
6 8 10 12
Asset size
Figure 2.4: Scatter plot of assessment of market risk versus asset size with fitted
least squares regression line and 95 percent confidence intervals for the mean
(inner bands) and 95 percent prediction intervals (outer bands).
2 THE SIMPLE LINEAR REGRESSION MODEL 39
H0 : θi ∈ Θi
H0 : θ1 ∈ Θ1 , ..., θk ∈ Θk
against
H1 : Not all θi ∈ Θi , i = 1, ..., k
at significance level α. One approach to this test is to conduct individual tests on
the coefficients θi and to reject the null hypothesis if any of the individual tests
are rejected. But what significance level should be used for the individual tests
to ensure significance level α for the joint test? It turns out that if we adjust
the significance level for the individual tests to be α/k, then the joint test has
significance level less than or equal to α. This is the Bonferroni adjustment to
the significance level.
We should point out that the Bonferroni adjustment is conservative in the
sense that the Bonferroni confidence intervals discussed above give a joint cov-
erage which is usually greater than 100(1 − α)%, and the Bonferroni adjustment
2 THE SIMPLE LINEAR REGRESSION MODEL 41
y i = β 0 + β 1 xi + εi
where the εi , i = 1, ..., n, are uncorrelated zero mean errors with common variance
σ 2 . We can estimate β0 and β1 via the least squares estimators b0 and b1 . The
fitted values are
ybi = b0 + b1 xi
for i = 1, ..., n. We define the residuals of the fitted model to be the differences
between the responses and the fitted values,
ei = yi − ybi
i = 1, ..., n. If we can estimate β0 and β1 precisely using b0 and b1 , then the resid-
uals should be approximately the error terms εi and so we can use the residuals
to check some of the assumptions made about the error terms in the simple linear
regression model. As well as the residuals ei (sometimes called the raw residuals)
there are other kinds of residuals which may be more useful in some situations,
but we will not discuss these at the moment.
What happens when the assumptions of the simple linear regression model
are violated? It is common for statisticians to use plots of the residuals against
the fitted values or predictor values to try to detect an incorrect specification of
the systematic part of the model or to detect departures from the constancy of
variance assumption for the errors. Since the error terms in the simple linear
regression model are supposed to have zero mean, any systematic trend in the
residuals as the fitted values or predictor values increase provides some evidence
that the errors do not have zero mean and that the mean structure is not correctly
specified. Similarly, increasing variability as the fitted values or predictor values
increase can indicate that the residuals do not have a common variance σ 2 . We
2 THE SIMPLE LINEAR REGRESSION MODEL 42
can also use the residuals to check the assumption of normality of errors, although
we do not discuss this further at the present time.
The usefulness of residual plots for model criticism is perhaps best shown via
some examples. Figure 2.5 shows a scatterplot of some synthetic data together
with a fitted linear regression model. Below is a plot of the residuals versus
the fitted values: there is a clear trend in the mean level of the residuals as
the fitted values increase, suggesting that perhaps a model which is nonlinear in
the predictor would be more appropriate than the model considered here. This
example shows how beneficial residual plots can be for detecting departures from
the assumed mean structure for the model.
We give another synthetic example which shows the usefulness of the residuals
for detection of violations of the assumption of constancy of variance of the errors.
It often happens in data sets from many areas of science, social science and
economics that the variation in the response increases as the mean response
increases. Figure 2.6 shows a scatterplot for a synthetic data set which exhibits
this kind of behaviour, together with a fitted linear regression model. Below it
is the corresponding residual plot: the increasing trend in the variance of the
residuals is obvious.
We will say much more about residuals later in this course.
2 THE SIMPLE LINEAR REGRESSION MODEL 43
2.0
•
• •
• •
•• • •
1.8
•
•
1.6
• •
•
•
•
• •
y
•
1.4
•• •
• • • •
• • •
• •
1.2
•
• •
• • •
• • •
• •
•
1.0
• • •
•
•
• •
0.0 0.2 0.4 0.6 0.8 1.0
x
•
0.2
• •
• • •
• • • •
0.1
• •
• • • •
• • •
• • •
Residuals
• • •
0.0
•
• • •
• •
• • •
• • •
• •
-0.1
• •
•
• • •
• • • •
-0.2
•
1.0 1.2 1.4 1.6 1.8
Fitted values
Figure 2.5: Scatter plot of synthetic data set with fitted linear regression (top)
and plot of residuals versus fitted values.
2 THE SIMPLE LINEAR REGRESSION MODEL 44
•
4
•
3
•
•
•
•
y
•
2
• •
• • ••
• • • •
• •• • • • • •
• •• • •
• • •• •
• • • • • •
1
• • •
•
• • •
•
•
0.0 0.2 0.4 0.6 0.8 1.0
x
•
2
•
1
• •
•
Residuals
•
• • • •
• • • •• • • • • ••
• • • • •
0
• • • •
• •
•• • •
• • • • •
• • •
•
• •
-1
•
1.2 1.4 1.6 1.8
Fitted Values
Figure 2.6: Scatter plot of synthetic data set with fitted linear regression (top)
and plot of residuals versus fitted values.
3 THE GENERAL LINEAR MODEL 45
Asset growth, x4 : Average growth rate in asset size for the years 1977- 1981
These measures of risk were determined for 25 companies, and sent to 500 ran-
domly chosen financial analysts, 209 of whom responded to a request to rank the
risk of each of the 25 companies on a scale of 1 (low) to 9 (high), assuming that
the stock was to be added to a diversified portfolio. The average of the ratings
(y) for the 209 analysts was then computed for each of the 25 stocks.
We are interested in predicting risk (the response y) in terms of the accounting-
determined measures of risk x1 − x7 . One way of approaching this prediction
problem is to build a linear model for the response in terms of the multiple
predictors x1 − x7 . Specifically, we write
where yi is the risk for the ith company, xi1 , ..., xi7 are the accounting-determined
measures of risk for the ith company, and εi , i = 1, ..., 25 are error terms which
are assumed uncorrelated with a common variance.
Some of the problems we face in using this model to predict risk are the same
as problems we have faced for the simple linear regression model, but some are
more complicated. We need to estimate the parameters β0 − β7 , and the error
variance σ 2 . We also need to decide what subset of the predictors is most useful
for predicting the risk (and deciding this question can be much more complicated
than deciding whether a single predictor is useful or not in a simple linear regres-
sion, particularly if the various predictors considered contain much the same kind
of information). We must also develop tools for checking model assumptions, and
once we have a model we are happy with we must be able to use it for making
predictions and for describing our uncertainty about those predictions. These
problems and others are considered in the remainder of the course.
yi = β0 + β1 xi + β2 x2i + β3 x3i + εi .
That is, we can use the general linear model with three predictors (the original
predictor x together with x2 and x3 ) as a model for the variation in the response.
3 THE GENERAL LINEAR MODEL 47
1.0
•
•
• •
•
0.5
•
• •
• • • • •••
••
• • •
• ••
0.0
•
•
• •• • •• ••
y
• •• •
• •
• •
• •
•
-0.5
• ••
•
-1.0
•
-1.0 -0.5 0.0 0.5 1.0
x
Figure 3.7: Scatter plot for synthetic data set showing non-linear relationship
between mean response and predictor.
3 THE GENERAL LINEAR MODEL 48
This model is still linear in the parameters, and hence it is a linear model by our
definition. We can describe quite complicated relationships between the mean re-
sponse and a predictor using polynomials, and we may wish to develop inferential
techniques for deciding on an appropriate order for a polynomial approximation
(we have used a cubic polynomial here). We will develop hypothesis tests for
the linear model that can provide one approach to problems of model selection
such as this. There are other flexible approximation techniques for non-linear
relationships between the mean response and a predictor which perform better
than polynomial regression for many purposes.
As another example we consider a synthetic data set in which there are two
predictors of the response. Below is a meshplot showing the responses, the values
of the predictors, and the true mean response as a function of the predictors. I
45
40
35
30
25
20
15
10
5
40
30 35
30
20 25
20
15
10 10
5
0 0
Figure 3.8: Scatter plot for synthetic data set showing non-linear relationship
between mean response and two predictors. The circles show the response values,
and the surface shows the true mean response used in the simulation.
i = 1, ..., 50 where Var(εi ) = 1.0 here and the εi were taken to be normally
distributed. This data set also comes from a linear model, as the relationship of
the mean response to the predictors in the above model is linear in the parameters
(the coefficients in the above bivariate polynomial would be unknown for a real
data set and would need to be estimated from the data). By now you should
3 THE GENERAL LINEAR MODEL 49
realize that the linear model is a very flexible tool for modelling complicated
relationships between a response variable and a set of predictors.
Before we go on we formulate the linear model in matrix notation: this matrix
notation will be used throughout the rest of the course, and the use of matrix
language is indispensible in describing the theory of linear models. Write y for
the vector of responses, y = (y1 , ..., yn )⊤ where a⊤ denotes the transpose of the
vector a here (so y is a column vector). Also, write β for the vector of the mean
parameters in the linear model, β = (β0 , ..., βk )⊤ . and letting p = k + 1 (so that
p is the length of the vector β), write X for the n × p matrix whose 1st column
is a column of 1’s and whose (i + 1)th column is the vector (x1i , ..., xni )⊤ for
i = 1, ..., k. Finally write ε = (ε1 , ..., εn ) for the vector of errors in the linear
model. Then we can write the general linear model in matrix notation as
y1 1 x11 . . . x1k ε 1
β0
.. .. .
. .. ..
. = . . . + .
βk
yn 1 xn1 . . . xnk εn
or
y = Xβ + ε.
y = Xβ + ε
and predictions (that is, the sum of the squared elements∑ of the vector y − Xβ).
Note that for any n × 1 vector x, x⊤ x is the scalar value ni=1 x2i , so that we can
write this sum of squared differences as
First we recall a result about matrix transposes: for two matrices A and B which
have dimensions so that A and B can be multiplied together, the transpose of
AB, (AB)⊤ is equal to B ⊤ A⊤ . If you don’t know this result, or have forgotten it,
then you should prove it as an (easy) exercise. Of course, using the above result,
if A, B and C are three matrices such that the product ABC is well defined,
then it is easy to show that (ABC)⊤ = C ⊤ B ⊤ A⊤ , and a similar result holds for
a product of four or more matrices.
Now, observe that (19) can be written
β ⊤ X ⊤ y = (X ⊤ y)⊤ β.
We must take the partial derivatives of the above expression with respect to the
elements of the vector β, and set these to zero to obtain the least squares esti-
mator of β. We do this now, but first we state some results which will help us
about differentiating vector expressions.
For a real valued function f (x) defined for an argument x = (x1 , ..., xp ) which is
a p-dimensional vector of real numbers, we define the derivative of f with respect
to x as the vector ( )⊤
∂f (x) ∂f (x) ∂f (x)
= , ..., .
∂x ∂x1 ∂xp
With this notation, in order to find the least squares estimator we need to find
the value b such that the function
∂
(y − Xβ)⊤ (y − Xβ)
∂β
is zero at β = b. To differentiate the least squares criterion with respect to β,
the following result is useful.
f (x) = a⊤ x
then
∂f (x)
= a.
∂x
Also, if
f (x) = x⊤ Ax
then
∂f (x)
= Ax + A⊤ x.
∂x
Using the above result to differentiate the least squares criterion, we have
∂ ∂ ( ⊤ )
(y − Xβ)⊤ (y − Xβ) = y y − 2(X ⊤ y)⊤ β + β ⊤ X ⊤ Xβ
∂β ∂β
= −2X ⊤ y + X ⊤ Xβ + (X ⊤ X)⊤ β
= −2X ⊤ y + 2X ⊤ Xβ.
−2X ⊤ y + 2X ⊤ Xb = 0
3 THE GENERAL LINEAR MODEL 52
Lemma:
3 THE GENERAL LINEAR MODEL 53
We end this subsection with two examples. In the first, we give the least
squares estimates of parameters in a multiple linear regression model for the risk
assessment data. In the second example, we show for the simple linear regression
model that the expression we have just derived in matrix notation for the least
squares estimator is equivalent to the expressions we gave in our previous work
for b0 and b1 , the least squares estimators of β0 and β1 .
For the risk assessment data, the response yi was risk assessment, and there were
seven accounting determined measures of risk (predictors) xi1 , ..., xi7 . The model
is
yi = β0 + β1 xi1 + ... + β7 xi7 + εi .
The result of fitting the parameters β0 , ..., β7 using the least squares criterion is
given next.
Regression Analysis
i=1 xi i=1 xi
and [ ∑n 2 ∑n ]
⊤ −1 1
∑i=1 x i − i=1 x i
(X X) = .
nSxx − ni=1 xi n
Also, [ ∑n ]
y
X y = ∑n
⊤ i=1 i
.
i=1 xi yi
Hence
b = (X ⊤ X)−1 (X ⊤ y)
[ ∑n 2 ∑n ∑n ∑n ]
1 x
i=1∑ i y i − x
∑ i=1 ∑ i=1
i x i y i
= i=1 (22)
nSxx n ni=1 xi yi − ni=1 xi ni=1 yi
Using our previous notation, we recognize the second element of (22) as (nSxy )/(nSxx ) =
Sxy /Sxx , which was the expression we derived for the least squares estimator of
the slope when studying the simple linear regression model. Our previous expres-
sion for the least squares estimator of the intercept was
b0 = ȳ − b1 x̄
3 THE GENERAL LINEAR MODEL 55
Lemma:
Proof:
∑k
E(a⊤ Y ) = E( ai Y i )
i=1
∑
k
= ai E(Yi )
i=1
∑k
= ai µ i
i=1
⊤
= a µ
∑k
iii) (AY )i = m=1 Aim Ym and so
∑
k
E((AY )i ) = E( Aim Ym )
m=1
∑
k
= Aim E(Ym )
m=1
∑k
= Aim µm
m=1
= (Aµ)i .
Therefore
E(AY ) = Aµ.
Covariance matrices
vector with E(Y ) = µ, then the covariance matrix of Y is the k × k matrix whose
(i, j)th entry is Cov(Yi , Yj ) = E((Yi −µi )(Yj −µj )). If we define the expected value
of a random matrix to be the matrix whose (i, j)th entry is just the expectation
of the (i, j)th entry of the random matrix, and if we use the notation Var(Y ) for
the covariance matrix of Y , we can write
since
(Y − µ)(Y − µ)⊤
is the k × k matrix whose (i, j)th element is (Yi − µi )(Yj − µj ). Note that the ith
diagonal element of Var(Y ) is simply the variance of Yi , i = 1, ..., n. Also, Var(Y )
is symmetric, since Cov(Yi , Yj ) = Cov(Yj , Yi ). We can now state the following
result (Raymond H. Myers and Janet S. Milton, “A First Course in the Theory
of Linear Statistical Models,” PWS-KENT, Boston, 1991, p. 54).
Lemma:
i)
3 THE GENERAL LINEAR MODEL 58
∑
k
Var(a⊤ Y ) = Var( ai Y i )
(i=1 )2
∑ k ∑
k
= E ai Y i − ai µ i
i=1 i=1
( )2
∑
k
= E ai (Yi − µi )
i=1
( )
∑
k ∑
k
= E ai aj (Yi − µi )(Yj − µj )
i=1 j=1
k ∑
∑ k
= ai aj E((Yi − µi )(Yj − µj ))
i=1 j=1
∑
k ∑
k
= ai aj Vij
i=1 j=1
= a⊤ V a.
ii)
3 THE GENERAL LINEAR MODEL 59
( k )
∑ ∑
k
Cov(Zi , Zj ) = Cov Aiq Yq , Ajr Yr
q=1 r=1
(( k )( ))
∑ ∑
k ∑
k ∑
k
= E Aiq Yq − Aiq µq Ajr Yr − Ajr µr
q=1 q=1 r=1 r=1
(( )( ))
∑
k ∑
k
= E Aiq (Yq − µq ) Ajr (Yr − µr )
q=1 r=1
( )
∑
k ∑
k
= E Aiq Ajr (Yr − µr )(Yq − µq )
q=1 r=1
∑
k ∑
k
= Aiq Ajr Vqr
q=1 r=1
∑
k ∑
k
= Aiq Vqr Ajr
q=1 r=1
∑
k
= Aiq (V A⊤ )qj
q=1
= (AV A⊤ )ij
So
Var(Z) = AV A⊤ .
Theorem:
In the full rank linear model, the least squares estimator b = (X ⊤ X)−1 X ⊤ y is
unbiased,
E(b) = β
with covariance matrix
Var(b) = σ 2 (X ⊤ X)−1 .
3 THE GENERAL LINEAR MODEL 60
Proof:
so that b is unbiased. Next we consider Var(b). Also from the above lemma, and
noting that Var(y) = σ 2 I, where I is the identity matrix, we have
where we have used the fact that for a nonsingular matrix A, (A⊤ )−1 = (A−1 )⊤ .
Hence
Var(b) = σ 2 (X ⊤ X)−1
The ability to compute the covariance matrix of the estimator b provides us with
a way of comparing the properties of this estimator with alternative estimators.
The least squares estimator of β is an example of a linear estimator which is an
estimator of the form Ay for some p × n matrix A (setting A = (X ⊤ X)−1 X ⊤
gives the least squares estimator). The Gauss-Markov Theorem states that if b∗
is any unbiased linear estimator of β, and if b is the least squares estimator, then
Var(bi ) 6 Var(b∗i ), i = 1, ..., p. We say that b is the best linear unbiased estimator
(BLUE) of β. We prove this result now.
Theorem:
The best linear unbiased estimator of β in the full rank linear model is the least
squares estimator b.
Proof:
Now, we know that b∗ is unbiased, so for the above to hold for arbitrary β we
must have BX = 0. Now consider Var(b∗ ). We have that
(( ) )
Var(b∗ ) = Var (X ⊤ X)−1 X ⊤ + B y
= σ 2 ((X ⊤ X)−1 X ⊤ + B)((X ⊤ X)−1 X ⊤ + B)⊤
= σ 2 ((X ⊤ X)−1 X ⊤ + B)(X(X ⊤ X)−1 + B ⊤ )
= σ 2 ((X ⊤ X)−1 X ⊤ X(X ⊤ X)−1 + (X ⊤ X)−1 X ⊤ B ⊤ + BX(X ⊤ X)−1 + BB ⊤ ).
But BX = 0, and so the third term on the right in the above expression is zero:
also, the second term is zero, as (X ⊤ B ⊤ ) = (BX)⊤ = 0. Hence
( )
Var(b∗ ) = σ 2 (X ⊤ X)−1 + BB ⊤
Note that the diagonal elements of Var(b∗ ) (which are the variances of the ele-
ments of b∗ ) are just the variances of the least squares estimators plus the diagonal
elements of BB ⊤ . But the ith diagonal element of BB ⊤ is
∑
n
⊤
(BB )ii = Bij2 ,
j=1
Under normal assumptions, we can write down the likelihood in the case of the
general linear model as
∏n ( )
1 1
2
L(β, σ ) = √ exp − 2 (yi − (Xβ)i )2
2πσ 2 2σ
i=1
( )
1 ∑
n
= (2πσ 2 )− 2 exp − 2
n
(yi − (Xβ)i )2 .
2σ i=1
3 THE GENERAL LINEAR MODEL 62
But
∑
n
(yi − (Xβ)i )2 = (y − Xβ)⊤ (y − Xβ)
i=1
We also find the maximum likelihood estimator for σ 2 . Differentiating (23) with
respect to σ 2 , we get
∂ n 1
l(β, σ 2
) = − + (y − Xβ)⊤ (y − Xβ).
∂σ 2 2σ 2 2σ 4
Writing b for the least squares estimator of β, and σ ∗ 2 for the maximum likelihood
estimator of σ 2 , we have
n 1
∗ 2
= ∗ 4
(y − Xb)⊤ (y − Xb)
2σ 2σ
so that
1
σ∗2 = (y − Xb)⊤ (y − Xb).
n
Note that this expression agrees with the expression we obtained before for the
simple linear regression model.
We showed for the case of the simple linear regression model that this estimator
was biased. What is E(σ ∗ 2 ) in the general linear model?
Matrices like I − X(X ⊤ X)−1 X ⊤ which are unchanged when multiplied by them-
selves have a special name: they are called idempotent matrices, and they play
an important role in the theory of linear models.
To see (24), observe that
The following properties of the trace are easy to prove (see Raymond H. Myers
and Janet S. Milton, “A First Course in the Theory of Linear Statistical Models,”
PWS-KENT, Boston, 1991, p. 26.)
Theorem:
3 THE GENERAL LINEAR MODEL 64
We are now in a position to prove the result we need in order to find the mean
of σ ∗ 2 . We are interested in quadratic forms of a random vector, which are ex-
pressions of the form y ⊤ Ay for a random vector y and matrix A.
Lemma:
Proof:
Unbiased estimation of σ 2
Applying the above lemma to compute E(σ ∗ 2 ) and observing that Cov(y) = σ 2 I,
we have
1
E(σ ∗ 2 ) = E(y ⊤ (I − X(X ⊤ X)−1 X ⊤ )y)
n
1( 2 )
= σ tr(I − X(X ⊤ X)−1 X ⊤ ) + (Xβ)⊤ (I − X(X ⊤ X)−1 X ⊤ )(Xβ)
n
1( 2 )
= σ tr(I) − σ 2 tr(X(X ⊤ X)−1 X ⊤ ) + β ⊤ X ⊤ (I − X(X ⊤ X)−1 X ⊤ )Xβ
n
1( 2 )
= σ n − σ 2 tr(X ⊤ X(X ⊤ X)−1 ) + β ⊤ X ⊤ Xβ − β ⊤ X ⊤ X(X ⊤ X)−1 X ⊤ Xβ
n
where we have used the fact that tr(AB) = tr(BA) for matrices A and B for
which these products are defined. Observe that (X ⊤ X)−1 (X ⊤ X) is the p × p
identity matrix, so that
n−p 2
E(σ ∗ 2 ) = σ .
n
3 THE GENERAL LINEAR MODEL 65
So the maximum likelihood estimator is biased, and the bias depends on the
number of mean parameters p. The above expressions suggest estimation of σ 2
by
1
b2 =
σ (y − Xb)⊤ (y − Xb)
n−p
which will be unbiased.
We conclude this section with an example where we illustrate the computation
of the least squares estimates of β, the estimation of σ 2 , and the computation of
estimated standard errors for the least squares estimators.
This data set is from Gerald Keller, Brian Warrack and Henry Bartel, “Statistics
for Management and Economics: a Systematic Approach,” Wadsworth, Belmont,
California, 1990, p. 804.
Many people have attempted to get an overall picture of business failures and
bankruptcies – both by studying the incidence of business failure on a national or
industry-wide scale and by analyzing individual firms one by one – with varying
degrees of success. In the latter instance, promising results have been achieved
by using various accounting and financial ratios to predict bankruptcies.
In looking at the overall picture, some analysts have been tempted by dramatic
swings in the number of business failures from year to year to try to relate them
to overall economic conditions. The annual percentage change in certain key
economic indicators is reproduced in the table below.
In this example we fit a linear model (multiple regression model) to these
data, using percentage change in business failures as the response and percentage
change in the economic indicators shown in the table as predictors. For the
moment we do not discuss model selection or criticism of a model, but simply fit
this linear model. The coefficient estimates obtained from a common statistical
package are shown below.
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) -59.4138 47.4629 -1.2518 0.2572
Real.domestic.product 2.9335 3.1052 0.9447 0.3813
Wages.and.Salaries 3.7996 2.8988 1.3108 0.2379
Unemployment.Rate 8.6019 5.7039 1.5081 0.1823
Retail.sales -5.5136 3.2695 -1.6864 0.1427
Housing.Starts -0.2079 0.5414 -0.3840 0.7142
C.P.I. 1.7576 5.0222 0.3500 0.7383
b2 (X ⊤ X)−1 is
From this, we have σ
2251.9 −46.619 −101.23 −237.93 81.109 −13.972 −18.089
−46.619 9.6388 −1.0828 0.89594 −4.3519 0.20687 8.3275
−101.23 −1.0828 8.4000 11.622 −4.5641 0.43326 −3.4736
−237.93 0.89594 11.622 32.522 −6.6688 1.2118 −6.4694
81.109 −4.3519 −4.5641 −6.6688 10.686 −0.91736 −10.615
−13.972 0.20687 0.43326 1.2118 −0.91737 0.29297 1.3521
−18.089 8.3275 −3.4736 −6.4694 −10.615 1.3521 25.213
3 THE GENERAL LINEAR MODEL 67
Taking the square root of the diagonal elements of this matrix gives the values
which after allowing for rounding error are the entries given in the table.
To construct our confidence intervals we will first need to give some results about
the distribution of the least squares estimator b of β and the distribution of our
estimator σb2 of the residual error variance.
To state the distribution of b we need to introduce the multivariate normal
distribution. You are of course already familiar with the univariate normal dis-
tribution. An n × 1 random vector Y is said to have a multivariate normal
distribution with mean vector µ and covariance matrix Σ (and we write this as
Y ∼ N (µ, Σ)) if it has the density
( )
1 1 ⊤ −1
fY (y; µ, Σ) = exp − (y − µ) Σ (y − µ) ,
(2π)n/2 |Σ|1/2 2
where |Σ| denotes the determinant of the matrix Σ. You will study some of the
properties of the multivariate normal distribution in later statistics courses. It
can be shown that the n × 1 vector µ actually is the mean of a random vec-
tor with the above density. Similarly, if Y is a random vector with the above
density, Var(Y ) = Σ. The mean vector µ and covariance matrix Σ completely
characterize a multivariate normal distribution. Note that if Y = (Y1 , ..., Yn ) is a
vector of univariate independent normal random variables with E(Yi ) = µi and
Var(Yi ) = σ 2 , then Y is a multivariate normal random vector according to the
above definition with mean µ and covariance matrix σ 2 I, where I is the n × n
identity matrix here. In order to state the distribution of b, we need the following
result.
Lemma:
Let Y be an n×1 multivariate normal random vector with mean µ and covariance
matrix Σ,
Y ∼ N (µ, Σ).
3 THE GENERAL LINEAR MODEL 68
Z ∼ N (Aµ, AΣA⊤ ).
Distribution of b
Theorem:
In the full rank linear model, the least squares estimator b of β has a multivariate
normal distribution with mean β and covariance matrix σ 2 (X ⊤ X)−1 ,
b ∼ N (β, σ 2 (X ⊤ X)−1 ).
Proof:
We have already proven that the mean and covariance matrix of b are given
by E(b) = β and Var(y) = σ 2 (X ⊤ X)−1 . The fact that b is multivariate normal
follows from multivariate normality of y, the fact that b is a linear transformation
of y,
b = (X ⊤ X)−1 X ⊤ y,
and the above lemma.
Interval estimation
We can now write down test statistics which can be used for constructing confi-
dence intervals for the parameters βj .
Write cjj j = 0, ..., k for the diagonal elements of (X ⊤ X)−1 . Then we know
that
bj − β j
√ ∼ N (0, 1).
σ cjj
Also
(n − p)b
σ2
∼ χ2n−p
σ2
3 THE GENERAL LINEAR MODEL 69
and (n − p)b
σ 2 /σ 2 is independent of bj . So
√
( ) σ2
(n−p)b
bj − β j
√ / σ2
∼ tn−p .
σ cjj n−p
We can use the above statistic to derive a confidence interval for βj . Writing
tα/2,n−p for the upper 100α/2 percentage point of a t distribution with n − p
degrees of freedom, we have that
bj − β j
P(−tα/2,n−p 6 √ 6 tα/2,n−p ) = 1 − α.
σb cjj
or
√ √
P(−bj − tα/2,n−p σ
b cjj 6 −βj 6 −bj + tα/2,n−p σ
b cjj ) = 1 − α
which gives
√ √
P(bj − tα/2,n−p σ
b cjj 6 βj 6 bj + tα/2,n−p σ
b cjj ) = 1 − α.
For the risk assessment data, fitting a multiple linear regression model with mean
risk assessment as the response and the accounting determined measures of risk
as predictors gives the following.
Regression Analysis
We can use the above table of estimated coefficients and standard errors to con-
struct confidence intervals for coefficients. The estimated standard errors of the
√
b cjj , are listed in the table in the ‘StDev’ column. Noting that
estimators bj , σ
the upper 2.5 percentage point of a t distribution with n − p = 17 degrees of
freedom is approximately 2.1098, a 95 percent confidence interval for β1 , the
coefficient for Dividend, can be computed as
To test
H0 : βj = γ
against the alternative
H1 : βj ̸= γ
we use the test statistic
bj − γ
T = √
b cjj
σ
which has a tn−p distribution under H0 .
3 THE GENERAL LINEAR MODEL 71
We can also conduct the test by computing a p-value. If we write t for the
observed value of T , then the p-value for the test is computed as
One-sided alternatives
For a one-sided test, we must modify the critical region and the definition of the
p-value.
For
H1 : βj > γ
the critical region is
T > tα;n−p
and we compute the p-value as
p = P(T 6 t).
We conclude this section by looking once again at the risk assessment data.
In the column labelled ‘T’ here we have the values of the t-statistics described
above for testing whether each element of β is zero in the presence of the other
terms in the model: that is, for each parameter βj , j = 0, ..., k the statistics
bj
√
b cjj
σ
3 THE GENERAL LINEAR MODEL 72
are listed (and they are realizations of tn−p random variables under H0 : βj = 0).
Also listed are the p-values for testing H0 : βj = 0 against the two-sided
alternative H1 : βj ̸= 0. A small p-value indicates rejection of the null hypothesis.
A large p-value indicates that the given predictor seems to make no contribution
to explaining the variation in the response in the presence of the other terms.
The t-tests we have just described are sometimes called partial t-tests. We
note from the above p-values that asset growth and covariability earnings do not
seem to make a contribution to explaining the variation in the presence of the
other terms in the model. A word of caution here: partial t-tests can be hard
to interpret. We certainly can’t conclude from a partial t-test that a predictor is
unrelated to the response, or that it should not be included in a good model for the
response. For instance, it may happen that if two predictors carry much the same
information about the response then a good model for the response should contain
one of the two predictors, but that any model which contains both predictors
results in a non-significant p-value for the partial t-tests for both predictors.
Upon deleting one of the predictors from a model containing both, however, we
might then find that the p-value for the partial t-test for the remaining variable is
significant! Note that there is also a possible problem of simultaneous inference
here: it could be argued that we should make an adjustment to the significance
level used in the partial t-tests when we examine a table of p-values like the one
given above.
where β = (β0 , ..., βk )⊤ is a vector of unknown parameters and the εi are inde-
pendent normal errors with zero mean and common variance σ 2 .
Now consider a new response value y∗ for which the values of the predictor
variables are x∗1 , ..., x∗k . If we write x∗ for the vector x∗ = (1, x∗1 , ..., x∗k )⊤ , then
the conditional mean of y∗ given the predictor values is x⊤ ∗ β. Also, we can write
y ∗ = x⊤
∗ β + ε∗
x⊤
∗ β and a prediction interval for y∗ .
First we develop a test statistic which can be used for constructing a confidence
interval for the conditional mean. Writing b for the least squares estimator of β,
we consider x⊤ ∗ b. We have that
E(x⊤ ⊤
∗ b) = x∗ β
and
Var(x⊤ ⊤
∗ b) = x∗ Var(b)x∗
= σ 2 x⊤ ⊤ −1
∗ (X X) x∗
where X is the design matrix for the fitted linear model. Furthermore, x⊤ ∗ b is
normal (as a linear combination of a multivariate normal random vector). So we
have
x⊤ ⊤ 2 ⊤ ⊤ −1
∗ b ∼ N (x∗ β, σ x∗ (X X) x∗ )
or
x⊤ b − x⊤
∗β
√ ∗ ∼ N (0, 1).
σ x⊤ ⊤ −1
∗ (X X) x∗
From the result that (26) has a t distribution with n − p degrees of freedom we
can derive a confidence interval in the usual way: we can show that
√
P(x⊤∗ b − tα/2,n−p b
σ x⊤ ⊤ −1 ⊤
∗ (X X) x∗ 6 x∗ β 6
√
x⊤ b x⊤
∗ b + tα/2,n−p σ ∗ (X X) x∗ ) = 1 − α
⊤ −1
Prediction intervals
It remains to derive a prediction interval for y∗ . Write yb∗ for the value
yb∗ = x⊤
∗b
y∗ − yb∗ = x⊤ ⊤
∗ β + ε∗ − x∗ b
Now, clearly E(y∗ − yb∗ ) = 0. Furthermore, ε∗ and b are independent, and we have
just shown that
Var(x⊤ 2 ⊤ ⊤ −1
∗ b) = σ x∗ (X X) x∗
so that
= σ 2 (1 + x⊤ ⊤ −1
∗ (X X) x∗ ).
It can also be shown that y∗ − yb∗ and (n − p)bσ 2 /σ 2 are independent, and so
√
(n−p)bσ2
y∗ − yb∗
√ / σ2
∼ tn−p .
σ 1 + x⊤ ∗ (X ⊤ X)−1 x
∗ n − p
Simplifying, we have
y∗ − yb∗
√ ∼ tn−p .
b 1 + x⊤
σ ⊤ −1
∗ (X X) x∗
To illustrate the computation of confidence intervals for the mean and prediction
intervals we consider again the risk assessment data. Suppose we are interested
in predicting the market risk of the stock of a company with a divident payout
of 0.5, current ratio of 1.0, asset size of 10.0, asset growth of 0.1, leverage of
0.3, variability earnings of 2.0 and covariability earnings of 0.6. We can compute
a 95 percent confidence interval for the conditional mean of a company with
3 THE GENERAL LINEAR MODEL 75
(2.138, 3.811)
(0.741, 5.208).
where yi , i = 1, ..., n were the responses, ȳ is the mean of the responses, and ybi is
the ith fitted value. We expressed this identity in the notation
where SStotal is called the total sum of squares (the sum of squared deviations
of the responses about their mean), SSreg is called the regression sum of squares
(the sum of the squared deviations of the fitted values about their mean, which
is ȳ) and SSres is called the residual sum of squares (the sum of the squared
deviations of the fitted values from the responses).
The identity (27) holds in the general linear model, where of course the fit-
ted values are defined as the elements of the vector Xb where X is the design
matrix and b is the least squares estimator of β. I won’t prove this identity, but
MATH2931 students will have to do this in their next assignment.
In the case of simple linear regression, the partition (27) of total variation was
presented in an analysis of variance table. This table was also a convenient
way of organizing calculations for hypothesis testing. We will cover some theory
about hypothesis testing later, but for now we will look at the output of fitting
a regression model, and try to understand the analysis of variance table and
some of the other information. Below is the result of fitting a multiple regression
model for the risk assessment data discussed in previous lectures (where mean
risk assessment is the response, and the seven accounting determined measures
of risk are predictors). See previous notes for a description of the data.
Regression Analysis
Analysis of Variance
Source DF SS MS F P
Regression 7 47.0282 6.7183 6.97 0.001
Residual Error 17 16.3808 0.9636
Total 24 63.4090
Source DF Seq SS
Dividend 1 18.4253
Current 1 5.6024
Asset Si 1 10.1251
Asset Gr 1 1.6456
Leverage 1 6.0841
Variabil 1 3.9497
Covariab 1 1.1961
By now you should understand how the coefficients in the fitted model are com-
puted, and how their standard errors are estimated. You should also understand
the computation of the partial t statistics and associated p-values listed in the
table of coefficient estimates, and their use for testing for the significance of a
predictor in the multiple linear regression model in the presence of the other
terms.
Below the table of coefficient estimates is listed the estimated standard de-
viation of the errors S = 0.9816 (this was σb in our notation). Also listed is R2 ,
which takes the value 0.742 here (or 74.2 percent). For the simple linear regres-
sion model, we defined R2 in terms of our fundamental partition of variation
as
SSreg
R2 = (28)
SStotal
and the definition of R2 is the same in the general linear model. Beside the
quoted value for R2 is a value for the so-called adjusted R2 , which we write as
R̄2 . We have R̄2 = 0.635 here (or 63.5 percent). To define R̄2 , rewrite R2 as
SSres
R2 = 1 − (29)
SStotal
(the equivalence of the two expressions for R2 comes from replacing SSreg in the
numerator of (28) by the equivalent expression SStotal − SSres , which comes from
(27)). We define R̄2 by replacing SSres in (29) by σ b2 (which is SSres /(n − p))
and replacing SStotal by SStotal /(n − 1). SStotal /(n − 1) is in fact the estimated
3 THE GENERAL LINEAR MODEL 78
error variance for a linear model in which we fit only an intercept term. Hence
we can write
(n − 1)SSres
R̄2 = 1 − (30)
(n − p)SStotal
or
b2 (n − 1)
σ
R̄2 = 1 − . (31)
SStotal
We note that the column labelled ‘SS’ in the analysis of variance table lists
the components of the decomposition (27). SS stands for sum of squares here,
and this column shows the partition of variation for a multiple linear regression
model in exactly the same way as it did for the simple linear regression model.
It will be helpful to introduce some notation. Suppose we have k predictors
x1 , ..., xk in a linear model for a response y. There are p = k + 1 parameters
in the vector β = (β0 , ...βk )⊤ in the linear model with the predictors x1 , ..., xk .
⊤ ⊤
Partition the vector β into two parts β = (β (1) , β (2) )⊤ where β (1) is an r × 1
subvector say, and β (2) is a (p − r) × 1 subvector. We write R(β (2) |β (1) ) for the
increase in SSreg in (27) when the predictors corresponding to the parameters
β (2) are added to a model involving the parameters β (1) . Think of R(β (2) |β (1) )
as the variation explained by the term involving β (2) in the presence of the term
involving β (1) . We define R(β1 , ..., βk |β0 ) to be SSreg .
The sequential sums of squares shown below the analysis of variance table are
simply the values
R(β1 |β0 )
R(β2 |β0 , β1 )
R(β3 |β0 , β1 , β2 )
..
.
R(βk |β0 , ..., βk−1 )
Note that these contributions add up to R(β1 , ...βk |β0 ). For instance, consider
the case where k = 2. R(β2 |β0 , β1 ) is the increase in the regression sum of squares
when β2 is added to the model involving β1 , so that by definition
The sequential sums of squares are useful when we have first ordered the variables
in our model in a meaningful way (based on the underlying science or context).
They tell us about how much a term contributes to explaining variation given all
the previous terms in the table (but ignoring the terms which come after it).
While partial t or F tests can be useful, in the case of the general linear
model there are more complicated hypotheses that may be considered. For ex-
ample, we may ask: are any of the predictors helpful for explaining variation in
the response? Or is the model including all the predictors no better than the one
containing just the intercept? There is an F statistic for testing this hypothesis,
and this F statistic is usually displayed in the analysis of variance table when we
fit a multiple linear regression model.
The ANOVA table for a typical multiple regression model with k predictors and
where p = k + 1 is shown below. Here as usual we write SStotal for the total
sum of squares, SSreg for the regression sum of squares and SSres for the residual
sum of squares. Writing β = (β0 , β1 , ..., βk )⊤ for the parameters in the general
SSreg M Sreg
Regression p−1 SSreg M Sreg = (p−1)
F = M Sres
P(Fp−1,n−p > F )
Residual n−p SSres M Sres = SSres
(n−p)
Total n−1 SStotal
linear model, where β0 is an intercept term and β1 , ..., βk are the coefficients
corresponding to the k predictors, it can be shown that the quantity M Sreg /M Sres
has an F distribution with p − 1 and n − p degrees of freedom when β1 = ... =
βk = 0. Hence we can use the statistic M Sreg /M Sres to test the null hypothesis
H0 : β1 = ... = βk = 0
Writing F = M Sreg /M Sres , and writing F ∗ for a random variable with p − 1 and
n − p degrees of freedom, we compute the p-value for the test as
P(F ∗ > F ).
An alternative to the use of p-values is to find the critical region: for a test with
significance level α, the critical region is F > Fα;p−1,n−p .
3 THE GENERAL LINEAR MODEL 81
We illustrate the test for the overall significance of the model using the Canadian
business failures data that we have discussed previously. These data describe
the annual percentage change in business failures and in the level of certain key
economic indicators. If we fit a multiple linear regression model with percentage
change in business failures as the response and percentage change in the economic
indicators (real domestic product, wages and salaries, unemployment rate, retail
sales, housing starts and C.P.I.) as predictors, we get the results shown below.
Regression analysis
Analysis of Variance
Source DF SS MS F P
Regression 6 1782.3 297.1 1.24 0.400
Residual Error 6 1436.4 239.4
Total 12 3218.7
Source DF Seq SS
Real dom 1 8.3
Wages an 1 2.8
Unemploy 1 478.5
Retail s 1 1164.3
Housing 1 99.1
C.P.I. 1 29.3
3 THE GENERAL LINEAR MODEL 82
We may have anticipated that we would not learn very much here by fitting a
multiple linear regression model with six predictors and only thirteen observa-
tions. We see in the ANOVA table that the F statistic for testing overall model
adequacy is 1.24, and the associated p-value is 0.4. So at the 5 percent level we
accept the null hypothesis that the regression coefficients corresponding to the
six predictors are all zero.
As a further example of testing for overall model adequacy consider the risk
assessment data, which has been described in previous lectures. On pages 88
and 89 of your lecture notes I have given results when fitting a multiple linear
regression model for these data with mean risk assessment as the response and
the seven accounting determined measures of risk as predictors. It can be seen in
the ANOVA table that the F statistic for testing overall model adequacy is 6.97,
and the associated p-value
p = P(F ∗ > 6.97)
where F ∗ ∼ F7,17 is approximately 0.001. So for the risk assessment data we
reject the null hypothesis
H0 : β1 = ... = βk = 0
The F -test given in the ANOVA table illustrates a general method for comparison
of linear models via hypothesis testing.
When we conduct a hypothesis we are always comparing two different models,
and the null hypothesis imposes a restriction upon the parameters in the more
general model which holds under the alternative hypothesis.
For instance, in the F -test in the ANOVA table we compare the model
y i = β0 + εi
(the model which holds under the null hypothesis) to the more general model
to the model
yi = β0 + β1 xi1 + ... + βr xir + .. + βk xik + εi .
Here H0 is βr+1 = ... = βk = 0 and H1 states that at least one of βr+1 , ..., βk is
nonzero.
Write SSnull for the residual sum of squares for the model where H0 holds, and
write SSf ull for the model which holds under H1 . The extra variation explained
by the more general model is SSnull − SSf ull . Is the amount of extra variation
explained so large that we would favour the more complex model? We can develop
a hypothesis test to address this question.
Writing ∆p for the number of restrictions imposed by the null hypothesis on
the most general model (in H0 given above this is k − r, the number of extra
parameters in the model under H1 ) it can be shown that
Suppose we feel that the most important predictor of mean risk assessment for
the risk assessment data is dividend, and that we wish to see whether the other
predictors in the model are needed in a multiple linear regression model in which
dividend is included. To compute the test statistic for testing the hypothesis that
the coefficients corresponding to current ratio, asset size, asset growth, leverage,
variability earnings and covariability earnings are all zero, we need to compute
the decrease in the residual sum of squares when we add these terms to a model
involving dividend. This is obtained from the R output from the table of sequen-
tial sums of squares (see previous notes for the R output for the risk assessment
data). Let β (1) = (β0 , β1 )⊤ and β (2) = (β2 , ..., β7 )⊤ . Then we wish to test
H0 : β (2) = 0
3 THE GENERAL LINEAR MODEL 84
Write R(β (2) |β (1) ) for SSnull − SSf ull (the reduction in residual sum of squares
when the β (2) term is added to the null model). We have
Hence if F ∗ is a F6,17 random variable, we compute the p-value for the test as
Sequential F tests
In the last lecture we discussed the table of sequential sums of squares which
appears in the computer output, and mentioned that the values which appear in
this table represent the increases in the regression sum of squares as terms are
added sequentially to the model. These sequential sums of squares can be used
as the basis for a formal test of the hypothesis of the usefulness of each predictor
in the model in the presence of the terms which appear before it but ignoring
terms which come after it. As we mentioned last time, the table of sequential
sums of squares gives the values
R(β1 |β0 )
R(β2 |β0 , β1 )
R(β3 |β0 , β1 , β2 )
..
.
R(βk |β0 , ..., βk−1 ).
H1 : βj ̸= 0
p = P(F ∗ > F ).
Alternatively, for a test with significance level α, the critical region is F >
Fα;1,n−p . The test we have just described is called a sequential F test. These
sequential F tests will in general be hard to interpret unless we first carefully
choose the ordering of the predictors in the model: a sequential F test examines
the usefulness of a predictor in the presence of previous predictors but ignoring
predictors which come after it.
For the risk assessment data. In fitting the model, the predictors are ordered so
that dividend is the first predictor, current ratio is next, then asset size, then
asset growth, then leverage, then variability earnings and finally covariability
earnings. Suppose we wish to test whether asset size is a useful predictor in the
model which includes dividend and current ratio. The increase in the regression
sum of squares when a term for the predictor asset size is added to the model
involving dividend and current ratio is 10.1251. Writing β0 for the intercept term,
β1 for the coefficient for dividend, β2 for the coefficient for current ratio and β3
for the coefficient for asset size, we want to test
H0 : β3 = 0
versus
H1 : β3 ̸= 0
in the model involving the terms β0 , β1 and β2 . The mean square error M Sres can
be computed from the output by (16.3808+1.1961+3.9497+6.0841+1.6456)/21 =
3 THE GENERAL LINEAR MODEL 86
29.2563/21 = 1.3932, and we know from the discussion above that the ratio
10.1251/1.3932 = 7.267514 is a realization of an F1,21 random variable under the
null hypothesis. If F ∗ ∼ F1,21 , then we compute the p-value for the sequential
test as
p = P(F ∗ > 7.267514)
which from tables or a computer package gives p = 0.01353433 approximately.
So adding the term asset size to a model involving dividend and current ratio
does help to explain variation in the response in a multiple linear regression model.
We have seen that in the analysis of variance table the “Degrees of freedom”
column specifies the degrees of freedom which are appropriate for computing a
p-value for an F test of overall model adequacy. We have not discussed the
question, however, of where these degrees of freedom parameters come from, and
of how we derive the distributions of the test statistics we have studied. The
key theorem in proving the results we have stated about the distributions of test
statistics is Cochran’s Theorem. I will state this result for the MATH2931 stu-
dents.
Definition:
Theorem:
∑
k
Q= Qi .
i=1
If we write ri for the rank of Ai and r for the rank of A then any one of the
following conditions implies the other two.
∑k
1. i=1 ri = r
3 THE GENERAL LINEAR MODEL 87
What we have called the total sum of squares SStotal , in the general linear
model takes the form y ⊤ Ay for an idempotent matrix A, where y here is the vector
of responses. Also, the regression sum of squares and residual sum of squares can
be written as quadratic forms in y. When we are deriving the test statistics above
we are considering decompositions of a quadratic form into meaningful parts, and
showing that the parts are independent and χ2 distributed (which is what we need
to do to prove that our test statistics, which involves ratios of quadratic forms,
are F distributed). For a quadratic form y ⊤ Ay, the rank of the quadratic form is
defined to be simply the rank of A. Each of the sums of squares in the analysis
of variance table are quadratic forms, and the degrees of freedom column in the
analysis of variance table specifies the ranks of those quadratic forms.
3.12 Multicollinearity
In a previous lecture where we developed a partial t test for the significance of
a predictor in the general linear model in the presence of other predictors we
mentioned that the interpretation of these tests is complicated when some of the
predictors in the model contribute similar information. In this subsection we
discuss this idea more formally.
In the course so far we have been dealing with the full rank linear model, in
which the design or model matrix X has full rank. Recall that X has full rank
if no column of X can be expressed as a linear combination of the remaining
columns. When the predictors in our linear model contribute similar informa-
tion, then it may happen that although X is of full rank, we can still nearly
express one of the predictors as a linear combination of the remaining predictors:
multicollinearity is the name given to this condition.
Effects of multicollinearity
Why is multicollinearity a potential problem, and what are its effects? If X does
not have full rank, then there is no unique solution to the normal equations: we
can get the same vector of fitted values by forming different linear combinations
of the vectors of predictor values. When multicollinearity occurs in the full rank
linear model, something close to this situation also occurs: quite different linear
combinations of the predictor values can result in almost the same vector of fitted
values. Hence for quite different β values the least squares criterion (which was
just the sum of the squared differences between the observations and fitted values)
3 THE GENERAL LINEAR MODEL 88
may not change very much. Roughly speaking, there may be a large region in
the space of the possible parameter values where we have much the same fit, and
this makes it difficult to estimate β precisely.
Figure 3.9 illustrates the phenomenon of multicollinearity for an artificial data
set involving two predictors. In the Figure, y is the response and x1 and x2 are
the predictors. We can see that x1 and x2 are strongly linearly related, and
the (y, x1 , x2 ) points nearly lie on a line in three-dimensional space. Since there
are an infinite number of planes which can pass through a given line in three-
dimensional space, it’s clear that when attempting to fit a plane through these
points we could have two planes with very different coefficients for x1 and x2
resulting in almost the same quality of fit (residual sum of squares). Because
x1 and x2 are so closely related, it is hard to separate out their effects. The
coefficients will not be well estimated, and attempting to make predictions at
new points which don’t respect the dependencies in the original data could result
in nonsensical answers.
The example above is sometimes referred to as the picket fence of multi-
collinearity. Think of the heights y as representing the tops of pickets at the
(x1 , x2 ) points in a picket fence. Imagine trying to balance a table top on the
pickets: the balance is very precarious in directions perpendicular to the line of
pickets.
More formally, recall that the covariance matrix of the least squares estimator
b of β in the general linear model was
σ 2 (X ⊤ X)−1 .
small, but for one of the predictors to be capable of nearly being expressed as a
linear combination of two or more of the remaining predictors. We now describe
an example which illustrates some of the effects of multicollinearity.
The following data are from Raymond H. Myers, “Classical and Modern Regres-
sion with Applications (Second Edition),” Duxbury, Belmont, California, 1990,
pp. 130–133. Data were collected from seventeen U.S. Naval hospitals at various
sites around the world. The regressors are variables which are thought to predict
workload (monthly man hours) at the hospitals. The variables are as follows:
y = Monthly man-hours
x1 = Average daily patient load
x2 = Monthly X-ray exposures
x3 = Monthly occupied bed days
x4 = Eligible population in the area / 1000
x5 = Average length of patients’ stay in days.
It was desired to predict workload (i.e. the monthly man-hours y) based on the
predictor variables x1 , x2 , x3 , x4 and x5 . Fitting a multiple linear regression
model gives the following output.
Regression Analysis
Analysis of Variance
Source DF SS MS F P
Regression 5 490177488 98035498 237.79 0.000
Residual Error 11 4535052 412277
3 THE GENERAL LINEAR MODEL 91
x1 x2 x3 x4 x5 y
Total 16 494712540
Source DF Seq SS
x1 1 480612694
x2 1 7231656
x3 1 598469
x4 1 276098
x5 1 1458572
At first sight this seems like a satisfactory model: we have a very high R2 value
of 99.1 percent here. However, when we compute the correlation matrix of the
predictors we find that the correlation between x1 and x3 is greater than 0.99.
The opposite extreme to collinearity is orthogonality. Write x̄j for the mean
of the values for the jth predictor,
∑n
xij
x̄j = i=1 .
n
Also, write sj for the quantity
v
u n
u∑
sj = t (xij − x̄j )2 .
i=1
which is the sample correlation between the jth and kth predictors.
We say that the predictors are orthogonal if they are uncorrelated: that is,
⊤
Z Z = I where I is the identity matrix. In the case of orthogonality, if we fit the
model involving the transformed predictors, it is easy to see that the estimators of
the predictor coefficients are independent (the covariance matrix of the predictor
coefficients is σ 2 (Z ⊤ Z)−1 = σ 2 I and normality implies independence from this).
This independence makes interpretation of coefficient estimates and of the partial
t statistics much easier.
In general, if orthogonality doesn’t hold and we fit a model using the stan-
dardized predictors Z, then the diagonal elements of
σ 2 (Z ⊤ Z)−1
will be larger than σ 2 (their value in the case of orthogonality). The diagonal el-
ements of (Z ⊤ Z)−1 represent the increase in the variances of the coefficients over
the ideal case of orthogonality. These diagonal elements are called the variance
inflation factors of the coefficients, and they are very useful for detecting mul-
ticollinearity. For the hospital manpower data the variance inflation factors are
(9597.57, 7.94, 8933.09, 23.29, 4.28). The first and third elements (corresponding
to x1 and x3 , which are highly correlated) are very large here, and the quality of
estimation for the coefficients for these predictors is very poor.
More sophisticated methods for detecting multicollinearity and diagnosing
what variables are involved are based on looking at the eigenvalues and eigenvec-
tors of the correlation matrix Z ⊤ Z. (MATH2831 students can ignore the rest of
this section if they wish). Roughly speaking, the eigenvectors corresponding to
eigenvalues which are close to zero describe a near linear dependence among the
columns of Z. We can write
Q⊤ (Z ⊤ Z)Q = Λ
4 Model Selection
At the beginning of this course we discussed some of the reasons why we build
statistical models. We build statistical models in order to make decisions, and in
the process of selecting a model we must manage a trade off between complexity
and goodness of fit in order to provide a reduction of the data that is useful for
decision making.
In the context of the general linear model, the problem of model selection
appears when we are deciding which predictor variables should be included for
explaining variation in the response.
However, often data may have been collected for a large number of predic-
tors, some of which might be unrelated to the response. We may wish to choose
a simple model containing a subset of the predictors (or perhaps to choose a
small collection of simple models containing subsets of the predictors) which of-
fer a simpler explanation of the observed variation than a model incorporating
all the predictors. Our interest may be in summarizing the data succinctly, in
trying to determine which predictors are related to the response and which are
not, or in prediction of future response values. Often there is no single best model.
One of the most common reasons for building a statistical model is prediction,
and we now discuss the issue of model selection when prediction is the goal. We
begin by asking the question: is it harmful for prediction if we fit a statistical
model which is more complicated than we really need? We will illustrate that
fitting an unnecessarily complicated model can be harmful by considering the
most elementary case, the simple linear regression model.
Suppose we have a data set consisting of responses y1 , ..., yn and corresponding
predictor values x1 , ..., xn , and suppose that the simple linear regression model
holds,
y i = β 0 + β 1 xi + εi
where εi i = 1, ..., n is a collection of zero mean errors uncorrelated with common
variance σ 2 say. We write M0 for the model
y i = β0 + εi
in which the predictor xi is excluded and write M1 for the full model
y i = β 0 + β 1 xi + εi .
We consider fitting these two models to the data, and develop expressions for the
expected squared prediction errors for the models for a new observation y∗ when
the predictor value is x∗ . The expressions for expected squared prediction error
4 MODEL SELECTION 95
will help to illustrate the trade off between complexity and goodness of fit that
is involved in all model selection problems.
Consider first fitting the model M0 which involves just an intercept. To estimate
β0 , we must minimize
∑n
(yi − β0 )2 .
i=1
or
b0 = ȳ.
For the model M0 then, our prediction of a new observation y∗ from the fitted
model is simply ȳ, the mean of the responses. We write yb0 (x∗ ) for the predicted
value of y∗ for model M0 when the predictor is x∗ .
For the model M1 , write yb1 (x∗ ) for the predicted value of y∗ when the predictor
is x∗ : that is, let
yb1 (x∗ ) = b0 + b1 x∗
where b0 and b1 are the least squares estimators of β0 and β1 in the simple linear
regression model that we developed at the beginning of the course. (At the risk
of some confusion we have used the notation b0 for the least squares estimator
of the intercept in both models M0 and M1 even though the estimator of the
intercept is different for the two models).
4 MODEL SELECTION 96
and
E((y∗ − yb1 (x∗ ))2 ).
Consider first E((y∗ − yb0 (x∗ ))2 ). Recall that for a random variable Z, Var(Z) =
E(Z 2 ) − E(Z)2 so that
E(Z 2 ) = Var(Z) + E(Z)2 .
Applying this identity to y∗ − yb0 (x∗ ), we have
E((y∗ − yb0 (x∗ ))2 ) = Var(y∗ − yb0 (x∗ )) + E(y∗ − yb0 (x∗ ))2
= Var(y∗ ) + Var(b y 0 (x∗ )) + E(y∗ − yb0 (x∗ ))2
y 0 (x∗ )) + E(y∗ − yb0 (x∗ ))2 .
= σ 2 + Var(b (32)
The terms in the above expressions are easily interpreted. The second term is
just the variance of the prediction. The variance of prediction depends on the
variance of our estimators of the model parameters, and we might expect that
this is larger for the more complex model M1 where there are more parameters
to estimate. In fact, if either model M0 or M1 holds,
σ2
y 0 (x∗ )) =
Var(b
n
and ( )
1 ∗ 2 1 (x∗ − x̄)2
Var(b
y (x )) = σ + .
n Sxx
So prediction variance is larger for the more complex model.
The third term in both expressions for expected squared prediction error is
the square of the prediction bias. If the data conform to the model M1 , then we
showed in our discussion of prediction for the general linear model that E(y∗ −
yb1 (x∗ )) = 0 and so the squared bias term is zero for model M1 . However, if we
fit the model M0 when model M1 holds with a nonzero β1 , then the bias term
E(y∗ − yb0 (x∗ ))2 may be nonzero. If M0 holds then the bias term will be zero. So
the bias term is always smaller for the more complex model, and the variance
term is always larger.
4 MODEL SELECTION 97
The key idea we are coming to here is the following: a complicated model
has a higher variance of prediction than a simpler one, but a model which is too
simple and ignores important predictors may make predictions which are system-
atically biased. Good model selection involves managing a trade off between the
contributions of bias and variance to prediction error.
The following example of the difference between within sample and out of sam-
ple predictive performance is given by Myers (1990), “Classical and Modern Re-
gression with Applications,” Duxbury, Belmont, California, pp. 168–169. The
example relates to the hospital manpower data we discussed in the last lecture.
There are 17 observations here. If we fit a linear regression model with y as the
response and x1 , x2 , x3 , x4 and x5 as predictors, then the residual for the last
observation (within sample prediction error) is −466.27. Now, suppose we refit
the model with the 17th observation deleted. Then we can produce a forecast
of the response for the 17th observation based on this fitted model: when we do
this, the prediction error is −3675.121, which is much larger (in absolute value)
than the within sample prediction error of −466.27. Similarly, we can delete
the other observations one at a time, and produce forecasts of each observation
from a model not including that observation. For the fifteenth and sixteenth
observations, for instance, this gives prediction errors of −2510.842 and 2242.496
respectively (and the residuals for the fifteenth and sixteenth observations when
the model is fitted to all the data are −504.857 and 376.549 respectively, see
Myers, p. 169). So we can see that in this example, prediction out of sample
4 MODEL SELECTION 98
Unlike the residual sum of squares, which the PRESS statisic resembles, the
PRESS statistic does not necessarily decrease as we make the model more com-
plex. The PRESS statistic can be computed and compared for different models,
and the model with minimum PRESS chosen as one likely to have good out of
sample predictive performance.
In order to give the computational formula for the PRESS residuals in terms
of ordinary residuals and leverages, we need to first define the leverages. The
leverages are the diagonal elements of the so-called hat matrix,
H = X(X ⊤ X)−1 X ⊤ .
4 MODEL SELECTION 99
To interpret the hat matrix, just observe that the vector of fitted values is
So multiplying the vector of responses by H gives the fitted values. The ith
diagonal element hii of H (ith leverage) multiplies yi in determining ybi . So the
leverage is some measure of the influence of yi on the fit at the ith set of predictor
values.
There is another interpretation of the leverages. If we write xi for the vec-
tor of predictors corresponding to the ith observation of the response, xi =
(1, xi1 , xi2 , .., xik )⊤ , then we can write hii as
hii = x⊤ ⊤ −1
i (X X) xi .
Var(b
y (xi )) = Var(x⊤ i b)
⊤
= xi Var(b)xi
= σ 2 x⊤ ⊤ −1
i (X X) xi
= σ 2 hii
so that the ith leverage is apart from σ 2 simply the prediction variance for the
fitted model at xi . It can be shown that the leverage always lies between zero and
one (the prediction variance is always nonnegative of course, and the variance of
a prediction is never worse than the residual error variance at one of the observed
xi ). Note that the leverage depends only on the vector of predictors (not on
the response value): you should think of it as measuring the potential influence
of an observation, with high leverage indicating that the vector of predictors is
somehow extreme compared to the other predictor vectors.
Now that we have discussed the idea of leverage, we can describe computation
of the PRESS residuals and the PRESS statistic. If ei is the ordinary residual
for the ith observation, hii is the ith leverage value, and if ei,−i is the ith PRESS
residual, then
ei
ei,−i = .
1 − hii
So to compute the PRESS residuals (and hence the PRESS statisic) all we need
to do is compute the ordinary residuals and the diagonal of the hat matrix. If
the leverage is large (so that the ith observation is an influential one in the fit)
then the PRESS residual is made much larger in absolute value than the ordinary
residual by the divisor of 1 − hii in the above equation. In effect, the PRESS
residuals are obtained by increasing in absolute value the size of the ordinary
4 MODEL SELECTION 100
residuals, with the amount of the increase related to how influential the observa-
tion may be in fitting the full model.
An example will help to make the concept of the leverage clear for the simple
linear regression model. As we have mentioned, the ith leverage value is (apart
from σ 2 ) the variance of our predictor of the response at xi . For the simple linear
regression model, we showed that the variance of our predictor of the conditional
mean at x0 was ( )
2 1 (x0 − x̄)2
σ +
n Sxx
so that the ith leverage value is
1 (xi − x̄)2
+ .
n Sxx
We see that the value of the leverage will be large if xi is far away from the
mean of the predictors x̄: outlying points in the predictor space (predictor values
distant from x̄) have potentially high influence and a high leverage. Below is
the result of fitting a simple linear regression model involving stock price as the
response and TSE300 index as the predictor for the capital asset pricing model
data discussed on Tutorial Sheet 5.
Also shown is a scatterplot of stock price versus TSE300 index.
Regression Analysis
Analysis of Variance
Source DF SS MS F P
Regression 1 15.400 15.400 152.65 0.000
Residual Error 19 1.917 0.101
Total 20 17.317
4 MODEL SELECTION 101
The leverage values for the 21 observations are 0.073, 0.064, 0.057, 0.075,
0.088, 0.053, 0.052, 0.068, 0.070, 0.099, 0.059, 0.048, 0.072, 0.066, 0.048, 0.051,
0.080, 0.078, 0.168, 0.247, 0.386. We see that the largest leverage value corre-
sponds to the last observation (which happens to be the rightmost predictor value
in the scatter plot).
We now give an example illustrating the use of the PRESS statistic for model
selection.
Example: data on cheddar cheese
In this example we will consider some data from a study on cheddar cheese in
the La Trobe Valley, Victoria. For this data set, we have a subjective measure
of cheese taste as the response (taste) and as predictors we have measures of
concentration of acetic acid (acetic), concentration of hydrogen sulfide (H2S) and
concentration of lactic acid (lactic). The table below shows the values of R2 ,
adjusted R2 and the PRESS statistic for all possible models with 1, 2 and 3
predictors.
For this example, we see from the table that the best model in terms of the
PRESS statistic is the model which contains Lactic and H2S. This model is also
the best in terms of maximum adjusted R2 .
was omitted. The idea of cross validation is to split the data into two parts, a
training sample (which is used to estimate parameters in candidate models) and
a validation sample (which is used to assess predictive performance). Selection of
a model is based on predictive performance in the validation sample. If we have
m + n observations (responses y1 , .., ym+n and corresponding vectors of predictors
x1 , ..., xm+n ) and if the first m observations constitute the training sample and the
last n constitute the validation sample, then we might compare different models
according to the criterion
∑
m+n
(yi − x⊤
i b)
2
(33)
i=m+1
∑
m+n
|yi − x⊤
i b|.
i=m+1
Once a model has been selected, we refit the selected model using the whole data
set.
There are a number of difficulties with the idea of cross validation. Firstly, we
must have a large enough number of observations so that we can split the data into
two parts which are large enough to enable reliable estimation of parameters and
reliable estimation of predictive performance. This may not always be the case.
Also, there is the problem of how exactly to split the data (which observations
should go in the training and validation samples, and what should m and n be?)
We give an example to illustrate the use of cross validation and some of the
problems involved in its application.
For the cheddar cheese tastings data, there are 30 observations which we split
into two parts of 15 observations each (the first 15 observations in the sample
constitute the training sample, and the last 15 constitute the validation sample).
I have computed the criterion (33) using the validation sample after fitting each
of the possible models involving the three predictors to the training sample. The
results are shown in the table below. Also shown are the values of (33) when
the second half of the data is used as the training sample and the first half is
used as the validation sample. We can see from the table that the best model
according to the cross validation criterion for both cases is the one involving H2S
and Lactic. However, there are quite different predictor values in the first fifteen
and last fifteen observations, and our ranking of the models is highly dependent
on the choice of training set: for instance, the full model is the second best model
4 MODEL SELECTION 104
when the second half of the data is used for the training set, but it is the second
worst model when the first half of the data is used for the training set. How to
split the data into two parts can be a difficult question in application of the idea
of cross validation.
y (xi )) = E((b
MSE(b y (xi ) − y(xi ))2 )
Now, using the formula Var(Z) = E(Z 2 ) − E(Z)2 , we have E(Z 2 ) = Var(Z) +
E(Z)2 and hence
y (xi )) = Var(b
MSE(b y (xi ))2
y (xi )) + Bias(b
y (xi )) = E(b
where Bias(b y (xi ) − y(xi )). So (34) is equal to
∑n ∑n
i=1 Var(by (xi )) y (xi ))2
i=1 Bias(b
+ .
σ2 σ2
4 MODEL SELECTION 105
and that ∑n
i=1 Var(by (xi ))
2
=p
σ
so that
(n − p)(bσ2 − σ2)
p+
σ2
is an estimator of (34). In practice σ 2 is not known, but an unbiased estimate of
σ 2 (provided one of the models in the class of models being considered is correct)
bF2 based on the full model: substituting into the expression above
is the estimate σ
gives Mallow’s Cp statistic,
σ2 − σ
(b bF2 )(n − p)
Cp = p + .
bF2
σ
To illustrate the use of Cp , we consider once more the data on cheddar cheese
tastings.
Best Subsets Regression
Response is taste
A L
c a
e c
t H t
Adj. i 2 i
Vars R-Sq R-Sq C-p s c S c
The best model according to the Cp criterion is again the one involving H2S and
Lactic. All the various model selection criteria we have considered so far have
led to the model involving H2S and Lactic as the best model. The choice of a
best model (for predictive or other purposes) will often not be so clear, and it is
always worthwhile to consider a number of model selection criteria in choosing a
model.
Forward selection
4 MODEL SELECTION 107
The idea of sequential variable selection methods is to start with some initial
model, and to then add, delete or change a variable in the initial model to im-
prove the fit. We then add, delete or change a variable in the new model, and
continue searching through the space of possible models in this way until some
stopping criterion is satisfied. By starting with some initial model and making
small changes to it we avoid the need to compute some criterion function for all
possible models. The mechanisms for adding or deleting variables in the stepwise
procedure are based on the hypothesis tests (partial t and F tests) which were
discussed earlier in the course.
There are three basic sequential variable selection techniques. The first is for-
ward selection. The idea of forward selection is to start with our best model in-
volving one variable, and then add the variable at each step which most improves
the fit until some stopping rule is satisfied. The hope with forward selection is
that by taking our best simple model and improving it as much as we can at each
step, we will arrive at a model that is good globally: this hope may not always
be realized. We can summarize the steps in forward selection as follows:
2. Until some stopping rule is satisfied, find the predictor not currently in the
model (if there is one) for which the partial F -statistic is largest and add
it to the model.
We haven’t said anything yet about the stopping criterion in the method
above. One approach is to stop when the model contains a certain fixed number
of variables, p∗ say. Another common stopping rule is to stop when none of the
partial F statistics for the coefficients considered in the second step is significant
at some preselected level. The default significance level used varies between
computer packages. We give an example which will illustrate the idea of forward
selection.
Example: cheddar cheese tastings
Consider the data on cheddar cheese tastings given in assignment three. The
response here was the variable taste (a subjective measure of taste) and the
predictors were measures of the concentrations of various chemicals in the samples
(acetic, lactic and H2S).
Of all the one variable models, the largest partial F -statistic occurs for the
model involving H2S (the F statistic for testing for the significance of the slope
for H2S is 37.29, with a p-value of approximately 0).
Now we consider adding the variables not currently in the model (acetic and
lactic) to the variable H2S. Fitting the model involving H2S and lactic, the partial
F statistic for testing for the significance of lactic in the presence of H2S is
approximately 6.25. If we fit the model involving H2S and acetic, the partial
F statistic for testing for the significance of acetic in the presence of H2S is
4 MODEL SELECTION 108
approximately 0.71. The largest of these is the F statistic for lactic, with a p-
value of 0.019. Finally we consider adding acetic to the model involving H2S
and lactic. The partial F statistic for acetic is approximately 0.005, with a p-
value exceeding 0.9, clearly not significant at the 0.25 level. Since this is the only
variable not already in the model our stopping rule is satisfied, and the result of
our forward stepwise search is to choose the model involving H2S and lactic.
Stepwise Regression: Taste versus acetic,H2S,lactic
Step 1 2
Constant -9.787 -27.592
Lactic 19.9
T-Value 2.50
P-Value 0.019
S 10.8 9.94
R-Sq 57.12 65.17
R-Sq(adj) 55.58 62.59
C-p 6.0 2.0
The “Step” columns labelled 1 and 2 here show the variables which are added
at each stage. We see that H2S was added first, and for the simple linear re-
gression model with H2S as predictor the estimated intercept was −9.787 and
the estimated slope was 5.78. Also reported is the partial t statistic for testing
significance of the slope which is 6.11 (we square this to get the partial F statis-
tic) with associated p-value 0.000. The residual standard error is 10.8, the R2 is
0.5712, the adjusted R2 is 0.5558 and Mallow’s Cp is 6.0. At the second stage,
the variable lactic is added to the model: again we have coefficient estimates,
the estimated residual standard error, the R2 , adjusted R2 , Cp and values of the
partial t statistics. The model involving H2S and lactic is the final model here.
Backward elimination
The second main sequential variable selection algorithm is called backward elim-
ination or backward selection. The idea of backward selection is to start with
4 MODEL SELECTION 109
an initial model containing all predictors, and to then delete terms at each step
which give the least degradation of fit until some stopping criterion is satisfied.
More formally, we can describe the algorithm as follows:
The stopping rule for backward elimination is usually to stop when some fixed
number of parameters is reached, or to stop when all of the partial F statistics
are significant at a certain level (this is what R does, and the default significance
level is 0.1).
If we fit the full model for the cheddar cheese tastings data, then we have that the
partial F statistics for the three variables are approximately 9.80 for H2S, 0.005
for acetic and 5.20 for lactic. So at the first stage of backward elimination we
delete the variable acetic. Now we consider the model involving H2S and lactic.
In this model, the partial F statistics are approximately 12.04 for H2S and 6.25
for lactic. Since both of these are significant at the 0.1 level, our stopping criterion
is satisfied and the final model is again the model involving H2S and lactic.
Step 1 2
Constant -28.88 -27.59
Acetic 0.3
T-Value 0.07
P-Value 0.942
S 10.1 9.94
R-Sq 65.18 65.17
R-Sq(adj) 61.16 62.59
C-p 4.0 2.0
Again the “Step” columns labelled 1 and 2 show the variables in the model at
each stage of the backward elimination procedure. We start with the full model,
and in the second step move to the model involving just H2S and lactic. Also
shown are coefficient estimates, the estimate of the error standard deviation, the
R2 value, adjusted R2 and Cp values.
Stepwise methods
The final sequential variable selection algorithm we will discuss is called stepwise
variable selection, and combines elements of the forward and backward elimina-
tion algorithms. There are many variations of the basic stepwise procedure, and
we will just discuss the following algorithm. Repeat:
3. If no variable could be added or deleted in the first two steps, then stop.
For the data on cheddar cheese tastings, we describe the steps involved in step-
wise variable selection. The steps involved are actually the same as for forward
selection in this case. The initial model contains just an intercept, and there is no
variable to delete. Then we consider adding a variable: the variable to be added
(with the largest partial F statistic of approximately 37.29) is H2S. This variable
cannot be deleted, and so we consider adding another variable: the largest partial
F is for lactic, with a partial F value of approximately 6.25 (which is significant
at the 0.15 level). Now consider deleting a variable: the partial F statistics in the
model involving Lactic and H2S are both significant at the 0.15 level, so there is
no variable to delete. The partial F statistic for acetic is not significant at the
0.15 level, so there is no variable to add. So we stop, and the final model is the
one involving Lactic and H2S.
4 MODEL SELECTION 111
Step 1 2
Constant -9.787 -27.592
Lactic 19.9
T-Value 2.50
P-Value 0.019
S 10.8 9.94
R-Sq 57.12 65.17
R-Sq(adj) 55.58 62.59
C-p 6.0 2.0
If you want to keep variables in the model regardless of their p-values, you can
enter them as “Predictors to include in every model” after choosing Stepwise....
As we have mentioned there are many variations on the basic stepwise procedure:
one common variation is to consider swaps of variables currently in the model
with variables not in the model if this improves R2 , in addition to additions and
deletions of variables.
Example: regression for simulated data where the response and pre-
dictors are unrelated
Regression Analysis
Analysis of Variance
Source DF SS MS F P
Regression 4 24.0298 6.0074 8.08 0.000
Residual Error 95 70.6279 0.7435
Total 99 94.6576
Source DF Seq SS
C37 1 7.1434
C45 1 7.9000
C14 1 5.0881
C4 1 3.8982
We see here that the p value for overall significance of the model is less than
0.0005, and the p-values for the partial t tests are also less than 0.05 for all
four of the predictors. By ignoring the fact that the data was used to select
which model was fitted we compute inappropriate p values which indicate overall
significance of the model and significance of the coefficients.
1. Set the initial model to be the best simple linear regression model (largest
partial F -statistic or equivalently largest R2 ).
(a) Add the variable not currently in the model which produces the largest
increase in R2 . Repeatedly cycle through all the predictors in the new
model, doing the following until R2 does not increase in a cycle through
all the predictors (and save the current model when this occurs).
i. Swap the current predictor with the predictor not in the model
producing the largest increase in R2 (provided there is a swap
which increases R2 ).
We now discuss the implementation of the MAXR procedure for the data on
cheddar cheese tastings. The MAXR procedure starts in the same way as for-
ward selection: we simply choose the best one variable model by finding the
predictor which has the highest absolute correlation with the response. As we
saw in the last lecture, the best one variable model for the data on cheddar cheese
tastings involved the predictor H2S. We save this model as being our best one
variable model. Then we consider adding one of the variables acetic or lactic
to the model involving H2S. The maximum increase in R2 occurs when lactic
is added (the R2 for H2S and lactic is 0.652, and the R2 for H2S and acetic is
0.582). Then for the model involving H2S and lactic, we consider a swap between
H2S and acetic: does the model with acetic and lactic have a higher R2 ? The
answer here is no (the model with acetic and lactic has an R2 of 0.520). Then
we consider a swap between lactic and acetic: but we have already seen that the
model with H2S and acetic is inferior to the one involving H2S and lactic. So
there is no one on swap of acetic with one of the predictors H2S and lactic which
can improve the model. So we store this model as our best two predictor model.
Of course, there is only one three predictor model. This is the end of the MAXR
procedure: we have a list of possible models for each subset size.
5 RESIDUALS AND DIAGNOSTICS 115
y = Xβ + ε
b = (X T X)−1 X T y
yb = Xb
= X(X T X)−1 X T y
= Hy
e = y − Hy
= (I − H)y (35)
5 RESIDUALS AND DIAGNOSTICS 116
e = (I − H)y
= (I − H)(Xβ + ε)
= Xβ − HXβ − Hε + ε.
e = (I − H)ε.
In general, as the sample size n increases the elements of the hat matrix tend to
zero and the residuals ei are a good approximation to the errors εi . Note that
E(e) = 0 always holds, so that the residuals have mean zero. We considered plots
of the raw residuals against predictors and fitted values in the context of the sim-
ple linear regression model for detecting an incorrectly specified mean structure
or a violation of the assumption of constant error variance. Similar plots can be
useful in the context of the general linear model, as the following examples show.
The following example is from Keller, Warrack and Bartel, “Statistics for Man-
agement and Economics: A Systematic Approach (Second Edition),” Wadsworth,
Belmont, California, 1990, pp. 832–833.
In trying to find new locations for their restaurants, hamburger emporiums
like McDonald’s and Wendy’s usually consider a number of factors. Suppose
that a statistician working for a hamburger restaurant chain wants to construct a
regression model that will help her make recommendations about new locations.
She knows that this type of restaurant has as its primary market middle-income
adults and their children (particularly children betwen the ages of 5 and 12).
She collected the (fictitious) data in the table below showing annual sales in
twenty five randomly chosen areas containing about five hundred households and
exactly one of her employer’s restaurants and exactly one competitor’s restaurant.
Consider fitting a linear model to these data in order to predict annual gross sales
based on mean annual household income and mean age of children. Fitting the
model results in the following.
Analysis of Variance
Source DF SS MS F P
Regression 2 132071 66035 5.30 0.013
Residual Error 22 274025 12456
Total 24 406096
Source DF Seq SS
Income 1 77008
Age 1 55063
We see that the F test in the analysis of variance table indicates that this model
is to be preferred to the model involving just an intercept term. A plot of resid-
uals against the fitted values indicates some possible problems with the model
formulation, however (Figure 5.11). It seems that for both small and large fitted
values residuals tend to be negative, whereas away from the extreme fitted values
the residuals tend to be positive. This plot does suggest that the mean structure
of the model is not correctly specified: in particular, it may be that we need to
include some additional terms in our linear model such as the square of age or
square of income. We will look further at plots which can be used for assessing
the need for nonlinear terms in the predictors in later lectures.
As a further example of the use of the raw residuals for detecting violations of
model assumptions we consider the risk assessment data discussed in previous
lectures. We fit the full model with mean risk assessment as the response, and all
seven of the accounting determined measures of risk as predictors. A plot of the
raw residuals against the fitted values is shown in Figure 5.12 In this example,
we can see that the variability of the residuals seems to decrease for the most
extreme fitted values, and so the constancy of variance assumption is doubtful
here.
5 RESIDUALS AND DIAGNOSTICS 119
Figure 5.11: Scatter plot of residuals versus fitted values for fast food restaurants
data
5 RESIDUALS AND DIAGNOSTICS 120
Figure 5.12: Scatter plot of residuals versus fitted values for risk assessment data
5 RESIDUALS AND DIAGNOSTICS 121
Var(e) = σ 2 (I − H)2 .
We can learn a number of things from the above equation. Firstly, unlike the
errors εi , i = 1, ..., n, the residuals ei are not necessarily uncorrelated. In fact,
for i ̸= j,
Cov(ei , ej ) = −σ 2 hij .
This expression is not necessarily zero, although in general the elements of the
hat matrix will tend to zero as the sample size increases. Secondly, we have
illustrated that the residuals ei tend to be smaller in absolute value than the
errors εi , even when the model assumptions hold: recall that the leverages hii
(the diagonal elements of the hat matrix H) are all positive quantities, so that
the variances
Var(ei ) = σ 2 (1 − hii ),
are smaller than Var(εi ) = σ 2 . Furthermore, the potentially high influence points
with large leverages hii are the ones corresponding to residuals with a small vari-
ance. In effect, an observation with a high leverage will tend to pull the fitted
5 RESIDUALS AND DIAGNOSTICS 122
line towards itself, ensuring that the residual at that point will be small.
We illustrate some of the problems which can occur with plots of ordinary
residuals by considering the capital asset pricing model data discussed in pre-
vious lectures. Here we had 21 measurements of a stock price (the response) as
well as corresponding measurements of two predictors, the TSE 300 index and
Price/Earnings ratio. For the purposes of this example I have deleted two of the
observations and I will consider a simple linear regression model of Stock price
on the TSE300 index. Figure 5.13 shows a scatter plot of stock price against the
TSE300 index for my modified data set. For this modified data set, linearity of
Figure 5.13: Scatter plot of stock price against TSE300 index for modified capital
asset pricing model data.
the expected response in the predictor seems reasonable, apart from the extreme
discrepant point. Of course, in a simple linear regression model we can display
the data very easily with a scatter plot, and the residual plot is perhaps not very
informative. But in multiple linear regression diagnostic residual plots may re-
veal patterns in the data that are not obvious. However, looking at the residual
plot after fitting a simple linear regression model in this example is instructive
5 RESIDUALS AND DIAGNOSTICS 123
for showing what can happen when there is a single outlier (Figure 5.14). We see
Figure 5.14: Residual plot for modified capital asset pricing model data.
that the residual corresponding to the discrepant point is no larger than many of
the other residuals. This observation is a high influence point, and it has pulled
the fitted line towards itself so that a plot of the raw residuals does not reflect
the inconsistency of this observation with the rest of the data. The residual at
this point has a smaller variance than the other residuals, making it difficult to
see from the residual plot that there is an outlier.
The fact that the variance of the raw residuals is not constant can make it tricky
to interpret plots of the raw residuals, as we have just seen. This has motivated
the introduction of standardized and studentized residuals. The ith standardized
residual is simply the ith raw residual ei divided by its standard deviation
e
√ i .
σ 1 − hii
Since this depends on σ (which is in general unknown) we define the studentized
residuals (or more precisely the internally studentized residuals) in which σ is
replaced by σb:
e
ri = √ i .
b 1 − hii
σ
5 RESIDUALS AND DIAGNOSTICS 124
For detection of outlying points you may be thinking that we could use the
PRESS residuals which were defined when we looked at model selection. Recall
that to define the PRESS residuals ei,−i we looked at the difference between the
ith response and a prediction of that response obtained by fitting a model to the
data with the ith response excluded. We gave a computational formula for ei,−i ,
namely
ei
ei,−i = .
1 − hii
We might think that if we could standardize the PRESS residuals to have constant
variance then these standardized PRESS residuals would be most appropriate for
detecting a single outlying observation which does not follow the pattern of the
rest of the data.
To standardize the PRESS residuals we first have to find their variances. We
have
Var(ei )
Var(ei,−i ) =
(1 − hii )2
σ 2 (1 − hii )
=
(1 − hii )2
σ2
= .
1 − hii
Hence the standardized PRESS residual is
ei
ei
1−hii
= √
√ σ
1−hii σ 1 − hii
which is simply ri , the ordinary standardized residual! Hence it does seem ap-
propriate to use the studentized residuals to look for single outlying points in the
data.
Figure 5.15: Plot of studentized residuals versus fitted values for modified capital
asset pricing model data.
5 RESIDUALS AND DIAGNOSTICS 126
For the risk assessment data, we previously plotted the raw residuals versus fitted
values (Figure 5.12). For the same fitted model considered before, we can also
consider plotting the studentized residuals against the fitted values. The basic
pattern observed (Figure 5.16) is the same as before, but notice that there seem
to be some differences in the relative magnitudes of the standardized residuals,
with the ones corresponding to extreme fitted values being relatively larger com-
pared to the others than before.
Figure 5.16: Plot of the studentized residuals against fitted values for risk assess-
ment data.
Stanley and Miller, in a 1987 RAND corporation technical report, have attempted
to build a descriptive model of the role of various design and performance factors
in modelling technological innovation in jet fighter aircraft. Using data on Amer-
ican jet fighters built since 1940, they use the date of the first flight as a stand-in
for a measure of technology; presumably, the level of technology is increasing
with time. In some of their work, they considered the following variables:
Data on 22 planes flown between 1940 and 1979 were considered (we do not re-
produce the full data set here). If we fit a model with F F D as response and the
remaining variables as predictors, and if we plot the studentized residuals against
fitted values, we obtain the graph in Figure 5.17. We see here that the plot of
Figure 5.17: Plot of studentized residuals versus fitted values for jet fighter data.
5 RESIDUALS AND DIAGNOSTICS 128
We have pointed out that studentized residuals are useful for detection of single
outlying observations. However, we have not yet formally discussed the question
of how large the studentized residual needs to be to cause concern. Obviously
the standardized residual has variance one, but when we estimate the unknown
σ 2 in construction of the studentized residual there is some extra dispersion due
to estimation of this parameter and we might expect that a distribution more
heavy tailed than the normal such as a t distribution might be appropriate for
assessing whether an observation is more extreme than would be expected under
the model assumptions. We now discuss a statistic that can be used to formally
test whether or not an observation seems to conform to the model. The test
statistic used is the following:
e
ti = √i
b−i 1 − hii
σ
where σ b−i is the estimate of the error variance obtained by fitting the model to all
the data except the ith observation. Note that this statistic is exactly the same as
the studentized residual ri except that we use σ b−i to estimate the error variance
rather than σ b. The quantities ti are usually called the externally studentized
residuals (externally studentized since we have not used the ith observation in
estimation of σ 2 .) Use of the quantities ti for outlier detection does make good
intuitive sense: if the ith observation really is an outlying observation, there is
the danger that σ b2 overestimates σ 2 due to this extreme point and the residual
ri will be “deflated” because of this.
It can be shown that if the assumptions of the linear model hold, then ti has
a t-distribution with n − p − 1 degees of freedom. We may test the hypothesis
that the ith observation is an outlier by expanding the original model in one of
two different ways. In the mean shift outlier model, we assume that the ith error
εi does not have mean zero, but has mean ∆i say. Alternatively, we may assume
that εi has zero mean, but that the constancy of error variance assumption breaks
down at the ith point, so that Var(εi ) = σ 2 + σi2 say. In this second scenario an
unusual observation occurs because of a large random fluctuation at the ith point
consistent with the larger error variance. We can test the hypothesis
H0 : ∆i = 0
H1 : ∆i ̸= 0
5 RESIDUALS AND DIAGNOSTICS 129
or test
H0 : σi2 = 0
against the alternative
H1 : σi2 ̸= 0
in the model where constancy of error variance is violated using ti as a test statis-
tic with null distribution the t distribution with n − p − 1 degrees of freedom. We
will say more about testing for outliers with the externally studentized residuals,
and about associated problems of multiple testing, in the next lecture.
E(εi ) = ∆i
where ∆i ̸= 0. That is, in the mean shift outlier model it is assumed that the
assumption of a zero mean for the ith error breaks down, possibly leading to an
outlier for the ith observation.
The second kind of departure we consider for the linear model is that
Var(εi ) = σ 2 + σi2
where σi2 > 0. In this scenario, an outlier can occur for the ith observation due
to a large random fluctuation in line with σi2 > 0.
We can test either
H0 : ∆i = 0
against the alternative
H1 : ∆i ̸= 0
in the mean shift outlier model or test
H0 : σi2 = 0
as the test statistic. For both tests, under the null hypothesis ti is a realization
of a t random variable with n − p − 1 degrees of freedom in the linear model with
normal errors, so that if T ∼ tn−p−1 we compute the p-value for the tests above
as
p = P(|T | > |ti |).
Example: Capital asset pricing model
Consider the modified capital asset pricing model data which we examined in the
previous lecture. The data consisted of a response variable stock price and two
predictor variables price/earnings ratio and TSE300 index. For illustrating some
properties of raw and studentized residuals we deleted two observations from the
data set. Figure 5.13 shows the scatter plot of stock price versus TSE300 index
for the modified data set. Since the observation corresponding to the predictor
value on the extreme right is so far away from the other predictor values we may
have a prior interest in testing whether this observation follows the pattern of the
rest of the data. We can apply the testing procedure above to this observation.
The externally studentized residual for this point is approximately −3.37 here.
To compute the p-value for testing zero ∆i in the mean shift outlier model versus
the alternative ∆i ̸= 0, or for testing σi2 = 0 versus the alternative σi2 > 0 in the
model with a possibly inhomogeneous variance, we let T be a t random variable
with n − p − 1 = 19 − 2 − 1 = 16 degrees of freedom and calculate
which is approximately 0.0038. So we would reject the null hypothesis for either
test at the 5 percent level, and we would conclude that this observation may need
to be investigated further.
When we have a prior reason for believing that one of the observations may be
unusual then the test we have described above is appropriate. However, if we
were to conduct this test for all the observations in our data set, then we have a
problem of multiple testing or simultaneous inference: although it is rare for an
externally studentized residual to be as large as the upper 2.5 percentage point
of a tn−p−1 distribution for a single observation, it is very likely that in a large
data set at least some of the externally studentized residuals are this large even
if all observations conform to the model. So we need to make an adjustment to
the significance level of our outlier test when we apply it to all the observations
in the data set. We can apply the idea of Bonferroni adjustment discussed in
earlier lectures in this situation.
In particular, if we apply our outlier test for each observation in the data set,
then we adjust the significance level of our test from α to α/n.
5 RESIDUALS AND DIAGNOSTICS 131
Let us reconsider for the moment the modified capital asset pricing model data
we discussed in the previous example. If we did not have prior knowledge that
the last observation was unusual, then we might apply our outlier test to all
nineteen observations in this data set. Since we were testing before at the 5
percent level, Bonferroni adjustment suggests we should use a significance level
of 0.05/19 which is approximately 0.0026. Comparing this with the p-value we
obtained previously of approximately 0.0038, we see that the observation would
not be considered an outlier in this analysis. We should remember, however, that
the Bonferroni adjustment is a conservative one, and we should not in any case
treat formal tests too seriously in the context of model criticism: the purpose
of the test is simply to bring to our attention individual observations that may
require further investigation.
Recall the data set on technological innovation in jet fighters discussed on page
156 of your notes. We showed a plot of the internally studentized residuals when
we fitted a model with FFD as the response and the remaining five variables as
predictors in this data set. A plot of the externally studentized residuals is shown
in Figure 5.18. The value of the externally studentized residual on the extreme
right here is approximately 4.63. If we compare with the plot of the internally
studentized residuals given in your last set of lecture notes we see that this is an
example where it may be desirable to use an external estimate of the error variance
in the standardization: the internally studentized residual here is approximately
3.07, which is quite different to the value 4.63 when the possibly outlying point
is excluded in compuation of the estimated standard deviation. There are 22
observations in this data set, so if we apply Bonferroni adjustment with an initial
5 percent significance level for testing for outliers, the Bonferroni significance
level is 0.05/22 or approximately 0.0023. To compute the p-value for the outlier
test, we let T be a t distributed random variable with n − p − 1 = 22 − 6 − 1 = 15
degrees of freedom, and calculate
which is approximately 0.0004. So we reject the null hypothesis, and this obser-
vation with an externally studentized residual of 4.63 might be considered to be
an unusual one.
As we mentioned when discussing the capital asset pricing model data, the formal
tests of hypotheses we have considered in this section are not to be taken too
seriously in the context of model criticism. The purpose of the techniques we
5 RESIDUALS AND DIAGNOSTICS 132
Figure 5.18: Plot of externally studentized residuals versus fitted values for jet
fighter data set.
5 RESIDUALS AND DIAGNOSTICS 133
predictor being one for the second observation and zero otherwise, and so on.
Again we can examine estimates of coefficients for these predictors and examine
partial t-tests to highlight possibly unusual observations.
We consider again the capital asset pricing model data. For the purposes of
this example, consider the full data set and a simple linear regression of stock
price on TSE300 index. The scatter plot of stock price against TSE300 index
is shown in Figure 5.19. Suppose that the three observations corresponding to
the predictors on the extreme right were thought to be suspicious on the basis
of prior knowledge: the data were collected over time, and some event occurred
which caused a shift in the stock price for the last three observations collected.
We can set up an indicator variable to model the possible effect. The indicator
variable is zero for the first eighteen observations, and is one for the last three.
Analysis of Variance
Source DF SS MS F P
Regression 2 15.8294 7.9147 95.77 0.000
Residual Error 18 1.4876 0.0826
Total 20 17.3170
Source DF Seq SS
TSE 300 1 15.4001
Dummy 1 0.4293
We see from the above output that the partial t-test for the coefficient of our
dummy variable is significant at the 5 percent level: there does seem to be reason
to further examine the last three observations and to possibly exclude them from
our analysis.
and with x1 , .., xj−1 , xj+1 , ..., xk as predictors (that is, we regress xj on the remain-
ing predictors). Write exj ,−xj for the residuals from this second fitted model. We
can think of the vector of residuals exj ,−xj as the independent information in the
predictor xj , after removing linear dependence on the other covariates. In the
partial regression plot we simply plot ey,−xj against exj ,−xj . This plot helps us
to visualize the relationship between y and xj after we have removed the linear
dependence of both y and xj on the remaining predictors.
It can be shown that the least squares slope of the partial regression plot
is actually bj , the least squares estimator of βj in the multiple linear regression
model involving all the predictors (see, for instance, p. 233 of Myers, “Classical
and Modern Regression with Applications (Second Edition)”, Duxbury, Belmont,
California, 1990). The partial regression plot gives an effective visual represen-
tation of the contribution of a predictor in a multiple linear regression model,
and it can be helpful for suggesting the need for an additional nonlinear term
in the model (we might decide to add x2j as a predictor for example). Partial
regression plots may also be helfpul for identifying influential observations: the
partial regression plots in which an unusual observation shows up most clearly
correspond to predictors whose coefficients will be most affected by the unusual
point.
We return to the example of the capital asset pricing model data. Recall that
in this data set the response was stock price, and there were two predictors,
price/earnings ratio and TSE300 index. We wish to consider the role of each
predictor in a multiple linear regression model involving both predictors. The
partial regression plots for price/earnings ratio and TSE300 index are shown
below. The partial regression plot for TSE300 index may suggest the need to
include an additional predictor such as the square of TSE300 index in the model.
If we do include the square of TSE300 index as an additional predictor then
the partial t test for the coefficient of this predictor leads to a p-value of 0.004,
so that this additional term does seem to contribute to the model.
Figures 5.24 and 5.25 show the partial residual plots for the predictors price/earnings
ratio and TSE300 index for the capital asset pricing model data set. Again the
Figures 5.26 and 5.27 show the partial residual plots for the predictors income
and age for the fast food restaurants data set. Again the possible need for
A further alternative to the partial regression and partial residual plots for
detecting the need for nonlinear transformations of the predictors is the use of an
augmented partial residual plot. In the augmented partial residual plot for the
predictor xj , we fit a linear model involving all the predictors with the additional
predictor x2j : writing bj for the least squares estimator of the coefficient for xj
(2)
and bj for the least squares estimator of the coefficient for x2j , and writing e
for the vector of residuals, then in the augmented partial residual plot we plot
(2)
e + bj xj + bj x2j against xj . The motivation for the introduction of this plot was
that if the contribution of xj is really nonlinear, then the nonlinear contribution
of xj may be masked by its relationship with the other terms, and including x2j
as an additional predictor can help to alleviate this problem.
5 RESIDUALS AND DIAGNOSTICS 144
Z(i) = µ + σW(i)
Figure 5.28: Plot of internally studentized residuals versus fitted values for ched-
dar cheese tastings data.
5 RESIDUALS AND DIAGNOSTICS 148
Figure 5.29: Normal probability plot of internally studentized residuals for fitted
model to cheddar cheese tastings data.
5 RESIDUALS AND DIAGNOSTICS 149
Figure 5.30: Plot of internally studentized residuals for data on income, education
and job experience.
5 RESIDUALS AND DIAGNOSTICS 150
Figure 5.31: Normal probability plot for internally studentized residuals and data
on income, education and job experience.
5 RESIDUALS AND DIAGNOSTICS 151
reference line on the plot. There are various ways of doing this. One method is to
draw a straight line through the two points corresponding to the first and third
quartiles of the sample (which are roughly the points which bound the middle
half of the data). We can interpret characteristics of the residuals based on the
way the probability plot deviates from the reference line. Points below the line
correspond to points which are smaller than we would expect based on normality,
and points above the line corresond to points which are larger than we expect.
The ‘S’ shape plot shown in Figure 5.32 indicates that the sample plotted
has fewer extreme observations (both large and small) than we would expect
under normality (and we say that we have a distribution which is light tailed
with respect to the normal). The points above the line on the left indicate that
the small values in the sample are larger than we would expect, whereas the
values below the line on the right indicate that the large values in the sample are
smaller than we would expect. If we reflect this shape through the line y = x,
1.0
0.8
0.6
Sample
0.4
0.2
0.0
-2 -1 0 1 2
Normal Distribution
Figure 5.32: Normal probability plot for sample with fewer extreme values than
we would expect based on the normal distribution (these data were simulated
from a uniform distribution on [0, 1]). The data are plotted on the y-axis.
then we get a probability plot like the one shown in Figure 5.33, which indicates
that the sample plotted has more extreme observations than we would expect
under normality (and we say that we have a distribution which is heavy tailed
with respect to the normal). Here the points below the line on the left indicate
5 RESIDUALS AND DIAGNOSTICS 152
that the small values in the sample are smaller than we would expect based
on normality, whereas the values above the line on the right indicate that the
large values in the sample are larger than we expect. The plots of Figure 5.34
2
Sample
-2
-2 -1 0 1 2
Normal Distribution
Figure 5.33: Normal probability plot for sample with more extreme values than
we would expect based on the normal distribution (these data were simulated
from a t distribution with 10 degrees of freedom). The data are plotted on the
y-axis.
and Figure 5.35 show shapes which are characteristic of positive and negative
skewness respectively.
We conclude this subsection with an example to illustrate that apparent de-
viations from normality in the normal probability plot may in fact just be due to
a failure of other assumptions.
We discuss the data given in Tutorial five on inflation rates and central bank
independence for twenty three developed and developing countries. The predic-
tors here are two measures of independence (QUES and LEGAL) as well as a
binary variable DEV (which is 1 for developed economies and 0 for developing
economies). In this example we fit a model for annual inflation rate with predic-
tors QUES, LEGAL, DEV, QUES*DEV and LEGAL*DEV. A plot of internally
5 RESIDUALS AND DIAGNOSTICS 153
0.25
0.20
Sample
0.15
0.10
0.05
0.00
-2 -1 0 1 2
Normal Distribution
Figure 5.34: Normal probability plot for sample which is positively skewed (these
data were simulated from a Beta(2,20) distribution). The data are plotted on the
y-axis.
5 RESIDUALS AND DIAGNOSTICS 154
1.0
0.9
Sample
0.8
0.7
0.6
-2 -1 0 1 2
Normal Distribution
Figure 5.35: Normal probability plot for sample which is negatively skewed (these
data were simulated from a Beta(20,2) distribution). The data are plotted on the
y-axis.
5 RESIDUALS AND DIAGNOSTICS 155
studentized residuals against fitted values and the normal probability plot of these
residuals is shown in Figures 5.36 and 5.37. The normal probability plot here
Figure 5.36: Plot of internally studentized residuals against fitted values for data
on inflation and central bank independence.
indicates that the errors are heavy tailed with respect to the normal distribution:
but the real problem here may be a violation of the constancy of variance as-
sumption. Certainly from the residual plot it does not seem that the studentized
residuals are identically distributed, and the normal probability plot is not very
meaningful here.
As a further example consider the fast food restaurants data set we have dis-
cussed in previous lectures. Here we are trying to predict a measure of sales for
restaurants of a fast food chain in terms of the predictors income (a measure of
income levels for households in the area) and age (the mean age of children for
households in the area). For this example I have fitted a simple linear regression
model with sales as the response and income as the predictor. A plot of inter-
nally studentized residuals against fitted values and the normal probability plot
of these residuals are shown in Figures 5.38 and 5.39. The normal probability
5 RESIDUALS AND DIAGNOSTICS 156
Figure 5.37: Normal probability plot for internally studentized residuals and data
on inflation and central bank independence.
5 RESIDUALS AND DIAGNOSTICS 157
Figure 5.38: Plot of internally studentized residuals against fitted values for data
on fast food restaurants.
5 RESIDUALS AND DIAGNOSTICS 158
Figure 5.39: Normal probability plot for internally studentized residuals and data
on fast food restaurants.
5 RESIDUALS AND DIAGNOSTICS 159
plot would indicate some negative skewness. However, if we look at the plot of
studentized residuals against fitted values this indicates that the mean structure
of the model is not correctly specified, so that again the normal probability plot
is perhaps not terribly helpful.
Figure 5.40: Scatter plot of y against x4 for fish biomass data set.
5 RESIDUALS AND DIAGNOSTICS 161
Figure 5.41: Scatterplot of transformed pressure (in units of 100 times log inches
of mercury) against boiling point (in degrees Farenheit).
obvious outlier in this data set from the scatter plot: if we fit a simple linear
regression model to these data, the externally studentized residual corresponding
to this observation is 12.4037, which is extremely large. However, if we remove
this outlying point and refit the model, we find that the estimate of the slope
parameter in the simple linear regression model is little changed. So if this data
5 RESIDUALS AND DIAGNOSTICS 162
set were observed in a situation where the inference of interest concerned the
slope parameter in the model, the outlier would not correspond to an influential
observation.
This implies that the average leverage value is p/n, and one guideline which is
sometimes used for isolating large leverage values is to examine further observa-
tions with leverages larger than 2p/n.
We give some examples now where we consider the externally studentized
residuals and leverages. These examples also serve to illustrate how a single ob-
servation in a data set can substantially alter inferences of interest.
determining the dose, there is no relationship between the percentage of the dose
in the liver (y) and the body weight (x1 ), liver weight (x2 ) and relative dose (x3 ).
It can be shown that simple linear regressions of y on each of the predictors give
insignificant slope terms, in line with the hypothesis. However, fitting a multiple
linear regression for y involving predictors x1 , x2 and x3 results in the following
output.
Regression Analysis
Analysis of Variance
Source DF SS MS F P
Regression 3 0.051265 0.017088 2.86 0.072
Residual Error 15 0.089609 0.005974
Total 18 0.140874
Source DF Seq SS
Body Wei 1 0.003216
Liver We 1 0.003067
Dose 1 0.044982
We see that the p-values for the partial t tests for the coefficients for body weight
and dose are significant here. If we remove the term for liver weight, the coeffi-
cients for body weight and dose remain significant and the p value in the ANOVA
table for overall significance of the model is 0.032. Given that the simple linear
regressions indicate no relationship in line with our original hypothesis this seems
like a strange result. The explanation of the problem can be revealed by examin-
ing externally studentized residuals and leverages. In the model involving body
weight and dose, the leverage of the third observation is approximately 0.83,
indicating that this is a potentially influential case because it has an unusual
predictor vector (as a matter of fact, for this rat the usual rule for assigning
dosage based on body weight was not used). If we omit this observation and refit
the model we get the following output.
5 RESIDUALS AND DIAGNOSTICS 165
Regression Analysis
Analysis of Variance
Source DF SS MS F P
Regression 2 0.000423 0.000211 0.04 0.964
Residual Error 15 0.087138 0.005809
Total 17 0.087561
Source DF Seq SS
Body Wei 1 0.000038
Dose 1 0.000385
As we can see, no relationship between the predictors and the response in this
multiple linear regression model is indicated, in line with our hypothesis. The
apparent relationship was due to a single influential observation.
While the leverages and residuals can help to expose potential influence, it
would be nice to have measures which expose directly the influence of each obser-
vation. We will look at a number of different influence measures in this course,
each of which reflects the way that an observation will effect a certain aspect of
the fitted model. Perhaps the most commonly used measure of influence (and one
that is computed in many statistical packages) is Cook’s distance. The Cook’s
distance attempts to measure the change in the estimated parameters when an
observation is deleted. Write b−i for the estimate of the regression parameters β
when we fit using all the data except the ith observation, and write b for the es-
timate of β based on the complete data set. If we write σ b2 for the usual unbiased
estimator of σ 2 , then the Cook’s distance for the ith observation is simply
To understand the motivation for this influence measure, recall that a 100(1 − α)
percentage confidence region for the vector of parameters β in the general linear
model is given by
{ ∗ T ∗
}
∗ (b − β ) X X(b − β )
T
β : 6 Fp,n−p;α
σ2
pb
(b − β)T X T X(b − β)
∼ Fp,n−p .
σ2
pb
So if we compare the Cook’s distance to, say, the 50 percentage point of an F
distribution with p and n − p degrees of freedom, we are saying that deletion of
the ith case moves the estimate of the parameters outside a 50 percent confidence
region for β based on the fit to the full data. We point out that it is not necessary
to refit the model with one observation deleted in order to compute the Cook’s
distance. It can be shown that
ri2 hii
Di =
p 1 − hii
5 RESIDUALS AND DIAGNOSTICS 167
where ri is the internally studentized residual and hii is the leverage. So the
Cook’s distance gives an easily computed and interpreted measure of how much
the parameter estimates are influenced by each observation.
We return briefly to the rat data. We identified the third case here as being a po-
tentially influential one, and confirmed this by deleting this case and reanalyzing
the data. Examining the Cook’s distance shows directly from the fit to the full
model that this observation is an influential one: the Cook’s distance for this ob-
servation is 1.82 (the next largest value for the Cook’s distance is approximately
0.21).
Consider once again the data on manpower needs for bachelor officer’s quarters.
In this data set, observation 23 was influential: does the Cook’s distance reveal
this as we expect? The Cook’s distance for this observation is 115.041 (the next
largest value is 5.89). We will illustrate in the lecture computation of percentage
points of an F distribution for assessment of when the Cook’s distance is large.
Now, note that Xb is simply yb, the vector of fitted values for the complete data
set. Also, Xb−i is the vector of fitted values based on the fit with the ith case
deleted, and we write yb−i = Xb−i . So Di becomes
y − yb−i )T (b
(b y − yb−i )
Di = 2
.
pbσ
So we can see that we can interpret Di as representing the ordinary Euclidean
distance between the vectors of fitted values for the full data set and the data set
with the ith case deleted, scaled by a constant. You may have felt that looking at
the change in the vector of fitted values would be the appropriate way of assessing
influence, particularly if prediction is the goal of a regression analysis: we now
see that this is closely related to looking at the Cook’s distance, which measures
the change in the estimate of β upon deletion of the ith case.
A statistic which is sometimes used to look at the change in the fitted value
at the ith point upon deletion of the ith case is DFFITSi , defined to be
ybi − ybi,−i
DFFITSi = √
b−i hii
σ
where we have written ybi,−i for the ith element of yb−i and σb−i is our estimate of
the residual standard deviation based√on the fit with the ith case deleted. Note
that the standard error of ybi is just σ hii , so DFFITSi is estimating the number
of standard errors by which the fit changes at the ith point when the ith case
is deleted from the fit. One rule of thumb for assessing the magnitude of values
of DFFITS is to say that an absolute value larger than 2 is large. The statistic
DFFITSi is closely related to the Cook’s distance Di : it can be shown that
so that the square of DFFITSi is simply p times a statistic that is very similar
to the Cook’s distance. Evidently the Cook’s distance and DFFITS give much
the same information, although DFFITS does give information about the sign of
the change in the fitted value upon deletion of the ith case. As for the Cook’s
distance, we can compute DFFITS without having to fit a new model when we
delete each case: the computational formula for DFFITSi is
√
hii
DFFITSi = ti
1 − hii
where ti is the externally studentized residual and hii is the leverage. We return
to some of the examples in the last lecture (the rat data and the BOQ data) to
illustrate the use of DFFITS.
5 RESIDUALS AND DIAGNOSTICS 169
Recall from last lecture the rat data, where we were interested in modelling the
percentage of a drug in the livers of 19 rats after a period of time in terms of pre-
dictors body weight, liver weight and relative dose. When discussing this example
in the last lecture we determined that the 3rd observation was very influential: it
has a high leverage value, since the method for assigning doses for the other rats
was not followed in this case, and so the vector of predictors is unusual here. The
Cook’s distance for this observation based on a model involving body weight and
relative dose is 1.82, which may be compared to the upper 50 percentage point
of an F distribution with 3 and 16 degrees of freedom (approximately 0.82). The
DFFITS for the 3rd case and the model involving body weight and relative dose
is approximately 2.35 here, which suggests that deletion of this case does sub-
stantially influence the fit at this point. Note that with p = 3 we have that p
times the Cook’s distance is approximately 5.46, which is roughly the square of
DFFITS (2.352 or approximately 5.53). So the Cook’s distance and DFFITS do
give much the same kind of information.
As a further example we consider the BOQ data. Recall that the goal of the
analysis with this data set was to predict manpower needs for bachelor officer’s
quarters in the US Navy. The response is man hours (a measure of manpower
needs), and there are seven predictors measured for 25 different naval installa-
tions. In the last lecture we considered fitting a model for the response including
all seven predictors. We identified the 23rd case as being an unusual one: it has
both a large externally studentized residual and a high leverage value. The Cook’s
distance for this observation is large (approximately 115.04) which is much larger
than the upper 50 percentage point of an F distribution with 8 and 17 degrees of
freedom (approximately 0.96). The DFFITS for the 23rd case is approximately
−48.52, which shows that the fitted value at this point changes by approximately
48.52 times the standard error of the fitted value at this point when this case is
deleted: this is certainly an influential case.
In defining the Cook’s distance our motivation was to identify observations which
have a substantial effect on the estimate of β. Once we have identified an ob-
servation which does appear to have an effect on estimation of β, it may be of
interest to ask which components of the estimate b of β are influenced most. It
may be that an unusual observation has a large influence on some of the coef-
ficient estimates but not on others. An influence statistic which can help us to
determine which of the coefficient estimates are most influenced by an unusual
observation is the DFBETAS statistic. For a given observation i and the jth
5 RESIDUALS AND DIAGNOSTICS 170
As a further example we consider the BOQ data. From the Cook’s distance
observation 23 exerts substantial influence on the coefficient estimates: which
coefficient estimates are most influenced? Again we can compute the DFBETAS
to answer this question. For the intercept the DFBETAS for observation 23 is
approximately −0.25, for x1 it is −44.38, for x2 it is 1.17, for x3 it is 0.66, for x4
it is −4.57, for x5 it is 1.18, for x6 it is −3.94 and for x7 it is 7.28. Obviously the
DFBETAS for x4 , x6 and x7 are all large here, and the DFBETAS for x1 is very
large at −44.38.
We discuss one more influence measure which is less commonly used than
DFFITS and DFBETAS, and which gives an overall measure of how much the
standard errors of parameter estimates are influenced by deletion of cases. In
particular, we define COVRATIOi to be
b−i
σ 2p
1
COVRATIOi = 2p .
b 1 − hii
σ
We won’t discuss the motivation for the definition of this statistic in too much
detail. You should just think of this statistic as measuring in a global sense how
much standard error estimates of the coefficients are influenced by deletion of the
ith case. A rule of thumb for assessing the magnitude of this statistic is to further
examine observations which have a COVRATIO larger than 1 + 3p/n or smaller
than 1 − 3p/n. We illustrate the use of the COVRATIO statistic by looking at
the fish biomass data discussed last lecture.
Consider again the fish biomass data. For this data set the response y was fish
biomass, and there were four predictors consisting of measurements of stream
5 RESIDUALS AND DIAGNOSTICS 171
characteristics which were thought to relate to fish biomass. Recall that when
we considered a simple linear regression model using the fourth predictor (area
of stream with a depth greater than 25 centimetres) we saw that there were two
observations with very high leverages (see Figure 5.40). We might expect that
these observations exert a substantial influence on standard errors of coefficient
estimates: computing the COVRATIO for these observations we get approxi-
mately 0.00589 for the third observation and approximately 0.492 for the fourth
observation. These very small COVRATIO values reflect the fact that precision
of estimation is substantially reduced by inclusion of these observations.
√
expected value E(y) of y. We have that y is approximately
√ 1
E(y) + √ (y − E(y)).
2 E(y)
from this we have that the variance of the square root of y is approximately
1
Var(y)
4E(y)
and hence if the variance of y is proportional to its mean, the variance of the
square root of y should be approximately constant. The square root transforma-
tion is often considered when a linear model is used for an approximate analysis
of count data. If a Poisson distribution is appropriate for the counts, then the
variance is proportional to the mean, and so the square root transformation has
some theoretical rationale in this situation.
A log transformation is appropriate when the error standard deviation is
proportional to the mean (that is, when the standard deviation is a percentage
of the response). We can consider a Taylor expansion as in our discussion of the
square root transformation to see why the log function will stabilize the variance
in this situation. We have that log y is approximately
(y − E(y))
log E(y) +
E(y)
so that the variance of log y is approximately
Var(y)
.
E(y)2
Again we require the responses to be positive for a log transformation to be appli-
cable. If some of the responses are zero, the transformation log(y+1) is sometimes
used. Another common variance stabilizing transformation is the inverse func-
tion, 1/y. This transformation is applicable when the responses are positive and
when the error standard deviation is proportional to the mean squared. If some
of the responses are zero, the transformation 1/(y + 1) is sometimes used.
Evaluating transformations
One method for comparing models is the following. As usual write y for the
response, and suppose that after fitting a multiple linear regression to y an exam-
ination of diagnostic statistics indicates the need for a variance stabilizing trans-
formation. We apply the variance stabilizing transformation z = f (y) (where f
is invertible) and fit a multiple linear regression model to z. Is the model on the
transformed scale to be preferred if our goal is to predict y? We can develop a
statistic for the model on the transformed response which can be compared with
the PRESS statistic for the model for y (or with the corresponding statistic for a
model involving a different transformation of the response). As in our discussion
of the PRESS statistic, write zbi,−i for the fitted value for z obtained from fitting
a model to all the data except the ith observation. The ith PRESS residual on
the transformed scale is of course simply zi − zbi,−i . However, as we are interested
in prediction on the original scale we consider transforming zbi,−i by f −1 (·) (where
f −1 (·) denotes the inverse of f (·)) to get a prediction of yi , and then the analogue
of the PRESS residual on the original scale is yi − f −1 (b zi,−i ). We can calculate
∑
n
(yi − f −1 (b
zi,−i ))2
i=1
and compare this to the PRESS statistic for the model for y or to a similar statistic
for a different transformation to get an idea of whether predictive performance is
improved on the original scale by using the variance stabilizing transformation.
Alternatively, we could compare the sum of the absolute PRESS residuals for the
model of the untransformed response with
∑
n
|yi − f −1 (b
zi,−i )|.
i=1
It is important to stress that models must be compared on the original scale and
that we cannot simply look at the R2 or error standard deviation for models at
different scales. We give some examples to illustrate the use of variance stabiliz-
ing transformations.
The following example is from Weisberg, S., “Applied Linear Regression (Second
Edition),” Wiley, New York, 1985, p. 102.
Aerial survey methods are regularly used to estimate the number of snow
geese in their summer range areas west of Hudson Bay in Canada. To obtain
estimates, small aircraft fly over the range and, when a flock of geese is spotted,
an experienced person estimates the number of geese in the flock. To investigate
the reliability of this method of counting, an experiment was conducted in which
an airplane carrying two observers flew over 45 flocks, and each observer made
5 RESIDUALS AND DIAGNOSTICS 174
46 50 119 75
38 25 165 100
25 30 152 150
48 35 205 120
38 25 409 250
22 20 342 500
22 12 200 200
42 34 73 50
34 20 123 75
14 10 150 150
30 25 70 50
9 10 90 60
18 15 110 75
25 20 95 150
62 40 57 40
26 30 43 25
88 75 55 100
56 35 325 200
11 9 114 60
66 55 83 40
42 30 91 35
30 25 56 20
90 40
Figure 5.42: Scatter plot of photo count against observer count for data on snow
geese.
5 RESIDUALS AND DIAGNOSTICS 176
Figure 5.43: Scatter plot of square root of photo count against square root of
observer count for data on snow geese.
5 RESIDUALS AND DIAGNOSTICS 177
predictor, although there is still a suggestion that the variance increases with the
mean. Figure 5.44 shows the scatter plot of the log of photo count against the
log of observer count. The log transformation would appear to be better than
Figure 5.44: Scatter plot of log of photo count against log of observer count for
data on snow geese.
the square root transformation for stabilizing the variance, although there are
reasons for preferring the square root for interpretability (since the data consist
of counts).
√
We can compare the models for y, y and log(y) in terms of predictive per-
formance on the scale of y by using the analogue of the PRESS statistic that we
have just developed. Computation of these statistics will be illustrated in lectures.
The PRESS statistic for the simple linear regression model with y as response
and x as predictor is approximately 172738. The analogue of this statistic for the
model involving the square root of y is 137603, and for the log of y is 122740. The
sum of the absolute PRESS residuals for the model involving y is approximately
1475.55, and the analogue of this statistic is 1295.89 and 1257.81 for the models
involving the square root of y and log of y respectively. So for predictive perfor-
mance on the untransformed scale it would seem that the model for log(y) is best.
5 RESIDUALS AND DIAGNOSTICS 178
We consider the data on inflation rates for 22 different countries which was con-
sidered in previous lectures. The response here was INF (inflation rate), and we
were interested in predicting INF using measures of central bank independence
(QUES, LEGAL) and an indicator variable DEV (one for developed countries,
zero for developing countries). For the purposes of this example we consider just
a simple linear regression model for INF with QUES as the predictor. The scatter
plot of INF against QUES is given in Figure 5.45. Variation in the response seems
Figure 5.45: Scatter plot of INF against QUES for data on inflation rates and
central bank independence.
to increase with the mean here, and a variance stabilizing transformation might
be considered. Figure 5.46 shows a scatter plot of the log of INF against QUES,
which seems to indicate that the assumption of constancy of variance would be
more nearly reasonable if the response were modelled on this scale. There is a
clear outlier in this scatter plot.
Again we can fit models at the different scales and compare predictive perfor-
mance via the analogue of the PRESS statistic that we have developed. The
PRESS statistic for the model for INF in terms of QUES is approximately
5 RESIDUALS AND DIAGNOSTICS 179
Figure 5.46: Scatter plot of log(INF) against QUES for data on inflation rates
and central bank independence.
5 RESIDUALS AND DIAGNOSTICS 180
16070.9, and the analogue of this for the model with the log of INF as response is
21610.8. However, if we look at the sum of the absolute PRESS residuals for the
model of INF we get 433.84, whereas for the model of log INF the corresponding
statistic is 431.193. The conflict between the two measures of predictive perfor-
mance is due to the outlier, and the model for the log of INF would be preferable
on the untransformed scale if the outlier is removed.
βb = (X ⊤ V −1 X)−1 (X ⊤ V −1 y)
where X is the design matrix and y is the vector of responses. In the situation
where V = σ 2 W and W is the diagonal matrix with diagonal elements w1 , ..., wn
we can write
βb = (X ⊤ W −1 X)−1 (X ⊤ W −1 y)
and we see that βb doesn’t depend on the (unknown) σ 2 . The covariance matrix
of βb can be shown to be
(X ⊤ V −1 X)−1
which reduces to
σ 2 (X ⊤ W −1 X)−1
when V = σ 2 W . It can be shown that βb minimizes
∑
n
wi−1 (yi − x⊤
i β)
2
i=1
which is a least squares type criterion in which observations with larger variances
receive less weight. Much of the theory of linear models with a constant error
variance can be carried over to the situation we have just described, and if we
can specify the variances σi2 of the errors or weights wi in a natural way then
modelling the inhomogeneity of the variances may be preferable to the application
of a variance stabilizing transformation. Sometimes the variances σi2 can be
5 RESIDUALS AND DIAGNOSTICS 181
estimated: for instance, if we have many observations for each set of predictor
values, then we could estimate σi2 from the data. Another possiblity which is
reasonable in some situations is to take the weights wi to be given by the values
of one of the predictors.
Example: transfer efficiency data
We illustrate the use of weighted regression with an example from Myers (1990),
“Classical and Modern Regression Analysis with Applications (Second Edition)”,
Duxbury, Belmont, California, p. 281–282. The response y here is a measure of
the efficiency of a particular electrostatic type of spray paint equipment. We are
interested in modelling the response as a function of two predictors, air velocity
and voltage. An experiment was conducted in which two levels of air velocity and
two levels of voltage were considered, and ten observations were taken for each of
the four possible air velocity/voltage combinations. The data are shown in the
table below. Since we have ten replicates for each distinct combination of the
Voltage
50 70
Air Velocity 60 87.5 88.2 77.4 68.1
88.1 87.3 70.7 65.3
89.5 89.2 67.0 61.0
86.2 85.9 71.7 81.7
90.0 87.0 79.2 60.3
120 82.5 81.3 61.2 50.7
81.6 80.7 67.2 52.3
77.4 79.3 55.9 68.6
81.5 82.0 52.0 69.5
79.7 79.2 63.5 70.1
predictor values, we can estimate the variance at each set of predictor values. We
can use these estimated variances in a weighted regression analysis of the data.
Weighted analysis using weights in Weights
It is important to note that while the ordinary least squares estimator will be
consistent even in the case where the error variances are unequal in the linear
model, an appropriate weighted least squares estimator will generally be much
more efficient (have a smaller variance). The ordinary least squares estimator
will be heavily influenced by unreliable high variance observations, whereas the
role of these observations is donwplayed in the weighted analysis.
Parabolic relationship
y = β0 + β1 x + β2 x2
Exponential relationship
5 RESIDUALS AND DIAGNOSTICS 183
log y = log β0 + β1 x
or
log y = β0∗ + β1∗ x.
So if we see an exponential pattern in a scatter plot of y against x, a transfor-
mation of y to log y might be considered.
y = β0 exp(β1 /x)
This example is from your text book (Myers (1990), “Classical and Modern Re-
gression with Applications (Second Edition),” Duxbury, Belmont, California, p.
299). These data were collected in an attempt to predict manpower requirements
for surgical services at US naval hospitals. The response y is man hours per
month measured for fifteen hospitals, and the predictor x is the number of sur-
gical cases at these hospitals. The data are shown in the table below. A scatter
plot of y against x is shown in Figure 5.47. The scatter plot certainly seems to
show a nonlinear relationship between y and x. However, there are a number of
different possibilities when it comes to modelling this nonlinear relationship. On
the basis of the above discussion, we consider the simple linear regression model
involving y and x, as well as a parabolic model and inverse exponential model.
For the simple linear regression model with y as response and x as the pre-
dictor, the PRESS statisic is 35, 927, 143 and the sum of the absolute PRESS
5 RESIDUALS AND DIAGNOSTICS 184
1275 230
1350 235
1650 250
2000 277
3750 522
4222 545
5018 625
6125 713
6200 735
8150 820
9975 992
12200 1322
12750 1900
13014 2022
13275 2155
residuals is 19064.7. For the parabolic model, (where y is the response and we
use x and x2 as predictors), the PRESS statisic is 7, 057, 692 and the sum of the
absolute PRESS residuals is 9348.06.
For the inverse exponential model, we fit a model with log y as the repsonse
and 1/x as the predictor. We use the analogue of the PRESS residuals developed
in the last lecture to derive statistics for comparison with the PRESS statistic
and sum of absolute PRESS residuals. The sum of squared prediction errors on
the original scale for this model is 20, 684, 140, and the sum of absolute prediction
errors is 14, 780.7.
From the above, it would appear that for predictive purposes the parabolic
model is best. The inverse exponential model also seems better than the simple
linear regression model with y as response and x as the predictor.
log yi = log β0 + β1 xi + ηi
where the ηi are zero mean with a constant variance. In other words, if we have
reasons for believing the first model stated above holds, then applying a trans-
formation which linearizes the mean in unknown parameters does not justify the
linear model above: we have to consider the effect on the errors of the transfor-
mation. It may be better to work with the original nonlinear model: we will not
discuss nonlinear regression in this course, but you may see more on this if you
do further statistics courses.
the predictors may obscure the role that each individual predictor plays in a
multiple linear regression model. We have already discussed ways of diagnosing
the need to transform in a multiple regression. In particular, we have looked at
partial residual and regression plots and added variable plots. However, it may
not always be clear from these plots what kind of nonlinear transformation of a
predictor or predictors may be best.
Weisberg (1985), “Applied Linear Regression (Second Edition)”, Wiley, New
York, distinguishes two situations in considering the need to transform the pre-
dictors in linear regression models. In the case where the expected response takes
a minimum or maximum within the range of the predictors, it is natural to in-
troduce powers of the original predictors and products of the original predictors
(that is, we consider polynomial functions for approximating the response). In
the case where the expected response is increasing or decreasing in the predic-
tors, it may be more natural to consider so-called power transformations of the
original predictors. For instance, if we wish to transform the first predictor x1 in
a multiple linear regression model, in the first situation we might introduce an
additional term involving x21 into the model. For the case where power transfor-
mations are considered, we replace the term involving x1 with a term of the form
xα1 1 for some exponent α1 (for instance, we might consider α1 = 1/2, the square
root of x1 ). In this subsection we discuss a procedure for estimating the expo-
nent α1 when a power transformation is considered, and in the next subsection
we briefly consider polynomial regression.
where the εi are zero mean errors and the usual linear model assumptions hold.
For simplicity suppose we wish to investigate the need to transform just the
first predictor x1 (although the discussion which follows may be generalized to
the situation where we consider the need to transform more than one predictor
simultaneously). We write the above model as
∑
k
yi = β0 + β1 xi1 + βj xij + εi .
j=2
5 RESIDUALS AND DIAGNOSTICS 187
and if we write η = β1 (α1 −1) we have a model which is linear in unknown param-
eters β0 , ..., βk , η where we have introduced a new predictor xi1 log xi1 . Examining
an estimate of η can help us to decide whether there is a need to transform (since
η = 0 when either α1 = 1 or β1 = 0).
The crude method for estimating α1 suggested by Weisberg (1985) is as fol-
lows. First we fit the linear model (37) where x1 is untransformed. Write b1 for
the estimate of β1 in this model. Then fit the model (39), and write ηb for the
estimate of η in this model. Since η = β1 (α1 − 1), one natural estimate of α1 is
obtained by solving for α1 in this equation and plugging in b1 and ηb: we obtain
an estimate
ηb
c1 =
α + 1.
b1
Weisberg warns that this crude estimator may not perform very well in some
situations, particularly if the partial t-statistic for b1 is small (since in this case
β1 is possibly close to zero, and in the above expression where b1 appears in the
denominator we do not expect the estimate α c1 to be reliable). We give some
examples to illustrate the use of the Box-Tidwell procedure.
5 RESIDUALS AND DIAGNOSTICS 188
The following example is discussed in Weisberg (1985) and concerns a data set
supplied by Stephen J. Gould. The data set consists of two variables, which are
the measured area (in hundreds of square metres) and perimeter (in hundreds of
metres) for 25 Romanesque churches. The data are given in the table below. A
scatter plot of area against perimeter is shown in Figure 5.48. From the scatter
plot, evidently a simple linear regression model of area against perimeter does not
seem to be reasonable here. Weisberg (1985) gives an argument which suggests
using the square root of the area as the response. We will return to a discussion
of transformation of the response in this example in later lectures. We investigate
the need to transform the predictor perimeter in a model for the square root of
area. We consider first the model
√
Areai = β0 + β1 Perimeteri + εi .
The estimate of β1 when this model is fitted is b1 = 1.544. Now consider the
model
√
Areai = β0 + β1 Perimeteri + ηPerimeteri log(Perimeteri ) + εi .
When we fit this model, we get ηb = −0.6726. Hence for a model involving a
power transformation Perimeterα1 a crude estimate of α1 is approximately
−0.6726
b1 =
α + 1 = 0.56.
1.544
5 RESIDUALS AND DIAGNOSTICS 189
Figure 5.48: Scatter plot of area against perimeter for 25 Romanesque churches.
5 RESIDUALS AND DIAGNOSTICS 190
Since this value is close to 0.5, we consider a model involving a square root trans-
formation. A scatter plot of the square root of area against the square root of
perimeter is shown in Figure 5.49. We may still not be entirely happy with this
Figure 5.49: Scatter plot of square root of area against square root of perimeter
for 25 Romanesque churches.
model. The relationship between the transformed response and transformed pre-
dictor does not seem to be quite linear (there seem to be some departures from
linearity for large and small perimeter values) and possibly the variance increases
as the mean response increases.
Analysis of Variance
Source DF SS MS F P
Regression 5 368140 73628 36.86 0.000
Residual Error 19 37956 1998
Total 24 406096
Source DF Seq SS
Income 1 77008
Age 1 55063
Incomesq 1 202845
Agesq 1 24551
Income*A 1 8674
The p-value for the partial t-statistic for Income*Age is 0.051, which suggests
dropping the interaction term from the model at the five percent level in the
presence of the other terms (although this is borderline). The p-value for the
partial t-statistic for Age is large here also.
simple or multiple regression model. So far we have only considered the use of
a transformation of the response to achieve constancy of error variance when
certain mean/variance relationships hold. However, a transformation of the re-
sponse can also be helpful for achieving linearity of the mean response in the
original predictors, and for making the assumption of normality of errors more
reasonable. In this subsection we consider a class of transformations indexed by a
single parameter, and a way of choosing a transformation within this class which
attempts to make the specification of the mean structure, constancy of error
variance assumption and normality of the errors reasonable on the transformed
scale.
The class of transformations we consider is the class of Box-Cox transforma-
tions. We saw this class of transformations when we discussed the Box-Tidwell
procedure. If y is the response (which we assume for the moment to be positive),
then we consider
{ yλ −1
if λ ̸= 0
w = λ
log(y) if λ = 0.
Here λ is a real valued parameter. The reason for defining w = log(y) when λ = 0
is that (y λ − 1)/λ approaches log(y) as λ approaches zero (see our discussion of
the Box-Tidwell procedure). If we find upon fitting a multiple linear regression
model that the assumptions of the linear model are violated (either specification
of the mean structure seems incorrect or the error variance does not seem to be
constant for instance) then one thing that might be considered is to transform
the response using the above transformation for a given value of λ. How do we
choose λ? Box and Cox have suggested an automatic way of choosing λ based
on the data, which we now describe.
Estimating λ
The proposal by Box and Cox for estimating λ is to use maximum likelihood.
If for a fixed λ the transformed responses wi are normal and uncorrelated with
common variance σ 2 , then we can write down the joint density of w1 , ..., wn . If
w = (w1 , ..., wn )⊤ , we write
w = Xβ + ε
where as usual X is the design matrix (n × p), β is a p × 1 vector of parameters
and ε is a vector of uncorrelated zero mean normal errors with variance σ 2 . We
can easily write down the density of w, which is of course a product of univariate
normal densities. From this density function and from the relationship between
wi and the untransformed response yi , we can write down the joint density of
y = (y1 , ..., yn )⊤ . In terms of the parameters β, σ 2 and λ, this joint density is
( n )λ−1
1 ( ) ∏
⊤
exp −1/(2σ 2
)(w − Xβ) (w − Xβ) yi .
(2π)n/2 (σ 2 )n/2 i=1
5 RESIDUALS AND DIAGNOSTICS 194
(If you are familiar with transformations of random variables, this is easily de-
rived: if not, you may take this on trust).
This joint density function considered as a function of the parameters β, σ 2
and λ is the likelihood function. It can be shown that if we take logarithms and
maximize with respect to β and σ 2 , then ignoring additive constants and writing
RSS(λ) for the residual sum of squares when we use the value λ in the Box-Cox
transformation, we obtain a function of λ, L(λ):
( n )
n ∏
L(λ) = − log(RSS(λ)) + n(λ − 1) log ( yi )1/n .
2 i=1
In the above examples we chose a rough value for λ by looking at a plot of L(λ)
against λ. However, in deciding on a suitable value for λ in the Box-Cox procedure
it is helpful to have some measure of the uncertainty about the maximizer λ b of
L(λ). It can be shown that an approximate 100(1 − α) percentage confidence
b is given by
interval for λ based on λ
b − L(λ) 6 1 χ2 }
{λ : L(λ)
2 1;α
where χ21;α is the upper 100α percentage point of a χ21 distribution.
Consider again the data on Romanesque churches. For this data set we said that
a Box-Cox transformation with a value of λ of approximately 0.5 seemed to be
indicated. To compute an approximate 95 percent confidence interval we need
the upper 5 percentage point of a χ21 distribution, which is approximately 3.8415.
In generating the graph of Figure 5.50 I computed L(λ) for values of λ 0.1 units
apart between −2.0 and 2.0. The values at λ = 0.4, 0.5, 0.6, 0.7 and 0.8 are re-
spectively −21.554, −19.212, −17.925, −18.165 and −20.038. If we assume that
the maximizer is exactly at 0.6 (we can compute the maximizer of L(λ) more pre-
cisely by using a finer grid) then the approximate 95 percent confidence interval
for λ includes values of λ for which L(λ) is within 0.5 × 3.8415 = 1.92075 of the
maximum value. Based on this guideline, the value of 0.5 lies in an approximate
95 percent confidence interval, and a square root transformation of the response
seems appropriate.
For the surgical services data set above, we stated that the need for a transfor-
mation of the response did not seem to be indicated by the Box-Cox procedure.
Is this inference justified based on a confidence interval for λ? Again as a very
crude guideline we can consider values of λ on our grid in the plot which lie within
1.92075 of the maximum value: the value λ = 1.0 satisfies this criterion, and so
a transformation does not seem to be warranted.
have large t-values. It must be said that in many cases we may not wish to use
a transformation of the response at all, and transformations can be overused.
6 CATEGORICAL PREDICTORS 199
6 Categorical predictors
In this section we discuss model building in multiple linear regression when some
of our predictors are categorical variables. This discussion will lead on to consid-
eration of the less than full rank linear model in the next section.
As an example of a categorical predictor in multiple linear regression consider
the data set on inflation rates and central bank independence that we discussed
in previous lectures and in tutorial five. For this data set we had for a number
of countries the average annual inflation rate over 1980 − 1989, as well as two
measures of central bank independence (a questionaire measure of independence
and a legal measure of independence). In addition to these two quantitative
predictor variables, there was a predictor which took the value 1 for developed
economies and 0 for developing economies. This last predictor variable is an
example of a categorical predictor: the values 1 and 0 taken by this predictor are
just arbitrary labels, numeric values that have no particular meaning except to
distinguish two different groups within the observations. Instead of the values
1 and 0 we could equally well have used the labels A and B to distinguish the
groups. We wish to develop methods for handling categorical variables like this
one within the framework of the linear model.
To take another example, suppose a new drug has been developed for the
treatment of high blood pressure. A study was undertaken with a group of
patients having high blood pressure into the effectiveness of the new drug. Each
patient was assigned at random to either a treatment group (patients in this
group receive the new drug) or a control group (patients in this group receive
the standard treatment for high blood pressure). After a month we measure the
change in the blood pressure of each patient compared to their blood pressure at
the beginning of the trial. We are interested in investigating the effect of the drug
on the response (change in blood pressure), perhaps adjusting for the effect of
other quantitative characteristics of the patients (such as age for instance). So for
each patient we have the response (change in blood pressure) some quantitative
predictors (such as age) as well as a predictor which records whether the patient
was in the treatment or control group. This last predictor is an example of a
categorical predictor variable with two levels: the two levels are “treatment” and
“control”.
y i = β 0 + β 1 xi + β 2 w i + εi ,
yi = β0 + β2 + β1 xi + εi .
y i = β 0 + β 1 xi + εi .
So the effect of introducing the dummy variable w into the regression is to have
a shift in the intercept for the regression of y on x as we move between the
two groups defined by z. This may be a sensible model. We can interpret the
parameter β2 as being the effect of being in group A compared to being in group
B, and if we want to see if there is any difference between the two groups (for
instance, if we want to see in our blood pressure example whether the new drug
is more effective than the standard treatment after adjusting for the quantitative
predictor age) then we can simply look at the partial t statistic for β2 .
The appropriateness of the above model involving the dummy variable w
depends on the assumption that the slope of the regression relationship between
y and x does not change for the groups A and B. The effect due to the group
(A or B) does not depend on the level of the quantitative variable x and we say
in this situation that there is no interaction between x and z. Effectively, the
relationship between the expected value of y and x for the two groups is described
by a pair of parallel lines, with one line for each group.
We can of course envisage a more general model in which the mean response
varies linearly with x in the two groups, but with a different slope and intercept
for the line in each group (that is, the relationship between the mean of y and x is
described by a pair of lines which may not be parallel). Consider defining a new
predictor variable as the product of the dummy variable w and the quantitative
predictor x. Now suppose we fit the model
yi = β0 + β1 xi + β2 wi + β3 xi wi + εi .
6 CATEGORICAL PREDICTORS 201
What happens in this situation for observations from the two groups A and B?
If wi = 1 (observation belongs to group A) then we have that
yi = β0 + β2 + (β1 + β3 )xi + εi .
y i = β 0 + β 1 xi + εi .
Analysis of Variance
Source DF SS MS F P
Regression 2 7345.2 3672.6 7.16 0.005
Residual Error 19 9750.0 513.2
Total 21 17095.3
Source DF Seq SS
QUES 1 5533.4
DEV 1 1811.8
Assuming that this model is adequate, we can test for whether being a developed
economy is related to inflation after adjusting for QUES by looking at the partial
t-statistic for DEV and its associated p-value. The conclusion of the test for
H0 : β2 = 0 against H1 : β2 ̸= 0 here is uncertain: we would accept H0 at the 5
percent level, but reject at the 10 percent level.
Now consider the model where we have a separate linear relationship between
expected inflation rate and QUES within each DEV group. We fit a multiple linear
regression model with inflation rate as the response and QUES, DEV and QUES*DEV
as predictors.
Regression Analysis
Analysis of Variance
Source DF SS MS F P
6 CATEGORICAL PREDICTORS 203
Source DF Seq SS
QUES 1 5533.4
DEV 1 1811.8
QUES*DEV 1 397.7
Assuming this model is an appropriate one, we can test for the presence of in-
teraction between QUES and DEV by looking at the partial t statistic and p-value
for the predictor QUES*DEV. The p-value here for testing H0 : β3 = 0 against the
alternative H1 : β3 ̸= 0 is 0.393, which would indicate acceptance of H0 at the 5
percent level. That is, there seems to be no real evidence of any interaction.
We can test whether the level of DEV seems to have any relationship to inflation
rate in the presence of QUES by testing
H0 : β2 = β3 = 0
The following example is from Myers, “Classical and Modern Regression with
Applications (Second Edition)”, Duxbury, Belmont, California, 1990, p. 154. In
a project to study age and growth characteristics of selected mussel species from
Southwest Virginia, the data below were taken from two distinct locations. It
was desired to investigate whether location was a significant factor in the growth
of the mussels.
3 0.44 11 3.96
3 0.50 11 3.84
3 0.66 12 5.58
3 0.78 12 5.64
4 1.20 12 4.26
4 1.18 13 6.00
4 1.08 13 2.54
6 1.12 13 3.82
6 1.72 14 4.50
7 1.04 14 5.18
8 2.62 14 4.04
9 1.88 15 6.38
10 2.26 15 4.08
11 4.10 16 4.56
11 2.12
We can define a dummy variable Location which is 0 for location 1 and 1 for
location 2. If we do a multiple linear regression of Weight on Age, Location and
Age*Location, then we obtain the following.
Regression Analysis
3 0.76 8 2.52
4 1.38 8 3.90
5 1.20 10 3.94
5 1.76 10 6.22
6 2.60 10 4.96
6 2.16 13 9.02
6 2.64 13 8.20
6 2.52 13 8.26
6 3.08 14 6.40
6 2.12 15 10.06
7 2.72 15 8.60
7 2.96 18 11.06
8 4.54 19 10.78
8 5.26 22 12.04
8 5.60 24 13.92
Analysis of Variance
Source DF SS MS F P
Regression 3 536.45 178.82 226.09 0.000
Residual Error 61 48.25 0.79
Total 64 584.70
Source DF Seq SS
Age 1 428.11
Location 1 75.57
Age*Loca 1 32.77
We can do an F test for the significance of the Location term and its interaction
with Age. The value of the appropriate test statistic for the F test is
(75.57 + 32.77)/2
= 68.57.
0.79
6 CATEGORICAL PREDICTORS 207
to discuss an example on real estate prices. For this example the response y
was the selling price of fifteen different houses in a certain area, and it was
desired to predict selling price based on the size of the house and the method
used for heating the house. The method of heating is a categorical predictor,
with values “electricity”, “heating oil” and “natural gas”. The data are shown in
Table 6.4. In the case of a categorical predictor variable taking on two values,
our approach was to define a binary dummy variable which was then introduced
as a quantitative predictor into the model. Using this dummy variable (and the
products of this dummy variable with any quantitative predictors) we were able
to test for the presence of an effect due to the categorical predictor, and to test
for interaction between the categorical predictor and any quantitative predictors.
A similar approach can be followed when dealing with a categorical variable with
more than two levels. For the case of the real estate prices data, and denoting
the selling price by y, the size of the house by x and the categorical predictor
heating type by z, we define binary dummy variables w1 and w2 by
{
1 if z takes the value “electricity”
w1 =
0 otherwise
and
{
1 if z takes the value “heating oil”
w2 =
0 otherwise.
6 CATEGORICAL PREDICTORS 209
Again we can ask ourselves: what does this model reduce to for houses heated
by electricity, heating oil and natural gas?
For electricity, wi1 = 1 and wi2 = 0, and we have
yi = β0 + β2 + (β1 + β4 )xi .
yi = β0 + β3 + (β1 + β5 )xi .
yi = β0 + β1 xi .
H0 : β2 = β3 = β4 = β5 = 0
We begin by doing a scatter plot of the selling price against size, using different
plotting symbols for the different heating type groups. This plot is shown in
Figure 6.54. Of course, there are only a few observations in each group here but
Figure 6.54: Scatter plot of selling price against size for real estate data using
different plotting symbols for different heating type groups.
on the basis of this plot we might believe that there is no interaction between
heating type and size (the relationship between expected selling price and size
for the three groups seems to be well described by three parallel lines). We
can test this by fitting a linear model with the dummy variables defined above as
predictors, and with the products of these dummy variables and size as predictors.
6 CATEGORICAL PREDICTORS 211
We can then do an F test in the way outlined above to test for interaction. The
computer output is shown below. The variable dummy1 is w1 and dummy2 is w2 .
Regression Analysis
Analysis of Variance
Source DF SS MS F P
Regression 5 5733.3 1146.7 20.77 0.000
Residual Error 9 497.0 55.2
Total 14 6230.2
Source DF Seq SS
Size 1 4034.4
Dummy1 1 1.7
Dummy2 1 1694.0
Size*Dum 1 1.1
Size*Dum 1 2.0
From the table of sequential sums of squares and the analysis of variance table,
the appropriate test statistic for testing for interaction is
(1.1 + 2.0)/2
= 0.028.
55.2
We compare this with the upper 5 percentage point of an F distribution with 2
and 9 degrees of freedom (approximately 4.26). We see that there is no evidence
for interaction here.
We can refit the model with no interaction (that is, fit the additive model).
Regression Analysis
6 CATEGORICAL PREDICTORS 212
Analysis of Variance
Source DF SS MS F P
Regression 3 5730.2 1910.1 42.02 0.000
Residual Error 11 500.0 45.5
Total 14 6230.2
Source DF Seq SS
Size 1 4034.4
Dummy1 1 1.7
Dummy2 1 1694.0
We demonstrate testing for an effect for heating type in this model. The ap-
propriate F statistic (from the sequential sums of squares and the analysis of
variance table) is
(1.7 + 1694.0)/2
= 18.63.
45.5
We compare this with the upper 5 percentage point of an F distribution with
2 and 11 degrees of freedom (approximately 3.98). In this case we reject the
null hypothesis that there is no group effect. As mentioned above, the partial t
statistics can tell us something about differences between pairs of groups. The
partial t statistic for dummy1 tests for a significant difference between electricity
and natural gas, and the partial t statistic for dummy2 tests for a singificant dif-
ference between heating oil and natural gas (from the p-values, both differences
are significant here at the 5 percent level).
We end our discussion of categorical variables here for the moment. We have
focused on the case of a single categorical variable and a single quantitative
predictor. Situations involving multiple quantitative and categorical predictor
variables may be handled in much the same way. We will say more about the
general case in the next section.
7 LESS THAN FULL RANK LINEAR MODELS. 213
yij = µ + αi + εij
where εij , i = 1, 2, 3, j = 1, ..., 6 are a collection of zero mean errors which are
uncorrelated with a common variance. Notice that this is a linear model. Just
define a dummy variable x1 which is one for the diet A group and zero otherwise,
a dummy variable x2 which is one for the diet B group and zero otherwise, and a
dummy variable x3 which is one for the diet C group and zero otherwise. Then if
we do a multiple linear regression of the responses on the predictors x1 , x2 and x3
then we are simply fitting the model above. In this model, the coefficient for x1
is what we have written as α1 , the coefficient for x2 is α2 , the coefficient for x3 is
α3 and the intercept term is µ. The model we have considered actually involves
7 LESS THAN FULL RANK LINEAR MODELS. 214
fitting a single categorical variable (diet group) and our discussion should remind
you of what we did in the last section.
However, in the last section we used k − 1 binary dummy variables to code
the information in a categorical variable with k levels. The reason for this is that
k binary variables are not required to code the information a k level categorical
variable: in the example above, if x1 and x2 are both zero (so that the subject is
not on diet A or B) then we know the subject is on diet C. If we looked at the
design matrix for fitting the model involving x1 , x2 , x3 and an intercept, we would
see that the sum of the columns for x1 , x2 and x3 is equal to the first column (a
column of ones). That is, x1 + x2 + x3 = 1. So the design matrix here is not of full
rank, and there is no unique solution to the normal equations. We can express
the nature of the problem in a different way. Suppose we define a parameter µ′ as
µ′ = µ + c where c is a constant, and suppose we define parameters αi′ , i = 1, 2, 3
as αi′ = αi − c. Then we see that µ + αi = µ′ + αi′ , and so these two sets of
parameters result in the same model for the data. So we should not expect to be
able to estimate all the parameters in the model (that is, we should not expect
to be able to solve the normal equations). We have too many parameters in the
model.
One solution to this problem is the one presented in the last section: just
remove one of the binary dummy variables. Effectively, we set one of the pa-
rameters αi to be zero. The reduced design matrix when one of the predictors is
removed now has full rank, and analysis can proceed as before. This approach to
dealing with the less than full rank model is called reparametrization.
However, reparameterization need not be used in dealing with a less than full
rank linear model. We can in fact work with the original model, provided we only
ask sensible questions about the parameters. We may not be able to estimate all
the parameters, but we might be able to estimate functions of the parameters:
for instance, in our example we could still estimate differences between the effects
for different diets (that is, we can estimate quantities like α1 − α2 for instance).
It is also possible to formulate sensible hypotheses to be tested. There will be
some discussion of the general approach to handling the less than full rank linear
model in the coming weeks.
i = 1, ..., k. In the one way classification model with fixed effects it is assumed
that
yij = µ + αi + εij
where µ and αi , i = 1, ..., k are unknown parameters, and the errors εij have mean
zero and are uncorrelated with a common variance σ 2 . Normality of the errors
needs to be assumed for hypothesis testing and interval estimation. The param-
eter µ is a global mean parameter, and the parameters αi represent deviations
from this mean for the k populations.
It is usually of interest in the one way classification model to test whether all
the population means are the same, and to test for differences between pairs of
population means. We will describe how to conduct these tests and describe the
form of the ANOVA table for a one way classification model in the remainder of
this subsection.
It should be evident from the discussion of last week that the one way classi-
fication model above is a less than full rank linear model. An example will help
to show how we can write the one way classification model in the form of the
general linear model. Suppose there are two populations, with a random sample
of size two taken from each population. Now let
We can write the general one way classification model in a similar way. In the
general case the vector β will be
β = (µ, α1 , ..., αk )⊤
and the design matrix X will have k + 1 columns, with the first column being a
column of ones, and the (i + 1)th column consisting of entries which are one for
observations in the ith population and zero otherwise. Observe that the design
matrix X does not have full rank, since the sum of the last k columns of X is
equal to the first column (see the above example for instance). Since X does not
have full rank, the normal equations do not have a unique solution, and so there is
no unique estimate of β. One approach to fixing this problem is to reparametrize
the original model: we reparametrize so that for the new model the design matrix
does have full rank, and then the parameters can be estimated.
There are in general many possible ways that we could reparametrize the one
way classification model with fixed effects. If the mean of one of the populations
can be considered to establish some reference level (without loss of generality say
the reference population is the first one) then we could set α1 = 0 (this corre-
sponds to deleting the second column of the original design matrix X). Then the
parameter µ is the mean of the reference population, and the parameters α2 , ..., αk
represent deviations from this mean for the remaining populations. This way of
reparametrizing the model is often very natural in, say, medical trials where we
are interested in comparing different methods of treatment with a control group.
The partial t-tests in the regression output for the parameters αi i = 2, ..., k
in this reparametrized model test for a significant difference between the mean of
population i and the mean of the first population (reference population). A test
for equality of all population means amounts to testing
H0 : α2 = ... = αk = 0
This hypothesis is tested using the usual F test. We will discuss another way
of parametrizing the one way classification model in the next lecture. In setting
α1 = 0 in the discussion above (removing a column of the design matrix) we
have effectively added an additional equation to the normal equations for the
less than full rank model, allowing a unique solution. Other arbitrary constraints
could have been made on the original∑ parameters: for instance, many statistical
packages introduce the constraint ki=1 αi = 0 on the parameters in the original
model. This will also allow a unique solution of the normal equations, where the
effects αi are required to be “centred” about the grand mean µ.
7 LESS THAN FULL RANK LINEAR MODELS. 217
This example is from Keller, Warrack and Bartel, “Statistics for Management
and Economics (Second Edition),” Wadsworth, Belmont, California 1990.
Three new formats have been developed recently for tax return forms (formats
A, B and C say). To determine which, if any, are superior to the current form,
20 individuals were asked to participate in an experiment. Each of the three
new forms and the old form were filled out by 5 different people. The amount of
time taken by each person to complete the task is recorded in the accompanying
table. At the 10 percent significance level, can we conclude that differences in
the completion times exist among the four forms? Three binary dummy variables
Table 7.5: Completion times for tax returns: old format, and three new formats
(A, B and C).
were created (Dummy1, Dummy2 and Dummy3). The first dummy variable is one for
format A and zero otherwise, the second dummy variable is one for format B
and zero otherwise, and the third dummy variable is one for format C and zero
otherwise. Thus we have parametrized a one way classification model by the
method discussed above, with the subjects filling out the current form acting as
a control group. Fitting a multiple linear regression model with these dummy
variables as predictors and the completion time (y) as the response gave the
following output.
Regression Analysis
Analysis of Variance
Source DF SS MS F P
Regression 3 26.573 8.858 6.06 0.006
Residual Error 16 23.404 1.463
Total 19 49.977
Source DF Seq SS
Dummy1 1 0.204
Dummy2 1 26.320
Dummy3 1 0.049
The F statistic in the ANOVA table is a realization of an F3,16 random variable
under the null hypothesis that
H0 : α2 = α3 = α4 = 0
(that is, under the hypothesis that there is no difference in completion time among
the four groups). The p-value for this F test is 0.006, so that we reject the null
hypothesis that there is no difference among means in favour of the hypothesis
that at least two of the population means differ. The partial t statistics in the
above regression output for α2 , α3 and α4 relate to a test for a significant differ-
ence between mean completion time for the current form and mean completion
time for formats A, B and C respectively. We see here that there seems to be a
significant difference between mean completion time for the current form and for
format B (the current form is better).
It is interesting to have a closer look at the analysis of variance table for the
one way classification model. For the multiple linear regression model, if we write
yi i = 1, ..., n for the responses, ybi i = 1, ..., n for the fitted values and ȳ for the
mean of the responses, then we know that
SStotal = SSreg + SSres
where
∑
n
SStotal = (yi − ȳ)2 ,
i=1
∑n
SSreg = yi − ȳ)2
(b
i=1
and
∑
n
SSres = (yi − ybi )2
i=1
7 LESS THAN FULL RANK LINEAR MODELS. 219
(this was stated in lectures and proved by the MATH2931 students on their third
assignment). We can rewrite this analysis of variance identity in an interest-
ing way for the one way classification model. However, first we introduce some
notation. Write ȳi. for the sample mean of observations in the ith population,
1 ∑
ni
ȳi. = yij .
ni j=1
Also, write ȳ.. for the sample mean of all response values,
1 ∑∑
k i n
ȳ.. = yij .
n i=1 j=1
Thus the dots in the subscripts above mean that summation is taken with respect
to that index in forming the mean.
It is easy to show that the fitted value for an observation from population i
is ȳi. , and the analysis of variance identity becomes
∑
k ∑
ni ∑
k ∑
k ∑
ni
(yij − ȳ.. ) =
2
ni (ȳi. − ȳ.. ) +
2
(yij − ȳi. )2 .
i=1 j=1 i=1 i=1 j=1
Suppose in the previous example that it was of particular interest to test for a
difference between formats B and C. We wish to test
H0 : α3 = α4
For a two-sided test at the 5 percent level, we compare this with the upper 2.5
percentage point of a t distribution with 16 degrees of freedom (which is approx-
imately 2.12). So we would reject the null hypothesis that there is no difference
in completion times for formats B and C.
In the last lecture we discussed the one way classification model. The one way
classification model is of interest when we have random samples from k different
populations, and are interested in comparing population means. With a random
sample of size ni from the ith population, i = 1, ..., k, we write the one way
classification model as
yij = µ + αi + εij
where yij is the jth observation from the ith population, i = 1, ..., k, j = 1, ..., ni ,
µ is a global mean, αi is the deviation from the global mean for population i,
i = 1, .., k and the terms εij are random errors. We showed that the one way clas-
sification model was an example of a less than full rank linear model, and in order
to apply techniques from the full rank case we discussed ways of reparametrizing
to full rank. In particular, we considered a situation where one of the k popula-
tions was considered a reference population (without loss of generality say it is the
7 LESS THAN FULL RANK LINEAR MODELS. 221
first population): it was then natural to set the deviation from the global mean
for this population to zero (α1 = 0 say). In the resulting reparametrized model µ
can be considered to be the mean of the reference population, and the parameters
α2 , ..., αk describe deviations from the mean of the reference population for the
other populations of interest in the experiment.
In this lecture we discuss an alternative parametrization of the one way clas-
sification model, and discuss hypothesis testing for this parametrization.
Instead of defining a reference population as in the previous lecture, an alter-
native way of parametrizing the one way classification model is to define
µi = µ + α i
(so that µi is the population mean for the ith population) and to write
yij = µi + εij .
We have thus reduced the number of parameters in the mean structure from the
k+1 parameters µ, α1 , ..., αk to the k parameters µ1 , ..., µk . The new parametriza-
tion gives a full rank linear model (you might like to check this for yourself). The
parametrization above is attractive, since µi has a direct interpretation (the pop-
ulation mean for population i).
As for the parametrization considered in the last lecture, we are interested in
testing whether all population means are equal, and in looking at differences in
population means. To test equality of the population means we must test
H0 : µ1 = ... = µk
against the alternative
H1 : Not all µ1 , ..., µk are equal.
This test does not quite fit into the hypothesis testing framework we have dis-
cussed previously for the full rank linear model: testing µ1 = ... = µk does not
correspond to testing whether a subset of µ1 , ..., µk are zero. However, hypothe-
ses like the one above can be tested within the framework of the general linear
hypothesis, which we now discuss.
H0 : βr+1 = ... = βk = 0
(βr+1 , ..., βk )⊤ .
H0 : βr+1 = ... = βk = 0
H0 : µ1 = ... = µk
in the parametrization we have introduced for the one way classification model.
We rewrite the above hypothesis as
H0 : µ1 − µ2 = 0, µ1 − µ3 = 0, ..., µ1 − µk = 0.
has an Fm,n−p distribution. For testing µ1 = ... = µk for the one way classification
model, this statistic is computed and displayed in the analysis of variance table.
Example: tax return forms.
We continue our discussion of the data on completion times for different formats
of tax return forms introduced in the last lecture. Recall that this data set was
collected by dividing a group of twenty subjects into four groups, and giving
each subject in the first group a tax return form in the format currently used,
and members of the remaining groups forms in three new formats (A, B and C).
The completion time for each of the subjects was recorded. This is a one way
classification model with four groups. We showed in the last lecture that we could
test if the mean completion time was the same for all four groups by introducing
appropriate binary variables to fit the model where we adopt a parametrization
with the mean completion time for the current form as a reference level.
1 2 3
7 LESS THAN FULL RANK LINEAR MODELS. 224
2 -2.862
0.382
3 -4.502 -3.262
-1.258 -0.018
has a tn−k distribution. This result is the basis for computing the intervals in the
table.
27 23 48
22 36 35
33 27 46
25 44 36
38 39 28
29 32 29
are interested in making comparisons between mean income levels for students
majoring in different disciplines. One way to test for any differences between
income levels is to analyze the data as a one way classification model with fixed
effects. We have seen how to do this in previous lectures. Fitting a one way
classification model results in the following output.
Analysis of Variance
Source DF SS MS F P
Factor 2 193.0 96.5 1.77 0.205
Error 15 819.5 54.6
Total 17 1012.5
Individual 95% CIs For Mean
Based on Pooled StDev
Level N Mean StDev ---+---------+---------+---------+---
Accounti 6 29.000 5.762 (---------*----------)
Marketin 6 33.500 7.765 (----------*----------)
Finance 6 37.000 8.390 (----------*---------)
---+---------+---------+---------+---
Pooled StDev = 7.391 24.0 30.0 36.0 42.0
From the p-value in the analysis of variance table, no significant difference be-
tween mean annual income levels seems to be indicated. We can see that only a
small fraction of total variation is explained by the model here. With a sample
of size six from each group, it will be difficult to detect differences in population
means which are small compared to the residual standard deviation (which is
about 7000 dollars). Now, suppose it was thought that annual income following
graduation was dependent on performance at university. We can classify the
members of each of the populations of interest (accounting majors, marketing
majors and finance majors) according to their average grade (A+, A, B+, B, C+
or C) during the course of their studies. A new experiment is conducted in which
we again take a sample of size six from the three populations, but in each of the
three groups of six students we take one student with an A+ average, one with
an A average, one with a B+ average, one with a B average, one with a C+
average and one with a C average: we say that students in different grade levels
form different blocks in this new experiment. So we have three populations, and
six blocks, and one observation for each population/block combination (eighteen
observations in all). The table below shows the data collected for this new exper-
iment. The idea of introducing the blocking variable here (average grade) is to
reduce residual variation to allow more precise estimates of differences between
population means. A key assumption here is that there is no interaction between
the blocking variable, average grade, and the major of the student: it is assumed
that the difference between mean salary for, say, an accounting and marketing
graduate with the same average grade does not depend of whether that grade is
an A+, a C or anything else. We allow the grade to have an additive effect on
mean annual income, but the amount of the effect must not depend on what the
student studied.
For notational convenience we consider first the situation where there is pre-
cisely one observation for each population/block combination. We write yij for
7 LESS THAN FULL RANK LINEAR MODELS. 227
A+ 41 45 51
A 36 38 45
B+ 27 33 31
B 32 29 35
C+ 26 31 32
C 23 25 27
the observation for population i, i = 1, .., k and block j, j = 1, ..., b. The model
we are considering in a randomized complete block experiment is
yij = µ + αi + βj + εij .
we can partition up total variation into a component of variation for the popula-
tion mean differences (sometimes called the sum of squares due to treatments),
a component for block differences (sum of squares due to blocks) and residual
variation. The sum of squares due to treatments is simply the regression sum of
squares you would get if you ignored blocking and used a one way classification
model for the samples from the different populations. The sum of squares due
to blocks is simply the regression sum of squares you would get if you ignored
population mean differences and fitted a one way classification model treating
the different blocks as the populations of interest. It is a special feature of the
randomized complete block design where you have equal numbers of observations
for each population/block combination that these two sums of squares add to the
regression sum of squares you get for fitting the full model.
Income levels for business students when analyzing the data as a randomized
complete block experiment is given below.
Our treatment sum of squares here is 108.44, block sum of squares is 854.94,
and residual sum of squares is 52.22. The estimated residual variance is 5.22,
and we see that blocking has considerably reduced the proportion of total varia-
tion unexplained by the model: this may allow us to detect differences between
population means which we could not have detected previously for the same ex-
penditure of resources. The F statistic for testing equality of mean income for
the different majors is 10.38 here (the mean square values in the ‘MS’ column
are the sums of squares divided by the degrees of freedom, and the F statistic
10.38 is the mean square due to treatments divided by the mean square error:
we look at whether the variation between different major types is large com-
pared to unexplained variation). The p-value for this F -test is 0.004, indicating
a significant difference between means for different majors at the 5 percent level.
The p-value for the blocking variable should be ignored here, although the F
statistic for blocking does give us a rough indication of whether blocking was
effective in reducing residual variation to allow more precise comparisons among
the populations of interest.
7 LESS THAN FULL RANK LINEAR MODELS. 229
The data described here are given by Blackman and Catalina (1993) in the ar-
ticle “The Moon and the Emergency Room,” Perceptual and Motor Skills, 37,
pp. 624–626. The admission rate (in patients per day) to the emergency room
of a Virginia mental health clinic was measured before, during and after the 12
full moons from August 1971 to July 1972. This is a factorial experiment: for
each level of the factor month (twelve levels) and each level of the factor moon
(three levels, before, during or after the full moon) we have an observation of the
response.
The data described here was collected by Stewart Fischer and David Tippetts,
statistics students at the Queensland University of Technology in a subject taught
by Dr Margaret Mackisack.
In the experiment conducted by these students, the relationship between
weight, design and angle of release for a paper aeroplane and the distance trav-
elled by the aeroplane was investigated. Two different weights were considered
(factor Paper describing the weight of the sheet of paper used with levels 1=80
grams and 2=50 grams) as well as two different designs (1=sophisticated de-
sign, 2=simple design) and two different angles of release (1=horizontal, 2=45
degrees). For each distinct combination of levels of the three factors (2.2.2 = 8
different combinations) there are two observations of the response - 16 observa-
tions in all.
where µ is an overall mean, αi is the effect for the ith level of factor A, βj is
the effect for the jth level of factor B, γij is a term representing an interaction
between the ith level of factor A and the jth level of factor B (more on this in
a moment), and the εijk are zero mean uncorrelated normal errors with constant
variance.
Write µij = E(yijk ) for the mean of observations at level i of factor A and at
level j of factor B. The main effect at the ith level of factor A is defined to be
1∑ 1 ∑
µij ′ − µi′ j ′ .
b j′ ab i′ ,j ′
This is the difference between the average of means with factor A fixed at level i
and the average of all means.
The interaction between factor A at level i and factor B at level j is defined
to be ( ) ( )
1∑ 1∑ 1 ∑
µij − µi′ j − µij ′ − µi′ j ′ .
a i′ b j′ ab i′ ,j ′
The first bracketed term above is the main effect you would get for factor A at
level i in an experiment with factor B fixed at level j. The second bracketed term
is the main effect for factor A at level i in the full experiment. If the interaction
7 LESS THAN FULL RANK LINEAR MODELS. 231
is nonzero, it means that the effect of factor A at level i depends on the level j
of factor B. This is the meaning of interaction.
When there is interaction between factors, it may be hard to interpret the
main effects (when the way that one factor affects the response depends on the
level of another factor, it may not be meaningful to talk about an average effect
for the factor). You may check that if all the terms γij are zero in the model (41)
then the interactions as we have defined them above are zero.
In experiments with more than two factors, higher order interactions (that is,
interactions between more than two factors) can be defined, but we leave this for
later statistics courses.
We illustrate the idea of interaction with the paper plane experiment.
For simplicity we analyse the paper plane experiment as a two factor experiment,
ignoring the angle of release factor. So we consider the two factor experiment
with factors Paper (weight of paper used in the plane) and Plane (design of
plane, sophisticated or simple).
Below is shown the so-called main effects plot for the paper plane experiment.
This plot shows response means for different levels of the factors. As we can see,
lighter planes seem to travel further and the planes with a sophisticated design
seem to travel further than planes with a simple design. The differences between
the points plotted on the graph and the horizontal lines on the graphs (the overall
mean of responses) give empirical estimates of the main effects as we’ve defined
them above.
Figure 7.56 shows a so-called interaction plot. Plotted on the y-axis are re-
sponse means, and plotted on the x-axis here are the levels of the factor Plane
(plane design, 1=sophisticated and 2=simple). The two lines (dashed and solid)
are for the light paper and heavy paper respectively. If there were no interaction,
then the lines in the plot would be parallel - for no interaction, the way that the
mean response should change as we change the level of Plane should not depend
on the level of Paper.
The model (41) is overparametrized: it is a less than full rank linear model where
the normal equations have no unique solution. We can impose some constraints
on the parameters, and adding these constraint equations to the normal equations
ensures that the normal equations have a unique solution.
The usual constraints are
∑ ∑
αi = 0, βj = 0
i j
∑
a
γij = 0, j = 1, ..., b
i=1
7 LESS THAN FULL RANK LINEAR MODELS. 232
∑
b
γij = 0, i = 1, ..., a.
j=1
yijk = µ + αi + βj + εijk .
Figure 7.57: Main effects plot for data on insanity and phases of the moon.
7 LESS THAN FULL RANK LINEAR MODELS. 236
Figure 7.58: Interaction plot for data on insanity and phases of the moon.
7 LESS THAN FULL RANK LINEAR MODELS. 237
From the p-values in the table, it can be seen that there is a significant effect
for both month of year and phase of the moon - it does seem as though there is
strong evidence in the additive model that phase of the moon has an effect on
mental hospital admissions.
When we have more than one observation for each distinct combination of
levels of the factors in a factorial experiment, then we are able to estimate inter-
action terms in our model.
8 Logistic Regression
This course has been concerned with the general linear model, and we have seen
how flexible the linear model is for interpreting relationships in data. However,
the assumptions of the linear model can be restrictive in some situations. Much
of the theory of the linear model we have discussed has relied on the assumption
of normal errors: this often may not be reasonable, and if the assumption of
normality is to hold the responses must also be continuous. Constancy of variance
for the errors is another assumption of the linear model which is often violated.
In this subsection we discuss how we can do regression when the responses are
binary (zero or one). The model we will consider is perhaps the simplest example
of a generalized linear model. Generalized linear models are a flexible class of
models (including normal linear models as a special case) which are often used
for analyzing discrete and other kinds of non-normal data. Generalized linear
models will be discussed in much greater depth in some of your later statistics
courses. For now, we will consider only logistic regression models for describing
the dependence of a binary response on one or more predictors.
There are many situations in regression modelling where we are interested in
a binary response. For instance, for a number of banks over a period of time we
may have a collection of predictor variables describing the financial viability of
the banks as well as a response variable which is one if the bank fails over the
period of observation and zero otherwise. We are interested in estimating the
probability of failure based on the predictors describing the financial viability of
the bank.
The use of regression models for a binary response is also common in medical
statistics. For instance, we might be interested in estimating the probability that
someone will suffer a heart attack based on lifestyle factors. We could take a
random sample of subjects, measure predictors which are thought to relate to
heart attack risk (such as weight, blood pressure, cholesterol, etc.) and follow
these subjects over a period of time. We could then record a response variable
indicating whether the subject suffered a heart attack during the period of in-
vestigation. Again we have a binary response which we are interested in relating
to predictor variables: what is the probability of heart attack for someone with
given weight, blood pressure, cholesterol, etc.?
Clearly the linear model is inappropriate for a regression analysis when the
response is binary. Normality doesn’t hold, and the assumption of a constant
variance cannot hold in general either. One method of analysis appropriate for a
binary response variable is based on the logistic regression model.
As before we write y = (y1 , ..., yn ) for a vector of n responses, but where now
each value yi is either zero or one. Regarding each response value yi as random,
we write
pi = P r(yi = 1).
8 LOGISTIC REGRESSION 239
so that pi = E(yi ). In the linear model we assumed that the expected response
was linear in a vector of unknown parameters β of length p, say: that is,
E(yi ) = xTi β
where we have written xi = (xi1 , ..., xip )T for a vector of p predictors (and where
xi1 = 1). Can we assume the same thing here, that
pi = xTi β?
Clearly there are some problems with this, since pi is a probability (and hence
must lie between zero and one) but the term xTi β can potentially be negative or
greater than one. One solution is to assume that some transformation of pi is
a linear function of the parameters β: we can use a transformation which maps
a number between zero and one (a probability) to the whole real line, and then
model the probability on this transformed scale.
One appropriate transformation of the probability pi might be
( )
pi
l(pi ) = log .
1 − pi
exp(xTi β)
pi = . (43)
1 + exp(xTi β)
If we assume that the responses yi are independent, the likelihood function is the
product of the probability functions for the observed responses yi . If yi = 1, the
contribution to the likelihood from the ith response is the probability that yi = 1,
or pi . If yi = 0, the contribution to the likelihood is the probability that yi = 0,
or 1 − pi . Noting that the function
pyi i (1 − pi )1−yi
8 LOGISTIC REGRESSION 240
takes the value pi when yi = 1 and the value 1 − pi when yi = 0, we can write
the likelihood function as
∏
n
L(β) = pyi i (1 − pi )1−yi
i=1
(you might like to prove this as an exercise). Now, for the linear model the
equations we needed to solve to obtain maximum likelihood estimates of β (the
normal equations) were linear equations, and we know from linear algebra when
a unique solution will exist and how to find it. However, the equations which
need to be solved to find the maximum likelihood estimate of β in the case of
logistic regression are nonlinear equations, and in general an iterative technique
is needed to solve them. We do not discuss the algorithm used in this course.
So far we have discussed the use of logistic regression for modelling a binary
response. It is easy to extend the idea of logistic regression to deal with responses
which are binomially distributed. Recall that a binomial random variable can be
regarded as the number of successes in n independent trials with a probability p
of success on each trial (or the number of heads in n tosses of a coin where the
coin has probability p of a head on each flip). If we have responses yi which are
the number of successes in ni trials with probability pi of success on each trial,
and if we are interested in modelling the probability pi in terms of predictor vari-
ables (with ni known) then logistic regression can be applied: we just regard each
binomial response yi as being equivalent to ni binary responses, with yi of these
binary responses being one, and ni − yi of them being zero. We now illustrate
the fitting of logistic regression models in MINITAB.
effects of high voltage power lines on livestock. Seven cows were subejcted to
six different shock intensities (0, 1, 2, 3, 4 and 5 milliamps). Each cow was given
thirty shocks, five at each intensity, in a random order. The experiment was
then repeated. For each shock, the response is mouth movement, which is either
present or absent. At each shock level, there are 70 observations of the response.
We ignore any effects due to the individual cows in our analysis (although we
could employ the idea of blocking here). We can model the number of positive
responses at each shock level by considering these to be binomial random variables
with n = 70 and a probability of success depending on the shock level.
Fitting the logistic regression model in MINITAB gives the following output.
Response Information
Log-Likelihood = -170.078
Test that all slopes are zero: G = 241.134, DF = 1, P-Value = 0.000
From the MINITAB output, we have that the fitted probability of mouth move-
ment at shock level i is
exp(−3.3010 + (1.2459).Currenti )
.
1 + exp(−3.3010 + (1.2459).Currenti )
Also listed in the output are estimated standard errors of the coeficients, and
p-values for testing the hypothesis that the various coefficients in the model are
zero (these are analogous to the partial t-tests we looked at in the linear model).
We will discuss in more depth what some of the output given here means in the
next lecture.
8 LOGISTIC REGRESSION 242
y i = µ i + εi
where the terms εi are zero mean errors which are here independent and normal
with a common variance σ 2 . Now, of course the model we have just suggested is
probably not a very helpful one for answering the questions motivating any real
experiment, but the perfect fit model (usually called a saturated model) can help
us to assess the fit of models with fewer parameters.
What are the maximum likelihood estimators of the parameters µi in the
above model? If we write µ = (µ1 , ..., µn )T for the vector of the unknown mean
parameters, and if we assume that σ 2 is known, we can write the likelihood in
terms of the unknown parameters µ as
∏n ( )
1 1
L(µ) = √ exp − 2 (yi − µi ) 2
2πσ 2 2σ
i=1
( )
1 ∑ n
= (2πσ 2 )−n/2 exp − 2 (yi − µi )2 .
2σ i=1
1 ∑
n
− 2 (yi − µi )2 .
2σ i=1
The above expression is clearly never positive, and the maximum value it achieves
bi = yi . The
is zero when µi = yi . So the maximum likelihood estimator of µi is µ
maximum value attained by the likelihood is thus
µ) = (2πσ 2 )−n/2 .
L(b
yi = xTi β + εi (45)
8 LOGISTIC REGRESSION 243
Hence
( )
b 1 ∑
n
L(β) b 2.
−2 log = (yi − xTi β)
L(b
µ) σ 2 i=1
We say that
( )
b
L(β)
λ(β) = −2 log
L(b
µ)
is the scaled deviance of the model (45). The scaled deviance for a normal linear
model is simply the residual sum of squares for the model divided by σ 2 . The
deviance of the model (45) is the scaled deviance multiplied by the variance
parameter σ 2 : the deviance for a normal linear model is thus simply the residual
sum of squares (note: you will sometimes find different definitions of the deviance
in different textbooks).
We have rewritten our usual measure of lack of fit for a normal linear model,
the residual sum of squares, by considering the maximum value of the likelihood
for the model and comparing this to the maximimum value of the likelihood for a
saturated model. Rewriting the residual sum of squares in this way enables us to
generalize to new situations. In particular, we can define the deviance for a logistic
regression model: we can consider a saturated model (with a parameter for each
observation) and define the scaled deviance as -2 times the log of the ratio of the
maximized likelihood for the model to the maximized likelihood for the saturated
model. In logistic regression the distributions of the responses are determined by
8 LOGISTIC REGRESSION 244
ber of parameters for the larger and smaller models (k − r here) to a percentage
point of a χ2k−r distribution.
This result may be used to test the hypothesis that β1 = ... = βk = 0 (that is,
that none of the predictors in a logistic regression model are helpful for explaining
variation in the response) against the alternative that not all these parameters
are zero. Also, of course we can compare the full model with a model in which
one predictor is excluded to get a test analogous to a two-sided partial t-test for
significance of coefficients in the normal linear model (there is also another way
of testing significance of individual coefficients in the logistic regression based
on large sample normality results for the maximum likelihood estimator). We
will discuss some of these tests when we look at the MINITAB output for some
examples during the lecture.