0% found this document useful (0 votes)
18 views30 pages

Chapter 5

Uploaded by

fxiqxxhjxnnxh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views30 pages

Chapter 5

Uploaded by

fxiqxxhjxnnxh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

CHAPTER 5

VARIABLE SELECTION AND MODEL BUILDING

5.1 MODEL-BUILDING PROCEDURE/PROBLEM


So far we have studied on how to fit a simple and multiple regression models, how
to conduct inferences on these model, evaluate on model performance and
conducting diagnostic checking on the error assumptions. So far, the independent
variables included in the model are assumed to be important and are the only
variables available. In addition, the focus was on techniques to ensure that the
functional form (later Chapter 8) of the model was correct and that the underlying
assumptions were not violated.
In some applications, theoretical considerations (literature review) or prior
experience can be helpful in selecting the independent variables to be used in the
model. However, in most practical problems, the analyst has a pool of candidate
independent variables that should include all the influential factors, but the actual
subset of (the important or significant) variables that should be used in the model
needs to be determined. Finding an appropriate subset of variables for the model is
called the variable selection problem. Recall that the resulting regression model
should:
a) provide a good fitting to variation in response variable (small MSE)
b) provide good prediction of the response variable (narrow prediction interval)
c) have good estimates of the slope coefficients (significance coefficients)
Building a regression model that includes only a subset of the available
independent variables involves two conflicting objectives. First, in order to make
the model as realistic as possible, the analyst may include as many as possible the
explanatory variables. That is, the model must include as many variables as
possible so that the information content in these factors can better predict the value
of response variable, y. Secondly, the model needs to be as simple as possible, this
is done by including few number of explanatory variables. Preferably, the model
contains as few explanatory variables as possible because the variance of the
prediction ŷ increases with the number of explanatory variables. Also the more
variables there are in the model, the greater the costs of data collection and model
maintenance.
The process of finding a model that is a compromise between these two objectives
is called selecting the “best” regression equation. Unfortunately, as we will see in
this chapter, there is no unique definition of “best”. Furthermore, there are several
algorithms that can be used for variable selection, and these procedures frequently
specify different subsets of the explanatory variables in the best regression model.
The problem of variable selection is addressed assuming that the correct functional
form of the explanatory variable is known, and that no outliers or influential
1
observations are present in the data. Various statistical diagnostic tools for model
adequacy such as residual analysis, detection of influence or leverage observations
are closely linked to variable selection procedure. In fact, diagnostic on model
adequacy and selecting the most appropriate variables in the model should be
done simultaneously. Usually, these steps are iteratively employed. In the first
step, a strategy for variable selection is opted and a regression model is fitted to
with the selected variables. The fitted model is then examined for parameter
significance, functional form and influential observations. Based on the outcome,
the model is re-examined and the selection of variable is reviewed again. Several
iterations may be required before a final statistically adequate model is produced.
None of the variable selection procedures described in this chapter are guaranteed
to produce the best regression equation for a given data set. In fact, there usually is
not a single best equation but rather several equally good ones. Because variable
selection algorithms are heavily computer dependent, the analyst is sometimes
tempted to place too much reliance on the results of a particular procedure. Such
temptation is to be avoided. Experience, professional judgment in the subject
matter field, and subjective considerations all enter into the variable selection
problem. Variable selection procedures should be used by the analyst as methods
to explore the structure of the data.

5.2 MODEL SPECIFICATION


There are four possible outcomes when formulating a regression for a particular set
of data. The fitted regression model may contain the wrong or inappropriate
variables. The outcomes are discussed as below.
a) Regression model is correctly specified.
This occurs when the regression model contains all the relevant explanatory
variables, including any necessary transformation and polynomial terms. That is,
there are no missing redundant or unnecessary variables in the model. This is the
best possible outcome and the one we hope to achieve. A correctly specified
regression model yields unbiased regression coefficients and unbiased prediction
of the response variable. In addition, the mean square error (MSE), which appears
in every hypothesis test and confidence interval, is unbiased estimate of the error
variance,  2 .

b) Regression model is underspecified


This occurs when the regression model is missing one or more important
explanatory variables. This situation is perhaps the worst case scenario as
underspecified model yields biased estimated regression coefficients and biased
prediction of the response variable value. That is, using the regression model
would consistently underestimate or overestimate the population slopes and the
population means. To make the already bad matters even worse, the MSE tends to
overestimate  2 , giving wider confidence interval than it should be.

2
c) Regression model contains one or more extraneous (unnecessary) variables
This occurs when the regression model contains extraneous variables that are
neither related to the response variable nor to any of the explanatory variables. It is
as if we include extra explanatory variables in the model that is not needed (not
statistically significant). Note however, such model has unbiased regression
coefficients, unbiased prediction of the response variable and also unbiased MSE.
However, since the model has more parameters, MSE has fewer degree of
freedom, producing wider confidence interval and the corresponding hypothesis
testing has lower power. In addition, by including extraneous variable, the model
becomes more complicated and harder to understand than necessary.
d) Regression model is overspecified
This occurs when the regression model contains one or more redundant
(repeating) explanatory variables. Although the model is correct, we have added
variables that are redundant. Redundant explanatory variable leads to problems
such as inflated standard errors for the regression coefficient (the issue of
Multicollinearity, see Chapter 7). Regression model that is overspecified produces
unbiased regression coefficients, unbiased prediction of the response variable and
also unbiased MSE. Such regression model can be used, with cautious, for
prediction of the response, but should not be used to evaluate the effect of an
explanatory variable on the response variable. Similar to the case of adding
extraneous variable, the model becomes more complicated and harder to
understand than necessary.
The motivation for variable selection can be summarized as follows. By deleting
variables from the model, the precision of the parameter estimates of the retained
variables may be improved, even though some of the deleted variables are not
negligible. This is also true for variance of a predicted response. Deleting variables
potentially introduces bias into the estimates of the coefficients of retained
variables and the response. However, if the deleted variables have small effects,
the MSE of the biased estimates will be less than variance of the unbiased
estimates. That is, the amount of bias introduced is less than the reduction in the
variance. There is danger in retaining negligible variables, that is, variables with
zero coefficients or coefficients less than their corresponding standard errors from
the full model. This danger is that the variances of the estimates of the parameters
and the predicted response are increased.

Finally, remember that regression models are frequently built using retrospective
data, that is, data that have been extracted from historical records. These data often
saturated with defects including outliers, “wild” points and inconsistencies
resulting from changes in the organization’s data collection and information
processing system over time. These data defects can have great impact on the
variable selection process and lead to model misspecification. A very common
problem in historical data is to find that some candidate explanatory vary over a
very limited range. Because of the limited range of the data, the variable may seem
unimportant in the least squares fit.
3
5.3 CRITERIA FOR EVALUATING SUBSET REGRESSION MODELS
Two key aspects of the variable selection problem are generating the subset models
and deciding if one subset is better than another. In this section, the criteria for
evaluating and comparing subset regression models will be discussed. Section 5.4
presents the computational methods for variable selection.

5.3.1 COEFFICIENT OF MULTIPLE DETERMINATION


A measure of the adequacy of a regression model that has been widely used is the
coefficient of multiple determination, R 2 . Let Rk2+1 denotes the coefficient of
multiple determination for a subset regression model with (k + 1) terms, that is, k
independent variables and an intercept term, 0 . Recall that:
SSRk +1 SSEk +1
Rk2+1 = = 1−
SST SST
where SSRk +1 and SSEk +1 denote the regression sum of squares and the residual
sum of squares, respectively, for a (k + 1)-term subset model. Note that there are
 kmax  2
  values of Rk +1 for each value of (k +1), one for each possible subset model
 k 
of size (k + 1). Now Rk2+1 increases as (k + 1) increases and has a maximum when
k = kmax . Therefore, the analyst uses this criterion by adding explanatory variable
to the model up to the point where an additional variable is not useful in that it
provides only a small increase in Rk2+1 .

1.0

0.0
k +1

Plot of versus (k +1)

The general approach is illustrated in the figure above, which presents a


hypothetical plot of the maximum value of Rk2+1 for each subset of size (k + 1)
against (k + 1). Typically one examines a display such as this and then specifies the
number of variables for the final model as the point at which the “knee” in the
curve becomes apparent. Clearly this requires judgment on the part of the analyst.

4
Since an “optimum” value of Rk2+1 cannot be found for a subset regression model, a
“satisfactory” value needs to be determined. Aitkin [1974] proposed one solution
to this problem by providing a test by which all subset regression models that have
an R 2 not significantly different from the R 2 for the full model can be identified.
Let
2
Rgood ( )(
= 1 − 1 − Rk2max +1 1 + d ,n,kmax )
where
kmax F ,kmax ,n−kmax −1
d ,n,kmax =
n − kmax − 1

and Rk2max +1 is the value of R 2 for the full model . Aitkin calls any subset of
regressor variables producing an R 2 greater than Rgood2
an R 2 -adequate ( ) subset.
Generally, it is not straightforward to use R 2 as a criterion for choosing the number
of independent variables to include in the model. Generally, analyst is looking for a
simple model (with few variables) that is as good as, or nearly as good as, the
model with all k independent variables.

2 2
5.3.2 ADJUSTED R , R
Recall that R 2 is a goodness-of-fit measure for linear regression model. This
statistic indicates the percentage of the variance in the dependent variable that
the independent variables explain collectively. R 2 measures the strength of the
relationship between your model and the dependent variable on a convenient 0 –
100% scale. R 2 increases with every predictor added to a model. As R 2 always
increases and never decreases, it can appear to be a better fit with the more terms
you add to the model. This can be completely misleading. In addition, if the model
has too many terms and too many high-order polynomials you can run into the
problem of over-fitting the data. When you over-fit data, a misleadingly high R 2
value can lead to misleading predictions.
In regression analysis, it can be tempting to add more explanatory variables to the
data as you think of them. Some of those variables will be significant, but some
may not be statistically significant. Similar to R 2 , R 2 also indicates how well
variables fit a line, but the measure adjusts for the number of variables in the
model. The R 2 will compensate for this by penalizing for those extra insignificant
variables. That is, the R 2 will penalize you for adding independent variables that
do not fit (explain) the model (response variable). That is, if useful (significant)
variables are added to the model the value of R 2 will increase but if more and
more useless (insignificant) variables are added, R 2 will decrease. In other
words, R 2 value increases only when the new variable improves the model fit
more than a certain amount while the R 2 value decreases when the variable does
not improve the model fit by a sufficient amount.

5
The adjusted R 2 statistic, R 2 defined for a (k + 1)-term equation is given as:
 n −1 
Rk2+1 = 1 −  (
 1 − Rk +1
 n − k −1
2
)
It can be shown (Edwards [1969], Haitovski [1969], and Seber [1977]) that if s
regressors are added to the model, Rk2+1+ s will exceed Rk2+1 if and only if the partial
F-statistic for testing the significance of the s additional explanatory variable
exceeds 1. Consequently, one criterion for selection of an optimum subset model is
to choose the model that has a maximum Rk2+1 .

Due to its properties, R 2 is often used to compare the goodness-of-fit for


regression models that contain differing numbers of independent variables.
Suppose you are comparing a model with five/six independent variables to a model
with one/two variable. Clearly, 5-variable model has a higher R 2 . However, is the
model with five variables actually a better model, or does it just have more
variables with one or two variables being insignificant? To determine this, just
compare the R 2 values.

5.3.3 RESIDUAL MEAN SQUARE


A model is said to have a better fit if residuals are smaller, i.e smaller value of SSE
. A model with smaller SSE is preferable. Based on this, the residual mean square
based on a variable subset regression model is defined as:
SSEk +1
MSEk +1 =
n − ( k + 1)
So MSEk +1 can be used as a criterion for model selection just like SSE . Generally,
the SSEk +1 decreases with an increase in (k + 1). Taking into account the degree of
freedom attached to SSE, as (k + 1) increases, MSEk +1 initially decreases, then
stabilizes and finally the eventual increase in MSEk +1 occurs when the reduction in
SSEk +1 from adding an extra variable to the model is not sufficient to compensate
for the loss of one degree of freedom in the denominator ( n − ( k + 1) ) . When
MSEk +1 is plotted versus (k + 1), the curve look like as in the following figure.
MSRes (p)

p
Plot of 6 versus p
The subset regression model that minimizes MSEk +1 will also maximize Rk2+1 . To
see this, note that
 n −1  n − 1 SSEk +1
Rk2+1 = 1 −  (
 1 − Rk +1 = 1 −
2
)
 n − ( k + 1)  n − k − 1 SST
n − 1 SSEk +1 MSEk +1
= 1− = 1−
SST n − k − 1 SST ( n − 1)
Thus, the criteria minimum MSEk +1 and maximum Rk2+1 are equivalent.

5.3.4 MALLOWS’S C p STATISTIC

Mallows [1964, 1966, 1973, 1995] proposed a model selection criterion that is
related to the mean square error of a fitted value, that is,

E  yˆi − E ( yi ) =  E ( yi ) − E ( yˆi ) + Var ( yˆi )


2 2

Note that E ( yi ) is the expected response from the true regression equation and
E ( yˆi ) is the expected response from the p-term subset model. Thus, E ( yi ) − E ( yˆi )
is the bias at the ith data point. Consequently, the two terms on the right-hand side
are the squared bias and variance components, respectively, of the mean square
error, MSE. Let the total squared bias for a p-term equation be
n
SSBp =   E ( yi ) − E ( yˆi )
2

i =1

and define the standardized total mean square error as:

1 n n

 p = 2   E ( yi ) − E ( yˆi )  +  Var ( yˆi )
2

   i =1 i =1 
SSB p 1 n
2 
= + Var ( yˆi )
2
  i =1

 i=1Var ( yˆi ) = p 2
n
It can be shown that and that the expected value of the residual
sum of squares from a p-term equation is:
E  SSE p  = SSB p + ( n − p ) 2

i=1Var ( yˆi ) and SSBp


n
Substituting for gives:

p =
1
 2 E  SSE  p − ( n − p ) 2 + p 2 
E  SSE p 
= − n + 2p
 2

7
Suppose that ˆ 2 is a good estimate of  2 . Then replacing E  SSE p  by the
observed value SSE p produces an estimate of  p , say
SSE p
Cp = − n + 2p
ˆ 2

where ˆ 2 is the MSE estimated from the full model. If the p-term model has
negligible bias, then SSB p = 0 . Consequently, E  SSE p  = ( n − p ) 2 , and

E C p Bias = 0  =
( n − p ) 2
− n + 2p = p
  2
When using the C p criterion, it can be helpful to visualize the plot of C p as a
function of p for each regression equation, such as shown in the plot below.
Regression equation with little bias will have values of C p that fall near the line
C p = p (point A) while those equations with substantial bias will fall above this
line (point B). Generally, small values of C p are desirable. For example, although
point C is above the line C p = p , it is below point A and thus represents a model
with lower total error. It may be preferable to accept some bias in the equation to
reduce the average error of prediction.

B
A =p

0 1 2 3 4 5 6 7 8
p
A plot vs. p

5.3.5 PRESS p STATISTIC

Frequently, regression models are used for prediction of future observations or


estimation of mean response. Generally, analyst would like to select the
independent variables such that the MSE of prediction is minimized. Allen [1971,
1974] suggested using the Prediction Error Sum of Squares, PRESS statistic as a
measure of model prediction ability.
2
  
2

( )
n n
PRESS p =  yi − yˆ( i ) =  i 
i =1 i =1  1 − hii 

8
5.4 TECHNIQUES FOR VARIABLE SELECTION
It is desirable to consider regression models that employ only a subset of the
candidate explanatory variables. To find the subset of variables to be included in
the final equation, it is natural to consider fitting models with various combinations
of the candidate regressors. In this section, several computational techniques for
reducing a large list of potential explanatory variables to a more manageable one
will be discussed. In particular, these technique determine which explanatory
variables in the list are the more/most predictors of y and which are the least/less
important predictors.

5.4.1 ALL POSSIBLE REGRESSIONS


This procedure requires the analyst to fit all the regression models involving one
candidate explanatory variable, two candidate explanatory variables, and so on.
These models are then evaluated according to some suitable criterion as discussed
in Section 6.3 and the “best” regression model is selected. If assume that the
intercept term  0 is included in all equations, then if there are k candidate
variables, there are 2 k total models to be estimated and examined. For example, if
k = 4 , then there are 24 = 16 possible equations, while for k = 10 , there are
210 = 1024 possible regression equations. Clearly the number of models to be
estimated increases rapidly as the number of candidate explanatory variables
increases. Prior to the development of efficient computer codes, generating all
possible regressions was impractical for problems involving more than a few
variables. The availability of high speed computers has motivated the development
of several very efficient algorithms for all possible regressions.

Example: Hald Cement Data


Hald Cement Data
Observation, i yi xi1 xi 2 xi 3 xi 4
1 78.5 7 26 6 60
2 74.3 1 29 15 52
3 104.3 11 56 8 20
4 87.6 11 31 8 47
5 95.9 7 52 6 33
6 109.2 11 55 9 22
7 102.7 3 71 17 6
8 72.5 1 31 22 44
9 93.1 2 54 18 22
10 115.9 21 47 4 26
11 83.8 1 40 23 34
12 113.3 11 66 9 12
13 109.4 10 68 8 12

Hald [1952] presents data concerning the heat evolved in calories per gram of
cement (y) as a function of the amount of each of four ingredients in the mix:

9
tricalcium aluminate ( x1 ), tricalcium silicate ( x2 ), tetracalcium alumino ferrite ( x3 ),
and dicalcium silicate ( x4 ). The data are shown in table above.

Since there are k = 4 candidate explanatory, there are 24 = 16 possible regression


models if the intercept 0 is always included. The statistics from fitting these 16
models are displayed in table below.
Summary of All Possible Regressions
Number of p IV in SSEk +1 R 2k +1 Rk2+1 MSE p Cp
IV Model
None 1 None 2715.76 0.00 0.00 226.31 442.92
1 2 x1 1265.69 0.534 0.492 115.06 202.55
1 2 x2 906.34 0.666 0.636 82.39 142.49
1 2 x3 1939.40 0.286 0.221 176.31 315.16
2 2 x4 883.87 0.675 0.645 80.35 138.73
2 3 x1 x2 57.90 0.979 0.974 5.79 2.68
2 3 x1 x3 1227.07 0.548 0.458 122.71 198.10
2 3 x1 x4 74.76 0.972 0.967 7.48 5.50
2 3 x2 x3 415.44 0.847 0.816 41.54 62.44
2 3 x2 x4 868.88 0.680 0.617 86.89 138.23
2 3 x3 x4 175.74 0.935 0.922 17.57 22.37
3 4 x1 x2 x3 48.11 0.982 0.976 5.35 3.04
3 4 x1 x2 x4 47.97 0.982 0.976 5.33 3.02
3 4 x1 x3 x4 50.84 0.981 0.975 5.65 3.50
3 4 x2 x3 x4 73.81 0.973 0.964 8.20 7.34
4 5 x1 x2 x3 x4 47.86 0.982 0.974 5.98 5.00

The table below displays the least squares estimates of the regression coefficients.
The partial nature of regression coefficients is readily apparent from examination
of this table. For example, consider x2 . When the model contains only x2 , the least
squares estimate of x2 effect is 0.789. If x4 is added to the model, the x2 effect is
0.311, a reduction of over 50%. Further addition of x3 changes the x2 effect to
−0.923.
It can be seen that while the estimates of x1 and x4 are consistently positive in all
combinations of variables, the estimates for x2 and x3 consist of a mixture of
positive and negative values. Clearly the least squares estimate of an individual
regression coefficient depends heavily on the presence of other variables in the
model. The large changes in the magnitude of regression coefficients observed in
the Hald cement data when variables are added or removed indicate that there is
substantial correlation between the four variables (the multicollinearity problems
that will be discussed later in Chapter 6).
10
Least Squares Estimation for All Possible Regressions
Variables in ˆ  ˆ  ˆ  ˆ  ˆ 
Model
x1 81.479 1.869
x2 57.424 0.789
x3 110.203 −
x4 117.568 −
x1 x2 52.577 1.468 0.662
x1 x3 72.349 2.312 0.494
x1 x4 103.097 1.440 −
x2 x3 72.075 0.731 −
x2 x4 94.160 0.311 −
x3 x4 131.282 − −
x1 x2 x3 48.194 1.696 0.657 0.250
x1 x2 x4 71.648 1.452 0.416 −
x1 x3 x4 111.684 1.052 − −
x2 x3 x4 203.642 −0.923 − −
x1 x2 x3 x4 62.405 1.551 0.510 0.102 −

Consider evaluating the subset models by the Rk2+1 criterion. From examining the
plot of Rk2+1 versus (k + 1), it is clear that after two variables are in the model, there
is little to be gained in terms of R 2 by introducing additional variables. Both of the
2-variable models ( x1 , x2 ) and ( x1 , x4 ) have essentially the same R 2 values, and in
terms of this criterion, it would make little difference which model is selected as
the final regression model. It may be preferable to use ( x1 , x4 ) because x4 provides
the best 1-variable model. Considering if takes  = 0.05 ,

 4 F0.05,4,8   4 ( 3.84 ) 
2
Rgood ( 2
)
= 1 − 1 − Rfull 1 +  = 1 − 0.01762 1 +  = 0.94855
 8   8 

= 0.94855 is R -
2
Therefore, any subset regression model for which Rk2+1 > Ropt
2

adequate (0.05); that is, its R 2 is not significantly different from Rk2+1 of the full
model that is 0.982. Clearly, several models satisfy this criterion, and so the choice
of the final model is still not clear.
It is instructive to examine the pairwise correlations between xi and x j and
between xi and y. Note from table below that the pairs of variables ( x1 , x3 ) and
( x2 , x4 ) are highly correlated. Consequently, adding further variables when x1 and
x2 or when x1 and x4 are already in the model will be of little use since the
information content in the excluded variables is essentially present in the variables

11
that are in the model. This correlative structure is partially responsible for the large
changes in the regression coefficients noted in the table below.

Matrix of Simple Correlations for Hald’s Data in Example 7.1


x1 x2 x3 x4 y
x1 1.0
x2 0.229 1.0
x3 − − 1.0
x4 − − 0.030 1.0
y 0.731 0.816 − − 1.0

1.00 ● ●
● ●


0.95

0.90

0.85

0.80

●0.6801
●0.6745
●0.6663
●0.5339 ●0.5482

●0.2859
1 2 3 4 5
k+1
Plot of versus p

A plot of MSEk +1 versus p is shows the minimum residual mean square model is
associated with ( x1 , x2 , x4 ) , with MSE = 5.3303 . Note that, as expected, the model
that minimizes MSEk +1 also minimizes the R 2 . Two of the other 3-variable models
( x1 , x2 , x3 ) and ( x1, x3 , x4 )  and the 2-variable models ( x1 , x2 ) and ( x1, x4 ) have
comparable values of the MSE . If ( x1 , x2 ) or ( x1 , x4 ) is in the model, there is little
reduction in MSE by adding further variable. Note however that, adding x3 into
( x1 , x2 , x4 ) increases the MSE .
In the case of model ( x1 , x2 , x4 ) does not possess the required statistical properties,
it is expected that the better model is chosen from either ( x1 , x2 , x3 ) , ( x1 , x3 , x4 ) or
( x1 , x2 , x3 , x4 ) .

12
●226.3136
●176.3092

●86.8880

●41.5443

20

MSRes (p)

15

10


● ● ●

5 ●

0 3 4 5
1 2
p
Plot of versus p

442.92 ● None
315.16 ●
202.55 ● 198.10 ●
142.49 ●
138.73 ● 138.23 ●
62.44 ●
22.37 ●

7.34 ●
6

5 ● ●

4


3 ●

2

0 3 4 5
1 2
p
The plot
13
Examining the C p plot, it is found that there are four models that could be
acceptable: ( x1 , x2 ) , ( x1 , x2 , x3 ) , ( x1 , x2 , x4 ) and ( x1 , x3 , x4 ) . It may be appropriate
to choose the simpler model ( x1 , x2 ) as the final model because it has the smallest
C p and that the value is closest to the line C p = p .

This example has illustrated the computation procedure associated with model
building with all possible regressions. Note that there is no clear cut choice of the
best regression equation. Very often the different criteria suggest different
equations. For example, the best C p model is ( x1 , x2 ) while the best MSE and R 2
model is ( x1 , x2 , x4 ) . All “final” candidate models should be subjected to the usual
tests for adequacy, including investigation of leverage points, influence, and
multicollinearity.
A few notes on reasonable strategy for using C p to identify the “best” model.
1. Identify combination of variables for which the C p value is near p.
2. Since C p = p for the full model, do not use C p to evaluate the full model.
3. Models that yield large C p not near p suggests a few important explanatory
variables are missing from the analysis/model.
4. If a number of models have C p near p, choose the model with the smallest C p as
this ensures the combination of bias and the variance is at the minimum.
5. When more than one model has a small value of C p near p, choose the simpler
model or the model that meet your research needs.
More advice:
To calculate C p , an unbiased estimate of  2 is needed. Frequently, the residual
mean square for the full equation is used for this purpose. However, this forces
C p = p = k + 1 for the full equation. Using MSEfull from the full model as an estimate
of  2 assumes that the full model has negligible bias. If the full model has several
explanatory variables that do not contribute significantly to the model (zero
regression coefficients), then MSEfull will often overestimate  2 , and consequently
the values of C p will be small. If the C p statistic is to work properly, a good
estimate of  2 must be used.

5.4.2 STEPWISE REGRESSION: FORWARD SELECTION


Because evaluating all possible regressions can be burdensome computationally,
various methods have been developed for evaluating only a small number of subset
regression models by either adding or deleting variable one at a time. These
methods are generally referred to as stepwise type procedures.
The forward selection procedure begins with the assumption that there are no
explanatory variables in the model other than the intercept. An effort is made to
find an optimal subset by adding variable into the model one at a time. The first
14
variable selected for entry into the equation is the one that has the largest simple
correlation with the response variable y. Suppose that this variable is x1 . This is
also the variable that will produce the largest value of the F statistic for testing
significance of regression. This variable is entered if the F statistic exceeds a
preselected F value, say FIN (or F-to-enter).

The second variable chosen for entry is the one that now has the largest correlation
with y after adjusting for the effect of the first variable entered ( x1 ) on y. These
correlations are referred as partial correlations. They are the simple correlations
between the residuals from the regression ŷ = ˆ0 + ˆ1 x1 and the residuals from the
regressions of the other candidate variables on x1 , say xˆ j = ˆ 0j + ˆ1 j x1 , j = 1, 2, ..., k
. Suppose that at step 2 the variable with the highest partial correlation with y is x2 .
This implies that the largest partial F statistic is
SSR ( x2 x1 ) SSR ( x1 , x2 ) − SSR ( x1 )
F= =
MSE ( x1, x2 ) MSE ( x1, x2 )
If this F value exceeds FIN , then x2 is added to the model. In general, at each step
the variable having the highest partial correlation with y (or equivalently the largest
partial F-statistic given the other variables already in the model) is added to the
model if its partial F-statistic exceeds the preselected entry level  . The procedure
terminates either when the largest partial F statistic at a particular step does not
exceed FIN or when the last candidate variable is added to the model.

Many statistical software choose x2 by choosing type-I error rate,  so that


explanatory variable with the highest partial correlation coefficient with y is added
to the model if partial F-statistic exceeds FIN = F , 1, n−k −1 .

Example: Forward Selection - Hald Cement Data


The forward selection procedure will be applied to the Hald Cement Data. Outputs
show the results obtained using a computer program, SPSS. In this example,
 = 0.05 is used to determine FIN . Some computer codes require that a numerical
value be selected for FIN , popular choice being between 2.0 and 4.0 for large n.

From correlation table previously shown, the variable most highly correlated with
y is x4 ( r4, y = −0.821) , and since the F-statistic associated with the model using x4
is F = 22.80  F0.05, 1, 11 = 4.844 , x4 is added to the model.

At step 2 the variable having the largest partial correlation with y (or the largest
partial F-statistic given that x4 is in the model) is x1 (why? See below). The partial
F-statistic for this variable is given as:
SSR ( x1 x4 ) 2641.001 − 1831.896 809.105
F= = = = 108.22
MSE ( x1, x4 ) 7.476 7.476

15
which is larger than FIN = F0.05, 1, 10 = 4.96 , and so x1 is added to the model.

SSR ( x2 x4 ) 1846.883 − 1831.896 14.987


F= = = = 0.172
MSE ( x2 , x4 ) 86.888 86.888

SSR ( x3 x4 ) 2540.025 − 1831.896 708.129


F= = = = 40.294
MSE ( x3 , x4 ) 17.574 17.574

Model Summaryb

Adjusted R Std. Error of the


Model R R Square Square Estimate

1 .821a .675 .645 8.96390

a. Predictors: (Constant), x4

b. Dependent Variable: y

ANOVAb

Model Sum of Squares df Mean Square F Sig.

1 Regression 1831.896 1 1831.896 22.799 .001a

Residual 883.867 11 80.352

Total 2715.763 12

a. Predictors: (Constant), x4

b. Dependent Variable: y

Coefficientsa

Standardized
Unstandardized Coefficients Coefficients

Model B Std. Error Beta t Sig.

1 (Constant) 117.568 5.262 22.342 .000

x4 -.738 .155 -.821 -4.775 .001

a. Dependent Variable: y

In step 3, x2 shows the largest partial correlation with y (why? See below) with
partial F-statistic of:
SSR ( x2 x1, x4 ) 2667.790 − 2641.001 26.789
F= = = = 5.03
MSE ( x1, x2 , x4 ) 5.330 5.330

SSR ( x3 x1, x4 ) 2664.927 − 2641.001 23.926


F= = = = 4.236
MSE ( x3 , x2 , x4 ) 5.648 5.648

16
for which the partial F statistic does not exceed FIN = F0.05, 1, 9 = 5.12 , so the forward
selection procedure terminate with only ( x1 , x4 ) in the model. Note that ( x1 , x4 )
only has the 6th largest (smallest) value of R 2 (MSE) with C p that is reasonably
small but lies above the line C p = p . Note that F = 5.03  FIN = F0.10, 1, 9 = 3.36 .

Model Summary

Adjusted R Std. Error of the


Model R R Square Square Estimate

1 .821a .675 .645 8.96390

2 .986b .972 .967 2.73427

a. Predictors: (Constant), x4

b. Predictors: (Constant), x4, x1

ANOVAc

Model Sum of Squares df Mean Square F Sig.

1 Regression 1831.896 1 1831.896 22.799 .001a

Residual 883.867 11 80.352

Total 2715.763 12

2 Regression 2641.001 2 1320.500 176.627 .000b

Residual 74.762 10 7.476

Total 2715.763 12

a. Predictors: (Constant), x4

b. Predictors: (Constant), x4, x1

c. Dependent Variable: y

Coefficientsa

Standardized
Unstandardized Coefficients Coefficients

Model B Std. Error Beta t Sig.

1 (Constant) 117.568 5.262 22.342 .000

x4 -.738 .155 -.821 -4.775 .001

2 (Constant) 103.097 2.124 48.540 .000

x4 -.614 .049 -.683 -12.621 .000

x1 1.440 .138 .563 10.403 .000

a. Dependent Variable: y

17
Model Summary

Model R R Square Adjusted R Std. Error of the


Square Estimate

1 .821a .675 .645 8.9639


2 .986b .972 .967 2.7343
3 .991c .982 .976 2.3087

a. Predictors: (Constant), x4
b. Predictors: (Constant), x4, x1
c. Predictors: (Constant), x4, x1, x2

ANOVAa

Model Sum of Squares df Mean Square F Sig.

Regression 1831.896 1 1831.896 22.799 .001b

1 Residual 883.867 11 80.352

Total 2715.763 12
Regression 2641.001 2 1320.500 176.627 .000c
2 Residual 74.762 10 7.476
Total 2715.763 12
Regression 2667.790 3 889.263 166.832 .000d

3 Residual 47.973 9 5.330

Total 2715.763 12

a. Dependent Variable: y
b. Predictors: (Constant), x4
c. Predictors: (Constant), x4, x1
d. Predictors: (Constant), x4, x1, x2

Coefficientsa

Model Unstandardized Coefficients Standardized t Sig.


Coefficients

B Std. Error Beta


(Constant) 117.568 5.262 22.342 .000
1
x4 -.738 .155 -.821 -4.775 .001
(Constant) 103.097 2.124 48.540 .000
2 x4 -.614 .049 -.683 -12.621 .000
x1 1.440 .138 .563 10.403 .000
(Constant) 71.648 14.142 5.066 .001

x4 -.237 .173 -.263 -1.365 .205


3
x1 1.452 .117 .568 12.410 .000

x2 .416 .186 .430 2.242 .052

a. Dependent Variable: y

18
Note that at  = 0.10 , F = 5.03  FIN = F0.10, 1, 9 = 3.36 and therefore x2 can be added
to the model. At this point the only remaining candidate variable is x3 , for which
the partial F-statistic does not exceed FIN = F0.05, 1, 8 = 5.32 , so the forward selection
procedure terminate with ( x1 , x2 , x4 ) in the model. Note that once x2 is included in
the model, the variable x4 becomes insignificant. Thus, the forward selection
procedure does not guarantee that the final model contains variables that are all
significant.

Adding x1 into the model that contains x4 significantly increase R 2 (from 64.5 to
96.7) and significantly decrease MSE (from 8.962 to 2.732 ). Further adding x2 into
the model that contains  x1 , x4  continues to increase R 2 (from 96.7 to 97.6) and
decrease MSE (from 2.732 to 2.312 ), but only by a small magnitude. Recall that all
possible regression based on both R 2 and MSE would choose  x1 , x2 , x4  as the
best model.
Other required outputs (not shown in SPSS)
ANOVAa

Model Sum of Squares df Mean Square F Sig.

Regression 1846.883 2 923.441 10.628 .003b

1 Residual 868.880 10 86.888

Total 2715.763 12

a. Dependent Variable: y
b. Predictors: (Constant), x2, x4

ANOVAa

Model Sum of Squares df Mean Square F Sig.

Regression 2540.025 2 1270.013 72.267 .000b

1 Residual 175.738 10 17.574

Total 2715.763 12

a. Dependent Variable: y
b. Predictors: (Constant), x3, x4

ANOVAa

Model Sum of Squares df Mean Square F Sig.

Regression 2664.927 3 888.309 157.266 .000b

1 Residual 50.836 9 5.648

Total 2715.763 12

a. Dependent Variable: y

b. Predictors: (Constant), x3, x4, x1

19
5.4.3 STEPWISE REGRESSION: BACKWARD ELIMINATION
The forward selection procedure starts with no explanatory variable in the model
and keep on adding one variable at a time until a suitable model is obtained (until
no more variable is worth added to the model). The backward elimination
procedure is contrary to the forward selection procedure.
The backward elimination methodology begins with all explanatory variables in
the model and keep on deleting one variable at a time until a suitable model is
obtained (i.e until no more variable should be removed). That is, the procedure
begins with a model that includes all k candidate variables. Then the partial F-
statistic is computed for each variable as if it were the last variable to enter the
model. The smallest of these partial F-statistic is compared with a preselected
value, FOUT (or F-to-remove), for example, and if the smallest partial F-statistic is
less than FOUT , then that variable is removed from the model.

Now a regression model with (k – 1) variables is fitted, the partial F-statistic for
this reduced model are calculated, and the procedure repeated. The backward
elimination algorithm terminates when the smallest partial F-statistic is not less
than the cut-off value FOUT . Backward elimination is often a very good variable
selection procedure. It is particularly favoured by analysts who like to see the
effect of including all the candidate variables, just so that no important variables
will be missed.

Example: Backward Elimination - Hald Cement Data


Backward elimination will be applied on the Hald cement data with the cut-off
value FOUT is selected by using  = 0.05 ; thus, a variable is eliminated if its partial
F- statistic is less than F0.05, 1, n−k −1 .
Step 1 shows that the results of fitting the full model. The smallest partial F value
is associated with x3 (why? See below)
SSR ( x3 x1, x2 , x4 ) 2667.899 − 2667.79 0.109
F= = = = 0.018
MSE ( x1, x2 , x3 , x4 ) 5.983 5.983
since F = 0.018  FOUT = F0.05, 1, 8 = 5.32 , x3 is removed from the model.

SSR ( x4 x1, x2 , x3 ) 2667.899 − 2667.652 0.247


F= = = = 0.041
MSE ( x1, x2 , x3 , x4 ) 5.983 5.983

SSR ( x2 x1, x3 , x4 ) 2667.899 − 2664.927 2.972


F= = = = 0.497
MSE ( x1, x2 , x3 , x4 ) 5.983 5.983

SSR ( x1 x2 , x3 , x4 ) 2667.899 − 2661.949 5.950


F= = = = 0.994
MSE ( x1, x2 , x3 , x4 ) 5.983 5.983

20
Model Summary

Model R R Square Adjusted R Square Std. Error of the


Estimate

1 .991a .982 .974 2.4460


2 .991b .982 .976 2.3087

3 .989c .979 .974 2.4063

a. Predictors: (Constant), x4, x3, x1, x2


b. Predictors: (Constant), x4, x1, x2 c. Predictors: (Constant), x1, x2

ANOVAa

Model Sum of Squares df Mean Square F Sig.

Regression 2667.899 4 666.975 111.479 .000b

1 Residual 47.864 8 5.983

Total 2715.763 12

Regression 2667.790 3 889.263 166.832 .000c


2 Residual 47.973 9 5.330
Total 2715.763 12
Regression 2657.859 2 1328.929 229.504 .000d

3 Residual 57.904 10 5.790

Total 2715.763 12

a. Dependent Variable: y b. Predictors: (Constant), x4, x3, x1, x2


c. Predictors: (Constant), x4, x1, x2 d. Predictors: (Constant), x1, x2

Coefficientsa

Model Unstandardized Coefficients Standardized t Sig.


Coefficients

B Std. Error Beta

(Constant) 62.405 70.071 .891 .399

x1 1.551 .745 .607 2.083 .071

1 x2 .510 .724 .528 .705 .501

x3 .102 .755 .043 .135 .896

x4 -.144 .709 -.160 -.203 .844


(Constant) 71.648 14.142 5.066 .001
x1 1.452 .117 .568 12.410 .000
2
x2 .416 .186 .430 2.242 .052
x4 -.237 .173 -.263 -1.365 .205
(Constant) 52.577 2.286 22.998 .000

3 x1 1.468 .121 .574 12.105 .000

x2 .662 .046 .685 14.442 .000

a. Dependent Variable: y

21
Step 2 shows the results of fitting the 3-variable model involving ( x1 , x2 , x4 ) . The
smallest partial F-statistic in this model is calculated as:
SSR ( x4 x1, x2 ) 2667.790 − 2657.859 9.931
F= = = = 1.86
MSE ( x1, x2 , x4 ) 5.33 5.33
is associated with x4 (why? See below). Since F = 1.86  FOUT = F0.05, 1, 9 = 5.12 , x4
is eliminated from the model.
SSR ( x2 x1, x4 ) 2667.790 − 2641.001 26.789
F= = = = 5.026
MSE ( x1, x2 , x4 ) 5.33 5.33

SSR ( x1 x2 , x4 ) 2667.790 − 1846.883 820.907


F= = = = 154.016
MSE ( x1, x2 , x4 ) 5.33 5.33

At step 3, the results of fitting the 2-variable model involving ( x1 , x2 ) . The two
partial F-statistic in this model are:
SSR ( x1 x2 ) 2657.859 − 1809.427
F= = = 146.53
MSE ( x1, x2 ) 5.79

SSR ( x2 x1 ) 2657.859 − 1450.076


F= = = 208.60
MSE ( x1, x2 ) 5.79

ANOVAa
Model Sum of Squares df Mean Square F Sig.

Regression 1809.427 1 1809.427 21.961 .001b

1 Residual 906.336 11 82.394

Total 2715.763 12

a. Dependent Variable: y
b. Predictors: (Constant), x2

ANOVAa

Model Sum of Squares df Mean Square F Sig.

Regression 1450.076 1 1450.076 12.603 .005b

1 Residual 1265.687 11 115.062

Total 2715.763 12
a. Dependent Variable: y
b. Predictors: (Constant), x1

22
The smaller F = 146.53 associated with x1 , and since this exceeds
FOUT = F0.05,1,10 = 4.965 , thus no further variable can be eliminated from the model.
Therefore, backward elimination terminates, yielding the final model with ( x1 , x2 ) .
Note that this is a different model from that found by forward selection, ( x1 , x4 )
and different from tentatively identified as best by the all possible regression
procedure.
Other required outputs (not shown in SPSS)
ANOVAa

Model Sum of Squares df Mean Square F Sig.

Regression 2667.652 3 889.217 166.345 .000b

1 Residual 48.111 9 5.346

Total 2715.763 12

a. Dependent Variable: y b. Predictors: (Constant), x3, x2, x1

ANOVAa

Model Sum of Squares df Mean Square F Sig.

Regression 2664.927 3 888.309 157.266 .000b

1 Residual 50.836 9 5.648

Total 2715.763 12

a. Dependent Variable: y b. Predictors: (Constant), x4, x3, x1

ANOVAa

Model Sum of Squares df Mean Square F Sig.

Regression 2641.949 3 880.650 107.375 .000b

1 Residual 73.815 9 8.202

Total 2715.763 12

a. Dependent Variable: y b. Predictors: (Constant), x2, x3, x4

ANOVAa
Model Sum of Squares df Mean Square F Sig.

Regression 1846.883 2 923.441 10.628 .003b

1 Residual 868.880 10 86.888

Total 2715.763 12

a. Dependent Variable: y b. Predictors: (Constant), x2, x4

ANOVAa

Model Sum of Squares df Mean Square F Sig.

Regression 2641.001 2 1320.500 176.627 .000b

1 Residual 74.762 10 7.476

Total 2715.763 12

a. Dependent Variable: y b. Predictors: (Constant), x1, x4

23
5.4.4 STEPWISE REGRESSION
Another popular procedure of variable selection is the stepwise regression
algorithm of Efroymson [1960]. Stepwise regression is a modification of forward
selection in which at each step all variables entered into the model previously are
reassessed via their partial F-statistics. A variable added at an earlier step may now
be redundant (insignificant) because of its relationships (or maybe due to relative
importance) with other variables now in the equation. If the partial F-statistic for a
variable is less than FOUT , that variable is eliminated from the model. The forward
procedure is applied on other variables that have not been considered previously.

Stepwise regression requires two cut-off values, FIN and FOUT . Some analyst prefer
to choose FIN = FOUT , although this is not necessary. In most statistical software,
FIN  FOUT is chosen, making it relatively easier to add a variable than to delete
one. This is to avoid model underspecification.

Example: Stepwise Regression - Hald Cement Data


For demonstration purpose,  = 0.10 has been specified for either adding or
eliminating variables. The procedure begins with no variables in the model. Since
the partial F-statistic associated with x4 , F = 22.799 is the largest and exceeds
FIN = F0.10, 1, 11 = 3.225 , x4 is added to the model.

At step 2, (similar to forward selection procedure) the largest partial F-statistic is


associated with x1 and is given by:
SSR ( x1 x4 ) 2641.001 − 1831.896 809.105
F= = = = 108.22
MSE ( x1, x4 ) 7.476 7.476

SSR ( x2 x4 ) 1846.883 − 1831.896 14.987


F= = = = 0.172
MSE ( x2 , x4 ) 86.888 86.888

SSR ( x3 x4 ) 2540.025 − 1831.896 708.129


F= = = = 40.294
MSE ( x3 , x4 ) 17.574 17.574
which is larger than FIN = F0.10, 1, 10 = 3.285 , and so x1 is added to the model.

At step 3a, the procedure looks at the possibility of eliminating x4 (output not
shown). The partial F-statistic for eliminating x4 is given by:
SSR ( x4 x1 ) 2641.001 − 1450.076 1190.925
F= = = = 159.30
MSE ( x1, x4 ) 7.476 7.476
which is larger than FOUT = F0.10, 1, 10 = 3.285 , and so x4 cannot be eliminated from,
but retained in the model.
24
Model Summary

Model R R Square Adjusted R Std. Error of the


Square Estimate

1 .821a .675 .645 8.9639


2 .986b .972 .967 2.7343
3 .991c .982 .976 2.3087
4 .989d .979 .974 2.4063

a. Predictors: (Constant), x4 b. Predictors: (Constant), x4, x1


c. Predictors: (Constant), x4, x1, x2 d. Predictors: (Constant), x1, x2

ANOVAa

Model Sum of Squares df Mean Square F Sig.

Regression 1831.896 1 1831.896 22.799 .001b

1 Residual 883.867 11 80.352

Total 2715.763 12
Regression 2641.001 2 1320.500 176.627 .000c
2 Residual 74.762 10 7.476
Total 2715.763 12
Regression 2667.790 3 889.263 166.832 .000d
3 Residual 47.973 9 5.330
Total 2715.763 12
Regression 2657.859 2 1328.929 229.504 .000e

4 Residual 57.904 10 5.790

Total 2715.763 12
a. Dependent Variable: y
b. Predictors: (Constant), x4 c. Predictors: (Constant), x4, x1
d. Predictors: (Constant), x4, x1, x2 e. Predictors: (Constant), x1, x2

At step 3b, the procedure looks at the possibility of adding one of the remaining
variables x2 or x3 to the model (same as step 3 in forward selection). Between,
these two variable, the F-statistic are given by:
SSR ( x2 x1 , x4 ) 2667.790 − 2641.001 26.789
F= = = = 5.03
MSE ( x1, x2 , x4 ) 5.33 5.33

SSR ( x3 x1, x4 ) 2664.927 − 2641.001 23.926


F= = = = 4.236
MSE ( x1, x3 , x4 ) 5.648 5.648
The larger of these F-statistic, F = 5.03  FIN = F0.10, 1, 9 = 3.36 and therefore x2 can
be added to the model. Note that F = 5.03  FIN = F0.05, 1, 9 = 5.12 .

25
At step 4a, the procedure looks at the possibility of eliminating the previously
added variables x1 or x4 from the model. The partial F-statistic for eliminating x4
and x1 are respectively given by:
SSR ( x4 x1, x2 ) 2667.790 − 2657.859 9.931
F= = = = 1.86
MSE ( x1, x2 , x4 ) 5.33 5.33

SSR ( x1 x2 , x4 ) 2667.790 − 1846.883 820.907


F= = = = 154.02
MSE ( x1, x2 , x4 ) 5.33 5.33
Since Fx4 = 1.86  FOUT = F0.10, 1, 9 = 3.36 , but Fx1 = 154.02  FOUT = F0.10, 1, 9 = 3.36 , x4
is eliminated from the model, leaving with a reduced model involving ( x1 , x2 ) .

At step 4b, the procedure looks at the possibility of adding x3 into the model. The
partial F-statistic is given by:
SSR ( x3 x1 , x2 ) 2667.652 − 2657.859 9.793
F= = = = 1.83
MSE ( x1, x2 , x3 ) 5.346 5.346
The partial F value for x3 is less than FIN = F0.10, 1, 9 = 3.36 , and therefore x3 cannot
be added into the model.
**Extra Output (not shown by SPSS)
ANOVAa

Model Sum of Squares df Mean Square F Sig.

Regression 1450.076 1 1450.076 12.603 .005b

1 Residual 1265.687 11 115.062

Total 2715.763 12

a. Dependent Variable: y b. Predictors: (Constant), x1

ANOVAa

Model Sum of Squares df Mean Square F Sig.


Regression 2664.927 3 888.309 157.266 .000b

1 Residual 50.836 9 5.648

Total 2715.763 12

a. Dependent Variable: y b. Predictors: (Constant), x4, x3, x1

ANOVAa

Model Sum of Squares df Mean Square F Sig.

Regression 1846.883 2 923.441 10.628 .003b

1 Residual 868.880 10 86.888

Total 2715.763 12

a. Dependent Variable: y b. Predictors: (Constant), x2, x4

26
ANOVAa

Model Sum of Squares df Mean Square F Sig.

Regression 2667.652 3 889.217 166.345 .000b

1 Residual 48.111 9 5.346

Total 2715.763 12

a. Dependent Variable: y b. Predictors: (Constant), x3, x2, x1

5.5. A Few Points on Variable Selection Techniques


5.5.1 General Comments on Stepwise Type Procedures
The stepwise regression procedures described above have been criticized on
various grounds, the most common being that none of the procedures generally
guarantees that the best subset regression model will be identified. Furthermore,
since all the stepwise type procedures terminate with one final equation,
inexperienced analysts may conclude that they have found a model that is in some
sense optimal. Part of the problem is that it is likely that there is not only one best
subset model, but there are several equally good models.
The analyst should also keep in mind that the order in which the variables enter or
leave the model does not necessarily imply an order of importance of the variables.
It is not unusual (often) to find that a variable included into the model early in the
procedure becomes negligible at a subsequent step. This is evident in the Hald
cement data, for which forward selection choose x4 as the first variable to enter.
However, when x2 is added at a subsequent step, x4 is no longer required (not
significant) because of the high inter-correlation between x2 and x4 . This is in fact
a general problem with the forward selection procedure. Once a variable had been
added, it cannot be removed at a later step.
Note that forward selection, backward elimination, and stepwise regression do not
necessarily lead to the same choice of final model. The inter-correlation between
the variables affects the order of entry and removal. For example, using the Hald
cement data, the variables selected by each procedure were as follows:

Forward selection 5% x1 x4 10% x1 x2 x4

Backward elimination 5% / 10% x1 x2

Stepwise regression 5% x1 x4 10% x1 x2

All possible regression R2 x1 x2 x4 or x1 x2 x3


All possible regression MSE x1 x2 x4
All possible regression Cp x1 x2

27
Some users have recommended that all the procedures be applied in the hopes of
either seeing some agreement or learning something about the structure of the data
might be overlooked by using only one selection procedure. Furthermore, there is
not necessarily any agreement between any of the stepwise type procedures and all
possible regressions. However, Berk [1978] has noted that forward selection tends
to agree with all possible regressions for small subset sizes but not for large ones,
while backward elimination tends to agree with all possible regressions for large
subset sizes but not for small ones.
For these reasons stepwise-type variable selection procedures should be used with
caution. The preferable procedure is the stepwise regression algorithm followed by
backward elimination. The backward elimination algorithm is often less adversely
affected by the correlative structure of the variables than forward selection (see
Mantel [1970]).

5.5.2 Stopping Rules for Stepwise Procedures


Choosing the cut-off values FIN and/or FOUT in stepwise-type procedures can be
thought of as specifying a stopping rule for these algorithms. Some computer
programs allow the analyst to specify these numbers directly, while other require
the choice of a Type I error rate  to generate FIN and/or FOUT . However, because
the partial F value examined at each stage is the maximum of several correlated
partial F variables, thinking of  as a level of significance or Type I error rate is
misleading. Several authors (e.g., Draper, Guttman, and Kanemase [1971] and
Pope and Webster [1972]) have investigated this problem, and little progress has
been made toward either finding conditions under which the “advertised” level of
significance on F is meaningful or developing the exact distribution of the F-to-
enter and F-to-remove statistics.

Some users prefer to choose relatively small values of FIN and FOUT so that several
additional variables that would ordinarily be rejected by more conservative F
values may be investigated. In the extreme we may choose FIN and FOUT so that all
variables are entered by forward selection or removed by backward elimination
revealing one subset model of each size for p = 2, 3, …, k + 1. These subset models
may then be evaluated by criteria such as C p or MSE to determine the final model.
We do not recommend this extreme strategy because the analyst may think that the
subsets so determined are in some sense optimal, when it is likely that the best
subset model was overlooked. A very popular procedure is to set FIN = FOUT = 4 , as
this corresponds roughly to the upper 5% point of the F distribution. Still another
possibility is to make several runs using different values for FIN and FOUT and
observed the effect of the choice of criteria on the subsets obtained.
There have been several studies directed toward providing practical guidelines in
the choice of stopping rules. Bendel and Afifi [1974] recommend  = 0.25 , for
forward selection. This would typically result in a numerical value of FIN between
28
1.3 and 2. Kennedy and Bancroft [1971] also suggest  = 0.25 for forward
selection and recommend  = 0.10 for backward elimination. The choice of values
for FIN and FIN is largely a matter of the personal preference of the analyst, and
considerable latitude is often taken in this area.

5.5.3 Some Final Recommendations for Practice


This chapter has discussed several procedures for variable selection in linear
regression. The methods may be generally classified as stepwise-type methods or
all possible regressions (with variations). The primary advantages of the stepwise-
type methods are that they are fast, easy to implement on digital computers, and
readily available for almost all computer systems. Their disadvantages are that they
do not produce subset models that are necessarily best with respect to any standard
criterion, and furthermore as they are oriented toward producing a single final
equation, the unsophisticated user may be led to believe that this model is in some
sense optimal. On the other hand, all possible regressions will identify the subset
models that are best with respect to whatever criterion the analyst imposes. For up
to about 20 or 30 candidate variables, the cost of computing with all possible
regressions is approximately the same as the stepwise-type procedures.
When the number of candidate variables it too large to initially employ the all-
possible-regressions approach, we recommend a 2-stage strategy. Stepwise-type
methods can be used to “screen” the candidate variables, eliminating those that
have negligible effects so that a smaller list of candidate variables results. This
reduced set of candidate variables can then be investigated by all possible
regressions. The analyst should always use knowledge of the problem environment
and common sense in evaluating candidate variables. When confronted with a
large list of candidate variables, it is usually useful to invest in some serious
thinking before resorting to the computer. Often we find that some variables can be
eliminated on the basis of logic or engineering sense.
We have discussed several formal criteria for evaluating subset regression models,
such as Mallow’s Cp statistic and the residual mean square. Usually, however, the
choice of a final model is not clear-cut. In addition to the formal evaluation
criteria, we would suggest that the analyst to ask the following questions:
1. Is the equation reasonable? That is, do the variables in the model make sense in
light of the problem environment?
2. Is the model usable for its intended purpose? For example, a model intended for
prediction that contains a regressor that is unobservable at the time the
prediction is required in unusable. If the cost of collecting data on a variable is
prohibitive, this would also render the model unusable.
3. Are the regression coefficients reasonable? That is, are the signs and
magnitudes of the coefficients realistic and are the standard errors relatively
small?
29
4. Are the usual diagnostic checks for model adequacy satisfactory? For example,
do the residual plots indicate unexplained structure or outliers, or are there one
or more high-leverage points that may be controlling the fit?
If these four questions are taken seriously and the answers strictly applied, in some
(perhaps many) instances there would be no final satisfactory regression equation.
Clearly judgment and experience in the model’s intended operating environment
are required. Finally, although the equation fits the data well and passes the usual
diagnostic checks, there is no assurance that it will predict new observations
accurately. The predictive ability of a model is recommended to be assessed by
observing its performance on new data not used to build the model. If this cannot
be done easily, then some of the original data may be set aside for this purpose.

30

You might also like