Chapter 5
Chapter 5
2
c) Regression model contains one or more extraneous (unnecessary) variables
This occurs when the regression model contains extraneous variables that are
neither related to the response variable nor to any of the explanatory variables. It is
as if we include extra explanatory variables in the model that is not needed (not
statistically significant). Note however, such model has unbiased regression
coefficients, unbiased prediction of the response variable and also unbiased MSE.
However, since the model has more parameters, MSE has fewer degree of
freedom, producing wider confidence interval and the corresponding hypothesis
testing has lower power. In addition, by including extraneous variable, the model
becomes more complicated and harder to understand than necessary.
d) Regression model is overspecified
This occurs when the regression model contains one or more redundant
(repeating) explanatory variables. Although the model is correct, we have added
variables that are redundant. Redundant explanatory variable leads to problems
such as inflated standard errors for the regression coefficient (the issue of
Multicollinearity, see Chapter 7). Regression model that is overspecified produces
unbiased regression coefficients, unbiased prediction of the response variable and
also unbiased MSE. Such regression model can be used, with cautious, for
prediction of the response, but should not be used to evaluate the effect of an
explanatory variable on the response variable. Similar to the case of adding
extraneous variable, the model becomes more complicated and harder to
understand than necessary.
The motivation for variable selection can be summarized as follows. By deleting
variables from the model, the precision of the parameter estimates of the retained
variables may be improved, even though some of the deleted variables are not
negligible. This is also true for variance of a predicted response. Deleting variables
potentially introduces bias into the estimates of the coefficients of retained
variables and the response. However, if the deleted variables have small effects,
the MSE of the biased estimates will be less than variance of the unbiased
estimates. That is, the amount of bias introduced is less than the reduction in the
variance. There is danger in retaining negligible variables, that is, variables with
zero coefficients or coefficients less than their corresponding standard errors from
the full model. This danger is that the variances of the estimates of the parameters
and the predicted response are increased.
Finally, remember that regression models are frequently built using retrospective
data, that is, data that have been extracted from historical records. These data often
saturated with defects including outliers, “wild” points and inconsistencies
resulting from changes in the organization’s data collection and information
processing system over time. These data defects can have great impact on the
variable selection process and lead to model misspecification. A very common
problem in historical data is to find that some candidate explanatory vary over a
very limited range. Because of the limited range of the data, the variable may seem
unimportant in the least squares fit.
3
5.3 CRITERIA FOR EVALUATING SUBSET REGRESSION MODELS
Two key aspects of the variable selection problem are generating the subset models
and deciding if one subset is better than another. In this section, the criteria for
evaluating and comparing subset regression models will be discussed. Section 5.4
presents the computational methods for variable selection.
1.0
0.0
k +1
4
Since an “optimum” value of Rk2+1 cannot be found for a subset regression model, a
“satisfactory” value needs to be determined. Aitkin [1974] proposed one solution
to this problem by providing a test by which all subset regression models that have
an R 2 not significantly different from the R 2 for the full model can be identified.
Let
2
Rgood ( )(
= 1 − 1 − Rk2max +1 1 + d ,n,kmax )
where
kmax F ,kmax ,n−kmax −1
d ,n,kmax =
n − kmax − 1
and Rk2max +1 is the value of R 2 for the full model . Aitkin calls any subset of
regressor variables producing an R 2 greater than Rgood2
an R 2 -adequate ( ) subset.
Generally, it is not straightforward to use R 2 as a criterion for choosing the number
of independent variables to include in the model. Generally, analyst is looking for a
simple model (with few variables) that is as good as, or nearly as good as, the
model with all k independent variables.
2 2
5.3.2 ADJUSTED R , R
Recall that R 2 is a goodness-of-fit measure for linear regression model. This
statistic indicates the percentage of the variance in the dependent variable that
the independent variables explain collectively. R 2 measures the strength of the
relationship between your model and the dependent variable on a convenient 0 –
100% scale. R 2 increases with every predictor added to a model. As R 2 always
increases and never decreases, it can appear to be a better fit with the more terms
you add to the model. This can be completely misleading. In addition, if the model
has too many terms and too many high-order polynomials you can run into the
problem of over-fitting the data. When you over-fit data, a misleadingly high R 2
value can lead to misleading predictions.
In regression analysis, it can be tempting to add more explanatory variables to the
data as you think of them. Some of those variables will be significant, but some
may not be statistically significant. Similar to R 2 , R 2 also indicates how well
variables fit a line, but the measure adjusts for the number of variables in the
model. The R 2 will compensate for this by penalizing for those extra insignificant
variables. That is, the R 2 will penalize you for adding independent variables that
do not fit (explain) the model (response variable). That is, if useful (significant)
variables are added to the model the value of R 2 will increase but if more and
more useless (insignificant) variables are added, R 2 will decrease. In other
words, R 2 value increases only when the new variable improves the model fit
more than a certain amount while the R 2 value decreases when the variable does
not improve the model fit by a sufficient amount.
5
The adjusted R 2 statistic, R 2 defined for a (k + 1)-term equation is given as:
n −1
Rk2+1 = 1 − (
1 − Rk +1
n − k −1
2
)
It can be shown (Edwards [1969], Haitovski [1969], and Seber [1977]) that if s
regressors are added to the model, Rk2+1+ s will exceed Rk2+1 if and only if the partial
F-statistic for testing the significance of the s additional explanatory variable
exceeds 1. Consequently, one criterion for selection of an optimum subset model is
to choose the model that has a maximum Rk2+1 .
p
Plot of 6 versus p
The subset regression model that minimizes MSEk +1 will also maximize Rk2+1 . To
see this, note that
n −1 n − 1 SSEk +1
Rk2+1 = 1 − (
1 − Rk +1 = 1 −
2
)
n − ( k + 1) n − k − 1 SST
n − 1 SSEk +1 MSEk +1
= 1− = 1−
SST n − k − 1 SST ( n − 1)
Thus, the criteria minimum MSEk +1 and maximum Rk2+1 are equivalent.
Mallows [1964, 1966, 1973, 1995] proposed a model selection criterion that is
related to the mean square error of a fitted value, that is,
Note that E ( yi ) is the expected response from the true regression equation and
E ( yˆi ) is the expected response from the p-term subset model. Thus, E ( yi ) − E ( yˆi )
is the bias at the ith data point. Consequently, the two terms on the right-hand side
are the squared bias and variance components, respectively, of the mean square
error, MSE. Let the total squared bias for a p-term equation be
n
SSBp = E ( yi ) − E ( yˆi )
2
i =1
1 n n
p = 2 E ( yi ) − E ( yˆi ) + Var ( yˆi )
2
i =1 i =1
SSB p 1 n
2
= + Var ( yˆi )
2
i =1
i=1Var ( yˆi ) = p 2
n
It can be shown that and that the expected value of the residual
sum of squares from a p-term equation is:
E SSE p = SSB p + ( n − p ) 2
p =
1
2 E SSE p − ( n − p ) 2 + p 2
E SSE p
= − n + 2p
2
7
Suppose that ˆ 2 is a good estimate of 2 . Then replacing E SSE p by the
observed value SSE p produces an estimate of p , say
SSE p
Cp = − n + 2p
ˆ 2
where ˆ 2 is the MSE estimated from the full model. If the p-term model has
negligible bias, then SSB p = 0 . Consequently, E SSE p = ( n − p ) 2 , and
E C p Bias = 0 =
( n − p ) 2
− n + 2p = p
2
When using the C p criterion, it can be helpful to visualize the plot of C p as a
function of p for each regression equation, such as shown in the plot below.
Regression equation with little bias will have values of C p that fall near the line
C p = p (point A) while those equations with substantial bias will fall above this
line (point B). Generally, small values of C p are desirable. For example, although
point C is above the line C p = p , it is below point A and thus represents a model
with lower total error. It may be preferable to accept some bias in the equation to
reduce the average error of prediction.
B
A =p
0 1 2 3 4 5 6 7 8
p
A plot vs. p
( )
n n
PRESS p = yi − yˆ( i ) = i
i =1 i =1 1 − hii
8
5.4 TECHNIQUES FOR VARIABLE SELECTION
It is desirable to consider regression models that employ only a subset of the
candidate explanatory variables. To find the subset of variables to be included in
the final equation, it is natural to consider fitting models with various combinations
of the candidate regressors. In this section, several computational techniques for
reducing a large list of potential explanatory variables to a more manageable one
will be discussed. In particular, these technique determine which explanatory
variables in the list are the more/most predictors of y and which are the least/less
important predictors.
Hald [1952] presents data concerning the heat evolved in calories per gram of
cement (y) as a function of the amount of each of four ingredients in the mix:
9
tricalcium aluminate ( x1 ), tricalcium silicate ( x2 ), tetracalcium alumino ferrite ( x3 ),
and dicalcium silicate ( x4 ). The data are shown in table above.
The table below displays the least squares estimates of the regression coefficients.
The partial nature of regression coefficients is readily apparent from examination
of this table. For example, consider x2 . When the model contains only x2 , the least
squares estimate of x2 effect is 0.789. If x4 is added to the model, the x2 effect is
0.311, a reduction of over 50%. Further addition of x3 changes the x2 effect to
−0.923.
It can be seen that while the estimates of x1 and x4 are consistently positive in all
combinations of variables, the estimates for x2 and x3 consist of a mixture of
positive and negative values. Clearly the least squares estimate of an individual
regression coefficient depends heavily on the presence of other variables in the
model. The large changes in the magnitude of regression coefficients observed in
the Hald cement data when variables are added or removed indicate that there is
substantial correlation between the four variables (the multicollinearity problems
that will be discussed later in Chapter 6).
10
Least Squares Estimation for All Possible Regressions
Variables in ˆ ˆ ˆ ˆ ˆ
Model
x1 81.479 1.869
x2 57.424 0.789
x3 110.203 −
x4 117.568 −
x1 x2 52.577 1.468 0.662
x1 x3 72.349 2.312 0.494
x1 x4 103.097 1.440 −
x2 x3 72.075 0.731 −
x2 x4 94.160 0.311 −
x3 x4 131.282 − −
x1 x2 x3 48.194 1.696 0.657 0.250
x1 x2 x4 71.648 1.452 0.416 −
x1 x3 x4 111.684 1.052 − −
x2 x3 x4 203.642 −0.923 − −
x1 x2 x3 x4 62.405 1.551 0.510 0.102 −
Consider evaluating the subset models by the Rk2+1 criterion. From examining the
plot of Rk2+1 versus (k + 1), it is clear that after two variables are in the model, there
is little to be gained in terms of R 2 by introducing additional variables. Both of the
2-variable models ( x1 , x2 ) and ( x1 , x4 ) have essentially the same R 2 values, and in
terms of this criterion, it would make little difference which model is selected as
the final regression model. It may be preferable to use ( x1 , x4 ) because x4 provides
the best 1-variable model. Considering if takes = 0.05 ,
4 F0.05,4,8 4 ( 3.84 )
2
Rgood ( 2
)
= 1 − 1 − Rfull 1 + = 1 − 0.01762 1 + = 0.94855
8 8
= 0.94855 is R -
2
Therefore, any subset regression model for which Rk2+1 > Ropt
2
adequate (0.05); that is, its R 2 is not significantly different from Rk2+1 of the full
model that is 0.982. Clearly, several models satisfy this criterion, and so the choice
of the final model is still not clear.
It is instructive to examine the pairwise correlations between xi and x j and
between xi and y. Note from table below that the pairs of variables ( x1 , x3 ) and
( x2 , x4 ) are highly correlated. Consequently, adding further variables when x1 and
x2 or when x1 and x4 are already in the model will be of little use since the
information content in the excluded variables is essentially present in the variables
11
that are in the model. This correlative structure is partially responsible for the large
changes in the regression coefficients noted in the table below.
1.00 ● ●
● ●
●
●
0.95
●
0.90
0.85
●
0.80
●0.6801
●0.6745
●0.6663
●0.5339 ●0.5482
●0.2859
1 2 3 4 5
k+1
Plot of versus p
A plot of MSEk +1 versus p is shows the minimum residual mean square model is
associated with ( x1 , x2 , x4 ) , with MSE = 5.3303 . Note that, as expected, the model
that minimizes MSEk +1 also minimizes the R 2 . Two of the other 3-variable models
( x1 , x2 , x3 ) and ( x1, x3 , x4 ) and the 2-variable models ( x1 , x2 ) and ( x1, x4 ) have
comparable values of the MSE . If ( x1 , x2 ) or ( x1 , x4 ) is in the model, there is little
reduction in MSE by adding further variable. Note however that, adding x3 into
( x1 , x2 , x4 ) increases the MSE .
In the case of model ( x1 , x2 , x4 ) does not possess the required statistical properties,
it is expected that the better model is chosen from either ( x1 , x2 , x3 ) , ( x1 , x3 , x4 ) or
( x1 , x2 , x3 , x4 ) .
12
●226.3136
●176.3092
●86.8880
●41.5443
20
●
MSRes (p)
15
10
●
●
● ● ●
●
5 ●
0 3 4 5
1 2
p
Plot of versus p
442.92 ● None
315.16 ●
202.55 ● 198.10 ●
142.49 ●
138.73 ● 138.23 ●
62.44 ●
22.37 ●
7.34 ●
6
5 ● ●
4
●
●
3 ●
●
2
0 3 4 5
1 2
p
The plot
13
Examining the C p plot, it is found that there are four models that could be
acceptable: ( x1 , x2 ) , ( x1 , x2 , x3 ) , ( x1 , x2 , x4 ) and ( x1 , x3 , x4 ) . It may be appropriate
to choose the simpler model ( x1 , x2 ) as the final model because it has the smallest
C p and that the value is closest to the line C p = p .
This example has illustrated the computation procedure associated with model
building with all possible regressions. Note that there is no clear cut choice of the
best regression equation. Very often the different criteria suggest different
equations. For example, the best C p model is ( x1 , x2 ) while the best MSE and R 2
model is ( x1 , x2 , x4 ) . All “final” candidate models should be subjected to the usual
tests for adequacy, including investigation of leverage points, influence, and
multicollinearity.
A few notes on reasonable strategy for using C p to identify the “best” model.
1. Identify combination of variables for which the C p value is near p.
2. Since C p = p for the full model, do not use C p to evaluate the full model.
3. Models that yield large C p not near p suggests a few important explanatory
variables are missing from the analysis/model.
4. If a number of models have C p near p, choose the model with the smallest C p as
this ensures the combination of bias and the variance is at the minimum.
5. When more than one model has a small value of C p near p, choose the simpler
model or the model that meet your research needs.
More advice:
To calculate C p , an unbiased estimate of 2 is needed. Frequently, the residual
mean square for the full equation is used for this purpose. However, this forces
C p = p = k + 1 for the full equation. Using MSEfull from the full model as an estimate
of 2 assumes that the full model has negligible bias. If the full model has several
explanatory variables that do not contribute significantly to the model (zero
regression coefficients), then MSEfull will often overestimate 2 , and consequently
the values of C p will be small. If the C p statistic is to work properly, a good
estimate of 2 must be used.
The second variable chosen for entry is the one that now has the largest correlation
with y after adjusting for the effect of the first variable entered ( x1 ) on y. These
correlations are referred as partial correlations. They are the simple correlations
between the residuals from the regression ŷ = ˆ0 + ˆ1 x1 and the residuals from the
regressions of the other candidate variables on x1 , say xˆ j = ˆ 0j + ˆ1 j x1 , j = 1, 2, ..., k
. Suppose that at step 2 the variable with the highest partial correlation with y is x2 .
This implies that the largest partial F statistic is
SSR ( x2 x1 ) SSR ( x1 , x2 ) − SSR ( x1 )
F= =
MSE ( x1, x2 ) MSE ( x1, x2 )
If this F value exceeds FIN , then x2 is added to the model. In general, at each step
the variable having the highest partial correlation with y (or equivalently the largest
partial F-statistic given the other variables already in the model) is added to the
model if its partial F-statistic exceeds the preselected entry level . The procedure
terminates either when the largest partial F statistic at a particular step does not
exceed FIN or when the last candidate variable is added to the model.
From correlation table previously shown, the variable most highly correlated with
y is x4 ( r4, y = −0.821) , and since the F-statistic associated with the model using x4
is F = 22.80 F0.05, 1, 11 = 4.844 , x4 is added to the model.
At step 2 the variable having the largest partial correlation with y (or the largest
partial F-statistic given that x4 is in the model) is x1 (why? See below). The partial
F-statistic for this variable is given as:
SSR ( x1 x4 ) 2641.001 − 1831.896 809.105
F= = = = 108.22
MSE ( x1, x4 ) 7.476 7.476
15
which is larger than FIN = F0.05, 1, 10 = 4.96 , and so x1 is added to the model.
Model Summaryb
a. Predictors: (Constant), x4
b. Dependent Variable: y
ANOVAb
Total 2715.763 12
a. Predictors: (Constant), x4
b. Dependent Variable: y
Coefficientsa
Standardized
Unstandardized Coefficients Coefficients
a. Dependent Variable: y
In step 3, x2 shows the largest partial correlation with y (why? See below) with
partial F-statistic of:
SSR ( x2 x1, x4 ) 2667.790 − 2641.001 26.789
F= = = = 5.03
MSE ( x1, x2 , x4 ) 5.330 5.330
16
for which the partial F statistic does not exceed FIN = F0.05, 1, 9 = 5.12 , so the forward
selection procedure terminate with only ( x1 , x4 ) in the model. Note that ( x1 , x4 )
only has the 6th largest (smallest) value of R 2 (MSE) with C p that is reasonably
small but lies above the line C p = p . Note that F = 5.03 FIN = F0.10, 1, 9 = 3.36 .
Model Summary
a. Predictors: (Constant), x4
ANOVAc
Total 2715.763 12
Total 2715.763 12
a. Predictors: (Constant), x4
c. Dependent Variable: y
Coefficientsa
Standardized
Unstandardized Coefficients Coefficients
a. Dependent Variable: y
17
Model Summary
a. Predictors: (Constant), x4
b. Predictors: (Constant), x4, x1
c. Predictors: (Constant), x4, x1, x2
ANOVAa
Total 2715.763 12
Regression 2641.001 2 1320.500 176.627 .000c
2 Residual 74.762 10 7.476
Total 2715.763 12
Regression 2667.790 3 889.263 166.832 .000d
Total 2715.763 12
a. Dependent Variable: y
b. Predictors: (Constant), x4
c. Predictors: (Constant), x4, x1
d. Predictors: (Constant), x4, x1, x2
Coefficientsa
a. Dependent Variable: y
18
Note that at = 0.10 , F = 5.03 FIN = F0.10, 1, 9 = 3.36 and therefore x2 can be added
to the model. At this point the only remaining candidate variable is x3 , for which
the partial F-statistic does not exceed FIN = F0.05, 1, 8 = 5.32 , so the forward selection
procedure terminate with ( x1 , x2 , x4 ) in the model. Note that once x2 is included in
the model, the variable x4 becomes insignificant. Thus, the forward selection
procedure does not guarantee that the final model contains variables that are all
significant.
Adding x1 into the model that contains x4 significantly increase R 2 (from 64.5 to
96.7) and significantly decrease MSE (from 8.962 to 2.732 ). Further adding x2 into
the model that contains x1 , x4 continues to increase R 2 (from 96.7 to 97.6) and
decrease MSE (from 2.732 to 2.312 ), but only by a small magnitude. Recall that all
possible regression based on both R 2 and MSE would choose x1 , x2 , x4 as the
best model.
Other required outputs (not shown in SPSS)
ANOVAa
Total 2715.763 12
a. Dependent Variable: y
b. Predictors: (Constant), x2, x4
ANOVAa
Total 2715.763 12
a. Dependent Variable: y
b. Predictors: (Constant), x3, x4
ANOVAa
Total 2715.763 12
a. Dependent Variable: y
19
5.4.3 STEPWISE REGRESSION: BACKWARD ELIMINATION
The forward selection procedure starts with no explanatory variable in the model
and keep on adding one variable at a time until a suitable model is obtained (until
no more variable is worth added to the model). The backward elimination
procedure is contrary to the forward selection procedure.
The backward elimination methodology begins with all explanatory variables in
the model and keep on deleting one variable at a time until a suitable model is
obtained (i.e until no more variable should be removed). That is, the procedure
begins with a model that includes all k candidate variables. Then the partial F-
statistic is computed for each variable as if it were the last variable to enter the
model. The smallest of these partial F-statistic is compared with a preselected
value, FOUT (or F-to-remove), for example, and if the smallest partial F-statistic is
less than FOUT , then that variable is removed from the model.
Now a regression model with (k – 1) variables is fitted, the partial F-statistic for
this reduced model are calculated, and the procedure repeated. The backward
elimination algorithm terminates when the smallest partial F-statistic is not less
than the cut-off value FOUT . Backward elimination is often a very good variable
selection procedure. It is particularly favoured by analysts who like to see the
effect of including all the candidate variables, just so that no important variables
will be missed.
20
Model Summary
ANOVAa
Total 2715.763 12
Total 2715.763 12
Coefficientsa
a. Dependent Variable: y
21
Step 2 shows the results of fitting the 3-variable model involving ( x1 , x2 , x4 ) . The
smallest partial F-statistic in this model is calculated as:
SSR ( x4 x1, x2 ) 2667.790 − 2657.859 9.931
F= = = = 1.86
MSE ( x1, x2 , x4 ) 5.33 5.33
is associated with x4 (why? See below). Since F = 1.86 FOUT = F0.05, 1, 9 = 5.12 , x4
is eliminated from the model.
SSR ( x2 x1, x4 ) 2667.790 − 2641.001 26.789
F= = = = 5.026
MSE ( x1, x2 , x4 ) 5.33 5.33
At step 3, the results of fitting the 2-variable model involving ( x1 , x2 ) . The two
partial F-statistic in this model are:
SSR ( x1 x2 ) 2657.859 − 1809.427
F= = = 146.53
MSE ( x1, x2 ) 5.79
ANOVAa
Model Sum of Squares df Mean Square F Sig.
Total 2715.763 12
a. Dependent Variable: y
b. Predictors: (Constant), x2
ANOVAa
Total 2715.763 12
a. Dependent Variable: y
b. Predictors: (Constant), x1
22
The smaller F = 146.53 associated with x1 , and since this exceeds
FOUT = F0.05,1,10 = 4.965 , thus no further variable can be eliminated from the model.
Therefore, backward elimination terminates, yielding the final model with ( x1 , x2 ) .
Note that this is a different model from that found by forward selection, ( x1 , x4 )
and different from tentatively identified as best by the all possible regression
procedure.
Other required outputs (not shown in SPSS)
ANOVAa
Total 2715.763 12
ANOVAa
Total 2715.763 12
ANOVAa
Total 2715.763 12
ANOVAa
Model Sum of Squares df Mean Square F Sig.
Total 2715.763 12
ANOVAa
Total 2715.763 12
23
5.4.4 STEPWISE REGRESSION
Another popular procedure of variable selection is the stepwise regression
algorithm of Efroymson [1960]. Stepwise regression is a modification of forward
selection in which at each step all variables entered into the model previously are
reassessed via their partial F-statistics. A variable added at an earlier step may now
be redundant (insignificant) because of its relationships (or maybe due to relative
importance) with other variables now in the equation. If the partial F-statistic for a
variable is less than FOUT , that variable is eliminated from the model. The forward
procedure is applied on other variables that have not been considered previously.
Stepwise regression requires two cut-off values, FIN and FOUT . Some analyst prefer
to choose FIN = FOUT , although this is not necessary. In most statistical software,
FIN FOUT is chosen, making it relatively easier to add a variable than to delete
one. This is to avoid model underspecification.
At step 3a, the procedure looks at the possibility of eliminating x4 (output not
shown). The partial F-statistic for eliminating x4 is given by:
SSR ( x4 x1 ) 2641.001 − 1450.076 1190.925
F= = = = 159.30
MSE ( x1, x4 ) 7.476 7.476
which is larger than FOUT = F0.10, 1, 10 = 3.285 , and so x4 cannot be eliminated from,
but retained in the model.
24
Model Summary
ANOVAa
Total 2715.763 12
Regression 2641.001 2 1320.500 176.627 .000c
2 Residual 74.762 10 7.476
Total 2715.763 12
Regression 2667.790 3 889.263 166.832 .000d
3 Residual 47.973 9 5.330
Total 2715.763 12
Regression 2657.859 2 1328.929 229.504 .000e
Total 2715.763 12
a. Dependent Variable: y
b. Predictors: (Constant), x4 c. Predictors: (Constant), x4, x1
d. Predictors: (Constant), x4, x1, x2 e. Predictors: (Constant), x1, x2
At step 3b, the procedure looks at the possibility of adding one of the remaining
variables x2 or x3 to the model (same as step 3 in forward selection). Between,
these two variable, the F-statistic are given by:
SSR ( x2 x1 , x4 ) 2667.790 − 2641.001 26.789
F= = = = 5.03
MSE ( x1, x2 , x4 ) 5.33 5.33
25
At step 4a, the procedure looks at the possibility of eliminating the previously
added variables x1 or x4 from the model. The partial F-statistic for eliminating x4
and x1 are respectively given by:
SSR ( x4 x1, x2 ) 2667.790 − 2657.859 9.931
F= = = = 1.86
MSE ( x1, x2 , x4 ) 5.33 5.33
At step 4b, the procedure looks at the possibility of adding x3 into the model. The
partial F-statistic is given by:
SSR ( x3 x1 , x2 ) 2667.652 − 2657.859 9.793
F= = = = 1.83
MSE ( x1, x2 , x3 ) 5.346 5.346
The partial F value for x3 is less than FIN = F0.10, 1, 9 = 3.36 , and therefore x3 cannot
be added into the model.
**Extra Output (not shown by SPSS)
ANOVAa
Total 2715.763 12
ANOVAa
Total 2715.763 12
ANOVAa
Total 2715.763 12
26
ANOVAa
Total 2715.763 12
27
Some users have recommended that all the procedures be applied in the hopes of
either seeing some agreement or learning something about the structure of the data
might be overlooked by using only one selection procedure. Furthermore, there is
not necessarily any agreement between any of the stepwise type procedures and all
possible regressions. However, Berk [1978] has noted that forward selection tends
to agree with all possible regressions for small subset sizes but not for large ones,
while backward elimination tends to agree with all possible regressions for large
subset sizes but not for small ones.
For these reasons stepwise-type variable selection procedures should be used with
caution. The preferable procedure is the stepwise regression algorithm followed by
backward elimination. The backward elimination algorithm is often less adversely
affected by the correlative structure of the variables than forward selection (see
Mantel [1970]).
Some users prefer to choose relatively small values of FIN and FOUT so that several
additional variables that would ordinarily be rejected by more conservative F
values may be investigated. In the extreme we may choose FIN and FOUT so that all
variables are entered by forward selection or removed by backward elimination
revealing one subset model of each size for p = 2, 3, …, k + 1. These subset models
may then be evaluated by criteria such as C p or MSE to determine the final model.
We do not recommend this extreme strategy because the analyst may think that the
subsets so determined are in some sense optimal, when it is likely that the best
subset model was overlooked. A very popular procedure is to set FIN = FOUT = 4 , as
this corresponds roughly to the upper 5% point of the F distribution. Still another
possibility is to make several runs using different values for FIN and FOUT and
observed the effect of the choice of criteria on the subsets obtained.
There have been several studies directed toward providing practical guidelines in
the choice of stopping rules. Bendel and Afifi [1974] recommend = 0.25 , for
forward selection. This would typically result in a numerical value of FIN between
28
1.3 and 2. Kennedy and Bancroft [1971] also suggest = 0.25 for forward
selection and recommend = 0.10 for backward elimination. The choice of values
for FIN and FIN is largely a matter of the personal preference of the analyst, and
considerable latitude is often taken in this area.
30