0% found this document useful (0 votes)
24 views36 pages

STEPAIC

Uploaded by

PRABHUDUTTA RAY
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views36 pages

STEPAIC

Uploaded by

PRABHUDUTTA RAY
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 36

What is stepAIC in R?

In R, stepAIC is one of the most commonly used search method for feature selection. We try
to keep on minimizing the stepAIC value to come up with the final set of features. “stepAIC”
does not necessarily mean to improve the model performance, however, it is used to simplify
the model without impacting much on the performance. So AIC quantifies the amount of
information loss due to this simplification. AIC stands for Akaike Information Criteria.

If we are given two models then we will prefer the model with lower AIC value. Hence we
can say that AIC provides a means for model selection. AIC is only a relative measure among
multiple models.

AIC is similar adjusted R-squared as it also penalizes for adding more variables to the model.
the absolute value of AIC does not have any significance. We only compare AIC value
whether it is increasing or decreasing by adding more variables. Also in case of multiple
models, the one which has lower AIC value is preferred.
So let's see how stepAIC works in R. We will use the mtcars data set. First, remove the
feature “x” by setting it to null as it contains only car models name which does not carry
much meaning in this case. Also then remove the rows which contain null values in any of
the columns using na.omit function. It is required to handle null values otherwise stepAIC
method will give an error. Then build the model and run stepAIC. For this, we need MASS
and CAR packages.
The first parameter in stepAIC is the model output and the second parameter is direction
means which feature selection techniques we want to use and it can take the following values:
“both” (for stepwise regression, both forward and backward selection);
“backward” (for backward selection) and
“forward” (for forward selection).
At the very last step stepAIC has produced the optimal set of features {drat, wt, gear, carb}.
stepAIC also removes the Multicollinearity if it exists, from the model which I will explain in
the next coming article.
So in the previous post, Feature Selection Techniques in Regression Model we have learnt
how to perform Stepwise Regression, Forward Selection and Backward Elimination
techniques in detail. StepAIC is an automated method that returns back the optimal set of
features.
This article first appeared on the “Tech Tunnel” blog at
https://ashutoshtripathi.com/2019/06/07/feature-selection-techniques-in-regression-model/
tepAIC {MASS}

Choose a model by AIC in a Stepwise Algorithm

Description

Performs stepwise model selection by AIC.

Usage
stepAIC(object, scope, scale = 0,
direction = c("both", "backward", "forward"),
trace = 1, keep = NULL, steps = 1000, use.start = FALSE,
k = 2, ...)

Arguments
object an object representing a model of an appropriate class. This is used as
the initial model in the stepwise search.
scope defines the range of models examined in the stepwise search. This
should be either a single formula, or a list containing
components upper and lower, both formulae. See the details for how to
specify the formulae and how they are used.
scale used in the definition of the AIC statistic for selecting the models,
currently only for lm and aov models (see extractAIC for details).
direction the mode of stepwise search, can be one of "both", "backward",
or "forward", with a default of "both". If the scope argument is missing
the default for direction is "backward".
trace if positive, information is printed during the running of stepAIC. Larger
values may give more information on the fitting process.
keep a filter function whose input is a fitted model object and the
associated AIC statistic, and whose output is arbitrary.
Typically keep will select a subset of the components of the object and
return them. The default is not to keep anything.
steps the maximum number of steps to be considered. The default is 1000
(essentially as many as required). It is typically used to stop the process
early.
use.start if true the updated fits are done starting at the linear predictor for the
currently selected model. This may speed up the iterative calculations
for glm (and other fits), but it can also slow them down. Not used in R.
k the multiple of the number of degrees of freedom used for the penalty.
Only k = 2 gives the genuine AIC: k = log(n) is sometimes referred to
as BIC or SBC.
... any additional arguments to extractAIC. (None are currently used.)

Details

The set of models searched is determined by the scope argument. The right-hand-
side of its lower component is always included in the model, and right-hand-side of
the model is included in the upper component. If scope is a single formula, it
specifies the upper component, and the lower model is empty. If scope is missing,
the initial model is used as the upper model.

Models specified by scope can be templates to update object as used


by update.formula.

There is a potential problem in using glm fits with a variable scale, as in that case
the deviance is not simply related to the maximized log-likelihood. The glm method
for extractAIC makes the appropriate adjustment for a gaussian family, but may
need to be amended for other cases. (The binomial and poisson families have
fixed scale by default and do not correspond to a particular maximum-likelihood
problem for variable scale.)

Where a conventional deviance exists (e.g. for lm, aov and glm fits) this is quoted in
the analysis of variance table: it is the unscaled deviance.

Value

the stepwise-selected model is returned, with up to two additional components.


There is an "anova" component corresponding to the steps taken in the search, as
well as a "keep" component if the keep= argument was supplied in the call.
The "Resid. Dev" column of the analysis of deviance table refers to a constant
minus twice the maximized log likelihood: it will be a deviance only in cases
where a saturated model is well-defined (thus excluding lm, aov and survreg fits,
for example).

Note

The model fitting must apply the models to the same dataset. This may be a
problem if there are missing values and an na.action other than na.fail is used (as
is the default in R). We suggest you remove the missing values first.

References

Venables, W. N. and Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth


edition. Springer.
See Also

addterm, dropterm, step

Examples
quine.hi <- aov(log(Days + 2.5) ~ .^4, quine)
quine.nxt <- update(quine.hi, . ~ . - Eth:Sex:Age:Lrn)
quine.stp <- stepAIC(quine.nxt,
scope = list(upper = ~Eth*Sex*Age*Lrn, lower = ~1),
trace = FALSE)
quine.stp$anova

cpus1 <- cpus


for(v in names(cpus)[2:7])
cpus1[[v]] <- cut(cpus[[v]], unique(quantile(cpus[[v]])),
include.lowest = TRUE)
cpus0 <- cpus1[, 2:8] # excludes names, authors' predictions
cpus.samp <- sample(1:209, 100)
cpus.lm <- lm(log10(perf) ~ ., data = cpus1[cpus.samp,2:8])
cpus.lm2 <- stepAIC(cpus.lm, trace = FALSE)
cpus.lm2$anova

example(birthwt)
birthwt.glm <- glm(low ~ ., family = binomial, data = bwt)
birthwt.step <- stepAIC(birthwt.glm, trace = FALSE)
birthwt.step$anova
birthwt.step2 <- stepAIC(birthwt.glm, ~ .^2 + I(scale(age)^2)
+ I(scale(lwt)^2), trace = FALSE)
birthwt.step2$anova

quine.nb <- glm.nb(Days ~ .^4, data = quine)


quine.nb2 <- stepAIC(quine.nb)
quine.nb2$anova
boot.stepAIC {bootStepAIC}

Bootstraps the Stepwise Algorithm of stepAIC() for


Choosing a Model by AIC

Description

Implements a Bootstrap procedure to investigate the variability of model selection


under the stepAIC() stepwise algorithm of package MASS.

Usage
boot.stepAIC(object, data, B = 100, alpha = 0.05, direction = "backward",
k = 2, verbose = FALSE, seed = 1L, ...)

Arguments
object an object representing a model of an appropriate class;
currently, "lm", "aov", "glm", "negbin", "polr", "survreg",
and "coxph" objects are supported.
data a data.frame or a matrix that contains the response variable and
covariates.
B the number of Bootstrap samples.
alpha the significance level.
direction the direction argument of stepAIC().
k the k argument of stepAIC().
verbose logical; if TRUE information about the evolution of the procedure is
printed in the screen.
seed numeric scalar denoting the seed used to create the Bootstrap samples.
... extra arguments to stepAIC(), e.g., scope.

Details

The following procedure is replicated B times:

Step 1:

Simulate a new data-set taking a sample with replacement from the rows
of data.
Step 2:

Refit the model using the data-set from Step 1.

Step 3:

For the refitted model of Step 2 run the stepAIC() algorithm.

Summarize the results by counting how many times (out of the B data-sets) each
variable was selected, how many times the estimate of the regression coefficient of
each variable (out of the times it was selected) it was statistically significant in
significance level alpha, and how many times the estimate of the regression
coefficient of each variable (out of the times it was selected) changed signs (see
also Austin and Tu, 2004).

Value

An object of class BootStep with components


Covariates a numeric matrix containing the percentage of times each variable
was selected.
Sign a numeric matrix containing the percentage of times the regression
coefficient of each variable had sign ++ and −−.
Significance a numeric matrix containing the percentage of times the regression
coefficient of each variable was significant under
the alpha significance level.
OrigModel a copy of object.
OrigStepAIC the result of applying stepAIC() in object.
direction a copy of the direction argument.
k a copy of the k argument.
BootStepAIC a list of length B containing the results of stepAIC() for each
Bootstrap data-set.

Author(s)

Dimitris Rizopoulos [email protected]

References

Austin, P. and Tu, J. (2004). Bootstrap methods for developing predictive


models, The American Statistician, 58, 131–137.
Venables, W. N. and Ripley, B. D. (2002). Modern Applied Statistics with S, 4th
ed. Springer, New York.

See Also

stepAIC in package MASS

Examples

## lm() Example ##
n <- 350
x1 <- runif(n, -4, 4)
x2 <- runif(n, -4, 4)
x3 <- runif(n, -4, 4)
x4 <- runif(n, -4, 4)
x5 <- runif(n, -4, 4)
x6 <- runif(n, -4, 4)
x7 <- factor(sample(letters[1:3], n, rep = TRUE))
y <- 5 + 3 * x1 + 2 * x2 - 1.5 * x3 - 0.8 * x4 + rnorm(n, sd = 2.5)
data <- data.frame(y, x1, x2, x3, x4, x5, x6, x7)
rm(n, x1, x2, x3, x4, x5, x6, x7, y)

lmFit <- lm(y ~ (. - x7) * x7, data = data)


boot.stepAIC(lmFit, data)

#####################################################################

## glm() Example ##
n <- 200
x1 <- runif(n, -3, 3)
x2 <- runif(n, -3, 3)
x3 <- runif(n, -3, 3)
x4 <- runif(n, -3, 3)
x5 <- factor(sample(letters[1:2], n, rep = TRUE))
eta <- 0.1 + 1.6 * x1 - 2.5 * as.numeric(as.character(x5) == levels(x5)[1])
y1 <- rbinom(n, 1, plogis(eta))
y2 <- rbinom(n, 1, 0.6)
data <- data.frame(y1, y2, x1, x2, x3, x4, x5)
rm(n, x1, x2, x3, x4, x5, eta, y1, y2)

glmFit1 <- glm(y1 ~ x1 + x2 + x3 + x4 + x5, family = binomial(), data =


data)
glmFit2 <- glm(y2 ~ x1 + x2 + x3 + x4 + x5, family = binomial(), data =
data)

boot.stepAIC(glmFit1, data, B = 50)


boot.stepAIC(glmFit2, data, B = 50)
Choose model variables
by AIC in a stepwise
algorithm with the MASS
package in R
R statistics for Political Science diagnostics, modelling, r October 23, 2020 3
Minutes

Running a regression model with too many variables – especially


irrelevant ones – will lead to a needlessly complex model.
Stepwise can help to choose the best variables to add.

Packages you need:

library(MASS)

First, choose a model and throw every variable you think has an
impact on your dependent variable!

I hear the voice of my undergrad professor in my ear: ” DO NOT


go for the “throw spaghetti at the wall and just see what STICKS”
approach. A cardinal sin.

We must choose variables because we have some theoretical


rationale for any potential relationship. Or else we could end up
stumbling on spurious relationships.

Like the one between Nick Cage movies and incidence of pool
drowning.
However …

… beyond just using our sound theoretical understanding of the


complex phenomena we study in order to choose our model
variables …

… one additional way to supplement and gauge which variables


add to – or more importantly omit from – the model is to choose
the one with the smallest amount of error.

We can operationalise this as the model with the lowest Akaike


information criterion (AIC).

AIC is an estimator of in-sample prediction error and is similar to


the adjusted R-squared measures we see in our regression output
summaries.

It effectively penalises us for adding more variables to the model.


Lower scores can indicate a more parsimonious model, relative
to a model fit with a higher AIC. It can therefore give an
indication of the relative quality of statistical models for a given
set of data.

As a caveat, we can only compare AIC scores with models that


are fit to explain variance of the same dependent / response
variable.

data(mtcars)
summary(car_model <- lm(mpg ~., data = mtcars))

With our model, we can now feed it into the stepwise function.
For the direction argument, you can choose between backward
and forward stepwise selection,

 Forward steps: start the model with no predictors, just one


intercept and search through all the single-variable models,
adding variables, until we find the the best one (the one that
results in the lowest residual sum of squares)

 Backward steps: we start stepwise with all the predictors and


removes variable with the least statistically significant (the
largest p-value) one by one until we find the lost AIC.

Backward stepwise is generally better because starting with the


full model has the advantage of considering the effects of all
variables simultaneously.

Unlike backward elimination, forward stepwise selection is more


suitable in settings where the number of variables is bigger than
the sample size.

So tldr: unless the number of candidate variables is greater than


the sample size (such as dealing with genes), using a backward
stepwise approach is default choice.
You can also choose direction = "both":

step_car <- stepAIC(car_model, trace = TRUE, direction= "both")

If you add the trace = TRUE, R prints out all the steps.

I’ll show the last step to show you the output.

The goal is to have the combination of variables that has the


lowest AIC or lowest residual sum of squares (RSS).

The last line is the final model that we assign to step_car object.

stargazer(car_model, step_car, type = "text")


We can see that the stepwise model has only three variables
compared to the ten variables in my original model.

And even with far fewer variables, the R2 has decreased by an


insignificant amount. In fact the Adjusted R2 increased because
we are not being penalised for throwing so many unnecessary
variables.
So we can quickly find a model that loses no explanatory power
by is far more parsimonious.

Plus in the original model, only one variable is significant but in


the stepwise variable all three of the variables are significant.

From the olsrr package

step_plot <- ols_step_both_aic(car_model)


plot(step_plot)
3.2 Model selection
In Chapter 2 we briefly saw that the inclusion of more predictors is not for free:
there is a price to pay in terms of more variability in the coefficients estimates,
harder interpretation, and possible inclusion of highly-dependent predictors.
Indeed, there is a maximum number of predictors p� that can be considered in
a linear model for a sample size n�:p≤n−2.�≤�−2.Equivalently, there is a
minimum sample size n� required for fitting a model
with p� predictors: n≥p+2.�≥�+2.
The interpretation of this fact is simple if we think on the geometry
for p=1�=1 and p=2�=2:

 If p=1,�=1, we need at least n=2�=2 points to uniquely fit a line. However, this
line gives no information on the vertical variation about it, hence σ2�2 cannot be
estimated52. Therefore, we need at least n=3�=3 points, that
is, n≥p+2=3.�≥�+2=3.
 If p=2,�=2, we need at least n=3�=3 points to uniquely fit a plane. But again this
plane gives no information on the variation of the data about it and hence σ2�2 cannot
be estimated. Therefore, we need n≥p+2=4.�≥�+2=4.

Another interpretation is the following:

The fitting of a linear model with p� predictors involves the estimation


of p+2�+2 parameters (β,σ2)(�,�2) from n� data points. The
closer p+2�+2 and n� are, the more variable the estimates (^β,^σ2)
(�^,�^2) will be, since less information is available for estimating each one. In
the limit case n=p+2,�=�+2, each sample point determines a parameter
estimate.
In the above discussion the degenerate case p=n−1�=�−1 was excluded, as
it gives a perfect and useless fit for which ^σ2�^2 is not defined.
However, ^β�^ is actually computable if p=n−1.�=�−1. The output of the
next chunk of code clarifies the distinction
between p=n−1�=�−1 and p=n−2.�=�−2.
# Data: n observations and p = n - 1 predictors
set.seed(123456)
n <- 5
p <- n - 1
df <- data.frame(y = rnorm(n), x = matrix(rnorm(n * p), nrow = n, ncol =
p))

# Case p = n - 1 = 4: beta can be estimated, but sigma^2 cannot (hence,


no
# inference can be performed since it requires the estimated sigma^2)
summary(lm(y ~ ., data = df))

# Case p = n - 2 = 3: both beta and sigma^2 can be estimated (hence,


inference
# can be performed)
summary(lm(y ~ . - x.1, data = df))
The degrees of freedom n−p−1�−�−1 quantify the increase in the variability
of (^β,^σ2)(�^,�^2) when p� approaches n−2.�−2. For example:

 tn−p−1;α/2��−�−1;�/2 appears in (2.16) and influences the length of the CIs


for βj,��, see (2.17). It also influences the length of the CIs for the prediction. As
Figure 3.2 shows, when the degrees of freedom
decrease, tn−p−1;α/2��−�−1;�/2 increases, thus the intervals widen.
 ^σ2=1n−p−1∑ni=1^ε2i�^2=1�−�−1∑�=1��^�2 influences
the R2�2 and R2Adj.�Adj2. If no relevant variables are added to the model
then ∑ni=1^ε2i∑�=1��^�2 will not change substantially. However, the
factor 1n−p−11�−�−1 will increase as p� augments, inflating ^σ2�^2 and its
variance. This is exactly what happened in Figure 2.19.
Figure 3.2: Effect
of df=n−p−1df=�−�−1 in tdf;α/2�df;�/2 for α=0.10,0.05,0.01.�=0
.10,0.05,0.01.
Now that we have shed more light on the problem of having an excess of
predictors, we turn the focus on selecting the most adequate predictors for a
multiple regression model. This is a challenging task without a unique solution,
and what is worse, without a method that is guaranteed to work in all the cases.
However, there is a well-established procedure that usually gives good results:
the stepwise model selection. Its principle is to sequentially compare multiple
linear regression models with different predictors 53, improving iteratively a
performance measure through a greedy search.

Stepwise model selection typically uses as measure of performance


an information criterion. An information criterion balances the fitness of a model
with the number of predictors employed. Hence, it determines objectively the best
model as the one that minimizes the considered information criterion. Two
common criteria are the Bayesian Information Criterion (BIC) and the Akaike
Information Criterion (AIC). Both are based on balancing the model fitness and
its complexity:

BIC(model)=−2ℓ(model)Model
fitness+npar(model)×log(n)Complexity,(3.1)
(3.1)BIC(model)=−2ℓ(model)⏟Model
fitness+npar(model)×log⁡(�)⏟Complexity,
where ℓ(model)ℓ(model) is the log-likelihood of the model (how well the model
fits the data) and npar(model)npar(model) is the number of parameters
considered in the model (how complex the model is). In the case of a multiple
linear regression model
with p� predictors, npar(model)=p+2.npar(model)=�+2. The AIC
replaces the log(n)log⁡(�) factor by a 22 in (3.1) so, compared with the BIC,
it penalizes less the more complex models 54. This is one of the reasons why
BIC is preferred by some practitioners for performing model comparison 55.
The BIC and AIC can be computed through the functions BIC and AIC. They take a
model as the input.
# Two models with different predictors
mod1 <- lm(medv ~ age + crim, data = Boston)
mod2 <- lm(medv ~ age + crim + lstat, data = Boston)

# BICs
BIC(mod1)
## [1] 3581.893
BIC(mod2) # Smaller -> better
## [1] 3300.841

# AICs
AIC(mod1)
## [1] 3564.987
AIC(mod2) # Smaller -> better
## [1] 3279.708

# Check the summaries


summary(mod1)
##
## Call:
## lm(formula = medv ~ age + crim, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.940 -4.991 -2.420 2.110 32.033
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.80067 0.97078 30.698 < 2e-16 ***
## age -0.08955 0.01378 -6.499 1.95e-10 ***
## crim -0.31182 0.04510 -6.914 1.43e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.157 on 503 degrees of freedom
## Multiple R-squared: 0.2166, Adjusted R-squared: 0.2134
## F-statistic: 69.52 on 2 and 503 DF, p-value: < 2.2e-16
summary(mod2)
##
## Call:
## lm(formula = medv ~ age + crim + lstat, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.133 -3.848 -1.380 1.970 23.644
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 32.82804 0.74774 43.903 < 2e-16 ***
## age 0.03765 0.01225 3.074 0.00223 **
## crim -0.08262 0.03594 -2.299 0.02193 *
## lstat -0.99409 0.05075 -19.587 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.147 on 502 degrees of freedom
## Multiple R-squared: 0.5559, Adjusted R-squared: 0.5533
## F-statistic: 209.5 on 3 and 502 DF, p-value: < 2.2e-16
Let’s go back to the selection of predictors. If we have p� predictors, a naive
procedure would be to check all the possible models that can be constructed with
them and then select the best one in terms of BIC/AIC. This exhaustive search is
the so-called best subset selection – its application to be seen in Section 4.4. The
problem is that there are 2p2� possible models to inspect! 56 Fortunately,
the MASS::stepAIC function helps us navigating this ocean of models by
implementing stepwise model selection. Stepwise selection will iteratively add
predictors that decrease an information criterion and/or remove those that
increase it, depending on the mode of stepwise search that is performed.
stepAIC takes as input an initial model to be improved and has several variants.
Let’s see first how it works in its default mode using the already
studied wine dataset.
# Load data -- notice that "Year" is also included
wine <- read.csv(file = "wine.csv", header = TRUE)
We use as initial model the one featuring all the available predictors. The
argument k of stepAIC stands for the factor post-
multiplying npar(model)npar(model) in (3.1). It defaults to 2,2, which gives
the AIC. If we set it to k = log(n), the function considers the BIC.
# Full model
mod <- lm(Price ~ ., data = wine)

# With AIC
modAIC <- MASS::stepAIC(mod, k = 2)
## Start: AIC=-61.07
## Price ~ Year + WinterRain + AGST + HarvestRain + Age + FrancePop
##
##
## Step: AIC=-61.07
## Price ~ Year + WinterRain + AGST + HarvestRain + FrancePop
##
## Df Sum of Sq RSS AIC
## - FrancePop 1 0.0026 1.8058 -63.031
## - Year 1 0.0048 1.8080 -62.998
## <none> 1.8032 -61.070
## - WinterRain 1 0.4585 2.2617 -56.952
## - HarvestRain 1 1.8063 3.6095 -44.331
## - AGST 1 3.3756 5.1788 -34.584
##
## Step: AIC=-63.03
## Price ~ Year + WinterRain + AGST + HarvestRain
##
## Df Sum of Sq RSS AIC
## <none> 1.8058 -63.031
## - WinterRain 1 0.4809 2.2867 -58.656
## - Year 1 0.9089 2.7147 -54.023
## - HarvestRain 1 1.8760 3.6818 -45.796
## - AGST 1 3.4428 5.2486 -36.222

# The result is an lm object


summary(modAIC)
##
## Call:
## lm(formula = Price ~ Year + WinterRain + AGST + HarvestRain,
## data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.46024 -0.23862 0.01347 0.18601 0.53443
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 43.6390418 14.6939240 2.970 0.00707 **
## Year -0.0238480 0.0071667 -3.328 0.00305 **
## WinterRain 0.0011667 0.0004820 2.420 0.02421 *
## AGST 0.6163916 0.0951747 6.476 1.63e-06 ***
## HarvestRain -0.0038606 0.0008075 -4.781 8.97e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2865 on 22 degrees of freedom
## Multiple R-squared: 0.8275, Adjusted R-squared: 0.7962
## F-statistic: 26.39 on 4 and 22 DF, p-value: 4.057e-08

# With BIC
modBIC <- MASS::stepAIC(mod, k = log(nrow(wine)))
## Start: AIC=-53.29
## Price ~ Year + WinterRain + AGST + HarvestRain + Age + FrancePop
##
##
## Step: AIC=-53.29
## Price ~ Year + WinterRain + AGST + HarvestRain + FrancePop
##
## Df Sum of Sq RSS AIC
## - FrancePop 1 0.0026 1.8058 -56.551
## - Year 1 0.0048 1.8080 -56.519
## <none> 1.8032 -53.295
## - WinterRain 1 0.4585 2.2617 -50.473
## - HarvestRain 1 1.8063 3.6095 -37.852
## - AGST 1 3.3756 5.1788 -28.105
##
## Step: AIC=-56.55
## Price ~ Year + WinterRain + AGST + HarvestRain
##
## Df Sum of Sq RSS AIC
## <none> 1.8058 -56.551
## - WinterRain 1 0.4809 2.2867 -53.473
## - Year 1 0.9089 2.7147 -48.840
## - HarvestRain 1 1.8760 3.6818 -40.612
## - AGST 1 3.4428 5.2486 -31.039
An explanation of what stepAIC did for modBIC:

 At each step, stepAIC displayed information about the current value of the information
criterion. For example, the BIC at the first step was Step: AIC=-53.29 and then it
improved to Step: AIC=-56.55 in the second step.57
 The next model to move on was decided by exploring the information criteria of the
different models resulting from adding or removing a predictor (depending on
the direction argument, explained later). For example, in the first step the model arising
from removing58 FrancePop had a BIC equal to -56.551.
 The stepwise regression proceeded then by removing FrancePop, as it gave the lowest
BIC. When repeating the previous exploration, it was found that
removing <none> predictors was the best possible action.

The selected models modBIC and modAIC are equivalent to the modWine2 we
selected in Section 2.7.3 as the best model. This is a simple illustration that the
model selected by stepAIC is often a good starting point for further additions or
deletions of predictors.
The direction argument of stepAIC controls the mode of the stepwise model
search:

 "backward": removes predictors sequentially from the given model. It produces a


sequence of models of decreasing complexity until attaining the optimal one.
 "forward": adds predictors sequentially to the given model, using the available ones in
the data argument of lm. It produces a sequence of models of increasing complexity until
attaining the optimal one.
 "both" (default): a forward-backward search that, at each step, decides whether to
include or exclude a predictor. Differently from the previous modes, a predictor that was
excluded/included previously can be later included/excluded.

The next chunk of code clearly explains how to exploit the direction argument,
and other options of stepAIC, with a modified version of the wine dataset. An
important warning is that in order to use direction = "forward" or direction =
"both", scope needs to be properly defined. The practical advice to model
selection is to run59 several of these three search modes and retain the model with
minimum BIC/AIC, being specially careful with the scope argument.
# Add an irrelevant predictor to the wine dataset
set.seed(123456)
wineNoise <- wine
n <- nrow(wineNoise)
wineNoise$noisePredictor <- rnorm(n)
# Backward selection: removes predictors sequentially from the given
model

# Starting from the model with all the predictors


modAll <- lm(formula = Price ~ ., data = wineNoise)
MASS::stepAIC(modAll, direction = "backward", k = log(n))
## Start: AIC=-50.13
## Price ~ Year + WinterRain + AGST + HarvestRain + Age + FrancePop +
## noisePredictor
##
##
## Step: AIC=-50.13
## Price ~ Year + WinterRain + AGST + HarvestRain + FrancePop +
## noisePredictor
##
## Df Sum of Sq RSS AIC
## - FrancePop 1 0.0036 1.7977 -53.376
## - Year 1 0.0038 1.7979 -53.374
## - noisePredictor 1 0.0090 1.8032 -53.295
## <none> 1.7941 -50.135
## - WinterRain 1 0.4598 2.2539 -47.271
## - HarvestRain 1 1.7666 3.5607 -34.923
## - AGST 1 3.3658 5.1599 -24.908
##
## Step: AIC=-53.38
## Price ~ Year + WinterRain + AGST + HarvestRain + noisePredictor
##
## Df Sum of Sq RSS AIC
## - noisePredictor 1 0.0081 1.8058 -56.551
## <none> 1.7977 -53.376
## - WinterRain 1 0.4771 2.2748 -50.317
## - Year 1 0.9162 2.7139 -45.552
## - HarvestRain 1 1.8449 3.6426 -37.606
## - AGST 1 3.4234 5.2212 -27.885
##
## Step: AIC=-56.55
## Price ~ Year + WinterRain + AGST + HarvestRain
##
## Df Sum of Sq RSS AIC
## <none> 1.8058 -56.551
## - WinterRain 1 0.4809 2.2867 -53.473
## - Year 1 0.9089 2.7147 -48.840
## - HarvestRain 1 1.8760 3.6818 -40.612
## - AGST 1 3.4428 5.2486 -31.039
##
## Call:
## lm(formula = Price ~ Year + WinterRain + AGST + HarvestRain,
## data = wineNoise)
##
## Coefficients:
## (Intercept) Year WinterRain AGST HarvestRain
## 43.639042 -0.023848 0.001167 0.616392 -0.003861

# Starting from an intermediate model


modInter <- lm(formula = Price ~ noisePredictor + Year + AGST, data =
wineNoise)
MASS::stepAIC(modInter, direction = "backward", k = log(n))
## Start: AIC=-32.38
## Price ~ noisePredictor + Year + AGST
##
## Df Sum of Sq RSS AIC
## - noisePredictor 1 0.0146 5.0082 -35.601
## <none> 4.9936 -32.384
## - Year 1 0.7522 5.7459 -31.891
## - AGST 1 3.2211 8.2147 -22.240
##
## Step: AIC=-35.6
## Price ~ Year + AGST
##
## Df Sum of Sq RSS AIC
## <none> 5.0082 -35.601
## - Year 1 0.7966 5.8049 -34.911
## - AGST 1 3.2426 8.2509 -25.417
##
## Call:
## lm(formula = Price ~ Year + AGST, data = wineNoise)
##
## Coefficients:
## (Intercept) Year AGST
## 41.49441 -0.02221 0.56067
# Recall that other predictors not included in modInter are not explored
# during the search (so the relevant predictor HarvestRain is not added)

# Forward selection: adds predictors sequentially from the given model

# Starting from the model with no predictors, only intercept (denoted as


~ 1)
modZero <- lm(formula = Price ~ 1, data = wineNoise)
MASS::stepAIC(modZero, direction = "forward",
scope = list(lower = modZero, upper = modAll), k = log(n))
## Start: AIC=-22.28
## Price ~ 1
##
## Df Sum of Sq RSS AIC
## + AGST 1 4.6655 5.8049 -34.911
## + HarvestRain 1 2.6933 7.7770 -27.014
## + FrancePop 1 2.4231 8.0472 -26.092
## + Year 1 2.2195 8.2509 -25.417
## + Age 1 2.2195 8.2509 -25.417
## <none> 10.4703 -22.281
## + WinterRain 1 0.1905 10.2798 -19.481
## + noisePredictor 1 0.1761 10.2942 -19.443
##
## Step: AIC=-34.91
## Price ~ AGST
##
## Df Sum of Sq RSS AIC
## + HarvestRain 1 2.50659 3.2983 -46.878
## + WinterRain 1 1.42392 4.3809 -39.214
## + FrancePop 1 0.90263 4.9022 -36.178
## + Year 1 0.79662 5.0082 -35.601
## + Age 1 0.79662 5.0082 -35.601
## <none> 5.8049 -34.911
## + noisePredictor 1 0.05900 5.7459 -31.891
##
## Step: AIC=-46.88
## Price ~ AGST + HarvestRain
##
## Df Sum of Sq RSS AIC
## + FrancePop 1 1.03572 2.2625 -53.759
## + Year 1 1.01159 2.2867 -53.473
## + Age 1 1.01159 2.2867 -53.473
## + WinterRain 1 0.58356 2.7147 -48.840
## <none> 3.2983 -46.878
## + noisePredictor 1 0.06084 3.2374 -44.085
##
## Step: AIC=-53.76
## Price ~ AGST + HarvestRain + FrancePop
##
## Df Sum of Sq RSS AIC
## + WinterRain 1 0.45456 1.8080 -56.519
## <none> 2.2625 -53.759
## + noisePredictor 1 0.00829 2.2542 -50.562
## + Year 1 0.00085 2.2617 -50.473
## + Age 1 0.00085 2.2617 -50.473
##
## Step: AIC=-56.52
## Price ~ AGST + HarvestRain + FrancePop + WinterRain
##
## Df Sum of Sq RSS AIC
## <none> 1.8080 -56.519
## + noisePredictor 1 0.0100635 1.7979 -53.374
## + Year 1 0.0048039 1.8032 -53.295
## + Age 1 0.0048039 1.8032 -53.295
##
## Call:
## lm(formula = Price ~ AGST + HarvestRain + FrancePop + WinterRain,
## data = wineNoise)
##
## Coefficients:
## (Intercept) AGST HarvestRain FrancePop WinterRain
## -5.945e-01 6.127e-01 -3.804e-03 -5.189e-05 1.136e-03
# It is very important to set the scope argument adequately when doing
forward
# search! In the scope you have to define the "minimum" (lower) and
"maximum"
# (upper) models that contain the set of explorable models. If not
provided,
# the maximum model will be taken as the passed starting model (in this
case
# modZero) and stepAIC will not do any search

# Starting from an intermediate model


MASS::stepAIC(modInter, direction = "forward",
scope = list(lower = modZero, upper = modAll), k = log(n))
## Start: AIC=-32.38
## Price ~ noisePredictor + Year + AGST
##
## Df Sum of Sq RSS AIC
## + HarvestRain 1 2.71878 2.2748 -50.317
## + WinterRain 1 1.35102 3.6426 -37.606
## <none> 4.9936 -32.384
## + FrancePop 1 0.24004 4.7536 -30.418
##
## Step: AIC=-50.32
## Price ~ noisePredictor + Year + AGST + HarvestRain
##
## Df Sum of Sq RSS AIC
## + WinterRain 1 0.47710 1.7977 -53.376
## <none> 2.2748 -50.317
## + FrancePop 1 0.02094 2.2539 -47.271
##
## Step: AIC=-53.38
## Price ~ noisePredictor + Year + AGST + HarvestRain + WinterRain
##
## Df Sum of Sq RSS AIC
## <none> 1.7977 -53.376
## + FrancePop 1 0.0036037 1.7941 -50.135
##
## Call:
## lm(formula = Price ~ noisePredictor + Year + AGST + HarvestRain +
## WinterRain, data = wineNoise)
##
## Coefficients:
## (Intercept) noisePredictor Year AGST
HarvestRain WinterRain
## 44.096639 -0.019617 -0.024126 0.620522 -
0.003840 0.001211
# Recall that predictors included in modInter are not dropped during the
# search (so the irrelevant noisePredictor is kept)

# Both selection: useful if starting from an intermediate model

# Removes the problems associated to "backward" and "forward" searches


done
# from intermediate models
MASS::stepAIC(modInter, direction = "both",
scope = list(lower = modZero, upper = modAll), k = log(n))
## Start: AIC=-32.38
## Price ~ noisePredictor + Year + AGST
##
## Df Sum of Sq RSS AIC
## + HarvestRain 1 2.7188 2.2748 -50.317
## + WinterRain 1 1.3510 3.6426 -37.606
## - noisePredictor 1 0.0146 5.0082 -35.601
## <none> 4.9936 -32.384
## - Year 1 0.7522 5.7459 -31.891
## + FrancePop 1 0.2400 4.7536 -30.418
## - AGST 1 3.2211 8.2147 -22.240
##
## Step: AIC=-50.32
## Price ~ noisePredictor + Year + AGST + HarvestRain
##
## Df Sum of Sq RSS AIC
## - noisePredictor 1 0.01182 2.2867 -53.473
## + WinterRain 1 0.47710 1.7977 -53.376
## <none> 2.2748 -50.317
## + FrancePop 1 0.02094 2.2539 -47.271
## - Year 1 0.96258 3.2374 -44.085
## - HarvestRain 1 2.71878 4.9936 -32.384
## - AGST 1 2.94647 5.2213 -31.180
##
## Step: AIC=-53.47
## Price ~ Year + AGST + HarvestRain
##
## Df Sum of Sq RSS AIC
## + WinterRain 1 0.48087 1.8058 -56.551
## <none> 2.2867 -53.473
## + FrancePop 1 0.02497 2.2617 -50.473
## + noisePredictor 1 0.01182 2.2748 -50.317
## - Year 1 1.01159 3.2983 -46.878
## - HarvestRain 1 2.72157 5.0082 -35.601
## - AGST 1 2.96500 5.2517 -34.319
##
## Step: AIC=-56.55
## Price ~ Year + AGST + HarvestRain + WinterRain
##
## Df Sum of Sq RSS AIC
## <none> 1.8058 -56.551
## - WinterRain 1 0.4809 2.2867 -53.473
## + noisePredictor 1 0.0081 1.7977 -53.376
## + FrancePop 1 0.0026 1.8032 -53.295
## - Year 1 0.9089 2.7147 -48.840
## - HarvestRain 1 1.8760 3.6818 -40.612
## - AGST 1 3.4428 5.2486 -31.039
##
## Call:
## lm(formula = Price ~ Year + AGST + HarvestRain + WinterRain,
## data = wineNoise)
##
## Coefficients:
## (Intercept) Year AGST HarvestRain WinterRain
## 43.639042 -0.023848 0.616392 -0.003861 0.001167
# It is very important as well to correctly define the scope, because
"both"
# resorts to "forward" (as well as to "backward")

# Using the defaults from the full model essentially does backward
selection,
# but allowing predictors that were removed to enter again at later steps
MASS::stepAIC(modAll, direction = "both", k = log(n))
## Start: AIC=-50.13
## Price ~ Year + WinterRain + AGST + HarvestRain + Age + FrancePop +
## noisePredictor
##
##
## Step: AIC=-50.13
## Price ~ Year + WinterRain + AGST + HarvestRain + FrancePop +
## noisePredictor
##
## Df Sum of Sq RSS AIC
## - FrancePop 1 0.0036 1.7977 -53.376
## - Year 1 0.0038 1.7979 -53.374
## - noisePredictor 1 0.0090 1.8032 -53.295
## <none> 1.7941 -50.135
## - WinterRain 1 0.4598 2.2539 -47.271
## - HarvestRain 1 1.7666 3.5607 -34.923
## - AGST 1 3.3658 5.1599 -24.908
##
## Step: AIC=-53.38
## Price ~ Year + WinterRain + AGST + HarvestRain + noisePredictor
##
## Df Sum of Sq RSS AIC
## - noisePredictor 1 0.0081 1.8058 -56.551
## <none> 1.7977 -53.376
## - WinterRain 1 0.4771 2.2748 -50.317
## + FrancePop 1 0.0036 1.7941 -50.135
## - Year 1 0.9162 2.7139 -45.552
## - HarvestRain 1 1.8449 3.6426 -37.606
## - AGST 1 3.4234 5.2212 -27.885
##
## Step: AIC=-56.55
## Price ~ Year + WinterRain + AGST + HarvestRain
##
## Df Sum of Sq RSS AIC
## <none> 1.8058 -56.551
## - WinterRain 1 0.4809 2.2867 -53.473
## + noisePredictor 1 0.0081 1.7977 -53.376
## + FrancePop 1 0.0026 1.8032 -53.295
## - Year 1 0.9089 2.7147 -48.840
## - HarvestRain 1 1.8760 3.6818 -40.612
## - AGST 1 3.4428 5.2486 -31.039
##
## Call:
## lm(formula = Price ~ Year + WinterRain + AGST + HarvestRain,
## data = wineNoise)
##
## Coefficients:
## (Intercept) Year WinterRain AGST HarvestRain
## 43.639042 -0.023848 0.001167 0.616392 -0.003861

# Omit lengthy outputs


MASS::stepAIC(modAll, direction = "both", trace = 0,
scope = list(lower = modZero, upper = modAll), k = log(n))
##
## Call:
## lm(formula = Price ~ Year + WinterRain + AGST + HarvestRain,
## data = wineNoise)
##
## Coefficients:
## (Intercept) Year WinterRain AGST HarvestRain
## 43.639042 -0.023848 0.001167 0.616392 -0.003861
Run the previous stepwise selections for the Boston dataset, with the aim of
clearly understanding the different search directions. Specifically:

a. Do a "forward" stepwise fit starting from medv ~ 1.


b. Do a "forward" stepwise fit starting from medv ~ crim + lstat + age.
c. Do a "both" stepwise fit starting from medv ~ crim + lstat + age.
d. Do a "both" stepwise fit starting from medv ~ ..
e. Do a "backward" stepwise fit starting from medv ~ ..

We conclude with a couple of quirks of stepAIC to be aware of.


stepAIC assumes that no NA’s (missing values) are present in the data. It is advised to
remove the missing values in the data before. Their presence might lead to errors. To do
so, employ data = na.omit(dataset) in the call to lm (if your dataset is dataset). Also,
see Appendix A.4 for possible alternatives to deal with missing data.

stepAIC and friends (addterm and dropterm) compute a slightly different version
of the BIC/AIC than the BIC/AIC functions. Precisely, the BIC/AIC they report
come from the extractAIC function, which differs in an additive constant from
the output of BIC/AIC. This is not relevant for model comparison, since shifting by a
common constant the BIC/AIC does not change the lower-to-higher BIC/AIC
ordering of models. However, it is important to be aware of this fact in order to
do not compare directly the output of stepAIC with the one of BIC/AIC. The additive
constant (included in BIC/AIC but not in extractAIC) is n(log(2π)
+1)+log(n)�(log⁡(2�)+1)+log⁡(�) for the BIC and n(log(2π)
+1)+2�(log⁡(2�)+1)+2 for the AIC. The discrepancy arises from simplifying
the computation of the BIC/AIC for linear models and from counting ^σ2�^2 as
an estimated parameter.
The following chunk of code illustrates the relation of the AIC reported in stepsAIC,
the output of extractAIC, and the BIC/AIC reported by BIC/AIC.
# Same BICs, different scale
n <- nobs(modBIC)
extractAIC(modBIC, k = log(n))[2]
## [1] -56.55135
BIC(modBIC)
## [1] 23.36717
# Observe that MASS::stepAIC(mod, k = log(nrow(wine))) returned as final
BIC
# the one given by extractAIC(), not by BIC()! But both are equivalent,
they
# just differ in a constant shift

# Same AICs, different scale


extractAIC(modAIC, k = 2)[2]
## [1] -63.03053
AIC(modAIC)
## [1] 15.59215

# The additive constant: BIC() includes it but extractAIC() does not


BIC(modBIC) - extractAIC(modBIC, k = log(n))[2]
## [1] 79.91852
n * (log(2 * pi) + 1) + log(n)
## [1] 79.91852

# Same for the AIC


AIC(modAIC) - extractAIC(modAIC)[2]
## [1] 78.62268
n * (log(2 * pi) + 1) + 2
## [1] 78.62268

3.2.1 Case study application


We want to build a linear model for predicting and explaining medv. There are a
good number of predictors and some of them might be of little use for
predicting medv. However, there is no clear intuition of which predictors will yield
better explanations of medv with the information at hand. Therefore, we can start
by doing a linear model on all the predictors:
modHouse <- lm(medv ~ ., data = Boston)
summary(modHouse)
##
## Call:
## lm(formula = medv ~ ., data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.595 -2.730 -0.518 1.777 26.199
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 ***
## crim -1.080e-01 3.286e-02 -3.287 0.001087 **
## zn 4.642e-02 1.373e-02 3.382 0.000778 ***
## indus 2.056e-02 6.150e-02 0.334 0.738288
## chas 2.687e+00 8.616e-01 3.118 0.001925 **
## nox -1.777e+01 3.820e+00 -4.651 4.25e-06 ***
## rm 3.810e+00 4.179e-01 9.116 < 2e-16 ***
## age 6.922e-04 1.321e-02 0.052 0.958229
## dis -1.476e+00 1.995e-01 -7.398 6.01e-13 ***
## rad 3.060e-01 6.635e-02 4.613 5.07e-06 ***
## tax -1.233e-02 3.760e-03 -3.280 0.001112 **
## ptratio -9.527e-01 1.308e-01 -7.283 1.31e-12 ***
## black 9.312e-03 2.686e-03 3.467 0.000573 ***
## lstat -5.248e-01 5.072e-02 -10.347 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338
## F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16
There are a couple of non-significant variables, but so far the model has
an R2=0.74�2=0.74 and the fitted coefficients are sensible with what would
be expected. For example, crim, tax, ptratio, and nox have negative effects
on medv, while rm, rad, and chas have positive effects. However, the non-
significant coefficients are only adding noise to the prediction and decreasing the
overall accuracy of the coefficient estimates.
Let’s polish the previous model a little bit. Instead of manually removing each non-
significant variable to reduce the complexity, we employ stepAIC for performing a
forward-backward search starting from the full model. This gives us a
candidate best model.
# Best BIC and AIC models
modBIC <- MASS::stepAIC(modHouse, k = log(nrow(Boston)))
## Start: AIC=1648.81
## medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad +
## tax + ptratio + black + lstat
##
## Df Sum of Sq RSS AIC
## - age 1 0.06 11079 1642.6
## - indus 1 2.52 11081 1642.7
## <none> 11079 1648.8
## - chas 1 218.97 11298 1652.5
## - tax 1 242.26 11321 1653.5
## - crim 1 243.22 11322 1653.6
## - zn 1 257.49 11336 1654.2
## - black 1 270.63 11349 1654.8
## - rad 1 479.15 11558 1664.0
## - nox 1 487.16 11566 1664.4
## - ptratio 1 1194.23 12273 1694.4
## - dis 1 1232.41 12311 1696.0
## - rm 1 1871.32 12950 1721.6
## - lstat 1 2410.84 13490 1742.2
##
## Step: AIC=1642.59
## medv ~ crim + zn + indus + chas + nox + rm + dis + rad + tax +
## ptratio + black + lstat
##
## Df Sum of Sq RSS AIC
## - indus 1 2.52 11081 1636.5
## <none> 11079 1642.6
## - chas 1 219.91 11299 1646.3
## - tax 1 242.24 11321 1647.3
## - crim 1 243.20 11322 1647.3
## - zn 1 260.32 11339 1648.1
## - black 1 272.26 11351 1648.7
## - rad 1 481.09 11560 1657.9
## - nox 1 520.87 11600 1659.6
## - ptratio 1 1200.23 12279 1688.4
## - dis 1 1352.26 12431 1694.6
## - rm 1 1959.55 13038 1718.8
## - lstat 1 2718.88 13798 1747.4
##
## Step: AIC=1636.48
## medv ~ crim + zn + chas + nox + rm + dis + rad + tax + ptratio +
## black + lstat
##
## Df Sum of Sq RSS AIC
## <none> 11081 1636.5
## - chas 1 227.21 11309 1640.5
## - crim 1 245.37 11327 1641.3
## - zn 1 257.82 11339 1641.9
## - black 1 270.82 11352 1642.5
## - tax 1 273.62 11355 1642.6
## - rad 1 500.92 11582 1652.6
## - nox 1 541.91 11623 1654.4
## - ptratio 1 1206.45 12288 1682.5
## - dis 1 1448.94 12530 1692.4
## - rm 1 1963.66 13045 1712.8
## - lstat 1 2723.48 13805 1741.5
modAIC <- MASS::stepAIC(modHouse, trace = 0, k = 2)

# Comparison: same fits


car::compareCoefs(modBIC, modAIC)
## Calls:
## 1: lm(formula = medv ~ crim + zn + chas + nox + rm + dis + rad + tax +
ptratio + black + lstat, data = Boston)
## 2: lm(formula = medv ~ crim + zn + chas + nox + rm + dis + rad + tax +
ptratio + black + lstat, data = Boston)
##
## Model 1 Model 2
## (Intercept) 36.34 36.34
## SE 5.07 5.07
##
## crim -0.1084 -0.1084
## SE 0.0328 0.0328
##
## zn 0.0458 0.0458
## SE 0.0135 0.0135
##
## chas 2.719 2.719
## SE 0.854 0.854
##
## nox -17.38 -17.38
## SE 3.54 3.54
##
## rm 3.802 3.802
## SE 0.406 0.406
##
## dis -1.493 -1.493
## SE 0.186 0.186
##
## rad 0.2996 0.2996
## SE 0.0634 0.0634
##
## tax -0.01178 -0.01178
## SE 0.00337 0.00337
##
## ptratio -0.947 -0.947
## SE 0.129 0.129
##
## black 0.00929 0.00929
## SE 0.00267 0.00267
##
## lstat -0.5226 -0.5226
## SE 0.0474 0.0474
##
summary(modBIC)
##
## Call:
## lm(formula = medv ~ crim + zn + chas + nox + rm + dis + rad +
## tax + ptratio + black + lstat, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.5984 -2.7386 -0.5046 1.7273 26.2373
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.341145 5.067492 7.171 2.73e-12 ***
## crim -0.108413 0.032779 -3.307 0.001010 **
## zn 0.045845 0.013523 3.390 0.000754 ***
## chas 2.718716 0.854240 3.183 0.001551 **
## nox -17.376023 3.535243 -4.915 1.21e-06 ***
## rm 3.801579 0.406316 9.356 < 2e-16 ***
## dis -1.492711 0.185731 -8.037 6.84e-15 ***
## rad 0.299608 0.063402 4.726 3.00e-06 ***
## tax -0.011778 0.003372 -3.493 0.000521 ***
## ptratio -0.946525 0.129066 -7.334 9.24e-13 ***
## black 0.009291 0.002674 3.475 0.000557 ***
## lstat -0.522553 0.047424 -11.019 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.736 on 494 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7348
## F-statistic: 128.2 on 11 and 494 DF, p-value: < 2.2e-16

# Confidence intervals
confint(modBIC)
## 2.5 % 97.5 %
## (Intercept) 26.384649126 46.29764088
## crim -0.172817670 -0.04400902
## zn 0.019275889 0.07241397
## chas 1.040324913 4.39710769
## nox -24.321990312 -10.43005655
## rm 3.003258393 4.59989929
## dis -1.857631161 -1.12779176
## rad 0.175037411 0.42417950
## tax -0.018403857 -0.00515209
## ptratio -1.200109823 -0.69293932
## black 0.004037216 0.01454447
## lstat -0.615731781 -0.42937513
Note how the R2Adj�Adj2 has slightly increased with respect to the full model
and how all the predictors are significant. Note also that modBIC and modAIC are the
same.
Using modBIC, we can quantify the influence of the predictor variables on the
housing prices (Q1) and we can conclude that, in the final model (Q2) and with
significance level α=0.05�=0.05:

 zn, chas, rm, rad, and black have a significantly positive influence on medv;
 crim, nox, dis, tax, ptratio, and lstat have a significantly negative influence
on medv.

The functions MASS::addterm and MASS::dropterm allow adding and removing all
individual predictors to a given model, and inform the BICs / AICs of the possible
combinations. Check that:

a. modBIC cannot be improved in terms of BIC by removing predictors.


Use dropterm(modBIC, k = log(nobs(modBIC))) for that.
b. modBIC cannot be improved in terms of BIC by adding predictors. Use addterm(modBIC,
scope = lm(medv ~ ., data = Boston), k = log(nobs(modBIC))) for that.

For the second point, recall that scope must specify the maximal model or formula.
However, be careful because if using the formula approach, addterm(modBIC, scope =
medv ~ ., k = log(nobs(modBIC))) will understand that . refers to all the
predictors in modBIC, not in the Boston dataset, and will return an error.
Calling addterm(modBIC, scope = medv ~ . + indus + age, k =
log(nobs(modBIC))) gives the required result in terms of a formula, at the expense of
manually adding the remaining predictors.

3.2.2 Consistency in model selection


A caveat on the use of the BIC/AIC is that both criteria are constructed assuming
that the sample size n� is much larger than the number of parameters in the
model (p+2�+2), that is, assuming that n≫p+2.�≫�+2. Therefore, they
will work reasonably well if n≫p+2�≫�+2 but, when this is not true, they
enter in “overfitting mode” and start favoring unrealistic complex models. An
illustration of this phenomenon is shown in Figure 3.3, which is the BIC/AIC
version of Figure 2.19 and corresponds to the simulation study of Section 2.6. The
BIC/AIC curves tend to have local minima close to p=2�=2 and then increase.
But, when p+2�+2 gets close to n,�, they quickly drop below their values
at p=2.�=2.
In particular, Figure 3.3 gives the following insights:
 The steeper BIC curves are a consequence of the higher penalization that BIC introduces
on model complexity with respect to AIC. Therefore, the local minima of the BIC curves
are better identified than those of the flatter AIC curves.
 The BIC resists better the overfitting than the AIC. The average BIC
curve, ¯¯¯¯¯¯¯¯¯¯BIC(p),BIC¯(�), starts giving smaller values
than ¯¯¯¯¯¯¯¯¯¯BIC(2)≈1023.76BIC¯(2)≈1023.76 when p=198�=198 (¯¯¯¯¯¯¯¯¯
¯BIC(198)≈788.30BIC¯(198)≈788.30). Overfitting appears earlier in the AIC,
and ¯¯¯¯¯¯¯¯¯¯AIC(2)≈1010.57AIC¯(2)≈1010.57 is surpassed
for p≥155�≥155 (¯¯¯¯¯¯¯¯¯¯AIC(155)≈1009.57AIC¯(155)≈1009.57).
 The variabilities of both AIC and BIC curves are comparable, with no significant
differences between them.

Figure 3.3: Comparison of BIC and AIC on the model (2.26) fitted with data
generated by (2.25). The number of predictors p� ranges
from 11 to 198,198, with only the first two predictors being significant.
The M=500�=500 curves for each color arise from M� simulated datasets of
sample size n=200.�=200. The thicker curves are the mean of each color’s
curves.
Another big difference between the AIC and BIC, which is indeed behind the
behaviors seen in Figure 3.3, is the consistency of the BIC in performing model
selection. In simple terms, “consistency” means that, if enough data is provided,
the BIC is guaranteed to identify the true data-generating model among a list of
candidate models if the true model is included in the list. Mathematically, it means
that, given a collection of models M0,M1,…,Mm,�0,�1,
…,��, where M0�0 is the generating model of a sample of
size n,�, thenP[argmink=0,…,mBIC(^Mk)=0]→1asn→∞,(3.2)
(3.2)�[arg⁡min�=0,…,�BIC(�^�)=0]→1as�→∞,where ^Mk�^�
represents the Mk�� model fitted with a sample of size n� generated
from M0.�0.60 Note that, despite being a nice theoretical result, its application
may be unrealistic in practice, as most likely the true model is nonlinear or not
present in the list of candidate models we examine.
The AIC is inconsistent, in the sense that (3.2) is not true if BICBIC is replaced
by AIC.AIC. Indeed, this result can be seen as a consequence of the asymptotic
equivalence of model selection by AIC and leave-one-out cross-validation 61,
and the inconsistency of the latter. This is beautifully described in the paper
by Shao (1993), whose abstract is given in Figure 3.4. The paper made a shocking
discovery in terms of what is the required modification to induce consistency in
model selection by cross-validation.

Figure 3.4: The abstract of Jun Shao’s Linear model selection by cross-
validation (Shao 1993).

The Leave-One-Out Cross-Validation (LOOCV) error in a fitted linear model is


defined asLOOCV:=1nn∑i=1(Yi−^Yi,−i)2,(3.3)
(3.3)LOOCV:=1�∑�=1�(��−�^�,−�)2,where ^Yi,−i�^�,−� is
the prediction of the i�-th observation in a linear model that has been fitted
without the i�-th datum (Xi1,…,Xip,Yi).(��1,…,���,��). That is, ^Yi,
−i=^β0,−i+^β1,−iXi1+⋯+^βp,−iXip,�^�,
−�=�^0,−�+�^1,−���1+⋯+�^�,−����, where ^β−i�^−�
are the estimated coefficients from the sample {(Xj1,…,Xjp,Yj)}nj=1,j≠i.
{(��1,…,���,��)}�=1,�≠��. Model selection based on LOOCV
chooses the model with minimum error (3.3) within a list of candidate models. Note
that the computation of LOOCVLOOCV requires, in principle, to fit n� separate
linear models, predict, and then aggregate. However, an algebraic shortcut based
on (4.25) allows to compute (3.3) using a single linear fit.
We carry out a small simulation study in order to illustrate the consistency
property (3.2) of BIC and the inconsistency of model selection by AIC and LOOCV.
For that, we consider the linear
modelY=β0+β1X1+β2X2+β3X3+β4X4+β5X5+ε,β=(0.5,1,1,0,0,0)′,
(3.4)
0,0)′,where Xj∼N(0,1)��∼�(0,1) for j=1,…,5�=1,…,5 (independently)
(3.4)�=�0+�1�1+�2�2+�3�3+�4�4+�5�5+�,�=(0.5,1,1,0,

and ε∼N(0,1).�∼�(0,1). Only the first two predictors are relevant, the last
three are “garbage” predictors. For a given sample, model selection is performed
by selecting among the 2525 possible fitted models the ones with minimum AIC,
BIC, and LOOCV. The experiment is repeated M=500�=500 times for sample
sizes n=2ℓ,�=2ℓ, ℓ=3,…,12,ℓ=3,…,12, and the estimated probability of
selecting the correct model (the one only involving X1�1 and X2�2) is
displayed in Figure 3.5. The figure evidences empirically several interesting
results:

 LOOCV and AIC are asymptotically equivalent. For large n,�, they tend to select the
same model and hence their estimated probabilities of selecting the true model are almost
equal. For small n,�, there are significant differences between them.
 The BIC is consistent in selecting the true model, and its probability of doing so quickly
approaches 1,1, as anticipated by (3.2).
 The AIC and LOOCV are inconsistent in selecting the true model. Despite the sample
size n� doubling at each step, their probability of recovering the true model gets stuck
at about 0.60.0.60.
 Even for moderate n�’s, the probability of recovering the true model by BIC quickly
outperforms those of AIC/LOOCV.

Figure 3.5: Estimation of the probability of selecting the correct model by


minimizing the AIC, BIC, and LOOCV criteria, done for an exhaustive search
with p=5�=5 predictors. The correct model contained two predictors. The
probability was estimated with M=500�=500 Monte Carlo runs. The horizontal
axis is in logarithmic scale. The estimated proportion of true model recoveries with
BIC for n=4096�=4096 is 0.9840.984.
Implement the simulation study behind Figure 3.5:

1. Sample from (3.4), but take p=4.�=4.


2. Fit the 2p2� possible models.
3. Select the best model according to the AIC and BIC.
4. Repeat Steps 1–3 M=100�=100 times. Estimate by Monte Carlo the probability of
selecting the true model.
5. Move n� in a range from 1010 to 200.200.

Once you have a working solution, increase (p,n,M)(�,�,�) to approach the


settings in Figure 3.5 (or go beyond!).
Modify the Step 3 of the previous exercise to:

a. Select the best model according to the R2Adj�Adj2 and investigate its consistency in
model selection.
b. Add the LOOCV criterion in order to fully replicate Figure 3.5. Hint: you may want to
adapt (4.25) to your needs in order to reduce computation time.

Investigate what happens with the probability of selecting the true model using BIC
and AIC if the exhaustive search is replaced by a stepwise selection. Precisely,
do:

1. Sample from (3.4), but take p=10�=10 (pad β� with zeros).


2. Run a forward-backward stepwise search, both for the AIC and BIC.
3. Repeat Steps 1–2 M=100�=100 times. Estimate by Monte Carlo the probability of
selecting the true model.
4. Move n� in a range from 2020 to 200.200.

Once you have a working solution, increase (n,M)(�,�) to approach the


settings in Figure 3.5 (or go beyond!).
Shao (1993)’s modification on how to make consistent the model selection by
cross-validation is shocking. As in LOOCV, one has to split the sample of
size n� into two subsets: a training set, of size nt,��, and a validation set, of
size nv.��. Of course, n=nt+nv.�=��+��. However, totally opposite
to LOOCV, where nt=n−1��=�−1 and nv=1,��=1, the modification
required is to choose nv�� asymptotically equal to n�, in the sense
that nv/n→1��/�→1 as n→∞.�→∞. An example would be to
take nv=n−20��=�−20 and nt=20.��=20. But nv=n/2��=�/2 a
nd nt=n/2,��=�/2, a typical choice, would not make the selection
consistent! Shao (1993)’s result gives a great insight: the difficulty in model
selection lies in the comparison of models, not in their estimation, and the
sample information has to be disproportionately allocated to the comparison task
to achieve a consistent model selection device.
Verify Shao (1993)’s result by constructing the analogous of Figure 3.5 for the
following choices of (nt,nv)(��,��):
a. nt=n−1,��=�−1, nv=1��=1 (LOOCV).
b. nt=n/2,��=�/2, nv=n/2.��=�/2.
c. nt=n/4,��=�/4, nv=(3/4)n.��=(3/4)�.
d. nt=5p,��=5�, nv=n−5p.��=�−5�.

Use first p=4�=4 predictors, M=100�=100 Monte Carlo runs, and


move n� in a range from 2020 to 200.200. Then increase the settings to
approach those of Figure 3.5.

References
Schwarz, G. 1978. “Estimating the Dimension of a Model.” The Annals of Statistics 6 (2):
461–64. https://doi.org/10.1214/aos/1176344136 .
Shao, J. 1993. “Linear Model Selection by Cross-Validation.” Journal of the American
Statistical Association 88 (422): 486–94. https://doi.org/10.2307/2290328 .

52.Applying its formula, we would obtain ^σ2=0/0�^2=0/0 and so ^σ2�^2 will


not be defined.↩︎

54.Recall that log(n)>2log⁡(�)>2 if n≥8.�≥8.↩︎


53.Of course, with the same responses.↩︎

55.Also, because the BIC is consistent, see Section 3.2.2.↩︎


56.Since we have to take p� binary decisions on whether to include or not each of
the p� predictors.↩︎
57.The function always prints ‘AIC’, even if ‘BIC’ is employed.↩︎
58.Note the - before FrancePop. Note also the convenient sorting of the BICs in an
increasing form, so that the next best action always corresponds to the first row.↩︎
59.If possible!↩︎
60.The specifics of this result for the linear model are given in Schwarz (1978).↩︎
61.Cross-validation and its different variants are formally addressed at the end of
Section 3.6.2.↩︎

You might also like