STEPAIC
STEPAIC
In R, stepAIC is one of the most commonly used search method for feature selection. We try
to keep on minimizing the stepAIC value to come up with the final set of features. “stepAIC”
does not necessarily mean to improve the model performance, however, it is used to simplify
the model without impacting much on the performance. So AIC quantifies the amount of
information loss due to this simplification. AIC stands for Akaike Information Criteria.
If we are given two models then we will prefer the model with lower AIC value. Hence we
can say that AIC provides a means for model selection. AIC is only a relative measure among
multiple models.
AIC is similar adjusted R-squared as it also penalizes for adding more variables to the model.
the absolute value of AIC does not have any significance. We only compare AIC value
whether it is increasing or decreasing by adding more variables. Also in case of multiple
models, the one which has lower AIC value is preferred.
So let's see how stepAIC works in R. We will use the mtcars data set. First, remove the
feature “x” by setting it to null as it contains only car models name which does not carry
much meaning in this case. Also then remove the rows which contain null values in any of
the columns using na.omit function. It is required to handle null values otherwise stepAIC
method will give an error. Then build the model and run stepAIC. For this, we need MASS
and CAR packages.
The first parameter in stepAIC is the model output and the second parameter is direction
means which feature selection techniques we want to use and it can take the following values:
“both” (for stepwise regression, both forward and backward selection);
“backward” (for backward selection) and
“forward” (for forward selection).
At the very last step stepAIC has produced the optimal set of features {drat, wt, gear, carb}.
stepAIC also removes the Multicollinearity if it exists, from the model which I will explain in
the next coming article.
So in the previous post, Feature Selection Techniques in Regression Model we have learnt
how to perform Stepwise Regression, Forward Selection and Backward Elimination
techniques in detail. StepAIC is an automated method that returns back the optimal set of
features.
This article first appeared on the “Tech Tunnel” blog at
https://ashutoshtripathi.com/2019/06/07/feature-selection-techniques-in-regression-model/
tepAIC {MASS}
Description
Usage
stepAIC(object, scope, scale = 0,
direction = c("both", "backward", "forward"),
trace = 1, keep = NULL, steps = 1000, use.start = FALSE,
k = 2, ...)
Arguments
object an object representing a model of an appropriate class. This is used as
the initial model in the stepwise search.
scope defines the range of models examined in the stepwise search. This
should be either a single formula, or a list containing
components upper and lower, both formulae. See the details for how to
specify the formulae and how they are used.
scale used in the definition of the AIC statistic for selecting the models,
currently only for lm and aov models (see extractAIC for details).
direction the mode of stepwise search, can be one of "both", "backward",
or "forward", with a default of "both". If the scope argument is missing
the default for direction is "backward".
trace if positive, information is printed during the running of stepAIC. Larger
values may give more information on the fitting process.
keep a filter function whose input is a fitted model object and the
associated AIC statistic, and whose output is arbitrary.
Typically keep will select a subset of the components of the object and
return them. The default is not to keep anything.
steps the maximum number of steps to be considered. The default is 1000
(essentially as many as required). It is typically used to stop the process
early.
use.start if true the updated fits are done starting at the linear predictor for the
currently selected model. This may speed up the iterative calculations
for glm (and other fits), but it can also slow them down. Not used in R.
k the multiple of the number of degrees of freedom used for the penalty.
Only k = 2 gives the genuine AIC: k = log(n) is sometimes referred to
as BIC or SBC.
... any additional arguments to extractAIC. (None are currently used.)
Details
The set of models searched is determined by the scope argument. The right-hand-
side of its lower component is always included in the model, and right-hand-side of
the model is included in the upper component. If scope is a single formula, it
specifies the upper component, and the lower model is empty. If scope is missing,
the initial model is used as the upper model.
There is a potential problem in using glm fits with a variable scale, as in that case
the deviance is not simply related to the maximized log-likelihood. The glm method
for extractAIC makes the appropriate adjustment for a gaussian family, but may
need to be amended for other cases. (The binomial and poisson families have
fixed scale by default and do not correspond to a particular maximum-likelihood
problem for variable scale.)
Where a conventional deviance exists (e.g. for lm, aov and glm fits) this is quoted in
the analysis of variance table: it is the unscaled deviance.
Value
Note
The model fitting must apply the models to the same dataset. This may be a
problem if there are missing values and an na.action other than na.fail is used (as
is the default in R). We suggest you remove the missing values first.
References
Examples
quine.hi <- aov(log(Days + 2.5) ~ .^4, quine)
quine.nxt <- update(quine.hi, . ~ . - Eth:Sex:Age:Lrn)
quine.stp <- stepAIC(quine.nxt,
scope = list(upper = ~Eth*Sex*Age*Lrn, lower = ~1),
trace = FALSE)
quine.stp$anova
example(birthwt)
birthwt.glm <- glm(low ~ ., family = binomial, data = bwt)
birthwt.step <- stepAIC(birthwt.glm, trace = FALSE)
birthwt.step$anova
birthwt.step2 <- stepAIC(birthwt.glm, ~ .^2 + I(scale(age)^2)
+ I(scale(lwt)^2), trace = FALSE)
birthwt.step2$anova
Description
Usage
boot.stepAIC(object, data, B = 100, alpha = 0.05, direction = "backward",
k = 2, verbose = FALSE, seed = 1L, ...)
Arguments
object an object representing a model of an appropriate class;
currently, "lm", "aov", "glm", "negbin", "polr", "survreg",
and "coxph" objects are supported.
data a data.frame or a matrix that contains the response variable and
covariates.
B the number of Bootstrap samples.
alpha the significance level.
direction the direction argument of stepAIC().
k the k argument of stepAIC().
verbose logical; if TRUE information about the evolution of the procedure is
printed in the screen.
seed numeric scalar denoting the seed used to create the Bootstrap samples.
... extra arguments to stepAIC(), e.g., scope.
Details
Step 1:
Simulate a new data-set taking a sample with replacement from the rows
of data.
Step 2:
Step 3:
Summarize the results by counting how many times (out of the B data-sets) each
variable was selected, how many times the estimate of the regression coefficient of
each variable (out of the times it was selected) it was statistically significant in
significance level alpha, and how many times the estimate of the regression
coefficient of each variable (out of the times it was selected) changed signs (see
also Austin and Tu, 2004).
Value
Author(s)
References
See Also
Examples
## lm() Example ##
n <- 350
x1 <- runif(n, -4, 4)
x2 <- runif(n, -4, 4)
x3 <- runif(n, -4, 4)
x4 <- runif(n, -4, 4)
x5 <- runif(n, -4, 4)
x6 <- runif(n, -4, 4)
x7 <- factor(sample(letters[1:3], n, rep = TRUE))
y <- 5 + 3 * x1 + 2 * x2 - 1.5 * x3 - 0.8 * x4 + rnorm(n, sd = 2.5)
data <- data.frame(y, x1, x2, x3, x4, x5, x6, x7)
rm(n, x1, x2, x3, x4, x5, x6, x7, y)
#####################################################################
## glm() Example ##
n <- 200
x1 <- runif(n, -3, 3)
x2 <- runif(n, -3, 3)
x3 <- runif(n, -3, 3)
x4 <- runif(n, -3, 3)
x5 <- factor(sample(letters[1:2], n, rep = TRUE))
eta <- 0.1 + 1.6 * x1 - 2.5 * as.numeric(as.character(x5) == levels(x5)[1])
y1 <- rbinom(n, 1, plogis(eta))
y2 <- rbinom(n, 1, 0.6)
data <- data.frame(y1, y2, x1, x2, x3, x4, x5)
rm(n, x1, x2, x3, x4, x5, eta, y1, y2)
library(MASS)
First, choose a model and throw every variable you think has an
impact on your dependent variable!
Like the one between Nick Cage movies and incidence of pool
drowning.
However …
data(mtcars)
summary(car_model <- lm(mpg ~., data = mtcars))
With our model, we can now feed it into the stepwise function.
For the direction argument, you can choose between backward
and forward stepwise selection,
If you add the trace = TRUE, R prints out all the steps.
The last line is the final model that we assign to step_car object.
If p=1,�=1, we need at least n=2�=2 points to uniquely fit a line. However, this
line gives no information on the vertical variation about it, hence σ2�2 cannot be
estimated52. Therefore, we need at least n=3�=3 points, that
is, n≥p+2=3.�≥�+2=3.
If p=2,�=2, we need at least n=3�=3 points to uniquely fit a plane. But again this
plane gives no information on the variation of the data about it and hence σ2�2 cannot
be estimated. Therefore, we need n≥p+2=4.�≥�+2=4.
BIC(model)=−2ℓ(model)Model
fitness+npar(model)×log(n)Complexity,(3.1)
(3.1)BIC(model)=−2ℓ(model)⏟Model
fitness+npar(model)×log(�)⏟Complexity,
where ℓ(model)ℓ(model) is the log-likelihood of the model (how well the model
fits the data) and npar(model)npar(model) is the number of parameters
considered in the model (how complex the model is). In the case of a multiple
linear regression model
with p� predictors, npar(model)=p+2.npar(model)=�+2. The AIC
replaces the log(n)log(�) factor by a 22 in (3.1) so, compared with the BIC,
it penalizes less the more complex models 54. This is one of the reasons why
BIC is preferred by some practitioners for performing model comparison 55.
The BIC and AIC can be computed through the functions BIC and AIC. They take a
model as the input.
# Two models with different predictors
mod1 <- lm(medv ~ age + crim, data = Boston)
mod2 <- lm(medv ~ age + crim + lstat, data = Boston)
# BICs
BIC(mod1)
## [1] 3581.893
BIC(mod2) # Smaller -> better
## [1] 3300.841
# AICs
AIC(mod1)
## [1] 3564.987
AIC(mod2) # Smaller -> better
## [1] 3279.708
# With AIC
modAIC <- MASS::stepAIC(mod, k = 2)
## Start: AIC=-61.07
## Price ~ Year + WinterRain + AGST + HarvestRain + Age + FrancePop
##
##
## Step: AIC=-61.07
## Price ~ Year + WinterRain + AGST + HarvestRain + FrancePop
##
## Df Sum of Sq RSS AIC
## - FrancePop 1 0.0026 1.8058 -63.031
## - Year 1 0.0048 1.8080 -62.998
## <none> 1.8032 -61.070
## - WinterRain 1 0.4585 2.2617 -56.952
## - HarvestRain 1 1.8063 3.6095 -44.331
## - AGST 1 3.3756 5.1788 -34.584
##
## Step: AIC=-63.03
## Price ~ Year + WinterRain + AGST + HarvestRain
##
## Df Sum of Sq RSS AIC
## <none> 1.8058 -63.031
## - WinterRain 1 0.4809 2.2867 -58.656
## - Year 1 0.9089 2.7147 -54.023
## - HarvestRain 1 1.8760 3.6818 -45.796
## - AGST 1 3.4428 5.2486 -36.222
# With BIC
modBIC <- MASS::stepAIC(mod, k = log(nrow(wine)))
## Start: AIC=-53.29
## Price ~ Year + WinterRain + AGST + HarvestRain + Age + FrancePop
##
##
## Step: AIC=-53.29
## Price ~ Year + WinterRain + AGST + HarvestRain + FrancePop
##
## Df Sum of Sq RSS AIC
## - FrancePop 1 0.0026 1.8058 -56.551
## - Year 1 0.0048 1.8080 -56.519
## <none> 1.8032 -53.295
## - WinterRain 1 0.4585 2.2617 -50.473
## - HarvestRain 1 1.8063 3.6095 -37.852
## - AGST 1 3.3756 5.1788 -28.105
##
## Step: AIC=-56.55
## Price ~ Year + WinterRain + AGST + HarvestRain
##
## Df Sum of Sq RSS AIC
## <none> 1.8058 -56.551
## - WinterRain 1 0.4809 2.2867 -53.473
## - Year 1 0.9089 2.7147 -48.840
## - HarvestRain 1 1.8760 3.6818 -40.612
## - AGST 1 3.4428 5.2486 -31.039
An explanation of what stepAIC did for modBIC:
At each step, stepAIC displayed information about the current value of the information
criterion. For example, the BIC at the first step was Step: AIC=-53.29 and then it
improved to Step: AIC=-56.55 in the second step.57
The next model to move on was decided by exploring the information criteria of the
different models resulting from adding or removing a predictor (depending on
the direction argument, explained later). For example, in the first step the model arising
from removing58 FrancePop had a BIC equal to -56.551.
The stepwise regression proceeded then by removing FrancePop, as it gave the lowest
BIC. When repeating the previous exploration, it was found that
removing <none> predictors was the best possible action.
The selected models modBIC and modAIC are equivalent to the modWine2 we
selected in Section 2.7.3 as the best model. This is a simple illustration that the
model selected by stepAIC is often a good starting point for further additions or
deletions of predictors.
The direction argument of stepAIC controls the mode of the stepwise model
search:
The next chunk of code clearly explains how to exploit the direction argument,
and other options of stepAIC, with a modified version of the wine dataset. An
important warning is that in order to use direction = "forward" or direction =
"both", scope needs to be properly defined. The practical advice to model
selection is to run59 several of these three search modes and retain the model with
minimum BIC/AIC, being specially careful with the scope argument.
# Add an irrelevant predictor to the wine dataset
set.seed(123456)
wineNoise <- wine
n <- nrow(wineNoise)
wineNoise$noisePredictor <- rnorm(n)
# Backward selection: removes predictors sequentially from the given
model
# Using the defaults from the full model essentially does backward
selection,
# but allowing predictors that were removed to enter again at later steps
MASS::stepAIC(modAll, direction = "both", k = log(n))
## Start: AIC=-50.13
## Price ~ Year + WinterRain + AGST + HarvestRain + Age + FrancePop +
## noisePredictor
##
##
## Step: AIC=-50.13
## Price ~ Year + WinterRain + AGST + HarvestRain + FrancePop +
## noisePredictor
##
## Df Sum of Sq RSS AIC
## - FrancePop 1 0.0036 1.7977 -53.376
## - Year 1 0.0038 1.7979 -53.374
## - noisePredictor 1 0.0090 1.8032 -53.295
## <none> 1.7941 -50.135
## - WinterRain 1 0.4598 2.2539 -47.271
## - HarvestRain 1 1.7666 3.5607 -34.923
## - AGST 1 3.3658 5.1599 -24.908
##
## Step: AIC=-53.38
## Price ~ Year + WinterRain + AGST + HarvestRain + noisePredictor
##
## Df Sum of Sq RSS AIC
## - noisePredictor 1 0.0081 1.8058 -56.551
## <none> 1.7977 -53.376
## - WinterRain 1 0.4771 2.2748 -50.317
## + FrancePop 1 0.0036 1.7941 -50.135
## - Year 1 0.9162 2.7139 -45.552
## - HarvestRain 1 1.8449 3.6426 -37.606
## - AGST 1 3.4234 5.2212 -27.885
##
## Step: AIC=-56.55
## Price ~ Year + WinterRain + AGST + HarvestRain
##
## Df Sum of Sq RSS AIC
## <none> 1.8058 -56.551
## - WinterRain 1 0.4809 2.2867 -53.473
## + noisePredictor 1 0.0081 1.7977 -53.376
## + FrancePop 1 0.0026 1.8032 -53.295
## - Year 1 0.9089 2.7147 -48.840
## - HarvestRain 1 1.8760 3.6818 -40.612
## - AGST 1 3.4428 5.2486 -31.039
##
## Call:
## lm(formula = Price ~ Year + WinterRain + AGST + HarvestRain,
## data = wineNoise)
##
## Coefficients:
## (Intercept) Year WinterRain AGST HarvestRain
## 43.639042 -0.023848 0.001167 0.616392 -0.003861
stepAIC and friends (addterm and dropterm) compute a slightly different version
of the BIC/AIC than the BIC/AIC functions. Precisely, the BIC/AIC they report
come from the extractAIC function, which differs in an additive constant from
the output of BIC/AIC. This is not relevant for model comparison, since shifting by a
common constant the BIC/AIC does not change the lower-to-higher BIC/AIC
ordering of models. However, it is important to be aware of this fact in order to
do not compare directly the output of stepAIC with the one of BIC/AIC. The additive
constant (included in BIC/AIC but not in extractAIC) is n(log(2π)
+1)+log(n)�(log(2�)+1)+log(�) for the BIC and n(log(2π)
+1)+2�(log(2�)+1)+2 for the AIC. The discrepancy arises from simplifying
the computation of the BIC/AIC for linear models and from counting ^σ2�^2 as
an estimated parameter.
The following chunk of code illustrates the relation of the AIC reported in stepsAIC,
the output of extractAIC, and the BIC/AIC reported by BIC/AIC.
# Same BICs, different scale
n <- nobs(modBIC)
extractAIC(modBIC, k = log(n))[2]
## [1] -56.55135
BIC(modBIC)
## [1] 23.36717
# Observe that MASS::stepAIC(mod, k = log(nrow(wine))) returned as final
BIC
# the one given by extractAIC(), not by BIC()! But both are equivalent,
they
# just differ in a constant shift
# Confidence intervals
confint(modBIC)
## 2.5 % 97.5 %
## (Intercept) 26.384649126 46.29764088
## crim -0.172817670 -0.04400902
## zn 0.019275889 0.07241397
## chas 1.040324913 4.39710769
## nox -24.321990312 -10.43005655
## rm 3.003258393 4.59989929
## dis -1.857631161 -1.12779176
## rad 0.175037411 0.42417950
## tax -0.018403857 -0.00515209
## ptratio -1.200109823 -0.69293932
## black 0.004037216 0.01454447
## lstat -0.615731781 -0.42937513
Note how the R2Adj�Adj2 has slightly increased with respect to the full model
and how all the predictors are significant. Note also that modBIC and modAIC are the
same.
Using modBIC, we can quantify the influence of the predictor variables on the
housing prices (Q1) and we can conclude that, in the final model (Q2) and with
significance level α=0.05�=0.05:
zn, chas, rm, rad, and black have a significantly positive influence on medv;
crim, nox, dis, tax, ptratio, and lstat have a significantly negative influence
on medv.
The functions MASS::addterm and MASS::dropterm allow adding and removing all
individual predictors to a given model, and inform the BICs / AICs of the possible
combinations. Check that:
For the second point, recall that scope must specify the maximal model or formula.
However, be careful because if using the formula approach, addterm(modBIC, scope =
medv ~ ., k = log(nobs(modBIC))) will understand that . refers to all the
predictors in modBIC, not in the Boston dataset, and will return an error.
Calling addterm(modBIC, scope = medv ~ . + indus + age, k =
log(nobs(modBIC))) gives the required result in terms of a formula, at the expense of
manually adding the remaining predictors.
Figure 3.3: Comparison of BIC and AIC on the model (2.26) fitted with data
generated by (2.25). The number of predictors p� ranges
from 11 to 198,198, with only the first two predictors being significant.
The M=500�=500 curves for each color arise from M� simulated datasets of
sample size n=200.�=200. The thicker curves are the mean of each color’s
curves.
Another big difference between the AIC and BIC, which is indeed behind the
behaviors seen in Figure 3.3, is the consistency of the BIC in performing model
selection. In simple terms, “consistency” means that, if enough data is provided,
the BIC is guaranteed to identify the true data-generating model among a list of
candidate models if the true model is included in the list. Mathematically, it means
that, given a collection of models M0,M1,…,Mm,�0,�1,
…,��, where M0�0 is the generating model of a sample of
size n,�, thenP[argmink=0,…,mBIC(^Mk)=0]→1asn→∞,(3.2)
(3.2)�[argmin�=0,…,�BIC(�^�)=0]→1as�→∞,where ^Mk�^�
represents the Mk�� model fitted with a sample of size n� generated
from M0.�0.60 Note that, despite being a nice theoretical result, its application
may be unrealistic in practice, as most likely the true model is nonlinear or not
present in the list of candidate models we examine.
The AIC is inconsistent, in the sense that (3.2) is not true if BICBIC is replaced
by AIC.AIC. Indeed, this result can be seen as a consequence of the asymptotic
equivalence of model selection by AIC and leave-one-out cross-validation 61,
and the inconsistency of the latter. This is beautifully described in the paper
by Shao (1993), whose abstract is given in Figure 3.4. The paper made a shocking
discovery in terms of what is the required modification to induce consistency in
model selection by cross-validation.
Figure 3.4: The abstract of Jun Shao’s Linear model selection by cross-
validation (Shao 1993).
and ε∼N(0,1).�∼�(0,1). Only the first two predictors are relevant, the last
three are “garbage” predictors. For a given sample, model selection is performed
by selecting among the 2525 possible fitted models the ones with minimum AIC,
BIC, and LOOCV. The experiment is repeated M=500�=500 times for sample
sizes n=2ℓ,�=2ℓ, ℓ=3,…,12,ℓ=3,…,12, and the estimated probability of
selecting the correct model (the one only involving X1�1 and X2�2) is
displayed in Figure 3.5. The figure evidences empirically several interesting
results:
LOOCV and AIC are asymptotically equivalent. For large n,�, they tend to select the
same model and hence their estimated probabilities of selecting the true model are almost
equal. For small n,�, there are significant differences between them.
The BIC is consistent in selecting the true model, and its probability of doing so quickly
approaches 1,1, as anticipated by (3.2).
The AIC and LOOCV are inconsistent in selecting the true model. Despite the sample
size n� doubling at each step, their probability of recovering the true model gets stuck
at about 0.60.0.60.
Even for moderate n�’s, the probability of recovering the true model by BIC quickly
outperforms those of AIC/LOOCV.
a. Select the best model according to the R2Adj�Adj2 and investigate its consistency in
model selection.
b. Add the LOOCV criterion in order to fully replicate Figure 3.5. Hint: you may want to
adapt (4.25) to your needs in order to reduce computation time.
Investigate what happens with the probability of selecting the true model using BIC
and AIC if the exhaustive search is replaced by a stepwise selection. Precisely,
do:
References
Schwarz, G. 1978. “Estimating the Dimension of a Model.” The Annals of Statistics 6 (2):
461–64. https://doi.org/10.1214/aos/1176344136 .
Shao, J. 1993. “Linear Model Selection by Cross-Validation.” Journal of the American
Statistical Association 88 (422): 486–94. https://doi.org/10.2307/2290328 .