Variable Selection Using Random Forests
Variable Selection Using Random Forests
1 Introduction
In many empirical analyses a crucial problem is the presence in the d a t a
of a set of variables not significatively contributing to explain the analyzed
phenomenon, but capable to create a random noise which prevents from dis-
tinguishing the main effects and the relevant predictors. In this context proper
methods are necessary in order to identify variables t h a t are predictors of a
given outcome. Many automatic variable selection techniques have been pro-
posed in the literature, for example the backward or forward stepwise regres-
sion (see Miller (1984) and Hocking (1976)) or the recent stepwise b o o t s t r a p
method of Austin and Tu (2004). These methods are for the most part based
on assumptions about the functional form of the models or on the distribution
of residuals. These hypothesis can be dangerously strong in presence of one
or more of the following situations: (i) a large number of observed variables is
available, (n) collinearity is present, {in) the d a t a generating process is com-
plex, (iv) the sample size is small with reference to all these conditions. D a t a
analysis can be basically approached by two points of view: d a t a modeling
and algorithmic modeling (Breiman (2001b)). The former assumes t h a t d a t a
are generated by a given stochastic model, while the latter treats d a t a mech-
anism as unknown, a Mack box whose insides are complex and often partly
unknowable. The aim of the present paper is to propose a variable selection
method based on the algorithmic approach and to examine its performance
on a particular dataset. In the mid-1980s two powerful new algorithms for
fitting d a t a were developed: neural nets and decision trees, and were applied
in a wide range of fields, from physics, to medicine, to economics, even if in
some applications (see e.g. Ennis et al. (1998)) their performance was poorer
264 Sandri and Zuccolotto
than that of simpler models like linear logistic regression. The main short-
comings of these two methods were overfitting and instability, the latter with
particular reference to decision trees.
While overfitting has been long discussed and many techniques are avail-
able to overcome the problem (stopping rules, cross-validation, pruning, ...),
few has been made to handle instability, a problem occurring when there are
many different models with similar predictive accuracy and a slight pertur-
bation in the data or in the model construction can cause a skip from one
model to another, close in terms of error, but distant in terms of the meaning
(Breiman (1996a)). The proposal of Random Forests (Breiman (2001a)), a
method for classification or regression based on the repeated growing of trees
through the introduction of a random perturbation, tries to manage these
situations averaging the outcome of a great number of models fitted to the
same dataset. As a subproduct of this technique, the identification of vari-
ables which are important in a great number of models provides suggestions in
terms of variable selection. The proposal of this paper is to use the technique
of Random Forests (RF) as a tool for variable selection, and a procedure is
introduced and evaluated on a real dataset. The paper is organized as follows:
in section 2 the technique of RF is briefiy recalled, confining the attention
to the case of classification, in section 3 a variable selection method based
on RF is proposed, the application to a real dataset is reported in section 4,
conclusive remarks follow in section 5.
2 R a n d o m Forests
2.1 Variable I m p o r t a n c e M e a s u r e s
The main drawback of using a set of random classifiers lies in its explanatory
power: predictions are t h e outcome of a black box where it is impossible t o
distinguish t h e contribution of t h e single predictors. W i t h R F this problem
is even more crucial, because t h e method performs very well especially in
presence of a small number of informative predictors hidden among a great
number of noise variables. To overcome this weakness t h e following four mea-
sures of variable importance are available in order t o identify t h e informative
predictors and exclude t h e others (Breiman (2002)):
then computed and compared with e. The Ml measure for h-th variable
is given by
Mlh = max {0; e/^ — e} .
M3h = max {0; #[mg{y, x) < mgh{y, x)] - #[mg{y, x) > mgh{y, x)]} .
a unique group and the predictors appear like outliers. A refinement of this
proposal, which provides a useful graphical representation, can be proposed
observing that the four measures are often correlated and this allows a dimen-
sional reduction of the space where the variables are defined. With a simple
Principal Component Analysis (PCA) the first two factors can be selected
and a scatterplot of the variables can be represented in the two-dimensional
factorial space, where the cluster of noise variables and the "outliers" can
be recognized. The above described procedure based on the calculation of
the distances from an average centroid can be applied also in this context
and helps deciding which points have to be effectively considered outliers^.
Simulation studies show that these methods very favorably compare with
a forward stepwise logistic regression, even when the real data generating
mechanism is a logistic one. Their major advantage lies in a sensibly smaller
number of wrongly identified predictors. The main problem of these methods
consists in the definition of the threshold between predictive and not predic-
tive variables. To help deciding if this threshold exists and where it could
be placed, a useful graphical representation could be a sort of scree-plot of
the distances from the centroid, where the actual existence of two groups of
variables, and the positioning of the threshold between them, can be easily
recognized.
4 Case Study
^ In this case the distance function can take into consideration the importance
of the two factors and a weight can be introduced given, for example, by the
fraction of total variance accounted for by each factor or by the correspondent
eigenvalue. Actually this procedure could be redundant, as the space rotation
implied by the PCA, already involves a overdispersion of points along the more
informative dimensions.
268 Sandri and Zuccolotto
3sbp 2^ ^
,
3
Forrest
2.5
§
t 2
Forrest
age
T E
' • . •
lshocl<
ijrecsurg
h Forrest
Lshocl<
'•.»location (2)
.symptoms
'size
A)recsurg
symptoms • ,hematem.
_ J ^ ^ c a t i o j i (2)_ _ , • • shock
0.5
• • anticoag '-..recsurg
10 20 30 1 2 3 4 5 6 7
1st Principal Component
Fig. 1. Left: Scree plots of variable distances from centroid (thresholds: average
distances); Right: Scatterplot of the variables in the first two principal components
space.
5 Concluding R e m a r k s
In this paper a variable selection method based on Breiman's R a n d o m Forests
is proposed and applied to a real dataset of patients affected by acute pep-
tic ulcers, in order to identify risk factors for recurrence of hemorrhage. The
main advantage of selecting relevant variables through an algorithmic model-
ing technique is the independence from any assumptions on the relationships
among variables and on the distribution of errors. After having selected the
270 Sandri and Zuccolotto
predictors, a model could be developed with some given hypothesis, and this
outlines R a n d o m Forests as a technique for preliminary analysis and vari-
able selection and not only for classification or regression, which are its main
purposes. The results on real d a t a confirm what expected on the basis of sim-
ulation studies: the RF-based variable selection identifies a smaller number of
relevant predictors and allows the construction of a more parsimonious model
but with predictive performance similar to the logistic model selected by the
AIC stepwise procedure. Further research is currently exploring the advan-
tages deriving from the combination of measures coming from model-based
prediction methods and algorithmic modeling techniques. Moreover simula-
tion studies have highlighted the presence of a bias effect in a commonly used
algorithmic variable importance measure. An adjustment strategy is under
development (Sandri and Zuccolotto (2006)).
References
AUSTIN, P. and TU, J. (2004): Bootstrap methods for developing predictive mod-
els. The American Statistician, 58, 131-137.
BREIMAN, L., FRIEDMAN, J.H., OLSHEN, R.A. and STONE, C.J. (1984): Clas-
sification and Regression Trees. Chapman & Hall, London.
BREIMAN, L. (1996a): The heuristic of instability in model selection. Annals of
Statistics, 24, 2350-2383.
BREIMAN, L. (1996b): Bagging predictions. Machine Learning, 24, 123-140.
BREIMAN, L. (2001a): Random Forests. Machine Learning, 45, 5-32.
BREIMAN, L. (2001b): Statistical modeling: the two cultures. Statistical Science,
16, 199-231.
BREIMAN, L. (2002): Manual on setting up, using, and understanding Random
Forests v3.1. Technical Report^ h t t p : / / o z . b e r k e l e y . e d u / u s e r s / b r e i m a n .
DIETTERICH, T. (2000): An experimental comparison of three methods for con-
struction ensembles of decision trees: bagging, boosting and randomization.
Machine Learning, 40, 139-157.
ENNIS, M., HINTON, C , NAYLOR, D., REVOW, M. and TIBSHIRANI, R.
(1998): A comparison of statistical learning methods on the gusto database.
Statistics in Medicine, 17, 2501-2508.
GUGLIELMI, A., RUZZENENTE, A., SANDRI, M., KIND, R., LOMBARDO, F.,
RODELLA, L., CATALANO, F., DE MANZONI, G. and CORDIANO, C.
(2002): Risk assessment and prediction of rebleeding in bleeding gastroduode-
nal ulcer. Endoscopy, 34, 771-779.
HOCKING, R.R. (1976): The analysis and selection of variables in linear regression.
Biometrics, 42, 1-49.
MILLER, A.J. (1984): Selection of subsets of regression variables. Journal of the
Royal Statistical Society, Series A, 147, 389-425.
SANDRI, M. and ZUCCOLOTTO, P. (2004): Classification with Random Forests:
the theoretical framework. Rapporto di Ricerca del Dipartim^ento Metodi Quan-
titativi, Universita degli Studi di Brescia, 235.
SANDRI, M. and ZUCCOLOTTO, P. (2006): Analysis of a bias effect on a tree-
based variable importance measure. Evaluation of an empirical adjustment
strategy. Manuscript.