0% found this document useful (0 votes)
14 views8 pages

Variable Selection Using Random Forests

This paper proposes a variable selection method using Random Forests (RF) for predictive modeling, addressing the limitations of traditional methods like stepwise regression that rely on strong assumptions. The authors introduce four measures of variable importance to identify significant predictors and apply this method to a dataset from a study on acute peptic ulcer patients. The results indicate that the RF-based method effectively identifies relevant predictors while reducing the number of incorrectly identified variables compared to traditional approaches.

Uploaded by

mohammad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views8 pages

Variable Selection Using Random Forests

This paper proposes a variable selection method using Random Forests (RF) for predictive modeling, addressing the limitations of traditional methods like stepwise regression that rely on strong assumptions. The authors introduce four measures of variable importance to identify significant predictors and apply this method to a dataset from a study on acute peptic ulcer patients. The results indicate that the RF-based method effectively identifies relevant predictors while reducing the number of incorrectly identified variables compared to traditional approaches.

Uploaded by

mohammad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Variable Selection Using R a n d o m Forests

Marco Sandri and Paola Zuccolotto^

Dipartimento Metodi Quantitativi,


Universita di Brescia
c.da S. Chiara, 50 - 25122 Brescia, Italy
[email protected]

A b s t r a c t . One of the main topic in the development of predictive models is the


identification of variables which are predictors of a given outcome. Automated
model selection methods, such as backward or forward stepwise regression, are
classical solutions to this problem, but are generally based on strong assumptions
about the functional form of the model or the distribution of residuals. In this pa-
per an alternative selection method, based on the technique of Random Forests, is
proposed in the context of classification, with an application to a real dataset.

1 Introduction
In many empirical analyses a crucial problem is the presence in the d a t a
of a set of variables not significatively contributing to explain the analyzed
phenomenon, but capable to create a random noise which prevents from dis-
tinguishing the main effects and the relevant predictors. In this context proper
methods are necessary in order to identify variables t h a t are predictors of a
given outcome. Many automatic variable selection techniques have been pro-
posed in the literature, for example the backward or forward stepwise regres-
sion (see Miller (1984) and Hocking (1976)) or the recent stepwise b o o t s t r a p
method of Austin and Tu (2004). These methods are for the most part based
on assumptions about the functional form of the models or on the distribution
of residuals. These hypothesis can be dangerously strong in presence of one
or more of the following situations: (i) a large number of observed variables is
available, (n) collinearity is present, {in) the d a t a generating process is com-
plex, (iv) the sample size is small with reference to all these conditions. D a t a
analysis can be basically approached by two points of view: d a t a modeling
and algorithmic modeling (Breiman (2001b)). The former assumes t h a t d a t a
are generated by a given stochastic model, while the latter treats d a t a mech-
anism as unknown, a Mack box whose insides are complex and often partly
unknowable. The aim of the present paper is to propose a variable selection
method based on the algorithmic approach and to examine its performance
on a particular dataset. In the mid-1980s two powerful new algorithms for
fitting d a t a were developed: neural nets and decision trees, and were applied
in a wide range of fields, from physics, to medicine, to economics, even if in
some applications (see e.g. Ennis et al. (1998)) their performance was poorer
264 Sandri and Zuccolotto

than that of simpler models like linear logistic regression. The main short-
comings of these two methods were overfitting and instability, the latter with
particular reference to decision trees.
While overfitting has been long discussed and many techniques are avail-
able to overcome the problem (stopping rules, cross-validation, pruning, ...),
few has been made to handle instability, a problem occurring when there are
many different models with similar predictive accuracy and a slight pertur-
bation in the data or in the model construction can cause a skip from one
model to another, close in terms of error, but distant in terms of the meaning
(Breiman (1996a)). The proposal of Random Forests (Breiman (2001a)), a
method for classification or regression based on the repeated growing of trees
through the introduction of a random perturbation, tries to manage these
situations averaging the outcome of a great number of models fitted to the
same dataset. As a subproduct of this technique, the identification of vari-
ables which are important in a great number of models provides suggestions in
terms of variable selection. The proposal of this paper is to use the technique
of Random Forests (RF) as a tool for variable selection, and a procedure is
introduced and evaluated on a real dataset. The paper is organized as follows:
in section 2 the technique of RF is briefiy recalled, confining the attention
to the case of classification, in section 3 a variable selection method based
on RF is proposed, the application to a real dataset is reported in section 4,
conclusive remarks follow in section 5.

2 R a n d o m Forests

A population is partitioned into two or more groups, according to some qual-


itative feature. It follows that each individual in the population belongs to
(only) one group. The information about the group is contained in the cate-
gorical variable Y, while relevant further information is collected in a set of
exogenous variables X, always known, which is assumed to somewhat affect
Y. Given a random sample S = {(i/i,xi); • • • ;(i/^,x^)}, several statistical
techniques are available in order to determine an operative rule /i(x) called
classifier, used to assign to one group an individual of the population, not
contained in the sample, for which only the exogenous variables x^+i are
known. A random classifier /i(x, 0) is a classifier whose prediction about y
depends, besides on the input vector x, on a random vector 0 from a known
distribution 0. Given a i.i.d. sequence {Ok} = {^i, ^2, • • • , ^/c, • • • } of i"^n-
dom vectors from a known distribution 6>, a Random Forest RF(x, {^/c})
is itself a random classifier, consisting of a sequence of random classifiers
{/i(x, ^1), /i(x, ^2), • • • 7 h{x, Ok)r ''} each predicting a value for y at input
X. The RF prediction for y is expressed in terms of probability of Y assum-
ing the value y, Pr{Y = y}. By definition a RF is composed by an infinite
number of classifiers, but from an operational point of view the term is used
Variable Selection Using Random Forests 265

to indicate a finite set of classifiers {/i(x, ^ i ) , /i(x, ^2), • • • , /^(x, Ok)}- T h e k-


set's prediction for y corresponds t o t h e prediction whose frequency exceeds
a given threshold^. Asymptotic results have been derived in order t o know
the behavior of t h e set as t h e number of classifiers increases. Limiting laws
and statistical features of R F have been developed by Breiman (2001a) and
a detailed explanation can be found in Sandri and Zuccolotto (2004). T h e
theory of R F is quite general and can be applied t o several kinds of classifiers
and randomizations: examples are already present in literature, for instance
the bagging technique of Breiman (1996b) or t h e random split selection of Di-
etterich (2000). Moreover other well-known techniques, like b o o t s t r a p itself,
although introduced in different contexts, can be led back t o t h e R F frame-
work. Nevertheless by now t h e methodology called R a n d o m Forests is used
uniquely with reference t o its original formulation, due t o Breiman (2001a),
which uses CART-structured (Classification And Regression Trees, Breiman
et al. (19984)) classifiers. R F with randomly selected inputs are sequences of
trees grown by selecting at random at each node a small group of F input
variables t o split on. This procedure is often used in t a n d e m with bagging
(Breiman (1996b)), t h a t is with a random selection of a subsample of t h e
original training set at each tree. T h e trees obtained in this way are a R F ,
t h a t is a k-set of random classifiers {/i(x, ^1), /i(x, ^2), • • • , /^(x, Ok)} where
the vectors Oi denote t h e randomization injected by t h e subsample drawing
and by t h e selection of t h e F variables at each node.

2.1 Variable I m p o r t a n c e M e a s u r e s

The main drawback of using a set of random classifiers lies in its explanatory
power: predictions are t h e outcome of a black box where it is impossible t o
distinguish t h e contribution of t h e single predictors. W i t h R F this problem
is even more crucial, because t h e method performs very well especially in
presence of a small number of informative predictors hidden among a great
number of noise variables. To overcome this weakness t h e following four mea-
sures of variable importance are available in order t o identify t h e informative
predictors and exclude t h e others (Breiman (2002)):

• M e a s u r e 1: at each tree of t h e R F all t h e values of t h e h-th variable


are randomly permuted and new classifications are obtained with this
new dataset, over only those individuals who have not contributed t o t h e
growing of t h e tree. At t h e end a new misclassification error rate Ch is
1
In the standard case, the /c-set's prediction for y corresponds to the most voted
prediction, but a generalization is needed, as sometimes real datasets are char-
acterized by extremely unbalanced class frequencies, so that the prediction rule
of the RF has to be changed to other than majority votes. The optimal cutoff
value can be determined for example with the usual method based on the joint
maximization of sensitivity and specificity.
266 Sandri and Zuccolotto

then computed and compared with e. The Ml measure for h-th variable
is given by
Mlh = max {0; e/^ — e} .

• M e a s u r e 2: for an individual (^,x) the margin function mg{y,x.) is


defined as a measure of the extent to which the proportion of correct
classifications exceeds the proportion of the most voted incorrect classi-
fications. If at each tree all the values of the h-th variable are randomly
permuted, new margins mgh{y, x) can be calculated over only those trees
which have not been grown with that subject. The M2 measure of im-
portance is given by the average lowering of the margin across all cases:

M2h = m^x{0; avs[mg{y,x) - m^/,(i/,x)]} .

• M e a s u r e 3: in the framework just described for M2, the M3 measure is


given by the difference between the number of lowered and raised margins:

M3h = max {0; #[mg{y, x) < mgh{y, x)] - #[mg{y, x) > mgh{y, x)]} .

• M e a s u r e 4: at each node z in every tree only a small number of variables


is randomly chosen to split on, relying on some splitting criterion given
by a heterogeneity index such as the Gini index or the Shannon entropy.
Let d(/i, z) be the decrease in the heterogeneity index allowed by variable
X/i at node z, then X/^ is used to split at node z if d(/i, z) > d{w^ z) for all
variables X,^ randomly chosen at node z. The M4 measure is calculated
as the sum of all decreases in the RF due to /i-th variable, divided by the
number of trees:
M4;, = ^ ^ [ d ( / i , z ) / ( / i , z ) ]
z

where /(/i, z) is the indicator function that is equal to 1 if /i-th variable


is used to split at node z and 0 otherwise.

3 Variable Selection Using R a n d o m Forests

In this paper the possible use of RF as a method for variable selection is


emphasized, relying on the above mentioned four importance measures. A
selection procedure can be defined, observing that the exogenous variables de-
scribed by the four measures can be considered as points in a four-dimensional
space, with the following steps: (1) calculate a four-dimensional centroid with
coordinates given by an average (or a median) of the four measures; (2) cal-
culate the distance of each point-variable from the centroid and arrange the
calculated distances in non-increasing order; (3) select the variable whose dis-
tance from the centroid exceeds a given threshold, for example the average
distance. This simple method is often quite effective, because the noise vari-
ables represented in the four-dimensional space tend to cluster together in
Variable Selection Using Random Forests 267

a unique group and the predictors appear like outliers. A refinement of this
proposal, which provides a useful graphical representation, can be proposed
observing that the four measures are often correlated and this allows a dimen-
sional reduction of the space where the variables are defined. With a simple
Principal Component Analysis (PCA) the first two factors can be selected
and a scatterplot of the variables can be represented in the two-dimensional
factorial space, where the cluster of noise variables and the "outliers" can
be recognized. The above described procedure based on the calculation of
the distances from an average centroid can be applied also in this context
and helps deciding which points have to be effectively considered outliers^.
Simulation studies show that these methods very favorably compare with
a forward stepwise logistic regression, even when the real data generating
mechanism is a logistic one. Their major advantage lies in a sensibly smaller
number of wrongly identified predictors. The main problem of these methods
consists in the definition of the threshold between predictive and not predic-
tive variables. To help deciding if this threshold exists and where it could
be placed, a useful graphical representation could be a sort of scree-plot of
the distances from the centroid, where the actual existence of two groups of
variables, and the positioning of the threshold between them, can be easily
recognized.

4 Case Study

A prospective study was conducted from January 1995 to December 1998


by the First Department of General Surgery (Ospedale Maggiore di Borgo
Trento, Verona, Italy) in patients affected by acute peptic ulcer who under-
went endoscopic examination and were treated with a particular injection
therapy. The aims of the study were to identify risk factors for recurrence of
hemorrhage, as early prediction and treatment of rebleeding would improve
the overall outcome of the therapy. The dataset consists of 499 cases, observed
according to 32 exogenous variables related to patient history (gender, age,
bleeding at home or during hospitalization, previous peptic ulcer disease, pre-
vious gastrointestinal hemorrhage, intake of nonsteoridal anti-infiammatory
drugs, intake of anticoagulant drugs, associated diseases, recent - within 30
days - or past - more than 30 days - surgical operations), to the magni-
tude of bleeding (symptoms: haematemesis, coffee-ground vomit, melena,
anemia; systolic blood pressure, heart rate, hypovolemic shock, hematocrit

^ In this case the distance function can take into consideration the importance
of the two factors and a weight can be introduced given, for example, by the
fraction of total variance accounted for by each factor or by the correspondent
eigenvalue. Actually this procedure could be redundant, as the space rotation
implied by the PCA, already involves a overdispersion of points along the more
informative dimensions.
268 Sandri and Zuccolotto

and hemoglobin level, units of blood transfused), to endoscopic state (num-


ber, size, location of peptic ulcers, Forrest classification, presence of gastritis
or duodenitis). The values of all the variables are classified into categories
according to medical suggestions. We think t h a t the use of the raw d a t a
could allow a more detailed analysis. The results were presented in a paper
(Guglielmi et al. (2002)) where a logistic regression with variables selected
relying both on statistical evidences and on medical experience was able to
provide a (in-sample) 24% misclassification error with sensitivity and speci-
ficity equal to 76%^. In this paper two logistic regressions are fitted to the
same data, with variables selected respectively by a AIC stepwise procedure^
(Model A) and by our RF-based method (Model B).
The AIC stepwise variable selection method identifies nine relevant pre-
dictors^, while using the R F procedure, eight predictors are selected^. In both
cases the resulting predictors has been judged reasonable on the basis of med-
ical experience. In the left part of Figure 1 the scree plot of variable distances
from the centroid are represented for three approaches (the basic method in
the four-dimensional space, the refinement based on the P C A of the four
measures with Euclidean distance or weighted Euclidean distance), while the
right part of Figure 1 shows the two-dimensional scatterplot of the variables
in the first two principal components space, with a virtual line separating the
outlier variables selected as predictors.
In order to evaluate the performance of the two models, a cross-validation
study has been carried out with validation sets of size 125 (25% of the sample)
and r = 1000 repeated d a t a splittings. The estimated probabilities of the
two models are used to classify a patient being or not at risk of rebleeding,
according to a cutoff point determined by minimizing the absolute difference
between sensitivity and specificity in each validation set. Results are reported
in Table 4, where also the corresponding in-sample statistics are shown.
The two models exhibit a substantially equal goodness-of-fit and also have
a high agreement rate (in the in-sample analysis 91.58% of the individuals is
classified in the same class by the two models). However it has to be noticed
t h a t Model B, built with the R F variable selection, has a reduced number of

The predictors included in the model were: associated diseases/liver cirrho-


sis ( l i v c i r ) , recent surgical operations (recsurg), systolic blood pressure
(sbp), symptoms/haematemesis (hematem), ulcer size (size), ulcer location
( l o c a t i o n ( 2 ) ) , Forrest class (Forrest).
Coherently with our previous simulative studies, a forward selection is used.
Anyway, the backward option was experimented: it leads to a less parsimonious
model with substantially the same predictive performance.
Forrest class, systolic blood pressure, ulcer size, recent surgical operations, ulcer
location, units of blood transfused (uobt), age (age), symptoms/haematemesis,
intake of anticoagulant drugs (anticoag).
Systolic blood pressure, Forrest class, hypovolemic shock (shock), recent surgical
operations, age, ulcer size, symptoms/haematemesis and-or melena (symptoms),
ulcer location.
Variable Selection Using Random Forests 269

Basic method (four-dimensional) PCA (weighted distance)


3.5

3sbp 2^ ^
,
3

Forrest

2.5

§
t 2
Forrest
age
T E
' • . •

lshocl<

ijrecsurg
h Forrest
Lshocl<
'•.»location (2)
.symptoms

'size
A)recsurg

1 ••* uobt ••.


sbp

symptoms • ,hematem.
_ J ^ ^ c a t i o j i (2)_ _ , • • shock
0.5
• • anticoag '-..recsurg

10 20 30 1 2 3 4 5 6 7
1st Principal Component

Fig. 1. Left: Scree plots of variable distances from centroid (thresholds: average
distances); Right: Scatterplot of the variables in the first two principal components
space.

Misclassification error Cohen k Sensitivity Specificity Cutoff


Model A / M o d e l B - In-sample analysis
Model A 23.05% 0.3482 77.03% 76.94% 0.1783
Model B 22.65% 0.3556 77.03% 77.41% 0.1642
Model A - Out-of-sample analysis
25th percentile 21.60% 0.2224 55.55% 73.15% 0.1629
Median 24.00% 0.2751 63.63% 76.19% 0.1742
75th percentile 26.40% 0.3320 70.65% 79.25% 0.1841
Model B - Out-of-sample analysis
25th percentile 22.40% 0.2138 55.55% 71.96% 0.1558
Median 24.80% 0.2686 63.63% 75.23% 0.1658
75th percentile 27.20% 0.3196 72.22% 78.57% 0.1771
Table 1. Misclassification error, Cohen /c, sensitivity, specificity, cutoff value of the
two logistic models (In- and out-of-sample analyses)

predictors, coherently with the simulation results assessing a good capability


of the method in the false variables selection rate.

5 Concluding R e m a r k s
In this paper a variable selection method based on Breiman's R a n d o m Forests
is proposed and applied to a real dataset of patients affected by acute pep-
tic ulcers, in order to identify risk factors for recurrence of hemorrhage. The
main advantage of selecting relevant variables through an algorithmic model-
ing technique is the independence from any assumptions on the relationships
among variables and on the distribution of errors. After having selected the
270 Sandri and Zuccolotto

predictors, a model could be developed with some given hypothesis, and this
outlines R a n d o m Forests as a technique for preliminary analysis and vari-
able selection and not only for classification or regression, which are its main
purposes. The results on real d a t a confirm what expected on the basis of sim-
ulation studies: the RF-based variable selection identifies a smaller number of
relevant predictors and allows the construction of a more parsimonious model
but with predictive performance similar to the logistic model selected by the
AIC stepwise procedure. Further research is currently exploring the advan-
tages deriving from the combination of measures coming from model-based
prediction methods and algorithmic modeling techniques. Moreover simula-
tion studies have highlighted the presence of a bias effect in a commonly used
algorithmic variable importance measure. An adjustment strategy is under
development (Sandri and Zuccolotto (2006)).

References
AUSTIN, P. and TU, J. (2004): Bootstrap methods for developing predictive mod-
els. The American Statistician, 58, 131-137.
BREIMAN, L., FRIEDMAN, J.H., OLSHEN, R.A. and STONE, C.J. (1984): Clas-
sification and Regression Trees. Chapman & Hall, London.
BREIMAN, L. (1996a): The heuristic of instability in model selection. Annals of
Statistics, 24, 2350-2383.
BREIMAN, L. (1996b): Bagging predictions. Machine Learning, 24, 123-140.
BREIMAN, L. (2001a): Random Forests. Machine Learning, 45, 5-32.
BREIMAN, L. (2001b): Statistical modeling: the two cultures. Statistical Science,
16, 199-231.
BREIMAN, L. (2002): Manual on setting up, using, and understanding Random
Forests v3.1. Technical Report^ h t t p : / / o z . b e r k e l e y . e d u / u s e r s / b r e i m a n .
DIETTERICH, T. (2000): An experimental comparison of three methods for con-
struction ensembles of decision trees: bagging, boosting and randomization.
Machine Learning, 40, 139-157.
ENNIS, M., HINTON, C , NAYLOR, D., REVOW, M. and TIBSHIRANI, R.
(1998): A comparison of statistical learning methods on the gusto database.
Statistics in Medicine, 17, 2501-2508.
GUGLIELMI, A., RUZZENENTE, A., SANDRI, M., KIND, R., LOMBARDO, F.,
RODELLA, L., CATALANO, F., DE MANZONI, G. and CORDIANO, C.
(2002): Risk assessment and prediction of rebleeding in bleeding gastroduode-
nal ulcer. Endoscopy, 34, 771-779.
HOCKING, R.R. (1976): The analysis and selection of variables in linear regression.
Biometrics, 42, 1-49.
MILLER, A.J. (1984): Selection of subsets of regression variables. Journal of the
Royal Statistical Society, Series A, 147, 389-425.
SANDRI, M. and ZUCCOLOTTO, P. (2004): Classification with Random Forests:
the theoretical framework. Rapporto di Ricerca del Dipartim^ento Metodi Quan-
titativi, Universita degli Studi di Brescia, 235.
SANDRI, M. and ZUCCOLOTTO, P. (2006): Analysis of a bias effect on a tree-
based variable importance measure. Evaluation of an empirical adjustment
strategy. Manuscript.

You might also like