RandomForest2324 CR - 4p
RandomForest2324 CR - 4p
Decision Trees
Sumanta Basu
Decision(trees(
Machine Learning 10601
Lecture(11( Recitation 8
Oct 21, 2009
David&Sontag& Oznur Tastan
New&York&University&
R2 t4
X2
X2
t2
R3
X1
t3
• However they typically are not competitive with the best
supervised learning approaches in terms of prediction
X1 t1
|
accuracy.
X2 t2 X1 t3
• Hence we also discuss bagging, random forests, and
boosting. These methods grow multiple trees which are
R1 R2 R3
X2 t4
then combined to yield a single consensus prediction.
X2
X1
R4 R5
• Combining a large number of trees can often result in
dramatic improvements in prediction accuracy, at the
[Top Left]: A partition that could not result from recursive binary splits; expense of some loss interpretation.
[Top Right]: output of a recursive binary splits, [Bottom Left]:
associated regression tree, [Bottom Right]: perspective plot of the
response surface of the regression tree. 2 / 51
• These data contain a binary outcome HD for 303 patients Ca < 0.5 Ca < 0.5
who presented with chest pain. MaxHR < 161.5 ChestPain:bc Age < 52 Thal:b
ChestPain:a
RestECG < 1
Yes
RestBP < 157 No Yes Yes
Yes No
• An outcome value of Yes indicates the presence of heart Chol < 244
MaxHR < 156
MaxHR < 145.5
No
Yes
No Chol < 244 Sex < 0.5
No Yes
No No No No Yes
heart disease.
• There are 13 predictors including Age, Sex, Chol (a Thal:a
|
0.6
Training
Cross−Validation
cholesterol measurement), and other heart and lung Test
0.5
function measurements.
0.4
• Cross-validation yields a tree with six terminal nodes. See
Error
0.3
Ca < 0.5 Ca < 0.5
next figure.
0.2
Yes Yes
0.1
MaxHR < 161.5 ChestPain:bc
0.0
No No
No Yes
5 10 15
Tree Size
27 / 51 28 / 51
Trees Versus Linear Models Advantages and Disadvantages of Trees
s Trees are very easy to explain to people. In fact, they are
2
even easier to explain than linear regression!
1
s Some people believe that decision trees more closely mirror
X2
X2
0
0
human decision-making than do the regression and
−1
−1
classification approaches seen in previous chapters.
−2
−2
−2 −1 0 1 2 −2 −1 0 1 2
s Trees can be displayed graphically, and are easily
X1 X1
2
interpreted even by a non-expert (especially if they are
2
small).
1
1
s Trees can easily handle qualitative predictors without the
X2
X2
0
0
need to create dummy variables.
−1
−1
t Unfortunately, trees generally do not have the same level of
−2
−2
−2 −1 0 1 2 −2 −1 0 1 2 predictive accuracy as some of the other regression and
X1 X1
classification approaches seen in this book.
Top Row: True linear boundary; Bottom row: true non-linear
boundary. However, by aggregating many decision trees, the predictive
Left column: linear model; Right column: tree-based model performance of trees can be substantially improved. We
introduce these concepts next.
29 / 51 30 / 51
Bias/Variance&Tradeoff& Reduce&Variance&Without&Increasing&Bias&
• Averaging&reduces&variance:&
(when predictions
are independent)
Pour les grandes bases de données, 36,8% des instances de la base doivent
• un exemple a une probabilité de 1/n d’être tiré dans la base • Générer un lot de données par boostrap : Etrain, Etest = E - Etrain
d'apprentissage à chaque tirage, et donc (1-1/n) de ne pas être tiré • Générer un classeur à partir des données d'App
• La probabilité qu'une instance particulière ne soit pas tirée = (1-1/n)n ~ e-1 • Estimer le taux d'erreur : 0.632 * etest +0.368 * etrain
=0,368
• Moyenner les résultats
1
D’après A. OSMANI Cours Apprentissage symbolique
9 09/17/07 A. OSMANI Cours Apprentissage symbolique 0
• Bootstrap aggregation, or bagging, is a general-purpose • Instead, we can bootstrap, by taking repeated samples
procedure for reducing the variance of a statistical learning from the (single) training data set.
method; we introduce it here because it is particularly • In this approach we generate B di↵erent bootstrapped
useful and frequently used in the context of decision trees.
training data sets. We then train our method on the bth
• Recall that given a set of n independent observations bootstrapped training set in order to get fˆ⇤b (x), the
Z1 , . . . , Zn , each with variance 2 , the variance of the mean prediction at a point x. We then average all the predictions
Z̄ of the observations is given by 2 /n. to obtain
B
1 X ˆ⇤b
fˆbag (x) = f (x).
• In other words, averaging a set of observations reduces B
b=1
variance. Of course, this is not practical because we
generally do not have access to multiple training sets. This is called bagging.
31 / 51 32 / 51
Bagging classification trees Out-of-Bag Error Estimation
• It turns out that there is a very straightforward way to
estimate the test error of a bagged model.
• Recall that the key to bagging is that trees are repeatedly
fit to bootstrapped subsets of the observations. One can
• The above prescription applied to regression trees show that on average, each bagged tree makes use of
• For classification trees: for each test observation, we record around two-thirds of the observations.
the class predicted by each of the B trees, and take a • The remaining one-third of the observations not used to fit
majority vote: the overall prediction is the most commonly a given bagged tree are referred to as the out-of-bag (OOB)
occurring class among the B predictions. observations.
• We can predict the response for the ith observation using
each of the trees in which that observation was OOB. This
will yield around B/3 predictions for the ith observation,
which we average.
• This estimate is essentially the LOO cross-validation error
for bagging, if B is large.
33 / 51 36 / 51
0.20
• The green and blue traces show the OOB error, which in
this case is considerably lower
0.15
Test: Bagging
Test: RandomForest
OOB: Bagging
0.10
OOB: RandomForest
Number of Trees
34 / 51 35 / 51
Bagging Bagging : an simulated example
Generated a sample of size N = 30, with two
M features classes and p = 5 features, each having a
standard Gaussian distribution with pairwise
N examples
Take the
majority
Correlation 0.95.
vote
....…
....…
Bagging
Bagging
Notice the bootstrap trees are different than the original tree
Hastie
37 / 51
M features M features
N examples
N examples
....…
....…
....…
Random Forest Classifier Example: gene expression data
• We applied random forests to a high-dimensional biological
data set consisting of expression measurements of 4,718
genes measured on tissue samples from 349 patients.
M features • There are around 20,000 genes in humans, and individual
genes have di↵erent levels of activity, or expression, in
N examples
....…
....…
38 / 51
m= p
Test Classification Error
Number of Trees
39 / 51 40 / 51
Variable importance measure
• For bagged/RF regression trees, we record the total
Random forest
amount that the RSS is decreased due to splits over a given
predictor, averaged over all B trees. A large value indicates To read more:
an important predictor.
http://www-stat.stanford.edu/~hastie/Papers/ESLII.pdf
• Similarly, for bagged/RF classification trees, we add up the
total amount that the Gini index is decreased by splits over
a given predictor, averaged over all B trees.
Fbs
RestECG
ExAng
Sex
Slope
Chol
Age
RestBP
Variable importance plot
MaxHR for the Heart data
Oldpeak
ChestPain
Ca
Thal
0 20 40 60 80 100
Variable Importance
50 / 51
Summary
51 / 51