0% found this document useful (0 votes)
19 views11 pages

RandomForest2324 CR - 4p

The document discusses decision trees, which are a machine learning method for classification and regression. Decision trees work by recursively splitting the predictor variable space into mutually exclusive regions called nodes. Each region or node is assigned a prediction, such as a class for classification trees or a continuous value for regression trees. The splits are determined by choosing a predictor variable and cut point that minimize the residual sum of squares at each step. This allows the tree to be built efficiently in a top-down, greedy manner. While interpretable, decision trees often have lower predictive accuracy than other methods, so ensemble methods like bagging and random forests are discussed which combine many decision trees to improve performance. An example decision tree for a heart disease dataset is also presented.

Uploaded by

souidiamel7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views11 pages

RandomForest2324 CR - 4p

The document discusses decision trees, which are a machine learning method for classification and regression. Decision trees work by recursively splitting the predictor variable space into mutually exclusive regions called nodes. Each region or node is assigned a prediction, such as a class for classification trees or a continuous value for regression trees. The splits are determined by choosing a predictor variable and cut point that minimize the residual sum of squares at each step. This allows the tree to be built efficiently in a top-down, greedy manner. While interpretable, decision trees often have lower predictive accuracy than other methods, so ensemble methods like bagging and random forests are discussed which combine many decision trees to improve performance. An example decision tree for a heart disease dataset is also presented.

Uploaded by

souidiamel7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Sources

Decision Trees

Sumanta Basu

September 28, 2017

Decision(trees(
Machine Learning 10601
Lecture(11( Recitation 8
Oct 21, 2009
David&Sontag& Oznur Tastan

New&York&University&

Slides adapted from Luke Zettlemoyer, Carlos Guestrin, and


Andrew Moore
Acknowledgement Regression trees, more formally
Regression: Response y 2 R, predictors X = (X1 , . . . , Xp ) 2 Rp
At a high-level, two-step process:
I Divide predictor space X1 , X2 , ..Xp into J non-overlapping
regions R1 , R2 , ..RJ
I Every observation in a Rj has same prediction i.e.
Some of the figures in this presentation are taken from ”An ŷRj = n1 ⌃j2Rj yj where n is the number of training observations
Introduction to Statistical Learning, with applications in R” in Rj .
(Springer, 2013) with permission from the authors: G. James, D.
A closer look at Step 1:
Witten, T. Hastie and R. Tibshirani
I We want to find boxes R1 , R2 , ..RJ that minimize the RSS
given by
X J X
RSS = (yi ŷRj )2
j=1 i2Rj

I Unfortunately, minimization problem computation intensive


I Method: Recursive binary splitting

Recursive binary splitting Prediction

I Top-down recursive greedy approach


I Top-down recursive: Start with all observations and split into
two branches at each level of the tree
Recall the two-step process:
I Greedy: Best split is made at each step without looking ahead I Divide predictor space X1 , X2 , ..Xp into J non-overlapping
regions R1 , R2 , ..RJ
I Choose a predictor Xj and a cutpoint s that minimizes the I Every observation in a Rj has same prediction i.e.
RSS for the resulting tree ŷRj = n1 ⌃j2Rj yj where n is the number of training observations
in Rj .
R1 (j, s) = {X |Xj < s}, R2 (j, s) = {X |Xj s}
X X
RSS = (yi ŷR1 )2 + (yi ŷR2 )2
xi 2R1 (j,s) xi 2R2 (j,s)

I This minimization problem can be solved efficiently!!


Prediction with a Regression Tree
Pros and Cons
R5

R2 t4

X2

X2
t2
R3

R4 • Tree-based methods are simple and useful for


R1
interpretation.
X1
t1

X1
t3
• However they typically are not competitive with the best
supervised learning approaches in terms of prediction
X1  t1
|
accuracy.
X2  t2 X1  t3
• Hence we also discuss bagging, random forests, and
boosting. These methods grow multiple trees which are
R1 R2 R3
X2  t4
then combined to yield a single consensus prediction.
X2
X1

R4 R5
• Combining a large number of trees can often result in
dramatic improvements in prediction accuracy, at the
[Top Left]: A partition that could not result from recursive binary splits; expense of some loss interpretation.
[Top Right]: output of a recursive binary splits, [Bottom Left]:
associated regression tree, [Bottom Right]: perspective plot of the
response surface of the regression tree. 2 / 51

Example: heart data Thal:a


|

• These data contain a binary outcome HD for 303 patients Ca < 0.5 Ca < 0.5

Slope < 1.5 Oldpeak < 1.1

who presented with chest pain. MaxHR < 161.5 ChestPain:bc Age < 52 Thal:b
ChestPain:a
RestECG < 1
Yes
RestBP < 157 No Yes Yes
Yes No

• An outcome value of Yes indicates the presence of heart Chol < 244
MaxHR < 156
MaxHR < 145.5
No
Yes
No Chol < 244 Sex < 0.5
No Yes

No No No No Yes

disease based on an angiographic test, while No means no


No Yes

heart disease.
• There are 13 predictors including Age, Sex, Chol (a Thal:a
|

0.6
Training
Cross−Validation
cholesterol measurement), and other heart and lung Test

0.5
function measurements.

0.4
• Cross-validation yields a tree with six terminal nodes. See

Error

0.3
Ca < 0.5 Ca < 0.5

next figure.
0.2
Yes Yes

0.1
MaxHR < 161.5 ChestPain:bc

0.0

No No
No Yes
5 10 15

Tree Size

27 / 51 28 / 51
Trees Versus Linear Models Advantages and Disadvantages of Trees
s Trees are very easy to explain to people. In fact, they are

2
even easier to explain than linear regression!

1
s Some people believe that decision trees more closely mirror

X2

X2
0

0
human decision-making than do the regression and

−1

−1
classification approaches seen in previous chapters.

−2

−2
−2 −1 0 1 2 −2 −1 0 1 2
s Trees can be displayed graphically, and are easily
X1 X1
2
interpreted even by a non-expert (especially if they are

2
small).
1

1
s Trees can easily handle qualitative predictors without the
X2

X2
0

0
need to create dummy variables.
−1

−1
t Unfortunately, trees generally do not have the same level of
−2

−2
−2 −1 0 1 2 −2 −1 0 1 2 predictive accuracy as some of the other regression and
X1 X1
classification approaches seen in this book.
Top Row: True linear boundary; Bottom row: true non-linear
boundary. However, by aggregating many decision trees, the predictive
Left column: linear model; Right column: tree-based model performance of trees can be substantially improved. We
introduce these concepts next.
29 / 51 30 / 51

Bias/Variance&Tradeoff& Reduce&Variance&Without&Increasing&Bias&

• Averaging&reduces&variance:&
(when predictions
are independent)

Average models to reduce model variance


One problem:
only one training set
where do multiple models come from?

Hastie, Tibshirani, Friedman “Elements of Statistical Learning” 2001


Bagging The 0.632 Bootstrap

Les méthodes précédentes n'autorisent pas le replacement/remise d'exemples


• Bagging or bootstrap aggregation a technique for (un exemple sélectionné ne peut l'être une autre fois)
reducing the variance of an estimated prediction
• Former un « lot » d'apprentissage par échantillonage (avec replacement) n
function.
fois à partir de la base d'exemples de n instances

• Conséquences : certain exemples sont tirés plusieurs fois et d'autres pas du


• For classification, a committee of trees each
tout
cast a vote for the predicted class.
• Les exemples non tirés sont mis dans la base de test

D’après A. OSMANI Cours Apprentissage symbolique


6

Bootstrap Un pas de côté : Leave-one-out

The basic idea: Cas particulier de la k-validation croisée où k = nombre d'exemples


• Avantage :
randomly draw datasets with replacement from the
training data, each sample the same size as the original training set • chaque itération utilise un nombre élevé de données pour la
phase d'apprentissage
• déterministe : pas de variance
• Inconvénients :
• algorithme d'apprentissage exécuté n fois
• pb de garantie de la stratification des exemples
=> utilisé quand n < 100

D’après A. OSMANI Cours Apprentissage symbolique


8
The 0.632 Bootstrap Apprentissage et test pour 0.632 Bootstrap

Pour les grandes bases de données, 36,8% des instances de la base doivent

apparaitre dans la base de tests. • A partir des données E

Preuve • Répéter avec différents échantillonnages aléatoires

• un exemple a une probabilité de 1/n d’être tiré dans la base • Générer un lot de données par boostrap : Etrain, Etest = E - Etrain

d'apprentissage à chaque tirage, et donc (1-1/n) de ne pas être tiré • Générer un classeur à partir des données d'App
• La probabilité qu'une instance particulière ne soit pas tirée = (1-1/n)n ~ e-1 • Estimer le taux d'erreur : 0.632 * etest +0.368 * etrain
=0,368
• Moyenner les résultats

1
D’après A. OSMANI Cours Apprentissage symbolique
9 09/17/07 A. OSMANI Cours Apprentissage symbolique 0

Bagging Bagging— continued

• Bootstrap aggregation, or bagging, is a general-purpose • Instead, we can bootstrap, by taking repeated samples
procedure for reducing the variance of a statistical learning from the (single) training data set.
method; we introduce it here because it is particularly • In this approach we generate B di↵erent bootstrapped
useful and frequently used in the context of decision trees.
training data sets. We then train our method on the bth
• Recall that given a set of n independent observations bootstrapped training set in order to get fˆ⇤b (x), the
Z1 , . . . , Zn , each with variance 2 , the variance of the mean prediction at a point x. We then average all the predictions
Z̄ of the observations is given by 2 /n. to obtain
B
1 X ˆ⇤b
fˆbag (x) = f (x).
• In other words, averaging a set of observations reduces B
b=1
variance. Of course, this is not practical because we
generally do not have access to multiple training sets. This is called bagging.

31 / 51 32 / 51
Bagging classification trees Out-of-Bag Error Estimation
• It turns out that there is a very straightforward way to
estimate the test error of a bagged model.
• Recall that the key to bagging is that trees are repeatedly
fit to bootstrapped subsets of the observations. One can
• The above prescription applied to regression trees show that on average, each bagged tree makes use of
• For classification trees: for each test observation, we record around two-thirds of the observations.
the class predicted by each of the B trees, and take a • The remaining one-third of the observations not used to fit
majority vote: the overall prediction is the most commonly a given bagged tree are referred to as the out-of-bag (OOB)
occurring class among the B predictions. observations.
• We can predict the response for the ith observation using
each of the trees in which that observation was OOB. This
will yield around B/3 predictions for the ith observation,
which we average.
• This estimate is essentially the LOO cross-validation error
for bagging, if B is large.
33 / 51 36 / 51

Bagging the heart data Details of previous figure

Bagging and random forest results for the Heart data.


0.30

• The test error (black and orange) is shown as a function of


B, the number of bootstrapped training sets used.
p
• Random forests were applied with m = p.
0.25

• The dashed line indicates the test error resulting from a


single classification tree.
Error

0.20

• The green and blue traces show the OOB error, which in
this case is considerably lower
0.15

Test: Bagging
Test: RandomForest
OOB: Bagging
0.10

OOB: RandomForest

0 50 100 150 200 250 300

Number of Trees

34 / 51 35 / 51
Bagging Bagging : an simulated example
Generated a sample of size N = 30, with two
M features classes and p = 5 features, each having a
standard Gaussian distribution with pairwise
N examples

Take the
majority
Correlation 0.95.
vote

....…
....…

The response Y was generated according to


Pr(Y = 1|x1 ≤ 0.5) = 0.2,
Pr(Y = 0|x1 > 0.5) = 0.8.

Bagging
Bagging
Notice the bootstrap trees are different than the original tree

Treat the voting


Proportions as
probabilities

Hastie

http://www-stat.stanford.edu/~hastie/Papers/ESLII.pdf Example 8.7.1


Random Forests
• Random forests provide an improvement over bagged trees
Random&Forests&Algorithm&
by way of a small tweak that decorrelates the trees. This
reduces the variance when we average the trees.
• As in bagging, we build a number of decision trees on
bootstrapped training samples.
• But when building these decision trees, each time a split in
a tree is considered, a random selection of m predictors is
chosen as split candidates from the full set of p predictors.
The split is allowed to use only one of those m predictors.
• A fresh selection of m predictors is taken at each split, and
p
typically we choose m ⇡ p — that is, the number of
predictors considered at each split is approximately equal
to the square root of the total number of predictors (4 out
of the 13 for the Heart data).

37 / 51

Random Forest Classifier Random Forest Classifier


Create decision tree
At each node in choosing the split feature from each bootstrap sample
choose only among m<M features

M features M features
N examples

N examples

....…
....…

....…
Random Forest Classifier Example: gene expression data
• We applied random forests to a high-dimensional biological
data set consisting of expression measurements of 4,718
genes measured on tissue samples from 349 patients.
M features • There are around 20,000 genes in humans, and individual
genes have di↵erent levels of activity, or expression, in
N examples

particular cells, tissues, and biological conditions.


Take he
• Each of the patient samples has a qualitative label with 15
majority
di↵erent levels: either normal or one of 14 di↵erent types of
vote cancer.

....…
....…

• We use random forests to predict cancer type based on the


500 genes that have the largest variance in the training set.
• We randomly divided the observations into a training and a
test set, and applied random forests to the training set for
three di↵erent values of the number of splitting variables m.

38 / 51

Results: gene expression data Details of previous figure

m=p • Results from random forests for the fifteen-class gene


m=p/2
expression data set with p = 500 predictors.
0.5

m= p
Test Classification Error

• The test error is displayed as a function of the number of


0.4

trees. Each colored line corresponds to a di↵erent value of


m, the number of predictors available for splitting at each
0.3

interior tree node.


• Random forests (m < p) lead to a slight improvement over
0.2

bagging (m = p). A single classification tree has an error


rate of 45.7%.
0 100 200 300 400 500

Number of Trees

39 / 51 40 / 51
Variable importance measure
• For bagged/RF regression trees, we record the total
Random forest
amount that the RSS is decreased due to splits over a given
predictor, averaged over all B trees. A large value indicates To read more:
an important predictor.
http://www-stat.stanford.edu/~hastie/Papers/ESLII.pdf
• Similarly, for bagged/RF classification trees, we add up the
total amount that the Gini index is decreased by splits over
a given predictor, averaged over all B trees.
Fbs

RestECG

ExAng

Sex

Slope

Chol

Age

RestBP
Variable importance plot
MaxHR for the Heart data
Oldpeak

ChestPain

Ca

Thal

0 20 40 60 80 100
Variable Importance
50 / 51

Summary

• Decision trees are simple and interpretable models for


regression and classification
• However they are often not competitive with other
methods in terms of prediction accuracy
• Bagging, random forests and boosting are good methods
for improving the prediction accuracy of trees. They work
by growing many trees on the training data and then
combining the predictions of the resulting ensemble of trees.
• The latter two methods— random forests and boosting—
are among the state-of-the-art methods for supervised
learning. However their results can be difficult to interpret.

51 / 51

You might also like