0% found this document useful (0 votes)
43 views

Week 12

This document provides an overview of boosted, bagging and random forest machine learning techniques. It recaps regression, logistic regression and classification and regression trees. It then discusses bagging, which builds many classification or regression trees on bootstrapped samples and averages their predictions. Random forest is introduced as an improvement over bagging that adds randomness to the variable selection at each split to decrease correlation among trees. The document provides step-by-step explanations of bagging and random forest algorithms.

Uploaded by

Siddhant Jha
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views

Week 12

This document provides an overview of boosted, bagging and random forest machine learning techniques. It recaps regression, logistic regression and classification and regression trees. It then discusses bagging, which builds many classification or regression trees on bootstrapped samples and averages their predictions. Random forest is introduced as an improvement over bagging that adds randomness to the variable selection at each split to decrease correlation among trees. The document provides step-by-step explanations of bagging and random forest algorithms.

Uploaded by

Siddhant Jha
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

College of Science and Engineering

James Cook University

Week 12: Boosted, bagging and Random


Forest

Dr Carla Ewels
[email protected]

Oct 2021
Recap
1

I Regression and logistic regression


I Classification and Regression tree

2
1

1
X2

X2
0

0
−1

−1
−2

−2
−2 −1 0 1 2 −2 −1 0 1 2

X1 X1
2

2
1

1
X2

X2
0

0
−1

−1
−2

−2

−2 −1 0 1 2 −2 −1 0 1 2

X1 X1

I Top row: true linear boundary


I Bottom row: Non-linear boundary
Carla Ewels |
Recap
2

1. Grow - use greedy algorithm to find variable and split which


minimisea loss function
I Regression- RSS
I Classification - misclassification rate, Gini Index and Cross-entropy
2. Prune - cost complexity pruning. For each tuning parameter α,
find a subtree with the lowest cost complexity criteria
3. Choose tuning parameter - α. K-fold CV
3.1 Divide data into K sets, for k = 1, . . . , K
3.2 Repeatedly growing and pruning with training set
3.3 Use the validation set to find prediction error
For each α, average the errors from repetitions. Choose the α
with the smallest average error
4. The optimal tree is subtree with corresponding α value identified
in Step 3

Carla Ewels |
Example
3

1
Labour
.30 .47 .22
100%
yes Blair < 3.5 no 3
Labour
.16 .61 .22
65%
Hague >= 3.5
6
Labour
.37 .44 .19
22%
Europe >= 7.5

2 12 13 7
Conservative Conservative Labour Labour
.56 .21 .23 .54 .30 .16 .21 .56 .22 .06 .70 .24
35% 10% 12% 43%

What are misclassification rate, Gini index and cross-entropy of the


rightmost node?
Carla Ewels |
Classification tree- Loss function
4

I Misclassification error: At a terminal node m, the impurity of


the node is the proportion of cases are misclassified, i.e.
1 X
I(yi 6= k(m)) = 1 − p̂mk(m)
Nm
i∈Rm

I Gini Index: At a terminal node m, the impurity of the node is

X K
X
p̂mk p̂mk0 = p̂mk (1 − p̂mk )
k6=k0 k=1

I Cross-entropy (or deviance): At a terminal node m, the


impurity of node m is
K
X
− p̂mk log p̂mk
k=1

Carla Ewels |
Advanced trees
5

I Simple and easy for interpretation


I Low prediction accuracy
I Unstable
I Many ways to improve tree-based algorithm
I Bagging
I Random Forest
I Boosting
I Con: Black box
I Key features - variable selection

Carla Ewels |
Bagging
6

I Build on CARTS algorithm


I Generating large number of trees using bootstrapping samples
I Combine predictions of different trees
I Wisdom of crowds

Carla Ewels |
Bootstrapping
7

I Resampling method
I Draw random samples with replacement from training set
I Grow a tree
I Prediction
I Repeat for many times
I Prediction, regression tree

B
1 X ˆ∗b
fˆbag (x) = f (x)
B
b=1

where B is total number of bootstrapping samples and fˆ∗b the


prediction from bth samples
I This is called bagging

Carla Ewels |
Step illustration
8

Carla Ewels |
Bagging Classification tree
9

I Classification trees: for each test observation, we record the


class predicted by each of the B trees, and class with the most
votes is the predicted
I Does this work?
I For regression tree, we can prove mathematically (i.e. using bias
and variance) that using the average of model predictions has
lower MSE than using individual model prediction
I For classification tree, not always (no bias and variance)
I Bagging good classifier can improve the predictions, but bagging bad
classifier can produce worse results.
I wisdom of crowds - assumption that individual in the crowds are
independent
I Bagged trees are not independent, therefore the advantage of
wisdom of crowds is not always true.

Carla Ewels |
Other issues with bagging
10

I A bagged tree does not retain the tree structure, therefore hard
to interpret (black box)
I A bagged tree is not a tree

Carla Ewels |
Random Forest
11

I Overcome the dependency (correlation) among trees


I RF is developed after boosting (next part), however more popular
than boosting
I Breiman (2001) founds that trees can be de-correlated when a
different subset of predictors is used at each split

Carla Ewels |
Random Forest
12

1. For b = 1, . . . , B
(a) Draw a bootstrapping sample, Z ∗ of size N from training data (with
replacement)
(b) Grow a random-forest tree Tb to the bootstrapped data by
recursively repeating the following step for each terminal node of
the tree, until the minimum node size nmin is reached
i Select m variables at random from the p variables
ii Pick the best variable/split-point among the m
iii Split the node
(c) Output of ensemble of trees {Tb }B
1

Carla Ewels |
Random Forest
13

I Let P be size of all predictors, the typical size of the subset (m)

I P for classification
I p/2 for regression
I When m = P , random forest is the same as bagging.
I Prediction
I Regression,
B
1 X
fˆrf
B
(x) = Tb (x)
B
b=1
I Classification,
Let Ĉb (x) be the prediction of the bth random-forest tree, then
B
Ĉrf (x) = majority vote{Ĉb (x)}B
1

Carla Ewels |
Bagging and Random Forest in R
14

I Share common algorithm


I Full grown tree - no pruning required
I Same R-package: randomForest
I mtyr equals number of predictor - bagging
I mtyr < number of predictor - random forest

Carla Ewels |
Example: Heart data
15

I Binary outcome on heart disease of 303 patients presented with


chest pain (yes/no)
I 13 predictors
I CV yields a tree with six terminal node

Carla Ewels |
Example: Heart
16

Thal:a
|

Ca < 0.5 Ca < 0.5

Slope < 1.5 Oldpeak < 1.1

MaxHR < 161.5 ChestPain:bc Age < 52 Thal:b RestECG < 1


ChestPain:a Yes
RestBP < 157 No Yes Yes
Yes No
Chol < 244 No Chol < 244 Sex < 0.5
MaxHR < 156 No Yes
MaxHR < 145.5 Yes
No
No No No No Yes
No Yes

Thal:a
|
0.6

Training
Cross−Validation
Test
0.5
0.4
Error

0.3

Ca < 0.5 Ca < 0.5


0.2

Yes Yes
0.1

MaxHR < 161.5 ChestPain:bc


0.0

No No
No Yes
5 10 15

Tree Size

Carla Ewels |
Example: Heart
17

I Dashed line - single tree


I Bagging resulted in a slightly
better prediction than a
0.30

single tree approach


I RF has lowest error, better
0.25

than Bagging
I No overfitting issue with
Error

0.20

Bagging - around 100 trees


0.15

Test: Bagging
Test: RandomForest
OOB: Bagging
0.10

OOB: RandomForest

0 50 100 150 200 250 300

Number of Trees

Carla Ewels |
Example: Random forest with different subset
sizes 18

I p=500 gene expression data


I n=349 patients
I There are around 20,000 genes in humans, and individual genes
have different levels of activity, or expression, in particular cells,
tissues, and biological conditions.
I response - normal or 14 different types of cancer
I training and test sets

Carla Ewels |
Example: Random forest with different subset
sizes 19

m=p
m=p/2

0.5
m= p
Test Classification Error

0.4
0.3
0.2

0 100 200 300 400 500

Number of Trees

I Slightly improvement than bagging (orange- bagging)


I Better than single tree - error rate of 45.7%

Carla Ewels |
Boosted tree
20

I Unlike bagging and random forest,


I add “Boosting” to tree algorithm
What is boosting?
I the most powerful learning idea introduced in the late 90’s/00’s
I originally design for classification problem, but extended to
regression problems
I fitting “weak” classifiers to the original but modified data
I combines weak classifiers to produce a strong committee
I learning slowly

Carla Ewels |
Example:AdaBoost
21

I AdaBoost
I classification problem with
K = 2, Y ∈ {−1, 1}
I Prediction

M
X 
G(x) = sign αm Gm (x)
m=1

coefficient αm is the weight


each subtree, Gm (x)
contributes to the overall
prediction.

Carla Ewels |
Example:AdaBoost
22

Carla Ewels |
Example:AdaBoost
23

Carla Ewels |
Boosting algorithm for regression tree
24

1. set fˆ(x) = 0 and ri = yi for all i in the training set


2. for b = 1, 2, . . . , B, repeat:
2.1 Fit a tree fˆb (x) with d splits (d + 1 terminal nodes) to training data
(X, r)
2.2 Update fˆ by adding in a shrunken version of the new tree:

fˆ(x) ← fˆ(x) + λfˆb (x)

2.3 Update residual


ri ← ri − λfˆb (xi )
3. Output the boosted model,
B
X
fˆ(xi ) = λfˆb (x)
b=1

Carla Ewels |
Extension of Boosted tree
25

I General framework - forward stagewise additive modelling


I Speed-up the search
I Different optimisation algorithms - gradient descent and speed
descent
I Limited loss function - differentiable
I Gradient tree boosting algorithm

Carla Ewels |
Tuning parameters
26

I Number of trees B. Large B results in overfitting. CV to find B


I Shrinkage parameter λ, controls the rate of learning, typically
between 0.01 and 0.001. Small λ will require B to be large
I Number of splits in each subtree - d, controls complexity of
ensemble
I d = 1 - stump
I Depends on problem in hand
I Seldom improvement when d is over 6

Carla Ewels |
Example
27

Carla Ewels |
Example
28

Carla Ewels |
Interpretation of trees
29

I Ensemble of trees - better predictive powerful


I Not able to interpret the findings
I Options
I Variable relative importance
I Partial dependency plots

Carla Ewels |
Variable importance
30

I CART- Not as reliable, amount of errors reduced by primary and


surrogate splits
I Bagging and random forest
I The variable importance ranking is derived from all trees
I At each tree, variable importance is quantified based on the level of
improvement the variable contributes at each internal node, and
sum over all trees

Carla Ewels |
Random Forest
31

I Random forest also uses OOB samples to measure variable


importance (prediction power)
I OOB samples are firstly passed down the tree to estimate the
baseline prediction accuracy
I The values of predictor x` in the OOB samples are then
permutated and passed down the tree again to estimate the
prediction accuracy.
I The differences between the prediction accuracies are then
averaged over all trees in the forest to quantify the importance of x`

Carla Ewels |
Boosting
32

I In boosted trees, a new tree is constructed at each iteration and


added to the final model
I Variable importance is evaluated at each “subtree”.
I In a single decision tree, the importance of a predictor(`) is the
sum of improvement it contributed at each internal node, hence
J−1
X
I`2 (T ) = ι̂2t I(v(t) = `)
j=1

where ι̂t is the improvement contributed by variable ` at internal


node t.
I Therefore, the importance of ` is the average contribution of the
variable over all trees,
M
1 X 2
I`2 = I (Tm )
M m=1 `

Carla Ewels |
Partial dependency plot
33

Carla Ewels |

You might also like