0% found this document useful (0 votes)
67 views54 pages

STAT 432: Basics of Statistical Learning: Tree and Random Forests

This document provides an overview of classification and regression trees (CART) and tree-based methods. It discusses how trees work for classification problems by recursively partitioning the feature space into rectangular subsets and making predictions for each subset. An example of using a tree to classify points based on their coordinates is also presented.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views54 pages

STAT 432: Basics of Statistical Learning: Tree and Random Forests

This document provides an overview of classification and regression trees (CART) and tree-based methods. It discusses how trees work for classification problems by recursively partitioning the feature space into rectangular subsets and making predictions for each subset. An example of using a tree to classify points based on their coordinates is also presented.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

STAT 432: Basics of Statistical Learning

Tree and Random Forests

Ruoqing Zhu, Ph.D. <[email protected]>

https://teazrq.github.io/stat432/

University of Illinois at Urbana-Champaign


November 20, 2018

1/50
Classification and Regression
Trees (CART)
Tree-based Methods

• Tree-based methods are nonparametric methods that recursively


partition the feature space into hyper-rectangular subsets, and
make prediction on each subset.
• Two main streams of models:
– Classification and regression Trees (CART): Breiman, Friedman,
Olshen and Stone (1984)
– ID3/C4.5: Quinlan (1986, 1993)
• Both are among the top algorithms in data mining (Wu et al.,
2008)
• In statistics, we CART is more popular.

3/50
Titanic Survivals

4/50
Classification and regression Trees

• Example: independent x1 and x2 from uniform [−1, 1],

P(Y = blue | x21 + x22 < 0.6) = 90%


P(Y = orange | x21 + x22 ≥ 0.6) = 90%

• Existing methods require transformation of the feature space to


deal with this model. Tree and random forests do not.
• How tree works in classification?

5/50
Example

● ● ● ●● ● ● ●
● ● ●
● ● ● ● ● ● ●
● ● ● ● ●● ● ●●
● ● ● ● ● ● ● ● ●
●● ● ● ● ● ● ●
● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ●
●● ●● ● ● ●
● ● ● ● ●
● ● ● ●
● ● ●
● ● ● ● ●● ● ● ●
● ●● ● ●● ● ●
● ● ● ● ●
● ● ● ●
● ● ● ● ● ●
● ● ● ● ●

● ● ●● ● ● ● ● ● ● ●
● ● ●● ●
● ●● ● ● ●● ● ●
● ●● ● ●● ● ●
● ● ●● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ●
● ●
●● ●
●● ● ● ●
● ● ● ● ● ●
● ● ●
● ● ●

● ● ●● ● ● ●
●● ● ● ● ●
● ●
● ● ●
● ● ● ●
● ● ● ● ● ●
● ● ● ● ●
● ●●● ● ● ● ●
● ●
● ● ●
● ●● ● ●
● ●
●● ● ● ● ●
● ●
● ● ● ● ● ● ● ●
● ● ● ●
● ●●
● ● ● ●● ●● ● ●
● ● ● ●●
● ● ● ●
● ●
● ● ● ● ●
● ● ●● ●
● ●● ●
● ● ● ●
● ● ● ●●
● ● ● ●● ● ●
●● ● ●● ● ● ●● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ●
● ● ●
●● ●
● ●
● ● ●
● ● ● ● ●
● ● ● ● ●
●● ● ● ● ●
● ● ● ● ●
● ●● ●
● ● ● ● ● ●
● ● ● ● ●● ●

● ● ● ● ● ● ●
● ● ●●
●●●● ●
● ● ● ●
● ● ● ● ●
●● ● ●
●● ● ●
● ● ● ● ● ● ●
● ●● ● ● ●
● ●
● ●● ●● ●
●● ● ● ● ● ● ●
● ●●
●● ●● ● ● ● ●●
● ●

● ● ●● ●● ● ● ● ●
● ● ● ●● ● ● ●
● ● ●

6/50
Example

● ● ● ●● ● ● ● ● ● ● ●● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ●● ● ●● ● ● ● ● ●● ● ●●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●● ● ● ● ● ● ● ●● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●
●● ●● ● ● ● ●● ●● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ●
● ●● ● ●● ● ● ● ●● ● ●● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●● ● ● ● ●●

● ●● ● ● ●● ● ● ● ● ●
● ●● ● ● ●● ● ● ● ●
● ● ●● ● ● ● ●● ●
● ● ●● ● ●● ● ● ● ● ●● ● ●● ● ●
● ●● ●● ● ● ● ● ●● ●● ● ● ●
● ● ●● ● ● ● ● ● ●● ● ● ●
● ●
● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●
●● ●
●● ● ● ● ●● ●
●● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ●● ● ● ● ● ●● ●

● ● ●● ● ● ● ●
● ● ●● ● ● ●
●● ● ● ● ● ●● ● ● ● ●
● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ●●● ● ● ● ● ● ●●● ● ● ● ●
● ●
● ● ● ● ●
● ● ●

● ●● ● ● ● ● ●
● ●● ● ● ● ●
● ●● ● ● ● ●● ● ●
● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ●● ● ●●
● ● ● ●● ●● ● ● ● ● ● ●● ●● ● ●
● ● ● ●● ● ● ● ●●
● ●● ● ● ● ● ●● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ●● ● ● ● ● ●● ●
● ●● ●
● ● ● ● ● ●● ●
● ● ● ●
● ● ● ●● ● ● ● ●●
● ● ● ● ● ● ● ● ● ● ● ●
●● ● ●● ● ● ●● ●● ● ●● ● ●● ● ● ●● ●● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ●
●● ●
● ● ●● ●
● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
●● ● ● ● ● ●● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ●● ● ● ●● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ●● ● ● ● ● ● ●● ●
● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ●● ● ● ●●
●●●● ● ●●●● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
●● ● ● ●● ● ●
●● ● ●
● ● ● ● ● ● ● ●● ● ●
● ● ● ● ● ● ●
● ●● ● ● ● ●● ● ●
● ● ● ● ● ●
● ●● ●● ● ● ●● ●● ●
●● ● ● ● ● ● ● ●● ● ● ● ● ● ●
● ●●
●● ●● ● ● ● ●● ● ●●
●● ●● ● ● ● ●●
● ● ● ●

● ● ●● ●● ● ● ● ●

● ● ●● ●● ● ● ● ●
● ● ● ●● ● ● ● ● ● ● ●● ● ● ●
● ● ● ● ● ●

● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ●● ● ●● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ●●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
●● ●● ● ● ● ●● ●● ● ● ● ●● ●● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ●
● ●● ● ●● ● ● ● ●● ● ●● ● ● ● ●● ● ●● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ●● ● ● ● ● ● ● ● ●
● ● ●● ● ● ● ● ● ● ● ●
● ● ●● ● ● ● ● ● ● ●
● ● ●
● ● ● ● ●
● ● ● ● ●
● ●
● ●● ● ● ●● ● ● ● ●● ● ● ●● ● ● ● ●● ● ● ●● ● ●
● ●● ● ●● ● ● ● ●● ● ●● ● ● ● ●● ● ●● ● ●
● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
●● ●
●● ● ● ● ●● ●
●● ● ● ● ●● ●
●● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ●● ●● ● ● ●● ●● ● ● ●● ●● ●

● ● ●● ● ● ● ●
● ● ●● ● ● ● ●
● ● ●● ● ● ●
●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●●●● ● ● ● ● ●●● ● ● ● ● ●●● ● ● ●
● ●
● ● ● ● ●
● ● ● ● ●
● ● ●

● ●● ● ● ● ● ●
● ●● ● ● ● ● ●
● ●● ● ● ● ●
● ●● ● ● ● ●● ● ● ● ●● ● ●
● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ●● ● ●● ● ●●
● ● ● ●● ●● ● ● ● ● ● ●● ●● ● ● ● ● ● ●● ●● ● ●
● ● ● ●● ● ● ● ●● ● ● ● ●●
● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●● ● ● ● ● ●● ● ● ● ● ●● ●
● ●● ●
● ● ● ● ● ●● ●
● ● ● ● ● ●● ●
● ● ● ●
● ● ● ●● ● ● ● ●● ● ● ● ●●
● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ●
●● ● ●● ● ● ●● ● ● ●● ● ●● ● ● ●● ● ● ●● ● ●● ● ● ●● ● ●
● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
●● ●● ● ●● ●● ● ●● ●● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ●● ● ● ● ●● ● ● ● ●● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ●● ●
● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ●● ● ● ●● ● ● ●●
●●●● ● ●●●● ● ●●●● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ●
● ● ● ●
● ● ●
●● ● ●
● ● ● ● ● ● ● ●● ● ●
● ● ● ● ● ● ● ●● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ●
● ●●
● ● ● ●●
● ● ● ●●
● ●
● ● ●
● ●● ●● ● ● ●● ●● ● ● ●● ●● ●
●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ●
● ●●
●● ●● ● ● ● ●● ● ●●
●● ●● ● ● ● ●● ● ●●
●● ●● ● ● ● ●●
● ● ● ● ● ●

● ● ●● ●● ● ● ● ●

● ● ●● ●● ● ● ● ●

● ● ●● ●● ● ● ● ●
● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ●
● ● ● ● ● ● ● ● ●

7/50
Example

• There are many popular packages that can fit a CART model:
rpart , tree and party .
• Read the reference manual carefully!

1 > library ( rpart )


2 > x1 = r u n i f ( 5 0 0 , −1, 1 )
3 > x2 = r u n i f ( 5 0 0 , −1, 1 )
4 > y = rbinom ( 5 0 0 , s i z e = 1 , prob = i f e l s e ( x1 ˆ 2 + x2 ˆ 2 < 0 . 6 , 0 . 9 , 0 . 1 ) )
5 > c a r t . f i t = r p a r t ( as . f a c t o r ( y ) ~x1+x2 , data = data . frame ( x1 , x2 , y ) )
6 > plot ( cart . f i t )
7 > text ( cart . f i t )

8/50
Example

x2 < −0.644432
|

x1 < 0.767078
0

x2 < 0.722725
0

x1 < −0.76759
0

x1 < −0.61653
0 x2 < 0.356438 x1 < 0.547162
x2 < −0.28598 x2 < 0.369569
0 1
0 1 1 0

9/50
Tree Algorithm: Recursive Partitioning

• Initialized the root node: all training data

Root node

10/50
Tree Algorithm: Recursive Partitioning

• Initialized the root node: all training data


• Find a splitting rule 1{X (j) ≤ c} and split the node

Root node

Splitting rule 1{X (j) ≤ c}

10/50
Tree Algorithm: Recursive Partitioning

• Initialized the root node: all training data


• Find a splitting rule 1{X (j) ≤ c} and split the node
• Recursively apply the procedure on each daughter node

Root node

No Age ≤ 45 Yes

Internal A1

10/50
Tree Algorithm: Recursive Partitioning

• Initialized the root node: all training data


• Find a splitting rule 1{X (j) ≤ c} and split the node
• Recursively apply the procedure on each daughter node

Root node

Age ≤ 45
Internal A1

No Yes
Female
A2 A3

10/50
Tree Algorithm: Recursive Partitioning

• Initialized the root node: all training data


• Find a splitting rule 1{X (j) ≤ c} and split the node
• Recursively apply the procedure on each daughter node
• Predict each terminal node using within-node data

Root node

Age ≤ 45
Internal A1
fb(x), for x ∈ A1
Female
A2 A3
fb(x), for x ∈ A2 fb(x), for x ∈ A3

10/50
Classification and Regression Trees

• How to construct the splitting rules?


• Classification problems
• Regression problems
• How to deal with categorical predictors
• Tree pruning

11/50
Constructing Splitting Rules
Splitting Using Continuous Covariates

• Splitting of continuous predictors are in the form of 1{X (j) ≤ c}


• At a node A, with |A| observations

(xi , yi ) : xi ∈ A, 1 ≤ i ≤ n

• We want to split this node into two child nodes AL and AR

AL = {x ∈ A, x(j) ≤ c}
AR = {x ∈ A, x(j) > c}

• This is done by calculating and comparing the impurity, before


and after a split.

13/50
Impurity for Classification

• We need to define the criteria for classification and regression


problems separately.
• Before the split: we evaluate the impurity for the entire node A
using the Gini index,
• Gini impurity is used as the measurement. Suppose we have K
different classes,
K
X K
X
Gini = pk (1 − pk ) = 1 − pk
k=1 k=1

• Interpretation: Gini = 0 means pure node (only one class), larger


Gini means more diverse node.

14/50
Impurity for Classification

• After the split, we want each child node to be as pure as possible,


i.e., the sum of their Gini impurities are as small as possible.
• Maximize the Gini impurity reduction after the split:

|AL | |AR |
score = Gini(A) − Gini(AL ) − Gini(AR ),
|A| |A|

where | · | denotes the cardinality (sample size) of a node.


• Note 1: Gini(AL ) and Gini(AR ) are calculated within their
respective node.
• Note 2: An alternative (and equivalent) definition is to minimize
|AL | |AR |
|A| Gini(AL ) + |A| Gini(AR ).

15/50
Impurity for Classification

• Calculating the Gini index based on the samples is very simple:


• First, for any node A, we estimate the frequencies pbk :
P
1{yi = k}1{xi ∈ A}
pbk = i P ,
i 1{xi ∈ A}

which is the proportion of samples with class label k in node A.


• Then the Gini impurity is
K
X K
X
Gini(A) = pbk (1 − pbk ) = 1 − pb2k
k=1 k=1

• Do the same for AL and AR , then calculate the score of a split.

16/50
Choosing the Split

• To define a split 1{X (j) ≤ c}, we need to know


• variable index j
• cutting point c
• To find the best split at a node, we do an exhaustive search:
• Go through each variable j, and all of its possible cutting points c
• For each combination of j and c, calculate the score of that split
• Compare all of such splits and choose the one with the best score
• Note: to exhaust all cutting points, we only need to examine
middle points of order statistics.

17/50
Other Impurity Measures

• Gini index is not the only measurement.


• ID3/C4.5 uses Shannon entropy from information theory
K
X
Entropy(A) = − pbk log(b
pk )
k=1

• Misclassification error

Error(A) = 1 − max pbk


k=1,...,K

• Similarly, we can use these measures to define the reduction of


impurity and search for the best splitting rule

18/50
Comparing Impurity Measures

Class 1 Class 2 pb1 pb2 Gini Entropy Error


A 7 3 7/10 3/10 0.420 0.611 0.3
AL 3 0 3/3 0 0 0 0
AR 4 3 4/7 3/7 0.490 0.683 3/7

scoreGini = 0.420 − (3/10 · 0 + 7/10 · 0.490) = 0.077


scoreEntropy = 0.611 − (3/10 · 0 + 7/10 · 0.683) = 0.133
scoreError = 3/10 − (3/10 · 0 + 7/10 · 3/7) = 0

19/50
Comparing Different Measures

• Gini index and Shannon entropy are more sensitive to the


changes in the node probability
• They prefer to create more “pure” nodes
• Misclassification error can be used for evaluating a tree, but may
not be sensitive enough for building a tree.

20/50
Regression Problems

• When the outcome Y is continuous, all we need is a


corresponding impurity measure
• Use variance instead of Gini, and consider the weighted variance
reduction:
|AL | |AR |
score = Var(A) − Var(AL ) − Var(AR )
|A| |A|

where for any A, Var(A) is just the variance of the node samples:
1 X
Var(A) = (yi − y A )2 ,
|A|
i∈A

|A| is the cardinality of A and y A is the within-node mean.

21/50
Categorical Predictors

• If X (j) is a categorical variable talking values in {1, . . . , C}, we


search for a subset C ⊂ {1, . . . , C}, and define the child nodes

AL = {x ∈ A, x(j) ∈ C}
AR = {x ∈ A, x(j) 6∈ C}

• Maximum of 2C−1 − 1 number of possible splits


• When M is too large, exhaustively searching for the best C can
be computationally intense.
• In the R randomForest package M needs to be less than 53
(when C is larger than 10, this is not exhaustively searched).
• Some heuristic methods are used, such as randomly sample a
subset of {1, . . . , C} as C.

22/50
Overfitting and Tree Pruning

• There is a close connection with the (adaptive) histogram


estimator
• A large tree (with too many splits) can easily overfit the data
• Small terminal node ⇐⇒ small bias, large variance
• Small tree may not capture important structures
• Large terminal node ⇐⇒ large bias, small variance
• Tree size is measured by the number of splits

23/50
Overfitting and Tree Pruning

• Balancing tree size and accuracy is the same as the “loss +


penalty” framework
• One possible approach is to split tree nodes only if the decrease
in the loss exceed certain threshold, however this can be
short-sighted
• A better approach is to grow a large tree, then prune it

24/50
Cost-Complexity Pruning

• First, fit the maximum tree Tmax (possibly one observation per
terminal node).
• Specify a complexity penalty parameter α.
• For any sub-tree of Tmax , denoted as T  Tmax , calculate
X
Cα (T ) = |A| · Gini(A) + α|T |
all terminal nodes A of T

= C(T ) + α|T |

where |A| is the cardinality of node A, |T | is the cardinality


(number of terminal nodes) of tree T .
• Find T that minimizes Cα (T )
• Large α gives small trees
• Choose α using CV (or plot)

25/50
Missing Values

• If each variable has 5% chance to have missing value, then if we


have 50 variables, there are only 7.7% of the samples that has
compete measures.
• Traditional approach is to discard observations with missing
values, or impute them
• Tree-based method can handle them by either putting them as a
separate category, or using surrogate variables whenever the
splitting variable is missing.

26/50
Remark

• Advantages of tree-based method:


• handles both categorical and continuous variables in a simple and
natural way
• Invariant under all monotone transformations of variables
• Robust to outliers
• Flexible model structure, capture iterations, easy to interpret
• Limitations
• Small changes in the data can result in a very different series of
splits
• Non-smooth. Some other techniques such as the multivariate
adaptive regression splines (MARS, Friedman 1991) can be used
to generate smoothed models.

27/50
Random Forests
Weak and Strong Learners

• Back in the mid-late 90’s, researches started to investigate


whether aggregated “weak learners” (unstable, less accurate)
can be a “strong learner”.
• Bagging, boosting, and random forests are all methods along this
line.
• Bagging and random forests learn individual trees with some
random perturbations, and “average” them.
• Boosting progressively learn models with small magnitude, then
“add” them
• In general, Boosting, Random Forests  Bagging  Single Tree.

29/50
Bagging Predictors

• Bagging stands for “Bootstrap aggregating”


• Draw B bootstrap samples from the training dataset, fit CART to
each of them, then average the trees
• “Averaging” is symbolic, what we really doing is to get the
predictions from each tree, and average the predicted values.
• Motivation: CART is unstable, however, perturbing and averaging
can improve stability and leads to better accuracy

30/50
Ensemble of Trees

Dn

Bootstrap 1 Bootstrap 2 ... Bootstrap B−1 Bootstrap B

...

fb1 (x) fb2 (x) fbB−1 (x) fbB (x)

fb(x)

31/50
Bagging Predictors

• Bootstrap sample with replacement. Fit a CART model to each


bootstrap sample (may require pruning for each tree).
• To combine the bootstrap learners, for classification:
 B
fbbagging (x) = Majority Vote fbb (x) b=1 ,

and for regression:


B
1 Xb
fbbagging (x) = fb (x),
B
b=1

• Dramatically reduce the variance of individual learners


• CART can be replaced by other weak learners

32/50
CART vs. Bagging

CART vs. Bagging

33/50
Remarks about Bagging

• Why Bagging works?


• Averaging (nearly) independent copies of fb(x) can lead to
reduced variance
• The “independence” is introduced by bootstrapping
• However, the simple structure of trees will be lost due to
averaging, hence it is difficult to interpret

34/50
Remarks about Bagging

• But, the performance of bagging in practice is oftentimes not


satisfactory. Why?
• Its not really independent...
• Different trees have high correlation which makes averaging not
very effective
• How to further de-correlate trees?

35/50
Random Forests

• Several articles came out in the late 90’s discussing the


advantages of using random features, these papers greatly
influenced Breiman’s idea of random forests.
• For example, in Ho (1998), each tree is constructed using a
randomly selected subset of features
• Random forests take a step forward: at each splitting rule we
consider a random subset of features
• Important tuning parameters: mtry and nodesize

36/50
Tuning Parameter: mtry

• An important tuning parameter of random forests is mtry


• At each split, randomly select mtry variables from the entire set
of features {1, . . . , p}
• Search for the best variable and the splitting point out of these
mtry variables
• Split and proceed to child nodes

37/50
Tuning Parameter: nodesize

• Another important tuning parameter is (terminal) nodesize


• Random forests do not perform pruning!
• Instead, splitting does not stop until the terminal node size is less
or equal to nodesize, and the entire tree is used.
• nodesize controls the trade-off between bias and variance in
each tree, same as k in kNN
• In the most extreme case, nmin = 1 means exactly fit each
observation, but this is not 1NN!

38/50
Tuning parameters

• A summary of important tuning parameters in Random forests


(using R package randomForest )
– ntree : number of trees, set it to be large. Default 500.
– mtry : number of variables considered at each split. Default p/3

for regression, p for classification.
– nodesize : terminal node size. Default 5 for regression, 1 for
classification
– sampsize : Bootstrap sample size, usually n with replacement.
• Overall, tuning is very crucial in random forests

39/50
CART vs. Bagging vs. RF

CART Bagging RF

RF: ntree = 1000, mtry = 1, nodesize = 25

40/50
Smoothness Effect of Random Forests

CART RF

Smoothness Effect of Random Forests


(Age: continuous; Diagnosis: categorical)

41/50
Random Forests vs. Kernel
Random Forests vs. Kernel

• Random forests are essentially kernel methods


• However, the distance used in random forests is adaptive to the
true underlying structure
• This can be seen from the kernel weights derived from a random
forests

43/50
RF vs. Kernel

Random forest kernel at two different target points

44/50
RF vs. Kernel

Gaussian Kernel at two different target points

45/50
Variable Importance
Variable Importance

• Random forests has a built-in variable selection tool: variable


importance
• Variable importance utilizes samples that are not selected by
bootstrapping (out-of-bag data):
• For the b-th tree, use the corresponding out-of-bag data as the
testing set to obtain the prediction error: Errb0
• For each variable j, randomly permute its value among the testing
samples, and recalculate the prediction error: Errbj
• calculate for each j

Errbj
VIbj = −1
Errb0
• Average VIbj across all trees
B
X
VIj = VIbj
b=1

47/50
Variable Importance

• This essentially works like a cross-validation:


• the in-bag samples are training samples,
• the out-of-bag samples are testing samples
• a bootstrapped cross-validation
• Usually the misclassification error is used instead of Gini index
• Higher VI means larger loss of accuracy due to the loss of
information on X (j) , hence more important.

48/50
Variable Importance in RF

10
5
0

x1 x4 x7 x11 x15 x19 x23 x27 x31 x35 x39 x43 x47

Same simulation setting as the “circle” example, with additional 48


noise variables.
49/50
Remarks about Random Forests

• Performs well on high-dimensional data


• Tuning parameters are crucial
• Difficult to interpret
• Adaptive kernel

50/50

You might also like