0% found this document useful (0 votes)
10 views39 pages

Random Forest Explained

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views39 pages

Random Forest Explained

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Random Forest

Mathieu Ribatet—Full Professor of Statistics

Random Forest (v0) Mathieu Ribatet ([email protected]) – 1 / 38


⊲ 1. Introduction
2. CART

3. Random forest

4. Feature importance

1. Introduction

Random Forest (v0) Mathieu Ribatet ([email protected]) – 2 / 38


Some references

[1] Gérard Biau and Erwan Scornet. A random forest guided tour. TEST,
25(2):197–227, 2016.

[2] Leo Breiman. Random forests. Machine Learning, 45(1):5–32, 2001.

[3] Gilles Louppe, Louis Wehenkel, Antonio Sutera, and Pierre Geurts.
Understanding variable importances in forests of randomized trees. In C.J.
Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors,
Advances in Neural Information Processing Systems, volume 26. Curran
Associates, Inc., 2013.

[4] Scott M. Lundberg and Su-In Lee. A unified approach to interpreting model
predictions. In Proceedings of the 31st International Conference on Neural
Information Processing Systems, NIPS’17, page 4768–4777, Red Hook, NY,
USA, 2017. Curran Associates Inc.

Random Forest (v0) Mathieu Ribatet ([email protected]) – 3 / 38


Quick overview

 Random forests are fairly recent learning strategy (00’s)


 It is based on classification and regression trees or CART for short.
 It is a modification of bagging to mitigate dependence between trees.

ñ Bagging and random Forest heavily rely on bootstrap.

Random Forest (v0) Mathieu Ribatet ([email protected]) – 4 / 38


A simple statement

Proposition 1. Let T1 , . . . , TB be independent copies of T with Var(T ) = σ 2 . We


have
B
 σ2 1 X
Var T̄B = , T̄B = Tb .
B B
b=1

Proposition 2. Let T1 , . . . , TB be dependent copies of T with Var(T ) = σ 2 and


pairwise correlation ρ > 0. We have
B
 2 1−ρ 2 1 X
Var T̄B = ρσ + σ , T̄ = Tb .
B B
b=1

ñ Since 
Var T̄B −→ ρσ 2 , B → ∞,
the pairwise correlation ρ mainly controls the variance of T̄B as long as B is large
enough. Random forests aims at reducing ρ without increasing (too much) σ 2 .

Random Forest (v0) Mathieu Ribatet ([email protected]) – 5 / 38


1. Introduction

⊲ 2. CART
3. Random forest

4. Feature importance

2. CART

Random Forest (v0) Mathieu Ribatet ([email protected]) – 6 / 38


What is a binary tree?

Definition 1. A tree is a collection of


connected nodes. It is often used to dis-
play a hierarchical structure in a graphical 2 3
way.
Nodes without any children are called 4 5
leaves or terminal nodes.

Definition 2. A binary tree is a tree 6 7


whose nodes have at most two children.
Figure 1: A binary tree.

Random Forest (v0) Mathieu Ribatet ([email protected]) – 7 / 38


Classification And Regression Trees (CART)

 CART are binary trees and are widely used in statistics


 Can be used both for regression and classification problems.
 Each terminal node is an estimator
 The estimator has the following form
nterminal
X
fˆ(X) = cj 1{X∈Rj } , X ∈ X,
j=1

where nterminal is the number of terminal nodes, Rj is a subset of X and cj an


estimator for region Rj .
 CART are built from recursive binary splitting

Random Forest (v0) Mathieu Ribatet ([email protected]) – 8 / 38


Region Rj

 Within a CART, any X has to lie in a single terminal node, i.e.,

X ∈ Rj(X) , for some unique j(X) ∈ {1, . . . , nterminal }.

 Hence we get a partition of X , i.e.,

∪nj=1
terminal
Rj = X , Rj1 ∩ Rj2 = ∅, j1 6= j2 .

- Due to binary splits, not all partitions of X are valid!

Random Forest (v0) Mathieu Ribatet ([email protected]) – 9 / 38


Figure 2: Two partitions of X . Only one is admissible! Taken from Elements of Statistical Learning (Second
edition).

Random Forest (v0) Mathieu Ribatet ([email protected]) – 10 / 38


Figure 2: Two partitions of X . Only one is admissible! Taken from Elements of Statistical Learning (Second
edition).

Random Forest (v0) Mathieu Ribatet ([email protected]) – 10 / 38


How to split a node into two children?

 Consider the case where X = (X1 , . . . , Xp ) ∈ X


 The Xj ’s can be a mix of both numerical and categorical variables.
 A node Nparent will have two children Nchild 1 and Nchild 2 such that

Nchild 1 ∈ Sparent , Nchild 2 6∈ Sparent ,

with
Sparent = X p−1 × Cj ,
where Cj is a subset of possible outcomes of feature Xj .
 At each split the feature Xj will be selected from a relevant criterion.

4 To ease explanations, we will assume that all covariates Xj and the outcome Y
are numerical. We will go back to categorical variables later.

Random Forest (v0) Mathieu Ribatet ([email protected]) – 11 / 38


X1 ≤ t1

R5

X2 ≤ t2 X1 ≤ t3 R2 t4

X2
R3
t2
X2 ≤ t4 R4
R1 R2 R3
R1

t1 t3
R4 R5 X1

Figure 3: A CART with X ⊂ R2 (left) and the corresponding partition of X .

Random Forest (v0) Mathieu Ribatet ([email protected]) – 12 / 38


Splitting a node

Given a current node, we want to find splitting regions

R1 (j, s) = {X : Xj ≤ s}, R2 (j, s) = {X : Xj > s},

using the following optimization problem


 
 X X 
argmin min (Yi − c1 )2 + min (Yi − c2 )2
j,s  c1 c2 
i : Xi ∈R1 (j,s) i : Xi ∈R2 (j,s)

- For any j, s, the optimal c1 and c2 are always


1 X 1 X
ĉ1 = Yi , ĉ2 = Yi
|R1 (j, s)| |R2 (j, s)|
i : Xi ∈R1 (j,s) i : Xi ∈R2 (j,s)

Random Forest (v0) Mathieu Ribatet ([email protected]) – 13 / 38


 
 X X 
2
argmin (Yi − ĉ1 ) + (Yi − ĉ2 )2
j,s  
i : Xi ∈R1 (j,s) i : Xi ∈R2 (j,s)

 Finding the optimal cutoff value s and feature j are relatively easy.
 For feature Xj , possible cutoff values s are the observed outcome of Xj
 Hence the above optimization problem is solved using a brute–force search.

× All possible cutoff values s can be computed once for all at the beginning of the
learning stage.

Random Forest (v0) Mathieu Ribatet ([email protected]) – 14 / 38


Growing and pruning a tree

 The main strategy is to:


1. Repetitively split nodes until some minimum node size is reached;
2. “simplify” the obtained tree
 The last stage is called pruning a tree
 It consists in “collapsing the internal nodes of a tree” based on some criterion.
 The criterion is often cost–complexity pruning
|T |
X
Pα (T ) = Nm Qm (T ) + α|T |, α ≥ 0,
m=1

where T is a binary tree with |T | terminal nodes R1 , . . . , R|T | and


n n
X 1 X
Nm = 1{Xi ∈Rm } , Qm (T ) = (Yi − ĉm )2 1{Xi ∈Rm } .
Nm
i=1 i=1

Random Forest (v0) Mathieu Ribatet ([email protected]) – 15 / 38


|T |
X
Pα (T ) = Nm Qm (T ) + α|T |, α ≥ 0.
m=1

 Tuning parameter α drives the tradeoff between goodness of fit and complexity:


– large values of α yields to small trees (simple models)
– small values of α gives large trees. (complex models)
 Pruning is done in an iterative way by
1. collapsing the internal node that gives the smallest quadratic loss increase
to get a sub–tree T̃
2. iterate on the new tree until we get the single–node tree, i.e., new tree is
the root.

Random Forest (v0) Mathieu Ribatet ([email protected]) – 16 / 38


Dealing with categorical covariates

 Suppose that Xj is a categorical variable with levels E = {1, . . . , K}.


 When K = 2, splitting is straightforward since Xj = 1 or Xj 6= 1.
 When K > 2, we quickly face a computational burden since there are
# partitions omit ∅ and E
z}|{ z}|{
2K − 2
= 2K−1 − 1
2
|{z}
symmetry

ways to partition E with 2 non overlapping sets.


 The optimal split can be found from only K − 1 evaluations (not trivial).

Random Forest (v0) Mathieu Ribatet ([email protected]) – 17 / 38


Classification trees

 For classification problems, i.e., Y ∈ {1, . . . , K}, we cannot use the quadratic
loss anymore h i
2
E(Y,X) (Y − Ŷ ) .

 We must use a measure of non homogeneity, i.e., node impurity,


Classification error
1 − Pr (Ŷ = Y )
(Y,X)

Gini’s index h i
E(Y,X) 1 − Pr(Ŷ = Y )

Cross–entropy h i
−E(Y,X) log Pr(Ŷ = Y )

Ë Remember that Ŷ obviously is a function of X.

Random Forest (v0) Mathieu Ribatet ([email protected]) – 18 / 38


Classification error
0.5

Gini’s index
Cross−entropy
0.4

 Similar overall pattern


Node impurity

 Miss-classification is not differen-


0.3

tiable
0.2

 Gini and cross–entropy are:


0.1

– differentiable: helps opti-


mization
0.0

0.0 0.2 0.4


Pr(Y = Y)
^
0.6 0.8 1.0 – favour pure nodes
Figure 4: Different node impurity measures for a binary
classification problem. The cross–entropy impurity has been
scaled by 1/ log 2 so that it is equal to 0.5 at x = 0.5.

× Gini and cross–entropy should be used to grow and one can use any impurity
measure while pruning—miss-classification rate is often use though.

Random Forest (v0) Mathieu Ribatet ([email protected]) – 19 / 38


Pros and cons

m Almost no data pre-processing, e.g., data scaling


m Robust to outliers
m Allows for missing values
m Intuitive and easy to explain to non specialist
m Kind of interpretable
l Highly instable: small change in X may give a completly different answer
l CPU demanding
l Prone to overfitting—pruning mitigates this drawback
l Non continuous predictor in regression
l High bias for unbalanced designs—must “re-balanced” it

- It is highly recommended to not use CART but rather random forests.

Random Forest (v0) Mathieu Ribatet ([email protected]) – 20 / 38


1. Introduction

2. CART

⊲ 3. Random forest
4. Feature importance

3. Random forest

Random Forest (v0) Mathieu Ribatet ([email protected]) – 21 / 38


Towards random forest

B
!
1 X
Var Tb ≈ ρσ 2 , Cov(Tb , Tb′ ) = ρσ 2 , B ≫ 1, ρ > 0.
B
b=1

 If we have a low (positive) correlation, the variance is reduced.


 We need to find a way to get almost uncorrelated trees.
 We use the same rationale as bagging, i.e.,
1. Generate a synthetic data set by boostraping both individuals and
covariates;
2. Fit a CART
3. Repeat
where
 Having “different dataset” reduces correlations.

Random Forest (v0) Mathieu Ribatet ([email protected]) – 22 / 38


Algorithm 1: Random forest for regression and classification.
input : Supervised data set Dn = {(Xi , Yi ) : i = 1, . . . , n}, number of trees B
1 for b ← 1 to B do
2 Draw a boostrap sample Db of size n from Dn ;
3 Grow a tree Tb from Db using the following steps:
1. Select p̃ variables at random from the p variables
2. Pick the best variable / split–point among the m
3. Split the node into two children nodes

4 Output the ensemble of trees {Tb : b = 1, . . . , B} and predictors

Regression (averaging) Classification (majority vote)

1 XB X
B
fˆB : x 7−→ Tb (x), ĈB : x 7−→ argmax 1{Tb (x)=k}
B b=1 k b=1

ñ Recommendations, for regression use m = ⌊p/3⌋ with minimum node size is 5; for

classification use m = ⌊ p⌋ with minimum node size 1. But these are just guidelines
and in practice you should consider fine tuning.

Random Forest (v0) Mathieu Ribatet ([email protected]) – 23 / 38


Out of bag samples

Algorithm 2: Bootstraping the observations give OOB sam-


ple.
input : Supervised data set Dn = {(Xi , Yi ) : i = 1, . . . , n}, number of trees B
1 for b ← 1 to B do
2 Draw a boostrap sample Db of size n from Dn ;
3 Grow a tree Tb from Db using the following steps:
1. Select p̃ variables at random from the p variables
2. Pick the best variable / split–point among the m
3. Split the node into two children nodes

 By construction, some observations will be discarded while fitting tree Tb .


 These observations are called Out Of Bag (OOB)
 As a consequence we can estimate the generalization error based on OOB
samples
n B B
1X 1 X X
loss{Yi , Tb (Xi )}1{(Xi ,Yi )6∈Db } , Ni = 1{(Xi ,Yi )6∈Db }
n Ni
i=1 b=1 b=1

Random Forest (v0) Mathieu Ribatet ([email protected]) – 24 / 38


Variable importance

 Within a tree, variable importance can be assessed from the improvement in


the splitting criterion.
 Since random forest are collection of tree, we just accumulate these
importances over all trees.

Random Forest (v0) Mathieu Ribatet ([email protected]) – 25 / 38


1. Introduction

2. CART

3. Random forest
4. Feature
⊲ importance

4. Feature importance

Random Forest (v0) Mathieu Ribatet ([email protected]) – 26 / 38


Motivation

 A CART is (quite) interpretable since it is based on binary splits


 It is more challenging for random forests since we are “averaging” over
multiple binary trees.
 How to know which feature has a large impact on predictions?
 This stage is known as variable importance
 Not all statistical models enable variable importance measures but random
forests do!
 Let’s see how

Random Forest (v0) Mathieu Ribatet ([email protected]) – 27 / 38


Mean decrease impurity

 Binary trees split node N into (NL , NR ) maximizing the decrease in impurity

|NL |
∆i(N ) = i(N )−{p(L)i(NL ) + p(R)i(NR )}, p(L) = , p(R) = 1−p(L),
| {z } |N |
impurity decrease after splitting N

where i(T ) and |T | are the impurity and cardinal of node T .


 Mean Decrease Impurity (MDI) aggregates over nodes of T
X
MDIT (Xj ) = p(N ) {i(N ) − ∆i(N )} 1{split of N uses Xj } .
N ∈T

 For a random forest F = {T1 , . . . , TB }, we average over trees, i.e.,


B
1 X
MDI(F ) = MDI(Tb ).
B
b=1

Random Forest (v0) Mathieu Ribatet ([email protected]) – 28 / 38


Boston data set

> head(Boston)
crim zn indus chas nox rm age dis rad tax ptratio black lstat medv
1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0
2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6
3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2
6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21 28.7

 Aim is to predict the price of house (regression problem)


 Sample size is n = 506
 We have p = 13 covariates (only a few categorical)

Random Forest (v0) Mathieu Ribatet ([email protected]) – 29 / 38


lstat
rm
indus
dis
ptratio
crim
nox
tax
rad
age
black
chas
zn

0 2000 4000 6000 8000 10000 12000

Figure 5: Mean decrease impurity for the Boston housing regression problem.

Random Forest (v0) Mathieu Ribatet ([email protected]) – 30 / 38


Mean decrease accuracy (a.k.a. permutation importance)

 The Mean Decrease Accuracy (MDA) for a tree (fitted) T is


n n
1X 1X
MDAT (Xj ; Dn ) = loss {Yi , T (Dj,n )} − loss {Yi , T (Dn )} ,
n n
i=1 i=1

where Dj,n is similar to the original data Dn except that feature Xj has been
randomly shuffled.
 For a random forest F = (T1 , . . . , TB ), we average over all trees, i.e.,
B
1 X
MDA(F , Dn ) = MDATb (Xj , D̃n,b )
B
b=1

where D̃n,b is the out-of-bag sample of tree Tb .

× Intuitively, if Xj is not influential, prediction performance should not be degraded


too much hence MDA should be small.

Random Forest (v0) Mathieu Ribatet ([email protected]) – 31 / 38


rm
lstat
nox
dis
indus
ptratio
rad
crim
age
tax
zn
black
chas

2 4 6 8 10 12

Figure 6: Mean decrease accuracy for the Boston housing regression problem.

Random Forest (v0) Mathieu Ribatet ([email protected]) – 32 / 38


lstat rm
rm lstat
indus nox
dis dis
ptratio indus
crim ptratio
nox rad
tax crim
rad age
age tax
black zn
chas black
zn chas

0 2000 4000 6000 8000 10000 12000 2 4 6 8 10 12

Figure 7: Comparison of feature importance measures.

Random Forest (v0) Mathieu Ribatet ([email protected]) – 33 / 38


Shapley values

 Shapley values came from game theory and hence are agnostic.
 Within a (coalitional) game with n players, they give the “fair” distribution of
the (maximal) profit
 For our concern, we have
Game predict Y given X;
Players features;
Profit model’s prediction for an observation (Yi , Xi ).
 Shapley value of the i–th observation and j–th feature (p features in total) is
 −1
1 X n−1
Shapley(Xi,j ) = {ν(Xi,S ∪ Xi,j ) − ν(Xi,S )}
p |S| | {z }
S⊆{1,...,p}\{j} | {z } marginal contribution of
number of partitions Xi,j to Xi,S ∪ Xi,j
of size |S| without j

× Intuitively, if Xi,j is not influential, as before prediction performance should not


be degraded too much hence Shapley values should be small.

Random Forest (v0) Mathieu Ribatet ([email protected]) – 34 / 38


Shapley value estimation

 Recall that Shapley values uses terms of the form ν(XS ), S ⊆ {1, . . . , p}.
 For our concern, it implies to fit the model to all subset S, i.e., 2p − 1 models.
 To reduce the computational burden, one can use an estimation based on a
marginalization approach to get ν(XS1 ) from ν(XS1 ∪ XS2 ), S1 ∩ S2 = ∅, i.e.,
n
1X
ν̂(Xi,S1 ) = ν(Xi,S1 ∪ Xℓ,S2 )
n
ℓ=1

× The above estimator is called Shapley sampling values.

Random Forest (v0) Mathieu Ribatet ([email protected]) – 35 / 38


Shapley additive explanation values

 Shapley Addidtive exPlanation (SHAP) values are Shapley values with


h i
ν(XS ) = E fˆ(X) | XS

 From basic probability theory, we easily get


h i Z
E fˆ(X) | XS = xS = fˆ(x)p(x | xS )dx−S .

 SHAP values can be estimated in two ways:


– using the same marginalization strategy, i.e., SHAP sampling values;
– using a Kernel approach, i.e., Kernel SHAP.
 We will now focus on Kernel SHAP.

Random Forest (v0) Mathieu Ribatet ([email protected]) – 36 / 38


Kernel SHAP

 Kernel SHAP assumes independence between covariates, i.e., XS1 | XS2 ∼ XS1
when S1 ∩ S2 = ∅, so that
h i Z Z
E fˆ(X) | XS = fˆ(x)p(x | xS )dx−S = fˆ(x)p(x−S )dx−S .

 Using (as often), we can define the Kernel SHAP estimator


J
1Xˆ
ν̂(Xi,j ) = f (Xi,j , X̃−j ),
L
ℓ=1

where the Xℓ,−j ’s are sampled independently from Xi,j .

- If covariates are highly dependent, estimates will be completely off. Some varia-
tions exists to enable dependent features, e.g., using a multivariate Gaussian distri-
bution.

Random Forest (v0) Mathieu Ribatet ([email protected]) – 37 / 38


House price Actual prediction: 45.00 Actual prediction: 10.57 Actual prediction: 25.72
Expensive Average prediction: 22.52 Average prediction: 22.52 Average prediction: 22.52
Cheap
Average
rm=8.266 crim=0.18337 lstat=6.56
2

lstat=4.14 zn=0 nox=0.431

nox=0.504 chas=0 age=17.5


1

ptratio=17.4 rad=4 indus=5.86

crim=0.31533 black=344.05 crim=0.19073

dis=2.8944 nox=0.609 black=393.74


0

black=385.05 dis=1.7554 zn=22

indus=6.2 age=98.3 rad=7

rad=8 tax=711 rm=6.718


−1

tax=307 ptratio=20.1 chas=0

age=78.3 indus=27.74 ptratio=19.1


−2

chas=0 rm=5.414 tax=330

zn=0 lstat=23.97 dis=7.8265


crim zn indus chas nox rm age dis rad tax ptratio lstat medv 0 3 6 9 −4 −3 −2 −1 0 0 1 2
phi phi phi

Figure 8: Boxplot (scaled Boston) and Shapley values for the Boston housing regression problem.

Random Forest (v0) Mathieu Ribatet ([email protected]) – 38 / 38

You might also like