Random Forest Explained
Random Forest Explained
3. Random forest
4. Feature importance
1. Introduction
[1] Gérard Biau and Erwan Scornet. A random forest guided tour. TEST,
25(2):197–227, 2016.
[3] Gilles Louppe, Louis Wehenkel, Antonio Sutera, and Pierre Geurts.
Understanding variable importances in forests of randomized trees. In C.J.
Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors,
Advances in Neural Information Processing Systems, volume 26. Curran
Associates, Inc., 2013.
[4] Scott M. Lundberg and Su-In Lee. A unified approach to interpreting model
predictions. In Proceedings of the 31st International Conference on Neural
Information Processing Systems, NIPS’17, page 4768–4777, Red Hook, NY,
USA, 2017. Curran Associates Inc.
ñ Since
Var T̄B −→ ρσ 2 , B → ∞,
the pairwise correlation ρ mainly controls the variance of T̄B as long as B is large
enough. Random forests aims at reducing ρ without increasing (too much) σ 2 .
⊲ 2. CART
3. Random forest
4. Feature importance
2. CART
∪nj=1
terminal
Rj = X , Rj1 ∩ Rj2 = ∅, j1 6= j2 .
with
Sparent = X p−1 × Cj ,
where Cj is a subset of possible outcomes of feature Xj .
At each split the feature Xj will be selected from a relevant criterion.
4 To ease explanations, we will assume that all covariates Xj and the outcome Y
are numerical. We will go back to categorical variables later.
R5
X2 ≤ t2 X1 ≤ t3 R2 t4
X2
R3
t2
X2 ≤ t4 R4
R1 R2 R3
R1
t1 t3
R4 R5 X1
Finding the optimal cutoff value s and feature j are relatively easy.
For feature Xj , possible cutoff values s are the observed outcome of Xj
Hence the above optimization problem is solved using a brute–force search.
× All possible cutoff values s can be computed once for all at the beginning of the
learning stage.
For classification problems, i.e., Y ∈ {1, . . . , K}, we cannot use the quadratic
loss anymore h i
2
E(Y,X) (Y − Ŷ ) .
Gini’s index h i
E(Y,X) 1 − Pr(Ŷ = Y )
Cross–entropy h i
−E(Y,X) log Pr(Ŷ = Y )
Gini’s index
Cross−entropy
0.4
tiable
0.2
× Gini and cross–entropy should be used to grow and one can use any impurity
measure while pruning—miss-classification rate is often use though.
2. CART
⊲ 3. Random forest
4. Feature importance
3. Random forest
B
!
1 X
Var Tb ≈ ρσ 2 , Cov(Tb , Tb′ ) = ρσ 2 , B ≫ 1, ρ > 0.
B
b=1
1 XB X
B
fˆB : x 7−→ Tb (x), ĈB : x 7−→ argmax 1{Tb (x)=k}
B b=1 k b=1
ñ Recommendations, for regression use m = ⌊p/3⌋ with minimum node size is 5; for
√
classification use m = ⌊ p⌋ with minimum node size 1. But these are just guidelines
and in practice you should consider fine tuning.
2. CART
3. Random forest
4. Feature
⊲ importance
4. Feature importance
Binary trees split node N into (NL , NR ) maximizing the decrease in impurity
|NL |
∆i(N ) = i(N )−{p(L)i(NL ) + p(R)i(NR )}, p(L) = , p(R) = 1−p(L),
| {z } |N |
impurity decrease after splitting N
> head(Boston)
crim zn indus chas nox rm age dis rad tax ptratio black lstat medv
1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0
2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6
3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2
6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21 28.7
Figure 5: Mean decrease impurity for the Boston housing regression problem.
where Dj,n is similar to the original data Dn except that feature Xj has been
randomly shuffled.
For a random forest F = (T1 , . . . , TB ), we average over all trees, i.e.,
B
1 X
MDA(F , Dn ) = MDATb (Xj , D̃n,b )
B
b=1
2 4 6 8 10 12
Figure 6: Mean decrease accuracy for the Boston housing regression problem.
Shapley values came from game theory and hence are agnostic.
Within a (coalitional) game with n players, they give the “fair” distribution of
the (maximal) profit
For our concern, we have
Game predict Y given X;
Players features;
Profit model’s prediction for an observation (Yi , Xi ).
Shapley value of the i–th observation and j–th feature (p features in total) is
−1
1 X n−1
Shapley(Xi,j ) = {ν(Xi,S ∪ Xi,j ) − ν(Xi,S )}
p |S| | {z }
S⊆{1,...,p}\{j} | {z } marginal contribution of
number of partitions Xi,j to Xi,S ∪ Xi,j
of size |S| without j
Recall that Shapley values uses terms of the form ν(XS ), S ⊆ {1, . . . , p}.
For our concern, it implies to fit the model to all subset S, i.e., 2p − 1 models.
To reduce the computational burden, one can use an estimation based on a
marginalization approach to get ν(XS1 ) from ν(XS1 ∪ XS2 ), S1 ∩ S2 = ∅, i.e.,
n
1X
ν̂(Xi,S1 ) = ν(Xi,S1 ∪ Xℓ,S2 )
n
ℓ=1
Kernel SHAP assumes independence between covariates, i.e., XS1 | XS2 ∼ XS1
when S1 ∩ S2 = ∅, so that
h i Z Z
E fˆ(X) | XS = fˆ(x)p(x | xS )dx−S = fˆ(x)p(x−S )dx−S .
- If covariates are highly dependent, estimates will be completely off. Some varia-
tions exists to enable dependent features, e.g., using a multivariate Gaussian distri-
bution.
Figure 8: Boxplot (scaled Boston) and Shapley values for the Boston housing regression problem.