0% found this document useful (0 votes)
60 views12 pages

Random Forests: N 1 N J X A I X A I

The document discusses random forests and summarizes key aspects in 3 sentences: Random forests are a simple and effective machine learning method that averages the predictions of many decision trees built from random subsets of the data. Decision trees partition the data space and make predictions based on the average outcomes in each partition region. Bagging improves upon single decision trees by averaging the predictions of many trees each built from bootstrap samples of the original data.

Uploaded by

endale
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views12 pages

Random Forests: N 1 N J X A I X A I

The document discusses random forests and summarizes key aspects in 3 sentences: Random forests are a simple and effective machine learning method that averages the predictions of many decision trees built from random subsets of the data. Decision trees partition the data space and make predictions based on the average outcomes in each partition region. Bagging improves upon single decision trees by averaging the predictions of many trees each built from bootstrap samples of the original data.

Uploaded by

endale
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Random Forests

One of the best known classifiers is the random forest. It is very simple and effective but
there is still a large gap between theory and practice. Basically, a random forest is an average
of tree estimators.

These notes rely heavily on Biau and Scornet (2016) as well as the other references at the
end of the notes.

1 Partitions and Trees

We begin by reviewing trees. As with nonparametric regression, simple and interpretable


classifiers can be derived by partitioning the range of X. Let Πn = {A1 , . . . , AN } be
a partition of X . Let Aj be the partition element that contains x. Then b h(x) = 1 if
P P
Xi ∈Aj Yi ≥ Xi ∈Aj (1 − Yi ) and h(x) = 0 otherwise. This is nothing other than the plugin
b
classifier based on the partition regression estimator
N
X
m(x)
b = Y j I(x ∈ Aj )
j=1

−1
Pn
where Y j = nj i=1 Yi I(Xi ∈ Aj ) is the average of the Yi ’s in Aj and nj = #{Xi ∈ Aj }.
(We define Y j to be 0 if nj = 0.)

Recall from the results on regression that if m ∈ H1 (1, L) and the binwidth b of a regular
partition satisfies b  n−1/(d+2) then
c
b − m||2P ≤ 2/(d+2) .
E||m (1)
n
h) − R(h∗ ) = O(n−1/(d+2) ).
We conclude that the corresponding classification risk satisfies R(b

Regression trees and classification trees (also called decision trees) are partition classifiers
where the partition is built recursively. For illustration, suppose there are two covariates,
X1 = age and X2 = blood pressure. Figure 1 shows a classification tree using these variables.

The tree is used in the following way. If a subject has Age ≥ 50 then we classify him as
Y = 1. If a subject has Age < 50 then we check his blood pressure. If systolic blood pressure
is < 100 then we classify him as Y = 1, otherwise we classify him as Y = 0. Figure 2 shows
the same classifier as a partition of the covariate space.

Here is how a tree is constructed. First, suppose that there is only a single covariate X. We
choose a split point t that divides the real line into two sets A1 = (−∞, t] and A2 = (t, ∞).
Let Y 1 be the mean of the Yi ’s in A1 and let Y 2 be the mean of the Yi ’s in A2 .

1
Age
< 50 ≥ 50

Blood Pressure 1

< 100 ≥ 100

0 1

Figure 1: A simple classification tree.

1
Blood Pressure

110
1
0

50
Age

Figure 2: Partition representation of classification tree.

2
For continuous Y (regression), the split is chosen to minimize the training error. For binary
Y (classification), the split is chosen to minimizeP2a surrogate for classification error. A
common choice is the impurity defined by I(t) = s=1 γs where
2
γs = 1 − [Y s + (1 − Y s )2 ]. (2)

This particular measure of impurity is known as the Gini index. If a partition element As
contains all 0’s or all 1’s, then γs = 0. Otherwise, γs > 0. We choose the split point t to
minimize the impurity. Other indices of impurity besides the Gini index can be used, such
as entropy. The reason for using impurity rather than classification error is because impurity
is a smooth function and hence is easy to minimize.

Now we continue recursively splitting until some stopping criterion is met. For example, we
might stop when every partition element has fewer than n0 data points, where n0 is some
fixed number. The bottom nodes of the tree are called the leaves. Each leaf has an estimate
m(x)
b which is the mean of Yi ’s in that leaf. For classification, we take b
h(x) = I(m(x)
b > 1/2).
When there are several covariates, we choose whichever covariate and split that leads to the
lowest impurity.

The result is a piecewise constant estimator that can be represented as a tree.

2 Example

The following data are from simulated images of gamma ray events for the Major Atmo-
spheric Gamma-ray Imaging Cherenkov Telescope (MAGIC) in the Canary Islands. The
data are from archive.ics.uci.edu/ml/datasets/MAGIC+Gamma+Telescope. The telescope
studies gamma ray bursts, active galactic nuclei and supernovae remnants. The goal is to
predict if an event is real or is background (hadronic shower). There are 11 predictors that
are numerical summaries of the images. We randomly selected 400 training points (200 pos-
itive and 200 negative) and 1000 test cases (500 positive and 500 negative). The results of
various methods are in Table 1. See Figures 3, 4, 5, 6.

3 Bagging

Trees are useful for their simplicity and interpretability. But the prediction error can be
reduced by combining many trees. A common approach, called bagging, is as follows.

Suppose we draw B bootstrap samples and each time we construct a classifier. This gives tree
classifiers h1 , . . . , hB . (The same idea applies to regression.) We now classify by combining

3
Method Test Error
Logistic regression 0.23
SVM (Gaussian Kernel) 0.20
Kernel Regression 0.24
Additive Model 0.20
Reduced Additive Model 0.20
11-NN 0.25
Trees 0.20

Table 1: Various methods on the MAGIC data. The reduced additive model is based on
using the three most significant variables from the additive model.
0.2

0.8

0.05
0.4
0.1
0.0

0.6

0.00
0.2
0.0

0.4

−0.05
−0.1

−0.1

0.0
0.2

−0.10
−0.2

−0.2

0.0

−0.2

−0.15
−0.3

−0.2
−0.3

−0.20
−0.4

−1 0 1 2 3 4 5 0 2 4 6 −1 0 1 2 3 4 −2 −1 0 1 2 −2 −1 0 1 2
0.15

0.06

0.04
0.2
0.2
0.04
0.10

0.03
0.1
0.02

0.02
0.05

0.1

0.01
0.00

0.0
0.00

0.00
−0.02

0.0

−0.1
−0.05

−0.04

−0.02
−0.10

−0.1

−0.2
−0.06

−4 −2 0 2 −4 −2 0 2 −6 −2 0 2 4 6 −1.0 0.0 1.0 2.0 −2 −1 0 1 2

Figure 3: Estimated functions for additive model.

4
0.29
0.28
0.27
Test Error

0.26
0.25

0 10 20 30 40 50

Figure 4: Test error versus k for nearest neighbor estimator.

xtrain.V9 < −0.189962


|

xtrain.V1 < 1.21831 xtrain.V1 < −0.536394

xtrain.V4 < 0.411748 xtrain.V4 < −0.769513 xtrain.V10 < −0.369401 xtrain.V2 < 0.343193

xtrain.V8 < −0.912288 xtrain.V3 < −0.463854 xtrain.V3 < −1.14519 xtrain.V7 < 0.015902
0 0 0 0

xtrain.V6 < −0.274359 xtrain.V10 < 0.4797 xtrain.V2 < −0.607802 xtrain.V8 < −0.199142
0 0 1 0

xtrain.V5 < 1.41292 xtrain.V4 < 1.95513


xtrain.V3 < −0.96174
1 1 0 0 0

xtrain.V1 < −0.787108


1 0 1 1 0

1 1

Figure 5: Full tree.

5
xtrain.V9 < | !0.189962

xtrain.V1 < 1.21831 xtrain.V1 < !0.536394

1 0

xtrain.V10 < !0.369401


0
1 0

Figure 6: Classification tree. The size of the tree was chosen by cross-validation.

them: (
1 if B1 j hj (x) ≥ 1
P
2
h(x) =
0 otherwise.
This is called bagging which stands for bootstrap aggregation. A variation is sub-bagging
where we use subsamples instead of bootstrap samples.

To get some intuition about why bagging is useful, consider this example from Buhlmann
and Yu (2002). Suppose that x ∈ R and consider the simple decision rule θbn = I(Y n ≤ x).
Let µ = E[Yi ] and for simplicity assume that Var(Yi ) = 1. Suppose that x is close to µ

relative to the sample size. We can model this by setting x ≡ xn = µ + c/ n. Then θbn
converges to I(Z ≤ c) where Z ∼ N (0, 1). So the limiting mean and variance of θbn are

Φ(c) and Φ(c)(1 − Φ(c)). Now the bootstrap
√ distribution of Y (conditional on Y1 , . . . , Yn )

is approximately N (Y , 1/n). That is, n(Y − Y ) ≈ N (0, 1). Let E ∗ denote the average
with respect to the bootstrap randomness. Then, if θen is the bagged estimator, we have
" !#
∗ ∗ ∗
√ ∗ √
θen = E [I(Y ≤ xn )] = E I n(Y − Y ) ≤ n(xn − Y )

= Φ( n(xn − Y )) + o(1) = Φ(c + Z) + o(1)

where Z ∼ N (0, 1), and we used the fact that Y ≈ N (µ, 1/n).

To summarize, θbn ≈ I(Z ≤ c) while θen ≈ Φ(c + Z) which is a smoothed version of I(Z ≤ c).

6
In other words, bagging is a smoothing operator. In particular, suppose we take c = 0.
Then θbn converges to a Bernoulli with mean 1/2 and variance 1/4. The bagged estimator
converges to Φ(Z) = Unif(0, 1) which has mean 1/2 and variance 1/12. The reduction in
variance is due to the smoothing effect of bagging.

4 Random Forests

Finally we get to random forests. These are bagged trees except that we also choose random
subsets of features for each tree. The estimator can be written as
1 X
m(x)
b = m
b j (x)
M j

where mb j is a tree estimator based on a subsample (or bootstrap) of size a using p randomly
selected features. The trees are usually required to have some number k of observations in
the leaves. There are three tuning parameters: a, p and k. You could also think of M as a
tuning parameter but generally we can think of M as tending to ∞.

For each tree, we can estimate the prediction error on the un-used data. (The tree is built
on a subsample.) Averaging these prediction errors gives an estimate called the out-of-bag
error estimate.

Unfortunately, it is very difficult to develop theory for random forests since the splitting
is done using greedy methods. Much of the theoretical analysis is done using simplified
versions of random forests. For example, the centered forest is defined as follows. Suppose
the data are on [0, 1]d . Choose a random feature, split in the center. Repeat until there are
k leaves. This defines one tree. Now we average M such trees. Breiman (2004) and Biau
(2002) proved the following.

Theorem 1 If each feature is selected with probability 1/d, k = o(n) and k → ∞ then

E[|m(X)
b − m(X)|2 ] → 0

as n → ∞.

Under stronger assumptions we can say more:

Theorem 2 Suppose that m is Lipschitz and that m only depends on a subset S of the
features and that the probability of selecting j ∈ S is (1/S)(1 + o(1)). Then
3
  4|S| log
2 1 2+3
E|m(X)
b − m(X)| = O .
n

7
This is better than the usual Lipschitz rate n−2/(d+2) if |S| ≤ p/2. But the condition that
we select relevant variables with high probability is very strong and proving that this holds
is a research problem.

A significant step forward was made by Scornet, Biau and Vert (2015). Here is their result.

Theorem 3 Suppose that Y = j mj (X(j)) +  where X ∼ Uniform[0, 1]d ,  ∼ N (0, σ 2 )


P
and each mj is continuous. Assume that the split is chosen using the maximum drop in sums
of squares. Let tn be the number of leaves on each tree and let an be the subsample size. If
tn → ∞, an → ∞ and tn (log an )9 /an → 0 then
E[|m(X)
b − m(X)|2 ] → 0
as n → ∞.

Again, the theorem has strong assumptions but it does allow a greedy split selection. Scornet,
Biau and Vert (2015) provide another interesting result. Suppose that (i) there is a subset
S of relevant features, (ii) p = d, (iii) mj is not constant on any interval for j ∈ S. Then
with high probability, we always split only on relevant variables.

5 Connection to Nearest Neighbors

Lin and Jeon (2006) showed that there is a connection between random forests and k-NN
methods. We say that Xi is a layered nearest neighbor (LNN) of x If the hyper-rectangle
defined by x and Xi contains no data points except Xi . Now note that if tree is grown until
each leaf has one point, then m(x)
b is simply a weighted average of LNN’s. More generally,
Lin and Jeon (2006) call Xi a k-potential nearest neighbor k − P N N if there are fewer than
k samples in the the hyper-rectangle defined by x and Xi . If we restrict to random forests
whose leaves have k points then it follows easily that m(x)
b is some weighted average of the
k − P N N ’s.

Let us know return to LNN’s. Let Ln (x) denote all LNN’s of x and let Ln (x) = |Ln (x)|. We
could directly define
1 X
m(x)
b = Yi I(Xi ∈ Ln (x)).
Ln (x) i
Biau and Devroye (2010) showed that, if X has a continuous density,
(d − 1)!E[Ln (x)]
→ 1.
2d (log n)d−1
Moreover, if Y is bounded and m is continuous then, for all p ≥ 1,
b n (X) − m(X)|p → 0
E|m

8
as n → ∞. Unfortunately, the rate of convergence is slow. Suppose that Var(Y |X = x) = σ 2
is constant. Then
σ2 σ 2 (d − 1)!
b n (X) − m(X)|p ≥
E|m ∼ d .
E[Ln (x)] 2 (log n)d−1

If we use k-PNN, with k → ∞ and k = o(n), then the results Lin and Jeon (2006) show that
the estimator is consistent and has variance of order O(1/k(log n)d−1 ).

As an aside, Biau and Devroye (2010) also show that if we apply bagging to the usual 1-NN
rule to subsamples of size k and then average over subsamples, then, if k → ∞ and k = o(n),
then for all p ≥ 1 and all distributions P , we have that E|m(X)
b − m(X)|p → 0. So bagged
1-NN is universally consistent. But at this point, we have wondered quite far from random
forests.

6 Connection to Kernel Methods

There is also a connection between random forests and kernel methods (Scornet 2016). Let
Aj (x) be the cell containing x in the j th tree. Then we can write the tree estimator as

1 X X Yi I(Xi ∈ Aj (x)) 1 XX
m(x)
b = = Wij Yj
M j i Nj (x) M j i

where Nj (x) is the number of data points in Aj (x) and Wij = I(Xi ∈ Aj (x))/Nj (x). This
suggests that a cell Aj with low density (and hence small Nj (x)) has a high weight. Based
on this observation, Scornet (2016) defined kernel based random forest (KeRF) by
P P
j i Yi I(Xi ∈ Aj (x))
m(x)
b = P .
j Nj (x)

With this modification, m(x)


b is the average of each Yi weighted by how often it appears in
the trees. The KeRF can be written as
P
Yi K(x, Xi )
m(x)
b = Pi
s Kn (x, Xs )

where
1 X
Kn (x, z) = I(x ∈ Aj (x)).
M j

The trees are random. So let us write the j th tree as Tj = T (Θj ) for some random quantity
Θj . So the forests is built from T (Θ1 ), . . . , T (ΘM ). And we can write Aj (x) as A(x, Θj ).
Then Kn (x, z) converges almost surely (as M → ∞) to κn (x, z) = PΘ (z ∈ A(x, Θ)) which is

9
just the probability that x and z are connected, in the sense that they are in the same cell.
Under some assumptions, Scornet (2016) showed that KeRF’s and forests are close to each
other, thus providing a kernel interpretation of forests.

Recall the centered forest we discussed earlier. This is a stylized forest — quite different
from the forests used in practice — but they provide a nice way to study the properties
of the forest. In the case of KeRF’s, Scornet (2016) shows that if m(x) is Lipschitz and
X ∼ Unif([0, 1]d ) then
  3+d1log 2
2 12
E[(m(x)
b − m(x)) ] ≤ C(log n) .
n

This is slower than the minimax rate n−2/(d+2) but this probably reflects the difficulty in
analyzing forests.

7 Variable Importance

Let m
b be a random forest estimator. How important is feature X(j)?

LOCO. One way to answer this question is to fit the forest with all the data and fit it
again without using X(j). When we construct a forest, we randomly select features for each
tree. This second forest can be obtained by simply average the trees where feature j was
b (−j) . Let H be a hold-out sample of size m. Then let
not selected. Call this m

bj = 1
X
∆ Wi
m i∈H

where
b (−j) (Xi ))2 − (Yi − m(X
Wi = (Yi − m b i ))2 .
Then ∆j is a consistent estimate of the prediction risk inflation that occurs by not having
access to X(j). Formally, if T denotes the training data then,
" #

2 2
E[∆j |T ] = E (Y − m
b b (−j) (X)) − (Y − m(X))
b T ≡ ∆j .

In fact, since ∆
b j is simply an average, we can easily construct a confidence interval. This
approach is called LOCO (Leave-Out-COvariates). Of course, it is easily extended to sets
of features. The method is explored in Lei, G’Sell, Rinaldo, Tibshirani, Wasserman (2017)
and Rinaldo, Tibshirani, Wasserman (2015).

Permutation Importance. A different approach is to permute the values of X(j) for the
out-of-bag observations, for each tree. Let Oj be the out-of-bag observations for tree j and

10
let Oj∗ be the out-of-bag observations for tree j with X(j) permuted.

bj = 1
XX
Γ Wij
M j i

where
1 X 1 X
Wij = b j (Xi ))2 −
(Yi − m b j (Xi ))2 .
(Yi − m
mj i∈O∗ mj i∈O
j j

This avoids using a hold-out sample. This is estimating

b j0 ))2 ] − E[(Y − m(X))


Γj = E[(Y − m(X b 2
]

where Xj0 has the same distribution as X except that Xj0 (j) is an independent draw from
X(j). This is a lot like LOCO but its meaning is less clear. Note that mb j is not changed
when X(j) is permuted. Gregorutti, Michel and Saint Pierre. (2013) show that, when (X, )
is Gaussian, that Var(X) = (1 − c)I + c11T and that Cov(Y, X(j)) = τ for all j then
 2
τ
Γj = 2 .
1 − c + dc

It
P is not clear how this connects to the actual importance of X(j). In the case where Y =
2
j mj (X(j)) +  with E[|X] = 0 and E[ |X] < ∞, they show that Γj = 2Var(mj (X(j)).

8 Inference

Using
√ the theory of infinite order U -statistics, Mentch and Hooker (2015) showed that
n(m(x)
b − E[m(x)])/σ
b converges to a Normal(0,1) and they show how to estimate σ.

Wager and Athey (2017) show asymptotic normality if we use sample splitting: part of the
data are used to build the tree and part is used to estimate the average in the leafs of the
tree. Under a number of technical conditions — including the fact that we use subsamples
of size s = nβ with β < 1 — they show that (m(x)
b − m(x))/σn (x) N (0, 1) and they show
how to estimate σn (x). Specifically,
 2 X
n−1 n
bn2 (x)
σ = b j (x), Nij )2
(Cov(m
n n−s i

where the covariance is with respect to the trees in the forest and Nij = 1 of (Xi , Yi ) was in
the j th subsample and 0 otherwise.

11
9 Summary

Random forests are considered one of the best all purpose classifiers. But it is still a mystery
why they work so well. The situation is very similar to deep learning. We have seen that
there are now many interesting theoretical results about forests. But the results make strong
assumptions that create a gap between practice and theory. Furthermore, there is no theory
to say why forests outperform other methods. The gap between theory and practice is due
to the fact that forests — as actually used in practice — are complex functions of the data.

10 References

Biau, Devroye and Lugosi. (2008). Consistency of Random Forests and Other Average
Classifiers. JMLR.

Biau, Gerard, and Scornet. (2016). A random forest guided tour. Test 25.2: 197-227.

Biau, G. (2012). Analysis of a Random Forests Model. arXiv:1005.0208.

Buhlmann, P., and Yu, B. (2002). Analyzing bagging. Annals of Statistics, 927-961.

Gregorutti, Michel, and Saint Pierre. (2013). Correlation and variable importance in random
forests. arXiv:1310.5726.

Lei J, G’Sell M, Rinaldo A, Tibshirani RJ, Wasserman L. (2017). Distribution-free predictive


inference for regression. Journal of the American Statistical Association.

Lin, Y. and Jeon, Y. (2006). Random Forests and Adaptive Nearest Neighbors. Journal of
the American Statistical Association, 101, p 578.

L. Mentch and G. Hooker. (2015). Ensemble trees and CLTs: Statistical inference for
supervised learning. Journal of Machine Learning Research.

Rinaldo A, Tibshirani R, Wasserman L. (2015). Uniform asymptotic inference and the


bootstrap after model selection. arXiv preprint arXiv:1506.06266.

Scornet E. Random forests and kernel methods. (2016). IEEE Transactions on Information
Theory. 62(3):1485-500.

Wager, S. (2014). Asymptotic Theory for Random Forests. arXiv:1405.0352.

Wager, S. (2015). Uniform convergence of random forests via adaptive concentration. arXiv:1503.06388.

Wager, S. and Athey, S. (2017). Estimation and inference of heterogeneous treatment effects
using random forests. Journal of the American Statistical Association.

12

You might also like