CHP 8.2 Intro To Statistical Learning
CHP 8.2 Intro To Statistical Learning
Tree-Based Methods
8.2.1 Bagging
The bootstrap, introduced in Chapter 5, is an extremely powerful idea. It is
used in many situations in which it is hard or even impossible to directly
compute the standard deviation of a quantity of interest. We see here that
the bootstrap can be used in a completely different context, in order to
improve statistical learning methods such as decision trees.
The decision trees discussed in Section 8.1 suffer from high variance.
This means that if we split the training data into two parts at random,
and fit a decision tree to both halves, the results that we get could be
quite different. In contrast, a procedure with low variance will yield similar
results if applied repeatedly to distinct data sets; linear regression tends
to have low variance, if the ratio of n to p is moderately large. Bootstrap
aggregation, or bagging, is a general-purpose procedure for reducing the
bagging
variance of a statistical learning method; we introduce it here because it is
particularly useful and frequently used in the context of decision trees.
Recall that given a set of n independent observations Z1 , . . . , Zn , each
with variance λ 2 , the variance of the mean Z̄ of the observations is given
by λ 2 /n. In other words, averaging a set of observations reduces variance.
Hence a natural way to reduce the variance and increase the test set ac-
curacy of a statistical learning method is to take many training sets from
8.2 Bagging, Random Forests, Boosting, and Bayesian Additive Regression Trees 341
the population, build a separate prediction model using each training set,
and average the resulting predictions. In other words, we could calculate
fˆ1 (x), fˆ2 (x), . . . , fˆB (x) using B separate training sets, and average them
in order to obtain a single low-variance statistical learning model, given by
B
ˆ 1 ⎛ ˆb
favg (x) = f (x).
B
b=1
B
1 ⎛ ˆ×b
fˆbag (x) = f (x).
B
b=1
0.30
0.25
Error
0.20
0.15
Test: Bagging
Test: RandomForest
OOB: Bagging
0.10
OOB: RandomForest
Number of Trees
FIGURE 8.8. Bagging and random forest results for the Heart data. The test
error (black and orange) is shown as a function of B, the number of bootstrapped
−
training sets used. Random forests were applied with m = p. The dashed line
indicates the test error resulting from a single classification tree. The green and
blue traces show the OOB error, which in this case is — by chance — considerably
lower.
Fbs
RestECG
ExAng
Sex
Slope
Chol
Age
RestBP
MaxHR
Oldpeak
ChestPain
Ca
Thal
0 20 40 60 80 100
Variable Importance
FIGURE 8.9. A variable importance plot for the Heart data. Variable impor-
tance is computed using the mean decrease in Gini index, and expressed relative
to the maximum.
to the square root of the total number of predictors (4 out of the 13 for the
Heart data).
In other words, in building a random forest, at each split in the tree,
the algorithm is not even allowed to consider a majority of the available
predictors. This may sound crazy, but it has a clever rationale. Suppose
that there is one very strong predictor in the data set, along with a num-
ber of other moderately strong predictors. Then in the collection of bagged
trees, most or all of the trees will use this strong predictor in the top split.
Consequently, all of the bagged trees will look quite similar to each other.
Hence the predictions from the bagged trees will be highly correlated. Un-
fortunately, averaging many highly correlated quantities does not lead to
as large of a reduction in variance as averaging many uncorrelated quan-
tities. In particular, this means that bagging will not lead to a substantial
reduction in variance over a single tree in this setting.
Random forests overcome this problem by forcing each split to consider
only a subset of the predictors. Therefore, on average (p − m)/p of the
splits will not even consider the strong predictor, and so other predictors
will have more of a chance. We can think of this process as decorrelating
the trees, thereby making the average of the resulting trees less variable
and hence more reliable.
8.2 Bagging, Random Forests, Boosting, and Bayesian Additive Regression Trees 345
The main difference between bagging and random forests is the choice
of predictor subset size m. For instance, if a random forest is built using
m = p, then this amounts simply to bagging. On the Heart data, random
∈
forests using m = p leads to a reduction in both test error and OOB error
over bagging (Figure 8.8).
Using a small value of m in building a random forest will typically be
helpful when we have a large number of correlated predictors. We applied
random forests to a high-dimensional biological data set consisting of ex-
pression measurements of 4,718 genes measured on tissue samples from 349
patients. There are around 20,000 genes in humans, and individual genes
have different levels of activity, or expression, in particular cells, tissues,
and biological conditions. In this data set, each of the patient samples has
a qualitative label with 15 different levels: either normal or 1 of 14 different
types of cancer. Our goal was to use random forests to predict cancer type
based on the 500 genes that have the largest variance in the training set.
We randomly divided the observations into a training and a test set, and
applied random forests to the training set for three different values of the
number of splitting variables m. The results are shown in Figure 8.10. The
error rate of a single tree is 45.7 %, and the null rate is 75.4 %.4 We see that
using 400 trees is sufficient to give good performance, and that the choice
∈
m = p gave a small improvement in test error over bagging (m = p) in
this example. As with bagging, random forests will not overfit if we increase
B, so in practice we use a value of B sufficiently large for the error rate to
have settled down.
8.2.3 Boosting
We now discuss boosting, yet another approach for improving the predic-
boosting
tions resulting from a decision tree. Like bagging, boosting is a general
approach that can be applied to many statistical learning methods for re-
gression or classification. Here we restrict our discussion of boosting to the
context of decision trees.
Recall that bagging involves creating multiple copies of the original train-
ing data set using the bootstrap, fitting a separate decision tree to each
copy, and then combining all of the trees in order to create a single predic-
tive model. Notably, each tree is built on a bootstrap data set, independent
of the other trees. Boosting works in a similar way, except that the trees are
grown sequentially: each tree is grown using information from previously
grown trees. Boosting does not involve bootstrap sampling; instead each
tree is fit on a modified version of the original data set.
4 The null rate results from simply classifying each observation to the dominant class
overall, which is in this case the normal class.
346 8. Tree-Based Methods
m=p
m=p/2
0.5
m= p
0.4
0.3
0.2
Number of Trees
FIGURE 8.10. Results from random forests for the 15-class gene expression
data set with p = 500 predictors. The test error is displayed as a function of
the number of trees. Each colored line corresponds to a different value of m, the
number of predictors available for splitting at each interior tree node. Random
forests (m < p) lead to a slight improvement over bagging (m = p). A single
classification tree has an error rate of 45.7 %.
Consider first the regression setting. Like bagging, boosting involves com-
bining a large number of decision trees, fˆ1 , . . . , fˆB . Boosting is described
in Algorithm 8.2.
What is the idea behind this procedure? Unlike fitting a single large deci-
sion tree to the data, which amounts to fitting the data hard and potentially
overfitting, the boosting approach instead learns slowly. Given the current
model, we fit a decision tree to the residuals from the model. That is, we
fit a tree using the current residuals, rather than the outcome Y , as the re-
sponse. We then add this new decision tree into the fitted function in order
to update the residuals. Each of these trees can be rather small, with just
a few terminal nodes, determined by the parameter d in the algorithm. By
fitting small trees to the residuals, we slowly improve fˆ in areas where it
does not perform well. The shrinkage parameter ϵ slows the process down
even further, allowing more and different shaped trees to attack the resid-
uals. In general, statistical learning approaches that learn slowly tend to
perform well. Note that in boosting, unlike in bagging, the construction of
each tree depends strongly on the trees that have already been grown.
We have just described the process of boosting regression trees. Boosting
classification trees proceeds in a similar but slightly more complex way, and
the details are omitted here.
Boosting has three tuning parameters:
8.2 Bagging, Random Forests, Boosting, and Bayesian Additive Regression Trees 347
2. For b = 1, 2, . . . , B, repeat:
(a) Fit a tree fˆb with d splits (d + 1 terminal nodes) to the training
data (X, r).
(b) Update fˆ by adding in a shrunken version of the new tree:
0.25
Boosting: depth=1
Boosting: depth=2
RandomForest: m= p
0.20
Test Classification Error
0.15
0.10
0.05
Number of Trees
FIGURE 8.11. Results from performing boosting and random forests on the
15-class gene expression data set in order to predict cancer versus normal. The test
error is displayed as a function of the number of trees. For the two boosted models,
ϵ = 0.01. Depth-1 trees slightly outperform depth-2 trees, and both outperform
the random forest, although the standard errors are around 0.02, making none of
these differences significant. The test error rate for a single tree is 24 %.
(c): Possibility #2 for fˆkb (X) (d): Possibility #3 for fˆkb (X)
X < 114.305
0.40790
X < 106.755 X < 140.35
−0.05089 −1.03100
−0.1218 0.4079 0.26670 −0.24700
FIGURE 8.12. A schematic of perturbed trees from the BART algorithm. (a):
The kth tree at the (b ∼ 1)st iteration, fˆkb′1 (X), is displayed. Panels (b)–(d)
display three of many possibilities for fˆkb (X), given the form of fˆkb′1 (X). (b): One
possibility is that fˆkb (X) has the same structure as fˆkb′1 (X), but with different
predictions at the terminal nodes. (c): Another possibility is that fˆkb (X) results
from pruning fˆkb′1 (X). (d): Alternatively, fˆkb (X) may have more terminal nodes
than fˆkb′1 (X).
the prediction at x for the kth regression tree used in the bth iteration. At
the end of each⎜Kiteration, the K trees from that iteration will be summed,
i.e. fˆb (x) = k=1 fˆkb (x) for b = 1, . . . , B.
In the first iteration of the BART algorithm, ⎜n all trees are initialized to
have a single root node, with fk (x) = nK i=1 yi , the mean of the response
ˆ1 1
⎜K
values
⎜ divided by the total number of trees. Thus, fˆ1 (x) = k=1 fˆk1 (x) =
i=1 yi .
1 n
n
In subsequent iterations, BART updates each of the K trees, one at a
time. In the bth iteration, to update the kth tree, we subtract from each
response value the predictions from all but the kth tree, in order to obtain
a partial residual
⎛ ⎛
ri = y i − fˆkb′ (xi ) − fˆkb∈1
′ (xi )
k′ <k k′ >k
for the ith observation, i = 1, . . . , n. Rather than fitting a fresh tree to this
partial residual, BART randomly chooses a perturbation to the tree from
the previous iteration (fˆkb∈1 ) from a set of possible perturbations, favoring
ones that improve the fit to the partial residual. There are two components
to this perturbation:
350 8. Tree-Based Methods
We typically throw away the first few of these prediction models, since
models obtained in the earlier iterations — known as the burn-in period
burn-in
— tend not to provide very good results. We can let L denote the num-
ber of burn-in iterations; for instance, we might take L = 200. Then, to
obtain a single prediction, ⎜B we simply take the average after the burn-in
iterations, fˆ(x) = B∈L 1
b=L+1
ˆb (x). However, it is also possible to com-
f
pute quantities other than the average: for instance, the percentiles of
fˆL+1 (x), . . . , fˆB (x) provide a measure of uncertainty in the final predic-
tion. The overall BART procedure is summarized in Algorithm 8.3.
A key element of the BART approach is that in Step 3(a)ii., we do not fit
a fresh tree to the current partial residual: instead, we try to improve the fit
to the current partial residual by slightly modifying the tree obtained in the
previous iteration (see Figure 8.12). Roughly speaking, this guards against
overfitting since it limits how “hard” we fit the data in each iteration.
Furthermore, the individual trees are typically quite small. We limit the
tree size in order to avoid overfitting the data, which would be more likely
to occur if we grew very large trees.
Figure 8.13 shows the result of applying BART to the Heart data, using
K = 200 trees, as the number of iterations is increased to 10, 000. During
the initial iterations, the test and training errors jump around a bit. After
this initial burn-in period, the error rates settle down. We note that there
is only a small difference between the training error and the test error,
indicating that the tree perturbation process largely avoids overfitting.
The training and test errors for boosting are also displayed in Figure 8.13.
We see that the test error for boosting approaches that of BART, but then
begins to increase as the number of iterations increases. Furthermore, the
training error for boosting decreases as the number of iterations increases,
indicating that boosting has overfit the data.
Though the details are outside of the scope of this book, it turns out
that the BART method can be viewed as a Bayesian approach to fitting an
ensemble of trees: each time we randomly perturb a tree in order to fit the
residuals, we are in fact drawing a new tree from a posterior distribution.
(Of course, this Bayesian connection is the motivation for BART’s name.)
Furthermore, Algorithm 8.3 can be viewed as a Markov chain Monte Carlo
Markov
algorithm for fitting the BART model. chain Monte
Carlo
8.2 Bagging, Random Forests, Boosting, and Bayesian Additive Regression Trees 351
3. For b = 2, . . . , B:
(a) For k = 1, 2, . . . , K:
i. For i = 1, . . . , n, compute the current partial residual
⎛ ⎛
ri = y i − fˆkb′ (xi ) − fˆkb∈1
′ (xi ).
k′ <k k′ >k
When we apply BART, we must select the number of trees K, the number
of iterations B, and the number of burn-in iterations L. We typically choose
large values for B and K, and a moderate value for L: for instance, K = 200,
B = 1,000, and L = 100 is a reasonable choice. BART has been shown to
have very impressive out-of-box performance — that is, it performs well
with minimal tuning.
0.5
BART Test Error
Boosting Training Error
Boosting Test Error
0.4
0.3
Error
0.2
0.1
0.0
Number of Iterations
FIGURE 8.13. BART and boosting results for the Heart data. Both training
and test errors are displayed. After a burn-in period of 100 iterations (shown in
gray), the error rates for BART settle down. Boosting begins to overfit after a
few hundred iterations.