0% found this document useful (0 votes)
35 views13 pages

CHP 8.2 Intro To Statistical Learning

Tree-based methods, such as decision trees, often lack predictive accuracy and robustness, but can be improved through ensemble techniques like bagging, random forests, and boosting. Bagging reduces variance by averaging predictions from multiple bootstrapped trees, while random forests enhance this by decorrelating trees through random selection of predictors at each split. Boosting, on the other hand, builds trees sequentially, focusing on correcting errors from previous trees, leading to improved predictions over single tree models.

Uploaded by

Hong Anh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views13 pages

CHP 8.2 Intro To Statistical Learning

Tree-based methods, such as decision trees, often lack predictive accuracy and robustness, but can be improved through ensemble techniques like bagging, random forests, and boosting. Bagging reduces variance by averaging predictions from multiple bootstrapped trees, while random forests enhance this by decorrelating trees through random selection of predictors at each split. Boosting, on the other hand, builds trees sequentially, focusing on correcting errors from previous trees, leading to improved predictions over single tree models.

Uploaded by

Hong Anh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

340 8.

Tree-Based Methods

▲ Unfortunately, trees generally do not have the same level of predictive


accuracy as some of the other regression and classification approaches
seen in this book.
▲ Additionally, trees can be very non-robust. In other words, a small
change in the data can cause a large change in the final estimated
tree.
However, by aggregating many decision trees, using methods like bagging,
random forests, and boosting, the predictive performance of trees can be
substantially improved. We introduce these concepts in the next section.

8.2 Bagging, Random Forests, Boosting, and


Bayesian Additive Regression Trees
An ensemble method is an approach that combines many simple “building
ensemble
block” models in order to obtain a single and potentially very powerful
model. These simple building block models are sometimes known as weak
learners, since they may lead to mediocre predictions on their own.
weak
We will now discuss bagging, random forests, boosting, and Bayesian learners
additive regression trees. These are ensemble methods for which the simple
building block is a regression or a classification tree.

8.2.1 Bagging
The bootstrap, introduced in Chapter 5, is an extremely powerful idea. It is
used in many situations in which it is hard or even impossible to directly
compute the standard deviation of a quantity of interest. We see here that
the bootstrap can be used in a completely different context, in order to
improve statistical learning methods such as decision trees.
The decision trees discussed in Section 8.1 suffer from high variance.
This means that if we split the training data into two parts at random,
and fit a decision tree to both halves, the results that we get could be
quite different. In contrast, a procedure with low variance will yield similar
results if applied repeatedly to distinct data sets; linear regression tends
to have low variance, if the ratio of n to p is moderately large. Bootstrap
aggregation, or bagging, is a general-purpose procedure for reducing the
bagging
variance of a statistical learning method; we introduce it here because it is
particularly useful and frequently used in the context of decision trees.
Recall that given a set of n independent observations Z1 , . . . , Zn , each
with variance λ 2 , the variance of the mean Z̄ of the observations is given
by λ 2 /n. In other words, averaging a set of observations reduces variance.
Hence a natural way to reduce the variance and increase the test set ac-
curacy of a statistical learning method is to take many training sets from
8.2 Bagging, Random Forests, Boosting, and Bayesian Additive Regression Trees 341

the population, build a separate prediction model using each training set,
and average the resulting predictions. In other words, we could calculate
fˆ1 (x), fˆ2 (x), . . . , fˆB (x) using B separate training sets, and average them
in order to obtain a single low-variance statistical learning model, given by

B
ˆ 1 ⎛ ˆb
favg (x) = f (x).
B
b=1

Of course, this is not practical because we generally do not have access


to multiple training sets. Instead, we can bootstrap, by taking repeated
samples from the (single) training data set. In this approach we generate
B different bootstrapped training data sets. We then train our method on
the bth bootstrapped training set in order to get fˆ×b (x), and finally average
all the predictions, to obtain

B
1 ⎛ ˆ×b
fˆbag (x) = f (x).
B
b=1

This is called bagging.


While bagging can improve predictions for many regression methods,
it is particularly useful for decision trees. To apply bagging to regression
trees, we simply construct B regression trees using B bootstrapped training
sets, and average the resulting predictions. These trees are grown deep,
and are not pruned. Hence each individual tree has high variance, but
low bias. Averaging these B trees reduces the variance. Bagging has been
demonstrated to give impressive improvements in accuracy by combining
together hundreds or even thousands of trees into a single procedure.
Thus far, we have described the bagging procedure in the regression
context, to predict a quantitative outcome Y . How can bagging be extended
to a classification problem where Y is qualitative? In that situation, there
are a few possible approaches, but the simplest is as follows. For a given test
observation, we can record the class predicted by each of the B trees, and
take a majority vote: the overall prediction is the most commonly occurring
majority
class among the B predictions. vote
Figure 8.8 shows the results from bagging trees on the Heart data. The
test error rate is shown as a function of B, the number of trees constructed
using bootstrapped training data sets. We see that the bagging test error
rate is slightly lower in this case than the test error rate obtained from a
single tree. The number of trees B is not a critical parameter with bagging;
using a very large value of B will not lead to overfitting. In practice we
use a value of B sufficiently large that the error has settled down. Using
B = 100 is sufficient to achieve good performance in this example.
342 8. Tree-Based Methods

0.30
0.25
Error

0.20
0.15

Test: Bagging
Test: RandomForest
OOB: Bagging
0.10

OOB: RandomForest

0 50 100 150 200 250 300

Number of Trees

FIGURE 8.8. Bagging and random forest results for the Heart data. The test
error (black and orange) is shown as a function of B, the number of bootstrapped

training sets used. Random forests were applied with m = p. The dashed line
indicates the test error resulting from a single classification tree. The green and
blue traces show the OOB error, which in this case is — by chance — considerably
lower.

Out-of-Bag Error Estimation


It turns out that there is a very straightforward way to estimate the test
error of a bagged model, without the need to perform cross-validation or
the validation set approach. Recall that the key to bagging is that trees are
repeatedly fit to bootstrapped subsets of the observations. One can show
that on average, each bagged tree makes use of around two-thirds of the
observations.3 The remaining one-third of the observations not used to fit a
given bagged tree are referred to as the out-of-bag (OOB) observations. We
out-of-bag
can predict the response for the ith observation using each of the trees in
which that observation was OOB. This will yield around B/3 predictions
for the ith observation. In order to obtain a single prediction for the ith
observation, we can average these predicted responses (if regression is the
goal) or can take a majority vote (if classification is the goal). This leads
to a single OOB prediction for the ith observation. An OOB prediction
can be obtained in this way for each of the n observations, from which the

3 This relates to Exercise 2 of Chapter 5.


8.2 Bagging, Random Forests, Boosting, and Bayesian Additive Regression Trees 343

overall OOB MSE (for a regression problem) or classification error (for a


classification problem) can be computed. The resulting OOB error is a valid
estimate of the test error for the bagged model, since the response for each
observation is predicted using only the trees that were not fit using that
observation. Figure 8.8 displays the OOB error on the Heart data. It can
be shown that with B sufficiently large, OOB error is virtually equivalent
to leave-one-out cross-validation error. The OOB approach for estimating
the test error is particularly convenient when performing bagging on large
data sets for which cross-validation would be computationally onerous.

Variable Importance Measures


As we have discussed, bagging typically results in improved accuracy over
prediction using a single tree. Unfortunately, however, it can be difficult to
interpret the resulting model. Recall that one of the advantages of decision
trees is the attractive and easily interpreted diagram that results, such as
the one displayed in Figure 8.1. However, when we bag a large number of
trees, it is no longer possible to represent the resulting statistical learning
procedure using a single tree, and it is no longer clear which variables
are most important to the procedure. Thus, bagging improves prediction
accuracy at the expense of interpretability.
Although the collection of bagged trees is much more difficult to interpret
than a single tree, one can obtain an overall summary of the importance of
each predictor using the RSS (for bagging regression trees) or the Gini index
(for bagging classification trees). In the case of bagging regression trees, we
can record the total amount that the RSS (8.1) is decreased due to splits
over a given predictor, averaged over all B trees. A large value indicates
an important predictor. Similarly, in the context of bagging classification
trees, we can add up the total amount that the Gini index (8.6) is decreased
by splits over a given predictor, averaged over all B trees.
A graphical representation of the variable importances in the Heart data
variable
is shown in Figure 8.9. We see the mean decrease in Gini index for each vari- importance
able, relative to the largest. The variables with the largest mean decrease
in Gini index are Thal, Ca, and ChestPain.

8.2.2 Random Forests


Random forests provide an improvement over bagged trees by way of a
random
small tweak that decorrelates the trees. As in bagging, we build a number forest
of decision trees on bootstrapped training samples. But when building these
decision trees, each time a split in a tree is considered, a random sample of
m predictors is chosen as split candidates from the full set of p predictors.
The split is allowed to use only one of those m predictors. A fresh sample of

m predictors is taken at each split, and typically we choose m × p—that
is, the number of predictors considered at each split is approximately equal
344 8. Tree-Based Methods

Fbs

RestECG

ExAng

Sex

Slope

Chol

Age

RestBP

MaxHR

Oldpeak

ChestPain

Ca

Thal

0 20 40 60 80 100
Variable Importance

FIGURE 8.9. A variable importance plot for the Heart data. Variable impor-
tance is computed using the mean decrease in Gini index, and expressed relative
to the maximum.

to the square root of the total number of predictors (4 out of the 13 for the
Heart data).
In other words, in building a random forest, at each split in the tree,
the algorithm is not even allowed to consider a majority of the available
predictors. This may sound crazy, but it has a clever rationale. Suppose
that there is one very strong predictor in the data set, along with a num-
ber of other moderately strong predictors. Then in the collection of bagged
trees, most or all of the trees will use this strong predictor in the top split.
Consequently, all of the bagged trees will look quite similar to each other.
Hence the predictions from the bagged trees will be highly correlated. Un-
fortunately, averaging many highly correlated quantities does not lead to
as large of a reduction in variance as averaging many uncorrelated quan-
tities. In particular, this means that bagging will not lead to a substantial
reduction in variance over a single tree in this setting.
Random forests overcome this problem by forcing each split to consider
only a subset of the predictors. Therefore, on average (p − m)/p of the
splits will not even consider the strong predictor, and so other predictors
will have more of a chance. We can think of this process as decorrelating
the trees, thereby making the average of the resulting trees less variable
and hence more reliable.
8.2 Bagging, Random Forests, Boosting, and Bayesian Additive Regression Trees 345

The main difference between bagging and random forests is the choice
of predictor subset size m. For instance, if a random forest is built using
m = p, then this amounts simply to bagging. On the Heart data, random

forests using m = p leads to a reduction in both test error and OOB error
over bagging (Figure 8.8).
Using a small value of m in building a random forest will typically be
helpful when we have a large number of correlated predictors. We applied
random forests to a high-dimensional biological data set consisting of ex-
pression measurements of 4,718 genes measured on tissue samples from 349
patients. There are around 20,000 genes in humans, and individual genes
have different levels of activity, or expression, in particular cells, tissues,
and biological conditions. In this data set, each of the patient samples has
a qualitative label with 15 different levels: either normal or 1 of 14 different
types of cancer. Our goal was to use random forests to predict cancer type
based on the 500 genes that have the largest variance in the training set.
We randomly divided the observations into a training and a test set, and
applied random forests to the training set for three different values of the
number of splitting variables m. The results are shown in Figure 8.10. The
error rate of a single tree is 45.7 %, and the null rate is 75.4 %.4 We see that
using 400 trees is sufficient to give good performance, and that the choice

m = p gave a small improvement in test error over bagging (m = p) in
this example. As with bagging, random forests will not overfit if we increase
B, so in practice we use a value of B sufficiently large for the error rate to
have settled down.

8.2.3 Boosting
We now discuss boosting, yet another approach for improving the predic-
boosting
tions resulting from a decision tree. Like bagging, boosting is a general
approach that can be applied to many statistical learning methods for re-
gression or classification. Here we restrict our discussion of boosting to the
context of decision trees.
Recall that bagging involves creating multiple copies of the original train-
ing data set using the bootstrap, fitting a separate decision tree to each
copy, and then combining all of the trees in order to create a single predic-
tive model. Notably, each tree is built on a bootstrap data set, independent
of the other trees. Boosting works in a similar way, except that the trees are
grown sequentially: each tree is grown using information from previously
grown trees. Boosting does not involve bootstrap sampling; instead each
tree is fit on a modified version of the original data set.

4 The null rate results from simply classifying each observation to the dominant class
overall, which is in this case the normal class.
346 8. Tree-Based Methods

m=p
m=p/2

0.5
m= p

Test Classification Error

0.4
0.3
0.2

0 100 200 300 400 500

Number of Trees

FIGURE 8.10. Results from random forests for the 15-class gene expression
data set with p = 500 predictors. The test error is displayed as a function of
the number of trees. Each colored line corresponds to a different value of m, the
number of predictors available for splitting at each interior tree node. Random
forests (m < p) lead to a slight improvement over bagging (m = p). A single
classification tree has an error rate of 45.7 %.

Consider first the regression setting. Like bagging, boosting involves com-
bining a large number of decision trees, fˆ1 , . . . , fˆB . Boosting is described
in Algorithm 8.2.
What is the idea behind this procedure? Unlike fitting a single large deci-
sion tree to the data, which amounts to fitting the data hard and potentially
overfitting, the boosting approach instead learns slowly. Given the current
model, we fit a decision tree to the residuals from the model. That is, we
fit a tree using the current residuals, rather than the outcome Y , as the re-
sponse. We then add this new decision tree into the fitted function in order
to update the residuals. Each of these trees can be rather small, with just
a few terminal nodes, determined by the parameter d in the algorithm. By
fitting small trees to the residuals, we slowly improve fˆ in areas where it
does not perform well. The shrinkage parameter ϵ slows the process down
even further, allowing more and different shaped trees to attack the resid-
uals. In general, statistical learning approaches that learn slowly tend to
perform well. Note that in boosting, unlike in bagging, the construction of
each tree depends strongly on the trees that have already been grown.
We have just described the process of boosting regression trees. Boosting
classification trees proceeds in a similar but slightly more complex way, and
the details are omitted here.
Boosting has three tuning parameters:
8.2 Bagging, Random Forests, Boosting, and Bayesian Additive Regression Trees 347

Algorithm 8.2 Boosting for Regression Trees

1. Set fˆ(x) = 0 and ri = yi for all i in the training set.

2. For b = 1, 2, . . . , B, repeat:
(a) Fit a tree fˆb with d splits (d + 1 terminal nodes) to the training
data (X, r).
(b) Update fˆ by adding in a shrunken version of the new tree:

fˆ(x) ≈ fˆ(x) + ϵfˆb (x). (8.10)

(c) Update the residuals,

ri ≈ ri − ϵfˆb (xi ). (8.11)

3. Output the boosted model,


B

fˆ(x) = ϵfˆb (x). (8.12)
b=1

1. The number of trees B. Unlike bagging and random forests, boosting


can overfit if B is too large, although this overfitting tends to occur
slowly if at all. We use cross-validation to select B.
2. The shrinkage parameter ϵ, a small positive number. This controls
the rate at which boosting learns. Typical values are 0.01 or 0.001, and
the right choice can depend on the problem. Very small ϵ can require
using a very large value of B in order to achieve good performance.
3. The number d of splits in each tree, which controls the complexity
of the boosted ensemble. Often d = 1 works well, in which case each
tree is a stump, consisting of a single split. In this case, the boosted stump
ensemble is fitting an additive model, since each term involves only a
single variable. More generally d is the interaction depth, and controls
interaction
the interaction order of the boosted model, since d splits can involve depth
at most d variables.
In Figure 8.11, we applied boosting to the 15-class cancer gene expression
data set, in order to develop a classifier that can distinguish the normal
class from the 14 cancer classes. We display the test error as a function of
the total number of trees and the interaction depth d. We see that simple
stumps with an interaction depth of one perform well if enough of them
are included. This model outperforms the depth-two model, and both out-
perform a random forest. This highlights one difference between boosting
348 8. Tree-Based Methods

0.25
Boosting: depth=1
Boosting: depth=2
RandomForest: m= p

0.20
Test Classification Error

0.15
0.10
0.05

0 1000 2000 3000 4000 5000

Number of Trees

FIGURE 8.11. Results from performing boosting and random forests on the
15-class gene expression data set in order to predict cancer versus normal. The test
error is displayed as a function of the number of trees. For the two boosted models,
ϵ = 0.01. Depth-1 trees slightly outperform depth-2 trees, and both outperform
the random forest, although the standard errors are around 0.02, making none of
these differences significant. The test error rate for a single tree is 24 %.

and random forests: in boosting, because the growth of a particular tree


takes into account the other trees that have already been grown, smaller
trees are typically sufficient. Using smaller trees can aid in interpretability
as well; for instance, using stumps leads to an additive model.

8.2.4 Bayesian Additive Regression Trees


Finally, we discuss Bayesian additive regression trees (BART), another en-
Bayesian
semble method that uses decision trees as its building blocks. For simplicity, additive
we present BART for regression (as opposed to classification). regression
Recall that bagging and random forests make predictions from an aver- trees
age of regression trees, each of which is built using a random sample of data
and/or predictors. Each tree is built separately from the others. By con-
trast, boosting uses a weighted sum of trees, each of which is constructed
by fitting a tree to the residual of the current fit. Thus, each new tree at-
tempts to capture signal that is not yet accounted for by the current set
of trees. BART is related to both approaches: each tree is constructed in
a random manner as in bagging and random forests, and each tree tries to
capture signal not yet accounted for by the current model, as in boosting.
The main novelty in BART is the way in which new trees are generated.
Before we introduce the BART algorithm, we define some notation. We
let K denote the number of regression trees, and B the number of iterations
for which the BART algorithm will be run. The notation fˆkb (x) represents
8.2 Bagging, Random Forests, Boosting, and Bayesian Additive Regression Trees 349

(a): fˆkb∈1 (X) (b): Possibility #1 for fˆkb (X)

X < 169.17 X < 169.17


| |

X < 114.305 X < 114.305


0.4079 0.4221
X < 140.35 X < 140.35
−0.5031 −0.5110

0.2667 −0.2470 0.2693 −0.2649

(c): Possibility #2 for fˆkb (X) (d): Possibility #3 for fˆkb (X)

X < 169.17 X < 169.17


| |

X < 114.305
0.40790
X < 106.755 X < 140.35

−0.05089 −1.03100
−0.1218 0.4079 0.26670 −0.24700

FIGURE 8.12. A schematic of perturbed trees from the BART algorithm. (a):
The kth tree at the (b ∼ 1)st iteration, fˆkb′1 (X), is displayed. Panels (b)–(d)
display three of many possibilities for fˆkb (X), given the form of fˆkb′1 (X). (b): One
possibility is that fˆkb (X) has the same structure as fˆkb′1 (X), but with different
predictions at the terminal nodes. (c): Another possibility is that fˆkb (X) results
from pruning fˆkb′1 (X). (d): Alternatively, fˆkb (X) may have more terminal nodes
than fˆkb′1 (X).

the prediction at x for the kth regression tree used in the bth iteration. At
the end of each⎜Kiteration, the K trees from that iteration will be summed,
i.e. fˆb (x) = k=1 fˆkb (x) for b = 1, . . . , B.
In the first iteration of the BART algorithm, ⎜n all trees are initialized to
have a single root node, with fk (x) = nK i=1 yi , the mean of the response
ˆ1 1
⎜K
values
⎜ divided by the total number of trees. Thus, fˆ1 (x) = k=1 fˆk1 (x) =
i=1 yi .
1 n
n
In subsequent iterations, BART updates each of the K trees, one at a
time. In the bth iteration, to update the kth tree, we subtract from each
response value the predictions from all but the kth tree, in order to obtain
a partial residual
⎛ ⎛
ri = y i − fˆkb′ (xi ) − fˆkb∈1
′ (xi )
k′ <k k′ >k

for the ith observation, i = 1, . . . , n. Rather than fitting a fresh tree to this
partial residual, BART randomly chooses a perturbation to the tree from
the previous iteration (fˆkb∈1 ) from a set of possible perturbations, favoring
ones that improve the fit to the partial residual. There are two components
to this perturbation:
350 8. Tree-Based Methods

1. We may change the structure of the tree by adding or pruning branches.

2. We may change the prediction in each terminal node of the tree.

Figure 8.12 illustrates examples of possible perturbations to a tree.


The output of BART is a collection of prediction models,
K

fˆb (x) = fˆkb (x), for b = 1, 2, . . . , B.
k=1

We typically throw away the first few of these prediction models, since
models obtained in the earlier iterations — known as the burn-in period
burn-in
— tend not to provide very good results. We can let L denote the num-
ber of burn-in iterations; for instance, we might take L = 200. Then, to
obtain a single prediction, ⎜B we simply take the average after the burn-in
iterations, fˆ(x) = B∈L 1
b=L+1
ˆb (x). However, it is also possible to com-
f
pute quantities other than the average: for instance, the percentiles of
fˆL+1 (x), . . . , fˆB (x) provide a measure of uncertainty in the final predic-
tion. The overall BART procedure is summarized in Algorithm 8.3.
A key element of the BART approach is that in Step 3(a)ii., we do not fit
a fresh tree to the current partial residual: instead, we try to improve the fit
to the current partial residual by slightly modifying the tree obtained in the
previous iteration (see Figure 8.12). Roughly speaking, this guards against
overfitting since it limits how “hard” we fit the data in each iteration.
Furthermore, the individual trees are typically quite small. We limit the
tree size in order to avoid overfitting the data, which would be more likely
to occur if we grew very large trees.
Figure 8.13 shows the result of applying BART to the Heart data, using
K = 200 trees, as the number of iterations is increased to 10, 000. During
the initial iterations, the test and training errors jump around a bit. After
this initial burn-in period, the error rates settle down. We note that there
is only a small difference between the training error and the test error,
indicating that the tree perturbation process largely avoids overfitting.
The training and test errors for boosting are also displayed in Figure 8.13.
We see that the test error for boosting approaches that of BART, but then
begins to increase as the number of iterations increases. Furthermore, the
training error for boosting decreases as the number of iterations increases,
indicating that boosting has overfit the data.
Though the details are outside of the scope of this book, it turns out
that the BART method can be viewed as a Bayesian approach to fitting an
ensemble of trees: each time we randomly perturb a tree in order to fit the
residuals, we are in fact drawing a new tree from a posterior distribution.
(Of course, this Bayesian connection is the motivation for BART’s name.)
Furthermore, Algorithm 8.3 can be viewed as a Markov chain Monte Carlo
Markov
algorithm for fitting the BART model. chain Monte
Carlo
8.2 Bagging, Random Forests, Boosting, and Bayesian Additive Regression Trees 351

Algorithm 8.3 Bayesian Additive Regression Trees


⎜n
1. Let fˆ11 (x) = fˆ21 (x) = · · · = fˆK
1 1
(x) = nK i=1 yi .
⎜K ⎜n
2. Compute fˆ1 (x) = k=1 fˆk1 (x) = n1 i=1 yi .

3. For b = 2, . . . , B:
(a) For k = 1, 2, . . . , K:
i. For i = 1, . . . , n, compute the current partial residual
⎛ ⎛
ri = y i − fˆkb′ (xi ) − fˆkb∈1
′ (xi ).
k′ <k k′ >k

ii. Fit a new tree, fˆkb (x), to ri , by randomly perturbing the


kth tree from the previous iteration, fˆkb∈1 (x). Perturbations
that improve the fit are favored.
⎜K
(b) Compute fˆb (x) = k=1 fˆkb (x).
4. Compute the mean after L burn-in samples,
B

1
fˆ(x) = fˆb (x).
B−L
b=L+1

When we apply BART, we must select the number of trees K, the number
of iterations B, and the number of burn-in iterations L. We typically choose
large values for B and K, and a moderate value for L: for instance, K = 200,
B = 1,000, and L = 100 is a reasonable choice. BART has been shown to
have very impressive out-of-box performance — that is, it performs well
with minimal tuning.

8.2.5 Summary of Tree Ensemble Methods


Trees are an attractive choice of weak learner for an ensemble method
for a number of reasons, including their flexibility and ability to handle
predictors of mixed types (i.e. qualitative as well as quantitative). We have
now seen four approaches for fitting an ensemble of trees: bagging, random
forests, boosting, and BART.

• In bagging, the trees are grown independently on random samples of


the observations. Consequently, the trees tend to be quite similar to
each other. Thus, bagging can get caught in local optima and can fail
to thoroughly explore the model space.
352 8. Tree-Based Methods

BART Training Error

0.5
BART Test Error
Boosting Training Error
Boosting Test Error

0.4
0.3
Error

0.2
0.1
0.0

5 10 50 100 500 5000

Number of Iterations

FIGURE 8.13. BART and boosting results for the Heart data. Both training
and test errors are displayed. After a burn-in period of 100 iterations (shown in
gray), the error rates for BART settle down. Boosting begins to overfit after a
few hundred iterations.

• In random forests, the trees are once again grown independently on


random samples of the observations. However, each split on each tree
is performed using a random subset of the features, thereby decorre-
lating the trees, and leading to a more thorough exploration of model
space relative to bagging.
• In boosting, we only use the original data, and do not draw any ran-
dom samples. The trees are grown successively, using a “slow” learn-
ing approach: each new tree is fit to the signal that is left over from
the earlier trees, and shrunken down before it is used.
• In BART, we once again only make use of the original data, and we
grow the trees successively. However, each tree is perturbed in order
to avoid local minima and achieve a more thorough exploration of
the model space.

You might also like