Wager14a PDF
Wager14a PDF
Abstract
We study the variability of predictions made by bagged learners and random forests, and
show how to estimate standard errors for these methods. Our work builds on variance
estimates for bagging proposed by Efron (1992, 2013) that are based on the jackknife and
the infinitesimal jackknife (IJ). In practice, bagged predictors are computed using a finite
number B of bootstrap replicates, and working with a large B can be computationally
expensive. Direct applications of jackknife and IJ estimators to bagging require B =
Θ(n1.5 ) bootstrap replicates to converge, where n is the size of the training set. We propose
improved versions that only require B = Θ(n) replicates. Moreover, we show that the IJ
estimator requires 1.7 times less bootstrap replicates than the jackknife to achieve a given
accuracy. Finally, we study the sampling distributions of the jackknife and IJ variance
estimates themselves. We illustrate our findings with multiple experiments and simulation
studies.
Keywords: bagging, jackknife methods, Monte Carlo noise, variance estimation
1. Introduction
Bagging (Breiman, 1996) is a popular technique for stabilizing statistical learners. Bag-
ging is often conceptualized as a variance reduction technique, and so it is important to
understand how the sampling variance of a bagged learner compares to the variance of the
original learner. In this paper, we develop and study methods for estimating the variance
of bagged predictors and random forests (Breiman, 2001), a popular extension of bagged
trees. These variance estimates only require the bootstrap replicates that were used to form
the bagged prediction itself, and so can be obtained with moderate computational overhead.
The results presented here build on the jackknife-after-bootstrap methodology introduced
by Efron (1992) and on the infinitesimal jackknife for bagging (IJ) (Efron, 2013).
Figure 1 shows the results from applying our method to a random forest trained on the
“Auto MPG” data set, a regression task where we aim to predict the miles-per-gallon (MPG)
gas consumption of an automobile based on 7 features including weight and horsepower. The
error bars shown in Figure 1 give an estimate of the sampling variance of the random forest;
in other words, they tell us how much the random forest’s predictions might change if we
2014
c Stefan Wager, Trevor Hastie and Bradley Efron.
Wager, Hastie and Efron
● ● ● ●
●
35
●
● ● ●
●
●
●
●
● ●
●
●
30
●
●
Predicted MPG ●
● ● ● ●
● ●
●
●
●
● ●●
25
● ●
● ●
● ●
●
● ●
●
●
● ●
● ●
20
● ●
● ● ●
● ● ●
●
● ● ●
●
● ●
●
●
15
● ●
● ●
● ●
●
● ●
●
15 20 25 30 35 40
Reported MPG
Figure 1: Random forest predictions on the “Auto MPG” data set. The random forest was
trained using 314 examples; the graph shows results on a test set of size 78. The
error bars are 1 standard error in each direction. Because this is a fairly small
data set, we estimated standard errors for the random forest using the averaged
estimator from Section 5.2. A more detailed description of the experiment is
provided in Appendix C.
trained it on a new training set. The fact that the error bars do not in general cross the
prediction-equals-observation diagonal suggests that there is some residual noise in the MPG
of a car that cannot be explained by a random forest model based on the available predictor
variables.1
Figure 1 tells us that the random forest was more confident about some predictions than
others. Rather reassuringly, we observe that the random forest was in general less confident
about the predictions for which the reported MPG and predicted MPG were very different.
There is not a perfect correlation, however, between the error level and the size of the error
bars. One of the points, circled in red near (32, 32), appears particularly surprising: the
random forest got the prediction almost exactly right, but gave the prediction large error
bars of ±2. This curious datapoint corresponds to the 1982 Dodge Rampage, a two-door
Coupe Utility that is a mix between a passenger car and a truck with a cargo tray. Perhaps
our random forest had a hard time confidently estimating the mileage of the Rampage
1. Our method produces standard error estimates σ̂ for random forest predictions. We then represent these
standard error estimates as Gaussian confidence intervals ŷ ± zα σ̂, where zα is a quantile of the normal
distribution.
1626
Confidence Intervals for Random Forests
because it could not quite decide whether to cluster it with cars or with trucks. We present
experiments on larger data sets in Section 3.
Estimating the variance of bagged learners based on the preexisting bootstrap replicates
can be challenging, as there are two distinct sources of noise. In addition to the sampling
noise (i.e., the noise arising from randomness during data collection), we also need to control
the Monte Carlo noise arising from the use of a finite number of bootstrap replicates. We
study the effects of both sampling noise and Monte Carlo noise.
In our experience, the errors of the jackknife and IJ estimates of variance are often
dominated by Monte Carlo effects. Monte Carlo bias can be particularly troublesome: if
we are not careful, the jackknife and IJ estimators can conflate Monte Carlo noise with the
underlying sampling noise and badly overestimate the sampling variance. We show how to
estimate the magnitude of this Monte Carlo bias and develop bias-corrected versions of the
jackknife and IJ estimators that outperform the original ones. We also show that the IJ
estimate of variance is able to use the preexisting bootstrap replicates more efficiently than
the jackknife estimator by having a lower Monte Carlo variance, and needs 1.7 times less
bootstrap replicates than the jackknife to achieve a given accuracy.
If we take the number of bootstrap replicates to infinity, Monte Carlo effects disappear
and only sampling errors remain. We compare the sampling biases of both the jackknife and
IJ rules and present some evidence that, while the jackknife rule has an upward sampling
bias and the IJ estimator can have a downward bias, the arithmetic mean of the two variance
estimates can be close to unbiased. We also propose a simple method for estimating the
sampling variance of the IJ estimator itself.
Our paper is structured as follows. We first present an overview of our main results
in Section 2, and apply them to random forest examples in Section 3. We then take a
closer look at Monte Carlo effects in Section 4 and analyze the sampling distribution of the
limiting IJ and jackknife rules with B → ∞ in Section 5. We spread simulation experiments
throughout Sections 4 and 5 to validate our theoretical analysis.
1627
Wager, Hastie and Efron
where the Zi∗ are drawn independently with replacement from the original data (i.e., they
form a bootstrap sample). The expectation E∗ is taken with respect to the bootstrap
measure.
The expectation in (1) cannot in general be evaluated exactly, and so we form the bagged
estimator by Monte Carlo
B
1 X ∗
B
θ̂ (x) = ∗
tb (x), where t∗b (x) = t(x; Zb1 ∗
, ..., Zbn ) (2)
B
b=1
In other words, we ask how much variance θ̂B would have once we make B large enough
to eliminate the bootstrap effects. We consider two basic estimates of V : The Infinitesimal
Jackknife estimate (Efron, 2013), which results in the simple expression
n
X
∞
VbIJ = Cov∗ [Ni∗ , t∗ (x)]2 , (3)
i=1
where Cov∗ [Ni∗ , t∗ (x)] is the covariance between t∗ (x) and the number of times Ni∗ the ith
training example appears in a bootstrap sample; and the Jackknife-after-Bootstrap estimate
(Efron, 1992)
n
n − 1 X ∗ 2
VbJ∞ = t̄(−i) (x) − t̄∗ (x) , (4)
n
i=1
1628
Confidence Intervals for Random Forests
where t̄∗(−i) (x) is the average of t∗ (x) over all the bootstrap samples not containing the ith
example and t̄∗ (x) is the mean of all the t∗ (x).
The jackknife-after-bootstrap estimate VbJ∞ arises directly by applying the jackknife to
the bootstrap distribution. The infinitesimal jackknife (Jaeckel, 1972), also called the non-
parametric delta method, is an alternative to the jackknife where, instead of studying the
behavior of a statistic when we remove one observation at a time, we look at what happens to
the statistic when we individually down-weight each observation by an infinitesimal amount.
When the infinitesimal jackknife is available, it sometimes gives more stable predictions than
the regular jackknife. Efron (2013) shows how an application of the infinitesimal jackknife
principle to the bootstrap distribution leads to the simple estimate VbIJ ∞.
and
n
n − 1 X ˆ2 ˆ i = θ̂B (x) − θ̂B (x)
VbJB = ∆i , where ∆ (−i) (6)
n
i=1
∗
P
∗ =0} tb (x)
{b : Nbi
B
and θ̂(−i) (x) = {N ∗ = 0} .
bi
Here, Nbi∗ indicates the number of times the ith observation appears in the bootstrap sample
b.
In our experience, these finite-B estimates of variance are often badly biased upwards
if the number of bootstrap samples B is too small. Fortunately, bias-corrected versions are
available:
B
n X ∗
B
VbIJ−U B
= VbIJ − 2 (tb (x) − t̄∗ (x))2 , and (7)
B
b=1
B
n X ∗
B
VbJ−U = VbJB − (e − 1) 2 (tb (x) − t̄∗ (x))2 . (8)
B
b=1
These bias corrections are derived in Section 4. In many applications, the simple estimators
(5) and (6) require B = Θ(n1.5 ) bootstrap replicates to reduce Monte Carlo noise down to
the level of the inherent sampling noise, whereas our bias-corrected versions only require
B = Θ(n) replicates. The bias-corrected jackknife (8) was also discussed by Sexton and
Laake (2009).
B
In Figure 2, we show how VbIJ−U can be used to accurately estimate the variance of a
bagged tree. We compare the true sampling variance of a bagged regression tree with our
variance estimate. The underlying signal is a step function with four jumps that are reflected
1629
Wager, Hastie and Efron
0.30
True Variance
Mean Estimated Variance
+/− 1 Standard Deviation
0.25
0.20
Variance
0.15
0.10
0.05
0.00
as spikes in the variance of the bagged tree. On average, our variance estimator accurately
identifies the location and magnitude of these spikes.
Figure 3 compares the performance of the four considered variance estimates on a bagged
adaptive polynomial regression example described in detail in Section 4.4. We see that the
B are badly biased: the lower whiskers of their boxplots do
uncorrected estimators VbJB and VbIJ
not even touch the limiting estimate with B → ∞. We also see that that VbIJ−U B has about
B
half the variance of VbJ−U . This example highlights the importance of using estimators that
B
use available bootstrap replicates efficiently: with B = 500 bootstrap replicates, VbIJ−U can
B
give us a reasonable estimate of V , whereas VbJ is quite unstable and biased upwards by a
factor 2.
The figure also suggests that the Monte Carlo noise of VbIJB decays faster (as a function
B
of B) than that of VbJ . This is no accident: as we show in Section 4.2, the infinitesimal
jackknife requires 1.7 times less bootstrap replicates than the jackknife to achieve a given
level of level of Monte Carlo error.
VbJ , namely VbJ − V , and the Monte Carlo error VbJB − VbJ∞ to be small.
∞ ∞
1630
Confidence Intervals for Random Forests
150
●
Jackknife
Inf. Jackknife
Jackknife−U
Inf. Jackknife−U
100
●
●
●
●
Variance Estimate
●
50
●
●
● ●
●
●
● ●
● ●
●
●
● ● ●
● ●
● ●
0
It is well known that jackknife estimates of variance are in general biased upwards (Efron
and Stein, 1981). This phenomenon also holds for bagging: VbJ∞ is somewhat biased upwards
for V . We present some evidence suggesting that VbIJ ∞ is biased downwards by a similar
∞
amount, and that the arithmetic mean of VbJ and VbIJ ∞ is closer to being unbiased for V
n 2
\h i X
∞ =
Var VbIJ Ci∗,2 − Ci∗,2 ,
i=1
where Ci∗ = Cov∗ [Nbi∗ , t∗b (x)] and Ci∗,2 is the mean of the Ci∗,2 .
1631
Wager, Hastie and Efron
Random forests extend bagged trees by allowing the individual trees t∗b to depend on an
auxiliary noise source ξb . The main idea is that the auxiliary noise ξb encourages more
diversity among the individual trees, and allows for more variance reduction than bagging.
Several variants of random forests have been analyzed theoretically by, e.g., Biau et al.
(2008), Biau (2012), Lin and Jeon (2006), and Meinshausen (2006).
Standard implementations of random forests use the auxiliary noise ξb to randomly
restrict the number of variables on which the bootstrapped trees can split at any given
training step. At each step, m features are randomly selected from the pool of all p possible
features and the tree predictor must then split on one of these m features. If m = p the
tree can always split on any feature and the random forest becomes a bagged tree; if m = 1,
then the tree has no freedom in choosing which feature to split on.
Following Breiman (2001), random forests are usually defined more abstractly for theo-
retical analysis: any predictor of the form
B
1 X ∗ ∗ ∗ iid
θ̂RF (x) = tb (x; ξb , Zb1 , ..., Zbn ) with ξb ∼ Ξ (9)
B
b=1
is called a random forest. Various choices of noise distribution Ξ lead to different random
forest predictors. In particular, trivial noise sources are allowed and so the class of random
forests includes bagged trees as a special case. In this paper, we only consider random forests
of type (9) where individual trees are all trained on bootstrap samples of the training data.
We note, however, that that variants of random forests that do not use bootstrap noise have
also been found to work well (e.g., Dietterich, 2000; Geurts et al., 2006).
All our results about bagged predictors apply directly to random forests. The reason
for this is that random forests can also be defined as bagged predictors with different base
learners. Suppose that, on each bootstrap replicate, we drew K times from the auxiliary
noise distribution Ξ instead of just once. This would give us a predictor of the form
B K
1 X 1 X ∗ ∗ ∗ iid
θ̂RF (x) = tb (x; ξkb , Zb1 , ..., Zbn ) with ξkb ∼ Ξ.
B K
b=1 k=1
Adding the extra draws from Ξ to the random forest does not change the B → ∞ limit of
the random forest. If we take K → ∞, we effectively marginalize over the noise from Ξ, and
get a predictor
B
1 X ∗ ∗ ∗
θ̂RF (x) = t̃b (x; Zb1 , ..., Zbn ), where
g
B
b=1
t̃(x; Z1 , ..., Zn ) = Eξ∼Ξ [t(x; ξ, Z1 , ..., Zn )] .
1632
Confidence Intervals for Random Forests
0.4
0.4
Mean Stdv Estimate Mean Stdv Estimate Mean Stdv Estimate
Standard Deviation Estimate
0.3
0.3
● ●
● ●
● ● ● ● ● ●
● ● ● ●
●● ● ● ● ●
● ● ●
0.2
0.2
0.2
● ● ● ●
●
● ● ● ● ●● ●● ● ● ●● ●
● ● ● ● ● ● ● ●● ●
● ● ● ● ● ● ●●
●● ● ● ● ● ● ●
●
● ● ● ●
● ● ● ● ●● ● ● ● ●
●
●
● ● ● ● ● ● ● ● ● ● ●
● ●
●●● ●● ● ● ●
●●
● ● ●
● ● ● ●●
● ● ● ●● ●
●●●● ● ● ●● ●● ●● ● ●
●
● ● ●
● ● ● ● ●
●
●●
●
●● ●
● ●●● ● ● ● ● ● ●●● ●● ●
● ● ● ● ●
● ●● ● ● ●
● ●● ●● ● ●● ●
● ● ●● ●
● ● ●● ●
● ●● ●● ● ● ●●●●● ● ●
● ● ●● ● ● ● ● ● ● ●●
● ● ● ● ● ●● ● ●● ●● ●●●● ●
●● ● ● ● ●● ●●●●●● ●
●● ● ● ●● ● ●●
●● ● ● ● ● ●●●● ●●● ● ● ●● ● ● ● ●
● ● ●●● ●●
0.1
0.1
0.1
●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ●
● ● ●● ●● ●● ●● ●●● ●● ● ● ● ●● ●● ●
●● ●
●
● ●● ●
● ● ● ● ● ●●
● ● ●●● ●
●
● ● ●●
● ● ●●●
●●●●●●●
●
● ●
● ●● ●
●●● ● ● ● ●●
● ● ●●
●●● ●
●●●●
● ● ●● ●● ● ●● ●● ● ●● ● ● ●●●●
● ● ●
●● ●●●●●● ● ●
● ● ● ●●●●
●●
● ●● ● ●● ● ● ● ● ●● ●● ● ●●● ●● ● ● ●● ● ● ●● ●
●● ●
●
● ● ● ●● ● ●● ●● ●● ●●
●● ●● ●●● ●●● ●●●
●●●●● ●
●●● ●● ●●● ●● ● ●● ●● ● ●
●
●
●●●●
●●
●
●
●●
●● ●
●●
● ●●● ●● ● ●
●●●●
●●●
● ● ●
● ●●● ●
● ●●● ● ● ●● ●
● ● ●
● ● ●●
●● ●
●
●●● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ●
●●
● ● ●
● ●●● ●
●
●● ●●
●
●●
●● ●
● ●● ● ● ● ● ●
●
● ● ● ● ● ● ●●● ● ●●● ●●● ● ● ●●
● ●● ● ● ●●● ● ● ●● ●●●
●
●●● ●●●
● ● ● ●● ● ●● ● ● ●● ●●●●●
● ●●●●● ●
● ●● ●
●●● ●
●●● ●
● ●● ● ● ● ● ●●●
●●● ●
● ●●● ●
●●●
●●
●
●●
● ●
●
●●●● ● ●● ● ● ●●●●
●●
●●
●
●●●● ●●● ●●● ● ● ● ●● ● ●●● ● ● ●● ●
●●●●● ● ● ●●
●● ●● ●● ●●
● ●● ●● ● ● ● ● ● ●●
●● ●●● ●●
●●
● ● ●
●●
●●
●
●●●●● ●
● ●● ●●
●●
●● ●●●● ●●●●
● ● ● ● ● ● ●●●● ● ● ● ●●● ●●●
●● ●
●
●●
●●●●●●●
●●
●●●●
●●● ● ● ●● ●● ● ● ● ●
●●●
● ●● ● ●● ●● ● ●
● ●
●
●
●●
● ●
●●
●
●
●
● ●●●●
●
● ●● ● ●●
●●
● ●
●
● ●● ●● ● ●
●
●
●
●●●●
● ● ●●
● ●● ●
●●●●● ● ●
●●● ●●● ●●●●● ●● ●
●● ●● ●
●●
●●
●●● ● ●●●●●● ●●● ●●
●● ●●
●●●
●
●
●●
●●
●
●●
●●
●●
●●●
●●● ●
● ●● ● ● ● ●●● ● ●● ●
●● ●●● ● ●●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
● ●● ●●●
● ●●
●
●●
●●
●●
●● ●
●
●
●
●●
●●
●●●●●●●
●
●●
●●
●
●
●
●
● ●●●
●
●●● ●●●●●● ●
●●●●●●
●● ● ● ● ● ●● ● ●●●● ●● ●
●●
●●●●
● ●●●
●
● ● ●●● ●●
● ●
● ●●●
● ●●
●● ●●●
●
●●
●
●
●
●●
●● ●
●
●
●●
●
●
●
●
●
●
●●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●●● ● ●●
● ●●
●●
● ●
●
●
●●●
●
●
●
●
●●
●●
●
●
●●
●
●●●● ●
●
●●●
● ●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●
●
●
● ●● ●
●
●●
●●
●
●
●●
●
●●● ●
●
●●
● ●
●● ●●
●●
●●●
●●
●●●
●●●●
●●●
●● ●
●●●
● ●● ●●● ●●●●
● ● ●● ● ●● ● ●
● ● ●●
●
● ●●● ●●●● ●
●●●●●●● ●
●● ●● ●
●●
●
●
●●
●●
●
●●
● ●●
● ●
●
●
●●●
● ● ●●●● ●
●●
●●●●
●
●●●●
● ●
●
● ●
●●
●
●●
●
●
●●
●● ●
●
●
●●
●●●
●●
●●
●●●
●●
●●
●●
●●
●●
●
●
●●
●
●●
●
●●
●●
●●●●
●
●
●
●●●●
●
●●●
●
● ●●● ● ●
● ●● ● ● ● ● ●● ●
● ●
●
●●●
●●●●●●
●
●●
●●
●
●●●●
●●●
● ●●●
●●●●
●●
●●
● ●●●●
●
●
●●
●
●
●●
●
●●●
●
●
●●
●●
●
●●● ●
● ●
●
●
●●●●
●●●
●
● ●
●●●
● ●
●
●
●
●
●●
●
●
●
●●
●
●● ●●
●●
●
●
●●
●
●
●
●
●●●●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●●
●
●●●
●
●
●●
●
●
●●
●● ●
●
●●
●
●●
●
●
●
● ●
●● ●● ●● ●
● ● ●
● ●
●
●●
●●
●●
●
●●
●●
●●
●●
●●●
●●
●● ●
●
●
●●
●
●
●
●●
●
●
●● ●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●● ●●
● ● ●●
●●
●
●●
●
●●
●
●
●●
●
●
●
●●
●
●●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●●
● ●
●●
●
●
●●
●
●
●
● ● ●● ●● ●
●● ● ● ● ● ●●● ●● ●●
● ●●● ● ●●● ●●● ● ● ●●●
0.0
●● ●
0.0
● ●
0.0
●●● ●
●●
●
●●
●
●●
● ● ●● ●
●
●●●
● ●●● ● ● ●● ●● ●
●● ●
● ●●
●
●● ●●
●●●
●
● ● ● ●● ●
●●
● ●
●
● ●
●
●
●●
●●
●
●
●●●
●
●
●
●
●
●●
●
●●
●●
●
●
●●
●●
●●●●
● ●
● ●●●●
● ●●●●
●
●
●●
●●
●
●●
●●
●●
●●
●●
● ●●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●●●
●●
●
●●
●
●
●●
●
●
●● ●
●
●
●
●●
●
● ●
●
●●
●●
●
●
●
●
●
●●
●●
●
●
●●
●●
●●
●●
●●●
●●● ●●●
●● ●● ● ● ●● ● ● ●● ●●● ●●●
●●●●●●
●
●●
●
●●
●
●●
●
●
●
●
●● ●
●
●
●●
●
●
●
●
●
●
●●
●
●
●●●
● ●
●●
●●
●
●
●●
●
●
●
●●
●
● ●
●
●
●● ●●
●
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Prediction Prediction Prediction
Figure 4: Standard errors of random forest predictions on the e-mail spam data. The random
forests with m = 5, 19, and 57 splitting variables were all trained on a train set
of size n = 3, 065; the panels above show class predictions and IJ-U estimates for
standard errors on a test set of size 1,536. The solid curves are smoothing splines
(df = 4) fit through the data (including both correct and incorrect predictions).
In other words, the random forest θ̂RF as defined in (9) is just a noisy estimate of a bagged
predictor with base learner t̃.
B and V
It is straight-forward to check that our results about VbIJ b B also hold for bagged
J
predictors with randomized base learners. The extra noise from using t(·; ξ) instead of t̃(·)
does not affect the limiting correlations in (3) and (4); meanwhile, the bias corrections from
(7) and (8) do not depend on how we produced the t∗ and remain valid with random forests.
Thus, we can estimate confidence intervals for random forests from N ∗ and t∗ using exactly
the same formulas as for bagging.
In the rest of this section, we show how the variance estimates studied in this paper
can be used to gain valuable insights in applications of random forests. We use the VbIJ−UB
1633
Wager, Hastie and Efron
*
*
0.2
*
* *
* *
* * * **
* * ** *
0.1
* * * ** *
** * * * *
* * *** **
* * ** * * * * * * ** ********* * *
*
* *** * * * * * * * * * * ** *** ** *** ********** **
* ** ** * * ** ** * ** * * * * * * * ** * ** ** ******* **
*
* ** * * *** * * * * * ***** ***************************************
***** ****** *** ** * ** * *
****** ********* ***
* * * * * * *
**** ************ ********** * * *** * * * * * * * *
** * * * * * * ***** **** * ** ******* ***
0.0
* ** * ** * * * **************************
***************************
*** ******************** *** ** * * * * *
* **
*
* * * *************
**************************************** ** **** * ** * ** * * * * ** *
*** * * * * ** *** **** **** *** *
*************************************** **** * * * * * * * * * * * * * **
****************** ********* * **** *
********** ***** * *** *** * ** ** * * ** * * * * * ** * ** * * * *
******************** ** * ** * ** * ** * *
* * * *
******** ** **** * * * **
−0.1
* * **** ** ** * * * *
* ** ** * ** *
*** * * * * ** *
* * * *
* * * *
*
* * ** *
−0.2
*
*
In Figure 4, we plot test-set predictions against IJ-U estimates of standard error for all
three random forests. The m = 57 random forest appears to be quite unstable, in that the
estimated errors are high. Because many of its predictions have large standard errors, it
is plausible that the predictions made by the random forest could change drastically if we
got more training data. Thus, the m = 57 forest appears to suffer from overfitting, and the
quality of its predictions could improve substantially with more data.
Conversely, predictions made by the m = 5 random forest appear to be remarkably
stable, and almost all predictions have standard errors that lie below 0.1. This suggests that
the m = 5 forest may be mostly constrained by bias: if the predictor reports that a certain
e-mail is spam with probability 0.5 ± 0.1, then the predictor has effectively abandoned any
hope of unambiguously classifying the e-mail. Even if we managed to acquire much more
training data, the class prediction for that e-mail would probably not converge to a strong
vote for spam or non-spam.
The m = 19 forest appears to have balanced the bias-variance trade-off well. We can
further corroborate our intuition about the bias problem faced by the m = 5 forest by
comparing its predictions with those of the m = 19 forest. As shown in Figure 5, whenever
the m = 5 forest made a cautious prediction that an e-mail might be spam (e.g., a prediction
of around 0.8), the m = 19 forest made the same classification decision but with more
confidence (i.e., with a more extreme class probability estimate p̂). Similarly, the m = 19
forest tended to lower cautious non-spam predictions made by the m = 5 forest. In other
words, the m = 5 forest appears to have often made lukewarm predictions with mid-range
values of p̂ on e-mails for which there was sufficient information in the data to make confident
predictions. This analysis again suggests that the m = 5 forest was constrained by bias and
was not able to efficiently use all the information present in the data set.
1634
Confidence Intervals for Random Forests
2.0
●
● Mean Sampling Variance Bootstrap Variance
3.4
0.17
●
25.0
Sampling Variance [100M $^2]
Tree Correlation
1.8
Squared Error [100M $^2]
1.6
0.15
Correlation
●
24.0
3.2
1.4
●
●
23.5
0.13
●
3.1
1.2
23.0
● ●
●
●
●
● ● ●
0.11
1.0
●
3.0
● ●
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
Number of Splitting Variables Number of Splitting Variables
Figure 6: Performance of random forests on the California housing data. The left panel
plots MSE and mean sampling variance as a function of the number m of splitting
variables. The MSE estimate is the out-of bag error, while the mean sampling
B
variance is the average estimate of variance VbIJ−U computed over all training
examples. The right panel displays the drivers of sampling variance, namely the
variance of the individual bootstrapped trees (bootstrap variance v) and their
correlation (tree correlation ρ).
1635
Wager, Hastie and Efron
the variance v of the individual trees shoots up, and so the decrease in ρ is no longer sufficient
to bring down the variance of the whole forest. The increasing ρ-curve and the decreasing
v-curve thus jointly produce a U-shaped relationship between m and the variance of the
random forest. The m = 4 forest achieves a low variance by matching fairly stable base
learners with a small correlation ρ.
B
be the perfect IJ estimator with B = ∞ (Efron, 2013). Then, the Monte Carlo bias of VbIJ
is
n
(N ∗ − 1)(t∗b − t̄∗ )
h i X P
B ∞
E∗ VbIJ − VbIJ = Var∗ [Ci ], where Ci = b bi
B
i=1
is the Monte Carlo estimate of the bootstrap covariance. Since t∗b depends on all n ob-
servations, Nbi∗ and t∗b can in practice be treated as independent for computing Var∗ [Ci ],
especially when n is large (see remark below). Thus, as Var∗ [Nbi∗ ] = 1, we see that
B
h i
∞ n v̂ 1 X ∗
B
E∗ VbIJ − VbIJ ≈ , where v̂ = (tb − t̄∗ )2 . (11)
B B
b=1
Notice that v̂ is the standard bootstrap estimate for the variance of the base learner θ̂(x).
Thus, the bias of VbIJ B grows linearly in the variance of the original estimator that is being
bagged.
Meanwhile, by the central limit theorem, Ci converges to a Gaussian random vari-
able as B gets large. Thus, the Monte Carlo asymptotic variance of Ci2 is approximately
2 Var∗ [Ci ]2 + 4 E∗ [Ci ]2 Var∗ [Ci ]. The Ci can be treated as roughly independent, and so the
limiting distribution of the IJ estimate of variance has approximate moments
!
∞ v̂
B ∞ · n v̂ n v̂ 2 VbIJ
VIJ − VIJ ∼
b b , 2 2 +4 . (12)
B B B
1636
Confidence Intervals for Random Forests
Interestingly, the Monte Carlo mean squared error (MSE) of VbIJ B mostly depends on the
problem through v̂, where v̂ is the bootstrap estimate of the variance of the base learner.
In other words, the computational difficulty of obtaining confidence intervals for bagged
learners depends on the variance of the base learner.
In the case of the sample mean t(Z1∗ , ..., Zn∗ ) = n1 i Zi∗ paired with the Poisson bootstrap,
P
this term reduces to
2
∗ 2 ∗ ∗ 2
∗ ∗ ∗ 2 Zi − Z̄
Cov∗ (Nbi − 1) , (tb − t̄ ) − Cov∗ [(Nbi − 1), (tb − t̄ )] = 2 ,
n2
and the correction to (11) would be 2v̂/(nB) nv̂/B.
where VbJ∞ is the jackknife estimate computed with B = ∞ bootstrap replicates. The Monte
Carlo stability of VbJB again primarily depends on v̂.
1637
Wager, Hastie and Efron
By comparing (12) with (14), we notice that the IJ estimator makes better use of a finite
number B of bootstrap replicates than the jackknife estimator. For a fixed value of B, the
Monte Carlo bias of VbJB is about e − 1 or 1.7 times as large as that of VbIJ B ; the ratio of
Monte Carlo variance starts off at 3 for small values of B and decays down to 1.7 as B gets
much larger than n. Alternatively, we see that the IJ estimate with B bootstrap replicates
has errors on the same scale as the jackknife estimate with 1.7 · B replicates.
This suggests that if computational considerations matter and there is a desire to per-
form as few bootstrap replicates B as possible while controlling Monte Carlo error, the
infinitesimal jackknife method may be preferable to the jackknife-after-bootstrap.
B B n v̂
VbIJ−U = VbIJ − , and (15)
B
B n v̂
VbJ−U = VbJB − (e − 1) . (16)
B
B and V
Here VbIJ b B are as defined in (5), and v̂ is the bootstrap estimate of variance from
J
(11). The letter U stands for unbiased. This transformation effectively removes the Monte
Carlo bias in our experiments without noticeably increasing variance. The bias corrected
estimates only need B = Θ(n) bootstrap replicates to control Monte Carlo MSE at level
1/n.
1638
Confidence Intervals for Random Forests
3.0
Bias ratio
Variance ratio
2.8
●
●
2.6
2.4
Ratio
2.2
●
2.0
1.8
●
●
●
●
Figure 7: Predicted and actual performance ratios for the uncorrected VbJB and VbIJ B
to predicting the cholesterol decrease of a new patient with compliance level c = −2.25; this
corresponds to the patient with the lowest observed compliance level.
In Figure 3, we compare the performance of the variance estimates for bagged predictors
studied in this paper. The boxplots depict repeated realizations of the variance estimates
with a finite B. We can immediately verify the qualitative insights presented in this section.
Both the jackknife and IJ rules are badly biased for small B, and this bias goes away more
slowly than the Monte Carlo variance. Moreover, at any given B, the jackknife estimator is
noticeably less stable than the IJ estimator.
The J-U and IJ-U estimators appear to fix the bias problem without introducing in-
stability. The J-U estimator has a slightly higher mean than the IJ-U one. As discussed
in Section 5.2, this is not surprising, as the limiting (B → ∞) jackknife estimator has an
upward sampling bias while the limiting IJ estimator can have a downward sampling bias.
The fact that the J-U and IJ-U estimators are so close suggests that both methods work
well for this problem.
The insights developed here also appear to hold quantitatively. In Figure 7, we compare
the ratios of Monte Carlo bias and variance for the jackknife and IJ estimators with theo-
retical approximations implied by (12) and (14). The theoretical formulas appear to present
a credible picture of the relative merits of the jackknife and IJ rules.
1639
Wager, Hastie and Efron
process of developing this variance formula, we obtain an ANOVA expansion of VbIJ ∞ that
we then use in Section 5.2 to compare the sampling biases of the jackknife and infinitesimal
jackknife estimators.
where Ci∗ = Cov∗ [Nbi∗ , t∗b ] is a bootstrap estimate for hF (Zi ) and Ci∗,2 is the mean of the
Ci∗,2 . The rest of the notation is as in Section 2.
The relation (17) arises from a general connection between the infinitesimal jackknife
and the theory of Hájek projections. The Hájek projection of an estimator is the best
approximation to that estimator that only considers first-order effects. In our case, the
Hájek projection of θ̂∞ is
h i X n
∞
θ̂H = EF θ̂∞ + hF (Zi ), (20)
i=1
h i
∞ = n Var [h (Z)].
where hF (Zi ) is as in (18). The variance of the Hájek projection is Var θ̂H F
The key insight behind (17) is that the IJ estimator is effectively trying to estimate the
variance of the Hájek projection of θ̂B , and that
n
X
∞
VbIJ ≈ h2F (Zi ). (21)
i=1
The approximation (17) then follows immediately, as the right-hand side of the above expres-
sion is a sum of independent random variables. Note that we cannot apply this right-hand
side expression directly, as h depends on the unknown underlying distribution F .
The connections between Hájek projections and the infinitesimal jackknife have been
understood for a long time. Jaeckel (1972) originally introduced the infinitesimal jackknife
1640
Confidence Intervals for Random Forests
*
*
6
Estimated Cholesterol Decrease
*
* ** ●
80
* ** *
** ●
* * * *
* * *
*
5
●
● ●
**
60
* ●
*
* * ** ●
* ●
●
** * ● ●* *
* * ● * ●
* ●
** * * * * ● *● *
* ** **
40
●
* * *● * * ** *
● ***
* * **
● ●
* **
4
* *
● * ●
* * *● ● ** *** ●
* * ● * * ●
* *
20
* * ●
* * * ●
* * * ● * *
* *● ● ●
● ** * * ●
*
* ● ● ● * * * * ●
* *
* ● ●
● * * * * *
* ● * *** * * **
* * ● ● ●
●
0
* * * ** * * **
3
* ● ●
● ●
* ●
* ● ●
●
●
** * ● ●
* ● ● ●
−20
* * *
* * * *
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
Compliance Level Compliance Level
Figure 8: Stability of the IJ estimate of variance on the cholesterol data. The left panel
shows the bagged fit to the data, along with error bars generated by the IJ method;
the stars denote the data (some data points have x-values that exceed the range
of the plot). In the right panel, we use (19) to estimate error bars for the error
bars in the first panel. All error bars are one standard deviation in each direction.
as a practical approximation to the first-order variance of an estimator (in our case, the right-
hand side of (21)). More recently, Efron (2013) showed that VbIJ ∞ is equal to the variance of
a “bootstrap Hájek projection.” In Appendix B, we build on these ideas and show that, in
cases where a plug-in approximation is valid, (21) holds very nearly for bagged estimators.
We apply our variance formula to the cholesterol data set of Efron (2013), following the
methodology described in Section 4.4. In Figure 8, we use the formula (19) to study the
∞ as a function of the compliance level c. The main message here
sampling variance of VbIJ
∞ appears to be
is rather reassuring: as seen in Figure 8b, the coefficient of variation of VbIJ
fairly low, suggesting that the IJ variance estimates can be trusted in this example. Note
that, the formula from (19) can require many bootstrap replicates to stabilize and suffers
from an upward Monte Carlo bias just like VbIJ B . We used B = 100, 000 bootstrap replicates
to generate Figure 8.
1641
Wager, Hastie and Efron
where h h ii
V1 = n VarF EF θ̂∞ |Z1
is the variance due to first-order effects, V2 is the variance due to second-order effects of the
form h i h i h i h i
EF θ̂ |Z1 , Z2 − EF θ̂ |Z1 − EF θ̂ |Z2 + EF θ̂∞ ,
∞ ∞ ∞
and so the arithmetic mean of VbJ∞ and VbIJ∞ has an upward bias that depends only on third-
and higher-order effects. Thus, we might expect that in small-sample situations where VbJ∞
∞ exhibit some bias, the mean of the two estimates may work better than either of
and VbIJ
them taken individually.
To test this idea, we used both the jackknife and IJ methods to estimate the variance of
a bagged tree trained on a sample of size n = 25. (See Appendix C for details.) Since the
sample size is so small, both the jackknife and IJ estimators exhibit some bias as seen in
Figure 9a. However, the mean of the two estimators is nearly unbiased for the true variance
of the bagged tree. (It appears that this mean has a very slight upward bias, just as we
would expect from (25).)
This issue can arise in real data sets too. When training bagged forward stepwise re-
gression on a prostate cancer data set discussed by Hastie et al. (2009), the jackknife and
IJ methods give fairly different estimates of variance: the jackknife estimator converged to
0.093, while the IJ estimator stabilized at 0.067 (Figure 9b). Based on the discussion in
this section, it appears that (0.093 + 0.067)/2 = 0.08 should be considered a more unbiased
estimate of variance than either of the two numbers on their own.
B
In the more extensive simulations presented in Table 1, averaging VbIJ−U B
and VbJ−U is in
general less biased than either of the original estimators (although the “AND” experiment
seems to provide an exception to this rule, suggesting that most of the bias of VbJ−UB for
B
this function is due to higher-order interactions). However, VbIJ−U has systematically lower
1642
Confidence Intervals for Random Forests
Jackknife−U
Jackknife−U Estimate
Inf. Jackknife
0.20
Inf. Jackknife−U Estimate Inf. Jackknife−U
Average Estimate
0.06
Variance Estimate
●
●
0.15
●
Variance
0.04
●
●
● ●
●
0.10
●
●
0.02
0.05
0.00
Figure 9: Sampling bias of the jackknife and IJ rules. In the left panel, we compare the
expected values of the jackknife and IJ estimators as well as their mean with
the true variance of a bagged tree. In this example, the features take values in
(x1 , x2 ) ∈ [−1, 1]2 ; we depict variance estimates along the diagonal x1 = x2 . The
prostate cancer plot can be interpreted in the same way as Figure 3, except that
the we now indicate the weighted means of the J-U and IJ-U estimators separately.
variance, which allows it to win in terms of overall mean squared error. Thus, if unbiasedness
B
is important, averaging VbIJ−U B
and VbJ−U B
seems like a promising idea, but VbIJ−U appears to
be the better rule in terms of raw MSE minimization.
Finally, we emphasize that this relative bias result relies on the heuristic relationship (24).
While this approximation does not seem problematic for the first-order analysis presented in
Section 5.1, we may be concerned that the plug-in argument from Appendix B used to justify
it may not give us correct second- and higher-order terms. Thus, although our simulation
results seem promising, developing a formal and general understanding of the relative biases
∞ and V
of VbIJ b ∞ remains an open topic for follow-up research.
J
6. Conclusion
In this paper, we studied the jackknife-after-bootstrap and infinitesimal jackknife (IJ) meth-
ods (Efron, 1992, 2013) for estimating the variance of bagged predictors. We demonstrated
that both estimators suffer from considerable Monte Carlo bias, and we proposed bias-
corrected versions of the methods that appear to work well in practice. We also provided
a simple formula for the sampling variance of the IJ estimator, and showed that from a
sampling bias point of view the arithmetic mean of the jackknife and IJ estimators is often
preferable to either of the original methods. Finally, we applied these methods in numerous
1643
Wager, Hastie and Efron
B B 1 bB B )
Function n p B ERR VbIJ−U VbJ−U 2 (VIJ−U + VbJ−U
Bias −0.15 (±0.03) 0.14 (±0.02) −0.01 (±0.02)
Cosine 50 2 200 Var 0.08 (±0.02) 0.41 (±0.13) 0.2 (±0.06)
MSE 0.11 (±0.03) 0.43 (±0.13) 0.2 (±0.06)
Bias −0.05 (±0.01) 0.07 (±0.01) 0.01 (±0.01)
Cosine 200 2 500 Var 0.02 (±0) 0.07 (±0.01) 0.04 (±0.01)
MSE 0.02 (±0) 0.07 (±0.01) 0.04 (±0.01)
Bias −0.3 (±0.03) 0.37 (±0.04) 0.03 (±0.03)
XOR 50 50 200 Var 0.48 (±0.03) 1.82 (±0.12) 0.89 (±0.05)
MSE 0.58 (±0.03) 1.96 (±0.13) 0.89 (±0.05)
Bias −0.08 (±0.02) 0.24 (±0.03) 0.08 (±0.02)
XOR 200 50 500 Var 0.26 (±0.02) 0.77 (±0.04) 0.4 (±0.02)
MSE 0.27 (±0.01) 0.83 (±0.04) 0.41 (±0.02)
Bias −0.23 (±0.04) 0.65 (±0.05) 0.21 (±0.04)
AND 50 500 200 Var 1.15 (±0.05) 4.23 (±0.18) 2.05 (±0.09)
MSE 1.21 (±0.06) 4.64 (±0.21) 2.09 (±0.09)
Bias −0.04 (±0.04) 0.32 (±0.04) 0.14 (±0.03)
AND 200 500 500 Var 0.55 (±0.07) 1.71 (±0.22) 0.85 (±0.11)
MSE 0.57 (±0.08) 1.82 (±0.24) 0.88 (±0.11)
Bias −0.11 (±0.02) 0.23 (±0.05) 0.06 (±0.03)
Auto 314 7 1000 Var 0.13 (±0.04) 0.49 (±0.19) 0.27 (±0.1)
MSE 0.15 (±0.04) 0.58 (±0.24) 0.29 (±0.11)
Table 1: Simulation study. We evaluate the mean bias, variance, and MSE of different
variance estimates Vb for random forests. Here, n is the number of test examples
used, p is the number of features, and B is the number of trees grown; the numbers
in parentheses are 95% confidence errors from sampling. The best methods for
each evaluation metric are highlighted in bold. The data-generating functions are
described in Appendix C.
experiments, including some random forest examples, and showed how they can be used to
gain valuable insights in realistic problems.
Acknowledgments
The authors are grateful for helpful suggestions from the action editor and three anonymous
referees. S.W. is supported by a B.C. and E.J. Eaves Stanford Graduate Fellowship.
1644
Confidence Intervals for Random Forests
and Nbi∗ indicates the number of times the ith observation appears in the bootstrap sample
ˆ i is not defined because N ∗ = 0 for either all or none of the b = 1, ..., B, then just
b. If ∆ bi
ˆ i = 0.
set ∆
Now VbJB is the sum of squares of noisy quantities, and so VbJB will be biased upwards.
Specifically,
n
h i
B ∞ n−1X h i
ˆi ,
E∗ VJ − VJ =
b b Var∗ ∆
n
i=1
where VbJ∞ is the jackknife estimate computed with B = ∞ bootstrap replicates. For conve-
nience, let
Bi = |{b : Nbi = 0}| ,
and recall that
h i h h ii h h ii
ˆ i = E∗ Var∗ ∆
Var∗ ∆ ˆ i |Bi + Var∗ E∗ ∆
ˆ i |Bi .
and so h i h h ii
ˆ i = E∗ Var∗ ∆ˆ i |Bi + O ∆2 B .
Var∗ ∆ i
Meanwhile, for Bi ∈
/ {0, B},
2 !
h i
ˆ i |Bi = 1 B (0) (+)
Var∗ ∆ − 1 Bi ṽi + (B − Bi )ṽi
B2 Bi
1 (B − Bi )2 (0) B − Bi (+)
= ṽi + ṽi
B BBi B
where
(0) (+)
ṽi = Var∗ [t∗b |Nbi∗ = 0] and ṽi = Var∗ [t∗b |Nbi∗ 6= 0] .
1645
Wager, Hastie and Efron
Thus,
2
h i
ˆi = 1 (B − B i ) (0) B − Bi (+)
+ O ∆2i B ,
Var∗ ∆ E∗ 1i ṽi + E∗ 1i ṽi
B BBi B
where 1i = 1({Bi ∈
/ {0, B}}).
As n and B get large, Bi converges in law to a Gaussian random variable
Bi − Be−1
⇒ 0, e−1 (1 − e−1 )
√
B
and the above expressions are uniformly integrable. We can verify that
(B − Bi )2
−1 1
E∗ 1i = e − 2 + e + O ,
BBi B
and
B − Bi e−1 B
E∗ 1i = + O 1 − e−1 .
B e
Finally, this lets us conclude that
n
(e − 1)2
h i 1 n−1X (0) e−1 (+) 1 n
E∗ VbJB − VbJ∞ = ṽi + ṽi +O + 2 ,
B n e e B B
i=1
(0) (+)
where the error term depends on ṽi , ṽi , and VbJ∞ = (n − 1)/n ni=1 ∆2i .
P
We now address Monte Carlo variance. By the central limit theorem, ∆ ˆ i converges to a
Gaussian random variable as B gets large. Thus, the asymptotic Monte Carlo variance of
ˆ 2 is approximately 2 Var∗ [∆
∆ ˆ i ]2 + 4E∗ [∆
ˆ i ]2 Var∗ [∆
ˆ i ], and so
i
n 2
1 n−1 2X (e − 1)2
h i
(0) e−1 (+)
Var∗ VbJB ≈ 2 ṽi + ṽi
B n e e
i=1
n
(e − 1)2
1 n−1X 2 (0) e−1 (+)
+4 ∆i ṽi + ṽi .
B n e e
i=1
(0) (+)
In practice, the terms ṽi and ṽi can be well approximated by v̂ = Var∗ [t∗b ], namely the
(0) (+)
bootstrap estimate of variance for the base learner. (Note that ṽi , ṽi , and v̂ can always
be inspected on a random forest, so this assumption can be checked in applications.) This
lets us considerably simplify our expressions for Monte Carlo bias and variance:
h i n
E∗ VbJB − VbJ∞ ≈ (e − 1)v̂, and
B
h i n 1
Var∗ VbJ ≈ 2 2 (e − 1)2 v̂ 2 + 4 (e − 1) VbJ∞ v̂.
B
B B
1646
Confidence Intervals for Random Forests
where the Y1 , ..., Yn are drawn independently from G. We call functionals T satisfying (26)
averaging. Clearly, θ̂B can be expressed as an averaging functional applied to the empirical
distribution Fb defined by the observations Z1 , ..., Zn .
Suppose that we have an averaging functional T , a sample Z1 , ..., Zn forming an em-
pirical distribution Fb, and want to study the variance of T (Fb). The infinitesimal jackknife
estimate for the variance of Tb is given by
n
1 ∂ b 2
X
Vb = T Fi (ε) ,
n ∂ε
i=1
where Fbi (ε) is the discrete distribution that places weight 1/n + (n − 1)/n · ε at Zi and
weight 1/n − ε/n at all the other Zj .
We can transform samples from Fb into samples from Fbi (ε) by the following method. Let
Z1 , ..., Zn∗ be a sample from Fb. Go through the whole sample and, independently for each
∗
j, take Zj∗ and with probability ε replace it with Zi . The sample can now be considered a
sample from Fbi (ε).
When ε → 0, the probability of replacing two of the Zi∗ with this procedure becomes
negligible, and we can equivalently transform our sample into a sample from Fbi (ε) by trans-
forming a single random element from {Zj∗ } into Zi with probability n ε. Without loss of
generality this element is the first one, and so we conclude that
1
lim EFbi (ε) [τ (Z1∗ , ..., Zn∗ )] − EFb [τ (Z1∗ , ..., Zn∗ )]
ε→0 ε
= n EFb τ (Z1∗ , ..., Zn∗ )Z1∗ = Zi − EFb [τ (Z1∗ , ..., Zn∗ )] ,
1 ∂
T (Fbi (ε)) = EFb T Z1∗ = Zi − EFb [T ] ,
n ∂ε
and so
n
X 2
EFb T Z1∗ = Zi − EFb [T ]
Vb = (27)
i=1
n
X 2
EF T Z1∗ = Zi − EF [T ] ,
≈ (28)
i=1
1647
Wager, Hastie and Efron
1.4
1.2
1.0
0.8
f(x)
0.6
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
x
Figure 10: Underlying model for the bagged tree example from Figure 2.
where on the last line we only replaced the empirical approximation Fb with its true value
F . In the case of bagging, this last expression is equivalent to (21).
A crucial step in the above argument is the plug-in approximation (28). If T is just a
sum, then the error of (28) is within O(1/n); presumably, similar statements hold whenever
T is sufficiently well-behaved. That being said, it is possible to construct counter-examples
where (28) fails; a simple such example is when T counts the number of times Z1∗ is matched
in the rest of the training data. Establishing general conditions under which (28) holds is
an interesting topic for further research.
This section provides a more detailed description of the experiments reported in this paper.
The Auto MPG data set, available from the UCI Machine Learning Repository (Bache
and Lichman, 2013), is a regression task with 7 features. After discarding examples with
missing entries, the data set had 392 rows, which we divided into a test set of size 78 and a
train set of size 314. We estimated the variance of the random forest predictions using the
B +V
(VbJ−U b B )/2 estimator advocated in Section 5.2, with B = 10, 000 bootstrap replicates.
IJ−U
The data for this simulation was drawn from a model yi = f (xi ) + εi , where xi ∼ U ([0, 1]),
εi ∼ N (0, 1/22 ), and f (x) is the step function shown in Figure 10. We modeled the data
using 5-leaf regression trees generated using the R package tree (Venables and Ripley, 2002);
for bagging, we used B = 10, 000 bootstrap replicates. The reported data is compiled over
1, 000 simulation runs with n = 500 data points each.
1648
Confidence Intervals for Random Forests
Y = 5 · [XOR (X1 > 0.6, X2 > 0.6) + XOR (X3 > 0.6, X4 > 0.6)] + ε
and p = 50.
1649
Wager, Hastie and Efron
Y = 10 · AND (X1 > 0.3, X2 > 0.3, X3 > 0.3, X4 > 0.3) + ε
and p = 500.
• Auto: This example is based on a parametric bootstrap built on the same data set as
used in Figure 1. We first fit a random forest to the training set, and evaluated the
MSE σ̂ 2 on the test set. We then generated new training sets by replacing the labels
Yi from the original training set with Ybi + σ̂ε, where Ybi is the original random forest
prediction at the ith training example and ε is fresh residual noise.
During the simulation, we first generated a random test set of size 50 (except for the auto
example, where we just used the original test set of size 78). Then, while keeping the test
set fixed, we generated 100 training sets and produced variance estimates Vb at each test
point. Table 1 reports average performance over the test set.
References
Kevin Bache and Moshe Lichman. UCI machine learning repository, 2013. URL http:
//archive.ics.uci.edu/ml.
Gérard Biau. Analysis of a random forests model. The Journal of Machine Learning Re-
search, 13(4):1063–1095, 2012.
Gérard Biau, Luc Devroye, and Gábor Lugosi. Consistency of random forests and other
averaging classifiers. The Journal of Machine Learning Research, 9:2015–2033, 2008.
Peter Bühlmann and Bin Yu. Analyzing bagging. The Annals of Statistics, 30(4):927–961,
2002.
Andreas Buja and Werner Stuetzle. Observations on bagging. Statistica Sinica, 16(2):323,
2006.
Song Xi Chen and Peter Hall. Effects of bagging and bias correction on estimators defined
by estimating equations. Statistica Sinica, 13(1):97–110, 2003.
Jiangtao Duan. Bootstrap-Based Variance Estimators for a Bagging Predictor. PhD thesis,
North Carolina State University, 2011.
1650
Confidence Intervals for Random Forests
Bradley Efron. Estimation and accuracy after model selection. Journal of the American
Statistical Association, (just-accepted), 2013.
Bradley Efron and David Feldman. Compliance as an explanatory variable in clinical trials.
Journal of the American Statistical Association, 86(413):9–17, 1991.
Bradley Efron and Charles Stein. The jackknife estimate of variance. The Annals of Statis-
tics, pages 586–596, 1981.
Jerome H Friedman. Stochastic gradient boosting. Computational Statistics & Data Anal-
ysis, 38(4):367–378, 2002.
Jerome H Friedman and Peter Hall. On bagging and nonlinear estimation. Journal of
Statistical Planning and Inference, 137(3):669–683, 2007.
Pierre Geurts, Damien Ernst, and Louis Wehenkel. Extremely randomized trees. Machine
Learning, 63(1):3–42, 2006.
Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learn-
ing. New York: Springer, 2009.
Louis A Jaeckel. The Infinitesimal Jackknife. 1972.
Andy Liaw and Matthew Wiener. Classification and regression by randomForest. R News,
2(3):18–22, 2002. URL http://CRAN.R-project.org/doc/Rnews/.
Yi Lin and Yongho Jeon. Random forests and adaptive nearest neighbors. Journal of the
American Statistical Association, 101(474):578–590, 2006.
Colin L Mallows. Some comments on Cp. Technometrics, 15(4):661–675, 1973.
Nicolai Meinshausen. Quantile regression forests. The Journal of Machine Learning Re-
search, 7:983–999, 2006.
Joseph Sexton and Petter Laake. Standard errors for bagged and random forest estimators.
Computational Statistics & Data Analysis, 53(3):801–811, 2009.
Marina Skurichina and Robert PW Duin. Bagging for linear classifiers. Pattern Recognition,
31(7):909–930, 1998.
Thomas A Stamey, John N Kabalin, John E McNeal, Iain M Johnstone, Fuad Freiha,
EA Redwine, and N Yang. Prostate specific antigen in the diagnosis and treatment of
adenocarcinoma of the prostate. II. radical prostatectomy treated patients. The Journal
of Urology, 141(5):1076–1083, 1989.
Carolin Strobl, Anne-Laure Boulesteix, Achim Zeileis, and Torsten Hothorn. Bias in ran-
dom forest variable importance measures: Illustrations, sources and a solution. BMC
Bioinformatics, 8(1):25, 2007.
William N Venables and Brian D Ripley. Modern Applied Statistics with S. Springer, New
York, fourth edition, 2002. URL http://www.stats.ox.ac.uk/pub/MASS4. ISBN 0-387-
95457-0.
1651