Journal of Statistical Software
Journal of Statistical Software
Abstract
In this article, we introduce the BART R package which is an acronym for Bayesian ad-
ditive regression trees. BART is a Bayesian nonparametric, machine learning, ensemble
predictive modeling method for continuous, binary, categorical and time-to-event out-
comes. Furthermore, BART is a tree-based, black-box method which fits the outcome
to an arbitrary random function, f , of the covariates. The BART technique is relatively
computationally efficient as compared to its competitors, but large sample sizes can be
demanding. Therefore, the BART package includes efficient state-of-the-art implemen-
tations for continuous, binary, categorical and time-to-event outcomes that can take ad-
vantage of modern off-the-shelf hardware and software multi-threading technology. The
BART package is written in C++ for both programmer and execution efficiency. The
BART package takes advantage of multi-threading via forking as provided by the parallel
package and OpenMP when available and supported by the platform. The ensemble of
binary trees produced by a BART fit can be stored and re-used later via the R predict
function. In addition to being an R package, the installed BART routines can be called
directly from C++. The BART package provides the tools for your BART toolbox.
Keywords: binary trees, black-box, categorical, competing risks, continuous, ensemble pre-
dictive model, forking, multinomial, multi-threading, OpenMP, recurrent events, survival
analysis.
1. Introduction
Bayesian additive regression trees (BART) arose out of earlier research on Bayesian model
fitting of an outcome to a single tree (Chipman, George, and McCulloch 1998). In this era
2 BART: Bayesian Additive Regression Trees in R
from 1996 to 2001, the excellent predictive performance of ensemble models became apparent
(Breiman 1996; Krogh and Solich 1997; Freund and Schapire 1997; Breiman 2001; Friedman
2001; Baldi and Brunak 2001). Instead of making a single prediction from a complex model,
ensemble models make a single prediction which is the summary of the predictions from many
simple models. Generally, ensemble models have desirable properties, e.g., they do not suffer
from over-fitting (Kuhn and Johnson 2013). Like bagging (Breiman 1996), boosting (Freund
and Schapire 1997; Friedman 2001) and random forests (Breiman 2001), BART relies on an
ensemble of trees to predict the outcome; and, although, there are similarities, there are also
differences between these approaches.
BART is a Bayesian nonparametric, sum of trees method for continuous, dichotomous, cat-
egorical and time-to-event outcomes. Furthermore, BART is a black-box, machine learn-
ing method which fits the outcome via an arbitrary random function, f , of the covariates.
So-called black-box models generate functions of the covariates which are so complex that
interpreting the internal details of the fitted model is generally abandoned in favor of assess-
ment via evaluations of the fitted function, f , at chosen values of the covariates. As shown
by Chipman, George, and McCulloch (2010), BART’s out-of-sample predictive performance
is generally equivalent to, or exceeds that, of alternatives like lasso with L1 regularization
(Efron, Hastie, Johnstone, and Tibshirani 2004) or black-box models such as gradient boost-
ing (Freund and Schapire 1997; Friedman 2001), neural nets with one hidden layer (Venables
and Ripley 2002) and random forests (Breiman 2001). Over-fitting is the tendency to overly
fit a model to an in-sample training data set at the expense of poor predictive performance
for unseen out-of-sample data. Typically, BART does not over-fit to the training data due to
the regularization tree-branching penalty of the BART prior, i.e., generally, each tree has few
branches and plays a small part in the overall fit. So, the resulting fit from the ensemble of
trees as a whole is generally a good fit that does not over-fit. Essentially, BART is a Bayesian
nonlinear model with all the advantages of the Bayesian paradigm such as posterior infer-
ence including point and interval estimation. Conveniently, BART naturally scales to large
numbers of covariates and facilitates variable selection; it does not require the covariates to
be rescaled; neither does it require the covariate functional relationship, nor the interactions
considered, to be pre-specified.
In this article, we give an overview of data analysis with BART and the BART R package.
In Section 2, we describe the R functions provided by the BART package for analyzing
continuous outcomes with BART. In Section 3, we demonstrate the typical usage of BART
via the classic example of Boston housing values. In Section 4, we describe how BART can
be used to analyze binary and categorical outcomes. In Section 5, we describe how BART
can be used to analyze time-to-event outcomes with censoring including competing risks and
recurrent events. In Appendix Section A, we describe how to get and install the BART
package. In Appendix Section B, we describe the basis of BART on binary trees along with
the details of the BART prior. In Appendix Section C, we briefly describe the posterior
computations required to use BART. In Appendix Section D, we describe how to perform the
BART computations efficiently by resorting to parallel processing with multi-threading (N.B.
by default, the Microsoft Windows operating system does not provide the multi-threading
interfaces employed by the R environment; in lieu of the provision, the BART package is
single-threaded on Windows, yet, otherwise, completely functional; see Appendix D for more
details).
Journal of Statistical Software 3
set.seed(99)
post <- wbart(x.train, y.train, x.test, ndpost = M)
post <- mc.wbart(x.train, y.train, x.test, ndpost = M, mc.cores = B,
seed = 99)
post <- gbart(x.train, y.train, x.test, ndpost = M)
post <- mc.gbart(x.train, y.train, x.test, ndpost = M, mc.cores = B,
seed = 99)
The value returned, post as shown above, is of type ‘wbart’ that is essentially a list contain-
ing named components. Of particular interest are post$yhat.train and post$yhat.test
described as follows.
ŷ11 . . . ŷN 1
.. .. ..
• post$yhat.train is a matrix of predictions . . .
ŷ1M . . . ŷN M
where ŷim = µ0 + fm (xi ) is the m-th posterior draw.
• post$yhat.test is a similar matrix corresponding to x.test if provided.
4 BART: Bayesian Additive Regression Trees in R
• nkeeptreedraws: Number of tree ensemble draws to return for use with predict.
Members of the object returned (which is essentially a list) include varprob and varcount
which correspond to the variable selection probability and the observed counts in the ensemble
of trees. When sparse = TRUE, varprob is the random variable selection probability, sj ;
otherwise, it is the fixed constant sj = P −1 . Besides the posterior samples, also the mean
over the posterior is provided as varprob.mean and varcount.mean.
R> library("MASS")
R> x <- Boston[, c(6, 13)]
R> y <- Boston$medv
R> head(cbind(x, y))
rm lstat y
1 6.575 4.98 24.0
2 6.421 9.14 21.6
3 7.185 4.03 34.7
4 6.998 2.94 33.4
5 7.147 5.33 36.2
6 6.430 5.21 28.7
50
50
40
40
y=mdev
y=mdev
30
30
20
20
10
10
4 5 6 7 8 10 20 30
x1=rm x2=lstat
30
x2=lstat
20
10
4 5 6 7 8
x1=rm
Figure 1: The Boston housing data was compiled from the 1970 US Census where each
observation represents a Census tract in Boston with owner-occupied homes. For each tract,
we have the median value of owner-occupied homes (in thousands of dollars truncated at 50),
y = mdev, the average number of rooms, x1 = rm, and the percent of the population that is
lower status, x2 = lstat. Here, we show scatter plots of the data.
R> library("BART")
R> set.seed(99)
R> nd <- 200
R> burn <- 50
R> post <- wbart(x, y, nskip = burn, ndpost = nd)
Journal of Statistical Software 7
MCMC
done 0 (out of 250)
done 100 (out of 250)
done 200 (out of 250)
time: 1s
check counts
trcnt,tecnt,temecnt,treedrawscnt: 200,0,0,200
R> names(post)
R> length(post$sigma)
[1] 250
R> length(post$yhat.train.mean)
[1] 506
R> dim(post$yhat.train)
Remember, the training data has n = 506 observations, we had burn = 50 burn-in discarded
draws and nd = M = 200 draws kept. Let us look at a couple of the key list components.
• $sigma: Both the 50 burn-in and 250 draws are kept for σ; burn-in are kept only for
this parameter.
• $yhat.train: The m-th row and i-th column is fm (xi ) (the m-th kept MCMC draw
evaluated at the i-th training observation).
In Figure 2, you can see that BART burned in very quickly. Just one initial draw looking a
bit bigger than the rest. Apparently, subsequent variation is legitimate posterior variation.
In a more difficult problem you may see the σ draws initially declining as the MCMC searches
for a good fit.
y BART Linear
y 1.0000000 0.9051200 0.7991005
BART 0.9051200 1.0000000 0.8978003
Linear 0.7991005 0.8978003 1.0000000
R> pairs(fitmat)
Journal of Statistical Software 9
4.8
4.6
4.4
post$sigma
4.2
4.0
3.8
Index
Figure 2: The Boston housing data was compiled from the 1970 US Census where each
observation represents a Census tract in Boston with owner-occupied homes. For each tract,
we have the median value of owner-occupied homes (in thousands of dollars truncated at 50),
y = mdev, the average number of rooms, x1 = rm, and the percent of the population that is
lower status, x2 = lstat. With BART, we predict y = mdev from rm and lstat. Here, we
show a trace plot of the error variance, σ, that demonstrates convergence for BART rather
quickly, i.e., by 50 iterations or earlier.
In Figure 3, we present scatter plots between mdev, the BART fit and the multiple linear
regression. The BART fit is noticeably different from the linear fit.
Substantial predictive uncertainty, but you can still be fairly certain that some houses should
cost more than other.
10 BART: Bayesian Additive Regression Trees in R
10 20 30 40
50
40
30
y
20
10
40
30
BART
20
10
40
30
20
Linear
10
0
10 20 30 40 50 0 10 20 30 40
Figure 3: The Boston housing data was compiled from the 1970 US Census where each
observation represents a Census tract in Boston with owner-occupied homes. For each tract,
we have the median value of owner-occupied homes (in thousands of dollars truncated at 50),
y = mdev, the average number of rooms, x1 = rm, and the percent of the population that
is lower status, x2 = lstat. With BART, we predict y = mdev from rm and lstat. Here,
we show scatter plots comparing y = mdev, the BART fit (“BART”) and multiple linear
regression (“Linear”).
50
40
post$yhat.train
30
20
10
Figure 4: The Boston housing data was compiled from the 1970 US Census where each
observation represents a Census tract in Boston with owner-occupied homes. For each tract,
we have the median value of owner-occupied homes (in thousands of dollars truncated at 50),
mdev, the average number of rooms, rm, and the percent of the population that is lower status,
lstat. With BART, we predict y = mdev from rm and lstat. Here, we show boxplots of the
posterior samples of predictions (on the y-axis) ordered by the average predicted home value
per tract (on the x-axis).
And now we can run wbart using the training data to learn and predict at x.test. First,
we’ll just pass x.test to the wbart call.
R> set.seed(99)
R> post1 <- wbart(x.train, y.train, x.test)
R> dim(post1$yhat.test)
R> length(post1$yhat.test.mean)
[1] 127
• $yhat.test: the m-th row and h-th column is fm (xh ) (the m-th kept MCMC draw
evaluated at the h-th testing observation).
Alternatively, we could run wbart saving all the MCMC results and then call predict.
R> set.seed(99)
R> post2 <- wbart(x.train, y.train)
R> yhat <- predict(post2, x.test)
R> dim(yhat)
R> set.seed(4)
R> post3 <- wbart(x.train, y.train, nskip = 1000, ndpost = 10000,
+ nkeeptrain = 0, nkeeptest = 0, nkeeptestmean = 0,
+ nkeeptreedraws = 200)
R> yhatthin <- predict(post3, x.test)
Journal of Statistical Software 13
R> dim(post3$yhat.train)
[1] 0 379
R> dim(yhatthin)
Now, there are no kept draws of f (x) for training x, and we have 200 tree ensemble draws to
use with predict. Of course, if we keep 200 out of 10000, then every 50th draw is kept.
The default values are to keep all the draws (e.g., nkeeptrain = ndpost). Now, let us have
a look at the predictions.
In Figure 5, we present scatter plots between mdev, “yhat” and “yhatThin”. Recall, the
predictions labeled “yhat” are from a BART run with seed = 99 and all default values. The
predictions labeled “yhatThin” are thinned by 50 (after 1000 burnin discarded, 200 kept out
of 10000 draws) with seed = 4. It is very interesting how similar they are!
10 20 30 40
50
40
y
30
20
10
40
30
yhat
20
10
40
30
yhatThin
20
10
10 20 30 40 50 10 20 30 40
Figure 5: The Boston housing data was compiled from the 1970 US Census where each
observation represents a Census tract in Boston with owner-occupied homes. For each tract,
we have the median value of owner-occupied homes (in thousands of dollars truncated at
50), mdev, the average number of rooms, rm, and the percent of the population that is lower
status, lstat. With BART, we predict y = mdev by rm and lstat. The predictions labeled
“yhat” are from a BART run with seed = 99 and all default values. The predictions labeled
“yhatThin” are thinned by 50 (after 1000 burnin discarded, 200 kept out of 10000 draws)
with seed = 4. It is very interesting how similar they are!
can be obtained in a similar fashion. Estimates can be derived via functions of the poste-
rior samples such as means, quantiles, etc., e.g., fˆ(xhS ) = M −1 N −1 M i=1 fm (xhS , xiC )
P PN
m=1
where m indexes posterior samples. However, care must be taken in the interpretation of the
marginal effect as estimated by Friedman’s partial dependence function. If there are strong
relationships among the covariates, it may be unrealistic to assume that individual covariates
can be manipulated independently.
For example, suppose that we want to summarize the median home value, medv (variable
14 of the Boston data frame), by the percent of the population with lower status, lstat
(variable 13), while aggregating over the other twelve covariates in the Boston housing data.
In Figure 6, we demonstrate the marginal estimate and its 95% credible interval.
Journal of Statistical Software 15
50
40
mdev
30
20
10
10 20 30
lstat
Figure 6: The Boston housing data was compiled from the 1970 US Census where each
observation represents a Census tract in Boston with owner-occupied homes. For each tract,
we have the median value of owner-occupied homes (in thousands of dollars truncated at 50),
mdev, and the percent of the population that is lower status, lstat, along with eleven other
covariates. We summarize the marginal effect of lstat on mdev while aggregating over the
other covariates with Friedman’s partial dependence function. The marginal estimate and its
95% credible interval are shown.
Besides the marginal effect, we can define the conditional effect of x1 | x2 as f (x1 +δ,x2δ)−f (x1 ,x2 ) .
However, BART is not fitting simple linear functions. For example, suppose the data follows
a sufficiently complex function like so: f (x1 , x2 ) = b1 x1 + b2 x21 + b3 x1 x2 . Then the conditional
effect that BART is likely to fit is approximately b1 +2b2 x1 +b2 δ +b3 x2 . This function is not so
easy to characterize (as the marginal effect) since it involves x1 , x2 and δ. Nevertheless, these
functions can be estimated by BART if these inputs are provided. But, these functions have
the same limitations as Friedman’s partial dependence function and, perhaps, even moreso.
See the conditional effect example at the end of demo("boston.R", package = "BART").
To extend BART to binary outcomes, we employ the technique of Albert and Chib (1993)
that assumes there is an unobserved latent, zi , where yi = I (zi > 0) and i = 1, . . . , n indexes
subjects. Given yi , we generate the truncated normal latents, zi ; these auxiliary latents are
efficiently sampled (Robert 1995) and recast as the outcome for a continuous BART with unit
variance as follows.
I (−∞, 0) if yi = 0
(
zi | yi , f ∼ N (µ0 + f (xi ), 1)
I (0, ∞) if yi = 1
Centering the latent zi around the constant µ0 is analogous to quasi-centering the probabili-
ties, pi , at p0 = Φ(µ0 ), i.e., E [pi ] is approximately equal to p0 which is all that is necessary
for inference to be performed. The default value of µ0 is Φ−1 (ȳ) (which you can over-ride
with the binaryOffset argument).
The pbart (mc.pbart) and gbart (mc.gbart) functions are for serial (parallel) computation.
The outcome y.train is a vector containing zeros and ones. The covariates for training
(validation, if any) are x.train (x.test) which can be matrices or data frames containing
factors; in the display below, we assume matrices for simplicity. Notation: M for the number
Journal of Statistical Software 17
of posterior samples, B for the number of threads (generally, B = 1 for Windows), N for the
number of observations in the training set, and Q for the number of observations in the test
set.
set.seed(99)
post <- pbart(x.train, y.train, x.test, ndpost = M)
post <- mc.pbart(x.train, y.train, x.test, ndpost = M, mc.cores = B,
seed = 99)
post <- gbart(x.train, y.train, x.test, type = "pbart", ndpost = M)
post <- mc.gbart(x.train, y.train, x.test, type = "pbart", ndpost = M,
seed = 99)
N.B. for pbart, the thinning argument, keepevery defaults to 1 while for gbart with
type = "pbart", keepevery defaults to 10.
The data inputs, as shown above, are as follows.
x1
x2
• x.train is a matrix or a data frame of covariates for training represented as
..
.
xN
where xi are row vectors.
The return value, post as shown above is of type ‘pbart’ that is essentially a list of named
items; of particular interest are the following: post$prob.train and post$prob.test. As
with a continuous outcome, the columns of post$yhat.train and post$yhat.test repre-
sent different covariate settings and the rows, the M draws from the posterior. However,
post$prob.train and post$prob.test (when requested) are generally of more interest (and
post$prob.train.mean and post$prob.test.mean which are the means of the posterior
sample columns, not shown).
Often it is impractical to provide x.test in the call to pbart due to the number of predictions
considered or all the settings to evaluate are simply not known at that time. To allow for
this common problem, the BART package returns the trees encoded in an ASCII string,
treedraws$trees, and provides a predict function to generate any predictions needed.
Note that if you need to perform the prediction in some later R instance, then you can
save the ‘pbart’ object returned and reload it when needed, e.g., save with saveRDS(post,
"post.rds") and reload, post <- readRDS("post.rds") .
18 BART: Bayesian Additive Regression Trees in R
The returned value, pred as shown above, is of type ‘pbart’ that is essentially a list with the
following named components.
1.0
1.0
0.8
0.8
0.6
0.6
p(x)
p(x)
0.4
0.4
0.2
0.2
0.0
0.0
15 25 35 45 15 25 35 45
BMI BMI
Low−back pain: M(blue) vs. F(red) Neck pain: M(blue) vs. F(red)
Figure 7: NHANES, BMI and the probability of chronic pain: the left panel for lower-back
pain and the right panel for neck pain. The unweighted Friedman’s partial dependence rela-
tionship between chronic pain, BMI and gender are displayed as ascertained from NHANES
data: males (females) are represented by blue (red) lines with the corresponding 95% credible
intervals (dashed lines). We want to explore the hypothesis that obesity is a risk factor for
chronic lower-back pain (which includes buttock pain in this definition). A corollary to this
hypothesis is that obesity is not considered to be a risk factor for chronic neck pain. Although
there is a generous amount of uncertainty, it does not appear that the probability of chronic
lower-back pain increases with BMI for either gender. Conversely, chronic neck pain does
appear to be rising, yet again, the intervals are wide. In both cases, these findings are not
anticipated.
The BART package provides two examples for the relationship between chronic pain and BMI:
demo("nhanes.pbart1", package = "BART"), probabilities; and demo("nhanes.pbart2",
package = "BART"), differences in probabilities. In Figure 7, the left panel for lower-back
pain and the right panel for neck pain, the unweighted relationship between chronic pain,
BMI and gender are displayed: males (females) are represented by blue (red) solid lines
with corresponding 95% credible intervals in dashed lines. Although there is a generous
amount of uncertainty, it does not appear that the probability of chronic lower-back pain
20 BART: Bayesian Additive Regression Trees in R
0.2
0.2
0.1
0.1
p(x) − p(25)
p(x) − p(25)
0.0
0.0
−0.1
−0.1
−0.2
−0.2
15 25 35 45 15 25 35 45
BMI BMI
Chronic pain: low−back(blue) Chronic pain: neck(red)
Figure 8: NHANES, BMI and the probability of chronic pain for females only: the left panel
for lower-back pain and the right panel for neck pain. The unweighted Friedman’s partial
dependence relationship between chronic pain and BMI are displayed as ascertained from
NHANES data for females only: lower-back (blue) and neck pain (red) are presented with
the corresponding 95% credible intervals (dashed lines). The difference in probability of
chronic pain from a baseline BMI of 25 (which is the upper limit of normal) is presented,
i.e., p(x) − p(25). We want to explore the hypothesis that obesity is a risk factor for chronic
lower-back pain (which includes buttock pain in this definition). A corollary to this hypothesis
is that obesity is not considered to be a risk factor for chronic neck pain. Although there is a
generous amount of uncertainty, it does not appear that the probability of chronic lower-back
pain increases with BMI. Conversely, chronic neck pain does appear to be rising, yet again,
the intervals are wide. In both cases, these findings are not anticipated.
increases with BMI for either gender. Conversely, chronic neck pain does appear to be rising,
yet again, the intervals are wide. In both cases, these findings are not anticipated given
the original hypotheses. Based on survey weights (not shown), the results are basically the
same. In Figure 8, the unweighted relationship for females between BMI and the difference
in probability of chronic pain from a baseline BMI of 25 (which is the upper limit of normal)
with corresponding 95% credible intervals in dashed lines: the left panel for lower-back pain
Journal of Statistical Software 21
(blue solid lines) and the right panel for neck pain (red solid lines). Again, we have roughly
the same impression, i.e., there is no increase of lower-back chronic pain with BMI and it is
possibly dropping while neck pain might be increasing, but the intervals are wide for both.
The results are basically the same for males (not shown).
pi1 = P [yi1 = 1]
pi2 = P [yi2 = 1 | yi1 = 0]
pi3 = P [yi3 = 1 | yi1 = yi2 = 0]
..
.
pi,K−1 = P [yi,K−1 = 1 | yi1 = · · · = yi,K−2 = 0]
piK = P [yi,K−1 = 0 | yi1 = · · · = yi,K−2 = 0]
Notice that piK = 1 − pi,K−1 so we can specify the K conditional probabilities via K − 1
parameters. Furthermore, these conditional probabilities are, by construction, defined for
subsets of subjects: let S1 = {1, . . . , N } and Sj = {i : yi1 = · · · = yi,j−1 = 0} where j =
2, . . . , K − 1. Now, the unconditional probability of these outcome indicators, πij , can be
defined in terms of the conditional probabilities and their complements, qij = 1 − pij , for all
subjects.
This approach is provided by the BART package as the mbart function. The input for mbart is
essentially identical to gbart, but the output is slightly different. For example, due to the way
the model is estimated, the prediction for x.train is not available; therefore, to request it set
the argument x.test = x.train. By default, probit BART is employed for computational
Journal of Statistical Software 23
efficiency, but logit BART can be specified with the argument type = "lbart". Notation:
M for the number of posterior samples, B for the number of threads (generally, B = 1 for
Windows), N for the number of observations in the training set, and Q for the number of
observations in the test set.
set.seed(99)
post <- mbart(x.train, y.train, x.test, ndpost = M)
post <- mc.mbart(x.train, y.train, x.test, ndpost = M, mc.cores = B,
seed = 99)
The returned value, post as shown above, is of type ‘mbart’ that is essentially a list with
named components, particularly, post$prob.test.
The columns of post$prob.test represent different covariate settings crossed with the K
categories. The predict function for objects of type ‘mbart’ is analogous.
exp(µj + fj (xi ))
P [yi = j] = PK = πij
j 0 =1 exp(µj 0 + fj 0 (xi ))
prior
where fj ∼ BART, j = 1, . . . , K.
Suppose for the moment, the centering parameters, µj , are defined as in logit BART.
24 BART: Bayesian Additive Regression Trees in R
exp fc (xi )
P [yi = c] =
exp fc (xi ) + k6=c exp fk (xi )
P
exp fc (xi )
= where S = log exp fk (xi )
X
exp fc (xi ) + exp S k6=c
exp −S exp fc (xi )
=
exp −S exp fc (xi ) + exp S
exp(fc (xi ) − S)
= .
exp(fc (xi ) − S) + 1
exp fj (xi )
P [yi = j] =
exp fc (xi ) + k6=c exp fk (xi )
P
1
∝
exp fc (xi ) + exp S
1 1
∝
exp −S exp fc (xi ) + exp S
1
= .
exp(fc (xi ) − S) + 1
Thus, the conditional inference for fc is equivalent to a binary indicator I (y = c). Therefore,
mbart2 computes a full series of all K BART functions for binary indicators.
The mbart2 function defaults to type = "lbart", i.e., logistic latents are used to compute
the fj ’s which fits nicely with the logit development of this approach. However, the logistic
latent fitting method can be computationally demanding. Therefore, normal latents can be
specified by type = "pbart". This latter setting would appear to contradict the development
of this approach; but notice that πij is still a probability in this case and, in our experience,
the results produced are often reasonable.
0.6
Small
Large
0.5
0.4
Probability
0.3
0.2
0.1
0.0
Multinomial
x BART
Friedman's partial dependence function
Figure 9: In 1985, American alligators were harvested by hunters in peninsular Florida from
four lakes. Lake, length and sex were recorded for each alligator. The stomach contents of
219 alligators were classified into five categories based on the primary food preference: bird,
fish, invertebrate, reptile and other. The length of alligators was dichotomized into small,
≤2.3m, vs. large, >2.3m. We estimate the probability of each food preference category for
the marginal effect of size by resorting to Friedman’s partial dependence function (Friedman
2001). The 95% credible intervals are wide, but it appears that large alligators are more likely
to rely on a diet of fish while small alligators are more likely to rely on invertebrates.
and Trafford (Collier County). Lake, length and sex were recorded for each alligator. Stom-
achs from a sample of alligators 1.09:3.89m long were frozen prior to analysis. After thawing,
stomach contents were removed and separated and food items were identified and tallied.
Volumes were determined by water displacement. The stomach contents of 219 alligators
were classified into five categories of primary food preference: bird, fish (the most common
primary food choice), invertebrate (snails, insects, crayfish, etc.), reptile (turtles, alligators),
bird, and other (amphibians, plants, household pets, stones, and other debris). The length of
alligators was dichotomized into small, ≤ 2.3m, vs. large, > 2.3m. We estimate the probabil-
ity of each food preference category for the marginal effect of size by resorting to Friedman’s
partial dependence function (Friedman 2001). We have supplied Figure 9 which summarizes
26 BART: Bayesian Additive Regression Trees in R
the BART results generated by the example alligator.R: you can find this demo with the
command demo("alligator", package = "BART"). The mbart function was used since the
number of categories is small. The 95% credible intervals are wide, but it appears that large
alligators are more likely to rely on a diet of fish while small alligators are more likely to rely
on invertebrates. Although the true probabilities are obviously unknown, we compared mbart
to an analysis by a single hidden-layer/feed-forward Neural Network via the nnet R package
(Ripley 2007; Venables and Ripley 2002) and the results were essentially identical (see the
demo for details).
i sin(tw). This leads us to an estimator of the asymptotic variance which is σ̂θ̂2 = γ̂ 2 (0). We
divide our chain into two segments, A and B, as follows: m ∈ A = {1, . . . , MA } where MA =
aM ; and m ∈ B = {M − MB + 1, . . . , M } where MB = bM . Note that a + b < 1. Geweke
suggests a = 0.1, b = 0.5 and recommends the following normal test for convergence.
σ̂θ̂2 = γ̂m∈A
2
(0) σ̂θ̂2 = γ̂m∈B
2
(0)
A B
√
M (θ̂A − θ̂B )
ZAB = q ∼ N (0, 1)
a−1 σ̂θ̂2 + b−1 σ̂θ̂2
A B
In our BART package, we supply R functions adapted from the coda R package (Plummer,
Best, Cowles, and Vines 2006) to perform Geweke diagnostics: spectrum0ar and gewekediag.
But, how do we apply Geweke’s diagnostic to BART? We can check convergence for any
estimator of the form θ = h(f (x)), but often setting h to the identify function will suffice,
i.e., θ = f (x). However, BART being a Bayesian nonparametric technique means that we
have many potential estimators to check, i.e., essentially one estimator for every possible
choice of x.
We have supplied Figures 10, 11 and 12 generated by the example geweke.pbart2.R:
Journal of Statistical Software 27
1.0
partial dependence function
−0.12
0.5
0.0
acf
−0.16
−0.5
−0.20
−1.0
0.0 0.2 0.4 0.6 0.8 1.0 0 5 10 15 20
x4 lag
1.0
0.99999
4
0.9999
0.999
0.8
0.99
2
0.95
0.6
Φ(f(x))
0
z
0.4
−2
0.95
0.99
0.2
0.999
−4
0.9999
0.99999
0.0
m i
N:200, k:50 N:200, k:50
Figure 10: Geweke convergence diagnostics for probit BART: N = 200. In the upper left
quadrant, we have plotted Friedman’s partial dependence function for f (xi4 ) vs. xi4 for 10
values of xi4 . This is a check that can’t be performed for real data, but it is informative in
this case. Notice that f (xi4 ) vs. xi4 is mainly directly proportional expected. In the upper
right quadrant, we plot the auto-correlations of f (xi ) for 10 randomly selected xi where i
indexes subjects. Notice that there is very little auto-correlation. In the lower left quadrant,
we display the corresponding trace plots for these same settings. The traces demonstrate
that samples of f (xi ) appear to adequately traverse the sample space. In the lower right
quadrant, we plot the Geweke ZAB statistics for each subject i. Notice that the ZAB exceed
the 95% limits only a handful of times. Based on this figure, we conclude that the chains
have converged.
The data are simulated by Friedman’s five-dimensional test function (Friedman 1991) where
50 covariates are generated as xij ∼ U (0, 1) but only the first 5 covariates have an impact on
28 BART: Bayesian Additive Regression Trees in R
1.0
0.2
partial dependence function
0.1
0.5
0.0
acf
−0.1
−0.5
−0.3
−1.0
0.0 0.2 0.4 0.6 0.8 1.0 0 5 10 15 20
x4 lag
1.0
0.99999
4
0.9999
0.999
0.8
0.99
2
0.95
0.6
Φ(f(x))
0
z
0.4
−2
0.95
0.99
0.2
0.999
−4
0.9999
0.99999
0.0
m i
N:1000, k:50 N:1000, k:50
Figure 11: Geweke convergence diagnostics for probit BART: N = 1000. In the upper left
quadrant, we have plotted Friedman’s partial dependence function for f (xi4 ) vs. xi4 for 10
values of xi4 . This is a check that can’t be performed for real data, but it is informative
in this case. Notice that f (xi4 ) vs. xi4 is directly proportional as expected. In the upper
right quadrant, we plot the auto-correlations of f (xi ) for 10 randomly selected xi where i
indexes subjects. Notice that there is very little auto-correlation. In the lower left quadrant,
we display the corresponding trace plots for these same settings. The traces demonstrate
that samples of f (xi ) appear to adequately traverse the sample space. In the lower right
quadrant, we plot the Geweke ZAB statistics for each subject i. Notice that there appear
to be a considerable number exceeding the 95% limits. Based on this figure, we conclude
that convergence is questionable. We would suggest that more thinning be employed via the
keepevery argument to pbart; perhaps, keepevery = 50.
The convergence for each of these data sets is graphically displayed in Figures 10, 11 and 12
Journal of Statistical Software 29
0.4
1.0
partial dependence function
0.2
0.5
0.0
0.0
acf
−0.2
−0.5
−0.4
−1.0
0.0 0.2 0.4 0.6 0.8 1.0 0 5 10 15 20
x4 lag
1.0
0.99999
4
0.9999
0.999
0.8
0.99
2
0.95
0.6
Φ(f(x))
0
z
0.4
−2
0.95
0.99
0.2
0.999
−4
0.9999
0.99999
0.0
m i
N:5000, k:50 N:5000, k:50
Figure 12: Geweke convergence diagnostics for probit BART: N = 5000. In the upper left
quadrant, we have plotted Friedman’s partial dependence function for f (xi4 ) vs. xi4 for 10
values of xi4 . This is a check that can’t be performed for real data, but it is informative in
this case. Notice that f (xi4 ) vs. xi4 is directly proportional as expected. In the upper right
quadrant, we plot the auto-correlations of f (xi ) for 10 randomly selected xi where i indexes
subjects. Notice that there is some auto-correlation. In the lower left quadrant, we display
the corresponding trace plots for these same settings. The traces demonstrate that samples
of f (xi ) appear to traverse the sample space, but there are some slower oscillations. In the
lower right quadrant, we plot the Geweke ZAB statistics for each subject i. Notice that there
appear to be far too many exceeding the 95% limits. Based on these figures, we conclude
that convergence has not been attained. We would suggest that more thinning be employed
via the keepevery argument to pbart; perhaps, keepevery = 250.
where each figure is broken into four quadrants. In the upper left quadrant, we have plotted
Friedman’s partial dependence function for f (xi4 ) vs. xi4 for 10 values of xi4 . This is a check
that can’t be performed for real data, but it is informative in this case. Notice that f (xi4 ) vs.
xi4 is directly proportional in each figure as expected. In the upper right quadrant, we plot
the auto-correlations of f (xi ) for 10 randomly selected xi where i indexes subjects. Notice
that there is very little auto-correlation for N = 200, 1000, but a more notable amount for
30 BART: Bayesian Additive Regression Trees in R
N = 5000. In the lower left quadrant, we display the corresponding trace plots for these
same settings. The traces demonstrate that samples of f (xi ) appear to adequately traverse
the sample space for N = 200, 1000, but less notably for N = 5000. In the lower right
quadrant, we plot the Geweke ZAB statistics for each subject i. Notice that for N = 200,
the ZAB exceed the 95% limits only a handful of times. Although, there are 10 times more
comparisons, N = 1000 has seemingly more than 10 times as many values exceeding the
95% limits. And, for N = 5000, there are dramatically more values exceeding the 95% limits.
Based on these figures, we conclude that the chains have converged for N = 200; for N = 1000,
convergence is questionable; and, for N = 5000, convergence has not been attained. We would
suggest that more thinning be employed for N = 1000, 5000 via the keepevery argument to
pbart; perhaps, keepevery = 50 for N = 1000 and keepevery = 250 for N = 5000.
Typical settings are b = 1 and ρ = P (the defaults) which you can over-ride with the b
and rho arguments respectively. The value a = 0.5 (the default) is a sparse setting whereas
an alternative setting a = 1 is not sparse; you can specify this parameter with argument
a. If additional sparsity is desired, then you can set the argument rho to a value smaller
than P : for more details, see Appendix B. Furthermore, Linero discusses two assumptions:
Assumption 2.1 and Assumption 2.2 (see Linero (2018) for more details). Basically, Assump-
tion 2.2 (2.1) is more (less) friendly to binary/ordinal covariates and is (is not) the default
corresponding to augment = FALSE (augment = TRUE).
Let us return to the simulated probit BART example explored above (in the BART package):
demo("sparse.pbart", package = "BART"). For sample sizes of N = 200, 1000, 5000, there
are P = 100 covariates, but only the first 5 are active. In Figure 13, the 5 (95) active (inac-
tive) covariates are red (black) and circles (dots) are > (≤) P −1 which is chance association
represented by a black line. For N = 200, all five active variables are identified, but notice
that there are 20 false positives. For N = 1000, all five active covariates are identified, but
notice that there are still 14 false positives. For N = 5000, all five active covariates are
identified and notice that there is only one false positive. We are often interested in the
inter-relationship between covariates within our model. We can assess these relationships by
inspecting the binary trees. For example, we can ascertain how often x1 is chosen as a branch
decision rule leading to a branch decision rule with x2 further up the tree or vice versa. In
this case, we call x1 and x2 a concordant pair and we denote by x1 ↔ x2 which is a symmetric
relationship, i.e., x1 ↔ x2 implies x2 ↔ x1 . If Bh is the number of branches in tree Th , then
Journal of Statistical Software 31
0.30
Selection Probability
0.20
0.10
0.00
0 20 40 60 80 100
Index
0.20
0.10
0.00
0 20 40 60 80 100
Index
0.20
0.10
0.00
0 20 40 60 80 100
Index
Figure 13: Probit BART and variable selection example. For sample sizes of N =
200, 1000, 5000, there are P = 100 covariates, but only the first 5 are active. The 5 (95)
active (inactive) covariates are red (black) and circles (dots) are > (≤) P −1 which is chance
association represented by a black line. For N = 200, all five active variables are identified,
but notice that there are 20 false positives. For N = 1000, all five active covariates are iden-
tified, but notice that there are still 14 false positives. For N = 5000, all five active covariates
are identified and notice that there is only one false positive.
the concordant pair probability is: κij = P [xi ↔ xj ∈ Th | Bh > 1] for i = 1, . . . , P − 1 and
j = i + 1, . . . , P . See an example of calculating these probabilities in demo("trees.pbart",
package = "BART").
where i indexing subjects, i = 1, . . . , N ; and Φ(.) is the standard normal cumulative distribu-
Qni yij
tion function. This formulation creates the likelihood of [y | f ] = N i=1 j=1 pij (1−pij )
1−yij .
Q
If the event indicators, yij , have already been computed, then you can specify them with the
y.train argument. However, it is likely that the indicators would need to be constructed,
so for convenience, you can specify (si , δi ) by the arguments times and delta respectively.
In either case, the default value of µ0 is Φ−1 (ȳ) (which you can over-ride with the offset
argument). For computational efficiency, probit (Albert and Chib 1993) is the default, but
logit (Holmes and Held 2006; Gramacy and Polson 2012) can be specified as an option via
type = "lbart".
Based on the posterior samples, we construct quantities of interest with BART for survival
analysis. In discrete-time survival analysis, the instantaneous hazard from continuous-time
survival is essentially replaced with the probability of an event in an interval, i.e., p(t(j) , x) =
Φ(µ0 + f (t(j) , x)). Now, the survival function is constructed as follows: S(t(j) | x) = P r(T >
t(j) | x) = jl=1 (1 − p(t(l) , x)).
Q
Survival data pairs (s, δ) are converted to indicators by the helper function surv.pre.bart
which is called automatically by surv.bart if y.train is not provided. surv.pre.bart
returns a list which contains y.train for the indicators; tx.train for the covariates corre-
sponding to y.train for training f (t, x) (which includes time in the first column, and the rest
Journal of Statistical Software 33
of the covariates afterward, if any, i.e., rows of [t, x], hence the name tx.train to distinguish
it from the original x.train); tx.test for the covariates to predict f (t, x) rather than to
train; times which is the grid of ordered distinct time points; and K which is the length of
times. Here is a very simple example of a data set with three observations and no covariates
re-formatted for display (no covariates is an interesting special case but we will discuss the
more common case with covariates further below).
Here is a diagram of the input and output for the surv.pre.bart function. pre is a list that
is generated to contain the matrix pre$tx.train and the vector pre$y.train.
tx.train y.train
t(1) x1 y11 = 0
.. .. ..
. . .
1n1 = δ1
t x1 y
(n1 )
.. .. ..
. . .
t(1) xN yN 1 = 0
.. .. ..
. . .
t(nN ) xN yN nN = δN
For pre$tx.test, ni is replaced by K which is very helpful so that each subject contributes
an equal number of settings for programmatic convenience and non-informative estimation,
i.e., if high-risk subjects with earlier events did not appear beyond their event, then estimates
of survival for latter times would be biased upward. For other outcomes besides time-to-event,
we provide two matrices of covariates, x.train and x.test, where x.train is for training
and x.test is for validation. However, due to the variable ni for time-to-event outcomes, we
generally provide two arguments as follows: x.train, x.test = x.train where the former
matrix will be expanded by surv.pre.bart to N i=1 ni rows for training f (t, x) while the
P
latter matrix will be expanded to N × K rows for f (t, x) estimation only. If you still need to
perform validation, then you can make a separate call to the predict function.
34 BART: Bayesian Additive Regression Trees in R
N.B. the argument ndpost = M is the length of the chain to be returned and the argument
keepevery is used for thinning, i.e., return M observations where keepevery are culled in be-
tween each returned value. For BART with time-to-event outcomes which is based on gbart,
the default is keepevery = 10 since the grid of time points creates data set observations
of order N × K which have a tendency towards higher auto-correlation, therefore, making
thinning more necessary. To avoid unnecessarily enlarged data sets, it is often prudent to
coarsen the time axis appropriately. Although this might seem drastic, times are often col-
lected orders of magnitude more precisely than necessary for the problem under study. For
example, cancer registries often collect survival times in days while time in months or quarters
would suffice for many typical applications. You can coarsen automatically by supplying the
optional K argument to coarsen the times to a grid of time quantiles: 1/K, 2/K, . . . , K/K (not
to be confused with the k argument which is a prior parameter for the distribution of the leaf
terminal values).
Here is a diagram of the input and output for the surv.bart function for serial computation
and mc.surv.bart for parallel computation.
Serial call:
set.seed(99)
post <- surv.bart(x.train, times = times, delta = delta,
x.test = x.train, ndpost = M)
Parallel call:
The returned value, post as shown above, is of type ‘survbart’ that is essentially a list with
named components, particularly, post$surv.test.
Here is a diagram of the input and output for the predict.survbart function.
The returned value, pred as shown above, is an object of type ‘survbart’ that is essentially
a list of named components, particularly, pred$surv.test.
For an overview of Friedman’s partial dependence function (including the notation adopted
in this article and its meaning), please see Section 3.8 which discusses continuous out-
comes. For survival analysis, we use Friedman’s partial dependence function (Friedman
2001) with BART to summarize the marginal effect due to a subset of the covariates set-
tings which, naturally, includes time, (t(j) , xhS ). For survival analysis, the f function is
often not directly of interest; rather, the survival function is more readily interpretable:
i=1 S(t(j) , xhS , xiC ).
S(t(j) , xhS ) = N −1 N
P
1.0
0.8
0.6
S(t, x)
0.4
0.2
0.0
0 50 100 150
t (weeks)
Figure 14: Advanced lung cancer example: Friedman’s partial dependence function with 95%
credible intervals: males (blue) vs. females (red). A cohort of advanced lung cancer patients
was recruited from the North Central Cancer Treatment Group. For survival time, these
patients were followed for nearly 3 years or until lost to follow-up.
each based on a different subject profile), then the concordance probability is defined as
κt1 ,t2 = P [t1 < t2 ]. A simple analytic example with the Exponential distribution is as follows.
ind
ti | λi ∼ Exp (λi ) where i ∈ {1, 2}
Z ∞ Z t2
λ1
P [t1 < t2 | λ1 , λ2 ] = λ2 e−λ2 t2 λ1 e−λ1 t1 dt1 dt2 =
0 0 λ1 + λ2
λ2 λ1
1 − P [t1 > t2 | λ1 , λ2 ] = 1 − =
λ1 + λ2 λ1 + λ2
= P [t1 < t2 | λ1 , λ2 ]
notational convenience).
h i
P [s1 < s2 ] =P s1 = t(1) , s2 > t(1) +
h i h i
P s1 = t(2) , s2 > t(2) | s1 > t(1) , s2 > t(1) P s1 > t(1) , s2 > t(1) + . . .
K h i h i
= P s1 = t(j) , s2 > t(j) | s1 > t(j−1) , s2 > t(j−1) P s1 > t(j−1) , s2 > t(j−1)
X
j=1
K
= p1j q2j S1 (t(j−1) )S2 (t(j−1) )
X
j=1
j=1
K
=1− (1 − p1j )(1 − q2j )S1 (t(j−1) )S2 (t(j−1) )
X
j=1
K
=1− (1 − p1j − q2j + p1j q2j )S1 (t(j−1) )S2 (t(j−1) )
X
j=1
K K
=1− p1j q2j S1 (t(j−1) )S2 (t(j−1) ) − (q1j − q2j )S1 (t(j−1) )S2 (t(j−1) )
X X
j=1 j=1
However, note that these probabilities are not symmetric in this form. Yet, we can arrive at
symmetry as follows.
κs1 ,s2 = 0.5 (P [s1 < s2 ] + 1 − P [s1 > s2 ])
K
= 0.5 1 − (q1j − q2j )S1 (t(j−1) )S2 (t(j−1) )
X
j=1
notation slightly: (si , δi ) where δi = 1 for kind 1 events, δi = 2 for kind 2 events, or δi = 0 for
censoring times. We create a single grid of time points for the ordered distinct times based on
either kind of event or censoring: 0 = t(0) < t(1) < · · · < t(K) < ∞. We model the probability
for an event of kind 1, p1 (t(j) , xi ), and an event of kind 2 conditioned on subject i being
alive at time t(j) , p2 (t(j) , xi ). Now, we create event indicators by melding absorbing events
survival analysis with mutually exclusive Multinomial categories where i indexes subjects:
i = 1, . . . , N .
y1ij = I (δi = 1) I (j = ni ) where j = 1, . . . , ni
y1ij | p1ij ∼ B (p1ij )
prior
p1ij = Φ(µ1 + f1 (t(j) , xi )) where f1 ∼ BART
y2ij = I (δi = 2) I (j = ni ) where j = 1, . . . , ni − y1ini
y2ij | p2ij ∼ B (p2ij )
prior
p2ij = Φ(µ2 + f2 (t(j) , xi )) where f2 ∼ BART
ni y Qni −y1in y 0
i=1 j=1 p1ij (1 − p1ij )
The likelihood is: [y | f1 , f2 ] = N 1ij 1−y1ij 2ij
0 (1 − p2ij 0 )
1−y2ij 0
.
Q Q i
j 0 =1 p2ij
Now, we can estimate the survival function and the cumulative incidence functions as follows.
k h i
S(t, xi ) = 1 − F (t, xi ) = (1 − p1ij )(1 − p2ij ) where k = arg max t(j) ≤ t
Y
j
j=1
Z t k
F1 (t, xi ) = S(u−, xi )λ1 (u, xi )du = S(t(j−1) , xi )p1ij
X
0 j=1
Z t k
F2 (t, xi ) = S(u−, xi )λ2 (u, xi )du = S(t(j−1) , xi )(1 − p1ij )p2ij
X
0 j=1
The returned object of type ‘criskbart’ from crisk.bart or mc.crisk.bart provides the cu-
mulative incidence functions and survival corresponding to x.test as follows: F1 is cif.test,
F2 is cif.test2 and S is surv.test.
0.8
0.6
CI(t)
0.4
Transplant(BART)
Transplant(AJ)
Death(AJ)
Withdrawal(AJ)
0.2
0.0
0 20 40 60 80 100
t (weeks)
Figure 15: Liver transplant competing risks for type O patients estimated by BART and
Aalen-Johansen. This data is from the Mayo Clinic liver transplant waiting list from 1990-
1999. During the study period, the liver transplant organ allocation policy was flawed. Blood
type is an important matching factor to avoid organ rejection. Donor livers from subjects
with blood type O can be used by patients with all blood types; whereas a donor liver from
the other types will only be transplanted to a matching recipient. Therefore, type O subjects
on the waiting list were at a disadvantage since the pool of competitors was larger.
ni y u
The likelihood is: [y, u | fy , fu ] = N
i=1 j=1 pij (1 − pij ) i0 :δi0 6=0 πi0 (1 − πi0 )
1−ui0 . Now,
Q Q ij 1−yij Q i 0
we can estimate the survival function and the cumulative incidence functions similar to
the first approach. The returned object is of type ‘crisk2bart’ from crisk2.bart or
mc.crisk2.bart that provides the cumulative incidence functions and survival corresponding
to x.test as follows: F1 is cif.test, F2 is cif.test2 and S is surv.test.
type O subjects on the waiting list were at a disadvantage since the pool of competitors
was larger for type O donor livers. This data is of historical interest and provides a useful
example of competing risks, but it has little relevance to liver transplants today. Current liver
transplant policies have evolved and now depend on each individual patient’s risk/need which
are assessed and updated regularly while a patient is on the waiting list. Nevertheless, there
still remains an acute shortage of donor livers today. The transplant data set is provided by
the BART R package as is this example: demo("liver.crisk.bart", package = "BART").
We compare the nonparametric Aalen-Johansen competing risks estimator with BART for
the transplant event of type O patients which are in general agreement; see Figure 15.
where these pij are currently unspecified and we provide their definition later in Equation 2.
N.B. we follow the recurrent events literature’s favored terminology by using the term “in-
tensity” rather than “hazard”, but they are generally interchangeable.
With absorbing events such as mortality there is no concern about the conditional indepen-
dence of future events because there will never be any. Conversely, with recurrent events,
there is a valid concern. Of course, conditional independence can be satisfied by conditioning
on the entire event history, denoted by Ni (s) where 0 ≤ s < t. However, conditioning on the
entire event history is often impractical. Rather, we condition on both Ni (t−) and vi (t) to
satisfy any concern of conditional independence.
We now write the model for yij as a nonparametric probit regression of yij on t(j) , x̃i (t(j) )
tantamount to parametric models of discrete-time intensity (Thompson Jr. 1977; Arjas and
Haara 1987; Fahrmeir 2014). Specifically, with temporal data convertedfrom δi , si, ti , ui , xi (t)
to a sequence of longitudinal binary events as follows: yij = maxk I tik = t(j) . However,
note that the definition of j is currently unspecified. To understand the impetus of the range
of j, let us look at an example.
Suppose that we have two subjects with the following values:
which creates the grid of times (3, 4, 7, 8, 9, 12). For subject 1 (2), notice that y12 = y13 = 0
(y23 = 0) as it should be since no event occurred at times 4 or 7 (7). However, there were
no events since their first event had not ended yet, i.e., these subjects are not chronologically
at risk for an event and, therefore, no corresponding random behavior contributed to the
likelihood. The BART package provides the recur.pre.bart function which you can use
to construct these data sets. Here is a short demonstration of its capabilities adapted from
demo/data.recur.pre.bart.R (re-formatted for display purposes).
R> library("BART")
R> times <- matrix(c(3, 8, 9, 4, 12, 12), nrow = 2, ncol = 3, byrow = TRUE)
R> tstop <- matrix(c(7, 8, 0, 7, 0, 0), nrow = 2, ncol = 3, byrow = TRUE)
R> delta <- matrix(c(1, 1, 0, 1, 0, 0), nrow = 2, ncol = 3, byrow = TRUE)
R> recur.pre.bart(times = times, delta = delta, tstop = tstop)
[11,] 9 5 1
[12,] 12 8 1
Notice that $tx.test is not limited to the same time points as $tx.train, i.e., we often
want/need to estimate f at counter-factual values not observed in the data so each subject
contributes an equal number of evaluations for estimation purposes.
It is now clear that the yij which contribute to the likelihood are those such that j ∈ Ri which
is the risk set for subject i. We formally define the risk set as
n h io
Ri = j : j ∈ {1, . . . , ni } and ∩N
k=1 {t(j) 6∈ (tik , uik )}
i
i.e., the risk set contains j if t(j) is during the observation period for subject i and t(j) is not
contained within an already ongoing event for this subject.
Putting it all together, we arrive at the following recurrent events discrete-time model with i
indexing subjects; i = 1, . . . , N .
For computational efficiency, we carry out the probit regression via truncated normal latent
variables (Albert and Chib 1993) (this default can be over-ridden for logit with logistic latents
(Holmes and Held 2006; Gramacy and Polson 2012) by specifying type = "lbart").
With the data prepared as described in the above example, the BART model for binary data
treats the probability of an event within an interval as a nonparametric function of time, t,
and covariates, x̃(t). Conditioned on the data, BART provides samples from the posterior
distribution of f . For any t and x̃(t), we obtain the posterior distribution of p(t, x̃(t)) =
Φ(µ0 + f (t, x̃(t))).
For the purposes of recurrent events survival analysis, we are typically interested in estimating
the cumulative intensity function as presented in Equation 1. With these estimates, one can
accomplish inference from the posterior via means, quantiles or other functions of p(t, x̃i (t))
or Λ(t, x̃(t)) as needed such as the relative intensity, i.e., RI(t, x̃n (t), x̃d (t)) = p(t,x̃n (t))
p(t,x̃d (t)) where
x̃n (t) and x̃d (t) are two settings we wish to compare like two treatments.
10.0
5.0
2.0
RI(t)
1.0
0.5
0.2
0.1
0 10 20 30 40 50 60
t (months)
Figure 16: Relative intensity: Thiotepa vs. Placebo. The relative intensity function is as
p(t,x̃T (t))
follows: RI(t, x̃T (t), x̃P (t)) = p(t,x̃ P (t))
where T is for Thiotepa and P is for Placebo. The blue
lines are the relative intensity functions themselves and the red lines are their 95% credible
intervals. The relative intensity is calculated by Friedman’s partial dependence function,
i.e., aggregated over all other covariates.
Bladder cancer: Thiotepa vs. Vitamin B6
10.0
5.0
2.0
RI(t)
1.0
0.5
0.2
0.1
0 10 20 30 40 50 60
t (months)
Figure 17: Relative intensity: Thiotepa vs. Vitamin B6. The relative intensity function is
p(t,x̃T (t))
as follows: RI(t, x̃T (t), x̃B (t)) = p(t,x̃ B (t))
where T is for Thiotepa and B is for Vitamin
B6. The blue lines are the relative intensity functions themselves and the red lines are their
95% credible intervals. The relative intensity is calculated by Friedman’s partial dependence
function, i.e., aggregated over all other covariates.
44 BART: Bayesian Additive Regression Trees in R
10.0
5.0
2.0
RI(t)
1.0
0.5
0.2
0.1
0 10 20 30 40 50 60
t (months)
Figure 18: Relative intensity: Vitamin B6 vs. Placebo. The relative intensity function is
as follows: RI(t, x̃B (t), x̃P (t)) = p(t,x̃B (t))
p(t,x̃P (t)) where B is for Vitamin B6 and P is for Placebo.
The blue lines are the relative intensity functions themselves and the red lines are their
95% credible intervals. The relative intensity is calculated by Friedman’s partial dependence
function, i.e., aggregated over all other covariates.
recurrence time, if any, was measured from the beginning of treatment. There were 118 pa-
tients enrolled but only 116 were followed beyond time zero and contribute information. This
data set is loaded by data("bladder", package = "BART") and the data frame of interest is
bladder1. This data set is analyzed by demo("bladder.recur.bart", package = "BART").
In Figure 16, notice that the relative intensity calculated by Friedman’s partial dependence
function finds thiotepa inferior to placebo from roughly 6 to 18 months and afterward they
are about equal, but the 95% credible intervals are wide throughout. Similarly, the relative
intensity calculated by Friedman’s partial dependence function finds thiotepa inferior to vi-
tamin B6 from roughly 3 to 24 months and afterward they are about equal, but the 95%
credible intervals are wide throughout; see Figure 17. And, finally, vitamin B6 is superior to
placebo throughout, but the 95% credible intervals are wide; see Figure 18.
6. Discussion
The BART R package provides a user-friendly reference implementation of Bayesian additive
regression trees (BART). BART is a Bayesian nonparametric, tree-based ensemble, machine
learning technique with best-of-breed properties. In the spirit of machine learning, BART
learns the relationship between the covariates, x, and the response variable arriving at f (x)
while not burdening the user to pre-specify the functional form of f nor the interaction terms
among the covariates. By specifying an optional sparse Dirichlet prior, BART is capable of
variable selection: a form of learning which is especially useful in high-dimensional settings.
Journal of Statistical Software 45
• continuous outcomes;
• categorical outcomes;
– absorbing events,
– competing risks, and
– recurrent events.
In this article, we have provided the user with an overview of much that is described in this
section including (but not limited to): details of the BART prior and its arguments, sparse
variable selection, prediction, multi-threading, support for the outcomes listed above and
missing data handling. In addition, this article has provided primers on important BART
topics such as posterior computation, Friedman’s partial dependence function and convergence
diagnostics. With a computational method such as BART, the user needs a reliable, well-
documented software package with a diverse set of examples. With this article, and the
BART package itself, we believe that interested users now have the tools to successfully
employ BART for their rigorous data analysis needs.
References
Agresti A (2003). Categorical Data Analysis. 2nd edition. John Wiley & Sons, Hoboken.
46 BART: Bayesian Additive Regression Trees in R
Albert J, Chib S (1993). “Bayesian Analysis of Binary and Polychotomous Response Data.”
Journal of the American Statistical Association, 88, 669–79. doi:10.1080/01621459.
1993.10476321.
Andersen PK, Gill RD (1982). “Cox’s Regression Model for Counting Processes: A Large
Sample Study.” The Annals of Statistics, 10(4), 1100–1120. URL http://www.jstor.org/
stable/2240714.
Arjas E, Haara P (1987). “A Logistic Regression Model for Hazard: Asymptotic Results.”
Scandinavian Journal of Statistics, 14(1), 1–18. URL https://www.jstor.org/stable/
4616044.
Baldi P, Brunak S (2001). Bioinformatics: The Machine Learning Approach. 2nd edition.
MIT Press, Cambridge.
Bleich J, Kapelner A, George EI, Jensen ST (2014). “Variable Selection for BART: An
Application to Gene Regulation.” The Annals of Applied Statistics, 8(3), 1750–1781. URL
10.1214/14-AOAS755.
Calcote J (2010). Autotools: A Practitioner’s Guide to GNU Autoconf, Automake, and Libtool.
No Starch Press, San Francisco.
Chipman HA, George EI, McCulloch RE (1998). “Bayesian CART Model Search.” Journal
of the American Statistical Association, 93(443), 935–948. doi:10.1080/01621459.1998.
10473750.
Chipman HA, George EI, McCulloch RE (2010). “BART: Bayesian Additive Regression
Trees.” The Annals of Applied Statistics, 4(1), 266–298. doi:10.1214/09-AOAS285.
Journal of Statistical Software 47
Chipman HA, George EI, McCulloch RE (2013). “Bayesian Regression Structure Discovery.”
In P Damien, P Dellaportas, N Polson, D Stephens (eds.), Bayesian Theory and Applica-
tions. Oxford University Press, Oxford. doi:10.1093/acprof:oso/9780199695607.001.
0001.
Cox DR (1972). “Regression Models and Life-Tables.” Journal of the Royal Statistical Society
B, 34(2), 187–220.
Daniels M, Singh A (2018). sbart: Sequential BART for Imputation of Missing Co-
variates. R package version 0.1.1, URL https://CRAN.R-project.org/src/contrib/
Archive/sbart/.
De Waal T, Pannekoek J, Scholtus S (2011). Handbook of Statistical Data Editing and Impu-
tation. John Wiley & Sons, Hoboken. doi:10.1002/9780470904848.
Delany MF, Linda SB, Moore CT (1999). “Diet and Condition of American Alligators in 4
Florida Lakes.” In Proceedings of the Annual Conference of the Southeastern Association
of Fish and Wildlife Agencies, pp. 375–389.
Denison DGT, Mallick BK, Smith AFM (1998). “A Bayesian CART Algorithm.” Biometrika,
85(2), 363–377. doi:10.1093/biomet/85.2.363.
Dorie V (2020). dbarts: Discrete Bayesian Additive Regression Trees Sampler. R package
version 0.9-18, URL https://CRAN.R-project.org/package=dbarts.
Efron B, Hastie T, Johnstone I, Tibshirani R (2004). “Least Angle Regression.” The Annals
of Statistics, 32(2), 407–499. doi:10.1214/009053604000000067.
Fine JP, Gray RJ (1999). “A Proportional Hazards Model for the Subdistribution of a
Competing Risk.” Journal of the American Statistical Association, 94(446), 496–509. doi:
10.1080/01621459.1999.10474144.
Friedman JH (1991). “Multivariate Adaptive Regression Splines (with Discussion and a Re-
joinder by the Author).” The Annals of Statistics, 19, 1–67. URL http://www.jstor.org/
stable/2241837.
Gabriel E, Fagg GE, Bosilca G, Angskun T, Dongarra JJ, Squyres JM, Sahay V, Kambadur P,
Barrett B, Lumsdaine A, Castain R, Daniel D, Graham R, Woodall T (2004). “Open MPI:
Goals, Concept, and Design of a Next Generation MPI Implementation.” In European
Parallel Virtual Machine/Message Passing Interface Users’ Group Meeting, pp. 97–104.
Springer-Verlag, Berlin, Heidelberg. doi:10.1007/978-3-540-30218-6_19.
Geman S, Geman D (1984). “Stochastic Relaxation, Gibbs Distributions, and the Bayesian
Restoration of Images.” IEEE Transactions on Pattern Analysis and Machine Intelligence,
6, 721–741. doi:10.1109/tpami.1984.4767596.
Hahn P, Carvalho C (2015). “Decoupling Shrinkage and Selection in Bayesian Linear Mod-
els: a Posterior Summary Perspective.” Journal of the American Statistical Association,
110(509), 435–448. doi:10.1080/01621459.2014.993077.
Harrison Jr D, Rubinfeld DL (1978). “Hedonic Housing Prices and the Demand for Clean
Air.” Journal of Environmental Economics and Management, 5(1), 81–102. doi:10.1016/
0095-0696(78)90006-2.
Hastings WK (1970). “Monte Carlo Sampling Methods Using Markov Chains and Their
Applications.” Biometrika, 57(1), 97–109. doi:10.1093/biomet/57.1.97.
Holmes C, Held L (2006). “Bayesian Auxiliary Variable Models for Binary and Multinomial
Regression.” Bayesian Analysis, 1(1), 145–168. doi:10.1214/06-ba105.
Imai K, Van Dyk DA (2005). “A Bayesian Analysis of the Multinomial Probit Model Using
Marginal Data Augmentation.” Journal of Econometrics, 124(2), 311–334. doi:10.1016/
j.jeconom.2004.02.002.
Journal of Statistical Software 49
Institute of Electrical and Electronics Engineers (2008). IEEE Std 754-2008, chapter IEEE
Standard for Floating-Point Arithmetic, pp. 1–70. IEEE. doi:10.1109/ieeestd.2008.
4610935.
Ishwaran H, Gerds TA, Kogalur UB, Moore RD, Gange SJ, Lau BM (2014). “Random
Survival Forests for Competing Risks.” Biostatistics, 15(4), 757–773. doi:10.1093/
biostatistics/kxu010.
Johnson NL, Kotz S, Balakrishnan N (1995). Continuous Univariate Distributions, volume 2.
2nd edition. John Wiley & Sons, New York.
Kalbfleisch JD, Prentice RL (1980). The Statistical Analysis of Failure Time Data. 1st edition.
John Wiley & Sons, Hoboken.
Kalbfleisch JD, Prentice RL (2002). The Statistical Analysis of Failure Time Data. 2nd
edition. John Wiley & Sons, Hoboken. doi:10.1002/9781118032985.
Kapelner A, Bleich J (2016). “bartMachine: Machine Learning with Bayesian Additive Re-
gression Trees.” Journal of Statistical Software, 70(4), 1–40. doi:10.18637/jss.v070.i04.
Kindo BP, Wang H, Peña EA (2016). “Multinomial Probit Bayesian Additive Regression
Trees.” Stat, 5(1), 119–131. doi:10.1002/sta4.110.
Klein JP, Moeschberger ML (2003). Survival Analysis: Techniques for Censored and Trun-
cated Data. 2nd edition. Springer-Verlag, New York. doi:10.1007/b97377.
Krogh A, Solich P (1997). “Statistical Mechanics of Ensemble Learning.” Physical Review E,
55(1), 811–825. doi:10.1103/physreve.55.811.
Kuhn M, Johnson K (2013). Applied Predictive Modeling. Springer-Verlag, New York. doi:
10.1007/978-1-4614-6849-3.
Linero A (2018). “Bayesian Regression Trees for High Dimensional Prediction and Variable
Selection.” Journal of the American Statistical Association, 113(522), 626–636. doi:10.
1080/01621459.2016.1264957.
Loprinzi CL, Laurie JA, Wieand HS, Krook JE, Novotny PJ, Kugler JW, Bartel J, Law
M, Bateman M, Klatt NE (1994). “Prospective Evaluation of Prognostic Variables from
Patient-Completed Questionnaires. North Central Cancer Treatment Group.” Journal of
Clinical Oncology, 12(3), 601–607.
Lynch J (1965). “The Burroughs B8500.” Datamation, pp. 49–50.
McCulloch R, Rossi PE (1994). “An Exact Likelihood Analysis of the Multinomial Probit
Model.” Journal of Econometrics, 64(1), 207–240. doi:10.1016/0304-4076(94)90064-7.
McCulloch RE, Carvalho C, Hahn R (2015). “A General Approach to Variable Selection
Using Bayesian Nonparametric Models.” Joint Statistical Meetings, Seattle, 2015-08-09–
2015-08-13.
McCulloch RE, Polson NG, Rossi PE (2000). “A Bayesian Analysis of the Multinomial Probit
Model with Fully Identified Parameters.” Journal of Econometrics, 99(1), 173–193. doi:
10.1016/s0304-4076(00)00034-8.
50 BART: Bayesian Additive Regression Trees in R
McCulloch RE, Sparapani RA, Gramacy R, Spanbauer C, Pratola M (2021). BART: Bayesian
Additive Regression Trees. R package version 2.9, URL https://CRAN.R-project.org/
package=BART.
Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, Teller E (1953). “Equation of
State Calculations by Fast Computing Machines.” The Journal of Chemical Physics, 21(6),
1087–1092. doi:10.1063/1.1699114.
Mueller P (1991). “A Generic Approach to Posterior Integration and Gibbs Sampling.” Tech-
nical Report 91-09, Purdue University, West Lafayette. URL http://www.stat.purdue.
edu/research/technical_reports/pdfs/1991/tr91-09.pdf.
Murray JS (2020). “Log-Linear Bayesian Additive Regression Trees for Multinomial Logistic
and Count Regression Models.” Journal of the American Statistical Association, (ahead
of print), 1–35. doi:10.1080/01621459.2020.1813587.
Nicolaie MA, Van Houwelingen HC, Putter H (2010). “Vertical Modeling: A Pattern Mixture
Approach for Competing Risks Modeling.” Statistics in Medicine, 29(11), 1190–1205. doi:
10.1002/sim.3844.
Plummer M, Best N, Cowles K, Vines K (2006). “coda: Convergence Diagnosis and Output
Analysis for MCMC.” R News, 6(1), 7–11. URL https://CRAN.R-project.org/doc/
Rnews/.
Pratola MT, Chipman HA, Gattiker JR, Higdon DM, McCulloch R, Rust WN (2014). “Paral-
lel Bayesian Additive Regression Trees.” Journal of Computational and Graphical Statistics,
23(3), 830–852. doi:10.1080/10618600.2013.841584.
R Core Team (2017). Mathlib: A C Library of Special Functions. R Foundation for Sta-
tistical Computing, Vienna, Austria. URL https://CRAN.R-project.org/doc/manuals/
r-release/R-admin.html.
R Core Team (2020). R: A Language and Environment for Statistical Computing. R Founda-
tion for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
Ripley BD (2007). Pattern Recognition and Neural Networks. Cambridge University Press.
Robert C, Casella G (2013). Monte Carlo Statistical Methods. Springer-Verlag, New York.
Rossini AJ, Tierney L, Li N (2007). “Simple Parallel Statistical Computing in R.” Journal of
Computational and Graphical Statistics, 16(2), 399–420. doi:10.1198/106186007x178979.
Scott SL (2011). “Data Augmentation, Frequentist Estimation, and the Bayesian Anal-
ysis of Multinomial Logit Models.” Statistical Papers, 52(1), 87–109. doi:10.1007/
s00362-009-0205-0.
Journal of Statistical Software 51
Silverman BW (1986). Density Estimation for Statistics and Data Analysis. Chapman and
Hall, London.
Sparapani RA, Logan BR, McCulloch RE, Laud PW (2016). “Nonparametric Survival Anal-
ysis Using Bayesian Additive Regression Trees (BART).” Statistics in Medicine, 35(16),
2741–2753. doi:10.1002/sim.6893.
Urbanek S (2020). rJava: Low-Level R to Java Interface. R package version 0.9-13, URL
https://CRAN.R-project.org/package=rJava.
Venables WN, Ripley BD (2002). Modern Applied Statistics with S. 4th edition. Springer-
Verlag, New York.
Walker DW, Dongarra JJ (1996). “MPI: A Standard Message Passing Interface.” Supercom-
puter, 12, 56–68.
Wei LJ, Lin DY, Weissfeld L (1989). “Regression Analysis of Multivariate Incomplete Fail-
ure Time Data by Modeling Marginal Distributions.” Journal of the American Statistical
Association, 84(408), 1065–1073. doi:10.1080/01621459.1989.10478873.
Yu H (2002). “Rmpi: Parallel Statistical Computing in R.” R News, 2(2), 10–14. URL
https://CRAN.R-project.org/doc/Rnews/.
52 BART: Bayesian Additive Regression Trees in R
The examples in this article are included in the package. You can run the first example
(described in Section 3) as follows.
As we shall see, these examples produce R objects containing BART model fits. But, these fits
are Bayesian nonparametric samples from the posterior and require statistical summarization
before they are readily interpretable. Therefore, we often employ graphical summaries (such
as the figures in this article) to visualize the BART model fit. Note that the figures option
(in the code snippet above) specifies a directory where the Portable Document Format (PDF)
graphics files will be produced; if it is not specified, then the graphics will be generated by
R, however, no PDF files will be created. Furthermore, some of these BART model fits can
take a few minutes so it is wise to utilize multi-threading when it is available (for a discussion
of efficient computation with BART including multi-threading, see Appendix Section D).
Returning to the snippet above, the option mc.cores specifies the number of cores to employ
in multi-threading, e.g., there are diminishing returns so often 8 cores is sufficient. And,
finally, to run all of the examples in this article (with the options as specified above), then do
the following demo("replication", package = "BART").
Tier
t 2t ... 2t+1 −1
..
.
2 4 5 6 7
1 2 3
0 1
The var field is the variable in the branch decision rule which is encoded 0, . . . , P − 1 as a
C/C++ array index (rather than an R index). Similarly, the cut field is the cutpoint of the
variable in the branch decision rule which is encoded 0, . . . , cj − 1 for variable j; note that the
cutpoints are returned in the treedraws$cutpoints list item. The terminal leaf output value
is contained in the field leaf. It is not immediately obvious which nodes are branches vs.
leaves since, at first, it would appear that the leaf field is given for both branches and leaves.
Leaves are always associated with var = 0 and cut = 0; however, note that this is also a
valid branch variable/cutpoint since these are C/C++ indices. The key to discriminating
between branches and leaves is via the algebraic relationship between a branch, n, at tree tier
t(n) leading to its left, l = 2n, and right, r = 2n + 1, nodes at tier t(n) + 1, i.e., for each node,
besides root, you can determine from which branch it arose and those nodes that are not a
branch (since they have no leaves) are necessarily leaves.
Underlying this methodology is the BART prior. The BART prior specifies a flexible class of
unknown functions, f , from which we can gather randomly generated fits to the given data
via the posterior. N.B. we define f as returning a scalar value, but BART extensions which
return multivariate values are conceivable. Let the function g(x; T , M) assign a value based
on the input x. The binary decision tree T is represented by a set of ordered triples, (n, j, k),
54 BART: Bayesian Additive Regression Trees in R
representing branch decision rules: n ∈ B for node n in the set of branches B, j for covariate
xj and k for the cutpoint cjk . The branch decision rules are of the form xj < cjk which means
branch left and xj ≥ cjk , branch right; or terminal leaves where it stops. M represents leaves
and is a set of ordered pairs, (n, µn ): n ∈ L where L is the set of leaves (L is the complement
of B) and µn for the outcome value.
The function, f (x), is a sum of H trees:
H
f (x) = g(x; Th , Mh ) (3)
X
h=1
with i indexing subjects i = 1, . . . , N . The unknown random function, f , and the error
variance, σ 2 , follow the BART prior expressed notationally as
prior
(f, σ 2 ) ∼ BART(H, µ0 , τ, k, α, γ; ν, λ, q)
where H is the number of trees, µ0 is a known constant which centers y and the rest of the
parameters will be explained later in this section (for brevity, we will often use the simpler
prior
shorthand (f, σ 2 ) ∼ BART). The wi are known standard deviation weight multiples which
you can supply with the argument w that is only available for continuous outcomes, hence,
the weighted BART name; the unit weight vector is the default. The centering parameter,
µ0 , can be specified via the fmean argument where the default is taken to be ȳ.
BART is a Bayesian nonparametric prior. Using the Gelfand-Smith generic bracket notation
for the specification of random variable distributions (Gelfand and Smith 1990), we represent
the BART prior in terms of the collection of all trees, T ; collection of all leaves, M; and the
error variance, σ , as the following product: T , M, σ = σ [T , M] = σ [T ] [M | T ].
2 2
2 2
where [Th ] is the prior for the h-th tree and [Mh | Th ] is the collection of leaves for the h-th
tree. And, finally, the collection of leaves for the h-th tree are independent: [Mh | Th ] =
n [µhn | Th ] where n indexes the leaf nodes.
Q
The tree prior: [Th ]. There are three prior components of Th which govern whether the tree
branches grow or are pruned. The first tree prior regularizes the probability of a branch at
leaf node n in tree tier t(n) = blog2 nc as
where Bn = 1 represents a branch while Bn = 0 is a leaf, 0 < α < 1 and γ ≥ 0. You can
specify these prior parameters with arguments, but the following defaults are recommended:
α is set by the parameter base = 0.95 and γ by power = 2; for a detailed discussion of these
parameter settings, see Chipman et al. (1998). Note that this prior penalizes branch growth,
i.e., in prior probability, the default number of branches will likely be 1 or 2. Next, there is
a prior dictating the choice of a splitting variable j conditional on a branch event Bn which
Journal of Statistical Software 55
The default value, k = 2, corresponds to µi falling within the extrema with approximately
56 BART: Bayesian Additive Regression Trees in R
5.00
log (f (x, 0.5, 1, 0.5))
1.00
log (f (x, a, b, ρ P ))
0.50
0.10
0.05
0.01
0 1 2 3 4 5
Figure 19: The distribution of θ/P and the sparse Dirichlet prior. The key to understand-
ing the inducement of sparsity is the distribution of the arguments to the Dirichlet prior:
θ/P ∼F (a, b, ρ/P ) where F (.) is the beta prime distribution scaled by ρ/P . Here we plot the
natural logarithm of the scaled beta prime density, f (.), at a non-sparse setting and three
sparse settings. The non-sparse setting is (a, b, ρ/P ) = (1, 1, 1) (solid black line). As you can
see in the figure, sparsity is increased by reducing ρ (long dashed red line), reducing a (short
dashed blue line) or reducing both (mixed dashed gray line).
0.95 probability. Alternative choices of k can be supplied via the k argument. We have found
that values of k ∈ [1, 3] generally yield good results. Note that k is a potential candidate
parameter for choice via cross-validation.
The error variance prior: σ 2 . The prior for σ 2 is the conjugate scaled inverse Chi-square
distribution, i.e., νλχ−2 (ν). We recommend that the degrees of freedom, ν, be from 3 to
10 and the default is 3 which can be over-ridden by the argument sigdf. The λ parameter
can be specified by the lambda argument which defaults to NA. If lambda is unspecified,
then we determine a reasonable value for λ based on an estimate, σ b , (which can be specified
by the argument sigest and defaults to NA). If sigest is unspecified, the default value of
sigest is determined
via linear regression or the sample standard deviation: if P < N , then
yi ∼ N xi β, σ
0 b ; otherwise, σ
2 b = sy . Now we solve for λ such that P σ 2 ≤ σ b 2 = q. This
b
quantity, q, can be specified by the argument sigquant and the default is 0.9 whereas we
also recommend considering 0.75 and 0.99. Note that the pair (ν, q) are potential candidate
parameters for choice via cross-validation.
Other important arguments for the BART prior. We fix the number of trees at H which cor-
responds to the argument ntree. The default number of trees is 200 for continuous outcomes;
Journal of Statistical Software 57
as shown by Bleich et al. (2014), 50 is also a reasonable choice which is the default for all
other outcomes: cross-validation could be considered. The number of cutpoints is provided by
the argument numcut and the default is 100. The default number of cutpoints is achieved for
continuous covariates. For continuous covariates, the cutpoints are uniformly distributed by
default, or generated via uniform quantiles if the argument usequants = TRUE is provided.
By default, discrete covariates which have fewer than 100 values will necessarily have fewer
cutpoints. However, if you want a single discrete covariate to be represented by a group of
binary dummy variables, one for each category, then pass the variable as a factor within a
data frame.
cancellation of terms in the numerator and denominator, we have the likelihood contribution:
P [yL , yR | T ∗ ] P [yL | T ∗ ] P [yR | T ∗ ]
=
P [yLR | T m ] P [yLR | T m ]
where yL is the partition corresponding hto the
i newborn left leaf node; yR , the partition for
yL
the newborn right leaf node; and yLR = yR . N.B. the terms in the ratio are the predictive
densities of a normal mean with a known variance and a normal prior for the mean.
Similarly, the terms that the prior contributes to the posterior ratio often cancel since there
is only one “place” where the trees differ and the prior draws components independently at
P[T ∗ ]
different “places” of the tree. Therefore, the prior contribution to P[T m] is
2
P [Bn = 1] P [Bl = 0] P [Br = 0] sj α(t(n) + 1)−γ [1 − α(t(n) + 2)−γ ] sj
=
P [Bn = 0] 1 − α(t(n) + 1)−γ
where P [Bn ] is the branch regularity prior (see Equation 4), sj is the splitting variable
selection probability, n is the chosen leaf node in tree T m , l = 2n is the newborn left leaf
node in tree T ∗ and r = 2n + 1 is the newborn right leaf node in tree T ∗ .
P[T m |T ∗ ]
Finally, the ratio P[T ∗ |T m ] is
P [DEATH | T ∗ ] P [n | T ∗ ]
P [BIRTH | T m ] P [n | T m ] sj
where P [n | T ] is the probability of choosing node n given tree T .
N.B. sj appears in both the numerator and denominator of the acceptance probability πBIRTH ,
therefore, canceling which is mathematically convenient.
Now, let us briefly discuss the posterior computation related to the Dirichlet sparse prior. If
a Dirichlet prior is placed on the variable splitting probabilities, s, then its posterior samples
are drawn via Gibbs sampling with conjugate Dirichlet draws. The Dirichlet parameter is
updatedhby adding the total variable
i branch count over the ensemble, mj , to the prior setting,
P , i.e., P + m1 , . . . , P + mP . In this way, the Dirichlet prior induces a “rich get richer”
θ θ θ
variable selection strategy. The sparsity parameter, θ, is drawn on a fine grid of values for
the analytic posterior (Linero 2018). This draw only depends on [s1 , . . . , sP ].
from higher-end servers to mass-market products such as desktops and laptops. For example,
the consumer laptop that BART is developed on, purchased in 2016, is capable of 8 threads
(and hence many of the examples default to 8 threads).
are all in the UNIX family tree). The BART package uses forking for posterior sampling of
the f function, and also for the predict function when OpenMP is not available. Except for
predict, all functions that use forking start with mc. And, regardless of whether OpenMP or
forking is employed, these functions accept the argument mc.cores which controls the num-
ber of threads to be used. The parallel package provides the function detectCores which
returns the number of threads that your hardware can support and, therefore, the BART
package can use.
missing data imputation method is sufficient for data sets with relatively few missing values;
for more advanced needs, we recommend the sbart package which utilizes the Sequential
BART algorithm (Daniels and Singh 2018; Xu, Daniels, and Winterstein 2016) (N.B. sbart
is also a descendant of MPI BART).
results in an elapsed time of only ((1 − b)/B + b) = 0.27 or a ((1 − b)/B + b)−1 = 3.67 fold
reduction which is the gain in efficiency. In Figure 21, we plot theoretical gains on the y-axis
and the number of CPUs on the x-axis for two settings: b ∈ {0.025, 0.1}.
Table 2: A comparison of BART packages available on CRAN (Chipman and McCulloch 2016; Kapelner and Bleich 2016; McCulloch
et al. 2021).
63
64 BART: Bayesian Additive Regression Trees in R
1.0
Proportionate length of chain processing time
0.8
0.6
0.4
0.2
b
0.0
0 1 2 3 4 5
Chains
Figure 20: The theoretical gain due to multi-threading can be calculated by Amdahl’s Law.
Let b be the burn-in fraction and B be the number of threads, then the theoretical gain
limit is ((1 − b)/B + b)−1 . In this diagram, the burn-in fraction, b = 1100
100
= 0.09, and the
number of CPUs, B = 5, results in an elapsed time of only ((1 − b)/B + b) = 0.27 or a
((1 − b)/B + b)−1 = 3.67 fold reduction which is the gain in efficiency.
30
25
0.025
20
Gain
15
10
0.1
5
0
1 2 5 10 20 50
B: number of CPU
Figure 21: The theoretical gain due to multi-threading can be calculated by Amdahl’s Law.
Let b be the burn-in fraction and B be the number of threads, then the theoretical gain limit
is ((1 − b)/B + b)−1 . In this figure, the theoretical gains are on the y-axis and the number of
CPUs, the x-axis, for two settings: b ∈ {0.025, 0.1}.
pass A from R to C/C++, and then transpose, we will create multiple copies of A consuming
8 × m × n × B where B is the number of children. If A is a large matrix, then you may
stress the system’s limits. The simple solution is for the parent to create the transpose before
passing A and avoiding the multiple copies, i.e., A <- t(A). And this is the philosophy that
the BART package follows for the multi-threaded BART functions; see the documentation
for the transposed argument.
receive less CPU time than those with normal priority. Generally, processes with
nice 19 are only run when the system would otherwise be idle.
Therefore, by default, the BART package children have their nice value set to 19.
Affiliation:
Rodney Sparapani
Division of Biostatistics
Institute for Health and Equity
Medical College of Wisconsin, Milwaukee campus
8701 Watertown Plank Road
Milwaukee, WI 53226, United States of America
E-mail: [email protected]