0% found this document useful (0 votes)
72 views66 pages

Journal of Statistical Software

This document introduces the BART R package for Bayesian additive regression trees. BART is a Bayesian nonparametric machine learning method that fits an outcome to an arbitrary random function of covariates. The BART package allows for efficient analysis of continuous, binary, categorical, and time-to-event outcomes using Bayesian additive regression trees. It implements BART computations in C++ for efficiency and allows for parallel processing to handle large datasets. The package returns posterior predictive distributions and other inferences that facilitate interpretation and assessment of BART models.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views66 pages

Journal of Statistical Software

This document introduces the BART R package for Bayesian additive regression trees. BART is a Bayesian nonparametric machine learning method that fits an outcome to an arbitrary random function of covariates. The BART package allows for efficient analysis of continuous, binary, categorical, and time-to-event outcomes using Bayesian additive regression trees. It implements BART computations in C++ for efficiency and allows for parallel processing to handle large datasets. The package returns posterior predictive distributions and other inferences that facilitate interpretation and assessment of BART models.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

JSS Journal of Statistical Software

January 2021, Volume 97, Issue 1. doi: 10.18637/jss.v097.i01

Nonparametric Machine Learning and Efficient


Computation with Bayesian Additive Regression
Trees: The BART R Package

Rodney Sparapani Charles Spanbauer Robert McCulloch


Medical College of Wisconsin University of Minnesota Arizona State University

Abstract
In this article, we introduce the BART R package which is an acronym for Bayesian ad-
ditive regression trees. BART is a Bayesian nonparametric, machine learning, ensemble
predictive modeling method for continuous, binary, categorical and time-to-event out-
comes. Furthermore, BART is a tree-based, black-box method which fits the outcome
to an arbitrary random function, f , of the covariates. The BART technique is relatively
computationally efficient as compared to its competitors, but large sample sizes can be
demanding. Therefore, the BART package includes efficient state-of-the-art implemen-
tations for continuous, binary, categorical and time-to-event outcomes that can take ad-
vantage of modern off-the-shelf hardware and software multi-threading technology. The
BART package is written in C++ for both programmer and execution efficiency. The
BART package takes advantage of multi-threading via forking as provided by the parallel
package and OpenMP when available and supported by the platform. The ensemble of
binary trees produced by a BART fit can be stored and re-used later via the R predict
function. In addition to being an R package, the installed BART routines can be called
directly from C++. The BART package provides the tools for your BART toolbox.

Keywords: binary trees, black-box, categorical, competing risks, continuous, ensemble pre-
dictive model, forking, multinomial, multi-threading, OpenMP, recurrent events, survival
analysis.

1. Introduction
Bayesian additive regression trees (BART) arose out of earlier research on Bayesian model
fitting of an outcome to a single tree (Chipman, George, and McCulloch 1998). In this era
2 BART: Bayesian Additive Regression Trees in R

from 1996 to 2001, the excellent predictive performance of ensemble models became apparent
(Breiman 1996; Krogh and Solich 1997; Freund and Schapire 1997; Breiman 2001; Friedman
2001; Baldi and Brunak 2001). Instead of making a single prediction from a complex model,
ensemble models make a single prediction which is the summary of the predictions from many
simple models. Generally, ensemble models have desirable properties, e.g., they do not suffer
from over-fitting (Kuhn and Johnson 2013). Like bagging (Breiman 1996), boosting (Freund
and Schapire 1997; Friedman 2001) and random forests (Breiman 2001), BART relies on an
ensemble of trees to predict the outcome; and, although, there are similarities, there are also
differences between these approaches.

BART is a Bayesian nonparametric, sum of trees method for continuous, dichotomous, cat-
egorical and time-to-event outcomes. Furthermore, BART is a black-box, machine learn-
ing method which fits the outcome via an arbitrary random function, f , of the covariates.
So-called black-box models generate functions of the covariates which are so complex that
interpreting the internal details of the fitted model is generally abandoned in favor of assess-
ment via evaluations of the fitted function, f , at chosen values of the covariates. As shown
by Chipman, George, and McCulloch (2010), BART’s out-of-sample predictive performance
is generally equivalent to, or exceeds that, of alternatives like lasso with L1 regularization
(Efron, Hastie, Johnstone, and Tibshirani 2004) or black-box models such as gradient boost-
ing (Freund and Schapire 1997; Friedman 2001), neural nets with one hidden layer (Venables
and Ripley 2002) and random forests (Breiman 2001). Over-fitting is the tendency to overly
fit a model to an in-sample training data set at the expense of poor predictive performance
for unseen out-of-sample data. Typically, BART does not over-fit to the training data due to
the regularization tree-branching penalty of the BART prior, i.e., generally, each tree has few
branches and plays a small part in the overall fit. So, the resulting fit from the ensemble of
trees as a whole is generally a good fit that does not over-fit. Essentially, BART is a Bayesian
nonlinear model with all the advantages of the Bayesian paradigm such as posterior infer-
ence including point and interval estimation. Conveniently, BART naturally scales to large
numbers of covariates and facilitates variable selection; it does not require the covariates to
be rescaled; neither does it require the covariate functional relationship, nor the interactions
considered, to be pre-specified.

In this article, we give an overview of data analysis with BART and the BART R package.
In Section 2, we describe the R functions provided by the BART package for analyzing
continuous outcomes with BART. In Section 3, we demonstrate the typical usage of BART
via the classic example of Boston housing values. In Section 4, we describe how BART can
be used to analyze binary and categorical outcomes. In Section 5, we describe how BART
can be used to analyze time-to-event outcomes with censoring including competing risks and
recurrent events. In Appendix Section A, we describe how to get and install the BART
package. In Appendix Section B, we describe the basis of BART on binary trees along with
the details of the BART prior. In Appendix Section C, we briefly describe the posterior
computations required to use BART. In Appendix Section D, we describe how to perform the
BART computations efficiently by resorting to parallel processing with multi-threading (N.B.
by default, the Microsoft Windows operating system does not provide the multi-threading
interfaces employed by the R environment; in lieu of the provision, the BART package is
single-threaded on Windows, yet, otherwise, completely functional; see Appendix D for more
details).
Journal of Statistical Software 3

2. Continuous outcomes with BART


In this section, we document the analysis of continuous outcomes with the BART R package.
We provide two functions for continuous outcomes: (1) wbart named for weighted BART;
and (2) gbart named for generic, or generalized, BART. Both functions have roughly the
same functionality. wbart has a verbose interface while gbart is streamlined. Also, wbart is
for continuous outcomes only whereas gbart also supports binary outcomes.
Typically, when calling the wbart and gbart functions, many of the arguments can be omitted
since the default values are adequate for most purposes. However, there are certain common
arguments which are either always needed or frequently provided. The wbart (mc.wbart) and
gbart (mc.gbart) functions are for serial (parallel) computation; for more details on parallel
computation see the Appendix Section D. The outcome y.train is a vector of numeric values.
The covariates for training (validation, if any) are x.train (x.test) which can be matrices or
data frames containing factors; in the display below, we assume matrices for simplicity. N.B.
throughout we denote integer constants by upper case letters, e.g., in the following display:
M for the number of posterior samples, B for the number of threads (generally, B = 1 for
Windows), N for the number of observations in the training set, and Q for the number of
observations in the test set.

set.seed(99)
post <- wbart(x.train, y.train, x.test, ndpost = M)
post <- mc.wbart(x.train, y.train, x.test, ndpost = M, mc.cores = B,
seed = 99)
post <- gbart(x.train, y.train, x.test, ndpost = M)
post <- mc.gbart(x.train, y.train, x.test, ndpost = M, mc.cores = B,
seed = 99)

The data inputs, as shown above, are as follows.


 
x1

x2 
• x.train is a matrix or data frame of covariates for training represented as 
 
.. 
.
 
 
xN
made up of xi as row vectors.
• y.train is a vector of the outcome for training.
• x.test (optional) is a matrix or data frame for testing.

The value returned, post as shown above, is of type ‘wbart’ that is essentially a list contain-
ing named components. Of particular interest are post$yhat.train and post$yhat.test
described as follows.
ŷ11 . . . ŷN 1
 
 .. .. .. 
• post$yhat.train is a matrix of predictions  . . . 
ŷ1M . . . ŷN M
where ŷim = µ0 + fm (xi ) is the m-th posterior draw.
• post$yhat.test is a similar matrix corresponding to x.test if provided.
4 BART: Bayesian Additive Regression Trees in R

The columns of post$yhat.train and post$yhat.test represent different covariate settings


and the rows, the M draws from the posterior.
Often it is impractical to provide x.test in the call to wbart/gbart due to the large number
of predictions considered, or all of the settings to be evaluated are not known at that time.
To allow for this common problem, the BART package returns the trees encoded in an ASCII
string, treedraws$trees, and provides a predict function to generate any predictions needed
(more details on trees and, the string representation of trees, can be found in Appendix
Section B). Note that if you need to perform the prediction in some later R instance, then you
can save the ‘wbart’ object returned and reload it when needed, e.g., save with saveRDS(post,
"post.rds") and reload, post <- readRDS("post.rds"). The x.test input can be a matrix
or a data frame; for simplicity, we assume a matrix below.
For serial computation

R> pred <- predict(post, x.test)

For parallel computation

R> pred <- predict(post, x.test, mc.cores = B)

The data inputs, as shown above, are as follows.

• post is an object of type ‘wbart’.


 
x1

x2 
• x.test is a matrix or a data frame represented notationally as 
 
.. 
.
 
 
xQ
made up of xh as row vectors.

The value returned, pred as shown above, is as follows.

ŷ11 ... ŷQ1


 
 .. .. .. 
• pred is a matrix of predictions  . . . 
ŷ1M ... ŷQM
where ŷhm = µ0 + fm (xh ).

2.1. Posterior samples returned


The number of MCMC samples discarded for burn-in is specified by the nskip argument
and the default is 100. The number of MCMC samples returned is specified by the ndpost
argument and the default is 1000. Returning every l-th value, or thinning, can be specified by
the keepevery argument which defaults to 1, i.e., no thinning. Some, but not all, returned
values can be thinned. The following arguments are available with wbart and default to
ndpost, but can be over-ridden as needed (with gbart, ndpost draws are always returned
and can’t be over-ridden).

• nkeeptrain: Number of f draws to return corresponding to x.train.


Journal of Statistical Software 5

• nkeeptest: Number of f draws to return corresponding to x.test.

• nkeeptestmeam: Number of f draws to use in computing yhat.test.mean.

• nkeeptreedraws: Number of tree ensemble draws to return for use with predict.

Members of the object returned (which is essentially a list) include varprob and varcount
which correspond to the variable selection probability and the observed counts in the ensemble
of trees. When sparse = TRUE, varprob is the random variable selection probability, sj ;
otherwise, it is the fixed constant sj = P −1 . Besides the posterior samples, also the mean
over the posterior is provided as varprob.mean and varcount.mean.

3. The Boston housing values example


Now, let us examine the classic Boston housing values example (Harrison Jr and Rubinfeld
1978). This data is from the 1970 US Census where each observation represents a Census tract
in the Boston Standard Metropolitan Statistical Area. For each tract, there was a localized air
pollution estimate, the concentration of nitrogen oxides, based on a meteorological model that
was calibrated to monitoring data. Restricted to tracts with owner-occupied homes, there
are 506 observations. We’ll predict the median value of owner-occupied homes (in thousands
of dollars truncated at 50), y = mdev, from two covariates: rm and lstat. rm is the number
of rooms defined as the average number of rooms for owner-occupied homes. lstat is the
percent of population that is lower status defined as the average of the proportion of adults
without any high school education and the proportion of male workers classified as laborers.
Below, we present several observations of the data and scatter plots in Figure 1.

R> library("MASS")
R> x <- Boston[, c(6, 13)]
R> y <- Boston$medv
R> head(cbind(x, y))

rm lstat y
1 6.575 4.98 24.0
2 6.421 9.14 21.6
3 7.185 4.03 34.7
4 6.998 2.94 33.4
5 7.147 5.33 36.2
6 6.430 5.21 28.7

R> par(mfrow = c(2, 2))


R> plot(x[, 1], y, xlab = "x1=rm", ylab = "y=mdev")
R> plot(x[, 2], y, xlab = "x2=lstat", ylab = "y=mdev")
R> plot(x[, 1], x[, 2], xlab = "x1=rm", ylab = "x2=lstat")
R> par(mfrow = c(1, 1))
6 BART: Bayesian Additive Regression Trees in R

50

50
40

40
y=mdev

y=mdev
30

30
20

20
10

10
4 5 6 7 8 10 20 30
x1=rm x2=lstat
30
x2=lstat
20
10

4 5 6 7 8
x1=rm

Figure 1: The Boston housing data was compiled from the 1970 US Census where each
observation represents a Census tract in Boston with owner-occupied homes. For each tract,
we have the median value of owner-occupied homes (in thousands of dollars truncated at 50),
y = mdev, the average number of rooms, x1 = rm, and the percent of the population that is
lower status, x2 = lstat. Here, we show scatter plots of the data.

3.1. wbart for continuous outcomes


In this example, we fit the following BART model for continuous outcomes:
yi = µ0 + f (xi ) + i where i ∼ N (0, σ 2 )
prior
(f, σ 2 ) ∼ BART
with i indexing subjects; i = 1, . . . , N . We use Markov chain Monte Carlo (MCMC) to get
draws from the posterior distribution of the parameter (f, σ 2 ).

R> library("BART")
R> set.seed(99)
R> nd <- 200
R> burn <- 50
R> post <- wbart(x, y, nskip = burn, ndpost = nd)
Journal of Statistical Software 7

*****Into main of wbart


*****Data:
data:n,p,np: 506, 2, 0
y1,yn: 1.467194, -10.632806
x1,x[n*p]: 6.575000, 7.880000
*****Number of Trees: 200
*****Number of Cut Points: 100 ... 100
*****burn and ndpost: 50, 200
*****Prior:beta,alpha,tau,nu,lambda: 2.00000,0.95000,0.79549,3.00000,5.97902
*****sigma: 5.540257
*****w (weights): 1.000000 ... 1.000000
*****Dirichlet:sparse,a,b,rho,augment: 0,0.5,1,2,0
*****nkeeptrain,nkeeptest,nkeeptestme,nkeeptreedraws: 200,200,200,200
*****printevery: 100
*****skiptr,skipte,skipteme,skiptreedraws: 1,1,1,1

MCMC
done 0 (out of 250)
done 100 (out of 250)
done 200 (out of 250)
time: 1s
check counts
trcnt,tecnt,temecnt,treedrawscnt: 200,0,0,200

3.2. Results returned from wbart


We returned the results of running wbart in the object post of type ‘wbart’ which is essentially
a list.

R> names(post)

[1] "sigma" "yhat.train.mean" "yhat.train" "yhat.test.mean"


[5] "yhat.test" "varcount" "varprob" "treedraws"
[9] "mu" "varcount.mean" "varprob.mean" "rm.const"

R> length(post$sigma)

[1] 250

R> length(post$yhat.train.mean)

[1] 506

R> dim(post$yhat.train)

[1] 200 506


8 BART: Bayesian Additive Regression Trees in R

Remember, the training data has n = 506 observations, we had burn = 50 burn-in discarded
draws and nd = M = 200 draws kept. Let us look at a couple of the key list components.

• $sigma: Both the 50 burn-in and 250 draws are kept for σ; burn-in are kept only for
this parameter.

• $yhat.train: The m-th row and i-th column is fm (xi ) (the m-th kept MCMC draw
evaluated at the i-th training observation).

• $yhat.train.mean: The posterior estimate of f (xi ), i.e., M −1 m fm (xi ).


P

3.3. Assessing convergence with wbart


As with any high-dimensional MCMC, assessing convergence may be non-trivial. Posterior
convergence diagnostics are recommended for BART especially with large data sets and/or
a large number of covariates. Besides diagnostics, routine counter-measures such as longer
chains, thinning and multiple chains may be warranted. For continuous outcomes, the simplest
thing to look at are the draws of σ. See Section 4.5 for a primer on other convergence
diagnostic options for binary and categorical outcomes that are also applicable for continuous
outcomes.
Assessing convergence in this example, the parameter σ is the only identified parameter in
the model and, of course, it is indicative of the size of the errors.

R> plot(post$sigma, type = "l")


R> abline(v = burn, lwd = 2, col = "red")

In Figure 2, you can see that BART burned in very quickly. Just one initial draw looking a
bit bigger than the rest. Apparently, subsequent variation is legitimate posterior variation.
In a more difficult problem you may see the σ draws initially declining as the MCMC searches
for a good fit.

3.4. wbart and linear regression compared


Let us look at the in-sample BART fit (yhat.train.mean) and compare it to y = medv fits
from a multiple linear regression.

R> lmf <- lm(y~., data.frame(x, y))


R> fitmat <- cbind(y, post$yhat.train.mean, lmf$fitted.values)
R> colnames(fitmat) <- c("y", "BART", "Linear")
R> cor(fitmat)

y BART Linear
y 1.0000000 0.9051200 0.7991005
BART 0.9051200 1.0000000 0.8978003
Linear 0.7991005 0.8978003 1.0000000

R> pairs(fitmat)
Journal of Statistical Software 9

4.8
4.6
4.4
post$sigma

4.2
4.0
3.8

0 50 100 150 200 250

Index

Figure 2: The Boston housing data was compiled from the 1970 US Census where each
observation represents a Census tract in Boston with owner-occupied homes. For each tract,
we have the median value of owner-occupied homes (in thousands of dollars truncated at 50),
y = mdev, the average number of rooms, x1 = rm, and the percent of the population that is
lower status, x2 = lstat. With BART, we predict y = mdev from rm and lstat. Here, we
show a trace plot of the error variance, σ, that demonstrates convergence for BART rather
quickly, i.e., by 50 iterations or earlier.

In Figure 3, we present scatter plots between mdev, the BART fit and the multiple linear
regression. The BART fit is noticeably different from the linear fit.

3.5. Prediction and uncertainty with wbart


In Figure 4, we order the observations by the fitted house value (yhat.train.mean) and then
use boxplots to display the draws of f (x) in each column of yhat.train.

R> i <- order(post$yhat.train.mean)


R> boxplot(post$yhat.train[, i])

Substantial predictive uncertainty, but you can still be fairly certain that some houses should
cost more than other.
10 BART: Bayesian Additive Regression Trees in R

10 20 30 40

50
40
30
y

20
10
40
30

BART
20
10

40
30
20
Linear

10
0
10 20 30 40 50 0 10 20 30 40

Figure 3: The Boston housing data was compiled from the 1970 US Census where each
observation represents a Census tract in Boston with owner-occupied homes. For each tract,
we have the median value of owner-occupied homes (in thousands of dollars truncated at 50),
y = mdev, the average number of rooms, x1 = rm, and the percent of the population that
is lower status, x2 = lstat. With BART, we predict y = mdev from rm and lstat. Here,
we show scatter plots comparing y = mdev, the BART fit (“BART”) and multiple linear
regression (“Linear”).

3.6. Using the predict function with wbart


We can get out of sample predictions in two ways. First, we can can just ask for them when
we call wbart by supplying a matrix or data frame of test x values. Second, we can call a
predict method. Now, let us split our data into train and test subsets.

R> n <- length(y)


R> set.seed(14)
R> i <- sample(1:n, floor(0.75 * n))
R> x.train <- x[i, ]; y.train = y[i]
R> x.test <- x[-i, ]; y.test = y[-i]
R> cat("training sample size = ", length(y.train), "\n")
Journal of Statistical Software 11

50
40
post$yhat.train

30
20
10

1 29 62 95 132 174 216 258 300 342 384 426 468

Figure 4: The Boston housing data was compiled from the 1970 US Census where each
observation represents a Census tract in Boston with owner-occupied homes. For each tract,
we have the median value of owner-occupied homes (in thousands of dollars truncated at 50),
mdev, the average number of rooms, rm, and the percent of the population that is lower status,
lstat. With BART, we predict y = mdev from rm and lstat. Here, we show boxplots of the
posterior samples of predictions (on the y-axis) ordered by the average predicted home value
per tract (on the x-axis).

R> cat("testing sample size = ", length(y.test), "\n")

training sample size = 379


testing sample size = 127

And now we can run wbart using the training data to learn and predict at x.test. First,
we’ll just pass x.test to the wbart call.

R> set.seed(99)
R> post1 <- wbart(x.train, y.train, x.test)
R> dim(post1$yhat.test)

[1] 1000 127


12 BART: Bayesian Additive Regression Trees in R

R> length(post1$yhat.test.mean)

[1] 127

The testing data is handled similarly to the training data.

• $yhat.test: the m-th row and h-th column is fm (xh ) (the m-th kept MCMC draw
evaluated at the h-th testing observation).

• $yhat.test.mean: the posterior estimate of f (xh ), i.e., Q−1 m fm (xh ).


P

Alternatively, we could run wbart saving all the MCMC results and then call predict.

R> set.seed(99)
R> post2 <- wbart(x.train, y.train)
R> yhat <- predict(post2, x.test)

*****In main of C++ for bart prediction


tc (threadcount): 1
number of bart draws: 1000
number of trees in bart sum: 200
number of x columns: 2
from x,np,p: 2, 127
***using serial code

R> dim(yhat)

[1] 1000 127

R> summary(as.double(yhat - post1$yhat.test))

Min. 1st Qu. Median Mean 3rd Qu. Max.


-9.091e-09 -1.186e-09 2.484e-11 2.288e-12 1.188e-09 6.790e-09

So yhat and post1$yhat.test are practically identical.

3.7. wbart and thinning


In our simple example of the Boston housing data, wbart runs pretty fast. But with more
data and/or longer runs, you may want to speed things up by saving fewer samples and then
using predict. Let us just keep a thinned subset of 200 tree ensemble draws.

R> set.seed(4)
R> post3 <- wbart(x.train, y.train, nskip = 1000, ndpost = 10000,
+ nkeeptrain = 0, nkeeptest = 0, nkeeptestmean = 0,
+ nkeeptreedraws = 200)
R> yhatthin <- predict(post3, x.test)
Journal of Statistical Software 13

*****In main of C++ for bart prediction


tc (threadcount): 1
number of bart draws: 200
number of trees in bart sum: 200
number of x columns: 2
from x,np,p: 2, 127
***using serial code

R> dim(post3$yhat.train)

[1] 0 379

R> dim(yhatthin)

[1] 200 127

Now, there are no kept draws of f (x) for training x, and we have 200 tree ensemble draws to
use with predict. Of course, if we keep 200 out of 10000, then every 50th draw is kept.
The default values are to keep all the draws (e.g., nkeeptrain = ndpost). Now, let us have
a look at the predictions.

R> fmat <- cbind(y.test, post1$yhat.test.mean, apply(yhatthin, 2, mean))


R> colnames(fmat) <- c("y", "yhat", "yhatThin")
R> pairs(fmat)

In Figure 5, we present scatter plots between mdev, “yhat” and “yhatThin”. Recall, the
predictions labeled “yhat” are from a BART run with seed = 99 and all default values. The
predictions labeled “yhatThin” are thinned by 50 (after 1000 burnin discarded, 200 kept out
of 10000 draws) with seed = 4. It is very interesting how similar they are!

3.8. wbart and Friedman’s partial dependence function


BART does not directly provide a summary of the effect of a single covariate, or a subset of
covariates, on the outcome. This is also the case for black-box, or nonparametric regression,
models in general that need to deal with this same issue. Developed for such complex models,
Friedman’s partial dependence function (Friedman 2001) can be employed with BART to
summarize the marginal effect due to a subset of the covariates. Friedman’s partial depen-
dence function is a concept that is very flexible. So flexible that we are unable to provide
abstract functional support in the BART package; rather, we provide examples of the many
practical uses in the demo directory.
We use S to denote the indices of the covariates in the subset and the collection itself, i.e., de-
fine the row vector for test setting h as xhS = [xhj ] where j ∈ S. Similarly, we denote the
complement of the subset as C with S ∪C spanning all covariates. The complement row vector
for training observation i is xiC = [xij ] where j ∈ C. The marginal dependence function is
defined by fixing the subset at a test setting while aggregating over the training observations
of the complement covariates: f (xhS ) = N −1 N i=1 f (xhS , xiC ). Other marginal functions
P
14 BART: Bayesian Additive Regression Trees in R

10 20 30 40

50
40
y

30
20
10
40
30

yhat
20
10

40
30
yhatThin

20
10
10 20 30 40 50 10 20 30 40

Figure 5: The Boston housing data was compiled from the 1970 US Census where each
observation represents a Census tract in Boston with owner-occupied homes. For each tract,
we have the median value of owner-occupied homes (in thousands of dollars truncated at
50), mdev, the average number of rooms, rm, and the percent of the population that is lower
status, lstat. With BART, we predict y = mdev by rm and lstat. The predictions labeled
“yhat” are from a BART run with seed = 99 and all default values. The predictions labeled
“yhatThin” are thinned by 50 (after 1000 burnin discarded, 200 kept out of 10000 draws)
with seed = 4. It is very interesting how similar they are!

can be obtained in a similar fashion. Estimates can be derived via functions of the poste-
rior samples such as means, quantiles, etc., e.g., fˆ(xhS ) = M −1 N −1 M i=1 fm (xhS , xiC )
P PN
m=1
where m indexes posterior samples. However, care must be taken in the interpretation of the
marginal effect as estimated by Friedman’s partial dependence function. If there are strong
relationships among the covariates, it may be unrealistic to assume that individual covariates
can be manipulated independently.
For example, suppose that we want to summarize the median home value, medv (variable
14 of the Boston data frame), by the percent of the population with lower status, lstat
(variable 13), while aggregating over the other twelve covariates in the Boston housing data.
In Figure 6, we demonstrate the marginal estimate and its 95% credible interval.
Journal of Statistical Software 15

50
40
mdev

30
20
10

10 20 30

lstat

Figure 6: The Boston housing data was compiled from the 1970 US Census where each
observation represents a Census tract in Boston with owner-occupied homes. For each tract,
we have the median value of owner-occupied homes (in thousands of dollars truncated at 50),
mdev, and the percent of the population that is lower status, lstat, along with eleven other
covariates. We summarize the marginal effect of lstat on mdev while aggregating over the
other covariates with Friedman’s partial dependence function. The marginal estimate and its
95% credible interval are shown.

R> x.train <- as.matrix(Boston[i, -14])


R> set.seed(12)
R> post4 <- wbart(x.train, y.train)
R> H <- length(y.train)
R> L <- 41
R> x <- seq(min(x.train[, 13]), max(x.train[, 13]), length.out = L)
R> x.test <- cbind(x.train[, -13], x[1])
R> for(j in 2:L)
+ x.test <- rbind(x.test, cbind(x.train[, -13], x[j]))
R> pred <- predict(post4, x.test)
R> partial <- matrix(nrow = 1000, ncol = L)
R> for(j in 1:L) {
R> h <- (j - 1) * H + 1:H
R> partial[, j] <- apply(pred[, h], 1, mean)}
R> plot(x, apply(partial, 2, mean), type = "l", ylim = c(10, 50),
+ xlab = "lstat", ylab = "mdev")
16 BART: Bayesian Additive Regression Trees in R

R> lines(x, apply(partial, 2, quantile, probs = 0.025), lty = 2)


R> lines(x, apply(partial, 2, quantile, probs = 0.975), lty = 2)

Besides the marginal effect, we can define the conditional effect of x1 | x2 as f (x1 +δ,x2δ)−f (x1 ,x2 ) .
However, BART is not fitting simple linear functions. For example, suppose the data follows
a sufficiently complex function like so: f (x1 , x2 ) = b1 x1 + b2 x21 + b3 x1 x2 . Then the conditional
effect that BART is likely to fit is approximately b1 +2b2 x1 +b2 δ +b3 x2 . This function is not so
easy to characterize (as the marginal effect) since it involves x1 , x2 and δ. Nevertheless, these
functions can be estimated by BART if these inputs are provided. But, these functions have
the same limitations as Friedman’s partial dependence function and, perhaps, even moreso.
See the conditional effect example at the end of demo("boston.R", package = "BART").

4. Binary and categorical outcomes with BART


The BART package supports binary outcomes via probit BART with normal latents and logit
BART with logistic latents. Categorical outcomes are supported with multinomial BART
which defaults to probit for computational efficiency, but logit is available as an option.
Convergence diagnostics are provided and variable selection as well.

4.1. Probit BART for binary outcomes


Probit BART for binary outcomes is provided by the BART package as the pbart and gbart
functions. In this case, the outcome, y.train, is an integer with values of 0 or 1. The model
is as follows with i indexing subjects: i = 1, . . . , N .
ind
yi | pi ∼ B (pi ) where B (.) is the Bernoulli distribution
prior
pi = Φ(µ0 + f (xi )) where f ∼ BART and Φ(.) is the standard normal cdf
yi
This setup leads to the following likelihood: [y | f ] = i=1 pi (1 − pi )1−yi .
QN

To extend BART to binary outcomes, we employ the technique of Albert and Chib (1993)
that assumes there is an unobserved latent, zi , where yi = I (zi > 0) and i = 1, . . . , n indexes
subjects. Given yi , we generate the truncated normal latents, zi ; these auxiliary latents are
efficiently sampled (Robert 1995) and recast as the outcome for a continuous BART with unit
variance as follows.

I (−∞, 0) if yi = 0
(
zi | yi , f ∼ N (µ0 + f (xi ), 1)
I (0, ∞) if yi = 1

Centering the latent zi around the constant µ0 is analogous to quasi-centering the probabili-
ties, pi , at p0 = Φ(µ0 ), i.e., E [pi ] is approximately equal to p0 which is all that is necessary
for inference to be performed. The default value of µ0 is Φ−1 (ȳ) (which you can over-ride
with the binaryOffset argument).
The pbart (mc.pbart) and gbart (mc.gbart) functions are for serial (parallel) computation.
The outcome y.train is a vector containing zeros and ones. The covariates for training
(validation, if any) are x.train (x.test) which can be matrices or data frames containing
factors; in the display below, we assume matrices for simplicity. Notation: M for the number
Journal of Statistical Software 17

of posterior samples, B for the number of threads (generally, B = 1 for Windows), N for the
number of observations in the training set, and Q for the number of observations in the test
set.

set.seed(99)
post <- pbart(x.train, y.train, x.test, ndpost = M)
post <- mc.pbart(x.train, y.train, x.test, ndpost = M, mc.cores = B,
seed = 99)
post <- gbart(x.train, y.train, x.test, type = "pbart", ndpost = M)
post <- mc.gbart(x.train, y.train, x.test, type = "pbart", ndpost = M,
seed = 99)

N.B. for pbart, the thinning argument, keepevery defaults to 1 while for gbart with
type = "pbart", keepevery defaults to 10.
The data inputs, as shown above, are as follows.
 
x1

x2 
• x.train is a matrix or a data frame of covariates for training represented as 
 
.. 
.
 
 
xN
where xi are row vectors.

• y.train is a vector of the outcome for training.

• x.test (optional) is a matrix or a data frame of covariates for testing.

The return value, post as shown above is of type ‘pbart’ that is essentially a list of named
items; of particular interest are the following: post$prob.train and post$prob.test. As
with a continuous outcome, the columns of post$yhat.train and post$yhat.test repre-
sent different covariate settings and the rows, the M draws from the posterior. However,
post$prob.train and post$prob.test (when requested) are generally of more interest (and
post$prob.train.mean and post$prob.test.mean which are the means of the posterior
sample columns, not shown).

p̂11 ... p̂N 1


 
 .. .. .. 
• post$prob.train is a matrix of predictions  . . . 
p̂1M ... p̂N M
where p̂im = Φ(µ0 + fm (xi )).

• post$prob.test is a matrix of predictions corresponding to post$x.test if provided.

Often it is impractical to provide x.test in the call to pbart due to the number of predictions
considered or all the settings to evaluate are simply not known at that time. To allow for
this common problem, the BART package returns the trees encoded in an ASCII string,
treedraws$trees, and provides a predict function to generate any predictions needed.
Note that if you need to perform the prediction in some later R instance, then you can
save the ‘pbart’ object returned and reload it when needed, e.g., save with saveRDS(post,
"post.rds") and reload, post <- readRDS("post.rds") .
18 BART: Bayesian Additive Regression Trees in R

R> pred <- predict(post, x.test, mc.cores = B)

The data input, x.test as shown above, is as follows.


 
x1

x2 
• x.test is a matrix or data frame of covariates represented as 
 
.. 
.
 
 
xQ
where xh are row vectors.

The returned value, pred as shown above, is of type ‘pbart’ that is essentially a list with the
following named components.

ŷ11 ... ŷQ1


 
 .. .. .. 
• pred$yhat.test is a matrix of predictions  . . . 
ŷ1M ... ŷQM
where ŷhm = µ0 + fm (xh ).

p̂11 ... p̂Q1


 
 .. .. .. 
• pred$prob.test is a matrix of probabilities  . . . 
p̂1M ... p̂QM
where p̂hm = Φ(ŷhm ).

• pred$prob.test.mean is a vector of probabilities [p̂1 , . . . , p̂Q ]


m=1 p̂hm .
where p̂h = M −1 M
P

4.2. Probit BART and Friedman’s partial dependence function


For an overview of Friedman’s partial dependence function (including the notation adopted
in this article and its meaning), please see Section 3.8 which discusses continuous outcomes.
For probit BART, the f function is not directly of interest; rather, the probability of an event
is more interpretable: p(xhS ) = N −1 N i=1 Φ(µ0 + f (xhS , xiC )).
P

Probit BART example: Chronic pain and obesity


We want to explore the hypothesis that obesity is a risk factor for chronic lower-back pain
(which includes buttock pain in this definition). A corollary to this hypothesis is that obesity is
not considered to be a risk factor for chronic neck pain. A good source of data for this question
is available in the National Health and Nutrition Examination Survey (NHANES) 2009–2010
Arthritis Questionnaire. 5106 subjects were surveyed. We will use probit BART to analyze
the dichotomous outcomes of chronic lower-back pain and chronic neck pain. We restrict our
attention to the following covariates: age, gender and anthropometric measurements including
weight (kg), height (cm), body mass index (kg/m2 ) and waist circumference (cm). Also, note
that survey sampling weights are available to extrapolate the rates from the survey to the
US population as a whole. We will concentrate on body mass index (BMI) and gender, xhS ,
while utilizing Friedman’s partial dependence function as defined above and incorporating
the survey weights, i.e., phS (xhS ) = N i=1 wi Φ(µ0 + f (xhS , xiC ))/ i0 =1 wi0 .
P PN
Journal of Statistical Software 19

1.0

1.0
0.8

0.8
0.6

0.6
p(x)

p(x)
0.4

0.4
0.2

0.2
0.0

0.0

15 25 35 45 15 25 35 45

BMI BMI
Low−back pain: M(blue) vs. F(red) Neck pain: M(blue) vs. F(red)

Figure 7: NHANES, BMI and the probability of chronic pain: the left panel for lower-back
pain and the right panel for neck pain. The unweighted Friedman’s partial dependence rela-
tionship between chronic pain, BMI and gender are displayed as ascertained from NHANES
data: males (females) are represented by blue (red) lines with the corresponding 95% credible
intervals (dashed lines). We want to explore the hypothesis that obesity is a risk factor for
chronic lower-back pain (which includes buttock pain in this definition). A corollary to this
hypothesis is that obesity is not considered to be a risk factor for chronic neck pain. Although
there is a generous amount of uncertainty, it does not appear that the probability of chronic
lower-back pain increases with BMI for either gender. Conversely, chronic neck pain does
appear to be rising, yet again, the intervals are wide. In both cases, these findings are not
anticipated.

The BART package provides two examples for the relationship between chronic pain and BMI:
demo("nhanes.pbart1", package = "BART"), probabilities; and demo("nhanes.pbart2",
package = "BART"), differences in probabilities. In Figure 7, the left panel for lower-back
pain and the right panel for neck pain, the unweighted relationship between chronic pain,
BMI and gender are displayed: males (females) are represented by blue (red) solid lines
with corresponding 95% credible intervals in dashed lines. Although there is a generous
amount of uncertainty, it does not appear that the probability of chronic lower-back pain
20 BART: Bayesian Additive Regression Trees in R

0.2

0.2
0.1

0.1
p(x) − p(25)

p(x) − p(25)
0.0

0.0
−0.1

−0.1
−0.2

−0.2

15 25 35 45 15 25 35 45

BMI BMI
Chronic pain: low−back(blue) Chronic pain: neck(red)

Figure 8: NHANES, BMI and the probability of chronic pain for females only: the left panel
for lower-back pain and the right panel for neck pain. The unweighted Friedman’s partial
dependence relationship between chronic pain and BMI are displayed as ascertained from
NHANES data for females only: lower-back (blue) and neck pain (red) are presented with
the corresponding 95% credible intervals (dashed lines). The difference in probability of
chronic pain from a baseline BMI of 25 (which is the upper limit of normal) is presented,
i.e., p(x) − p(25). We want to explore the hypothesis that obesity is a risk factor for chronic
lower-back pain (which includes buttock pain in this definition). A corollary to this hypothesis
is that obesity is not considered to be a risk factor for chronic neck pain. Although there is a
generous amount of uncertainty, it does not appear that the probability of chronic lower-back
pain increases with BMI. Conversely, chronic neck pain does appear to be rising, yet again,
the intervals are wide. In both cases, these findings are not anticipated.

increases with BMI for either gender. Conversely, chronic neck pain does appear to be rising,
yet again, the intervals are wide. In both cases, these findings are not anticipated given
the original hypotheses. Based on survey weights (not shown), the results are basically the
same. In Figure 8, the unweighted relationship for females between BMI and the difference
in probability of chronic pain from a baseline BMI of 25 (which is the upper limit of normal)
with corresponding 95% credible intervals in dashed lines: the left panel for lower-back pain
Journal of Statistical Software 21

(blue solid lines) and the right panel for neck pain (red solid lines). Again, we have roughly
the same impression, i.e., there is no increase of lower-back chronic pain with BMI and it is
possibly dropping while neck pain might be increasing, but the intervals are wide for both.
The results are basically the same for males (not shown).

4.3. Logit BART for binary outcomes


Assuming a normal distribution of the unobserved latent, zi where yi = I (zi > 0), provides
some challenges when estimating very small or very large probabilities, pi , since the normal
distribution has relatively thin tails. This restriction can be relaxed by assuming the latents
follow the logistic distribution which has heavier tails. For logistic latents, we employ a
variant of the Holmes and Held (2006) technique by Gramacy and Polson (2012) to create
what we call logit BART. However, it is important to recognize that logit BART is more
computationally intensive than probit BART.
The outcome, y.train, is provided as an integer with values 0 or 1. Logit BART is provided
by the lbart and gbart functions. Unlike probit BART where the auxiliary latents, zi , have
a unit variance σ 2 = 1; with logit BART, we sample truncated normal latents, zi , with a
variance σi2 by the Robert (1995) technique. If σi2 = 4ψi2 where ψi is sampled from the
Kolmogorov-Smirnov distribution, then zi follow the logistic distribution. Sampling from
the Kolmogorov-Smirnov distribution is described by Devroye (1986). So, the conditionally
normal latents, zi | σi2 , are the outcomes for a continuous BART with a given heteroskedastic
variance, σi2 .
The zi are centered around a known constant, µ0 , which is analogous to quasi-centering the
probabilities, pi , around p0 = F (µ0 ) where F is the standard logistic distribution function.
The default value of µ0 is F −1 (ȳ) (which you can over-ride with the binaryOffset argument to
lbart or the offset argument to gbart). Therefore, the probabilities are pi = F (µ0 +f (xi )).
The input and output for lbart is essentially identical to pbart. Also, the predict function
for objects of type ‘lbart’ is analogous. The gbart function performs logit BART when
passed the type = "lbart" argument.
N.B. for lbart, the thinning argument, keepevery defaults to 1 while for gbart with type
= "lbart", keepevery defaults to 10.

4.4. Multinomial BART for categorical outcomes


Several strategies for analyzing categorical outcomes have been proposed from the Bayesian
perspective (Albert and Chib 1993; McCulloch and Rossi 1994; McCulloch, Polson, and Rossi
2000; Imai and Van Dyk 2005; Frühwirth-Schnatter and Frühwirth 2010; Scott 2011) includ-
ing two BART implementations (Kindo, Wang, and Peña 2016; Murray 2020): our BART
implementations differ from these; although, since we are working on the same problem, there
are some similarities. Generally, the literature has taken a logit approach. Due to the relative
computational efficiency, we prefer probit to logit (although, logit is available as an option).
To extend BART to categorical outcomes, we have created two approaches to what we call
Multinomial BART. The first approach works well when they are relatively few categories
while the second is preferable otherwise.
22 BART: Bayesian Additive Regression Trees in R

Multinomial BART and conditional probability: mbart


In the first approach, we fit a novel sequence of binary BART models that bears some resem-
blance to continuation-ratio logits (Agresti 2003). Let us assume that we have K categories
where each are represented by mutually exclusive binary indicators: yi1 , . . . , yiK for subjects
indexed by i = 1, . . . , N . We denote the probability of these outcome indicators via condi-
tional probabilities, pij , where j = 1, . . . , K as follows.

pi1 = P [yi1 = 1]
pi2 = P [yi2 = 1 | yi1 = 0]
pi3 = P [yi3 = 1 | yi1 = yi2 = 0]
..
.
pi,K−1 = P [yi,K−1 = 1 | yi1 = · · · = yi,K−2 = 0]
piK = P [yi,K−1 = 0 | yi1 = · · · = yi,K−2 = 0]

Notice that piK = 1 − pi,K−1 so we can specify the K conditional probabilities via K − 1
parameters. Furthermore, these conditional probabilities are, by construction, defined for
subsets of subjects: let S1 = {1, . . . , N } and Sj = {i : yi1 = · · · = yi,j−1 = 0} where j =
2, . . . , K − 1. Now, the unconditional probability of these outcome indicators, πij , can be
defined in terms of the conditional probabilities and their complements, qij = 1 − pij , for all
subjects.

πi1 = P [yi1 = 1] = pi1


πi2 = P [yi2 = 1] = pi2 qi1
πi3 = P [yi3 = 1] = pi3 qi2 qi1
..
.
πi,K−1 = P [yi,K−1 = 1] = pi,K−1 qi,K−2 · · · qi1
πiK = P [yiK = 1] = qi,K−1 qi,K−2 · · · qi1

N.B. the conditional probability construction of πij ensures that = 1.


PK
j=1 πij
Our modeling of these conditional probabilities based on a vector of covariates xi is what we
call Multinomial BART:

yij | pij ∼ B (pij ) where i ∈ Sj and j = 1, . . . , K − 1


pij = Φ(µj + fj (xi ))
prior
fj ∼ BART
 P 
with i indexing subjects, i = 1, . . . , N ; and the default value of µj = Φ−1 P i yij . This
I (i∈Sj )
QK i yij
formulation yields the Multinomial likelihood: [y | f1 , . . . , fK−1 ] = Ni=1 j=1 πij .
Q

This approach is provided by the BART package as the mbart function. The input for mbart is
essentially identical to gbart, but the output is slightly different. For example, due to the way
the model is estimated, the prediction for x.train is not available; therefore, to request it set
the argument x.test = x.train. By default, probit BART is employed for computational
Journal of Statistical Software 23

efficiency, but logit BART can be specified with the argument type = "lbart". Notation:
M for the number of posterior samples, B for the number of threads (generally, B = 1 for
Windows), N for the number of observations in the training set, and Q for the number of
observations in the test set.

set.seed(99)
post <- mbart(x.train, y.train, x.test, ndpost = M)
post <- mc.mbart(x.train, y.train, x.test, ndpost = M, mc.cores = B,
seed = 99)

The data inputs, as shown above, are as follows.

• x.train is a matrix or data frame of covariates for training.

• y.train is a vector of the outcome for training.


 
x1

x2 
• x.test (optional) is a matrix or data frame of covariates for testing 
 
.. 
.
 
 
xQ
where xi are row vectors.

The returned value, post as shown above, is of type ‘mbart’ that is essentially a list with
named components, particularly, post$prob.test.

• post$prob.test is matrix of probabilities


π̂111 . . . π̂1K1 . . . π̂Q11 . . . π̂QK1
 
.. .. .. .. .. .. ..
. . . . . . . .
 

π̂11M . . . π̂1KM . . . π̂Q1M . . . π̂QKM

The columns of post$prob.test represent different covariate settings crossed with the K
categories. The predict function for objects of type ‘mbart’ is analogous.

Multinomial BART and the logit transformation: mbart2


The second approach is inspired by the logit transformation and is provided by the mbart2
function which has a similar calling convention to mbart described above. Furthermore, as
we shall see, the computationally friendly probit is even applicable in this instance. Here,
yi is categorical, i.e., yi ∈ {1, . . . , K} (technically, the mbart2 function does not require the
categories to be 1, . . . , K; it only requires that there are K distinct categories). Now, we have
the following framework motivated by the logit transformation.

exp(µj + fj (xi ))
P [yi = j] = PK = πij
j 0 =1 exp(µj 0 + fj 0 (xi ))
prior
where fj ∼ BART, j = 1, . . . , K.

Suppose for the moment, the centering parameters, µj , are defined as in logit BART.
24 BART: Bayesian Additive Regression Trees in R

exp(µj +fj (xi ))


It would appear that this definition lacks identifiability since πij = PK =
j 0 =1
exp(µj 0 +fj 0 (xi ))
exp(µj +fj (xi )+c)
PK . Identifiability of fj could be restored by setting a single BART func-
j 0 =1
exp(µj 0 +fj 0 (xi )+c)
tion to zero, i.e., fj 0 (xi ) = 0. However, this is really unnecessary since πij is identified
regardless.
Computationally, this inference can be performed via a series of binary BARTs. This can
be shown by following the work of Holmes and Held (2006): define P [yi = c] ∝ exp fc (xi ).
Consider two cases: P [yi = c] and P [yi = j] where j 6= c. The first case gives us the following
in terms of fc .

exp fc (xi )
P [yi = c] =
exp fc (xi ) + k6=c exp fk (xi )
P

exp fc (xi )
= where S = log exp fk (xi )
X
exp fc (xi ) + exp S k6=c
exp −S exp fc (xi )
=
exp −S exp fc (xi ) + exp S
exp(fc (xi ) − S)
= .
exp(fc (xi ) − S) + 1

And the second case, where j 6= c, is as follows in terms of fc .

exp fj (xi )
P [yi = j] =
exp fc (xi ) + k6=c exp fk (xi )
P

1

exp fc (xi ) + exp S
1 1

exp −S exp fc (xi ) + exp S
1
= .
exp(fc (xi ) − S) + 1

Thus, the conditional inference for fc is equivalent to a binary indicator I (y = c). Therefore,
mbart2 computes a full series of all K BART functions for binary indicators.
The mbart2 function defaults to type = "lbart", i.e., logistic latents are used to compute
the fj ’s which fits nicely with the logit development of this approach. However, the logistic
latent fitting method can be computationally demanding. Therefore, normal latents can be
specified by type = "pbart". This latter setting would appear to contradict the development
of this approach; but notice that πij is still a probability in this case and, in our experience,
the results produced are often reasonable.

Multinomial BART example: Alligator food preference


We demonstrate the usage of these functions by the American alligator food preference ex-
ample (Delany, Linda, and Moore 1999; Agresti 2003). In 1985, American alligators were
harvested by hunters from August 26 to September 30 in peninsular Florida from lakes Ok-
lawaha (Putnam County), George (Putnam and Volusia counties), Hancock (Polk County)
Journal of Statistical Software 25

0.6
Small
Large

0.5
0.4
Probability

0.3
0.2
0.1
0.0

bird fish invert other reptile

Multinomial
x BART
Friedman's partial dependence function

Figure 9: In 1985, American alligators were harvested by hunters in peninsular Florida from
four lakes. Lake, length and sex were recorded for each alligator. The stomach contents of
219 alligators were classified into five categories based on the primary food preference: bird,
fish, invertebrate, reptile and other. The length of alligators was dichotomized into small,
≤2.3m, vs. large, >2.3m. We estimate the probability of each food preference category for
the marginal effect of size by resorting to Friedman’s partial dependence function (Friedman
2001). The 95% credible intervals are wide, but it appears that large alligators are more likely
to rely on a diet of fish while small alligators are more likely to rely on invertebrates.

and Trafford (Collier County). Lake, length and sex were recorded for each alligator. Stom-
achs from a sample of alligators 1.09:3.89m long were frozen prior to analysis. After thawing,
stomach contents were removed and separated and food items were identified and tallied.
Volumes were determined by water displacement. The stomach contents of 219 alligators
were classified into five categories of primary food preference: bird, fish (the most common
primary food choice), invertebrate (snails, insects, crayfish, etc.), reptile (turtles, alligators),
bird, and other (amphibians, plants, household pets, stones, and other debris). The length of
alligators was dichotomized into small, ≤ 2.3m, vs. large, > 2.3m. We estimate the probabil-
ity of each food preference category for the marginal effect of size by resorting to Friedman’s
partial dependence function (Friedman 2001). We have supplied Figure 9 which summarizes
26 BART: Bayesian Additive Regression Trees in R

the BART results generated by the example alligator.R: you can find this demo with the
command demo("alligator", package = "BART"). The mbart function was used since the
number of categories is small. The 95% credible intervals are wide, but it appears that large
alligators are more likely to rely on a diet of fish while small alligators are more likely to rely
on invertebrates. Although the true probabilities are obviously unknown, we compared mbart
to an analysis by a single hidden-layer/feed-forward Neural Network via the nnet R package
(Ripley 2007; Venables and Ripley 2002) and the results were essentially identical (see the
demo for details).

4.5. Converegence diagnostics for binary and categorical outcomes


How do you perform convergence diagnostics for BART? For continuous outcomes, convegence
can easily be determined from the trace plots of the the error standard deviation, σ. However,
for probit and Multinomial BART with normal latents, the error variance is fixed at 1 so this
is not an option. Similarly, for logit BART, σi , are auxiliary latent variables not suitable for
convergence diagnostics. Therefore, we adapt traditional MCMC diagnostic approaches to
BART. We perform graphical checks via auto-correlation, trace plots and an approach due
to Geweke (1992).
Geweke diagnostics are based on earlier work which characterizes MCMC as a time series
(Hastings 1970). Once this transition is made, auto-regressive, moving-average (ARMA)
process theory is employed (Silverman 1986). Generally, we define our Bayesian estimator
as θ̂M = M −1 M m=1 θm . We represent the asymptotic variance of the estimator by σθ̂ =
2
P
h i
limM →∞ V θ̂M . If we suppose that θm is an ARMA(p, q) process, then the spectral density
of the estimator is defined as γ(w) = (2π)−1 ∞ m=−∞ V [θ0 , θm ] e
imw where eitw = cos(tw) +
P

i sin(tw). This leads us to an estimator of the asymptotic variance which is σ̂θ̂2 = γ̂ 2 (0). We
divide our chain into two segments, A and B, as follows: m ∈ A = {1, . . . , MA } where MA =
aM ; and m ∈ B = {M − MB + 1, . . . , M } where MB = bM . Note that a + b < 1. Geweke
suggests a = 0.1, b = 0.5 and recommends the following normal test for convergence.

θ̂A = MA−1 θ̂B = MB−1


X X
θm θm
m∈A m∈B

σ̂θ̂2 = γ̂m∈A
2
(0) σ̂θ̂2 = γ̂m∈B
2
(0)
A B


M (θ̂A − θ̂B )
ZAB = q ∼ N (0, 1)
a−1 σ̂θ̂2 + b−1 σ̂θ̂2
A B

In our BART package, we supply R functions adapted from the coda R package (Plummer,
Best, Cowles, and Vines 2006) to perform Geweke diagnostics: spectrum0ar and gewekediag.
But, how do we apply Geweke’s diagnostic to BART? We can check convergence for any
estimator of the form θ = h(f (x)), but often setting h to the identify function will suffice,
i.e., θ = f (x). However, BART being a Bayesian nonparametric technique means that we
have many potential estimators to check, i.e., essentially one estimator for every possible
choice of x.
We have supplied Figures 10, 11 and 12 generated by the example geweke.pbart2.R:
Journal of Statistical Software 27

1.0
partial dependence function

−0.12

0.5
0.0
acf
−0.16

−0.5
−0.20

−1.0
0.0 0.2 0.4 0.6 0.8 1.0 0 5 10 15 20

x4 lag
1.0

0.99999

4
0.9999
0.999
0.8

0.99

2
0.95
0.6
Φ(f(x))

0
z
0.4

−2

0.95
0.99
0.2

0.999
−4

0.9999
0.99999
0.0

0 200 400 600 800 1000 0 50 100 150 200

m i
N:200, k:50 N:200, k:50

Figure 10: Geweke convergence diagnostics for probit BART: N = 200. In the upper left
quadrant, we have plotted Friedman’s partial dependence function for f (xi4 ) vs. xi4 for 10
values of xi4 . This is a check that can’t be performed for real data, but it is informative in
this case. Notice that f (xi4 ) vs. xi4 is mainly directly proportional expected. In the upper
right quadrant, we plot the auto-correlations of f (xi ) for 10 randomly selected xi where i
indexes subjects. Notice that there is very little auto-correlation. In the lower left quadrant,
we display the corresponding trace plots for these same settings. The traces demonstrate
that samples of f (xi ) appear to adequately traverse the sample space. In the lower right
quadrant, we plot the Geweke ZAB statistics for each subject i. Notice that the ZAB exceed
the 95% limits only a handful of times. Based on this figure, we conclude that the chains
have converged.

R> demo("geweke.pbart2", package = "BART")

The data are simulated by Friedman’s five-dimensional test function (Friedman 1991) where
50 covariates are generated as xij ∼ U (0, 1) but only the first 5 covariates have an impact on
28 BART: Bayesian Additive Regression Trees in R

1.0
0.2
partial dependence function

0.1

0.5
0.0
acf
−0.1

−0.5
−0.3

−1.0
0.0 0.2 0.4 0.6 0.8 1.0 0 5 10 15 20

x4 lag
1.0

0.99999

4
0.9999
0.999
0.8

0.99

2
0.95
0.6
Φ(f(x))

0
z
0.4

−2

0.95
0.99
0.2

0.999
−4

0.9999
0.99999
0.0

0 200 400 600 800 1000 0 200 400 600 800

m i
N:1000, k:50 N:1000, k:50

Figure 11: Geweke convergence diagnostics for probit BART: N = 1000. In the upper left
quadrant, we have plotted Friedman’s partial dependence function for f (xi4 ) vs. xi4 for 10
values of xi4 . This is a check that can’t be performed for real data, but it is informative
in this case. Notice that f (xi4 ) vs. xi4 is directly proportional as expected. In the upper
right quadrant, we plot the auto-correlations of f (xi ) for 10 randomly selected xi where i
indexes subjects. Notice that there is very little auto-correlation. In the lower left quadrant,
we display the corresponding trace plots for these same settings. The traces demonstrate
that samples of f (xi ) appear to adequately traverse the sample space. In the lower right
quadrant, we plot the Geweke ZAB statistics for each subject i. Notice that there appear
to be a considerable number exceeding the 95% limits. Based on this figure, we conclude
that convergence is questionable. We would suggest that more thinning be employed via the
keepevery argument to pbart; perhaps, keepevery = 50.

the outcome with sample sizes N = 200, 1000, 5000.

f (xi ) = −1.5 + sin(πxi1 xi2 ) + 2(xi3 − 0.5)2 + xi4 + 0.5xi5


zi ∼ N (f (xi ), 1)
yi = I (zi > 0)

The convergence for each of these data sets is graphically displayed in Figures 10, 11 and 12
Journal of Statistical Software 29

0.4

1.0
partial dependence function

0.2

0.5
0.0

0.0
acf
−0.2

−0.5
−0.4

−1.0
0.0 0.2 0.4 0.6 0.8 1.0 0 5 10 15 20

x4 lag
1.0

0.99999

4
0.9999
0.999
0.8

0.99

2
0.95
0.6
Φ(f(x))

0
z
0.4

−2

0.95
0.99
0.2

0.999
−4

0.9999
0.99999
0.0

0 200 400 600 800 1000 0 1000 3000 5000

m i
N:5000, k:50 N:5000, k:50

Figure 12: Geweke convergence diagnostics for probit BART: N = 5000. In the upper left
quadrant, we have plotted Friedman’s partial dependence function for f (xi4 ) vs. xi4 for 10
values of xi4 . This is a check that can’t be performed for real data, but it is informative in
this case. Notice that f (xi4 ) vs. xi4 is directly proportional as expected. In the upper right
quadrant, we plot the auto-correlations of f (xi ) for 10 randomly selected xi where i indexes
subjects. Notice that there is some auto-correlation. In the lower left quadrant, we display
the corresponding trace plots for these same settings. The traces demonstrate that samples
of f (xi ) appear to traverse the sample space, but there are some slower oscillations. In the
lower right quadrant, we plot the Geweke ZAB statistics for each subject i. Notice that there
appear to be far too many exceeding the 95% limits. Based on these figures, we conclude
that convergence has not been attained. We would suggest that more thinning be employed
via the keepevery argument to pbart; perhaps, keepevery = 250.

where each figure is broken into four quadrants. In the upper left quadrant, we have plotted
Friedman’s partial dependence function for f (xi4 ) vs. xi4 for 10 values of xi4 . This is a check
that can’t be performed for real data, but it is informative in this case. Notice that f (xi4 ) vs.
xi4 is directly proportional in each figure as expected. In the upper right quadrant, we plot
the auto-correlations of f (xi ) for 10 randomly selected xi where i indexes subjects. Notice
that there is very little auto-correlation for N = 200, 1000, but a more notable amount for
30 BART: Bayesian Additive Regression Trees in R

N = 5000. In the lower left quadrant, we display the corresponding trace plots for these
same settings. The traces demonstrate that samples of f (xi ) appear to adequately traverse
the sample space for N = 200, 1000, but less notably for N = 5000. In the lower right
quadrant, we plot the Geweke ZAB statistics for each subject i. Notice that for N = 200,
the ZAB exceed the 95% limits only a handful of times. Although, there are 10 times more
comparisons, N = 1000 has seemingly more than 10 times as many values exceeding the
95% limits. And, for N = 5000, there are dramatically more values exceeding the 95% limits.
Based on these figures, we conclude that the chains have converged for N = 200; for N = 1000,
convergence is questionable; and, for N = 5000, convergence has not been attained. We would
suggest that more thinning be employed for N = 1000, 5000 via the keepevery argument to
pbart; perhaps, keepevery = 50 for N = 1000 and keepevery = 250 for N = 5000.

4.6. BART and variable selection


Bayesian variable selection techniques applicable to BART have been studied by Chipman
et al. (2010); Chipman, George, and McCulloch (2013); Bleich, Kapelner, George, and Jensen
(2014); Hahn and Carvalho (2015); McCulloch, Carvalho, and Hahn (2015); Linero (2018).
The BART package supports the sparse prior of Linero (2018) by specifying sparse = TRUE
(the default is sparse = FALSE). Let us represent the variable selection probabilities by sj
where j = 1, . . ., P . Now, replace the uniform variable selection prior in BART with a Dirichlet
prior. Also, place a beta prior on the θ parameter.
prior
[s1 , . . ., sP ] | θ ∼ Dirichlet (θ/P, . . ., θ/P )
θ prior
∼ Beta (a, b)
θ+ρ

Typical settings are b = 1 and ρ = P (the defaults) which you can over-ride with the b
and rho arguments respectively. The value a = 0.5 (the default) is a sparse setting whereas
an alternative setting a = 1 is not sparse; you can specify this parameter with argument
a. If additional sparsity is desired, then you can set the argument rho to a value smaller
than P : for more details, see Appendix B. Furthermore, Linero discusses two assumptions:
Assumption 2.1 and Assumption 2.2 (see Linero (2018) for more details). Basically, Assump-
tion 2.2 (2.1) is more (less) friendly to binary/ordinal covariates and is (is not) the default
corresponding to augment = FALSE (augment = TRUE).
Let us return to the simulated probit BART example explored above (in the BART package):
demo("sparse.pbart", package = "BART"). For sample sizes of N = 200, 1000, 5000, there
are P = 100 covariates, but only the first 5 are active. In Figure 13, the 5 (95) active (inac-
tive) covariates are red (black) and circles (dots) are > (≤) P −1 which is chance association
represented by a black line. For N = 200, all five active variables are identified, but notice
that there are 20 false positives. For N = 1000, all five active covariates are identified, but
notice that there are still 14 false positives. For N = 5000, all five active covariates are
identified and notice that there is only one false positive. We are often interested in the
inter-relationship between covariates within our model. We can assess these relationships by
inspecting the binary trees. For example, we can ascertain how often x1 is chosen as a branch
decision rule leading to a branch decision rule with x2 further up the tree or vice versa. In
this case, we call x1 and x2 a concordant pair and we denote by x1 ↔ x2 which is a symmetric
relationship, i.e., x1 ↔ x2 implies x2 ↔ x1 . If Bh is the number of branches in tree Th , then
Journal of Statistical Software 31

N:200, P:100, thin:10

0.30
Selection Probability

0.20
0.10
0.00

0 20 40 60 80 100

Index

N:1000, P:100, thin:10


0.30
Selection Probability

0.20
0.10
0.00

0 20 40 60 80 100

Index

N:5000, P:100, thin:10


0.30
Selection Probability

0.20
0.10
0.00

0 20 40 60 80 100

Index

Figure 13: Probit BART and variable selection example. For sample sizes of N =
200, 1000, 5000, there are P = 100 covariates, but only the first 5 are active. The 5 (95)
active (inactive) covariates are red (black) and circles (dots) are > (≤) P −1 which is chance
association represented by a black line. For N = 200, all five active variables are identified,
but notice that there are 20 false positives. For N = 1000, all five active covariates are iden-
tified, but notice that there are still 14 false positives. For N = 5000, all five active covariates
are identified and notice that there is only one false positive.

the concordant pair probability is: κij = P [xi ↔ xj ∈ Th | Bh > 1] for i = 1, . . . , P − 1 and
j = i + 1, . . . , P . See an example of calculating these probabilities in demo("trees.pbart",
package = "BART").

5. Time-to-event outcomes with BART


The BART package supports time-to-event outcomes including survival analysis, competing
risks and recurrent events.
32 BART: Bayesian Additive Regression Trees in R

5.1. Survival analysis with BART


Survival analysis with BART is provided by the surv.bart function for serial computation
and mc.surv.bart for parallel computation. Survival analysis has been studied by many,
however, most take a proportional hazards approach (Cox 1972; Kalbfleisch and Prentice
1980; Klein and Moeschberger 2003). The complete details of our approach can be found in
Sparapani, Logan, McCulloch, and Laud (2016) and a brief introduction follows. We take
an approach that is tantamount to discrete-time survival analysis (Thompson Jr. 1977; Arjas
and Haara 1987; Fahrmeir 2014). Relying on the capabilities of BART, we do not stipulate a
linear relationship with the covariates nor proportional hazards.
The data is (si , δi , xi ) where i indexes subjects, i = 1, . . . , N ; si is the time of an absorbing
event, δi = 1, or right censoring, δi = 0; and xi is a vector of covariates (which can be time-
dependent, but, for simplicity, we assume that they are known at time zero). We construct
a grid of the ordered distinct event times, 0 = t(0) < · · · < t(K) < ∞, and we consider the
following time intervals: (0, t(1) ], (t(1) , t(2) ], . . .(t(K−1) , t(K) ].
Now, consider event indicators yij for each subject i at each distinct time t(j) up to and
h i
including the subject’s last observation time ti = t(ni ) with ni = arg maxj t(j) ≤ ti . This
means yij = 0 if j < ni and yini = δi . Denote the probability of an event at time t(j) ,
conditional on no previous event, by pij . Now, our model for yij is a nonparametric probit
regression of yij on the time t(j) and the covariates xi .
So the model is
 
yij = δi I si = t(j) , j = 1, . . . , ni
yij | pij ∼ B (pij )
pij = Φ(µij ), µij = µ0 + f (t(j) , xi )
prior
f ∼ BART

where i indexing subjects, i = 1, . . . , N ; and Φ(.) is the standard normal cumulative distribu-
Qni yij
tion function. This formulation creates the likelihood of [y | f ] = N i=1 j=1 pij (1−pij )
1−yij .
Q

If the event indicators, yij , have already been computed, then you can specify them with the
y.train argument. However, it is likely that the indicators would need to be constructed,
so for convenience, you can specify (si , δi ) by the arguments times and delta respectively.
In either case, the default value of µ0 is Φ−1 (ȳ) (which you can over-ride with the offset
argument). For computational efficiency, probit (Albert and Chib 1993) is the default, but
logit (Holmes and Held 2006; Gramacy and Polson 2012) can be specified as an option via
type = "lbart".
Based on the posterior samples, we construct quantities of interest with BART for survival
analysis. In discrete-time survival analysis, the instantaneous hazard from continuous-time
survival is essentially replaced with the probability of an event in an interval, i.e., p(t(j) , x) =
Φ(µ0 + f (t(j) , x)). Now, the survival function is constructed as follows: S(t(j) | x) = P r(T >
t(j) | x) = jl=1 (1 − p(t(l) , x)).
Q

Survival data pairs (s, δ) are converted to indicators by the helper function surv.pre.bart
which is called automatically by surv.bart if y.train is not provided. surv.pre.bart
returns a list which contains y.train for the indicators; tx.train for the covariates corre-
sponding to y.train for training f (t, x) (which includes time in the first column, and the rest
Journal of Statistical Software 33

of the covariates afterward, if any, i.e., rows of [t, x], hence the name tx.train to distinguish
it from the original x.train); tx.test for the covariates to predict f (t, x) rather than to
train; times which is the grid of ordered distinct time points; and K which is the length of
times. Here is a very simple example of a data set with three observations and no covariates
re-formatted for display (no covariates is an interesting special case but we will discuss the
more common case with covariates further below).

R> times <- c(2.5, 1.5, 3.0)


R> delta <- c(1, 1, 0)
R> surv.pre.bart(times = times, delta = delta)

$y.train $tx.train $tx.test $times $K


[1] t t [1] [1] 3
0 [1,] 1.5 [1,] 1.5 1.5
1 [2,] 2.5 [2,] 2.5 2.5
1 [3,] 1.5 [3,] 3.0 3.0
0 [4,] 1.5
0 [5,] 2.5
0 [6,] 3.0

Here is a diagram of the input and output for the surv.pre.bart function. pre is a list that
is generated to contain the matrix pre$tx.train and the vector pre$y.train.

R> pre <- surv.pre.bart(times, delta, x.train, x.test = x.train)

tx.train y.train
t(1) x1 y11 = 0
  
.. ..  ..
. .  .
  
  
  
1n1 = δ1
 t x1   y 
 (n1 )
.. ..  ..
 
  
. .  .
  
 
 t(1) xN yN 1 = 0
  
 
.. .. ..
  
. . .
  
  
t(nN ) xN yN nN = δN

For pre$tx.test, ni is replaced by K which is very helpful so that each subject contributes
an equal number of settings for programmatic convenience and non-informative estimation,
i.e., if high-risk subjects with earlier events did not appear beyond their event, then estimates
of survival for latter times would be biased upward. For other outcomes besides time-to-event,
we provide two matrices of covariates, x.train and x.test, where x.train is for training
and x.test is for validation. However, due to the variable ni for time-to-event outcomes, we
generally provide two arguments as follows: x.train, x.test = x.train where the former
matrix will be expanded by surv.pre.bart to N i=1 ni rows for training f (t, x) while the
P

latter matrix will be expanded to N × K rows for f (t, x) estimation only. If you still need to
perform validation, then you can make a separate call to the predict function.
34 BART: Bayesian Additive Regression Trees in R

N.B. the argument ndpost = M is the length of the chain to be returned and the argument
keepevery is used for thinning, i.e., return M observations where keepevery are culled in be-
tween each returned value. For BART with time-to-event outcomes which is based on gbart,
the default is keepevery = 10 since the grid of time points creates data set observations
of order N × K which have a tendency towards higher auto-correlation, therefore, making
thinning more necessary. To avoid unnecessarily enlarged data sets, it is often prudent to
coarsen the time axis appropriately. Although this might seem drastic, times are often col-
lected orders of magnitude more precisely than necessary for the problem under study. For
example, cancer registries often collect survival times in days while time in months or quarters
would suffice for many typical applications. You can coarsen automatically by supplying the
optional K argument to coarsen the times to a grid of time quantiles: 1/K, 2/K, . . . , K/K (not
to be confused with the k argument which is a prior parameter for the distribution of the leaf
terminal values).
Here is a diagram of the input and output for the surv.bart function for serial computation
and mc.surv.bart for parallel computation.
Serial call:

set.seed(99)
post <- surv.bart(x.train, times = times, delta = delta,
x.test = x.train, ndpost = M)

Parallel call:

post <- mc.surv.bart(x.train, times = times, delta = delta,


x.test = x.train, ndpost = M, mc.cores = B, seed = 99)

The data inputs, as shown above, are as follows.


 
x1

x2 
• x.train is a matrix or data frame of covariates for training represented as 
 
.. 
.
 
 
xN
where xi are row vectors.

• times is a vector of times with K distinct values.

• delta is a vector of binary event indicators.

• x.test (optional) is a matrix or data frame of covariates for testing.

The returned value, post as shown above, is of type ‘survbart’ that is essentially a list with
named components, particularly, post$surv.test.

• post$surv.test is a matrix of survival function estimates Ŝm (t(j) , xi )


 
Ŝ1 (t(1) , x1 ) ... Ŝ1 (t(K) , x1 ) ... Ŝ1 (t(1) , xN ) ... Ŝ1 (t(K) , xN )
.. .. .. .. .. .. ..
. . . . . . . .
 

ŜM (t(1) , x1 ) ... ŜM (t(K) , x1 ) ... ŜM (t(1) , xN ) ... ŜM (t(K) , xN )
Journal of Statistical Software 35

Here is a diagram of the input and output for the predict.survbart function.

R> pred <- predict(post, pre$tx.test, mc.cores = B)

The data inputs, as shown above, are as follows.

• post is an object of type ‘survbart’.


 
x1

x2 
• x.test is a matrix of data frame of covariates for testing represented as 
 
.. 
.
 
 
xQ
where xi are row vectors.

The returned value, pred as shown above, is an object of type ‘survbart’ that is essentially
a list of named components, particularly, pred$surv.test.

• pred$surv.test is a matrix of survival function estimates Ŝm (t(j) , xi )


 
Ŝ1 (t(1) , x1 ) ... Ŝ1 (t(K) , x1 ) ... Ŝ1 (t(1) , xQ ) ... Ŝ1 (t(K) , xQ )
.. .. .. .. .. .. ..
. . . . . . . .
 

ŜM (t(1) , x1 ) ... ŜM (t(K) , x1 ) ... ŜM (t(1) , xQ ) ... ŜM (t(K) , xQ )

For an overview of Friedman’s partial dependence function (including the notation adopted
in this article and its meaning), please see Section 3.8 which discusses continuous out-
comes. For survival analysis, we use Friedman’s partial dependence function (Friedman
2001) with BART to summarize the marginal effect due to a subset of the covariates set-
tings which, naturally, includes time, (t(j) , xhS ). For survival analysis, the f function is
often not directly of interest; rather, the survival function is more readily interpretable:
i=1 S(t(j) , xhS , xiC ).
S(t(j) , xhS ) = N −1 N
P

Survival analysis with BART example: Advanced lung cancer


Here we present an example that is available in the BART package: demo("lung.surv.bart",
package = "BART"). The North Central Cancer Treatment Group surveyed 228 advanced
lung cancer patients (Loprinzi et al. 1994). This data can be found in the lung data set.
The study focused on prognostic variables. Patient responses were paired with a few clinical
variables. We control for age, gender and Karnofsky performance score as rated by their
physician. We compare the survival for males and females with Friedman’s partial depen-
dence function; see Figure 14. We also analyze this data set with logit BART and the results
are quite similar (not shown): demo("lung.surv.lbart", package = "BART"). Further-
more, we perform convergence diagnostics on the chain: demo("geweke.lung.surv.bart",
package = "BART").

5.2. Survival analysis and the concordance probability


The concordance probability (Gönen and Heller 2005) is a measure of the discriminatory
ability of survival analysis analogous to the area under the receiver operating characteristic
curve for binary outcomes. Suppose that we have two event times, t1 and t2 , (let us say
36 BART: Bayesian Additive Regression Trees in R

1.0
0.8
0.6
S(t, x)

0.4
0.2
0.0

0 50 100 150

t (weeks)

Figure 14: Advanced lung cancer example: Friedman’s partial dependence function with 95%
credible intervals: males (blue) vs. females (red). A cohort of advanced lung cancer patients
was recruited from the North Central Cancer Treatment Group. For survival time, these
patients were followed for nearly 3 years or until lost to follow-up.

each based on a different subject profile), then the concordance probability is defined as
κt1 ,t2 = P [t1 < t2 ]. A simple analytic example with the Exponential distribution is as follows.

ind
ti | λi ∼ Exp (λi ) where i ∈ {1, 2}
Z ∞ Z t2
λ1
P [t1 < t2 | λ1 , λ2 ] = λ2 e−λ2 t2 λ1 e−λ1 t1 dt1 dt2 =
0 0 λ1 + λ2
λ2 λ1
1 − P [t1 > t2 | λ1 , λ2 ] = 1 − =
λ1 + λ2 λ1 + λ2
= P [t1 < t2 | λ1 , λ2 ]

Notice that the concordance is symmetric with respect to t1 and t2 .


We can make a similar calculation based on our BART survival analysis model. Suppose that
we have two event times, s1 and s2 , which are conditionally independent, i.e., s1 | (f, x1 ) ⊥
s2 | (f, x2 ). First, we calculate P [s1 < s2 | f, x1 , x2 ] (from here on, we suppress f and xi for
Journal of Statistical Software 37

notational convenience).
h i
P [s1 < s2 ] =P s1 = t(1) , s2 > t(1) +
h i h i
P s1 = t(2) , s2 > t(2) | s1 > t(1) , s2 > t(1) P s1 > t(1) , s2 > t(1) + . . .
K h i h i
= P s1 = t(j) , s2 > t(j) | s1 > t(j−1) , s2 > t(j−1) P s1 > t(j−1) , s2 > t(j−1)
X

j=1
K
= p1j q2j S1 (t(j−1) )S2 (t(j−1) )
X

j=1

Now, we calculate the mirror image relationship.


K
1 − P [s1 > s2 ] = 1 − q1j p2j S1 (t(j−1) )S2 (t(j−1) )
X

j=1
K
=1− (1 − p1j )(1 − q2j )S1 (t(j−1) )S2 (t(j−1) )
X

j=1
K
=1− (1 − p1j − q2j + p1j q2j )S1 (t(j−1) )S2 (t(j−1) )
X

j=1
K K
=1− p1j q2j S1 (t(j−1) )S2 (t(j−1) ) − (q1j − q2j )S1 (t(j−1) )S2 (t(j−1) )
X X

j=1 j=1

However, note that these probabilities are not symmetric in this form. Yet, we can arrive at
symmetry as follows.
κs1 ,s2 = 0.5 (P [s1 < s2 ] + 1 − P [s1 > s2 ])
 
K
= 0.5 1 − (q1j − q2j )S1 (t(j−1) )S2 (t(j−1) )
X

j=1

See the concordance example at demo("concord.surv.bart", package = "BART").

5.3. Competing risks with BART


Competing risks survival analysis (Kalbfleisch and Prentice 1980; Fine and Gray 1999; Klein
and Moeschberger 2003; Nicolaie, Van Houwelingen, and Putter 2010; Ishwaran, Gerds, Ko-
galur, Moore, Gange, and Lau 2014; Sparapani, Logan, McCulloch, and Laud 2020a) deal
with events which are mutually exclusive, say, death from cardiovascular disease vs. death
from other causes, i.e., a patient experiencing one of the events is incapable of experiencing
another. We take two approaches to support competing risks with BART: both approaches
are extensions of BART survival analysis. We flexibly model the cause-specific hazards and
eschew precarious restrictive assumptions like linearity of covariate effects, proportionality
and/or parametric distributions of the outcomes.

Competing risks with crisk.bart


The first approach is supported by the function crisk.bart for serial computation and
mc.crisk.bart for parallel computation. To accommodate competing risks, we adapt our
38 BART: Bayesian Additive Regression Trees in R

notation slightly: (si , δi ) where δi = 1 for kind 1 events, δi = 2 for kind 2 events, or δi = 0 for
censoring times. We create a single grid of time points for the ordered distinct times based on
either kind of event or censoring: 0 = t(0) < t(1) < · · · < t(K) < ∞. We model the probability
for an event of kind 1, p1 (t(j) , xi ), and an event of kind 2 conditioned on subject i being
alive at time t(j) , p2 (t(j) , xi ). Now, we create event indicators by melding absorbing events
survival analysis with mutually exclusive Multinomial categories where i indexes subjects:
i = 1, . . . , N .
y1ij = I (δi = 1) I (j = ni ) where j = 1, . . . , ni
y1ij | p1ij ∼ B (p1ij )
prior
p1ij = Φ(µ1 + f1 (t(j) , xi )) where f1 ∼ BART
y2ij = I (δi = 2) I (j = ni ) where j = 1, . . . , ni − y1ini
y2ij | p2ij ∼ B (p2ij )
prior
p2ij = Φ(µ2 + f2 (t(j) , xi )) where f2 ∼ BART
ni y Qni −y1in y 0
i=1 j=1 p1ij (1 − p1ij )
The likelihood is: [y | f1 , f2 ] = N 1ij 1−y1ij 2ij
0 (1 − p2ij 0 )
1−y2ij 0
.
Q Q i
j 0 =1 p2ij
Now, we can estimate the survival function and the cumulative incidence functions as follows.
k h i
S(t, xi ) = 1 − F (t, xi ) = (1 − p1ij )(1 − p2ij ) where k = arg max t(j) ≤ t
Y
j
j=1
Z t k
F1 (t, xi ) = S(u−, xi )λ1 (u, xi )du = S(t(j−1) , xi )p1ij
X
0 j=1
Z t k
F2 (t, xi ) = S(u−, xi )λ2 (u, xi )du = S(t(j−1) , xi )(1 − p1ij )p2ij
X
0 j=1

The returned object of type ‘criskbart’ from crisk.bart or mc.crisk.bart provides the cu-
mulative incidence functions and survival corresponding to x.test as follows: F1 is cif.test,
F2 is cif.test2 and S is surv.test.

Competing risks with crisk2.bart


The second approach is supported by the function crisk2.bart for serial computation and
mc.crisk2.bart for parallel computation. We take a similar approach as Nicolaie et al.
(2010). We model the probability for an event of either kind, pij = p(t(j) , xi ) (this is standard
survival analysis); and, given an event has occurred, the probability of a kind 1 event, πi =
π(ti , xi ). Now, we create the corresponding event indicators yij and ui where i indexes
subjects: i = 1, . . . , N .
yij = I (δi 6= 0) I (j = ni ) where j = 1, . . . , ni
yij | pij ∼ B (pij )
prior
pij = Φ(µy + fy (t(j) , xi )) where fy ∼ BART
ui = I (δi = 1) where i ∈ {i0 : δi0 6= 0}
ui | πi ∼ B (πi )
prior
πi = Φ(µu + fu (ti , xi )) where fu ∼ BART
Journal of Statistical Software 39

0.8
0.6
CI(t)

0.4

Transplant(BART)
Transplant(AJ)
Death(AJ)
Withdrawal(AJ)
0.2
0.0

0 20 40 60 80 100

t (weeks)

Figure 15: Liver transplant competing risks for type O patients estimated by BART and
Aalen-Johansen. This data is from the Mayo Clinic liver transplant waiting list from 1990-
1999. During the study period, the liver transplant organ allocation policy was flawed. Blood
type is an important matching factor to avoid organ rejection. Donor livers from subjects
with blood type O can be used by patients with all blood types; whereas a donor liver from
the other types will only be transplanted to a matching recipient. Therefore, type O subjects
on the waiting list were at a disadvantage since the pool of competitors was larger.

ni y u
The likelihood is: [y, u | fy , fu ] = N
i=1 j=1 pij (1 − pij ) i0 :δi0 6=0 πi0 (1 − πi0 )
1−ui0 . Now,
Q Q ij 1−yij Q i 0

we can estimate the survival function and the cumulative incidence functions similar to
the first approach. The returned object is of type ‘crisk2bart’ from crisk2.bart or
mc.crisk2.bart that provides the cumulative incidence functions and survival corresponding
to x.test as follows: F1 is cif.test, F2 is cif.test2 and S is surv.test.

Competing risks with BART example: Liver transplants


Here, we present the Mayo Clinic liver transplant waiting list data from 1990-1999 with
N = 815 patients. During the study period, the liver transplant organ allocation policy was
flawed. Blood type is an important matching factor to avoid organ rejection. Donor livers from
subjects with blood type O can be used by patients with A, B, AB or O blood types; whereas a
donor liver from the other types will only be transplanted to a matching recipient. Therefore,
40 BART: Bayesian Additive Regression Trees in R

type O subjects on the waiting list were at a disadvantage since the pool of competitors
was larger for type O donor livers. This data is of historical interest and provides a useful
example of competing risks, but it has little relevance to liver transplants today. Current liver
transplant policies have evolved and now depend on each individual patient’s risk/need which
are assessed and updated regularly while a patient is on the waiting list. Nevertheless, there
still remains an acute shortage of donor livers today. The transplant data set is provided by
the BART R package as is this example: demo("liver.crisk.bart", package = "BART").
We compare the nonparametric Aalen-Johansen competing risks estimator with BART for
the transplant event of type O patients which are in general agreement; see Figure 15.

5.4. Recurrent events analysis with BART


The BART package supports recurrent events (Sparapani, Rein, Tarima, Jackson, and Meurer
2020b) with recur.bart for serial computation and mc.recur.bart for parallel computation.
Survival analysis is generally concerned with absorbing events that a subject can only expe-
rience once like mortality. Recurrent events analysis is concerned with non-absorbing events
that a subject can experience more than once like hospital admissions (Andersen and Gill
1982; Wei, Lin, and Weissfeld 1989; Kalbfleisch and Prentice 2002; Sparapani et al. 2020b).
Recurrent events analysis with BART provides much desired flexibility in modeling the depen-
dence of recurrent events on covariates. Consider data in the form: δi , si , ti , ui , xi (t) where
i = 1, . . . , N indexes subjects; si is the end of the observation period (death, δi = 1, or cen-
soring, δi = 0); Ni is the number of events during the observation period; ti = [ti1 , . . . , tiNi ]
and tik is the event start time of the k-th event (let ti0 = 0); ui = [ui1 , . . . , uiNi ] and uik is
the event end time of the k-th event (let ui0 = 0); and xi (t) is a vector of time-dependent
covariates. Both start and end times of events are necessary to define risk set eligibility for
events of stochastic duration like readmissions since patients currently hospitalized cannot
be readmitted. For instantaneous events (or roughly instantaneous events such as emergency
department visits with time measured in days), the end times can be simply ignored.
We denote the K collectively distinct event start and end times for all subjects by 0 <
t(1) < · · · < t(K) < ∞ thus taking t(j) to be the j-th order statistic among distinct obser-
vation times and, for convenience, t(j 0 ) = 0 where j 0 ≤ 0 (note that t(j) are constructed
from all event start/end times for all subjects, but they may be a censoring time for any
given subject). Now consider binary event indicators yij for each subject i at each h distincti
time t(j) up to the subject’s last observation time t(ni ) ≤ si with ni = arg maxj t(j) ≤ si ,
i.e., yi1 , . . . , yini ∈ {0, 1}. We then denote by pij the probability of an event at time t(j) condi-
   
tional on t(j) , x̃i (t(j) ) where x̃i (t(j) ) = Ni (t(j−1) ), vi (t(j) ), xi (t(j) ) . Let Ni (t−) ≡ lim Ni (s)
s↑t
be the number of events for subject i just prior to time t and we also note that Ni = Ni (si ).
Let vi (t) = t − uNi (t−) be the sojourn time for subject i, i.e., time since last event, if any.
Notice that we can replace Ni (t(j) −) with Ni (t(j−1) ) since, by construction, the state of in-
formation available at time t(j) − is the same as that available at t(j−1) . Assuming a constant
intensity and constant covariates, x̃i (t(j) ), in the interval (t(j−1) , t(j) ], we define the cumulative
intensity process as:
Z t(j) j j
Λ(t(j) , x̃i (t(j) )) = dΛ(t, x̃i (t)) = Pr Ni (t(j 0 ) ) − Ni (t(j 0 −1) ) = 1 | t(j 0 ) , x̃i (t(j 0 ) ) =
X X
pij 0
0 j 0 =1 j 0 =1
(1)
Journal of Statistical Software 41

where these pij are currently unspecified and we provide their definition later in Equation 2.
N.B. we follow the recurrent events literature’s favored terminology by using the term “in-
tensity” rather than “hazard”, but they are generally interchangeable.
With absorbing events such as mortality there is no concern about the conditional indepen-
dence of future events because there will never be any. Conversely, with recurrent events,
there is a valid concern. Of course, conditional independence can be satisfied by conditioning
on the entire event history, denoted by Ni (s) where 0 ≤ s < t. However, conditioning on the
entire event history is often impractical. Rather, we condition on both Ni (t−) and vi (t) to
satisfy any concern of conditional independence.
 
We now write the model for yij as a nonparametric probit regression of yij on t(j) , x̃i (t(j) )
tantamount to parametric models of discrete-time intensity (Thompson Jr. 1977; Arjas and
Haara 1987; Fahrmeir 2014). Specifically, with temporal data convertedfrom δi , si, ti , ui , xi (t)
to a sequence of longitudinal binary events as follows: yij = maxk I tik = t(j) . However,
note that the definition of j is currently unspecified. To understand the impetus of the range
of j, let us look at an example.
Suppose that we have two subjects with the following values:

N1 = 2, s1 = 9, t11 = 3, u11 = 7, t12 = 8, u12 = 8 ⇒ y11 = 1, y12 = y13 = 0, y14 = 1, y15 = 0


N2 = 1, s2 = 12, t21 = 4, u21 = 7 ⇒ y21 = 0, y22 = 1, y23 = y24 = y25 = y26 = 0

which creates the grid of times (3, 4, 7, 8, 9, 12). For subject 1 (2), notice that y12 = y13 = 0
(y23 = 0) as it should be since no event occurred at times 4 or 7 (7). However, there were
no events since their first event had not ended yet, i.e., these subjects are not chronologically
at risk for an event and, therefore, no corresponding random behavior contributed to the
likelihood. The BART package provides the recur.pre.bart function which you can use
to construct these data sets. Here is a short demonstration of its capabilities adapted from
demo/data.recur.pre.bart.R (re-formatted for display purposes).

R> library("BART")
R> times <- matrix(c(3, 8, 9, 4, 12, 12), nrow = 2, ncol = 3, byrow = TRUE)
R> tstop <- matrix(c(7, 8, 0, 7, 0, 0), nrow = 2, ncol = 3, byrow = TRUE)
R> delta <- matrix(c(1, 1, 0, 1, 0, 0), nrow = 2, ncol = 3, byrow = TRUE)
R> recur.pre.bart(times = times, delta = delta, tstop = tstop)

$K $times $y.train $tx.train $tx.test


[1] [1] [1] t v N t v N
6 3 1 [1,] 3 3 0 [1,] 3 3 0
4 1 [2,] 8 5 1 [2,] 4 1 1
7 0 [3,] 9 1 2 [3,] 7 4 1
8 0 [4,] 3 3 0 [4,] 8 5 1
9 1 [5,] 4 4 0 [5,] 9 1 2
12 0 [6,] 8 4 1 [6,] 12 4 2
0 [7,] 9 5 1 [7,] 3 3 0
0 [8,] 12 8 1 [8,] 4 4 0
[9,] 7 3 1
[10,] 8 4 1
42 BART: Bayesian Additive Regression Trees in R

[11,] 9 5 1
[12,] 12 8 1

Notice that $tx.test is not limited to the same time points as $tx.train, i.e., we often
want/need to estimate f at counter-factual values not observed in the data so each subject
contributes an equal number of evaluations for estimation purposes.
It is now clear that the yij which contribute to the likelihood are those such that j ∈ Ri which
is the risk set for subject i. We formally define the risk set as
n h io
Ri = j : j ∈ {1, . . . , ni } and ∩N
k=1 {t(j) 6∈ (tik , uik )}
i

i.e., the risk set contains j if t(j) is during the observation period for subject i and t(j) is not
contained within an already ongoing event for this subject.
Putting it all together, we arrive at the following recurrent events discrete-time model with i
indexing subjects; i = 1, . . . , N .

yij | pij ∼ B (pij ) where j ∈ Ri


pij = Φ(µij ), µij = µ0 + f (t(j) , x̃i (t(j) )) (2)
prior
f ∼ BART
y
This produces the following likelihood: [y | f ] = p ij (1 − pij )1−yij . We center the
QN Q
Pi=1
P j∈Ri ij
y
j∈Ri ij
BART function, f , by µ0 = Φ−1 (ȳ) where ȳ = P P
i
ni
I (j∈Ri )
.
i j=1

For computational efficiency, we carry out the probit regression via truncated normal latent
variables (Albert and Chib 1993) (this default can be over-ridden for logit with logistic latents
(Holmes and Held 2006; Gramacy and Polson 2012) by specifying type = "lbart").
With the data prepared as described in the above example, the BART model for binary data
treats the probability of an event within an interval as a nonparametric function of time, t,
and covariates, x̃(t). Conditioned on the data, BART provides samples from the posterior
distribution of f . For any t and x̃(t), we obtain the posterior distribution of p(t, x̃(t)) =
Φ(µ0 + f (t, x̃(t))).
For the purposes of recurrent events survival analysis, we are typically interested in estimating
the cumulative intensity function as presented in Equation 1. With these estimates, one can
accomplish inference from the posterior via means, quantiles or other functions of p(t, x̃i (t))
or Λ(t, x̃(t)) as needed such as the relative intensity, i.e., RI(t, x̃n (t), x̃d (t)) = p(t,x̃n (t))
p(t,x̃d (t)) where
x̃n (t) and x̃d (t) are two settings we wish to compare like two treatments.

Recurrent events with BART example: Bladder tumors


An interesting example of recurrent events involves a clinical trial conducted by the Vet-
erans Administration Cooperative Urological Research Group (Byar 1980). In this study,
all patients had superficial bladder tumors when they entered the trial. These tumors were
removed transurethrally and patients were randomly assigned to one of three treatments:
placebo, thiotepa or pyridoxine (vitamin B6). Many patients had multiple recurrences of
tumors during the study and new tumors were removed at each visit. For each patient, their
Journal of Statistical Software 43

Bladder cancer: Thiotepa vs. Placebo

10.0
5.0
2.0
RI(t)

1.0
0.5
0.2
0.1

0 10 20 30 40 50 60

t (months)

Figure 16: Relative intensity: Thiotepa vs. Placebo. The relative intensity function is as
p(t,x̃T (t))
follows: RI(t, x̃T (t), x̃P (t)) = p(t,x̃ P (t))
where T is for Thiotepa and P is for Placebo. The blue
lines are the relative intensity functions themselves and the red lines are their 95% credible
intervals. The relative intensity is calculated by Friedman’s partial dependence function,
i.e., aggregated over all other covariates.
Bladder cancer: Thiotepa vs. Vitamin B6
10.0
5.0
2.0
RI(t)

1.0
0.5
0.2
0.1

0 10 20 30 40 50 60

t (months)

Figure 17: Relative intensity: Thiotepa vs. Vitamin B6. The relative intensity function is
p(t,x̃T (t))
as follows: RI(t, x̃T (t), x̃B (t)) = p(t,x̃ B (t))
where T is for Thiotepa and B is for Vitamin
B6. The blue lines are the relative intensity functions themselves and the red lines are their
95% credible intervals. The relative intensity is calculated by Friedman’s partial dependence
function, i.e., aggregated over all other covariates.
44 BART: Bayesian Additive Regression Trees in R

Bladder cancer: Vitamin B6 vs. Placebo

10.0
5.0
2.0
RI(t)

1.0
0.5
0.2
0.1

0 10 20 30 40 50 60

t (months)

Figure 18: Relative intensity: Vitamin B6 vs. Placebo. The relative intensity function is
as follows: RI(t, x̃B (t), x̃P (t)) = p(t,x̃B (t))
p(t,x̃P (t)) where B is for Vitamin B6 and P is for Placebo.
The blue lines are the relative intensity functions themselves and the red lines are their
95% credible intervals. The relative intensity is calculated by Friedman’s partial dependence
function, i.e., aggregated over all other covariates.

recurrence time, if any, was measured from the beginning of treatment. There were 118 pa-
tients enrolled but only 116 were followed beyond time zero and contribute information. This
data set is loaded by data("bladder", package = "BART") and the data frame of interest is
bladder1. This data set is analyzed by demo("bladder.recur.bart", package = "BART").
In Figure 16, notice that the relative intensity calculated by Friedman’s partial dependence
function finds thiotepa inferior to placebo from roughly 6 to 18 months and afterward they
are about equal, but the 95% credible intervals are wide throughout. Similarly, the relative
intensity calculated by Friedman’s partial dependence function finds thiotepa inferior to vi-
tamin B6 from roughly 3 to 24 months and afterward they are about equal, but the 95%
credible intervals are wide throughout; see Figure 17. And, finally, vitamin B6 is superior to
placebo throughout, but the 95% credible intervals are wide; see Figure 18.

6. Discussion
The BART R package provides a user-friendly reference implementation of Bayesian additive
regression trees (BART). BART is a Bayesian nonparametric, tree-based ensemble, machine
learning technique with best-of-breed properties. In the spirit of machine learning, BART
learns the relationship between the covariates, x, and the response variable arriving at f (x)
while not burdening the user to pre-specify the functional form of f nor the interaction terms
among the covariates. By specifying an optional sparse Dirichlet prior, BART is capable of
variable selection: a form of learning which is especially useful in high-dimensional settings.
Journal of Statistical Software 45

In the class of ensemble predictive models, BART’s out-of-sample predictive performance is


competitive with other leading members of this class. Due to its membership in the class of
Bayesian nonparametric models, BART not only provides an estimate of f (x), but naturally
generates the uncertainty as well.
There are user-friendly features that are inherent to BART itself which, of course, are available
in this package as well. BART was designed to be very flexible via its prior arguments while
providing the user robust, low information, default settings that will likely produce a good fit
without resorting to computationally demanding cross-validation. BART itself is relatively
computationally efficient, but larger data sets will naturally take more time to estimate.
Therefore, the BART package provides the user with simple and easy to use multi-threading
to keep elapsed time to a minimum. Another important time-saver, the BART package allows
the user to save the trees from a BART model fit so that prediction via the R predict function
can take place at a later time without having to re-fit the model. And these predictions can
also take advantage of multi-threading.
The BART package has been written in C++ for portability, maintainability and efficiency;
this allows BART to be called either from R or from other computer source code written in
many languages. The package supports missing data handling of the covariates and provides
the user with access to BART implementations for several types of responses. The BART
package supports the following:

• continuous outcomes;

• binary outcomes via probit or logit transformation;

• categorical outcomes;

• time-to-event outcomes with right censoring including

– absorbing events,
– competing risks, and
– recurrent events.

In this article, we have provided the user with an overview of much that is described in this
section including (but not limited to): details of the BART prior and its arguments, sparse
variable selection, prediction, multi-threading, support for the outcomes listed above and
missing data handling. In addition, this article has provided primers on important BART
topics such as posterior computation, Friedman’s partial dependence function and convergence
diagnostics. With a computational method such as BART, the user needs a reliable, well-
documented software package with a diverse set of examples. With this article, and the
BART package itself, we believe that interested users now have the tools to successfully
employ BART for their rigorous data analysis needs.

References

Agresti A (2003). Categorical Data Analysis. 2nd edition. John Wiley & Sons, Hoboken.
46 BART: Bayesian Additive Regression Trees in R

Albert J, Chib S (1993). “Bayesian Analysis of Binary and Polychotomous Response Data.”
Journal of the American Statistical Association, 88, 669–79. doi:10.1080/01621459.
1993.10476321.

Amdahl G (1967). “Validity of the Single Processor Approach to Achieving Large-Scale


Computing Capabilities.” In AFIPS Conference Proceedings, volume 30, pp. 483–485. doi:
10.1145/1465482.1465560.

Andersen PK, Gill RD (1982). “Cox’s Regression Model for Counting Processes: A Large
Sample Study.” The Annals of Statistics, 10(4), 1100–1120. URL http://www.jstor.org/
stable/2240714.

Anderson JP, Hoffman SA, Shifman J, Williams RJ (1962). “D825 - A Multiple-Computer


System for Command and Control.” In AFIPS Conference Proceedings, volume 24, pp.
86–96. doi:10.1145/1461518.1461527.

Arjas E, Haara P (1987). “A Logistic Regression Model for Hazard: Asymptotic Results.”
Scandinavian Journal of Statistics, 14(1), 1–18. URL https://www.jstor.org/stable/
4616044.

Baldi P, Brunak S (2001). Bioinformatics: The Machine Learning Approach. 2nd edition.
MIT Press, Cambridge.

Bleich J, Kapelner A, George EI, Jensen ST (2014). “Variable Selection for BART: An
Application to Gene Regulation.” The Annals of Applied Statistics, 8(3), 1750–1781. URL
10.1214/14-AOAS755.

Breiman L (1996). “Bagging Predictors.” Machine Learning, 24, 123–140. doi:10.1023/a:


1018054314350.

Breiman L (2001). “Random Forests.” Machine Learning, 45, 5–32. doi:10.1023/a:


1010933404324.

Byar D (1980). “The Veterans Administration Study of Chemoprophylaxis for Recurrent


Stage I Bladder Tumours: Comparisons of Placebo, Pyridoxine and Topical Thiotepa.” In
Bladder Tumors and Other Topics in Urological Oncology, pp. 363–370. Springer-Verlag,
Boston. doi:10.1007/978-1-4613-3030-1_74.

Calcote J (2010). Autotools: A Practitioner’s Guide to GNU Autoconf, Automake, and Libtool.
No Starch Press, San Francisco.

Chipman H, McCulloch RE (2016). BayesTree: Bayesian Additive Regression Trees. R pack-


age version 0.3-1.4, URL https://CRAN.R-project.org/package=BayesTree.

Chipman HA, George EI, McCulloch RE (1998). “Bayesian CART Model Search.” Journal
of the American Statistical Association, 93(443), 935–948. doi:10.1080/01621459.1998.
10473750.

Chipman HA, George EI, McCulloch RE (2010). “BART: Bayesian Additive Regression
Trees.” The Annals of Applied Statistics, 4(1), 266–298. doi:10.1214/09-AOAS285.
Journal of Statistical Software 47

Chipman HA, George EI, McCulloch RE (2013). “Bayesian Regression Structure Discovery.”
In P Damien, P Dellaportas, N Polson, D Stephens (eds.), Bayesian Theory and Applica-
tions. Oxford University Press, Oxford. doi:10.1093/acprof:oso/9780199695607.001.
0001.

Cox DR (1972). “Regression Models and Life-Tables.” Journal of the Royal Statistical Society
B, 34(2), 187–220.

Dagum L, Menon R (1998). “OpenMP: An Industry Standard API for Shared-Memory


Programming.” IEEE Computational Science and Engineering, 5(1), 46–55. doi:10.1109/
99.660313.

Daniels M, Singh A (2018). sbart: Sequential BART for Imputation of Missing Co-
variates. R package version 0.1.1, URL https://CRAN.R-project.org/src/contrib/
Archive/sbart/.

De Waal T, Pannekoek J, Scholtus S (2011). Handbook of Statistical Data Editing and Impu-
tation. John Wiley & Sons, Hoboken. doi:10.1002/9780470904848.

Delany MF, Linda SB, Moore CT (1999). “Diet and Condition of American Alligators in 4
Florida Lakes.” In Proceedings of the Annual Conference of the Southeastern Association
of Fish and Wildlife Agencies, pp. 375–389.

Denison DGT, Mallick BK, Smith AFM (1998). “A Bayesian CART Algorithm.” Biometrika,
85(2), 363–377. doi:10.1093/biomet/85.2.363.

Devroye L (1986). Non-Uniform Random Variate Generation. Springer-Verlag, New York.


doi:10.1007/978-1-4613-8643-8.

Dorie V (2020). dbarts: Discrete Bayesian Additive Regression Trees Sampler. R package
version 0.9-18, URL https://CRAN.R-project.org/package=dbarts.

Eddelbuettel D, Francois R (2011). “Rcpp: Seamless R and C++ Integration.” Journal of


Statistical Software, 40(8), 1–18. doi:10.18637/jss.v040.i08.

Efron B, Hastie T, Johnstone I, Tibshirani R (2004). “Least Angle Regression.” The Annals
of Statistics, 32(2), 407–499. doi:10.1214/009053604000000067.

Entezari R, Craiu RV, Rosenthal JS (2018). “Likelihood Inflating Sampling Algorithm.”


Canadian Journal of Statistics, 46(1), 147–175. doi:10.1002/cjs.11343.

Fahrmeir L (2014). “Discrete Survival-Time Models.” Wiley StatsRef: Statistics Reference


Online. doi:10.1002/9781118445112.stat06012.

Fine JP, Gray RJ (1999). “A Proportional Hazards Model for the Subdistribution of a
Competing Risk.” Journal of the American Statistical Association, 94(446), 496–509. doi:
10.1080/01621459.1999.10474144.

Freund Y, Schapire RE (1997). “A Decision-Theoretic Generalization of On-Line Learning and


an Application to Boosting.” Journal of Computer and System Sciences, 55(1), 119–139.
doi:10.1006/jcss.1997.1504.
48 BART: Bayesian Additive Regression Trees in R

Friedman JH (1991). “Multivariate Adaptive Regression Splines (with Discussion and a Re-
joinder by the Author).” The Annals of Statistics, 19, 1–67. URL http://www.jstor.org/
stable/2241837.

Friedman JH (2001). “Greedy Function Approximation: A Gradient Boosting Machine.” The


Annals of Statistics, 29(5), 1189–1232. URL http://www.jstor.org/stable/2699986.

Frühwirth-Schnatter S, Frühwirth R (2010). “Data Augmentation and MCMC for Binary


and Multinomial Logit Models.” In Statistical Modelling and Regression Structures, pp.
111–132. Springer-Verlag, New York. doi:10.1007/978-3-7908-2413-1_7.

Gabriel E, Fagg GE, Bosilca G, Angskun T, Dongarra JJ, Squyres JM, Sahay V, Kambadur P,
Barrett B, Lumsdaine A, Castain R, Daniel D, Graham R, Woodall T (2004). “Open MPI:
Goals, Concept, and Design of a Next Generation MPI Implementation.” In European
Parallel Virtual Machine/Message Passing Interface Users’ Group Meeting, pp. 97–104.
Springer-Verlag, Berlin, Heidelberg. doi:10.1007/978-3-540-30218-6_19.

Gelfand AE, Smith AF (1990). “Sampling-Based Approaches to Calculating Marginal


Densities.” Journal of the American Statistical Association, 85(410), 398–409. doi:
10.1080/01621459.1990.10476213.

Geman S, Geman D (1984). “Stochastic Relaxation, Gibbs Distributions, and the Bayesian
Restoration of Images.” IEEE Transactions on Pattern Analysis and Machine Intelligence,
6, 721–741. doi:10.1109/tpami.1984.4767596.

Geweke J (1992). “Evaluating the Accuracy of Sampling-Based Approaches to the Calculation


of Posterior Moments.” In JM Bernado, JO Berger, AP Dawid, AFM Smith (eds.), Bayesian
Statistics 4, pp. 169–193. Oxford University Press, Oxford.

Gönen M, Heller G (2005). “Concordance Probability and Discriminatory Power in Propor-


tional Hazards Regression.” Biometrika, 92(4), 965–970. doi:10.1093/biomet/92.4.965.

Gramacy RB, Polson NG (2012). “Simulation-Based Regularized Logistic Regression.”


Bayesian Analysis, 7(3), 567–590. doi:10.1214/12-ba719.

Hahn P, Carvalho C (2015). “Decoupling Shrinkage and Selection in Bayesian Linear Mod-
els: a Posterior Summary Perspective.” Journal of the American Statistical Association,
110(509), 435–448. doi:10.1080/01621459.2014.993077.

Harrison Jr D, Rubinfeld DL (1978). “Hedonic Housing Prices and the Demand for Clean
Air.” Journal of Environmental Economics and Management, 5(1), 81–102. doi:10.1016/
0095-0696(78)90006-2.

Hastings WK (1970). “Monte Carlo Sampling Methods Using Markov Chains and Their
Applications.” Biometrika, 57(1), 97–109. doi:10.1093/biomet/57.1.97.

Holmes C, Held L (2006). “Bayesian Auxiliary Variable Models for Binary and Multinomial
Regression.” Bayesian Analysis, 1(1), 145–168. doi:10.1214/06-ba105.

Imai K, Van Dyk DA (2005). “A Bayesian Analysis of the Multinomial Probit Model Using
Marginal Data Augmentation.” Journal of Econometrics, 124(2), 311–334. doi:10.1016/
j.jeconom.2004.02.002.
Journal of Statistical Software 49

Institute of Electrical and Electronics Engineers (2008). IEEE Std 754-2008, chapter IEEE
Standard for Floating-Point Arithmetic, pp. 1–70. IEEE. doi:10.1109/ieeestd.2008.
4610935.
Ishwaran H, Gerds TA, Kogalur UB, Moore RD, Gange SJ, Lau BM (2014). “Random
Survival Forests for Competing Risks.” Biostatistics, 15(4), 757–773. doi:10.1093/
biostatistics/kxu010.
Johnson NL, Kotz S, Balakrishnan N (1995). Continuous Univariate Distributions, volume 2.
2nd edition. John Wiley & Sons, New York.
Kalbfleisch JD, Prentice RL (1980). The Statistical Analysis of Failure Time Data. 1st edition.
John Wiley & Sons, Hoboken.
Kalbfleisch JD, Prentice RL (2002). The Statistical Analysis of Failure Time Data. 2nd
edition. John Wiley & Sons, Hoboken. doi:10.1002/9781118032985.
Kapelner A, Bleich J (2016). “bartMachine: Machine Learning with Bayesian Additive Re-
gression Trees.” Journal of Statistical Software, 70(4), 1–40. doi:10.18637/jss.v070.i04.
Kindo BP, Wang H, Peña EA (2016). “Multinomial Probit Bayesian Additive Regression
Trees.” Stat, 5(1), 119–131. doi:10.1002/sta4.110.
Klein JP, Moeschberger ML (2003). Survival Analysis: Techniques for Censored and Trun-
cated Data. 2nd edition. Springer-Verlag, New York. doi:10.1007/b97377.
Krogh A, Solich P (1997). “Statistical Mechanics of Ensemble Learning.” Physical Review E,
55(1), 811–825. doi:10.1103/physreve.55.811.
Kuhn M, Johnson K (2013). Applied Predictive Modeling. Springer-Verlag, New York. doi:
10.1007/978-1-4614-6849-3.
Linero A (2018). “Bayesian Regression Trees for High Dimensional Prediction and Variable
Selection.” Journal of the American Statistical Association, 113(522), 626–636. doi:10.
1080/01621459.2016.1264957.
Loprinzi CL, Laurie JA, Wieand HS, Krook JE, Novotny PJ, Kugler JW, Bartel J, Law
M, Bateman M, Klatt NE (1994). “Prospective Evaluation of Prognostic Variables from
Patient-Completed Questionnaires. North Central Cancer Treatment Group.” Journal of
Clinical Oncology, 12(3), 601–607.
Lynch J (1965). “The Burroughs B8500.” Datamation, pp. 49–50.
McCulloch R, Rossi PE (1994). “An Exact Likelihood Analysis of the Multinomial Probit
Model.” Journal of Econometrics, 64(1), 207–240. doi:10.1016/0304-4076(94)90064-7.
McCulloch RE, Carvalho C, Hahn R (2015). “A General Approach to Variable Selection
Using Bayesian Nonparametric Models.” Joint Statistical Meetings, Seattle, 2015-08-09–
2015-08-13.
McCulloch RE, Polson NG, Rossi PE (2000). “A Bayesian Analysis of the Multinomial Probit
Model with Fully Identified Parameters.” Journal of Econometrics, 99(1), 173–193. doi:
10.1016/s0304-4076(00)00034-8.
50 BART: Bayesian Additive Regression Trees in R

McCulloch RE, Sparapani RA, Gramacy R, Spanbauer C, Pratola M (2021). BART: Bayesian
Additive Regression Trees. R package version 2.9, URL https://CRAN.R-project.org/
package=BART.

Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, Teller E (1953). “Equation of
State Calculations by Fast Computing Machines.” The Journal of Chemical Physics, 21(6),
1087–1092. doi:10.1063/1.1699114.

Mueller P (1991). “A Generic Approach to Posterior Integration and Gibbs Sampling.” Tech-
nical Report 91-09, Purdue University, West Lafayette. URL http://www.stat.purdue.
edu/research/technical_reports/pdfs/1991/tr91-09.pdf.

Murray JS (2020). “Log-Linear Bayesian Additive Regression Trees for Multinomial Logistic
and Count Regression Models.” Journal of the American Statistical Association, (ahead
of print), 1–35. doi:10.1080/01621459.2020.1813587.

Nicolaie MA, Van Houwelingen HC, Putter H (2010). “Vertical Modeling: A Pattern Mixture
Approach for Competing Risks Modeling.” Statistics in Medicine, 29(11), 1190–1205. doi:
10.1002/sim.3844.

Plummer M, Best N, Cowles K, Vines K (2006). “coda: Convergence Diagnosis and Output
Analysis for MCMC.” R News, 6(1), 7–11. URL https://CRAN.R-project.org/doc/
Rnews/.

Pratola MT (2016). “Efficient Metropolis-Hastings Proposal Mechanisms for Bayesian Re-


gression Tree Models.” Bayesian Analysis, 11(3), 885–911. doi:10.1214/16-ba999.

Pratola MT, Chipman HA, Gattiker JR, Higdon DM, McCulloch R, Rust WN (2014). “Paral-
lel Bayesian Additive Regression Trees.” Journal of Computational and Graphical Statistics,
23(3), 830–852. doi:10.1080/10618600.2013.841584.

R Core Team (2017). Mathlib: A C Library of Special Functions. R Foundation for Sta-
tistical Computing, Vienna, Austria. URL https://CRAN.R-project.org/doc/manuals/
r-release/R-admin.html.

R Core Team (2020). R: A Language and Environment for Statistical Computing. R Founda-
tion for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.

Ripley BD (2007). Pattern Recognition and Neural Networks. Cambridge University Press.

Robert C, Casella G (2013). Monte Carlo Statistical Methods. Springer-Verlag, New York.

Robert CP (1995). “Simulation of Truncated Normal Variables.” Statistics and Computing,


5(2), 121–125. doi:10.1007/bf00143942.

Rossini AJ, Tierney L, Li N (2007). “Simple Parallel Statistical Computing in R.” Journal of
Computational and Graphical Statistics, 16(2), 399–420. doi:10.1198/106186007x178979.

Scott SL (2011). “Data Augmentation, Frequentist Estimation, and the Bayesian Anal-
ysis of Multinomial Logit Models.” Statistical Papers, 52(1), 87–109. doi:10.1007/
s00362-009-0205-0.
Journal of Statistical Software 51

Silverman BW (1986). Density Estimation for Statistics and Data Analysis. Chapman and
Hall, London.

Sparapani R, Logan BR, McCulloch RE, Laud PW (2020a). “Nonparametric Competing


Risks Analysis Using Bayesian Additive Regression Trees (BART).” Statistical Methods in
Medical Research, 29(1), 57–77. doi:10.1177/0962280218822140.

Sparapani R, Rein L, Tarima S, Jackson T, Meurer J (2020b). “Non-Parametric Recurrent


Events Analysis with BART and an Application to the Hospital Admissions of Patients
with Diabetes.” Biostatistics, 21(1), 69–85. doi:10.1093/biostatistics/kxy032.

Sparapani RA, Logan BR, McCulloch RE, Laud PW (2016). “Nonparametric Survival Anal-
ysis Using Bayesian Additive Regression Trees (BART).” Statistics in Medicine, 35(16),
2741–2753. doi:10.1002/sim.6893.

Thompson Jr WA (1977). “On the Treatment of Grouped Observations in Life Studies.”


Biometrics, 33(3), 463–470. doi:10.2307/2529360.

Tierney L, Rossini AJ, Li N, Sevcikova H (2018). snow: Simple Network of Workstations.


R package version 0.4-3, URL https://CRAN.R-project.org/package=snow.

Urbanek S (2020). rJava: Low-Level R to Java Interface. R package version 0.9-13, URL
https://CRAN.R-project.org/package=rJava.

Venables WN, Ripley BD (2002). Modern Applied Statistics with S. 4th edition. Springer-
Verlag, New York.

Walker DW, Dongarra JJ (1996). “MPI: A Standard Message Passing Interface.” Supercom-
puter, 12, 56–68.

Wei LJ, Lin DY, Weissfeld L (1989). “Regression Analysis of Multivariate Incomplete Fail-
ure Time Data by Modeling Marginal Distributions.” Journal of the American Statistical
Association, 84(408), 1065–1073. doi:10.1080/01621459.1989.10478873.

Wu Y, Tjelmeland H, West M (2007). “Bayesian CART: Prior Specification and Posterior


Simulation.” Journal of Computational and Graphical Statistics, 16(1), 44–66. doi:10.
1198/106186007x180426.

Xu D, Daniels MJ, Winterstein AG (2016). “Sequential BART for Imputation of Missing


Covariates.” Biostatistics, 17(3), 589–602. doi:10.1093/biostatistics/kxw009.

Yu H (2002). “Rmpi: Parallel Statistical Computing in R.” R News, 2(2), 10–14. URL
https://CRAN.R-project.org/doc/Rnews/.
52 BART: Bayesian Additive Regression Trees in R

A. Getting and installing the BART R package


The BART package (McCulloch, Sparapani, Gramacy, Spanbauer, and Pratola 2021) is GNU
General Public License (GPL) software, available from the Comprehensive R Archive Network
(CRAN) at https://CRAN.R-project.org/package=BART. You can install it from CRAN
as follows.

R> options(repos = c(CRAN = "https://CRAN.R-project.org"))


R> install.packages("BART", dependencies = TRUE)

The examples in this article are included in the package. You can run the first example
(described in Section 3) as follows.

R> options(figures = ".")


R> if(.Platform$OS.type == "unix") {
R> options(mc.cores = min(8, parallel::detectCores()))
R> } else {
R> options(mc.cores = 1)
R> }
R> demo("boston", package = "BART"))

As we shall see, these examples produce R objects containing BART model fits. But, these fits
are Bayesian nonparametric samples from the posterior and require statistical summarization
before they are readily interpretable. Therefore, we often employ graphical summaries (such
as the figures in this article) to visualize the BART model fit. Note that the figures option
(in the code snippet above) specifies a directory where the Portable Document Format (PDF)
graphics files will be produced; if it is not specified, then the graphics will be generated by
R, however, no PDF files will be created. Furthermore, some of these BART model fits can
take a few minutes so it is wise to utilize multi-threading when it is available (for a discussion
of efficient computation with BART including multi-threading, see Appendix Section D).
Returning to the snippet above, the option mc.cores specifies the number of cores to employ
in multi-threading, e.g., there are diminishing returns so often 8 cores is sufficient. And,
finally, to run all of the examples in this article (with the options as specified above), then do
the following demo("replication", package = "BART").

B. Binary trees and the BART prior


BART relies on an ensemble of H binary trees which are a type of a directed acyclic graph. We
exploit the wooden tree metaphor to its fullest. Each of these trees grows from the ground up
starting out as a root node. The root node is generally a branch decision rule, but it doesn’t
have to be; occasionally there are trees in the ensemble which are only a root terminal node
consisting of a single leaf output value. If the root is a branch decision rule, then it spawns a
left and a right node which each can be either a branch decision rule or a terminal leaf value
and so on. In binary tree, T , there are C nodes which are made of B branches and L leaves:
C = B + L. There is an algebraic relationship between the number of branches and leaves
which we express as B = L − 1.
The ensemble of trees is encoded in an ASCII string which is returned in the treedraws$trees
list item. This string can be easily imported by R with the following:
Journal of Statistical Software 53

R> write(post$treedraws$trees, "trees.txt")


R> tc <- textConnection(post$treedraws$tree)
R> trees <- read.table(file = tc, fill = TRUE, row.names = NULL,
+ header = FALSE, col.names = c("node", "var", "cut", "leaf"))
R> close(tc)
R> head(trees)
node var cut leaf
1 1000 200 1 NA 0.005 0.036
2 3 NA NA NA
3 1 0 66 -0.001032108 < c1,67 ≥ c1,67
4 2 0 0 0.004806880
5 3 0 0 0.035709372 x1
6 3 NA NA NA
The string is encoded via the following binary tree notation. The first line is an exception
which has the number of MCMC samples, M , in the field node; the number of trees, H, in
the field var; and the number of variables, P , in the field cut. For the rest of the file, the
field node is used for the number of nodes in the tree when all other fields are NA; or for a
specific node when the other fields are present. The nodes are numbered in relation to the
tree’s tier level, t(n) = blog2 nc or t = floor(log2(node)), as follows.

Tier
t 2t ... 2t+1 −1
..
.
2 4 5 6 7
1 2 3
0 1

Table 1: Schematic diagram of a binary tree.

The var field is the variable in the branch decision rule which is encoded 0, . . . , P − 1 as a
C/C++ array index (rather than an R index). Similarly, the cut field is the cutpoint of the
variable in the branch decision rule which is encoded 0, . . . , cj − 1 for variable j; note that the
cutpoints are returned in the treedraws$cutpoints list item. The terminal leaf output value
is contained in the field leaf. It is not immediately obvious which nodes are branches vs.
leaves since, at first, it would appear that the leaf field is given for both branches and leaves.
Leaves are always associated with var = 0 and cut = 0; however, note that this is also a
valid branch variable/cutpoint since these are C/C++ indices. The key to discriminating
between branches and leaves is via the algebraic relationship between a branch, n, at tree tier
t(n) leading to its left, l = 2n, and right, r = 2n + 1, nodes at tier t(n) + 1, i.e., for each node,
besides root, you can determine from which branch it arose and those nodes that are not a
branch (since they have no leaves) are necessarily leaves.
Underlying this methodology is the BART prior. The BART prior specifies a flexible class of
unknown functions, f , from which we can gather randomly generated fits to the given data
via the posterior. N.B. we define f as returning a scalar value, but BART extensions which
return multivariate values are conceivable. Let the function g(x; T , M) assign a value based
on the input x. The binary decision tree T is represented by a set of ordered triples, (n, j, k),
54 BART: Bayesian Additive Regression Trees in R

representing branch decision rules: n ∈ B for node n in the set of branches B, j for covariate
xj and k for the cutpoint cjk . The branch decision rules are of the form xj < cjk which means
branch left and xj ≥ cjk , branch right; or terminal leaves where it stops. M represents leaves
and is a set of ordered pairs, (n, µn ): n ∈ L where L is the set of leaves (L is the complement
of B) and µn for the outcome value.
The function, f (x), is a sum of H trees:
H
f (x) = g(x; Th , Mh ) (3)
X

h=1

where H is “large”, let us say, 50, 100 or 200.


For a continuous outcome, yi , we have the following BART regression on the vector of covari-
ates, xi :
 
iid
yi = µ0 + f (xi ) + i where i ∼ N 0, wi2 σ 2

with i indexing subjects i = 1, . . . , N . The unknown random function, f , and the error
variance, σ 2 , follow the BART prior expressed notationally as
prior
(f, σ 2 ) ∼ BART(H, µ0 , τ, k, α, γ; ν, λ, q)

where H is the number of trees, µ0 is a known constant which centers y and the rest of the
parameters will be explained later in this section (for brevity, we will often use the simpler
prior
shorthand (f, σ 2 ) ∼ BART). The wi are known standard deviation weight multiples which
you can supply with the argument w that is only available for continuous outcomes, hence,
the weighted BART name; the unit weight vector is the default. The centering parameter,
µ0 , can be specified via the fmean argument where the default is taken to be ȳ.
BART is a Bayesian nonparametric prior. Using the Gelfand-Smith generic bracket notation
for the specification of random variable distributions (Gelfand and Smith 1990), we represent
the BART prior in terms of the collection of all trees, T ; collection of all leaves, M; and the
error variance, σ , as the following product: T , M, σ = σ [T , M] = σ [T ] [M | T ].
2 2
 2  2

Furthermore, the individual trees themselves are independent: [T , M] = h [Th ] [Mh | Th ].


Q

where [Th ] is the prior for the h-th tree and [Mh | Th ] is the collection of leaves for the h-th
tree. And, finally, the collection of leaves for the h-th tree are independent: [Mh | Th ] =
n [µhn | Th ] where n indexes the leaf nodes.
Q

The tree prior: [Th ]. There are three prior components of Th which govern whether the tree
branches grow or are pruned. The first tree prior regularizes the probability of a branch at
leaf node n in tree tier t(n) = blog2 nc as

P [Bn = 1] = α(t(n) + 1)−γ (4)

where Bn = 1 represents a branch while Bn = 0 is a leaf, 0 < α < 1 and γ ≥ 0. You can
specify these prior parameters with arguments, but the following defaults are recommended:
α is set by the parameter base = 0.95 and γ by power = 2; for a detailed discussion of these
parameter settings, see Chipman et al. (1998). Note that this prior penalizes branch growth,
i.e., in prior probability, the default number of branches will likely be 1 or 2. Next, there is
a prior dictating the choice of a splitting variable j conditional on a branch event Bn which
Journal of Statistical Software 55

defaults to uniform probability sj = P −1 where P is the number of covariates (however, you


can specify a Dirichlet prior which is more appropriate if the number of covariates is large
(Linero 2018); see below). Given a branch event, Bn , and a variable chosen, xj , the last tree
prior selects a cut point, cjk , within the range of observed values for xj ; this prior is uniform.
We can also represent the probability of variable selection via the sparse Dirichlet prior as
prior
[s1 , . . . , sP ] | θ ∼ Dirichlet (θ/P, . . . , θ/P ) which is specified by the argument sparse =
TRUE while the default is sparse = FALSE for uniform sj = P −1 . The prior parameter θ can
be fixed or random: supplying a positive number will specify θ fixed at that value while the
default theta = 0 is random and its value will be learned from the data. The random θ
prior
prior is induced via θ/(θ + ρ) ∼ Beta (a, b) where the parameter ρ can be specified by
the argument rho (which defaults to NULL representing the value P ; provide a value to over-
ride), the parameter b defaults to 1 (which can be over-ridden by the argument b) and the
parameter a defaults to 0.5 (which can be over-ridden by the argument a). The distribution
of theta controls the sparsity of the model: a = 0.5 induces a sparse posture while a = 1 is
not sparse and similar to the uniform prior with probability sj = P −1 . If additional sparsity
is desired, then you can set the argument rho to a value smaller than P .
Here, we take the opportunity to provide some insight into how and why the sparse prior
works as desired. The key to understanding the inducement of sparsity is the distribution of
the arguments to the Dirichlet prior: θ/P . We are unaware of this result appearing elsewhere
in the literature. But, it can be shown that θ/P ∼F (a, b, ρ/P ) where F (.) is the beta prime
distribution scaled by ρ/P (Johnson, Kotz, and Balakrishnan 1995). The non-sparse setting
is (a, b, ρ/P ) = (1, 1, 1). As you can see in the Figure 19, sparsity is increased by reducing
ρ, reducing a or reducing both. Unlike matrices, data frames can contain categorical factors.
Therefore, factors can be supplied when x.train is a data frame. Factors with multiple levels
are transformed into dummy variables with each level as their own binary indicator; factors
with only two levels are a binary indicator with a single dummy variable.
The leaf prior: [µhn | Th ]. Given a tree, Th , there is a prior on its leaf values, µhn | Th and
we denote the collection of all leaves h in Th by Mhi = {(n, µhn ) : n ∈ Lh }. Suppose that
yi ∈ [ymin , ymax ] for all i and denote µ1(i) , . . . , µH(i) as the leaf output values from each tree
 
iid
corresponding to the vector of covariates, xi . If µh(i) | Th ∼ N 0, σµ2 , then the model
 
estimate for subject i is µi = E [yi | xi ] = µ0 + where µi ∼N µ0 , Hσµ2 . We choose
P
h µh(i)

a value for σµ which is the solution to the equations ymin = µ0 −  k Hσµ and ymax = µ0 +
√ −ymin prior
h i2
k Hσµ , i.e., σµ = ymax √
2k H
. Therefore, we arrive at µhn ∼ N 0, τ

2k H
where τ =
ymax − ymin . So, the prior for µhn is informed by the data, y, but only weakly via the extrema,
ymin and ymax . The parameter k calibrates this prior as follows.
2 !
τ

µi ∼ N µ0 ,
2k
P [ymin ≤ µi ≤ ymax ] = Φ(k) − Φ(−k)
ymax − µ0
 
Since P [µi ≤ ymax ] = P z ≤ 2k ≈ P [z ≤ k] = Φ(k)
τ
Similarly P [µi ≤ ymin ] = Φ(−k)

The default value, k = 2, corresponds to µi falling within the extrema with approximately
56 BART: Bayesian Additive Regression Trees in R

log (f (x, 1, 1, 1))


log (f (x, 1, 1, 0.5))
log (f (x, 0.5, 1, 1))

5.00
log (f (x, 0.5, 1, 0.5))

1.00
log (f (x, a, b, ρ P ))

0.50
0.10
0.05
0.01

0 1 2 3 4 5

Figure 19: The distribution of θ/P and the sparse Dirichlet prior. The key to understand-
ing the inducement of sparsity is the distribution of the arguments to the Dirichlet prior:
θ/P ∼F (a, b, ρ/P ) where F (.) is the beta prime distribution scaled by ρ/P . Here we plot the
natural logarithm of the scaled beta prime density, f (.), at a non-sparse setting and three
sparse settings. The non-sparse setting is (a, b, ρ/P ) = (1, 1, 1) (solid black line). As you can
see in the figure, sparsity is increased by reducing ρ (long dashed red line), reducing a (short
dashed blue line) or reducing both (mixed dashed gray line).

0.95 probability. Alternative choices of k can be supplied via the k argument. We have found
that values of k ∈ [1, 3] generally yield good results. Note that k is a potential candidate
parameter for choice via cross-validation.
The error variance prior: σ 2 . The prior for σ 2 is the conjugate scaled inverse Chi-square
 

distribution, i.e., νλχ−2 (ν). We recommend that the degrees of freedom, ν, be from 3 to
10 and the default is 3 which can be over-ridden by the argument sigdf. The λ parameter
can be specified by the lambda argument which defaults to NA. If lambda is unspecified,
then we determine a reasonable value for λ based on an estimate, σ b , (which can be specified
by the argument sigest and defaults to NA). If sigest is unspecified, the default value of
sigest is determined
 via linear regression or the sample standard deviation: if P < N , then
yi ∼ N xi β, σ
0 b ; otherwise, σ
2 b = sy . Now we solve for λ such that P σ 2 ≤ σ b 2 = q. This
 
b
quantity, q, can be specified by the argument sigquant and the default is 0.9 whereas we
also recommend considering 0.75 and 0.99. Note that the pair (ν, q) are potential candidate
parameters for choice via cross-validation.
Other important arguments for the BART prior. We fix the number of trees at H which cor-
responds to the argument ntree. The default number of trees is 200 for continuous outcomes;
Journal of Statistical Software 57

as shown by Bleich et al. (2014), 50 is also a reasonable choice which is the default for all
other outcomes: cross-validation could be considered. The number of cutpoints is provided by
the argument numcut and the default is 100. The default number of cutpoints is achieved for
continuous covariates. For continuous covariates, the cutpoints are uniformly distributed by
default, or generated via uniform quantiles if the argument usequants = TRUE is provided.
By default, discrete covariates which have fewer than 100 values will necessarily have fewer
cutpoints. However, if you want a single discrete covariate to be represented by a group of
binary dummy variables, one for each category, then pass the variable as a factor within a
data frame.

C. Posterior computation for BART


In order to generate samples from the posterior for f , we sample the structure of all the
trees Th , for h = 1, . . . , H; the values of all leaves µhn for n ∈ Lh within tree h; and, when
appropriate, the error variance σ 2 . Additionally, with the sparsity prior, there are samples
of the vector of splitting variable selection probabilities [s1 , . . . , sP ] and, when the sparsity
parameter is random, samples of θ.
The leaf and variance parameters are sampled from the posterior using Gibbs sampling (Ge-
man and Geman 1984; Gelfand and Smith 1990). Since the priors on these parameters are
conjugate, the Gibbs conditionals are specified analytically. For the leaves, each µhn is drawn
from a normal conditional density. The error variance, σ 2 , is drawn from a scaled inverse
Chi-square conditional.
Drawing a tree from the posterior requires a Metropolis-within-Gibbs sampling scheme (Mueller
1991; Robert and Casella 2013), i.e., a Metropolis-Hastings (MH) step (Metropolis, Rosen-
bluth, Rosenbluth, Teller, and Teller 1953; Hastings 1970) within Gibbs sampling. For single-
tree models, four different proposal mechanisms are defined (Chipman et al. 1998) (N.B. other
MCMC tree sampling strategies have been proposed: Denison, Mallick, and Smith (1998);
Wu, Tjelmeland, and West (2007); Pratola (2016)). The complementary BIRTH/DEATH
proposals are essential (the two other proposals are CHANGE and SWAP (Chipman et al.
1998)). For programming simplicity, the BART package only implements the BIRTH and
DEATH proposals each with equal probability. BIRTH selects a leaf and turns it into a
branch, i.e., selects a new variable and cutpoint with two leaves “born” as its descendants.
DEATH selects a branch leading to two terminal leaves and “kills” the branch by replacing
it with a single leaf. To illustrate this discussion, we present the acceptance probability for a
BIRTH proposal. Note that a DEATH proposal is the reversible inverse of a BIRTH proposal.
The algorithm assumes a fixed discrete set of possible split values for each xj . Furthermore,
the leaf values, µhn , are integrated over so that our search in tree space is over a large, but
discrete, set of possibilities. At the m-th MCMC step, let T m denote the current state for the
h-th tree and T ∗ denotes the proposed h-th tree (subscript h is suppressed for convenience).
T ∗ are identical T m except that one terminal leaf of T m is replaced by a branch of T ∗ with
two terminal leaves. The proposed tree is accepted with the following probability:
P [T ∗ ] P [T m | T ∗ ]
 
πBIRTH = min 1,
P [T m ] P [T ∗ | T m ]
where P [T m ] and P [T ∗ ] are the posterior probabilities of T m and T ∗ respectively. These
are the targets of this sampling, each consisting of a likelihood contribution and prior contri-
58 BART: Bayesian Additive Regression Trees in R

bution. Additionally, P [T m | T ∗ ] is the probability of proposing T m given current state T ∗


(a DEATH) and P [T ∗ | T m ] is the probability of proposing T ∗ given current state T m (a
BIRTH).
First, we describe the likelihood contribution to the posterior. Let yn denote the partition
of y corresponding to the leaf node n given the tree T . Because the leaf values are∗ a pri-
P[T ]
ori conditionally independent, we have [y | T ] = n [yn | T ]. So, for the ratio P[T m ] after
Q

cancellation of terms in the numerator and denominator, we have the likelihood contribution:
P [yL , yR | T ∗ ] P [yL | T ∗ ] P [yR | T ∗ ]
=
P [yLR | T m ] P [yLR | T m ]
where yL is the partition corresponding hto the
i newborn left leaf node; yR , the partition for
yL
the newborn right leaf node; and yLR = yR . N.B. the terms in the ratio are the predictive
densities of a normal mean with a known variance and a normal prior for the mean.
Similarly, the terms that the prior contributes to the posterior ratio often cancel since there
is only one “place” where the trees differ and the prior draws components independently at
P[T ∗ ]
different “places” of the tree. Therefore, the prior contribution to P[T m] is
2
P [Bn = 1] P [Bl = 0] P [Br = 0] sj α(t(n) + 1)−γ [1 − α(t(n) + 2)−γ ] sj
=
P [Bn = 0] 1 − α(t(n) + 1)−γ
where P [Bn ] is the branch regularity prior (see Equation 4), sj is the splitting variable
selection probability, n is the chosen leaf node in tree T m , l = 2n is the newborn left leaf
node in tree T ∗ and r = 2n + 1 is the newborn right leaf node in tree T ∗ .
P[T m |T ∗ ]
Finally, the ratio P[T ∗ |T m ] is
P [DEATH | T ∗ ] P [n | T ∗ ]
P [BIRTH | T m ] P [n | T m ] sj
where P [n | T ] is the probability of choosing node n given tree T .
N.B. sj appears in both the numerator and denominator of the acceptance probability πBIRTH ,
therefore, canceling which is mathematically convenient.
Now, let us briefly discuss the posterior computation related to the Dirichlet sparse prior. If
a Dirichlet prior is placed on the variable splitting probabilities, s, then its posterior samples
are drawn via Gibbs sampling with conjugate Dirichlet draws. The Dirichlet parameter is
updatedhby adding the total variable
i branch count over the ensemble, mj , to the prior setting,
P , i.e., P + m1 , . . . , P + mP . In this way, the Dirichlet prior induces a “rich get richer”
θ θ θ

variable selection strategy. The sparsity parameter, θ, is drawn on a fine grid of values for
the analytic posterior (Linero 2018). This draw only depends on [s1 , . . . , sP ].

D. Efficient computing with BART


If you had the task of creating an efficient implementation for a black-box model such as
BART, which tools would you use? Surprisingly, linear algebra routines which are a traditional
building block of scientific computing will be of little use for a tree-based method such as
BART. So what is needed? Restricting ourselves to widely available off-the-shelf hardware
and open-source software, we believe there are four key technologies necessary for a successful
BART implementation:
Journal of Statistical Software 59

• An object-oriented language to facilitate working with trees and matrices.

• A parallel (or distributed) CPU computing framework for faster processing.

• A high-quality parallel random number generator.

• An interpreted shell for high-level data processing and analysis.

In our implementation of BART, we pair the objected-oriented languages of R and C++ to


satisfy these requirements. In this Section, we give a brief introduction to the concepts and
technologies harnessed for efficient computing by our BART package.

D.1. A brief history of multi-threading


Writing multi-threaded programs is a fairly routine practice today with a high-level language
like R and corresponding user-friendly interfaces such as the parallel R package (R Core
Team 2020). Modern off-the-shelf laptops typically have 4 or 8 CPU cores placing reasonably
priced multi-threaded hardware at your fingertips. Although, BART is often computationally
undemanding, we find it very convenient, with the aid of multi-threading, to run in seconds
that which would otherwise take minutes. To highlight the point that multi-threading is a
mature technology, we now present a brief history of multi-threading. This is not meant to
be exhaustive; rather, we only provide enough detail to explain the capability and popularity
of multi-threading today.
Multi-threading emerged rather early in the digital computer age with pioneers laying the
research groundwork in the 1960s. In 1961, Burroughs released the B5000 which was the
first commercial hardware capable of multi-threading (Lynch 1965). The B5000 performed
asymmetric multiprocessing which is commonly employed in modern hardware like numerical
co-processors and/or graphical processors today. In 1962, Burroughs released the D825 which
was the first commercial hardware capable of symmetric multiprocessing (SMP) with CPUs
(Anderson, Hoffman, Shifman, and Williams 1962). In 1967, Gene Amdahl derived the the-
oretical limits for multi-threading which came to be known as Amdahl’s law (Amdahl 1967).
If B is the number of CPUs and b is the fraction of work that can’t be parallelized, then the
gain due to multi-threading is ((1 − b)/B + b)−1 .
Now, fast-forward to the modern era of multi-threading. Hardware and software architectures
in current use both directly, and indirectly, led to the wide availability of pervasive multi-
threading today. In 2000, Advanced Micro Devices (AMD) released the AMD64 specification
that created a new 64-bit x86 instruction set which was capable of co-existing with 16-bit
and 32-bit x86 legacy instructions. This was an important advance since 64-bit math is
capable of addressing vastly more memory than 16-bit or 32-bit (264 vs. 216 or 232 ) and
multi-threading inherently requires more memory resources. In 2003, version 2.6 of the Linux
kernel incorporated full SMP support; prior Linux kernels had either no support or very
limited/crippled support. From 2005 to 2011, AMD released a series of Opteron chips with
multiple cores for multi-threading: 2 cores in 2005, 4 cores in 2007, 6 cores in 2009, 12
cores in 2010 and 16 cores in 2011. From 2008 to 2010, Intel brought to market Xeon chips
with their hyper-threading technology that allows each core to issue two instructions per
clock cycle: 4 cores (8 threads) in 2008 and 8 cores (16 threads) in 2010. In today’s era,
most off-the-shelf hardware available features 1 to 4 CPUs each of which is capable of multi-
threading. Therefore, in the span of only a few years, multi-threading rapidly trickled down
60 BART: Bayesian Additive Regression Trees in R

from higher-end servers to mass-market products such as desktops and laptops. For example,
the consumer laptop that BART is developed on, purchased in 2016, is capable of 8 threads
(and hence many of the examples default to 8 threads).

D.2. Modern multi-threading software frameworks


Up to this point, we have introduced multi-threading with respect to parallelizing a task on
a single system. Here we want to make a distinction between simple multi-threading on a
single system and more complex multi-threading on multiple systems simultaneously which is
often denoted by the term distributed computing. On a single system, various programming
techniques can be used to create multi-threaded software. Basic multi-threading can be
provided by the fork system call which is often termed forking. More advanced multi-
threading is provided by software frameworks such as OpenMP (Dagum and Menon 1998)
and the message passing interface (MPI). Please note that MPI can be employed for both
simple multi-threading and for distributed computing, e.g., MPI software initially written for
a single system could be extended to operate on multiple systems as computational needs
expand. In the following, BART computations with multi-threading are explored where the
term multi-threading is used for a single system and the term distributed computing is used
for multiple systems.
In the late 1990s, MPI (Walker and Dongarra 1996) was introduced which is the dominant
distributed computing framework in use today (Gabriel et al. 2004). MPI support in R is
built upon a fairly consistent interface provided by the parallel package (R Core Team 2020)
which is extended by other CRAN packages such as snow (Tierney, Rossini, Li, and Sevcikova
2018) and Rmpi (Yu 2002). To support MPI, new BART software was created with a C++
object schema that is simple to program and maintain for distributed computing: we call
this the MPI BART code-base (Pratola, Chipman, Gattiker, Higdon, McCulloch, and Rust
2014). The BART package source code is a descendant of MPI BART and its programmer-
friendly objects, although, the multi-threading MPI support is now provided by R packages,
e.g., parallel, snow and Rmpi.
The BART package supports multi-threading in two ways: (1) via parallel and related pack-
ages (which is how MPI is provided); and (2) via the OpenMP standard (Dagum and Menon
1998). OpenMP takes advantage of modern hardware by performing multi-threading on sin-
gle machines which often have multiple CPUs each with multiple cores. Currently, the BART
package only uses OpenMP for parallelizing predict function calculations. The challenge
with OpenMP (besides the C/C++ programming required to support it) is that it is not
available on all platforms. Operating system support can be detected by the GNU auto-
tools (Calcote 2010) which define a C pre-processor macro called _OPENMP if it is available.
There are numerous exceptions for operating systems so it is difficult to make universal state-
ments. But, generally, Microsoft Windows lacks OpenMP detection since the GNU autotools
do not natively exist on this platform. For Apple macOS, the standard Xcode toolkit does
not provide OpenMP; however, the macOS compilers on CRAN do provide OpenMP (see
https://CRAN.R-project.org/bin/macosx/tools). Most Linux and UNIX distributions
provide OpenMP by default. We provide the function mc.cores.openmp which returns 1 if
the predict function is capable of utilizing OpenMP; otherwise, returns 0.
The parallel package provides multi-threading via forking. Forking is available on Unix plat-
forms, but not Windows (we use the term Unix to refer to UNIX, Linux and macOS since they
Journal of Statistical Software 61

are all in the UNIX family tree). The BART package uses forking for posterior sampling of
the f function, and also for the predict function when OpenMP is not available. Except for
predict, all functions that use forking start with mc. And, regardless of whether OpenMP or
forking is employed, these functions accept the argument mc.cores which controls the num-
ber of threads to be used. The parallel package provides the function detectCores which
returns the number of threads that your hardware can support and, therefore, the BART
package can use.

D.3. BART implementations on CRAN


Currently, there are four BART implementations on the Comprehensive R Archive Network
(CRAN); see the Appendix for a tabulated comparative summary of their features.
BayesTree was the first released in 2006 (Chipman and McCulloch 2016). Reported bugs
will be fixed, but no future improvements are planned; so, we suggest choosing one of the
newer packages such as BART. The basic interface and work-flow of BayesTree has strongly
influenced the other packages which followed. However, the BayesTree source code is difficult
to maintain and, therefore, improvements were limited leaving it with relatively fewer features
than the other entries.
The second entrant is bartMachine which is written in Java and was first released in 2013
(Kapelner and Bleich 2016). It provides advanced features like multi-threading, variable
selection (Bleich et al. 2014), a predict function, convergence diagnostics and missing data
handling. However, the R to Java interface can be challenging to deal with. R is written in
C and Fortran, consequentially, functions written in Java do not have a natural interface to
R. This interface is provided by the rJava (Urbanek 2020) package which requires the Java
Development Kit (JDK). Therefore, we highly recommend bartMachine for Java users.
The third entrant is dbarts (Dorie 2020) which is written in C++ and was first released
in 2014. It is a clone of the BayesTree interface, but it does not share the source code;
dbarts source has been re-written from scratch for efficiency and maintainability. dbarts is
a drop-in replacement for BayesTree. Although, it lacks multi-threading, the dbarts serial
implementation is the fastest, therefore, it is preferable when multi-threading is unavailable
such as on Windows.
The BART package which is written in C++ was first released in 2017 (McCulloch et al.
2021). It provides advanced features like multi-threading, variable selection (Linero 2018),
a predict function and convergence diagnostics. The source code is a descendant of MPI
BART. Although, R is mainly written in C and Fortran (at the time of this writing, 39.2%
and 26.8% lines of source code respectively), C++ is a natural choice for creating R functions
since they are both object-oriented languages. The C++ interface to R has been seamlessly
provided by the Rcpp package (Eddelbuettel and Francois 2011) which efficiently passes object
references from R to C++ (and vice versa) as well as providing direct access to the R random
number generator. The source code can also be called from C++ alone without an R instance
where the random number generation is provided by either the standalone Rmath library (R
Core Team 2017) or the C++ random Standard Template Library. Furthermore, it is the only
BART package to support categorical; and time-to-event outcomes (Sparapani et al. 2016,
2020b,a). For one or more missing covariates, record-level hot-decking imputation (De Waal,
Pannekoek, and Scholtus 2011) is employed that is biased towards the null, i.e., non-missing
values from another record are randomly selected regardless of the outcome. This simple
62 BART: Bayesian Additive Regression Trees in R

missing data imputation method is sufficient for data sets with relatively few missing values;
for more advanced needs, we recommend the sbart package which utilizes the Sequential
BART algorithm (Daniels and Singh 2018; Xu, Daniels, and Winterstein 2016) (N.B. sbart
is also a descendant of MPI BART).

D.4. MCMC is embarrassingly parallel


In general, Bayesian Markov chain Monte Carlo (MCMC) posterior sampling is considered to
be embarrassingly parallel (Rossini, Tierney, and Li 2007), i.e., since the chains only share the
data and don’t have to communicate with each other, parallel implementations are considered
to be trivial. BART MCMC also falls into this class.
However, to clarify this point before proceeding, the embarrassingly parallel designation is
in the context of simple multi-threading on single systems. An adaptation of distributed
computing to large data sets exhaustively divides the data into mutually exclusive partitions,
called shards, such that each system only processes a single shard. With sharded distributed
computing, the embarrassingly parallel moniker does not apply. Recently, two advanced tech-
niques have been developed for BART computations with sharding: Monte Carlo consensus
(Pratola et al. 2014) and modified likelihood inflating sampling algorithm, or modified LISA,
(Entezari, Craiu, and Rosenthal 2018). From here on, simple multi-threading is assumed.
Typical practice for Bayesian MCMC is to start in some initial state, perform a limited
number of samples to generate a new random starting position and throw away the preceding
samples which we call burn-in. The amount of burn-in in the BART package is controlled
by the argument nskip: defaults to 100 with the exception of time-to-event outcomes which
default to 250. The total length of the chain returned is controlled by the argument ndpost
which defaults to 1000. The theoretical gain due to multi-threading can be calculated by
what we call the MCMC Corollary to Amdahl’s Law. Let b be the burn-in fraction and B be
the number of threads, then the gain limit is ((1 − b)/B + b)−1 . (As an aside, note that we
can derive Amdahl’s Law as follows where the amount of work done is in the numerator and
elapsed time is in the denominator: (1−b)/B+b
1−b+b
= (1−b)/B+b
1
). For example, see the diagram
in Figure 20 where the burn-in fraction, b = 1100 = 0.09, and the number of CPUs, B = 5,
100

results in an elapsed time of only ((1 − b)/B + b) = 0.27 or a ((1 − b)/B + b)−1 = 3.67 fold
reduction which is the gain in efficiency. In Figure 21, we plot theoretical gains on the y-axis
and the number of CPUs on the x-axis for two settings: b ∈ {0.025, 0.1}.

D.5. Multi-threading and random access memory


The IEEE standard 754-2008 (Institute of Electrical and Electronics Engineers 2008) specifies
that every double-precision number consumes 8 bytes (64 bits). Therefore, it is quite simple
to estimate the amount of random access memory (RAM) required to store a matrix. If A is
m × n, then the amount of RAM needed is 8 × m × n bytes. Large matrices held in RAM
can present a challenge to system performance. If you consume all of the physical RAM, the
system will “swap” segments out to virtual RAM which are disk files and this can degrade
performance and possibly even crash the system. On Unix, you can monitor memory and
swap usage with the top command-line utility. And, within R, you can determine the size of
an object with the object.size function.
Category BayesTree bartMachine dbarts BART
First release 2006 2013 2014 2017
Authors Chipman & McCulloch Kapelner & Bleich Dorie McCulloch, Sparapani
Gramacy, Spanbauer & Pratola
Source code C++ Java C++ C++
R package dependencies None rJava, car, randomForest, None Rcpp
excluding “Recommended” missForest
Tree transition proposals 4 3 4 2
Multi-threaded No Yes No Yes
predict function No Yes No Yes
Variable selection No Yes No Yes
Continuous outcomes Yes Yes Yes Yes
Binary outcomes probit Yes Yes Yes Yes
Binary outcomes logit No No No Yes
Categorical outcomes No No No Yes
Time-to-event outcomes No No No Yes
Convergence diagnostics No Yes No Yes
Journal of Statistical Software

Thinning Yes No Yes Yes


Missing data handling No Yes No Yes
Cross-validation No Yes Yes No
Partial dependence plots Yes Yes Yes No

Table 2: A comparison of BART packages available on CRAN (Chipman and McCulloch 2016; Kapelner and Bleich 2016; McCulloch
et al. 2021).
63
64 BART: Bayesian Additive Regression Trees in R

1.0
Proportionate length of chain processing time

0.8
0.6
0.4
0.2

b
0.0

0 1 2 3 4 5

Chains

Figure 20: The theoretical gain due to multi-threading can be calculated by Amdahl’s Law.
Let b be the burn-in fraction and B be the number of threads, then the theoretical gain
limit is ((1 − b)/B + b)−1 . In this diagram, the burn-in fraction, b = 1100
100
= 0.09, and the
number of CPUs, B = 5, results in an elapsed time of only ((1 − b)/B + b) = 0.27 or a
((1 − b)/B + b)−1 = 3.67 fold reduction which is the gain in efficiency.

Mathematically, a matrix is represented as follows.


 
a11 a12 ··· a1n

a21 a22 ··· a2n 
A=
 
.. .. .. .. 

 . . . .


am1 am2 · · · amn
R is a column-major language, i.e., matrices are laid out in consecutive memory locations
by traversing the columns: [a11 , a21 , . . ., a12 , a22 , . . .]. R is written in C and Fortran where
Fortran is a column-major language as well. However, C and C++ are row-major lan-
guages, i.e., matrices are laid out in consecutive memory locations by traversing the rows:
[a11 , a12 , . . ., a21 , a22 , . . .]. So, if you have written an R function in C/C++, then you need to
be cognizant of the clash in paradigms (also note that R/Fortran array indexing goes from
1 to m while C/C++ indexing goes from 0 to m − 1). As you might surmise, this is easily
addressed with a transpose, i.e., instead of passing A from R to C/C++ pass A> .
R is very efficient in passing objects; rather, than passing an object (along with all of its
memory consumption) on the stack, it passes objects merely by a pointer referencing the
original memory location. However, R follows copy-on-write memory allocation, i.e., all ob-
jects present in the parent thread can be read by a child thread without a copy, but when an
object is altered/written by the child, then a new copy is created in memory. Therefore, if we
Journal of Statistical Software 65

30
25
0.025
20
Gain

15
10

0.1
5
0

1 2 5 10 20 50

B: number of CPU

Figure 21: The theoretical gain due to multi-threading can be calculated by Amdahl’s Law.
Let b be the burn-in fraction and B be the number of threads, then the theoretical gain limit
is ((1 − b)/B + b)−1 . In this figure, the theoretical gains are on the y-axis and the number of
CPUs, the x-axis, for two settings: b ∈ {0.025, 0.1}.

pass A from R to C/C++, and then transpose, we will create multiple copies of A consuming
8 × m × n × B where B is the number of children. If A is a large matrix, then you may
stress the system’s limits. The simple solution is for the parent to create the transpose before
passing A and avoiding the multiple copies, i.e., A <- t(A). And this is the philosophy that
the BART package follows for the multi-threaded BART functions; see the documentation
for the transposed argument.

D.6. Multi-threading: interactive and batch processing


Interactive jobs must take precedence over batch jobs to prevent the user experience from
suffering high latency. For example, have you ever experienced a system slowdown while you
are typing and the display of your keystrokes can not keep up; this should never happen
and is the sign of something amiss. With large multi-threaded jobs, it is surprisingly easy
to naively degrade system performance. But, this can easily be avoided by operating system
support provided by R. In the tools package (R Core Team 2020), there is the psnice function.
Paraphrased from the ?psnice help page.

Unix has a concept of process priority. Priority is assigned values from 0 to 39


with 20 being the normal priority and (counter-intuitively) larger numeric values
denoting lower priority. Adding to the complexity, there is a “nice” value, the
amount by which the priority exceeds 20. Processes with higher nice values will
66 BART: Bayesian Additive Regression Trees in R

receive less CPU time than those with normal priority. Generally, processes with
nice 19 are only run when the system would otherwise be idle.

Therefore, by default, the BART package children have their nice value set to 19.

D.7. Creating a BART executable


Occasionally, you may need to create a BART executable that you can run without an R
instance. This is especially useful if you need to include BART in another C++ program.
Or, when you need to debug the BART package C++ source code which is more difficult
to do when you are calling the function from R. Several examples of these are provided
with the BART package. With R, you can find the Makefile and the weighted BART
continuous outcome example with system.file("cxx-ex/Makefile", package = "BART")
and system.file("cxx-ex/wmain.cpp", package = "BART") respectively. Note that these
examples require the installation of the standalone Rmath library (R Core Team 2017) which
is contained in the R source code distribution. Rmath provides common R functions and
random number generation, e.g., pnorm and rnorm. You will likely need to copy the cxx-ex
directory to your workspace. Once done, you can build and run the weighted BART executable
example from the command line shell as follows.
sh% make wmain.out ## to build
sh% ./wmain.out ## to run
By default, these examples are based on the Rmath random number generator. However, you
can specify the C++ Standard Template Library random number generator (contained in the
STL random header file) by uncommenting the following line in the Makefile (by removing
the pound, #, symbols):
## CPPFLAGS = -I. -I/usr/local/include -DMATHLIB_STANDALONE -DRNG_random
(which still requires Rmath for other purposes). These examples were developed on Linux
and macOS, but they should be readily adaptable to UNIX and Windows as well.

Affiliation:
Rodney Sparapani
Division of Biostatistics
Institute for Health and Equity
Medical College of Wisconsin, Milwaukee campus
8701 Watertown Plank Road
Milwaukee, WI 53226, United States of America
E-mail: [email protected]

Journal of Statistical Software http://www.jstatsoft.org/


published by the Foundation for Open Access Statistics http://www.foastat.org/
January 2021, Volume 97, Issue 1 Submitted: 2019-01-23
doi:10.18637/jss.v097.i01 Accepted: 2019-10-17

You might also like