Bayesian Data Analysis Using R

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

BAYESIAN DATA ANALYSIS USING R

Bayesian data analysis using R


Jouni Kerman and Andrew Gelman be fit using Umacs: y j ∼ N(θ j , σ 2j ), j = 1, . . . , J
(J = 8), where σ j are fixed and the means θ j are
Introduction given the prior tν (µ, τ ). In our implementation of
the Gibbs sampler, θ j is drawn from a Gaussian dis-
Bayesian data analysis includes but is not limited tribution with a random variance component V j . The
to Bayesian inference (Gelman et al., 2003; Kerman, conditional distributions of θ, µ, V, and τ can be cal-
2006a). Here, we take Bayesian inference to refer to culated analytically, so we update them each by a
posterior inference (typically, the simulation of ran- direct (Gibbs) update. The updating functions are
dom draws from the posterior distribution) given a to be specified as R functions (here, theta.update,
fixed model and data. Bayesian data analysis takes V.update, mu.update, etc.). The degrees-of-freedom
Bayesian inference as a starting point but also in- parameter ν is also unknown, and updated using
cludes fitting a model to different datasets, alter- a Metropolis algorithm. To implement this, we
ing a model, performing inferential and predictive only need to supply a function calculating the log-
summaries (including prior or posterior predictive arithm of the posterior function; Umacs supplies
checks), and validation of the software used to fit the the code. We have several Metropolis classes for
model. efficiency; SMetropolis implements the Metropo-
The most general programs currently available lis update for a scalar parameter. These “updater-
for Bayesian inference are WinBUGS (BUGS Project, generating functions" (Gibbs and SMetropolis) also
2004) and OpenBugs, which can be accessed from R require an argument specifying a function return-
using the packages R2WinBUGS (Sturtz et al., 2005) ing an initial starting point for the unknown param-
and BRugs. In addition, various R packages ex- eter (here, theta.init, mu.init, tau.init, etc.).
ist that directly fit particular Bayesian models (e.g.
MCMCPack, Martin and Quinn (2005)). In this note, s <- Sampler(
we describe our own entry in the “inference en- J = 8,
gine” sweepstakes but, perhaps more importantly, sigma.y = c(15, 10, 16, 11, 9, 11, 10, 18),
describe the ongoing development of some R pack- y = c(28, 8, -3, 7, -1, 1, 18, 12),
theta = Gibbs(theta.update,theta.init),
ages that perform other aspects of Bayesian data
V = Gibbs(V.update, V.init),
analysis. mu = Gibbs(mu.update,mu.init),
tau = Gibbs(tau.update, tau.init),
Umacs nu = SMetropolis(log.post.nu, nu.init),
Trace("theta[1]")
Umacs (Universal Markov chain sampler) is an R )
package (to be released in Spring 2006) that facilitates
the construction of the Gibbs sampler and Metropo- Figure 1: Invoking the Umacs Sampler function to gen-
lis algorithm for Bayesian inference (Kerman, 2006b). erate an R Markov chain sampler function s(...). Up-
The user supplies data, parameter names, updating dating algorithms are associated with the unknown pa-
functions (which can be some mix of Gibbs sam- rameters (θ, V, µ, τ , ν). Optionally, the non-modeled con-
plers and Metropolis jumps, with the latter deter- stants and data (here J, σ, y) can be localized to the sam-
mined by specifying a log-posterior density func- pler function by defining them as parameters; the func-
tion), and a procedure for generating starting points. tion s then encapsulates a complete sampling environment
Using these inputs, Umacs writes a customized R that can be even moved over and run on another computer
sampler function that automatically updates, keeps without worrying about the availability of the data vari-
track of Metropolis acceptances (and uses acceptance ables. The “virtual updating function” Trace displays a
probabilities to tune the jumping kernels, following real-time trace plot for the specified scalar variable.
Gelman et al. (1995)), monitors convergence (follow-
ing Gelman and Rubin (1992)), summarizes results The program is customizable and modular so that
graphically, and returns the inferences as random users can define custom updating classes and more
variable objects (see rv, below). refined Metropolis implementations.
Umacs is modular and can be expanded to in- The function produced by Sampler runs a given
clude more efficient Gibbs/Metropolis steps. Cur- number of iterations and a given number of chains; if
rent features include adaptive Metropolis jumps for we are not satisfied with the convergence, we may re-
vectors and matrices of random variables (which sume iteration without having to restart the chains. It
arise, for example, in hierarchical regression models, is also possible to add chains. The length of the burn-
with a different vector of regression parameters for in period that is discarded is user-definable and we
each group). may also specify the desired number of simulations
Figure 1 illustrates how a simple Bayesian hier- to collect, automatically performing thinning as the
archical model (Gelman et al., 2003, page 451) can sampler runs.

1
BAYESIAN DATA ANALYSIS USING R

Once the pre-specified number of iterations are Standard functions to plot graphical summaries
done, the sampler function returns the simulations of random variable objects are being developed. Fig-
wrapped in an object which can be coerced into a ure 3 shows the result of a statement plot(x,y)
plain matrix of simulations or to a list of random where x are constants and y is a random vector with
variable objects (see rv below), which can be then at- 10 constant components (shown as dots) and five
tached to the search path. random components (shown as intervals).
Intervals for predicted examination scores

rv

100

rv is an R package that defines a new simulation-

80
based random variable class in R along with various ●


mathematical and statistical manipulations (Kerman ●



and Gelman, 2005). The program creates an object

60
class whose instances can be manipulated like nu-

final
meric vectors and arrays. However, each element ●

in a vector contains a hidden dimension of simula-

40
tions: the rv objects can thus be thought of being ap-
proximations of random variables. That is, a random

20
scalar is stored internally as a vector, a random vector
as a matrix, a random matrix as a three-dimensional
array, and so forth. The random variable objects are
0

useful when manipulating and summarizing simu- 0 20 40 60 80 100


lations from a Markov chain simulation (for example midterm

those generated by Umacs, see below). They can also


be used in simulation studies (Kerman, 2005). The
Figure 3: A scatterplot of fifteen points (x,y) where five
number of simulations stored in a random variable
of the components of y are random, that is, represented by
object is user-definable.
simulations and thus are drawn as intervals. Black verti-
The rv objects are a natural extension of numeric
cal intervals represent the 50% uncertainty intervals and
objects in R, which are conceptually just “random
the gray ones the 95% intervals. (The light grey line is a
variables with zero variance”—that is, constants.
regression line computed from the ten fixed points).
Arithmetic operations such as + and ^ and elemen-
tary functions such as exp and log work with rv ob- Many methods on rv objects have been written,
jects, producing new rv objects. for example E(y) returns the individual means (ex-
These random variable objects work seamlessly pectations) of the components of a random vector y.
with regular numeric vectors: for example, we can A statement Pr(z[1]>z[2]) would give an estimate
impute random variable z into a regular numeric of the probability of the event { z1 > z2 }.
vector y with a statement like y[is.na(y)] <- z. Random-variable generating functions generate new
This converts y automatically into a random vector rv objects by sampling from standard distributions,
(rv object) which can be manipulated much like any for example rvnorm(n=10, mean=0, sd=1) would
numeric object; for example we can write mean(y) to return a random vector representing 10 draws from
find the distribution of the arithmetic mean function the standard normal distribution. What makes these
of the (random) vector y or sd(y) to find the distri- functions interesting is that we can give them param-
bution of the sample standard deviation statistic. eters that are also random, that is, represented by
The default print method of a random variable simulations. If y is modeled as N(µ, σ 2 ) and the ran-
object outputs a summary of the distribution repre- dom variable objects mu and sigma represent draws
sented by the simulations for each component of the from the joint posterior distribution of (µ, σ )—we
argument vector or array. Figure 2 shows an example can obtain these if we fit the model with Umacs (see
of a summary of a random vector z with five random below) or BUGS for example—then a simple state-
components. ment like rvnorm(mean=mu, sd=sigma) would gen-
erate a random variable representing draws from
> z
name mean sd Min 2.5% 25% 50% 75% 97.5% Max
the posterior predictive distribution of y. A single
[1] Alice 59.0 27.3 ( -28.66 1.66 42.9 59.1 75.6 114 163 ) line of code thus will in fact perform Monte Carlo
[2] Bob 57.0 29.2 ( -74.14 -1.98 38.3 58.2 75.9
[3] Cecil 62.6 24.1 ( -27.10 13.25 48.0 63.4 76.3
110 202 )
112 190 )
integration of the joint density of ( yrep , µ, σ ), and
[4] Dave 71.7 18.7 ( 2.88 34.32 60.6 71.1 82.9 108 182 ) draw from the resulting distribution p( yrep | y) =
[5] Ellen 75.0 17.5 ( 4.12 38.42 64.1 75.3 86.2 108 162 )
N( yrep |µ, σ ) p(µ, σ | y) dµ dσ. (We distinguish the
R R

observations y and the unobserved random variable


Figure 2: The print method of an rv (random variable)
yrep , which has the same conditional distribution as
object returns a summary of the mean, standard deviation,
y).
and quantiles of the simulations embedded in the vector.

2
BIBLIOGRAPHY BIBLIOGRAPHY

R&B or compare the models, check fit to data, or vali-


date the software used for fitting. This article de-
The culmination of this research project is an R en- scribes several of our research efforts, which we have
vironment for Bayesian data analysis which would made into R packages or plan to do so. We hope
allow inference, model expansion and comparison, these packages will be useful in their own right and
model checking, and software validation to be per- also will motivate future work by others integrating
formed easily, using a high-level Bayesian graphical Bayesian modeling and graphical data analysis, so
modeling language “B” adapted to R, with functions that Bayesian inference can be performed in the iter-
that operate on R objects that include graphical mod- ative data-analytic spirit of R.
els, parameters (nodes), and random variables. B exists
now only in conceptual level (Kerman, 2006a), and
we plan for its first incarnation in R (called R & B) to Acknowledgements
be a simple version to demonstrate its possibilities.
We thank Tian Zheng, Shouhao Zhao, Yuejing Ding,
B is not designed to be tied to any particular infer-
and Donald Rubin for help with the various pro-
ence engine but rather a general interface for doing
grams and the National Science Foundation for fi-
Bayesian data analysis. Figure 4 illustrates a hypo-
nancial support.
thetical interactive session using R & B.

## Model 1: A trivial model:


NewModel(1, "J", "theta", "mu", "sigma", "y")
Bibliography
Model(y) <- Normal(J, theta, sigma)
Observation(y) <- c(28,8,-3,7,-1,1,18,12) BUGS Project. BUGS: Bayesian In-
Hypothesis(sigma) <- c(15,10,16,11,9,11,10,18) ference Using Gibbs Sampling.
Observation(J) <- 8 http://www.mrc-bsu.cam.ac.uk/bugs/, 2004.
Fit(1)
# Look at the inferences: A. Gelman and D. Rubin. Inference from iterative
print(theta) simulation using multiple sequences (with discus-
## Model 2: A hierarchical t model sion). Statistical Science, 7:457–511, 1992.
NewModel(2, based.on.model=1, "V", "mu", "tau")
Model(theta) <- Normal(J, mu, V) A. Gelman, G. Roberts, and W. Gilks. Efficient
Model(V) <- InvChisq(nu, tau) metropolis jumping rules. In J. M. Bernardo, J. O.
Fit(2) Berger, A. P. Dawid, and A. F. M. Smith, editors,
# Look at the new inferences: Bayesian Statistics 5. Oxford University Press, 1995.
plot(theta)
# Draw from posterior predictive distribution: A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin.
y.rep1 <- Replicate(y, model=1) Bayesian Data Analysis. Chapman & Hall/CRC,
y.rep2 <- Replicate(y, model=2) London, 2nd edition, 2003.
## Use the same models but
## a new set of observations and hypotheses: J. Kerman. Using random variable objects to com-
NewSituation() pute probability simulations. Technical report, De-
Hypothesis(sigma) <- NULL # Sigma is now unknown. partment of Statistics, Columbia University, 2005.
Fit()
... J. Kerman. An integrated framework for Bayesian
graphic modeling, inference, and prediction. Tech-
Figure 4: A hypothetical interactive session using the nical report, Department of Statistics, Columbia
high-level Bayesian language “B” in R (in development). University, 2006a.
Several models can be kept in memory. Independently of
models, several “inferential situations” featuring new sets J. Kerman. Umacs: A Universal Markov Chain Sam-
of observations and hypotheses (hypothesized values for pler. Technical report, Department of Statistics,
parameters with assumed point-mass distributions) can Columbia University, 2006b.
be defined. Fitting a model launches an inference engine
J. Kerman and A. Gelman. Manipulating and sum-
(usually, a sampler such as Umacs or BUGS) and stores
marizing posterior simulations using random vari-
the inferences as random variable objects. By default, pa-
able objects. Technical report, Department of
rameters are given noninformative prior distributions.
Statistics, Columbia University, 2005.

Conclusion A. D. Martin and K. M. Quinn. MCMCpack 0.6-6.


http://mcmcpack.wustl.edu/, 2005.
R is a powerful language for statistical modeling
and graphics; however it is currently limited when S. Sturtz, U. Ligges, and A. Gelman. R2WinBUGS:
it comes to Bayesian data analysis. Some packages A package for running WinBUGS from R. Journal
are available for fitting models, but it remains awk- of Statistical Software, 12(3):1–16, 2005. ISSN 1548-
ward to work with the resulting inferences, alter 7660.

You might also like