Statistical inference and Monte Carlo algorithms
Statistical inference and Monte Carlo algorithms
SUMMARY
This review article looks at a small part of the picture of the interrelationship
between statistical theory and computational algorithms, especially the Gibbs
sampler and the Accept-Reject algorithm. We pay particular attention to how
the methodologies affect and complement each other.
1. INTRODUCTION
Computations and statistics have always been intertwined. In particular,
applied statistics has relied on computing to implement its solutions of
real data problems. Here we look at another part of the relationship
between statistics and computation, and examine a small part of how
the theories not only are intertwined, but how they have influenced each
other.
With the explosion of methods based on Monte Carlo methods, par-
ticularly those using Markov chain algorithms such as the Gibbs sampler,
there has been a blurring of the distinction between the statistical model
and the algorithmic model. This is particularly evident in the examples
2. SYNTHESIS
Given the audience of this presentation, a digression may be in order into
the Bayes/frequentist approaches to statistics. The topic of algorithms,
particularly Monte Carlo algorithms, is a prime example of an area that
is best handled statistically by a mixture of the Bayesian and frequentist
approaches. Moreover, it seems that to completely analyze, understand,
and optimize the relationship between a statistical model, its associated
inference, and the algorithm used for computations, both Bayesian and
frequentist ideas must be used.
The Bayesian approach provides us with a means of constructing an
estimator that, when evaluated according to its global risk performance,
could result in an optimal frequentist estimator. This highlights impor-
tant features of both the Bayesian and frequentist approaches. Although
the Bayesian paradigm is well-suited for the construction of possibly
optimal estimators, it is less well-suited for their global evaluation. The
frequentist paradigm is quite complementary, as it is well-suited for
global evaluations, but is less well-suited for construction.
We look at two examples, taken from Lehmann and Casella (1997).
t = x/3 + + (2)
2/3) = i C ( a , + q i 2' u
f (o2lo"_i, y, u, fie, (5)
n
f (a2elo-,y,u,/3) = IG b+-~, 2 { ( y - ( X / 3 + Zu))'
2 (
f (ulo-,y,o-,,/3) = Nq (Z'Z+o,J_~
_2 n - l ~ )- I i_,
,w!
(i) ai < 0
(ii) qi > q - t - 2ai
256 George Casella
(iii) n+2Eai+2b-p>O.
When there are more than two variables, the definitions of compati-
bility and functional compatibility become more involved, but the idea is
Statistical Inference and Monte Carlo Algorithms 257
Example 4. The one-way random effects model (1) with a typical set of
priors is
2
o-2 ~ (o-2)-(a+1).
one, except for the overall mean, 13, which was set to eight. The chain
was first allowed to run for 55,000 iterations; keep in mind that the word
"bum-in" is not appropriate for these initial iterations because the chain
is null and is therefore not converging (in the usual sense). The sole
purpose of these initial iterations was to provide the chain with ample
opportunity to misbehave and alert us that something may be wrong; it
never did. We chose 15,000 because a typical burn-in would probably
be in the hundreds (see Gelfand et al. 1990 and Wang et al. 1993) so that
if our chain did not misbehave during the burn-in stage, neither would
that of an unknowing experimenter.
After the initial 55,000 iterations, the output from the 15,001st
through the 16,000th was collected. Figure 1 is a histogram of the
1,000 effect variances from the null Gibbs chain, that is, 0-2(j+15,000),
j = 1, 2 , . . . , 1000, with a Monte Carlo approximation of the supposed
marginal posterior density superimposed. Figure 2 is the analog of Fig-
ure 1 for the error variance component. The density approximations in
Figures 1 and 2 were calculated using the usual "average of conditional
densities" approximation. All of these plots appear perfectly reasonable
even though the posterior distribution is improper and the Monte Carlo
density approximations have almost sure pointwise limits of zero or no
limit at all. Clearly, if one were unaware of the impropriety, plots like
these could lead to seriously misleading conclusions.
This particular posterior is improper due to an infinite amount of
mass near 0-2 = 0. One might suspect that if the starting value of 0-2
were near zero, the o.2 component of the Gibbs chain would be absorbed
at 0. This is not the case, however. In fact, the o.2 component and
the random effects components move towards zero, but eventually they
all return to a reasonable part of the space. For example, we started the
chain with 0 .2 • 10 - 5 0 and after 20,000 iterations the 0-2 component was
a p p r o x i m a t e l y 10 - 1 2 2 and the largest magnitude of any of the random
effects components was about 10 - 6 0 . The chain was allowed to run
for a total of one million iterations, after which all of the components
were back in a reasonable part of the parameter space. This Gibbs chain
behaves somewhat like one constructed with the exponential conditionals
of Example 3 in that it leaves the "center" of the space for long periods
of time, but eventually returns. Such behavior is consistent with null
recurrence.
260 George Casella
110
55
I I I
0 0 5 10 15 20
Figure 1. Histogram of the 1000 values of the effect variance from the
for j = 1, 2 , . . . , 1000.
null Gibbs chain, that is, a histogram of or2(j+15'~176176
Superimposed is the approximate (supposed) marginal posterior density of
cr2. An appropriately scaled version of #a21y (t]y) is on the ordinate with
t on the abscissa. (Actually, 15 of the l,O00values of the effect variance,
ranging from 21.0 to 45.1, were not included in the histogramJ
120
60
I I I
process. In this light, we can ask how to best process the data, and answer
that question by applying statistical principles. In what follows, we apply
one of the simplest principles, that of Rao-Blackwellization, to the output
of an Accept-Reject Algorithm. For more details, including applications
to the Metropolis-Hastings Algorithm, see Casella and Robert (1995,
1996a, 1996b, 1996c).
where we define wi = f ( Y i ) / M 9 ( Y i ) .
Statistical Inference and Monte Carlo Algorithms 263
5RB = t "'"
i=1
1 n
= pih( ) (12)
i=1
we see that the improvement that 5RB brings over 5AR is related to
the size of E[var{5(U, Y)IY}]. This latter quantity can be interpreted
as measuring the average variance in the estimator that is due to the
auxiliary randomization, that is, the variance that is due to the uniform
random variables. In some cases this quantity can be substantial.
Acceptance rate .9
AR AR RB AR Percent
Sample Estimate Estimate MSE Decrease
Size 5An 5RB in MSE
t0 .5002 .5007 .0100 17.02
25 .5001 .4999 .0041 18.64
50 .4996 .4997 .0020 20.81
100 .4996 .4997 .0010 21.45
Acceptance rate .3
AR AR RB AR Percent
Sample Estimate Estimate MSE Decrease
Size 5A~ 5RB in MSE
10 .5005 .5004 .0012 52.85
25 .4997 .5000 .0005 58.62
50 .4998 .5001 .0002 60.49
100 .4995 .5001 .0001 61.60
t - 1i f ( y ) + n - t g ( y ) - - - ~1f ( y ) (15)
re(y) -- n - n - 1 1 M
1 n
(~TRB -~ -s E E[Z(gi < wi)lY/]h(Yd
i=1
n-1 (16)
1 (h(yn)+Eb(yi)h(Yi)) ,
t i=1
where
t - 1 f(y )
, i=l,.--,n-1. (17)
n - i m(yi)
- _< E[( AR --
where ~- = E[h(X)].
Moreover, the size of the improvement brought about by the rescaled
estimator is truly impressive.
1~ f(Yi)
5IS ~- n i = 1 g(Yi) h(yi), (19)
Acceptance rate .9
Acceptance rate .3
(19) is not correct for constants, and will suffer f r o m the same problems
as 6TRB. We thus want to rescale 6 i s , which results in the rescaled
importance sampling estimator
5. OTHER CONSIDERATIONS
In this section we review some recent work that further explores the
structure of Monte Carlo algorithms, particularly the Gibbs sampler.
The goals of these investigations are to understand how to better, or
even optimally, process the output of the algorithm, and also to use
the structure of the algorithm to help construct optimal procedures. It is
interesting to note that both frequentist and Bayesian inferences benefit in
the following examples. Unfortunately, these illustrations are somewhat
less detailed, as some of the work is still in progress
27'0 George Casella
--
m
2 7c(Oly, Ai)dO = c~.
ec i = 1
and
(3O
Example 7. Let
X I Y ~ N(pY, 1 - p2)
Y I X ,.~ N(pX, 1 - p2).
0.2~ 2 1
60/0"61 = -~ > 1.
So, if 61 is less than 1/,o 2 times more complex than 60, then 61 should
be used. Since E ( X I Y) = PY, it takes n + 2 floating point operations
(flops) to compute 61 = ( I / n ) ~ . = 0 E ( X I Yk) as compared to n + I
flops to compute 60 = ( I / n ) ~ = 0 X k. Therefore, the cost of compu-
tation, in terms of flops, is essentially the same, but there can be a vast
gain in precision by using 61. ,~
practice. For example, the convergence rate is not only typically diffi-
cult to compute and possibly mathematically intractable, but also may
also ignore important features of the target distribution necessary for
determining the optimal random scan, as we will see below.
Levine (1996) considers an alternative measure derived from statis-
tical decision theoretic considerations, which seems to provide an attrac-
tive criterion for choosing an appropriate random scan. Assume a ran-
d o m d • 1 vector X is generated by a random scan Gibbs sampler which
generates a Markov chain { X ( i ) } ~ I with stationary distribution 7r. Sup-
pose interest lies in estimating # = E~(h(X)) where v a r ( h ( X ) ) < oc.
If we estimate # with the sample mean/2 = g1 Y]'~i=ln
h ( X ( i ) ) , the optimal
mean squared error scan is the one that minimizes the risk
(23)
n-1
i=1
O0
where the supremum is over all functions with finite variance. Thus
we see that, when compared to (24), the convergence rate contains less
information about the variance and covariances of the chain. It is in this
sense that we feel that (24) is a better optimality measure.
274 George Casella
6. DISCUSSION
t-I n-I
X E I I ( w i j AUij ) I I ( u i j - - W i j ) + d t l ' ' ' d t n - l ,
(il ..... it_l) j = l j=t
(2s)
where w i = f ( Y i ) / M g ( y i ) and the sum is over all subsets of { 1 , . . . , n -
1 } of size t - 1.
276 George Casella
We next want to get the joint distribution of (Y/, Ui)IN = n, for any
i = 1 , . . - , n - 1. Since this distribution is the same for each of these
values o f / , we can just derive it for ( ~ , U 0 . Recall that Yn ~ f .
If we set Yl = Y, ul = u, Y2 = Y 3 . . . . - - Y n = c<) and u2 =
u3 = " . = Un = 1, we can derive the joint distribution of (N, 1~, U1).
Assume, without loss of generality, that limy__,~ f ( y ) / g ( y ) = 1. (If this
is not the case, we just have to adjust the constant M in what follows).
Then, aside from the pair (Wl, ul), we have (wij A uij) = a and
(Uij -- Wzj" ) + = ( 1 - ~ / ) , h e n c e
t-1 n-1
1-I(w,j ^ u,j) 1-I (~,,~ - .,,j)+ =
(il,'..,it-i) j=l j=t
-- (Wl /kUl) (?_~2) (~)t-2 (1-- ~) n-t (26)
ACKNOWLEDGEMENT
I would like to thank Jost-Miguel Bernardo for suggestions and encour-
aging this project, and the University of Granada, The Spanish Statistical
Society, and particularly Elias Moreno Bas for their hospitality. Lastly,
thanks go to Jim Hobert and Christian Robert, who did most of the hard
work. This research was supported by NSF Grant No. DMS-9625440.
REFERENCES
Amit, Y. (1996). Convergence properties of the Gibbs sampler for perturbations of
gaussians. Ann. Statist. (to appear).
Arnold, B. C., and Press, S. J. (1989). Compatible conditional distributions. J. Amer.
Statist. Assoc. 84, 152-156.
Besag, J. (1974). Spatial interaction and the statistical analysis of Lattice systems.
J. Roy. Statist. Soc. B 36, 192-236 (with discussion).
Besag, J., Green, P., Higdon, D. and Mengersen, K. (1995). Bayesian computation and
stochastic systems. Statist. Sci. 10 1-66 (with discussion).
Berger, J. O. (1985). Statistical Decision Theory and Bayesian Analysis, Second Edi-
tion. New York: Springer-Verlag.
Casella, G. and Berger, R. L. (1994). Estimation with selected binomial information or,
do you really believe that Dave Winfield is batting .471 ? J. Amer. Statist. Assoc. 89,
1080-1090.
Casella, G. and George, E. I. (1992). Explaining the Gibbs sampler. Ann. Statist. 46,
167-174.
Casella, G. and Robert, C. P. (1995). Une impltmentation de thtor~me de Rao-black-
well en simulation avec rejet. C. R. Acad. Sci. Paris 322, 571-576.
Casella, G. and Robert, C. P. (1996a). Rao-BlackweIlization of sampling schemes.
Biometrika 83, 81-94.
Casella, G. and Robert, C. P.(1996b). Post-processing accept-reject samples: recycling
and rescaling. Tech. Rep., BU-1311-M., Cornell University, and Tech. Rep., 9625,
INSEE, Paris.
Casella, G. and Robert, C. P. (1996c). Recycling rejected values in accept-reject meth-
ods. C. R. Acad. Sci. Paris 321, 1621-1626.
27'8 George Casella
Dey, D. K., Gelfand, A. E. and Peng, E (1994. Overdispersed generalized linear models.
Tech. Rep., University of Connecticut.
Eberly, L. E. (1997). Constructing Confidence Statements from the Gibbs Sampler.
Ph.D. Thesis, Cornell University.
Gelfand, A. E., Hills, S. E., Racine-Poon, A. and Smith, A. F. M. (1990). Illustration of
Bayesian inference in normal data models using Gibbs sampling. J. Amer. Statist.
Assoc. 85, 972-985.
Gelfand, A. E. and Smith, A. E M. (1990). Sampling-based approaches to calculating
marginal densities. J. Amer. Statist. Assoc. 85, 398-409.
Gelman, A. and Speed, T. P. (1993). Characterizing a joint probability distribution by
conditionals. J. Roy. Statist. Soc. B 55, 185-188.
Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the
Bayesian restoration of images. IEEE Trans. Patt. Anal Mach. Intelligence 6, 721-
741.
Geyer, C. (1992). Practical Markov chain Monte Carlo. Statist. Sci. 7, 473-483.
Hastings, W. K. (1970). Monte Carlo sampling using Markov chains and their appli-
cation. Biometrika 57, 77-109.
Hill, B. M. (1965). Inference about Variance Components in the One-Way Model.
J. Amer. Statist. Assoc. 60, 806-825.
Hobert, J. P. and Casella, G. (1995). Functional compatibility, Markov chains, and Gibbs
sampling with improper posteriors. Tech. Rep. BU-1280-M, Cornell University.
Hobert, J. P. and Casella, G. (1996). The effect of improper priors on Gibbs sampling
in hierarchical linear mixed models. J. Amer. Statist. Assoc. (to appear).
Ibrahim, J. G. and Laud, P. W. (1991). On Bayesian analysis of generalized linear
models using Jeffreys's prior. J. Amer. Statist. Assoc. 86, 981-986.
Lehmann, E. L. and Casella, G. Theory of Point Estimation, Second Edition. New York:
Springer-Verlag.
Levine, R. A. (1996). Optimizing convergence rates and variances in Gibbs sampling
schemes. Ph.D. Thesis, Cornell University.
Liu, J. (1994). The collapsed Gibbs sampler in Bayesian computation with application
to a gene regulation problem. J. Amer. Statist. Assoc. 89, 958-966.
Liu, C. and Rubin, D. B. (1994). The ECME algorithm: a simple extension of EM and
ECM with fast monotone convergence. Biometrika 81,633-648.
Liu, J., Wong, W. H. and Kong, A. (1994). Covariance structure of the Gibbs sampler
with applications to the comparisons of estimators and augmentation schemes.
Biometrika 81, 27-40.
Liu, J., Wong, W. H. and Kong, A. (1995). Correlation structure and convergence rate
of the Gibbs sampler with various scans. J. Roy. Statist. Soc. B 57, 157-169.
Meng, X.-L. (1994). On the rate of convergence of the ECM algorithm. Ann. Statist. 22,
326-339.
Meng, X.-L. and Rubin, D. B. (1993). Maximum likelihood estimation via the ECM
algorithm: a general framework. Biometrika 80, 267-278.
Statistical Inference a n d Monte Carlo Algorithms 279
Meng, X-L. and van Dyk, D. (1996). (1993). THe EM algorithm - an old folk song
sung to a fast new tune. J. Roy. Statist. Soc. B (to appear).
Metropolis, M. Rosenbluth, A., Rosenbluth, M., Teller, A. and Teller; E. (1953). Equa-
tion of state calculations by fast computing machines. J. Chemical Phys. 21, 1087-
1092.
Natarajan, R. and McCulloch, C. E. (1995). A Note on the existence of the posterior
distribution for a class of mixed models for binomial responses. Biometrika 82,
639--643.
Raftery, A. E. and Banfield, J. D. (1991). Stopping the Gibbs sampler, the use of
morphology, and other issues in spatial statistics. Ann.lnst. Stat. Math. 43, 32--43.
Robert, C. (1995). Convergence control methods for Markov chain Monte Carlo algo-
rithms. Statist. Sci. 10, 231-253.
Roberts, G. O. (1992). Convergence diagnostics of the Gibbs sampler. Bayesian Statis-
tics 4 (J. M. Bernardo, J. O. Berger, A. P. Dawid and A. E M. Smith, eds.). Oxford:
University Press, 775-782.
Roberts, G. O. and Sahu, S. K. (1996). Updating schemes, correlation structure, block-
ing and parameterisation for the Gibbs sampler. Tech. Rep., University of Cam-
bridge.
Rosenthal, J. S. (1995). Rates of convergence for Gibbs sampling for variance compo-
nent models. Ann. Statist. 23, 740-761.
Searle, S. R., Casella, G. and McCulloch, C. E. (1992). Variance Components. New
York: Wiley.
Smith, A. F. M. and Roberts, G. O. (1993). Bayesian computation via the Gibbs sampler
and related Markov chain Monte Carlo methods. J. Roy. Statist. Soc. B 55, 3-23.
Tanner, M. A. (1993). Tools for Statistical Inference. New York: Springer-Verlag.
Tanner, M. A. and Wong, W. (1987). The calculation of posterior distributions by data
augmentation. J. Amer. Statist. Assoc. 82, 805-811 (with discussion).
Tierney, L. (1994). Markov chains for exploring posterior distributions. Ann. Statist. 22,
1701-1762.
Wang, C. S., Rutledge, J. J. and Gianola, D. (1993). Marginal inferences about variance
components in a mixed linear model using Gibbs sampling. Genetique, Selection,
Evolution 25, 41-62.
Wang, C. S., Rutledge, J. J. and Gianola, D. (1994). Bayesian analysis of mixed lin-
ear models via Gibbs sampling with an application to litter size of iberian pigs.
Genetique, Selection, Evolution 26, 1-25.
Zeger, S. L. and Karim, M. R. (1991). Generalized linear models with random effects:
a Gibbs sampling approach. J. Amer. Statist. Assoc. 86, 79-86.
280 George Casella
DISCUSSION
JUAN FERRANDIZ (Universitat de ValOncia, Spain)
First of all I would like to thank Professor Casella for this stimulating
paper. I have enjoyed reading these many good ideas exposed in so clear
a style. I found his main message very important telling us that not only
statistical practice can benefit from Markov Chain Monte Carlo (MCMC)
methods but that these MCMC methods can still take advantage of well-
known statistical ideas.
His second message, related to the Bayesian-frequentist controversy,
has been particularly pleasing to me. I strongly agree with Professor
Casella that
"... there are situationsand problemsin whichone or the other approach
is better-suited,or even a combinationmay be best, so a statistician without
a command of both approaches may be less than complete."
In fact, as I was reading the paper, I was thinking how his sugges-
tions could apply to a frequentist context: likelihood methods for spatial
models arising from random variables associated to geographical sites
(see e.g. Ferrfindiz et al., 1995).
Gibbs distributions are a natural choice in this context. Among
them, the proposed automodels in Besag (1974) are particularly ap-
pealing because the full conditionals determining joint distributions are
well-known members of the exponential family.
The corresponding density of these models can be expressed as
0 .2
E(OIX, Y ) = X - (0.2 + r2 + 72 ) ( X - Y)
and this estimate minimizes the Bayes risk and is admissible under weak
regularity conditions. A related frequentist solution to this problem, in
the spirit of James-Stein shrinkage estimator, has been developed by
Green and Strawderman (1991). In particular, as they showed in their
paper, this estimate can be seen as an empirical Bayes estimate. In
general, sensible shrinkage estimators have a straightforward Bayesian
justification whereas their derivation in terms of frequentist inference is
not so clear. On the other hand, when testing a model without any specific
alternative in mind, that is when we look at our model and data and try to
see if our hypothesis and the observed data are compatible, we need to
have in mind all the samples that might have been observed if the model
Statistical Inference and Monte Carlo Algorithms 285
where Y is the sample data and (M1, M2, ..., Mk) is a set of possible
models to be considered. However this formulations has several prob-
lems: (i) sometimes we do not have a set of alternative models and we
just want to see if the one entertained can be considered a reasonable
approximation; (ii) even if we have several models in mind the present
application of Bayes theorem requires that we have a partition of the
model space, that is the models must be incompatible. In general this
is not the case. This is obvious when some models are nested, as when
selecting between a linear or a quadratic regression, but in general if
we are considering two alternative non-nested models they usually have
some degree of overlap. Sometimes we can avoid the overlap by defin-
ing all the possible combinations of cases as in selecting the best set of
explanatory variables or in outliers problems in which the number of
models is 2 n. However, this partitioning of the model space can not be
carried out in a clear way in many situations in which we need to choose
between several non nested nonlinear models.
In closing my comments on the first message of the paper I would
like to stress my full agreement with the final statement of section 2
that both approaches provide to the statistician a better understanding
and a more complete approach to statistics. For instance, Samaniago
and Renau (1994) showed that the method to be recommended in a
particular application depends crucially on the quality of the available
prior information. The conclusion of all this is that both approaches
286 George Casella
the allocation takes into account the likelihood that each strata includes
unidentified outliers. Pefia and Tiao (1992) showed, in a related prob-
lem, that if instead of random sampling we use preliminary information
to stratify the observations we can obtain a bettei" procedure. Finally, I
believe that the use of time series models in the analysis of the output
of sequential algorithm can lead to substantial improvement in judging
convergence. In particular the use of multiple time series models in the
analysis of the output of a parallel algorithm seems to be a promising
area of future research.
In summary a have found this paper very stimulating and full of
insights. It gives me a great pleasure to congratulate Professor Casella
for this outstanding contribution to our journal.
~(~)
8
(2
2 4 6 8 i0
Figure 1. Reference prior for the parameter of a Ga(z I a, 2~) model.
Statistical Inference and Monte Carlo Algorithms 291
Of course, the results may depend on the prior used. Since this is a
one-parameter regular model, the reference prior is also Jeffreys' prior
(Bernardo, 1979), namely
( 1)1,
@,() c~
where ~bt (.) is the trigamma, or first derivative of the digamma function,
and c = (7r2/6 - 1) 1/2 ~ 0.65, shown in Figure 1. It may be seen
that, in this case, the refence prior is actually close to the naive "positive
parameter" prior 7r(a) ec c~-~.
R A. G A R d A - L O P E Z and A. GONZALEZ
(Universidad de Granada, Spain)
We should first like to congratulate Professor Casella for his clear and
detailed explanation of all the aspects concerning the interrelationship
between statistical theory and computational algorithms, in particular
the Gibbs sampler and the accept-reject algorithm. His talk has been
highly methodological as far as all aspects of the choice of algorithm
and its subsequent effects on the inference are concerned. What we
consider to be especially important are the conditions for generating
proper posteriors starting from proper conditionals in the Gibbs sampler.
Some of the published results on this subject ought to be treated with a
degree of caution because the compatibility of the proper conditionals
(cf. Theorem 2 in Prof. Casella's paper) have not been adequately
investigated.
Thus, one question we should like to put to Professor Casella refers
directly to a technical aspect of his approach to the application of the
Gibbs sampler. There are at least two widely known methods of gen-
erating the Gibbs sample, the so-called single-path and multiple-path
methods. Let us suppose that we have a random vector
U= (U1,. . . , Uk)
Ui [ ( V l , . . . , Ui-1, U i + l , . . . , Uk)
292 George Casella
where (j) denotes the j-th replicate. It is clear that the successive cycles
on a particular path, U~j), u~J),..., "U~ ~) are not independent but that
cycles from different paths, U(1), U(2),..., U(nm), are indeed indepen-
dent.
With the single-path method you have only to generate one path
long enough to obtain q values for r -t- q, where r is a point at which the
Gibbs sampler converges. These q values then provide the basis for our
estimation and they all obviously depend upon the starting values.
It has already been demonstrated (cf. Geman and Geman, 1984
and Liu, Wong and Kong, 1992a), that under general conditions, both
methods result in convergence, i.e.
d
Un ,U
and starts at a reasonable value. This can occur, of course, only if the
chain is unlikely to visit too close to the singularity. From the author's
experience, is this claim reasonable?
Section 4 was quite interesting and had some nice surprises, but I
note that it ends up essentially with the "status quo" being supported.
The common understanding in use of "accept-reject" and "importance
sampling" includes:
(i) Importance sampling gives more accurate estimates for a single
h.
(ii) If one wants to simultaneously compute expectations for many
h, but the same Y, accept-reject will often be computationally
faster, especially if the acceptance rate is low (since then t will
be considerably smaller than n).
iii) Rescaling by a correlated estimate of one is an important vari-
ance reduction technique.
Statistical Inference and Monte Carlo Algorithms 295
on ,-V with inner product < f I g > = ~-,x 7r(x)f(x)g(x). The equi-
librium expectation o f f under 7r is then < f > - < f I 1 > = E~r(f),
and we can think of (Pkf)(x) and (IIf)(x) as operators on g2(Tr) given
by Pk(Z)(x ) = y~yp(k,x,y)f(y) and ( 1 - i f ) ( x ) = ~-'~yTr(y)Z(y). The
matrix II has rows equal to 7r and is an orthogonal projector on g2 (Tr) with
range the constant functions. The autocovariance function of ~ i f ( X i )
is
which also equals < f I (pli-Jl _ I I ) f > = < f I (P - II)li-Jlf > = <
f I ( I - II)pIJ-Jl(I- I I ) f >. The autocorrelation function is pf(Itl) =
Cf([tl)
cf(o) 9
In Section 5.3 Professor Casella discusses the minimax properties of
the Monte Carlo average estimate of the parameter # = Erh(X). The
limiting risk function R (n) (h) in (23), can be developed further by using
the results of Peskun (1973). The fundamental matrix of Markov chains
(Kemeny and Snell, 1983) Z = ( I - ( P - II)) -1 = I + } - ~ _ l ( P k - II)
arises naturally in this limiting expression. It can be shown that
1 <hlOh>
Th =
2< h l ( I - I I ) h >
is known as the integrated relaxation time, see Sokal (1989) and Gidas
(1995). There are n/2rh effectively independent samples in a run of
length n. Note that "rh = 89Y'~,iph(i).
Professor Casella asserts that the risk function contains more infor-
mation than is contained in the rate of convergence. This assertion can
be seen using the ideas above. In the case where P is self-adjoint on
Statistical Inference and Monte Carlo Algorithms 999
g2(rc), that is < 9 [ Ph > = < Pg [ h >, one can relate the rate of con-
vergence of the chain to the limiting risk. Let the ordered eigenvalues
of P are 1 = /30 > /31 ~ f12 2 "'" ~ /3 > - 1 , where/3 equal the
smallest eigenvalue. Much work (see Diaconis and Stroock, 1991) has
focused on methods for bounding/31, fl, and/3, = max(/31, ] /3 [) that
give rise to bounds on the rate of convergence of the chain to its station-
ary distribution. As pointed out in Diaconis and Stroock (1991), there
are advantages to studying I - P instead of P. The spectrum of I - P
consists of numbers Ai = 1 - / 3 i . Using the minimax representation of
eigenvalues
< h[(I P)h >
A1 = inf = i~fll - ph(1)]
h <hl(I-l-I)h>
where the infimum is over all nonconstant functions h E g2 (~), this ratio
is called the Rayleigh quotient and its numerator may be represented as
1 E~(i)p(i'J)[f(i)- f(j)12 .
2 i,j
and
Hence the randomized approach with A = 89is never more than a factor
of 2 from the best value of A.
least some small interval extending right from zero. This suggests that
o.2 > 0. Thus the ACD estimator has great potential to be misleading
about the posterior evidence concerning small values of o.2 .
This aberrant behavior has been illustrated in a very simple model
where the posterior marginal distribution of the variance component can
be obtained analytically. The behavior seems to occur quite generally,
however, whenever a prior density which is positive at zero is specified
for a variance component. The use of such priors seems quite appropriate
in many contexts, even though inverse gamma priors, which vanish at
zero, are much more commonly specified for variance components. The
data cannot rule out the absence of a random effect (o.2 = 0), so it seems
overly confident to use a prior which vanishes as o.2 goes to zero. In
fact, one might argue that monotone decreasing prior densities should be
specified, in order to favor parsimonious models. The Jeffreys prior for
the simple model discussed above has a monotone decreasing density
which is finite and positive at zero. One disadvantage of not using an
inverse gamma prior is that the "conditional conjugacy" which drives the
Gibbs sampler will be lost. The ACD approach can be extended to deal
with this, however, based on work of Chen (1994). But Chen's density
estimator will still have aberrant behavior near zero.
In one sense it is not surprising that the ACD estimator does not
work well for variance component marginals with prior densities which
are positive at zero. In such problems, the Bayes factor for testing the
absence of random effects can be expressed as the Savage-Dickey density
ratio, which is the ratio of posterior to prior marginal densities for the
variance component, both evaluated at zero. For details see Verdinelli
and Wasserman (1995). If the ACD estimator worked well for estimating
the posterior marginal density at zero, then we would have an easy and
reliable way to estimate the Bayes factor. But invariably Bayes factors
are harder to compute than other posterior quantities. In this regard, we
are not surprised that there is no free lunch via the ACD estimator.
1 E Wy(Yi)h(yi).
i=1
Statistical Inference and Monte Carlo Algorithms 30"l
> f -f(Y)
~ h 2(y)f(y)dy = Eg{w2(y)h2(y)}
J
= varg{w(y)h(y)} -= nvar(Sis ).
An effort of comparing the two samplers with the Metropolized inde-
pendence sampling was made in Liu (1996). Since the advantage of
the rejection method is that exact draws from f can be obtained, it is
sometimes useful to combine the two samplers when one wants to reduce
importance sampling variations (Liu, Chen, and Wong 1996).
In many practical problems, the marginal weight Wy(y) is difficult
to compute, whereas the conditional expectation Ef{h(X) [ Y = y} is
relatively easy to obtain. In such cases, as shown in Kong et al. (1994),
one can use a partial RB-estimate
n
y = 1 y=2
~l(x,y): x=l 0.26591 0.02955
x = 2 0.21136 0.49318
Statistical Inference and Monte Carlo Algorithms 309
y=l y=2
7r2(x,y) 9 x = 1 0.19091 0.10455
x = 2 0.28636 0.41818
The sampler is, therefore, a combination of two positive recurrent Mar-
kov chains; and depending on how to define the joint state, the sampler
converges into two different, though very close, distributions. When
running a random-scan Gibbs sampler, however, a proper limiting dis-
tribution - - that is the mixture of the two distributions given above - -
exists.
Under some regularity conditions that are satisfied in most prac-
tical situations, Tx(xo,zl) = ffl(y[xo)f2(xl[y)dy defines a posi-
tive recurrent transition function for the X space, and Tu(yo, yl) =
f f2(xlyo)fl (yl [x)dx defines that for the Y space. Hence two limiting
distributions 7rl (x) and 7r2(y), for Tx and Ty, respectively, are uniquely
determined. In the incompatible case, we observe that
But
(1)
2 p(Xl Ix2)
While identity (1) also provides an explicit formula showing how p(xl I
x2) and p(x2 I Xl) uniquely determine p(xl, x2), it seems to be much
Statistical Inference and Monte Carlo Algorithms 311
p(xl)
p(Xl I Xt2)
P(4 I z 0 '
for any fixed E ft2, (3)
p(X2 I Xl ) p2(X2)
(4)
p(xllx2) pl(Xl)
f 1
""f~
m
g(xl,...,Xm)#m(dxm)'"#l(dxl) <oc. (7)
k>j,kT~i
Conditions (I) and (II) amount to the conditional compatibility of fjj
and fij conditional o n X_Ai j . Because of (12), these conditions are also
sufficient for the compatibility of {fil, i = 1,..., m). In other words,
{p(xi ] X-{i}), i = 1 , . . . , m } are compatible if and only if (I) and (II)
are satisfied for all m - 1 > j _> 1 and i _> j + 1.
A matrix representation of {fij, i >_ j, j = 1 , . . . , m} perhaps can
help to visualize the recursive de-conditioning process defined by (10).
Table 1 gives the representation with m = 4, where we use [. I~.] to
denote conditional density (e.g., [4[3]aefp(x41x3)) and "k" to indicate
the elimination (i.e.,"de-conditioning") of Xk from the variables that are
being conditioned on.
where Tli = Y~kr xk. It then follows that (14) holds if and only if
p2 < 1, under which
p(Xl [ Xt2,X-A)
p(Xl I X-A) C<
p(x~2 l x l , X _ z ) ' (19)
for any A ~ {1, 2} and any fixed x~ E ~2.
[ Zl :
Moreover if the function h is bounded on the support of f then the
convergence rate of the bias is O(n-1).
lim IF,
n- - - ~oo
[(6~-IE End])
6 =0.
where Y0) -< "'" -< Y(n) is an ordered sample of variables with density
9. Note that the density does not appear explicitly in the expression of
the estimator. A good choice of the instrumental function is a density
proportional to [h If. This choice is optimal in terms of reduction of the
:320 George Casella
Table 1.1. Comparison of the mean squared errors for the estimation of
a gamma mean given by the empirical average, the Riemann estimators 6~
and <5~obtained respectively with the sample simulated from Ga(a, 2a) and
from the density proportional to h f, based on 7500 simulations.
AR 6E 61~ 6~ Pourcent
sample MSE MSE MSE Decrease
size (t) in MSE for ~5~
25 .0041 .0060 .0021 48.78
50 .0020 .0026 .0006 70.00
100 .0010 .0009 .0001 90.00
Statistical Inference and Monte Carlo Algorithms 321
Table 1.2. Comparison of the mean squared errors for the estimation of
a gamma mean given by the Riemann estimators recycling the N values
produced by the accept-reject algorithm, for the sample from Ga(a, 2a)
(61n) and for the sample from the density proportional to h f (52n), based on
7500 simulations.
AR 5~ 5-2
R
sample MSE MSE
size (t)
25 .0031 .002
50 .0012 .0002
100 . 0 0 0 4 .0001
Note that, when we use the Gibbs sampler algorithm, this estimator is
available. Therefore, we can always get the following generalized form
of the Riemann estimator :
n
,qR/RB -1 ~"~((t+l)
vn = n s ~,Xl -- x(t))h(x
1 (t))
1 7r(x ) I~k2,"', x 9
t=l \k=l
(2.2)
The computational cost of this estimator is higher than for the standard
Riemann estimator but the efficiency is quite similar and it definitely
improves upon the empirical average. The performances are illustrated
in the case of the auto exponential model (Besag, 1974).
Example 2. Consider the density
e-y1
f l ( Y l ) 0( - - ,
1 +Yl
we can compare the Riemann estimators (1.1) and (2.1) with the empiri-
cal average and the Rao-Blackwell estimator. By running a Monte-Carlo
experiment 200 times, we build equal tailed confidence regions Cn such
that, for fixed n,
P(e~n C C n ) = 1 - Ol.
Statistical Inference and Monte Carlo Algorithms 323
~j
, j'; . . . . .
o 2tl 40 6o so loo
0
Figure 2.1. 95% confidence band for the estimation of]Ef (z) for the auto
exponential model: the emp(rical average (plain), the Riemann estimator
(1.1) (dots), the modified Riemann estimator (2.2) (dashes) and the Rao-
Blackwell estimator (long dashes). For n = 5,000, the confidence band are
[0.6627, 0.6932], [0.6761, 0.6806], [0.6738, 0.6825], and [0.6728, 0.6807]
respectively and the true value is 0.6768.
Figure 2.1 shows the behavior of the confidence band for o~ = 0.05. The
amplitude of the confidence band of the Riemann and Rao-Blackwell
estimators are quite similar. The three estimators improve upon the
empirical average.
A major theme of this paper is the interplay between the data model
and the MC simulation method. I prefer to view the MC simulation an
additional step of data collection, much like a second stage of sampling in
a multistage survey. Let S(ra) denote the output stream from a simulation
run of length m. If computational resources were unlimited, we could
generate S (~176and obtain inferences equivalent to those from the actual
posterior distribution P(h(O) [ y). In reality we can generate only S (~),
so the best inferences attainable will be those based on the reduced
information in the posterior P(h(O) [ S(m)). Perhaps we should focus
our efforts on approximating P(h(O) [ S(m)).
_1 [~* ~m r ( O l Y , Ai)dO = ol
m a-~ i=1
Statistical Inference and Monte Carlo Algorithms 397
based on the Gibbs sequence (01,)~1), (02,)t2), .... This problem can
be immediately generalized to finding a* such that fa~ 9m(8)d8 = c~,
where
gin(0) = 1 s r
i=l f(Oi'Ai)f(8'Ai)
for any proper conditional density r having the same support as
7r(OlY, )0 and f(O, X) cx 7r(8, ~), the latter being the joint posterior den-
sity of (0, A) given y. The function 9m(8) is the importance weighted
marginal density (IWMD) estimator of Chen (1994), and reduces to
m-1 ~iml 7f(81Y, I~i) for r ) = 7r(0iA ). The ensuing proposal there-
fore covers both possibilities. The density estimate gm (8) may not inte-
grate to 1 (cf. Chen, 1994, w it is useful to note here that the following
will only require gin(8) tO integrate to c for some c > 0, and thus no
numerical renormalization of 9m(8) is necessary.
Given a Gibbs sequence (81, "~1), (82,/~2), .. 9 (8m,"~m), we can
easily calculate the corresponding IWMD estimate. Suppose that m is
reasonably large and that ~r(8ly) ~ c-lgra(O) is unimodal with 0 =
398 George Casella
based on 11,000 iterations (the first 1000 of which are treated as "burn-
in"), and then marginal posterior intervals are respectively calculated via
the empirical cdf's of the 10,000 iterates of p and 0.
The full conditionals are not "nice" in this problem, and it is ad-
vantageous to use the IWMD estimator. Based on the Gibbs output, I
estimated the marginal densities of p and 0 as discussed above; qS(.1.)
was taken to be a Beta density with mean and variance matching the
empirical mean and variance of the parameter whose marginal density
was being computed. To calculate the posterior marginal HPD region
for 0, I generated the IWMD estimate for 0 on an equally-spaced grid of
points (mesh = 0.01). Tail probabilities at any given point (away from
the very extreme tail) were then calculated using the tail probability for-
mula above. This was accomplished by setting t) = argmaxai9m(ai)
and then computing k(~)) and k(J)([9),j = 1,2, the latter via stan,
dard formulas for numerical derivatives. Recalling that H ( a ; c~) ----
P{O > a} - o~, the equations defining the 95% marginal HPD limits are
H(Ou; 0.025) = 0 and H(OL; 0.975) = 0. As an approximation to Ou, I
used 0 u j = 0.5(al + a2)where al = argmaxa{H(a;O.025) > 0} and
a2 = argmina{H(a; 0.025) _< 0}; OL was determined similarly. The
results are summarized in Table 2.
~ basedon renormalizedIWMDestimateusing32-pointGaussianquadrature
2. Computational Algorithms
At the very least, I am heartened that some of this work has resulted in
people being sensitized (but not in the sense of Professor Meng) to the
impact of the algorithm on the inference. The concerns of Professor Pefia
are well founded, and the guidelines of Professor Rios Insua are quite
important. As Professor Schafer points out, focusing on the algorithm
may be one step removed from our ultimate purpose, but it is an important
step. As we will see in Section 4.2, problems can appear even with
seemingly reasonable MC estimators. But even more importantly, I
believe that we are all beginning to approach theoretical problems in
a new way, always thinking of the computations, and being concerned
more with algorithms than theorems. Such an approach can only enhance
our thinking and broaden our influence.
3. Posterior Distributions
The power variance priors of model (4) are mainly chosen because (i)
experimenters tend to believe that improper priors reflect impartiality and
(ii) they result in easy to simulate conditionals. As Professor Pefia notes,
the Jeffreys priors considered by Ibrahim and Laud (1991) indeed give
proper posterior distributions, as will Professor Bernardo's reference
priors, as they both control the tail at zero. Any reanalysis with these
priors will result in coherent inferences, the only drawback being that
the conditional distributions are not as easy to sample from. However,
the inferences are definitely superior.
The popularity of the power prior is an example of the algorithm
overshadowing the statistics. Experimenters were so keen to make the
Gibbs sampler work that they forgot to check the fundamentals of the
model. Moreover, choosing a = b = 0 in (4), which usually is justified
through an invariance argument, is extremely unfortunate as, for exam-
ple, a = b = 1/2 would yield easily obtained conditionals and proper
posterior distributions.
Many discussants had extremely interesting comments and concerns
about this topic. I can loosely group those concerns in the following
subsections.
3.1. Incompatibility. The property of compatibility of densities has
received a lot of comment, and I am heartened that the discussants feel
that this property is as important as Jim Hobert and I do. I should first
332 George Casella
7r(/3[y ) = / 7r(a,/31y)da.
If so, then it is impossible for 7r(/31y) to be proper, as
f ~r(/3,y)d/3=f Tr(a,/3,y)dad/3=cc.
Thus there is no meaningful inference about the parameter/3 that can be
recovered from the full model. (I also suspect that any inference about
/3 in this model would be incoherent in the sense of Heath and Sudderth
1989).
So what about the experience of Berger, and the examples of George?
These are instances in which there is reason to abandon the full model.
That is, the transformations of George, and the "identifiability" of B erger
are procedures for changing the model. In my illustration above, the
parameter a would be somehow eliminated, and only/3 would be con-
sidered, with a proper 7r(/3ly ). So my point is that if a model results
in an improper full posterior, there is no lower dimensional inference
based on the full model that can make sense. However, there may be a
lower dimensional model that makes sense. I have no problem with this
solution, but realize that the model is being changed in a fundamental
way; we are not recovering anything from the improper posterior distri-
bution. The interesting procedure discussed by Meng, that of recursive
deconditioning seems to be an excellent candidate for searching for such
lower dimensional models
3.3. Fixing Impropriety. If the posterior distribution is improper, an
obvious fix is to replace it with a sufficiently "vague" proper prior that
334 George Casella
4. Rao-Blackwellization
The technique of Rao-Blackwellization has expanded beyond the origi-
nal idea of conditioning on a sufficient statistic. Indeed, in my thinking,
it has expanded to encompass a class of techniques that aim at improv-
ing estimators by taking advantage of the structure of the problem in
whatever manner is available.
I don't believe that we have returned to the status quo, as stated by
Berger. Even in situations where we end up with the same procedures,
we also end up learning a lot (the gains of Rao-Blackwellization can
be huge, and easy to obtain) and have not always returned to the status
quo (the full Rao-Blackwellized estimator is still the only one to achieve
substantial gains while retaining unbiasedness.) Although Femindiz
rightly points out that the Rao-Blackwellization in the paper only applies
to algorithms with ancillary random variables, the general approach goes
far beyond this case. Perhaps the most important contribution is that we
have stimulated thinking to search for better ways to process the output,
Statistical Inference and Monte CarloAlgorithms 335
1~ f(Xi) h(Xi)
(~IS = -'~ i=1 f(XilYi)
(ignoring the possibility that the marginal f(x) may not be computable).
An interesting fact is that
"correct for constants". This is perhaps more clear when the estimator
is written in the form (9), which can only be done with the knowledge of
the value of t, that is, with knowledge of the stopping rule. The estimator
6is of Liu's discussion, that is,
n
1
~Is = n
(m)
/=I
= J/lira
t-~t 0
%21~,y(tlu, y)m(ulu)du.
338 George Casella
5. Other Concerns
5.1. Multiple Paths. The question of multiple path Gibbs sampling was
raised by both Bernardo and Garcfa-Ldpez and Gonzfilez, although in
different contexts. Firstly, the number of paths used in the Gibbs sampler
will not have any impact on propriety or compatibility, as these are
properties of the underlying model, and the manner in which we observe
the model cannot have any bearing. The question of how multiple paths
can affect the variance of our estimate is also an interesting one, and
prompted me to write the following.
where -r~ = var(b~ i) tTi), the variance that is only due to the algorithm,
and is not due to the model. Now we can see the effect of multiple paths
(m) and increasing the length of the chain (k). As k -* co, -r~ -+ 0,
so increasing the length of the chain will reduce the variation due to the
algorithm and also diminish the effect of Rao-Blackwellization (but, as
we saw in Section 5.2, not erase it). However, increasing m, the number
of paths, has no direct effect on z~, but still will reduce var(5). But
this latter situation is less desirable, as we should strive to eliminate
the variation due solely to the algorithm (which is under our control).
Thus, this naive analysis seems to show that there is less to be gained
in variance reduction, whether the criterion is Bayesian or frequentist,
from running multiple chains.
Equation (R3) may also answer the concern of Rfos-Insua that our
stream of "endless data" eliminates the role of Bayesian statistics. In-
deed, a more careful analysis of (R3), and the effects of changing k and
m would almost certainly need some form of prior input to help balance
the effects of the model and the algorithm.
5.2. Accurate Approximations. Professor Strawderman reminds me of
one of my own lessons, that of not forgetting that we are statisticians with
a large box of tools. He brings the methods of higher-order asymptotics
to bear on the Gibbs sampler, showing that the DiCiccio/Martin tail prob-
ability approximation results in an extremely accurate approximation to
the desired posterior probability in Section 5.1. Bravo. Professors Di-
Ciccio and Wells also note the place for higher-order asymptotics, and
make an interesting point about recovering a frequentist inference in the
face of the Bayesian "catastrophe". Of course, whether the posterior
distribution is proper has no bearing on the frequentist inference, which
can always be made. However, under such catastrophic priors, such as
a = b = 1, the Gibbs sampler cannot be used to produce reasonable fre-
quentist inferences. Indeed, conjecturing based on the results of Nataran
and McCulloch (1996), such catastrophic priors could leave us quite far
from reasonable frequentist inference.
Statistical Inference and Monte Carlo Algorithms 341
Box, G. E. E (1980). Sampling and Bayes' inference in scientific modelling and ro-
bustness. J. Roy. Statist. Soc. A 143, 383--430.
Caracciolo, S., Pelissetto, A. and Sokal, A. D. (1990). Nonlocal Monte Carlo algorithms
for self-avoiding walks with fixed endpoints. J. Stat. Phys. 60, 7-53.
Chen, M. (1994). Importance-weighted marginal Bayesian posterior density estima-
tion. J. Amer. Statist. Assoc. 89, 818-824.
Christiansen, C. and Morris, C. (1995). Hierarchical Poisson regression modeling.
Tech. Rep., Department of Health Care Policy, Harvard.
Daniels, M. (1996). A prior for the variance in hierarchical models. Tech. Rep., De-
partment of Statistics, Carnegie Mellon University.
Daniels, M. and Gatsonis, C. (1996). Multilevel hierarchical generalized linear models
in health services research. Tech. Rep., Department of Health Care Policy, Harvard.
Dawid, A. E, Stone, M. and Zidek, J. V. (1973). Marginalization paradoxes in Bayesian
and structural inference. J. Roy. Statist. Soc. B 35, 189-233 (with discussion).
Diaconis, E and Stroock, D. (1991). Geometric bounds for eigenvalues of Markov
Chains. Ann. Appl. Probab. 1, 36-61.
DiCiccio, T. and Martin, M. (1993). Simple modifications for signed roots of likelihood
ratio statistics. J. Roy. Statist. Soc. B 55, 305-316.
DiCiccio, T., Kass, R., Raftery, A. and Wasserman, L. (1996). Computing Bayes factors
by combining simulation and asymptotic approximations. Tech. Rep., Carnegie-
Mellon University, Pittsburgh PA.
DuMouchel, W. (1994). Hierarchical Bayes linear models for meta-analysis. Tech.
Rep. 27, National Institute of Statistical Sciences.
Farewell, V. and Sprott, D. (1988). The use of a mixture model in the analysis of count
data. Biometrics 44, 1191-1194.
Ferr~indiz, J., L6pez, A., Llopis, A., Morales, M. and Tejerizo, M. L. (1995). Spatial
interaction between neighouring counties: cancer data in Valencia, (Spain). Bio-
metrika 51,665-678.
Gelfand, A. and Rubin, D. R. (1991). A single series from the Gibbs sampler pro-
vides a false sense of security. Bayesian Statistics 4 (J. M. Bernardo, J. O. Berger,
A. E Dawid and A. E M. Smith, eds.). Oxford: University Press, 625-631.
Gelfand, A. E. and Smith, A. E M. (1990). Sampling based approaches to calculating
marginal densities. J. Amer. Statist. Assoc. 85, 398--409.
Geman, S. and Geman, D. (1984). Stochastic Relaxation, Gibbs Distributions and the
Bayesian Restoration of Images. IEEE Trans. Pattern. Anal. Mach. Intelligence 6,
721-741.
Geyer, C. J. and Thompson, E. A. (1992). Constrained Monte Carlo maximum likeli-
hood calculations. J. Roy. Statist. Soc. B 54, 657-699 (with discussion).
Geyer, C. J. and Thompson, E. A. (1995). Annealing Markov Monte Carlo with appli-
cations to ancestral inference. J. Amer. Statist. Assoc. 90, 909-920.
Green, E. J. and Strawderman, W. E. (1991). A James-Stein type estimator for com-
bining unbiased and possibly biased estimators. J. Amer. Statist. Assoc. 6, 416,
1001-1006.
Statistical Inference a n d M o n t e Carlo Algorithms 343
Robert, C. E (1996). Mgthodes de Monte Carlo par Chafnes de Markov. Paris: Eco-
nomica.
Rubin, D. B. (1987). Multiple Imputation for Nonresponse in Surveys. New York:
Wiley.
Rubinstein, R. Y. (1981). Simulation and the Monte Carlo Method. New York: Wiley.
Samaniego, J. E and Renau, D. M. (1994). Towards a reconciliation of the Bayesian
and frequentist approaches to point estimation. J. Amer. Statist. Assoc. 89, 427,
947-957.
Schafer, J. L. (1996). Analysis of Incomplete Multivariate Data. London: Chapman
and Hall, (in press).
Smith, A. and Gelfand, A. (1992). Bayesian statistics without tears. Amer. Stat. 46,
84-88.
Sokal, A. D. (1989). Monte Carlo Methods in Statistical Mechanics: Foundations and
New Algorithms. Cours de Troisi~me Cycle de la Physique en Suisse Romande,
Lausanne.
Spiegelhalter, D., Thomas, A., Best, N. and Gilks, W. (1996). BUGS: Bayesian Infer-
ence Using Gibbs Sampling, Version 0.50. Cambridge: MRC Biostatistics Unit.
Stein, C. (1959). An example of wide discrepancy between fiducial and confidence
intervals. Ann. Math. Statist. 30, 877-880.
Strawderman, W. E. (1971). Proper Bayes minimax estimators of the multivariate
normal mean. Ann. Math. Statist. 42, 385-388.
Tanner, M. A. and Wong, W. H. (1991). The calculation of posterior distributions by
data augmentation. J. Amer. Statist. Assoc. 82, 528-550, (with discussion).
Verdinelli, I. and Wasserman, L. (1995). Computing Bayes factors by using a general-
ization of the Savage-Dickey density ratio. J. Amer. Statist. Assoc. 90, 614-618.
Yakowitz, S., Krimmel, J. E. and Szidorovszky, E (1978). Weighted Monte-Carlo
integration. SIAM J. Numer. Anal. 15, 1289-1300.