Bayesian Inference
Bayesian Inference
net/publication/232534586
Bayesian Inference
CITATIONS READS
19 4,784
1 author:
SEE PROFILE
All content following this page was uploaded by Frederic Yves Bois on 21 November 2014.
n° 2010-30
Bayesian Inference
J.-M. MARIN1
C. P. ROBERT2
J. ROUSSEAU3
Working papers do not reflect the position of INSEE but only the views of the authors.
1 Université de Montpellier 2.
2 Université Paris-Dauphine, CEREMADE and CREST-INSEE, Paris.
3 Université Paris-Dauphine, CEREMADE and CREST-INSEE, Paris.
Bayesian Inference
Christian P. Robert1,3 , Jean-Michel Marin2,3 and Judith Rousseau1,3
1
Université Paris-Dauphine, 2 Université de Montpellier 2, and 3 CREST,
INSEE, Paris
1 Introduction
This chapter provides a overview of Bayesian inference, mostly emphasising that it is a
universal method for summarising uncertainty and making estimates and predictions using
probability statements conditional on observed data and an assumed model (Gelman 2008).
The Bayesian perspective is thus applicable to all aspects of statistical inference, while being
open to the incorporation of information items resulting from earlier experiments and from
expert opinions.
We provide here the basic elements of Bayesian analysis when considered for standard
models, refering to Marin and Robert (2007) and to Robert (2007) for book-length entries.1
In the following, we refrain from embarking upon philosophical discussions about the nature
of knowledge (see, e.g., Robert 2007, Chapter 10), opting instead for a mathematically sound
presentation of an eminently practical statistical methodology. We indeed believe that the
most convincing arguments for adopting a Bayesian version of data analyses are in the
versatility of this tool and in the large range of existing applications, rather than in those
polemical arguments (for such perspectives, see, e.g., Jaynes 2003 and MacKay 2002).
2.1 Bases
We start this section with some notations about the statistical model that may appear to be
over-mathematical but are nonetheless essential.
Given an independent and identically distributed (iid) sample Dn = (x1 , . . . , xn ) from a
density fθ , with an unknown parameter θ ∈ Θ, like the mean µ of the benchmark normal
1 The chapter borrows heavily from Chapter 2 of Marin and Robert (2007).
2 BAYESIAN INFERENCE
This quantity is a fundamental entity for the analysis of the information provided about the
parameter θ by the sample Dn , and Bayesian analysis relies on this function to draw inference
on θ.2 Since all models are approximations of reality, the choice of a sampling model is wide-
open for criticisms (see, e.g., Templeton 2008), but those criticism go far beyond Bayesian
modelling and question the relevance of completely built models for drawing inference or
running predictions. We will therefore address the issue of model assessment later in the
chapter.
The major input of the Bayesian perspective, when compared with a standard likelihood
approach, is that it modifies the likelihood—which is a simple function of θ—into a posterior
distribution on the parameter θ—which is a probability distribution on Θ defined by
ℓ(θ|Dn )π(θ)
π(θ|Dn ) = R . (1)
ℓ(θ|Dn )π(θ) dθ
The factor π(θ) in (1) is called the prior (often omitting the qualificative density) and it
necesssarily has to be determined to start the analysis. A primary motivation for introducing
this extra-factor is that the prior distribution summarizes the prior information on θ; that is,
the knowledge that is available on θ prior to the observation of the sample Dn . However, the
choice of π(θ) is often decided on practical or computational grounds rather than on strong
subjective beliefs or on overwhelming prior information. As will be discussed later, there
also exist less subjective choices, made of families of so-called noninformative priors.
The radical idea behind Bayesian modelling is thus that the uncertainty on the unknown
parameter θ is more efficiently modelled as randomness and consequently that the probability
distribution π is needed on Θ as a reference measure. In particular, the distribution Pθ of the
sample Dn then takes the meaning of a probability distribution on Dn that is conditional
on [the event that the parameter takes] the value θ, i.e. fθ is the conditional density of x
given θ. The above likelihood offers the dual interpretation of the probability density of Dn
conditional on the parameter θ, with the additional indication that the observations in Dn are
independent given θ. The numerator of (1) is therefore the joint density on the pair (Dn , θ)
and the (standard probability calculus) Bayes theorem provides the conditional (or posterior)
distribution of the parameter θ given the sample Dn as (1), the denominator being called the
marginal (likelihood) m(Dn ).
There are many arguments which make such an approach compelling. When defining
a probability measure on the parameter space Θ, the Bayesian approach endows notions
such as the probability that θ belongs to a specific region with a proper meaning and those
are particularly relevant when designing measures of uncertainty like confidence regions or
when testing hypotheses. Furthermore, the posterior distribution (1) can be interpreted as the
actualisation of the knowledge (uncertainty) on the parameter after observing the data. At this
early stage, we stress that the Bayesian perspective does not state that the model within which
it operates is the “truth”, no more that it believes that the corresponding prior distribution π
2 Resorting to an abuse of notations, we will also call ℓ(θ|D ) our statistical model, even though the distribution
n
with the density fθ is, strictly speaking, the true statistical model.
BAYESIAN INFERENCE 3
it requires has a connection with the “true” production of parameters (since there may even
be no parameter at all). It simply provides an inferential machine that has strong optimality
properties under the right model and that can similarly be evaluated under any other well-
defined alternative model. Furthermore, the Bayesian approach includes techniques to check
prior beliefs as well as statistical models (Gelman 2008), so there seems to be little reason for
not using a given model at an earlier stage even when dismissing it as “un-true” later (always
in favour of another model).
for a given sample Dn . For instance, observing a frequency 38/58 of survivals among 58
breast-cancer patients and assuming a binomial B(58, θ) with a uniform U(0, 1) prior on θ
leads to the Bayes estimate
R1
58 38 20
38 θ (1 − θ) dθ
θ 38 + 1
θb = 0
R 1 58 = ,
θ38 (1 − θ)20 dθ 58 + 2
0 38
where the function to maximize is usually provided in closed form. However, numerical
problems often make the optimization involved in finding the MAP far from trivial. Note
also here the similarity of (3) with the maximum likelihood estimator (MLE): The influence
of the prior distribution π(θ) progressively disappears with the number of observations, and
the MAP estimator recovers the asymptotic properties of the MLE. See Schervish (1995) for
more details on the asymptotics of Bayesian estimators.
3 Hence the concept, introduced above, of a complete inferential machine.
4 BAYESIAN INFERENCE
surviv
age malign yes no
under 50 no 77 10
yes 51 13
50-69 no 51 11
yes 38 20
above 70 no 7 3
yes 6 3
Figure 1 (left) Data describing the survival rates of some breast-cancer patients (Bishop et al. 1975)
and (right) representation of two gamma posterior distributions differentiating between malignant
(dashes) versus non-malignant (full) breast cancer survival rates.
π(θi |D3 ) ∝ θx1i +x2i +x3i exp {−θi (2 + N1i + N2i + N3i )}
i.e. a Gamma Γ(x1i + x2i + x3i + 1, 2 + N1i + N2i + N3i ) distribution. The choice of the
(prior) exponential parameter corresponds to a prior estimate of 50% survival probability
over the period.4 In the case of the non-malignant breast cancers, the parameters of the
4 Once again, this is an academic example. The prior survival probability would need to be assessed by a physician
(posterior) Gamma distribution are a = 136 and b = 161, while, for the malignant cancers,
they are a = 96 and b = 133. Figure 1 shows the difference between both posteriors, the non-
malignant case being stochastically closer to 1, hence indicating a higher survival rate. (Note
that the posterior in this figure gives some weight to values of θ larger than 1. This drawback
can easily be fixed by truncated the exponential prior at 1.)
which are parameterized by two quantities, λ > 0 and ξ, λξ being of the same nature as R(y).
These parameterized prior distributions on θ are appealing for the simple computational
reason that the posterior distributions are exactly of the same form as the prior distributions;
that is, they can be written as
π(θ|ξ ′ (Dn ), λ′ (Dn )) , (4)
where (ξ ′ (Dn ), λ′ (Dn )) is defined in terms of the sample of observations Dn (Robert 2007,
Section 3.3.3). Equation (4) simply says that the conjugate prior is such that the prior and
posterior densities belong to the same parametric family of densities but with different
parameters. In this conjugate setting, it is the parameters of the posterior density themselves
that are “updated”, based on the observations, relative to the prior parameters, instead of
changing the whole shape of the distribution. To avoid confusion, the parameters involved in
5 This covers most of the standard statistical distributions, see Lehmann and Casella (1998) or Robert (2007).
6 BAYESIAN INFERENCE
the prior distribution on the model parameter are usually called hyperparameters. (They can
themselves be associated with prior distributions, then called hyperpriors.)
The computation of estimators, of confidence regions or of other types of summaries of
interest on the conjugate posterior distribution often becomes straightforward.
As a first illustration, note that a conjugate family of priors for the Poisson model is the
collection of gamma distributions Γ(a, b), since
leads to the posterior distribution of θ given X = x being the gamma distribution Ga(a +
x, b + 1). (Note that this includes the exponential distribution Exp(2) used on the dataset of
Figure 1. The Bayesian estimator of the average survival rate, associated with the quadratic
loss, is then given by θ̂ = (1 + x1 + x2 + x3 )/(2 + N1 + N2 + N3 ), the posterior mean.)
As a further illustration, consider the case of the normal distribution N (µ, 1), which is
indeed another case of an exponential family, with θ = µ, R(x) = x, and Ψ(µ) = µ2 /2. The
corresponding conjugate prior for the normal mean µ is thus normal,
N λ−1 ξ, λ−1 .
This means that, when choosing a conjugate prior in a normal setting, one has to select both
a mean and a variance a priori. (In some sense, this is the advantage of using a conjugate
prior, namely that one has to select only a few parameters to determine the prior distribution.
Conversely, the drawback of conjugate priors is that the information known a priori on µ
either may be insufficient to determine both parameters or may be incompatible with the
structure imposed by conjugacy.) Once ξ and λ are selected, the posterior distribution on µ
for a single observation x is determined by Bayes’ theorem,
λ−1 1
x+ λ−1 ξ , (5)
1 + λ−1 1 + λ−1
that is, a weighted average of the observation x and the prior mean λ−1 ξ. The smaller λ is,
the closer the posterior mean is to x. The general case of an iid sample Dn = (x1 , . . . , xn )
from the normal distribution N (µ, 1) is processed in exactly the same manner, since x̄n is a
sufficient statistic with normal distribution N (µ, 1/n): the 1’s in (5) are then replaced with
n−1 ’s.
The general case of an iid sample Dn = (x1 , . . . , xn ) from the normal distribution
N (µ, σ 2 ) with an unknown θ = (µ, σ 2 ) also allows for a conjugate processing. The normal
distribution does indeed remain an exponential family when both parameters are unknown.
It is of the form
(σ 2 )−λσ −3/2 exp − λµ (µ − ξ)2 + α /2σ 2
BAYESIAN INFERENCE 7
since
π((µ, σ 2 )|Dn ) ∝ (σ 2 )−λσ −3/2 exp − λµ (µ − ξ)2 + α /2σ 2
×(σ 2 )−n/2 exp − n(µ − x)2 + s2x /2σ 2 (6)
∝ (σ 2 )−λσ (Dn ) exp − λµ (Dn )(µ − ξ(Dn ))2 + α(Dn ) /2σ 2 ,
P
where s2x = ni=1 (xi − x)2 . Therefore, the conjugate prior on θ is the product of an inverse
gamma distribution on σ 2 , I G (λσ , α/2), and, conditionally on σ 2 , a normal distribution on
µ, N (ξ, σ 2 /λµ ).
The apparent simplicity of conjugate priors is however not a reason that makes them
altogether appealing, since there is no further (strong) justification to their use. One of the
difficulties with such families of priors is the influence of the hyperparameter (ξ, λ). If the
prior information is not rich enough to justify a specific value of (ξ, λ), arbitrarily fixing
(ξ, λ) = (ξ0 , λ0 ) is problematic, since it does not take into account the prior uncertainty on
(ξ0 , λ0 ) itself. To improve on this aspect of conjugate priors, a more ameanable solution is to
consider a hierarchical prior, i.e. to assume that γ = (ξ, λ) itself is random and to consider
a probability distribution with density q on γ, leading to
θ|γ ∼ π(θ|γ)
γ ∼ q(γ) ,
As a general principle, q may also depend on some further hyperparameters η. Higher order
levels in the hierarchy are thus possible, even though the influence of the hyper(-hyper-
)parameter η on the posterior distribution of θ is usually smaller than that of γ. But multiple
levels are nonetheless useful in complex populations as those found in animal breeding
(Sørensen and Gianola 2002).
Instead of using conjugate priors, even when mixed with hyperpriors, one can opt for
the so-called noninformative (or vague) priors (Robert et al. 2009) in order to attenuate the
impact on the resulting inference. These priors are defined as refinements of the uniform
distribution, which rigorously does not exist on unbounded spaces. A peculiarity of those
vague priors is indeed that their density usually fails to integrate to one since they have
infinite mass, i.e. Z
π(θ)dθ = +∞,
Θ
and they are defined instead as positive measures, the first and foremost example being
the Lebesgue measure on Rp . While this sounds like an invalid extension of the standard
probabilistic framework—leading to their denomination of improper priors—, it is quite
correct to define the corresponding posterior distributions by (1), provided the integral in
the denominator is defined, i.e.
Z
π(θ)ℓ(θ|Dn ) dθ = m(Dn ) < ∞ .
8 BAYESIAN INFERENCE
In some cases, this difficulty disappears when the sample size n is large enough. In others like
mixture models (see also Section 3), the impossibility of using a particular improper prior
may remain whatever the sample size is. It is thus strongly advised, when using improper
priors in new settings, to check that the above finiteness condition holds.
The purpose of noninformative priors is to set a prior reference that has very little bearing
on the inference (relative to the information brought by the likelihood function). More
detailed accounts are provided in Robert (2007, Section 1.5) about this possibility of using
σ-finite measures in settings where genuine probability prior distributions are too difficult to
come by or too subjective to be accepted by all.
While a seemingly natural way of constructing noninformative priors would be to fall
back on the uniform (i.e. flat) prior, this solution has many drawbacks, the worst one being
that it is not invariant under a change of parameterisation. To understand this issue, consider
the example of a Binomial model: the observation x is a B(n, p) random variable, with
p ∈ (0, 1) unknown. The uniform prior π(p) = 1 could then sound like the most natural
noninformative choice; however, if, instead of the mean parameterisation by p, one considers
the logistic parameterisation θ = log(p/(1 − p)) then the uniform prior on p is transformed
into the logistic density
π(θ) = eθ /(1 + eθ )2
by the Jacobian transform, which is not uniform. There is therefore a lack of invariance under
reparameterisation, which in its turn implies that the choice of the parameterisation associated
with the uniform prior is influencing the resulting posterior. This is generaly considered to
be a drawback. Flat priors are therefore mostly restricted to location models x ∼ p(x − θ),
while scale models
x ∼ p(x/θ)/θ
are associated with the log-transform of a flat prior, that is,
π(θ) = 1/θ .
In a more general setting, the (noninformative) prior favoured by most Bayesians is the so-
called Jeffreys’ (1939) prior, which is related to Fisher’s information I F (θ) by
1/2
π J (θ) = I F (θ) ,
25
20
15
10
5
0
Figure 2 Two posterior distributions on a normal mean corresponding to the flat prior (plain) and a
conjugate prior (dotted) for a dataset of 90 observations. (Source: Marin and Robert 2007.)
where α is either a predetermined level such as 0.05.6 or a value derived from the loss
function (that may depend on the data).
The important difference from a traditional perspective is that the integration here is done
over the parameter space, rather than over the observation space. The quantity 1 − α thus
corresponds to the probability that a random θ belongs to this set C(Dn ), rather than to
the probability that the random set contains the “true” value of θ. Given this drift in the
interpretation of a confidence set (rather called a credible set by Bayesians in order to
stress this major difference with the classical confidence set), the determination of the best7
confidence set turns out to be easier than in the classical sense: It simply corresponds to the
values of θ with the highest posterior values,
where kα is determined by the coverage constraint (7). This region is called the highest
posterior density (HPD) region.
When the prior distribution is not conjugate, the posterior distribution is not necessarily so
easily-managed. For instance, if the normal N (µ, 1) distribution is replaced with the Cauchy
6 There is nothing special about 0.05 when compared with, say, 0.87 or 0.12. It is just that the famous 5% level
0.15
0.10
0.05
0.00
−10 −5 0 5 10
Figure 3 (left) Posterior distribution of the location parameter µ of a Cauchy sample for a N (0, 10)
prior and corresponding 95% HPD region (Source: Marin and Robert 2007); (right) Representation of
a posterior sample of 103 values of (θ, σ 2 ) for the normal model, x1 , . . . , x10 ∼ N (θ, σ 2 ) with x = 0,
s2 = 1 and n = 10, under Jeffreys’ prior, along with the pointwise approximation to the 10% HPD
region (in darker hues) (Source: Robert and Wraith 2009).
there is no conjugate prior available and we can consider a normal prior on µ, say N (0, 10).
The posterior distribution is then proportional to
Yn
2
π̃(µ|Dn ) = exp(−µ /20) (1 + (xi − µ)2 ) .
i=1
Solving π̃(µ|Dn ) = k is not possible analytically, only numerically, and the derivation of the
bound kα requires some amount of trial-and-error in order to obtain the correct coverage.
Figure 3 gives the posterior distribution of µ for the observations x1 = −4.3 and x2 = 3.2.
For a given value of k, a trapezoidal approximation can be used to compute the approximate
coverage of the HPD region. For α = 0.95, a trial-and-error exploration of a range of values
of k then leads to an approximation of kα = 0.0415 and the corresponding HPD region is
represented in Figure 3 ((left).
As illustrated in the above example, posterior distributions are not necessarily unimodal
and thus the HPD regions may include several disconnected sets. This may sound
counterintuitive from a classical point of view, but it must be interpreted as indicating
indeterminacy, either in the data or in the prior, about the possible values of θ. Note also that
HPD regions are dependent on the choice of the reference measure that defines the volume
(or surface).
BAYESIAN INFERENCE 11
The analytic derivation of HPD regions is rarely straightforward but let us stress that, due to
the fact that the posterior density is most known up to a normalising constant, those regions
can be easily derived from posterior simulations. For instance, Figure 3 (right) illustrates
this derivation in the case of a normal N (θ, σ 2 ) model with both parameters unknown
and Jeffreys’ prior, when the sufficient statistics are x = 0 and s2 = 1, based on n = 10
observations.
3 Testing Hypotheses
Deciding about the validity of some restrictions on the parameter θ or on the validity of
a whole model—like whether or not the normal distribution is appropriate for the data at
hand—is a major and maybe the most important component of statistical inference. Because
the outcome of the decision process is clearcut, accept (coded by 1) or reject (coded by 0),
the construction and the evaluation of procedures in this setup are quite crucial. While the
Bayesian solution is formally very close to a likelihood ratio statistic, its numerical values
and hence its conclusions often strongly differ from the classical solutions.
3.1 Decisions
Without loss of generality, and including the setup of model choice, we represent null
hypotheses as restricted parameter spaces, namely θ ∈ Θ0 . For instance, θ > 0 corresponds
to Θ0 = R+ . The evaluation of testing procedures can be formalised via the 0 − 1 loss that
equally penalizes all errors: If we consider the test of H0 : θ ∈ Θ0 versus H1 : θ 6∈ Θ0 ,
and denote by d ∈ {0, 1} the decision made by the researcher and by δ the corresponding
decision procedure, the loss
(
1 − d if θ ∈ Θ0 ,
L(θ, d) =
d otherwise,
is associated with the Bayes decision (estimator)
(
π 1 if P π (θ ∈ Θ0 |x) > P π (θ 6∈ Θ0 |x),
δ (x) =
0 otherwise.
This estimator is easily justified on an intuitive basis since it chooses the hypothesis with the
largest posterior probability. The Bayesian testing procedure is therefore a direct transform
of the posterior probability of the null hypothesis.
π P π (θ ∈ Θ1 |x)/P π (θ ∈ Θ0 |x)
B10 = ,
P π (θ ∈ Θ1 )/P π (θ ∈ Θ0 )
which corresponds to the classical odds or likelihood ratio, the difference being that the
parameters are integrated rather than maximized under each model. While it is a simple one-
to-one transform of the posterior probability, it can be used for Bayesian testing without
12 BAYESIAN INFERENCE
resorting to a specific loss, evaluating the strength of the evidence in favour or against H0
π
by the distance of log10 (B10 ) from zero (Jeffreys 1939). This somehow ad-hoc perspective
provides a reference for hypothesis assessment with no need to define the prior probabilities
of H0 and H1 , which is one of the advantages of using the Bayes factor. In general, the Bayes
factor does depend on prior information, but it can be perceived as a Bayesian likelihood
π
ratio since, if π0 and π1 are the prior distributions under H0 and H1 , respectively, B10 can
be written as R
fθ (x)π1 (θ) dθ m1 (x)
B10π
= R Θ1 = ,
Θ0
f θ (x)π0 (θ) dθ m 0 (x)
thus replacing the likelihoods with the marginals under both hypotheses. Thus, by integrating
out the parameters within each hypothesis, the uncertainty on each parameter is taken into
account, which induces a natural penalisation for larger models, as intuited by Jeffreys
(1939). The Bayes factor is connected with the Bayesian information criterion (BIC, see
Robert 2007, Chapter 5), with a penalty term of the form d log n/2, which explicits the
penalisation induced by Bayes factors in regular parametric models. In a wide generality,
the Bayes factor asymptotically corresponds to a likelihood ratio with a penalty of the form
d∗ log n∗ /2 where d∗ and n∗ can be viewed as the effective dimension of the model and
number of observations, respectively, see (Berger et al. 2003, Chambaz and Rousseau 2008).
The Bayes factor therefore offers the major interest that it does not require to compute a
complexity measure (or penalty term)—in other words, to define what is d∗ and what is
n∗ —, which often is quite complicated and may depend on the true distribution.
hold, whatever the measures of Θ0 and Θ1 for the original prior, which means that the prior
must be decomposed as
and accepted, this means that, in most situations, the (reduced) model under H0 will be
used rather than the (full) model considered before. Thus, a prior distribution under the
reduced model must be available for potential later inference. (Formaly, the fact that this
later inference depends on the selection of H0 should also be taken into account.)
In the special case Θ0 = {θ0 }, π0 is the Dirac mass at θ0 , which simply means that
P π0 (θ = θ0 ) = 1, and we need to introduce a separate prior weight of H0 , namely,
Then,
fθ0 (x)ρ fθ0 (x)ρ
π(Θ0 |x) = R = .
fθ (x)π(θ) dθ fθ0 (x)ρ + (1 − ρ)m1 (x)
In the case when x ∼ N (µ, σ 2 ) and µ ∼ N (ξ, τ 2 ), consider the test of H0 : µ = 0. We
can choose ξ equal to 0 if we do not have additional prior information. Then the Bayes factor
is the ratio of marginals under both hypotheses, µ = 0 and µ 6= 0,
2 2 2
π m1 (x) σ e−x /2(σ +τ )
B10 = =√
f0 (x) σ2 + τ 2 e−x2 /2σ2
and " r #−1
1−ρ σ2 τ 2 x2
π(µ = 0|x) = 1 + exp
ρ σ + τ2
2 2σ (σ 2 + τ 2 )
2
is the posterior probability of H0 . Table 1 gives an indication of the values of the posterior
probability when the normalized quantity x/σ varies. This posterior probability again
depends on the choice of the prior variance τ 2 : The dependence is actually quite severe,
as shown below with the Jeffreys–Lindley paradox.
1.0
0.8
0.6
B10
0.4
0.2
0.0
π
Figure 4 Range of the Bayes factor B10 when τ goes from 10−4 to 10. (Note: The x-axis is in
logarithmic scale.) (Source: Marin and Robert 2007.)
A first consequence of this choice is that the posterior probability of H0 is bounded from
above by
√
π(µ = 0|x) ≤ 1/(1 + 2π) = 0.285 .
Table 2 provides the evolution of this probability as x goes away from 0. An interesting point
is that the numerical values somehow coincide with the p-values used in classical testing
(Casella and Berger 2001).
BAYESIAN INFERENCE 15
and the answer is now exactly the p-value found in classical statistics.
The difficulty in using an improper prior also relates to what is called the Jeffreys–Lindley
paradox, a phenomenon that shows that limiting arguments are not valid in testing settings.
In contrast with estimation settings, the noninformative prior no longer corresponds to the
limit of conjugate inferences. In fact, for a conjugate prior, the posterior probability
( r )−1
1 − ρ0 σ2 τ 2 x2
π(θ = 0|x) = 1+ exp
ρ0 σ2 + τ 2 2σ 2 (σ 2 + τ 2 )
is completely
√ determined by the choice of c0 /c1 . This implies, for instance, that the function
[1 + 2π exp(x2 /2)]−1 obtained earlier has no validity whatsoever.
Since improper priors are an essential part of the Bayesian approach, there have been many
proposals to overcome this ban. Most use a device that transforms the prior into a proper
probability distribution by using a portion of the data Dn and then use the other part of the
data to run the test as in a standard situation. The variety of available solutions is due to the
many possibilities of removing the dependence on the choice of the portion of the data used
in the first step. The resulting procedures are called pseudo-Bayes factors. See Robert (2007,
Chapter 5) for more details.
where X denotes the (n, p) matrix of regressors—upon which the whole analysis is
conditioned—, y the vector of the n observations, and β is the vector of the regression
coefficients. (This is a matrix representation of the repeated observation of
when i varies from 1 to n.) Variable selection in this setup means removing covariates, that
is, columns of X, that are not significantly contributing to the expectation of y given X. In
other words, this is about testing whether or not a null hypothesis like H0 : β1 = 0 holds.
From a Bayesian perspective, a possible non informative prior distribution on the generic
regression model (8) is the so-called Zellner’s (1986) g-prior, where the conditional8 π(β|σ)
prior density corresponds to a normal
where X−1 denotes the regression matrix missing the column corresponding to the first
regressor, and σ 2 ∼ π(σ 2 ) = σ −2 . Since σ is a nuisance parameter in this case, we may
use the improper prior on σ 2 as common to all submodels and thus avoid the indeterminacy
in the normalising factor of the prior when computing the Bayes factor
R
f (y|β−1 , σ, X)π(β (−1) |σ, X−1 ) dβ−1 σ −2 dσ
B01 = R
f (y|β, σ, X)π(β|σ, X)dβ σ −2 dσ
Figure 5 reproduces a computer output from Marin and Robert (2007) that illustrates how
this default prior and the corresponding Bayes factors can be used in the same spirit as
significance levels in a standard regression model, each Bayes factor being associated with
the test of the nullity of the corresponding regression coefficient. For instance, only the
intercept and the coefficients of X1 , X2 , X4 , X5 are significant. This output mimics the
standard lm R function outcome in order to show that the level of information provided by
the Bayesian analysis goes beyond the classical output. (We stress that all items in the table of
Figure 5 are obtained via closed-form formulae.) Obviously, this reproduction of a frequentist
output is not the whole purpose of a Bayesian data anlysis, quite the opposite: it simply
reflects on the ability of a Bayesian analysis to produce automated summaries, just as in the
classical case, but the inferencial abilities of the Bayesian approach are considerably wider.
(For instance, testing simultaneously the nullity of β3 , β6 , . . . , β10 is of identical difficulty,
as detailed in Marin and Robert 2007, Chapter 3.)
8 The fact that the prior distribution depends on the matrix of regressors X is not contradictory with the Bayesian
paradigm in that the whole analysis is conditional on X. The potential randomness of the regressors is not accounted
for in this analysis.
BAYESIAN INFERENCE 17
Estimate BF log10(BF)
Figure 5 R output of a Bayesian regression analysis on a processionary caterpillar dataset with ten
covariates analysed in Marin and Robert (2007). The Bayes factor on each row corresponds to the test
of the nullity of the corresponding regression coefficient.
4 Extensions
The above description of inference is only an introduction and is thus not representative of
the wealth of possible applications resulting from a Bayesian modelling. We consider below
two extensions inspired from Marin and Robert (2007).
4.1 Prediction
When considering a sample Dn = (x1 , . . . , xn ) from a given distribution, there can be
a sequential or dynamic structure in the model that implies that future observations are
expected. While more realistic modeling may involve probabilistic dependence between the
xi ’s, we consider here the simpler setup of predictive distributions in iid settings.
If xn+1 is a future observation from the same distribution fθ (·) as the sample Dn , its
predictive distribution given the current sample is defined as
Z Z
f π (xn+1 |Dn ) = f (xn+1 |θ, Dn )π(θ|Dn ) dθ = fθ (xn+1 )π(θ|Dn ) dθ .
The motivation for defining this distribution is that the information available on the
pair (xn+1 , θ) given the data Dn is summarized in the joint posterior distribution
fθ (xn+1 )π(θ|Dn ) and the predictive distribution above is simply the corresponding marginal
on xn+1 . This is nonetheless coherent with the Bayesian approach, which then considers
xn+1 as an extra unknown.
For the normal N (µ, σ 2 ) setup, using a conjugate prior on (µ, σ 2 ) of the form
(σ 2 )−λσ −3/2 exp − λµ (µ − ξ)2 + α /2σ 2 ,
the corresponding posterior distribution on (µ, σ 2 ) given Dn is
λµ ξ + nxn σ2 2 nλµ 2
N , × I G λσ + n/2, α + sx + (x − ξ) /2 ,
λµ + n λµ + n λµ + n
18 BAYESIAN INFERENCE
denoted by
N ξ(Dn ), σ 2 /λµ (Dn ) × I G (λσ (Dn ), α(Dn )/2) ,
and the predictive on xn+1 is derived as
Z
f π (xn+1 |Dn ) ∝ (σ 2 )−λσ −2−n/2 exp −(xn+1 − µ)2 /2σ 2
× exp − λµ (Dn )(µ − ξ(Dn ))2 + α(Dn ) /2σ 2 d(µ, σ 2 )
Z
∝ (σ 2 )−λσ −n/2−3/2 exp − (λµ (Dn ) + 1)(xn+1 − ξ(Dn ))2
Therefore, the predictive of xn+1 given the sample Dn is a Student’s t distribution with
mean ξ(Dn ) and 2λσ + n degrees of freedom. In the special case of the noninformative
prior, λµ = λσ = α = 0 and the predictive is
−(n+1)/2
n
f π (xn+1 |Dn ) ∝ s2x + (xn+1 − xn )2 .
n+1
√
This is again a Student’s t distribution with mean xn ), scale sx / n, and n degrees of
freedom.
4.2 Outliers
Since normal modeling is often an approximation to the “real thing,” there may be doubts
about its adequacy. As already mentioned above, we will deal later with the problem of
checking that the normal distribution is appropriate for the whole dataset. Here, we consider
the somehow simpler problem of assessing whether or not each point in the dataset is
compatible with normality. There are many different ways of dealing with this problem.
We choose here to take advantage of the derivation of the predictive distribution above: If an
observation xi is unlikely under the predictive distribution based on the other observations,
then we can argue against its distribution being equal to the distribution of the other
observations.
For each xi ∈ Dn , we consider fiπ (x|Dni ) as being the predictive distribution based
on Dni = (x1 , . . . , xi−1 , xi+1 , . . . , xn ). Considering fiπ (xi |Dni ) or the corresponding cdf
Fiπ (xi |Dni ) (in dimension one) gives an indication of the level of compatibility of the
observation with the sample. To quantify this level, we can, for instance, approximate the
distribution of Fiπ (xi |Dni ) as uniform over [0, 1] since Fiπ (·|Dni ) converges to the true cdf of
the model. Simultaneously checking all Fiπ (xi |Dni ) over i may signal outliers.
The detection of outliers must pay attention to the Bonferroni fallacy, which is that extreme
values do occur in large enough samples. This means that, as n increases, we will see smaller
and smaller values of Fiπ (xi |Dni ) even if the whole sample is from the same distribution. The
significance level must therefore be chosen in accordance with this observation, for instance
BAYESIAN INFERENCE 19
1 − (1 − a)n = 1 − α ,
Mi : x ∼ fi (x|θi ) , i ∈ I,
where I can be finite or infinite, the usual Bayesian answer is similar to the Bayesian
tests as described above. The most coherent perspective (from our viewpoint) is actually
to envision the tests of hypotheses as particular cases of model choices, rather than trying
to justify the modification of the prior distribution criticised by Gelman (2008). This also
incorporates within model choice the alternative solution of model averaging, proposed
by Madigan and Raftery (1994), which strives to keep all possible models when drawing
inference.
The idea behind Bayesian model choice is to construct an overall probability on the
collection of models ∪i∈I Mi in the following way: the parameter is θ = (i, θi ), i.e. the
model index and given the model index equal to i, the parameter θi in model Mi , then the
prior measure on the parameter θ is expressed as
X X
dπ(θ) = pi dπi (θi ), pi = 1.
i∈I i∈Ii
As a consequence, the Bayesian model selection associated with the 0–1 loss function and
the above prior is the model that maximises the posterior probability
Z
pi fi (x|θi )πi (θi )dθi
π(Mi |x) = X Θ Zi
pj fj (x|θj )πj (θj )dθj
j Θj
across all models. Contrary to classical pluggin likelihoods, the marginal likelihoods involved
in the above ratio do compare on the same scale and do not require the models to be nested. As
mentioned in Section 3.5 integrating out the parameters Rθi in each of the models takes into
account their uncertainty thus the marginal likelihoods Θi fi (x|θi )πi (θi )dθi are naturally
penalised likelihoods. In most parametric setups, when the number of parameters does not
grow to infinity with the number of observations and when those parameters are identifiable.
the Bayesian model selector as defined above is consistent, i.e. with increasing numbers of
observations, the probability of choosing the right model goes to 1.
References
Berger J, Ghosh J and Mukhopadhyay N 2003 Approximations to the Bayes factor in model selection problems and
consistency issues. J. Statist. Plann. Inference 112, 241–258.
20 BAYESIAN INFERENCE
Bishop YMM, Fienberg SE and Holland PW 1975 Discrete Multivariate Analysis: Theory and Practice. MIT Press,
Cambridge, MA.
Casella G and Berger R 2001 Statistical Inference second edn. Wadsworth, Belmont, CA.
Chambaz A and Rousseau J 2008 Bounds for Bayesian order identification with application to mixtures. Ann. Statist.
36, 938–962.
Gelman A 2008 Objections to Bayesian statistics. Bayesian Analysis 3(3), 445–450.
Jaynes E 2003 Probability Theory. Cambridge University Press, Cambridge.
Jeffreys H 1939 Theory of Probability first edn. The Clarendon Press, Oxford.
Lehmann E and Casella G 1998 Theory of Point Estimation (revised edition). Springer-Verlag, New York.
MacKay DJC 2002 Information Theory, Inference & Learning Algorithms. Cambridge University Press, Cambridge,
UK.
Madigan D and Raftery A 1994 Model selection and accounting for model uncertainty in graphical models using
Occam’s window. J. American Statist. Assoc. 89, 1535–1546.
Marin JM and Robert C 2007 Bayesian Core. Springer-Verlag, New York.
Robert C 2007 The Bayesian Choice paperback edn. Springer-Verlag, New York.
Robert C and Wraith D 2009 Computational methods for Bayesian model choice In MaxEnt 2009 proceedings (ed.
of Physics AI). (To appear.).
Robert C, Chopin N and Rousseau J 2009 Theory of Probability revisited (with discussion). Statist. Science. (to
appear).
Schervish M 1995 Theory of Statistics. Springer-Verlag, New York.
Sørensen D and Gianola D 2002 Likelihood, Bayesian, and MCMC Methods in Qualitative Genetics. Springer-
Verlag, New York.
Templeton A 2008 Statistical hypothesis testing in intraspecific phylogeography: nested clade phylogeographical
analysis vs. approximate Bayesian computation. Molecular Ecology 18(2), 319–331.
Zellner A 1986 On assessing prior Distributions and Bayesian regression analysis with g-prior distribution regression
using Bayesian variable selection Bayesian inference and decision techniques: Essays in Honor of Bruno de
Finetti North-Holland / Elsevier pp. 233–243.