0% found this document useful (0 votes)
59 views22 pages

Bayesian Inference

This document provides an overview of Bayesian inference and some of its key concepts: - Bayesian inference models uncertainty about unknown parameters through probability distributions rather than just point estimates. It uses Bayes' theorem to update prior distributions based on observed data to obtain posterior distributions. - The prior distribution represents existing knowledge about a parameter before observing new data. Bayes' theorem is used to combine the prior with the likelihood of the data to obtain the posterior distribution, which represents updated knowledge. - Bayesian analysis provides a framework for estimation, hypothesis testing, prediction and model evaluation by using probability distributions and loss/penalty functions rather than just point estimates and p-values. It allows incorporating prior information and accounting for uncertainty in a probabilistic way

Uploaded by

Lavanya Easwar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views22 pages

Bayesian Inference

This document provides an overview of Bayesian inference and some of its key concepts: - Bayesian inference models uncertainty about unknown parameters through probability distributions rather than just point estimates. It uses Bayes' theorem to update prior distributions based on observed data to obtain posterior distributions. - The prior distribution represents existing knowledge about a parameter before observing new data. Bayes' theorem is used to combine the prior with the likelihood of the data to obtain the posterior distribution, which represents updated knowledge. - Bayesian analysis provides a framework for estimation, hypothesis testing, prediction and model evaluation by using probability distributions and loss/penalty functions rather than just point estimates and p-values. It allows incorporating prior information and accounting for uncertainty in a probabilistic way

Uploaded by

Lavanya Easwar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/232534586

Bayesian Inference

Article in Methods in molecular biology (Clifton, N.J.) · January 2013


DOI: 10.1007/978-1-62703-059-5_25 · Source: PubMed

CITATIONS READS

19 4,784

1 author:

Frederic Yves Bois


Certara
309 PUBLICATIONS 7,221 CITATIONS

SEE PROFILE

All content following this page was uploaded by Frederic Yves Bois on 21 November 2014.

The user has requested enhancement of the downloaded file.


INSTITUT NATIONAL DE LA STATISTIQUE ET DES ETUDES ECONOMIQUES
Série des Documents de Travail du CREST
(Centre de Recherche en Economie et Statistique)

n° 2010-30

Bayesian Inference

J.-M. MARIN1
C. P. ROBERT2
J. ROUSSEAU3

Les documents de travail ne reflètent pas la position de l'INSEE et n'engagent que


leurs auteurs.

Working papers do not reflect the position of INSEE but only the views of the authors.

1 Université de Montpellier 2.
2 Université Paris-Dauphine, CEREMADE and CREST-INSEE, Paris.
3 Université Paris-Dauphine, CEREMADE and CREST-INSEE, Paris.
Bayesian Inference
Christian P. Robert1,3 , Jean-Michel Marin2,3 and Judith Rousseau1,3
1
Université Paris-Dauphine, 2 Université de Montpellier 2, and 3 CREST,
INSEE, Paris

1 Introduction
This chapter provides a overview of Bayesian inference, mostly emphasising that it is a
universal method for summarising uncertainty and making estimates and predictions using
probability statements conditional on observed data and an assumed model (Gelman 2008).
The Bayesian perspective is thus applicable to all aspects of statistical inference, while being
open to the incorporation of information items resulting from earlier experiments and from
expert opinions.
We provide here the basic elements of Bayesian analysis when considered for standard
models, refering to Marin and Robert (2007) and to Robert (2007) for book-length entries.1
In the following, we refrain from embarking upon philosophical discussions about the nature
of knowledge (see, e.g., Robert 2007, Chapter 10), opting instead for a mathematically sound
presentation of an eminently practical statistical methodology. We indeed believe that the
most convincing arguments for adopting a Bayesian version of data analyses are in the
versatility of this tool and in the large range of existing applications, rather than in those
polemical arguments (for such perspectives, see, e.g., Jaynes 2003 and MacKay 2002).

2 The Bayesian argument

2.1 Bases
We start this section with some notations about the statistical model that may appear to be
over-mathematical but are nonetheless essential.
Given an independent and identically distributed (iid) sample Dn = (x1 , . . . , xn ) from a
density fθ , with an unknown parameter θ ∈ Θ, like the mean µ of the benchmark normal

1 The chapter borrows heavily from Chapter 2 of Marin and Robert (2007).
2 BAYESIAN INFERENCE

distribution, the associated likelihood function is


n
Y
ℓ(θ|Dn ) = fθ (xi ) .
i=1

This quantity is a fundamental entity for the analysis of the information provided about the
parameter θ by the sample Dn , and Bayesian analysis relies on this function to draw inference
on θ.2 Since all models are approximations of reality, the choice of a sampling model is wide-
open for criticisms (see, e.g., Templeton 2008), but those criticism go far beyond Bayesian
modelling and question the relevance of completely built models for drawing inference or
running predictions. We will therefore address the issue of model assessment later in the
chapter.
The major input of the Bayesian perspective, when compared with a standard likelihood
approach, is that it modifies the likelihood—which is a simple function of θ—into a posterior
distribution on the parameter θ—which is a probability distribution on Θ defined by
ℓ(θ|Dn )π(θ)
π(θ|Dn ) = R . (1)
ℓ(θ|Dn )π(θ) dθ
The factor π(θ) in (1) is called the prior (often omitting the qualificative density) and it
necesssarily has to be determined to start the analysis. A primary motivation for introducing
this extra-factor is that the prior distribution summarizes the prior information on θ; that is,
the knowledge that is available on θ prior to the observation of the sample Dn . However, the
choice of π(θ) is often decided on practical or computational grounds rather than on strong
subjective beliefs or on overwhelming prior information. As will be discussed later, there
also exist less subjective choices, made of families of so-called noninformative priors.
The radical idea behind Bayesian modelling is thus that the uncertainty on the unknown
parameter θ is more efficiently modelled as randomness and consequently that the probability
distribution π is needed on Θ as a reference measure. In particular, the distribution Pθ of the
sample Dn then takes the meaning of a probability distribution on Dn that is conditional
on [the event that the parameter takes] the value θ, i.e. fθ is the conditional density of x
given θ. The above likelihood offers the dual interpretation of the probability density of Dn
conditional on the parameter θ, with the additional indication that the observations in Dn are
independent given θ. The numerator of (1) is therefore the joint density on the pair (Dn , θ)
and the (standard probability calculus) Bayes theorem provides the conditional (or posterior)
distribution of the parameter θ given the sample Dn as (1), the denominator being called the
marginal (likelihood) m(Dn ).
There are many arguments which make such an approach compelling. When defining
a probability measure on the parameter space Θ, the Bayesian approach endows notions
such as the probability that θ belongs to a specific region with a proper meaning and those
are particularly relevant when designing measures of uncertainty like confidence regions or
when testing hypotheses. Furthermore, the posterior distribution (1) can be interpreted as the
actualisation of the knowledge (uncertainty) on the parameter after observing the data. At this
early stage, we stress that the Bayesian perspective does not state that the model within which
it operates is the “truth”, no more that it believes that the corresponding prior distribution π
2 Resorting to an abuse of notations, we will also call ℓ(θ|D ) our statistical model, even though the distribution
n
with the density fθ is, strictly speaking, the true statistical model.
BAYESIAN INFERENCE 3

it requires has a connection with the “true” production of parameters (since there may even
be no parameter at all). It simply provides an inferential machine that has strong optimality
properties under the right model and that can similarly be evaluated under any other well-
defined alternative model. Furthermore, the Bayesian approach includes techniques to check
prior beliefs as well as statistical models (Gelman 2008), so there seems to be little reason for
not using a given model at an earlier stage even when dismissing it as “un-true” later (always
in favour of another model).

2.2 Bayesian analysis in action


The operating concept that is at the core of Bayesian analysis is that one should provide an
inferential assessment conditional on the realized value of Dn , and Bayesian analysis gives a
proper probabilistic meaning to this conditioning by allocating to θ a (reference) probability
(prior) distribution π. Once the prior distribution is selected, Bayesian inference formally
is “over”; that is, it is completely determined since the estimation, testing, prediction,
evaluation, and any other inferential procedures are automatically provided by the prior and
the associated loss (or penalty) function.3 For instance, if estimations θ̂ of θ are evaluated via
the quadratic loss function
L(θ, θ̂) = kθ − θ̂k2 ,
the corresponding Bayes procedure is the expected value of θ under the posterior distribution,
Z R
θ ℓ(θ|Dn ) π(θ) dθ
θ̂ = θ π(θ|Dn ) dθ = , (2)
m(Dn )

for a given sample Dn . For instance, observing a frequency 38/58 of survivals among 58
breast-cancer patients and assuming a binomial B(58, θ) with a uniform U(0, 1) prior on θ
leads to the Bayes estimate
R1 
58 38 20
38 θ (1 − θ) dθ
θ 38 + 1
θb = 0
R 1 58 = ,
θ38 (1 − θ)20 dθ 58 + 2
0 38

since the posterior distribution is then a beta Be(38 + 1, 20 + 1) distribution.


When no specific loss function is available, the estimator (2) is often used as a default
estimator, although alternatives also are available. For instance, the maximum a posteriori
estimator (MAP) is defined as

θ̂ = arg max π(θ|Dn ) = arg max π(θ)ℓ(θ|Dn ), (3)


θ θ

where the function to maximize is usually provided in closed form. However, numerical
problems often make the optimization involved in finding the MAP far from trivial. Note
also here the similarity of (3) with the maximum likelihood estimator (MLE): The influence
of the prior distribution π(θ) progressively disappears with the number of observations, and
the MAP estimator recovers the asymptotic properties of the MLE. See Schervish (1995) for
more details on the asymptotics of Bayesian estimators.
3 Hence the concept, introduced above, of a complete inferential machine.
4 BAYESIAN INFERENCE

surviv
age malign yes no
under 50 no 77 10
yes 51 13
50-69 no 51 11
yes 38 20
above 70 no 7 3
yes 6 3

Figure 1 (left) Data describing the survival rates of some breast-cancer patients (Bishop et al. 1975)
and (right) representation of two gamma posterior distributions differentiating between malignant
(dashes) versus non-malignant (full) breast cancer survival rates.

As an academic example, consider the contingency table provided in Figure 1 on


survival rate for breast-cancer patients with or without malignant tumours, extracted from
Bishop et al. (1975), the goal being to distinguish between the two types of tumour in terms
of survival probability. We then consider each entry of the table on the number of survivors
(first column of figures in Figure 1 to be independently Poisson distributed P(Nit θi ),
where t = 1, 2, 3, denotes the age group, i = 1, 2, the tumor group, distinguishing between
malignant (i = 1) and non-malignant (i = 2), and Nit is the total number of patients in this
age group and for this type of tumor. Therefore, denoting by xit the number of survivors in
age group t and tumor group i, the corresponding density is
(θNit )xit
fθi (xit |Nit ) = e−θi Nit , x ∈ N.
xit !
The corresponding likelihood on θi (i = 1, 2) is thus
3
Y
L(θi |D3 ) = (θi Nti )xti exp{−θi Nti }
t=1

which, under an θ ∼ Exp(2) prior, leads to the posterior

π(θi |D3 ) ∝ θx1i +x2i +x3i exp {−θi (2 + N1i + N2i + N3i )}

i.e. a Gamma Γ(x1i + x2i + x3i + 1, 2 + N1i + N2i + N3i ) distribution. The choice of the
(prior) exponential parameter corresponds to a prior estimate of 50% survival probability
over the period.4 In the case of the non-malignant breast cancers, the parameters of the
4 Once again, this is an academic example. The prior survival probability would need to be assessed by a physician

in a real life situation.


BAYESIAN INFERENCE 5

(posterior) Gamma distribution are a = 136 and b = 161, while, for the malignant cancers,
they are a = 96 and b = 133. Figure 1 shows the difference between both posteriors, the non-
malignant case being stochastically closer to 1, hence indicating a higher survival rate. (Note
that the posterior in this figure gives some weight to values of θ larger than 1. This drawback
can easily be fixed by truncated the exponential prior at 1.)

2.3 Prior distributions


The selection of the prior distribution is an important issue in Bayesian modelling. When
prior information is available about the data or the model, it can be used in building the
prior, and we will see some illustrations of this recommendation in the following chapters.
In many situations, however, the selection of the prior distribution is quite delicate in the
absence of reliable prior information, and generic solutions must be chosen instead. Since
the choice of the prior distribution has a considerable influence on the resulting inference,
this choice must be conducted with the utmost care. It is indeed straightforward to come up
with examples where a particular choice of the prior leads to absurd decisions. Hence, for a
Bayesian analysis to be sound the prior distribution needs to be well-justified. Before entering
into a brief description of some existing approaches of constructing prior distributions, note
that, as part of model checking, every Bayesian analysis needs to assess the influence of the
choice of the prior, for instance through a sensitivity analysis. Since the prior distribution
models the knowledge (or uncertainty) prior to the observation of the data, the sparser the
prior information is, the flatter the prior should be. There actually exists a category of priors
whose primary aim is to minimize the impact of the prior selection on the inference: They
are called noninformative priors and we will detail them below.
When the sample model is from an exponential family of distributions5 with densities of
the form
fθ (x) = h(x) exp {θ · R(x) − Ψ(θ)} , θ, R(x) ∈ Rp ,
where θ · R(x) denotes the canonical scalar product in Rp , there exists an associated class of
priors called the class of conjugate priors, of the form

π(θ|ξ, λ) ∝ exp {θ · ξ − λΨ(θ)} ,

which are parameterized by two quantities, λ > 0 and ξ, λξ being of the same nature as R(y).
These parameterized prior distributions on θ are appealing for the simple computational
reason that the posterior distributions are exactly of the same form as the prior distributions;
that is, they can be written as
π(θ|ξ ′ (Dn ), λ′ (Dn )) , (4)
where (ξ ′ (Dn ), λ′ (Dn )) is defined in terms of the sample of observations Dn (Robert 2007,
Section 3.3.3). Equation (4) simply says that the conjugate prior is such that the prior and
posterior densities belong to the same parametric family of densities but with different
parameters. In this conjugate setting, it is the parameters of the posterior density themselves
that are “updated”, based on the observations, relative to the prior parameters, instead of
changing the whole shape of the distribution. To avoid confusion, the parameters involved in
5 This covers most of the standard statistical distributions, see Lehmann and Casella (1998) or Robert (2007).
6 BAYESIAN INFERENCE

the prior distribution on the model parameter are usually called hyperparameters. (They can
themselves be associated with prior distributions, then called hyperpriors.)
The computation of estimators, of confidence regions or of other types of summaries of
interest on the conjugate posterior distribution often becomes straightforward.
As a first illustration, note that a conjugate family of priors for the Poisson model is the
collection of gamma distributions Γ(a, b), since

fθ (x)π(θ|a, b) ∝ θa−1+x e−(b+1)θ

leads to the posterior distribution of θ given X = x being the gamma distribution Ga(a +
x, b + 1). (Note that this includes the exponential distribution Exp(2) used on the dataset of
Figure 1. The Bayesian estimator of the average survival rate, associated with the quadratic
loss, is then given by θ̂ = (1 + x1 + x2 + x3 )/(2 + N1 + N2 + N3 ), the posterior mean.)
As a further illustration, consider the case of the normal distribution N (µ, 1), which is
indeed another case of an exponential family, with θ = µ, R(x) = x, and Ψ(µ) = µ2 /2. The
corresponding conjugate prior for the normal mean µ is thus normal,

N λ−1 ξ, λ−1 .

This means that, when choosing a conjugate prior in a normal setting, one has to select both
a mean and a variance a priori. (In some sense, this is the advantage of using a conjugate
prior, namely that one has to select only a few parameters to determine the prior distribution.
Conversely, the drawback of conjugate priors is that the information known a priori on µ
either may be insufficient to determine both parameters or may be incompatible with the
structure imposed by conjugacy.) Once ξ and λ are selected, the posterior distribution on µ
for a single observation x is determined by Bayes’ theorem,

π(µ|x) ∝ exp(xµ − µ2 /2) exp(ξµ − λµ2 /2)


n  2 o
∝ exp −(1 + λ) µ − (1 + λ)−1 (x + ξ) /2 ,

i.e. a normal distribution with mean (1 + λ)−1 (x + ξ) and variance (1 + λ)−1 . An


alternative representation of the posterior mean is

λ−1 1
x+ λ−1 ξ , (5)
1 + λ−1 1 + λ−1

that is, a weighted average of the observation x and the prior mean λ−1 ξ. The smaller λ is,
the closer the posterior mean is to x. The general case of an iid sample Dn = (x1 , . . . , xn )
from the normal distribution N (µ, 1) is processed in exactly the same manner, since x̄n is a
sufficient statistic with normal distribution N (µ, 1/n): the 1’s in (5) are then replaced with
n−1 ’s.
The general case of an iid sample Dn = (x1 , . . . , xn ) from the normal distribution
N (µ, σ 2 ) with an unknown θ = (µ, σ 2 ) also allows for a conjugate processing. The normal
distribution does indeed remain an exponential family when both parameters are unknown.
It is of the form
 
(σ 2 )−λσ −3/2 exp − λµ (µ − ξ)2 + α /2σ 2
BAYESIAN INFERENCE 7

since
 
π((µ, σ 2 )|Dn ) ∝ (σ 2 )−λσ −3/2 exp − λµ (µ − ξ)2 + α /2σ 2
 
×(σ 2 )−n/2 exp − n(µ − x)2 + s2x /2σ 2 (6)
 
∝ (σ 2 )−λσ (Dn ) exp − λµ (Dn )(µ − ξ(Dn ))2 + α(Dn ) /2σ 2 ,
P
where s2x = ni=1 (xi − x)2 . Therefore, the conjugate prior on θ is the product of an inverse
gamma distribution on σ 2 , I G (λσ , α/2), and, conditionally on σ 2 , a normal distribution on
µ, N (ξ, σ 2 /λµ ).
The apparent simplicity of conjugate priors is however not a reason that makes them
altogether appealing, since there is no further (strong) justification to their use. One of the
difficulties with such families of priors is the influence of the hyperparameter (ξ, λ). If the
prior information is not rich enough to justify a specific value of (ξ, λ), arbitrarily fixing
(ξ, λ) = (ξ0 , λ0 ) is problematic, since it does not take into account the prior uncertainty on
(ξ0 , λ0 ) itself. To improve on this aspect of conjugate priors, a more ameanable solution is to
consider a hierarchical prior, i.e. to assume that γ = (ξ, λ) itself is random and to consider
a probability distribution with density q on γ, leading to

θ|γ ∼ π(θ|γ)
γ ∼ q(γ) ,

as a joint prior on (θ, γ). The above is equivalent to considering, as a prior on θ


Z
π(θ) = π(θ|γ)q(γ)dγ .
Γ

As a general principle, q may also depend on some further hyperparameters η. Higher order
levels in the hierarchy are thus possible, even though the influence of the hyper(-hyper-
)parameter η on the posterior distribution of θ is usually smaller than that of γ. But multiple
levels are nonetheless useful in complex populations as those found in animal breeding
(Sørensen and Gianola 2002).
Instead of using conjugate priors, even when mixed with hyperpriors, one can opt for
the so-called noninformative (or vague) priors (Robert et al. 2009) in order to attenuate the
impact on the resulting inference. These priors are defined as refinements of the uniform
distribution, which rigorously does not exist on unbounded spaces. A peculiarity of those
vague priors is indeed that their density usually fails to integrate to one since they have
infinite mass, i.e. Z
π(θ)dθ = +∞,
Θ
and they are defined instead as positive measures, the first and foremost example being
the Lebesgue measure on Rp . While this sounds like an invalid extension of the standard
probabilistic framework—leading to their denomination of improper priors—, it is quite
correct to define the corresponding posterior distributions by (1), provided the integral in
the denominator is defined, i.e.
Z
π(θ)ℓ(θ|Dn ) dθ = m(Dn ) < ∞ .
8 BAYESIAN INFERENCE

In some cases, this difficulty disappears when the sample size n is large enough. In others like
mixture models (see also Section 3), the impossibility of using a particular improper prior
may remain whatever the sample size is. It is thus strongly advised, when using improper
priors in new settings, to check that the above finiteness condition holds.
The purpose of noninformative priors is to set a prior reference that has very little bearing
on the inference (relative to the information brought by the likelihood function). More
detailed accounts are provided in Robert (2007, Section 1.5) about this possibility of using
σ-finite measures in settings where genuine probability prior distributions are too difficult to
come by or too subjective to be accepted by all.
While a seemingly natural way of constructing noninformative priors would be to fall
back on the uniform (i.e. flat) prior, this solution has many drawbacks, the worst one being
that it is not invariant under a change of parameterisation. To understand this issue, consider
the example of a Binomial model: the observation x is a B(n, p) random variable, with
p ∈ (0, 1) unknown. The uniform prior π(p) = 1 could then sound like the most natural
noninformative choice; however, if, instead of the mean parameterisation by p, one considers
the logistic parameterisation θ = log(p/(1 − p)) then the uniform prior on p is transformed
into the logistic density
π(θ) = eθ /(1 + eθ )2
by the Jacobian transform, which is not uniform. There is therefore a lack of invariance under
reparameterisation, which in its turn implies that the choice of the parameterisation associated
with the uniform prior is influencing the resulting posterior. This is generaly considered to
be a drawback. Flat priors are therefore mostly restricted to location models x ∼ p(x − θ),
while scale models
x ∼ p(x/θ)/θ
are associated with the log-transform of a flat prior, that is,

π(θ) = 1/θ .

In a more general setting, the (noninformative) prior favoured by most Bayesians is the so-
called Jeffreys’ (1939) prior, which is related to Fisher’s information I F (θ) by
1/2
π J (θ) = I F (θ) ,

where |I| denotes the determinant of the matrix I.


Since the mean µ of a normal model N (µ, 1) is a location parameter, the standard
choice of noninformative prior is then π(µ) = 1 (or any other constant). Given that this
flat prior formally corresponds to the choice λ = µ = 0 in the conjugate prior, it is easy to
verify that this noninformative prior is associated with the posterior distribution N (x, 1).
An interesting consequence of this remark is that the posterior density (as a function
of the parameter θ) is then equal to the likelihood function, which shows that Bayesian
analysis subsumes likelihood analysis in this sense. Therefore, the MAP estimator is also
the maximum likelihood estimator in that special case. Figure 2 provides the posterior
distributions associated with both the flat prior on µ and the conjugate N (0, 0.1 σ̂ 2 ) prior
for a crime dataset discussed in Marin and Robert (2007). The difference between both
posteriors is still visible after 90 observations and it illustrates the impact of the choice of
the hyperparameter (ξ, λ) on the resulting inference.
BAYESIAN INFERENCE 9

25
20
15
10
5
0

−0.10 −0.05 0.00 0.05

Figure 2 Two posterior distributions on a normal mean corresponding to the flat prior (plain) and a
conjugate prior (dotted) for a dataset of 90 observations. (Source: Marin and Robert 2007.)

2.4 Confidence intervals


As should now be clear, the Bayesian approach is a complete inferential approach. Therefore,
it covers among other things confidence evaluation, testing, prediction, model checking, and
point estimation. Unsurprisingly, the derivation of the confidence intervals (or of confidence
regions in more general settings) is based on the posterior distribution π(θ|Dn ). Since the
Bayesian approach processes θ as a random variable and conditions upon the observables
Dn , a natural definition of a confidence region on θ is to determine C(Dn ) such that

π(θ ∈ C(Dn )|Dn ) = 1 − α (7)

where α is either a predetermined level such as 0.05.6 or a value derived from the loss
function (that may depend on the data).
The important difference from a traditional perspective is that the integration here is done
over the parameter space, rather than over the observation space. The quantity 1 − α thus
corresponds to the probability that a random θ belongs to this set C(Dn ), rather than to
the probability that the random set contains the “true” value of θ. Given this drift in the
interpretation of a confidence set (rather called a credible set by Bayesians in order to
stress this major difference with the classical confidence set), the determination of the best7
confidence set turns out to be easier than in the classical sense: It simply corresponds to the
values of θ with the highest posterior values,

C(Dn ) = {θ; π(θ|Dn ) ≥ kα } ,

where kα is determined by the coverage constraint (7). This region is called the highest
posterior density (HPD) region.
When the prior distribution is not conjugate, the posterior distribution is not necessarily so
easily-managed. For instance, if the normal N (µ, 1) distribution is replaced with the Cauchy
6 There is nothing special about 0.05 when compared with, say, 0.87 or 0.12. It is just that the famous 5% level

is adopted by most as an acceptable level of error.


7 In the sense of offering a given confidence coverage for the smallest possible length/volume.
10 BAYESIAN INFERENCE

0.15
0.10
0.05
0.00

−10 −5 0 5 10

Figure 3 (left) Posterior distribution of the location parameter µ of a Cauchy sample for a N (0, 10)
prior and corresponding 95% HPD region (Source: Marin and Robert 2007); (right) Representation of
a posterior sample of 103 values of (θ, σ 2 ) for the normal model, x1 , . . . , x10 ∼ N (θ, σ 2 ) with x = 0,
s2 = 1 and n = 10, under Jeffreys’ prior, along with the pointwise approximation to the 10% HPD
region (in darker hues) (Source: Robert and Wraith 2009).

distribution, C (µ, 1), in the likelihood


n
Y  Y n
ℓ(µ|Dn ) = fµ (xi ) = 1 π n (1 + (xi − µ)2 ) ,
i=1 i=1

there is no conjugate prior available and we can consider a normal prior on µ, say N (0, 10).
The posterior distribution is then proportional to
Yn
2
π̃(µ|Dn ) = exp(−µ /20) (1 + (xi − µ)2 ) .
i=1

Solving π̃(µ|Dn ) = k is not possible analytically, only numerically, and the derivation of the
bound kα requires some amount of trial-and-error in order to obtain the correct coverage.
Figure 3 gives the posterior distribution of µ for the observations x1 = −4.3 and x2 = 3.2.
For a given value of k, a trapezoidal approximation can be used to compute the approximate
coverage of the HPD region. For α = 0.95, a trial-and-error exploration of a range of values
of k then leads to an approximation of kα = 0.0415 and the corresponding HPD region is
represented in Figure 3 ((left).
As illustrated in the above example, posterior distributions are not necessarily unimodal
and thus the HPD regions may include several disconnected sets. This may sound
counterintuitive from a classical point of view, but it must be interpreted as indicating
indeterminacy, either in the data or in the prior, about the possible values of θ. Note also that
HPD regions are dependent on the choice of the reference measure that defines the volume
(or surface).
BAYESIAN INFERENCE 11

The analytic derivation of HPD regions is rarely straightforward but let us stress that, due to
the fact that the posterior density is most known up to a normalising constant, those regions
can be easily derived from posterior simulations. For instance, Figure 3 (right) illustrates
this derivation in the case of a normal N (θ, σ 2 ) model with both parameters unknown
and Jeffreys’ prior, when the sufficient statistics are x = 0 and s2 = 1, based on n = 10
observations.

3 Testing Hypotheses
Deciding about the validity of some restrictions on the parameter θ or on the validity of
a whole model—like whether or not the normal distribution is appropriate for the data at
hand—is a major and maybe the most important component of statistical inference. Because
the outcome of the decision process is clearcut, accept (coded by 1) or reject (coded by 0),
the construction and the evaluation of procedures in this setup are quite crucial. While the
Bayesian solution is formally very close to a likelihood ratio statistic, its numerical values
and hence its conclusions often strongly differ from the classical solutions.

3.1 Decisions
Without loss of generality, and including the setup of model choice, we represent null
hypotheses as restricted parameter spaces, namely θ ∈ Θ0 . For instance, θ > 0 corresponds
to Θ0 = R+ . The evaluation of testing procedures can be formalised via the 0 − 1 loss that
equally penalizes all errors: If we consider the test of H0 : θ ∈ Θ0 versus H1 : θ 6∈ Θ0 ,
and denote by d ∈ {0, 1} the decision made by the researcher and by δ the corresponding
decision procedure, the loss
(
1 − d if θ ∈ Θ0 ,
L(θ, d) =
d otherwise,
is associated with the Bayes decision (estimator)
(
π 1 if P π (θ ∈ Θ0 |x) > P π (θ 6∈ Θ0 |x),
δ (x) =
0 otherwise.
This estimator is easily justified on an intuitive basis since it chooses the hypothesis with the
largest posterior probability. The Bayesian testing procedure is therefore a direct transform
of the posterior probability of the null hypothesis.

3.2 The Bayes Factor


A notion central to Bayesian testing is the Bayes factor

π P π (θ ∈ Θ1 |x)/P π (θ ∈ Θ0 |x)
B10 = ,
P π (θ ∈ Θ1 )/P π (θ ∈ Θ0 )
which corresponds to the classical odds or likelihood ratio, the difference being that the
parameters are integrated rather than maximized under each model. While it is a simple one-
to-one transform of the posterior probability, it can be used for Bayesian testing without
12 BAYESIAN INFERENCE

resorting to a specific loss, evaluating the strength of the evidence in favour or against H0
π
by the distance of log10 (B10 ) from zero (Jeffreys 1939). This somehow ad-hoc perspective
provides a reference for hypothesis assessment with no need to define the prior probabilities
of H0 and H1 , which is one of the advantages of using the Bayes factor. In general, the Bayes
factor does depend on prior information, but it can be perceived as a Bayesian likelihood
π
ratio since, if π0 and π1 are the prior distributions under H0 and H1 , respectively, B10 can
be written as R
fθ (x)π1 (θ) dθ m1 (x)
B10π
= R Θ1 = ,
Θ0
f θ (x)π0 (θ) dθ m 0 (x)

thus replacing the likelihoods with the marginals under both hypotheses. Thus, by integrating
out the parameters within each hypothesis, the uncertainty on each parameter is taken into
account, which induces a natural penalisation for larger models, as intuited by Jeffreys
(1939). The Bayes factor is connected with the Bayesian information criterion (BIC, see
Robert 2007, Chapter 5), with a penalty term of the form d log n/2, which explicits the
penalisation induced by Bayes factors in regular parametric models. In a wide generality,
the Bayes factor asymptotically corresponds to a likelihood ratio with a penalty of the form
d∗ log n∗ /2 where d∗ and n∗ can be viewed as the effective dimension of the model and
number of observations, respectively, see (Berger et al. 2003, Chambaz and Rousseau 2008).
The Bayes factor therefore offers the major interest that it does not require to compute a
complexity measure (or penalty term)—in other words, to define what is d∗ and what is
n∗ —, which often is quite complicated and may depend on the true distribution.

3.3 Point null hypotheses


When the hypothesis to be tested is a point null hypothesis, H0 : θ = θ0 , there are difficulties
in the construction of the Bayesian procedure, given that, for an absolutely continuous prior
π,
P π (θ = θ0 ) = 0 .
Rather logically, point null hypotheses can be criticized as being artificial and impossible
to test (how often can one distinguish θ = 0 from θ = 0.0001?!), but they must also
be processed, being part of the everyday requirements of statistical analysis and also a
convenient representation of some model choice problems (which we will discuss later).
Testing point null hypotheses actually requires a modification of the prior distribution so
that, when testing H0 : θ ∈ Θ0 versus H1 : θ ∈ Θ1 ,

π(Θ0 ) > 0 and π(Θ1 ) > 0

hold, whatever the measures of Θ0 and Θ1 for the original prior, which means that the prior
must be decomposed as

π(θ) = P π (θ ∈ Θ0 ) × π0 (θ) + P π (θ ∈ Θ1 ) × π1 (θ)

with positive weights on both Θ0 and Θ1 .


Note that this modification makes sense from both informational and operational points of
view. If H0 : θ = θ0 , the fact that the hypothesis is tested implies that θ = θ0 is a possibility
and it brings some additional prior information on the parameter θ. Besides, if H0 is tested
BAYESIAN INFERENCE 13

Table 1 Posterior probability of µ = 0 for different values


of z = x/σ, ρ = 1/2, and for τ = σ (top), τ 2 = 10σ 2
(bottom).
z 0 0.68 1.28 1.96
π(µ = 0|z) 0.586 0.557 0.484 0.351
π(µ = 0|z) 0.768 0.729 0.612 0.366
Source: Marin and Robert 2007.

and accepted, this means that, in most situations, the (reduced) model under H0 will be
used rather than the (full) model considered before. Thus, a prior distribution under the
reduced model must be available for potential later inference. (Formaly, the fact that this
later inference depends on the selection of H0 should also be taken into account.)
In the special case Θ0 = {θ0 }, π0 is the Dirac mass at θ0 , which simply means that
P π0 (θ = θ0 ) = 1, and we need to introduce a separate prior weight of H0 , namely,

ρ = P π (θ = θ0 ) and π(θ) = ρIθ0 (θ) + (1 − ρ)π1 (θ) .

Then,
fθ0 (x)ρ fθ0 (x)ρ
π(Θ0 |x) = R = .
fθ (x)π(θ) dθ fθ0 (x)ρ + (1 − ρ)m1 (x)
In the case when x ∼ N (µ, σ 2 ) and µ ∼ N (ξ, τ 2 ), consider the test of H0 : µ = 0. We
can choose ξ equal to 0 if we do not have additional prior information. Then the Bayes factor
is the ratio of marginals under both hypotheses, µ = 0 and µ 6= 0,
2 2 2
π m1 (x) σ e−x /2(σ +τ )
B10 = =√
f0 (x) σ2 + τ 2 e−x2 /2σ2
and " r  #−1
1−ρ σ2 τ 2 x2
π(µ = 0|x) = 1 + exp
ρ σ + τ2
2 2σ (σ 2 + τ 2 )
2

is the posterior probability of H0 . Table 1 gives an indication of the values of the posterior
probability when the normalized quantity x/σ varies. This posterior probability again
depends on the choice of the prior variance τ 2 : The dependence is actually quite severe,
as shown below with the Jeffreys–Lindley paradox.

3.4 The Ban on Improper Priors


Unfortunately, this decomposition of the prior distribution into two subpriors brings a serious
difficulty related to improper priors, which amounts in practice to banning their use in testing
situations. In fact, when using the representation

π(θ) = P π (θ ∈ Θ0 ) × π0 (θ) + P π (θ ∈ Θ1 ) × π1 (θ) ,

the weights P π (θ ∈ Θ0 ) and P π (θ ∈ Θ1 ) are meaningful only if π0 and π1 are normalized


probability densities. Otherwise, they cannot be interpreted as weights.
14 BAYESIAN INFERENCE

1.0
0.8
0.6
B10
0.4
0.2
0.0

1e−04 1e−02 1e+00

π
Figure 4 Range of the Bayes factor B10 when τ goes from 10−4 to 10. (Note: The x-axis is in
logarithmic scale.) (Source: Marin and Robert 2007.)

Table 2 Posterior probability of H0 : µ = 0 for the Jeffreys


prior π1 (µ) = 1 under H1 .
x 0.0 1.0 1.65 1.96 2.58
π(µ = 0|x) 0.285 0.195 0.089 0.055 0.014
Source: Marin and Robert 2007.

In the instance when x ∼ N (µ, 1) and H0 : µ = 0, the improper (Jeffreys) prior is


π1 (µ) = 1; if we write
1 1
π(µ) = I0 (µ) + · Iµ6=0 ,
2 2
then the posterior probability is
2
/2
e−x 1
π(µ = 0|x) = R +∞ = √ .
e −x2 /2 + −∞
e−(x−θ)2/2 dθ 1+ 2πex2 /2

A first consequence of this choice is that the posterior probability of H0 is bounded from
above by

π(µ = 0|x) ≤ 1/(1 + 2π) = 0.285 .
Table 2 provides the evolution of this probability as x goes away from 0. An interesting point
is that the numerical values somehow coincide with the p-values used in classical testing
(Casella and Berger 2001).
BAYESIAN INFERENCE 15

If we are instead testing H0 : θ ≤ 0 versus H1 : θ > 0, then the posterior probability is


Z 0
1 2
/2
π(θ ≤ 0|x) = √ e−(x−θ) dθ = Φ(−x) ,
2π −∞

and the answer is now exactly the p-value found in classical statistics.
The difficulty in using an improper prior also relates to what is called the Jeffreys–Lindley
paradox, a phenomenon that shows that limiting arguments are not valid in testing settings.
In contrast with estimation settings, the noninformative prior no longer corresponds to the
limit of conjugate inferences. In fact, for a conjugate prior, the posterior probability
( r  )−1
1 − ρ0 σ2 τ 2 x2
π(θ = 0|x) = 1+ exp
ρ0 σ2 + τ 2 2σ 2 (σ 2 + τ 2 )

converges to 1 when τ goes to +∞, for every value of x, as already illustrated by


Figure
√ 4. This noninformative procedure differs from the noninformative answer [1 +
2π exp(x2 /2)]−1 above.
The fundamental issue that bars us from using improper priors on one or both of the sets
Θ0 and Θ1 is a normalizing difficulty: If g0 and g1 are measures (rather than probabilities)
on the subspaces Θ0 and Θ1 , the choice of the normalizing constants influences the Bayes
factor. Indeed, when gi is replaced by ci gi (i = 0, 1), where ci is an arbitrary constant, the
Bayes factor is multiplied by c0 /c1 . Thus, for instance, if the Jeffreys prior is flat and g0 = c0 ,
g1 = c1 , the posterior probability
R
ρ0 c0 Θ0 fθ (x) dθ
π(θ ∈ Θ0 |x) = R R
ρ0 c0 Θ0 fθ (x) dθ + (1 − ρ0 )c1 Θ1 fθ (x) dθ

is completely
√ determined by the choice of c0 /c1 . This implies, for instance, that the function
[1 + 2π exp(x2 /2)]−1 obtained earlier has no validity whatsoever.
Since improper priors are an essential part of the Bayesian approach, there have been many
proposals to overcome this ban. Most use a device that transforms the prior into a proper
probability distribution by using a portion of the data Dn and then use the other part of the
data to run the test as in a standard situation. The variety of available solutions is due to the
many possibilities of removing the dependence on the choice of the portion of the data used
in the first step. The resulting procedures are called pseudo-Bayes factors. See Robert (2007,
Chapter 5) for more details.

3.5 The case of nuisance parameters


In some settings, some parameters are shared by both hypotheses (or by both models) that
are under comparison. Since they have the same meaning in each of both models, the above
ban can be partly lifted and a common improper prior can be used on these parameters, in
both models.
For instance, consider a regression model, represented as

y|X, β, σ ∼ N (Xβ, σ 2 In ) , (8)


16 BAYESIAN INFERENCE

where X denotes the (n, p) matrix of regressors—upon which the whole analysis is
conditioned—, y the vector of the n observations, and β is the vector of the regression
coefficients. (This is a matrix representation of the repeated observation of

yi = β1 xi1 + . . . + βp xip + σǫi ǫi ∼ N (0, 1) ,

when i varies from 1 to n.) Variable selection in this setup means removing covariates, that
is, columns of X, that are not significantly contributing to the expectation of y given X. In
other words, this is about testing whether or not a null hypothesis like H0 : β1 = 0 holds.
From a Bayesian perspective, a possible non informative prior distribution on the generic
regression model (8) is the so-called Zellner’s (1986) g-prior, where the conditional8 π(β|σ)
prior density corresponds to a normal

N (0, nσ 2 (XT X)−1 )

distribution on β, AT denoting the transposed matrix associated with A, and where a


“marginal” improper prior on σ 2 , π(σ 2 ) = σ −2 , is used to complete the joint distribution.
With this default (or reference) prior modelling, and when considering the submodel
corresponding to the null hypothesis H0 : β1 = 0, with parameters β (−1) and σ, we can use
a similar g-prior distribution

β (−1) |σ, X ∼ N (0, nσ 2 (XT−1 X−1 )−1 ) ,

where X−1 denotes the regression matrix missing the column corresponding to the first
regressor, and σ 2 ∼ π(σ 2 ) = σ −2 . Since σ is a nuisance parameter in this case, we may
use the improper prior on σ 2 as common to all submodels and thus avoid the indeterminacy
in the normalising factor of the prior when computing the Bayes factor
R
f (y|β−1 , σ, X)π(β (−1) |σ, X−1 ) dβ−1 σ −2 dσ
B01 = R
f (y|β, σ, X)π(β|σ, X)dβ σ −2 dσ

Figure 5 reproduces a computer output from Marin and Robert (2007) that illustrates how
this default prior and the corresponding Bayes factors can be used in the same spirit as
significance levels in a standard regression model, each Bayes factor being associated with
the test of the nullity of the corresponding regression coefficient. For instance, only the
intercept and the coefficients of X1 , X2 , X4 , X5 are significant. This output mimics the
standard lm R function outcome in order to show that the level of information provided by
the Bayesian analysis goes beyond the classical output. (We stress that all items in the table of
Figure 5 are obtained via closed-form formulae.) Obviously, this reproduction of a frequentist
output is not the whole purpose of a Bayesian data anlysis, quite the opposite: it simply
reflects on the ability of a Bayesian analysis to produce automated summaries, just as in the
classical case, but the inferencial abilities of the Bayesian approach are considerably wider.
(For instance, testing simultaneously the nullity of β3 , β6 , . . . , β10 is of identical difficulty,
as detailed in Marin and Robert 2007, Chapter 3.)
8 The fact that the prior distribution depends on the matrix of regressors X is not contradictory with the Bayesian

paradigm in that the whole analysis is conditional on X. The potential randomness of the regressors is not accounted
for in this analysis.
BAYESIAN INFERENCE 17

Estimate BF log10(BF)

(Intercept) 9.2714 26.334 1.4205 (***)


X1 -0.0037 7.0839 0.8502 (**)
X2 -0.0454 3.6850 0.5664 (**)
X3 0.0573 0.4356 -0.3609
X4 -1.0905 2.8314 0.4520 (*)
X5 0.1953 2.5157 0.4007 (*)
X6 -0.3008 0.3621 -0.4412
X7 -0.2002 0.3627 -0.4404
X8 0.1526 0.4589 -0.3383
X9 -1.0835 0.9069 -0.0424
X10 -0.3651 0.4132 -0.3838
evidence against H0: (****) decisive, (***) strong, (**) substantial, (*) poor

Figure 5 R output of a Bayesian regression analysis on a processionary caterpillar dataset with ten
covariates analysed in Marin and Robert (2007). The Bayes factor on each row corresponds to the test
of the nullity of the corresponding regression coefficient.

4 Extensions
The above description of inference is only an introduction and is thus not representative of
the wealth of possible applications resulting from a Bayesian modelling. We consider below
two extensions inspired from Marin and Robert (2007).

4.1 Prediction
When considering a sample Dn = (x1 , . . . , xn ) from a given distribution, there can be
a sequential or dynamic structure in the model that implies that future observations are
expected. While more realistic modeling may involve probabilistic dependence between the
xi ’s, we consider here the simpler setup of predictive distributions in iid settings.
If xn+1 is a future observation from the same distribution fθ (·) as the sample Dn , its
predictive distribution given the current sample is defined as
Z Z
f π (xn+1 |Dn ) = f (xn+1 |θ, Dn )π(θ|Dn ) dθ = fθ (xn+1 )π(θ|Dn ) dθ .

The motivation for defining this distribution is that the information available on the
pair (xn+1 , θ) given the data Dn is summarized in the joint posterior distribution
fθ (xn+1 )π(θ|Dn ) and the predictive distribution above is simply the corresponding marginal
on xn+1 . This is nonetheless coherent with the Bayesian approach, which then considers
xn+1 as an extra unknown.
For the normal N (µ, σ 2 ) setup, using a conjugate prior on (µ, σ 2 ) of the form

(σ 2 )−λσ −3/2 exp − λµ (µ − ξ)2 + α /2σ 2 ,
the corresponding posterior distribution on (µ, σ 2 ) given Dn is
     
λµ ξ + nxn σ2 2 nλµ 2
N , × I G λσ + n/2, α + sx + (x − ξ) /2 ,
λµ + n λµ + n λµ + n
18 BAYESIAN INFERENCE

denoted by 
N ξ(Dn ), σ 2 /λµ (Dn ) × I G (λσ (Dn ), α(Dn )/2) ,
and the predictive on xn+1 is derived as
Z
f π (xn+1 |Dn ) ∝ (σ 2 )−λσ −2−n/2 exp −(xn+1 − µ)2 /2σ 2

× exp − λµ (Dn )(µ − ξ(Dn ))2 + α(Dn ) /2σ 2 d(µ, σ 2 )
Z

∝ (σ 2 )−λσ −n/2−3/2 exp − (λµ (Dn ) + 1)(xn+1 − ξ(Dn ))2

/λµ (Dn ) + α(Dn ) /2σ 2 dσ 2


 −(2λσ +n+1)/2
λµ (Dn ) + 1 2
∝ α(Dn ) + (xn+1 − ξ(Dn )) .
λµ (Dn )

Therefore, the predictive of xn+1 given the sample Dn is a Student’s t distribution with
mean ξ(Dn ) and 2λσ + n degrees of freedom. In the special case of the noninformative
prior, λµ = λσ = α = 0 and the predictive is
 −(n+1)/2
n
f π (xn+1 |Dn ) ∝ s2x + (xn+1 − xn )2 .
n+1

This is again a Student’s t distribution with mean xn ), scale sx / n, and n degrees of
freedom.

4.2 Outliers
Since normal modeling is often an approximation to the “real thing,” there may be doubts
about its adequacy. As already mentioned above, we will deal later with the problem of
checking that the normal distribution is appropriate for the whole dataset. Here, we consider
the somehow simpler problem of assessing whether or not each point in the dataset is
compatible with normality. There are many different ways of dealing with this problem.
We choose here to take advantage of the derivation of the predictive distribution above: If an
observation xi is unlikely under the predictive distribution based on the other observations,
then we can argue against its distribution being equal to the distribution of the other
observations.
For each xi ∈ Dn , we consider fiπ (x|Dni ) as being the predictive distribution based
on Dni = (x1 , . . . , xi−1 , xi+1 , . . . , xn ). Considering fiπ (xi |Dni ) or the corresponding cdf
Fiπ (xi |Dni ) (in dimension one) gives an indication of the level of compatibility of the
observation with the sample. To quantify this level, we can, for instance, approximate the
distribution of Fiπ (xi |Dni ) as uniform over [0, 1] since Fiπ (·|Dni ) converges to the true cdf of
the model. Simultaneously checking all Fiπ (xi |Dni ) over i may signal outliers.
The detection of outliers must pay attention to the Bonferroni fallacy, which is that extreme
values do occur in large enough samples. This means that, as n increases, we will see smaller
and smaller values of Fiπ (xi |Dni ) even if the whole sample is from the same distribution. The
significance level must therefore be chosen in accordance with this observation, for instance
BAYESIAN INFERENCE 19

using a bound a on Fiπ (xi |Dni ) such that

1 − (1 − a)n = 1 − α ,

where α is the nominal level chosen for outlier detection.

4.3 Model choice


For model choice, i.e. when several models are under comparison for the same observation

Mi : x ∼ fi (x|θi ) , i ∈ I,

where I can be finite or infinite, the usual Bayesian answer is similar to the Bayesian
tests as described above. The most coherent perspective (from our viewpoint) is actually
to envision the tests of hypotheses as particular cases of model choices, rather than trying
to justify the modification of the prior distribution criticised by Gelman (2008). This also
incorporates within model choice the alternative solution of model averaging, proposed
by Madigan and Raftery (1994), which strives to keep all possible models when drawing
inference.
The idea behind Bayesian model choice is to construct an overall probability on the
collection of models ∪i∈I Mi in the following way: the parameter is θ = (i, θi ), i.e. the
model index and given the model index equal to i, the parameter θi in model Mi , then the
prior measure on the parameter θ is expressed as
X X
dπ(θ) = pi dπi (θi ), pi = 1.
i∈I i∈Ii

As a consequence, the Bayesian model selection associated with the 0–1 loss function and
the above prior is the model that maximises the posterior probability
Z
pi fi (x|θi )πi (θi )dθi
π(Mi |x) = X Θ Zi
pj fj (x|θj )πj (θj )dθj
j Θj

across all models. Contrary to classical pluggin likelihoods, the marginal likelihoods involved
in the above ratio do compare on the same scale and do not require the models to be nested. As
mentioned in Section 3.5 integrating out the parameters Rθi in each of the models takes into
account their uncertainty thus the marginal likelihoods Θi fi (x|θi )πi (θi )dθi are naturally
penalised likelihoods. In most parametric setups, when the number of parameters does not
grow to infinity with the number of observations and when those parameters are identifiable.
the Bayesian model selector as defined above is consistent, i.e. with increasing numbers of
observations, the probability of choosing the right model goes to 1.

References
Berger J, Ghosh J and Mukhopadhyay N 2003 Approximations to the Bayes factor in model selection problems and
consistency issues. J. Statist. Plann. Inference 112, 241–258.
20 BAYESIAN INFERENCE

Bishop YMM, Fienberg SE and Holland PW 1975 Discrete Multivariate Analysis: Theory and Practice. MIT Press,
Cambridge, MA.
Casella G and Berger R 2001 Statistical Inference second edn. Wadsworth, Belmont, CA.
Chambaz A and Rousseau J 2008 Bounds for Bayesian order identification with application to mixtures. Ann. Statist.
36, 938–962.
Gelman A 2008 Objections to Bayesian statistics. Bayesian Analysis 3(3), 445–450.
Jaynes E 2003 Probability Theory. Cambridge University Press, Cambridge.
Jeffreys H 1939 Theory of Probability first edn. The Clarendon Press, Oxford.
Lehmann E and Casella G 1998 Theory of Point Estimation (revised edition). Springer-Verlag, New York.
MacKay DJC 2002 Information Theory, Inference & Learning Algorithms. Cambridge University Press, Cambridge,
UK.
Madigan D and Raftery A 1994 Model selection and accounting for model uncertainty in graphical models using
Occam’s window. J. American Statist. Assoc. 89, 1535–1546.
Marin JM and Robert C 2007 Bayesian Core. Springer-Verlag, New York.
Robert C 2007 The Bayesian Choice paperback edn. Springer-Verlag, New York.
Robert C and Wraith D 2009 Computational methods for Bayesian model choice In MaxEnt 2009 proceedings (ed.
of Physics AI). (To appear.).
Robert C, Chopin N and Rousseau J 2009 Theory of Probability revisited (with discussion). Statist. Science. (to
appear).
Schervish M 1995 Theory of Statistics. Springer-Verlag, New York.
Sørensen D and Gianola D 2002 Likelihood, Bayesian, and MCMC Methods in Qualitative Genetics. Springer-
Verlag, New York.
Templeton A 2008 Statistical hypothesis testing in intraspecific phylogeography: nested clade phylogeographical
analysis vs. approximate Bayesian computation. Molecular Ecology 18(2), 319–331.
Zellner A 1986 On assessing prior Distributions and Bayesian regression analysis with g-prior distribution regression
using Bayesian variable selection Bayesian inference and decision techniques: Essays in Honor of Bruno de
Finetti North-Holland / Elsevier pp. 233–243.

View publication stats

You might also like