Oi PDF
Oi PDF
Oi PDF
B.J.K. Kleijn
University of Amsterdam
Korteweg-de Vries institute for Mathematics
Spring 2009
Contents
Preface
iii
1 Introduction
1.1
Frequentist statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2
Bayesian statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3
1.4
Exercises
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2 Bayesian basics
13
2.1
2.2
2.3
2.4
2.5
Exercises
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
51
3.1
3.2
Non-informative priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.3
3.4
3.5
Exercises
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4 Bayesian asymptotics
4.1
79
Asymptotic statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.1.1
4.1.2
4.2
Schwarz consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.3
4.4
4.5
87
5.1
5.2
Marginal distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.3
5.4
Hierarchical priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
89
6.1
6.2
More . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
A Measure theory
91
101
ii
Preface
These lecture notes were written for the course Bayesian Statistics, taught at University
of Amsterdam in the spring of 2007. The course was aimed at first-year MSc.-students in
statistics, mathematics and related fields. The aim was for students to understand the basic
properties of Bayesian statistical methods; to be able to apply this knowledge to statistical
questions and to know the extent (and limitations) of conclusions based thereon. Considered
were the basic properties of the procedure, choice of the prior by objective and subjective
criteria, Bayesian inference, model selection and applications. In addition, non-parametric
Bayesian modelling and posterior asymptotic behaviour have received due attention and computational methods were presented.
An attempt has been made to make these lecture notes as self-contained as possible.
Nevertheless the reader is expected to have been exposed to some statistics, preferably from
a mathematical perspective. It is not assumed that the reader is familiar with asymptotic
statistics; these lecture notes provide a general introduction to this topic. Where possible,
definitions, lemmas and theorems have been formulated such that they cover parametric and
nonparametric models alike. An index, references and an extensive bibliography are included.
Since Bayesian statistics is formulated in terms of probability theory, some background in
measure theory is prerequisite to understanding these notes in detail. However the reader is
not supposed to have all measure-theorical knowledge handy: appendix A provides an overview
of relevant measure-theoretic material. In the description and handling of nonparametric
statistical models, functional analysis and topology play a role. Of the latter two, however,
only the most basic notions are used and all necessary detail in this respect will be provided
during the course.
The author wishes to thank Aad van der Vaart for his contributions to this course and
these lecture notes, concerning primarily (but not exclusively) the chapter entitled Numerical
methods in Bayesian statistics. For corrections to the notes, the author thanks C. Muris, ...
Bas Kleijn, Amsterdam, January 2007
iii
iv
Chapter 1
Introduction
The goal of inferential statistics is to understand, describe and estimate (aspects of) the randomness of measured data. Quite naturally, this invites the assumption that the data represents a sample from an unknown but fixed probability distribution. Based on that assumption,
one may proceed to estimate this distribution directly, or to give estimates of certain characteristic properties (like its mean, variance, etcetera). It is this straightforward assumption
that underlies frequentist statistics and markedly distinguishes it from the Bayesian approach.
Introduction
The samplespace Y is assumed to be a measurable space to enable the consideration of
(1.1)
Hence from the frequentist perspective, inferential statistics revolves around the central question: What is P0 ?, which may be considered in parts by questions like, What is the mean
of P0 ?, What are the higher moments of P0 ?, etcetera.
The second ingredient of a statistical procedure is a model, which contains all explanations
under consideration of the randomness in Y .
Definition 1.1.2. A statistical model P is a collection of probability measures P : B [0, 1]
on the samplespace (Y , B).
The model P contains the candidate distributions for Y that the statistician finds reasonable explanations of the uncertainty he observes (or expects to observe) in Y . As such,
it constitutes a choice of the statistician analyzing the data rather than a given. Often, we
describe the model in terms of probability densities rather than distributions.
Definition 1.1.3. If there exists a -finite measure : B [0, ] such that for all P P,
P , we say that the model is dominated.
The Radon-Nikodym theorem (see theorem A.4.2) guarantees that we may represent a
dominated model P in terms of probability density functions p = dP/d : Y R. Note
that the dominating measure may not be unique and hence, that the representation of P in
terms of densities depends on the particular choice of dominating measure . A common way
of representing a model is a description in terms of a parameterization.
Definition 1.1.4. A model P is parameterized with parameter space , if there exists a
surjective map P : 7 P , called the parameterization of P.
Surjectivity of the parameterization is imposed so that for all P P, there exists a
such that P = P : unless surjectivity is required the parameterization may describe P only
partially. Also of importance is the following property.
Definition 1.1.5. A parameterization of a statistical model P is said to be identifiable, if
the map P : 7 P is injective.
Injectivity of the parameterization means that for all 1 , 2 , 1 6= 2 implies that
P1 6= P2 . In other words, no two different parameter values 1 and 2 give rise to the same
distribution. Clearly, in order for to serve as a useful representation for the candidate
distributions P , identifiability is a first requirement. Other common conditions on the map
Frequentist statistics
7 P are continuity (with respect to a suitable (often metric) topology on the model),
differentiability (which may involve technical subtleties in case is infinite-dimensional) and
other smoothness conditions.
Remark 1.1.1. Although strictly speaking ambivalent, it is commonplace to refer to both P
and the parameterizing space as the model. This practice is not unreasonable in view of
the fact that, in practice, almost all models are parameterized in an identifiable way, so that
there exists a bijective correspondence between and P.
A customary assumption in frequentist statistics is that the model is well-specified.
Definition 1.1.6. A model P is said to be well-specified if it contains the true distribution
of the data P0 , i.e.
P0 P.
(1.2)
The requirement regarding the interior of in definition 1.1.7 ensures that the dimension
d really concerns and not just the dimension of the space Rd of which forms a subset.
Example 1.1.1. The normal model for a single, real measurement Y , is the collection of all
normal distributions on R, i.e.
P = N (, 2 ) : (, )
Introduction
where the parameterizing space equals R (0, ). The map (, ) 7 N (, 2 ) is surjective and injective, i.e. the normal model is a two-dimensional, identifiable parametric model.
Moreover, the normal model is dominated by the Lebesgue measure on the samplespace R and
can hence be described in terms of Lebesgue-densities:
(x)2
1
p, (x) = e 22 .
2
p(yi ) = 1.
i=1
n
n
o
X
Sn = p = (p1 , . . . , pn ) Rn : pi 0,
pi = 1 .
i=1
n1
X
pi .
i=1
Frequentist statistics
Note that a point-estimator is a statistic, i.e. a quantity that depends only on the data
(and possibly on other known information): since a point-estimator must be calculable in
practice, it may depend only on information that is known to the statistician after he has
performed the measurement with outcome Y = y. Also note that a point-estimator is a
stochastic quantity: P (Y ) depends on Y and is hence random with its own distribution on
P (as soon as a -algebra on P is established with respect to which P is measurable). Upon
measurement of Y resulting in a realisation Y = y, the estimator P (y) is a definite point in
P.
Remark 1.1.2. Obviously, many other quantities may be estimated as well and the definition of a point-estimator given above is too narrow in that sense. Firstly, if the model is
parameterized, one may define a point-estimator : Y for 0 , from which we obtain
P (Y ) = P
as an estimator for P0 . If the model is identifiable, estimation of 0 in is
(Y )
equivalent to estimation of P0 in P. But if the dimension d of the model is greater than one,
we may choose to estimate only one component of (called the parameter of interest) and
disregard other components (called nuisance parameters). More generally, we may choose to
estimate certain properties of P0 , for example its expectation, variance or quantiles, rather
than P0 itself. As an example, consider a model P consisting of distributions on R with finite
expectation and define the linear functional e : P R by e(P ) = P X. Suppose that we
are interested in the expectation e0 = e(P0 ) of the true distribution. Obviously, based on an
estimator P (Y ) for P0 we may define an estimator
Z
eb(Y ) =
y d[P (Y )](y)
(1.3)
to estimate e0 . But in many cases, direct estimation of the property of interest of P0 can be
done more efficiently than through P .
For instance, assume that X is integrable under P0 and Y = (X1 , . . . , Xn ) collects the
results of an i.i.d. experiment with Xi P0 marginally (for all 1 i n), then the empirical
expectation of X, defined simply as the sample-average of X,
n
Pn X =
1X
Xi ,
n
i=1
provides an estimator for e0 . (Note that the sample-average is also of the form (1.3) if we
choose as our point-estimator for P0 the empirical distribution P (Y ) = Pn and Pn P.)
The law of large numbers guarantees that Pn X converges to e0 almost-surely as n , and
the central limit theorem asserts that this convergence proceeds at rate n1/2 (and that the
limit distribution is zero-mean normal with P0 (X P0 X)2 as its variance) if the variance of
X under P0 is finite. (More on the behaviour of estimators in the limit of large sample-size
n can be found in chapter 4.) Many parameterizations 7 P are such that parameters
coincide with expectations: for instance in the normal model, the parameter coincides with
Introduction
(Y ) =
1X
Xi ,
n
i=1
Often, other properties of P0 can also be related to expectations: for example, if X R, the
probabilities F0 (s) = P0 (X s) = P0 1{X s} can be estimated by
n
1X
F (s) =
1{Xi s}
n
i=1
i.e. as the empirical expectation of the function x 7 1{x s}. This leads to a step-function
with n jumps of size 1/n at samplepoints, which estimates the distribution function F0 . Generalizing, any property of P0 that can be expressed in terms of an expectation of a P0 -integrable
function of X, P0 (g(X)), is estimable by the corresponding empirical expectation, Pn g(X).
(With regard to the estimator F , the convergence F (s) F0 (s) does not only hold for all
s R but even uniform in s, i.e. supsR |F (s) F0 (s)| 0, c.f. the Glivenko-Cantelli theorem.)
To estimate a probability distribution (or any of its properties or parameters), many
different estimators may exist. Therefore, the use of any particular estimator constitutes
(another) choice made by the statistician analyzing the problem. Whether such a choice is a
good or a bad one depends on optimality criteria, which are either dictated by the particular
nature of the problem (see section 2.4 which extends the purely inferential point of view), or
based on more generically desirable properties of the estimator (note the use of the rather
ambiguous qualification best guess in definition 1.1.9).
Example 1.1.3. To illustrate what we mean by desirable properties, note the following.
When estimating P0 one may decide to use an estimator P (Y ) because it has the property that
it is close to the true distribution of Y in total variation (see appendix A, definition A.2.1).
To make this statement more specific, the property that make such an estimator P attractive
is that there exists a small constant > 0 and a (small) significance level 0 < < 1, such
that for all P P,
P kP (Y ) P k < > 1 ,
i.e. if Y P , then P (Y ) lies close to P with high P -probability. Note that we formulate this
property for all P in the model: since P0 P is unknown, the only way to guarantee that
this property holds under P0 , is to prove that it holds for all P P, provided that (1.2) holds.
A popular method of estimation that satisfies common optimality criteria in many (but
certainly not all!) problems is maximum-likelihood estimation.
Definition 1.1.10. Suppose that the model P is dominated by a -finite measure . The
likelihood principle says that one should pick P P as an estimator for the distribution P0
of Y such that
p(Y ) = sup p(Y ).
P P
Bayesian statistics
The above is only a very brief and rather abstract overview of the basic framework of
frequentist statistics, highlighting the central premise that a P0 for Y exists. It makes clear,
however, that frequentist inference concerns itself primarily with the stochastics of the random
variable Y and not with the context in which Y resides. Other than the fact that the model
has to be chosen reasonably based on the nature of Y , frequentist inference does not involve
any information regarding the background of the statistical problem in its procedures unless
one chooses to use such information explicitly (see, for example, remark 2.2.7 on penalized
maximum-likelihood estimation). In Bayesian statistics the use of background information is
an integral part of the procedure unless one chooses to disregard it: by the definition of a prior
measure, the statistician may express that he believes in certain points of the model more
strongly than others. This thought is elaborated on further in section 1.2 (e.g. example 1.2.1).
Similarly, results of estimation procedures are sensitive to the context in which they are
used: two statistical experiments may give rise to the same model formally, but the estimator
used in one experiment may be totally unfit for use in the other experiment.
Example 1.1.4. For example, if we interested in a statistic that predicts the rise or fall of a
certain share-price on the stockmarket based on its value over the past week, the estimator we
use does not have to be a very conservative one: we are interested primarily in its long-term
performance and not in the occasional mistaken prediction. However, if we wish to predict
the rise or fall of white-bloodcell counts in an HIV-patient based on last weeks counts, overly
optimistic predictions can have disastrous consequences.
Although in the above example, data and model are very similar in these statistical problems, the estimator used in the medical application should be much more conservative than
the estimator used in the stock-market problem. The inferential aspects of both questions are
the same, but the context in which such inference is made calls for adaptation. Such considerations form the motivation for statistical decision theory, as explained further in section 2.4.
Introduction
Bayesian statistics
test their respective claims posing the hypotheses:
H0 :
0 = 21 ,
H1 :
0 > 12 .
The total number of successes out of ten trials is a sufficient statistic for and we use it
as our test-statistics, noting that its distribution is binomial with n = 10, = 0 under H0 .
Given the data Y with realization y of ten correct answers, applicable in all three examples,
we reject H0 at p-value 210 0.1%. So there is strong evidence to support the claims made
in all three cases. Note that there is no difference in the frequentist analyses: formally, all
three cases are treated exactly the same.
Yet intuitively (and also in every-day practice), one would be inclined to treat the three
claims on different footing: in the second experiment, we have no reason to doubt the experts
claim, whereas in the third case, the friends condition makes his claim less than plausible. In
the first experiment, the validity of the ladys claim is hard to guess beforehand. The outcome
of the experiments would be as expected in the second case and remarkable in the first. In the
third case, one would either consider the friend extremely lucky, or begin to doubt the fairness
of the coin being flipped.
The above example convincingly makes the point that in our intuitive approach to statistical issues, we include all knowledge we have, even resorting to strongly biased estimators
if the model does not permit a non-biased way to incorporate it. The Bayesian approach to
statistics allows us to choose the prior such as to reflect this subjectivity: from the outset, we
attach more prior mass to parameter-values that we deem more likely, or that we believe in
more strongly. In the above example, we would choose a prior that concentrates more mass
at high values of in the second case and at low values in the third case. In the first case, the
absence of prior knowledge would lead us to remain objective, attaching equal prior weights
to high and low values of . Although the frequentists testing procedure can be adapted to
reflect subjectivity, the Bayesian procedure incorporates it rather more naturally through the
choice of a prior.
Subjectivist Bayesians view the above as an advantage; objectivist Bayesians and frequentists view it as a disadvantage. Subjectivist Bayesians argue that personal beliefs are an
essential part of statistical reasoning, deserving of a explicit role in the formalism and interpretation of results. Objectivist Bayesians and frequentists reject this thought because scientific
reasoning should be devoid of any personal beliefs or interpretation. So the above freedom in
the choice of the prior is also the Achilles heel of Bayesian statistics: fervent frequentists and
objectivist Bayesians take the point of view that the choice of prior is an undesirable source of
ambiguity, rather than a welcome way to incorporate expert knowledge as in example 1.2.1.
After all, if the subjectivist Bayesian does not like the outcome of his analysis, he can just
go back and change the prior to obtain a different outcome. Similarly, if two subjectivist
Bayesians analyze the same data they may reach completely different conclusions, depending
on the extent to which their respective priors differ.
10
Introduction
To a certain extent, such ambiguity is also present in frequentist statistics, since frequen-
tists make a choice for a certain point-estimator. For example, the use of either a maximumlikelihood or penalized maximum-likelihood estimator leads to differences, the size of which
depends on the relative sizes of likelihood and penalty. (Indeed, through the maximum-aposteriori Bayesian point-estimator (see definition 2.2.5), one can demonstrate that the logprior-density can be viewed as a penalty term in a penalized maximum-likelihood procedure,
c.f. remark 2.2.7.) Yet the natural way in which subjectivity is expressed in the Bayesian
setting is more explicit. Hence the frequentist or objectivist Bayesian sees in this a clear
sign that subjective Bayesian statistics lacks universal value unless one imposes that the prior
should not express any bias (see section 3.2).
A second difference in philosophy between frequentist and Bayesian statisticians arises as a
result of the fact that the Bayesian procedure does not require that we presume the existence
of a true, underlying distribution P0 of Y (compare with (1.1)). The subjectivist Bayesian
views the model with (prior or posterior) distribution as his own, subjective explanation of
the uncertainty in the data. For that reason, subjectivists prefer to talk about their (prior or
posterior) belief concerning parameter values rather than implying objective validity of their
assertions. On the one hand, such a point of view makes intrinsic ambiguities surrounding
statistical procedures explicit; on the other hand, one may wonder about the relevance of
strictly personal belief in a scientific tradition that emphasizes universality of reported results.
The philosophical debate between Bayesians and frequentist has raged with varying intensity for decades, but remains undecided to this date. In practice, the choice for a Bayesian
or frequentist estimation procedure is usually not motivated by philosophical considerations,
but by far more practical issues, such as ease of computation and implementation, common
custom in the relevant field of application, specific expertise of the researcher or other forms
of simple convenience. Recent developments [3] suggest that the philosophical debate will be
put to rest in favour of more practical considerations as well.
Exercises
11
Note, however, that the derivation of expression (2.7) (for example), is the result of subjectivist Bayesian assumptions on data and model. Since these assumptions are at odds with
the frequentist perspective, we shall take (2.7) as a definition rather than a derived form.
This has the consequence that some basic properties implicit by derivation in the Bayesian
framework, have to be imposed as conditions in the hybrid perspective (see remark 2.1.4).
Much of the material covered in these lecture notes does not depend on any particular
philosophical point of view, especially when the subject matter is purely mathematical. Nevertheless, it is important to realize when philosophical issues may come into play and there
will be points where this is the case. In particular when discussing asymptotic properties of
Bayesian procedures (see chapter 4), adoption of assumption (1.1) is instrumental, basically
because discussing convergence requires a limit-point.
1.4 Exercises
Exercise 1.1. Let Y Y be a random variable with unknown distribution P0 . Let P be
a model for Y , dominated by a -finite measure . Assume that the maximum-likelihood
estimator P (Y ) (see definition 1.1.10) is well-defined, P0 -almost-surely.
Show that if is a -finite measure dominating and we calculate the likelihood using densities, then the associated MLE is equal to P (Y ). Conclude that the MLE does not depend
on the dominating measure used, c.f. remark 1.1.3.
Exercise 1.2. In the three experiments of example 1.2.1, give the Neyman-Person test for
hypotheses H0 and H1 at level (0, 1). Calculate the p-value of the realization of 10
successes and 0 failures (in 10 Bernoulli trials according to H0 ).
Chapter 2
Bayesian basics
In this chapter, we consider the basic definitions and properties of Bayesian inferential and
decision-theoretic methods. Naturally the emphasis lies on the posterior distribution, which
we derive from the prior based on the subjectivist perspective. However, we also discuss
the way prior and posterior should be viewed if one assumes the frequentist point of view.
Furthermore, we consider point estimators derived from the posterior, credible sets, testing
of hypotheses and Bayesian decision theory. Throughout the chapter, we consider frequentist
methods side-by-side with the Bayesian procedures, for comparison and reference.
It should be stressed that the material presented here covers only the most basic Bayesian
concepts; further reading is recommended. Various books providing overviews of Bayesian
statistics are recommended, depending on the background and interest of the reader: a highly
theoretical treatment can be found in Le Cam (1986) [63], which develops a general, mathematical framework for statistics and decision theory, dealing with Bayesian methods as an
important area of its application. For a more down-to-earth version of this work, applied only
to smooth parametric models, the interested reader is referred to Le Cam and Yang (1990)
[64]. The book by Van der Vaart (1998) [83] contains a chapter on Bayesian statistics focusing
on the Bernstein-Von Mises theorem (see also section 4.4 in these notes). A general reference
of a more decision-theoretic inclination, focusing on Bayesian statistics, is the book by Berger
(1985) [8]; a more recent reference of a similar nature is Bernardo and Smith (1993) [13]. Both
Berger and Bernardo and Smith devote a great deal of attention to philosophical arguments in
favour of the Bayesian approach to statistics, staying rather terse with regard to mathematical
detail and focusing almost exclusively on parametric models. Recommendable is also Roberts
The Bayesian choice (2001) [72], which offers a very useful explanation on computational
aspects of Bayesian statistics. Finally, Ripley (1996) [73] discusses Bayesian methods with a
very pragmatic focus on pattern classification. The latter reference relates all material with
applications in mind but does so based on a firm statistical and decision-theoretic background.
13
14
Bayesian basics
(2.1)
which is not a product measure. The probability measure provides a joint probability
distribution for (Y, ), where Y is the observation and (the random variable associated
with) the parameter of the model.
Implicitly the choice for the measure defines the model in Bayesian context, by the
possibility to condition on = for some . The conditional distribution Y | : B
[0, 1] describes the distribution of the observation Y given the parameter . (For a discussion
of conditional probabilities, see appendix A, e.g. definition A.6.3 and theorem A.6.1). As
such, it defines the elements P of the model P = {P : }, although the role they
play in Bayesian context is slightly different from that in a frequentist setting. The question
then arises under which requirements the conditional probability Y | is a so-called regular
conditional distribution.
Lemma 2.1.1. Assume that is a Polish space and that the -algebra G contains the Borel
-algebra. Let be a probability measure, c.f. (2.1). Then the conditional probability Y | is
regular.
Proof The proof is a direct consequence of theorem A.6.1.
(2.2)
15
The prior is interpreted in the subjectivists philosophy as the degree of belief attached
to subsets of the model a priori, that is, before any observation has been made or incorporated
in the calculation. Central in the Bayesian framework is the conditional distribution for
given Y .
Definition 2.1.2. The conditional distribution
|Y : G Y [0, 1],
(2.3)
where G G is a measurable subset of the model P. Note that when expressed through
(2.4), the posterior distribution can be calculated based on a choice for the model (which
specifies p ) with a prior and the data Y (or a realisation Y = y thereof).
Based on the above definitions, two remarks are in order with regard to the notion of
a model in Bayesian statistics. First of all, one may choose a large model P, but if for a
subset P1 P the prior assigns mass zero, then for all practical purposes P1 does not play
a role, since omission of P1 from P does not influence the posterior. As long as the model
is parametric, i.e. Rd , we can always use priors that dominate the Lebesgue measure,
ensuring that P1 is a small subset of Rd . However, in non-parametric models null-sets of
the prior and posterior may be much larger than expected intuitively (for a striking example,
see section 4.2, specifically the discussion of Freedmans work).
Example 2.1.1. Taking the above argument to the extreme, consider a normal location model
P = {N (, 1) : R} with a prior = 1 (see example A.2.2), for some 1 , defined
on the Borel -algebra B. Then the posterior takes the form:
Z
.Z
p (Y )
p (Y ) d()
p (Y ) d() = 1
( A|Y ) =
(A) = (A).
p1 (Y )
A
for any A B. In other words, the posterior equals the prior, concentrating all its mass in
the point 1 . Even though we started out with a model that suggests estimation of location,
16
Bayesian basics
effectively the model consists of only one point, 1 due to the choice of the prior. In
subjectivist terms, the prior belief is fully biased towards 1 , leaving no room for amendment
by the data when we condition to obtain the posterior.
This example raises the question which part of the model proper P plays a role. In that
respect, it is helpful to make the following definition.
Definition 2.1.3. In addition to (, G , ) being a probability space, let (, T ) be a topological
space. Assume that G contains the Borel -algebra B corresponding to the topology T . The
support supp() of the prior is defined as:
\
supp() = {G G : G closed, (G) = 1}.
The viability of the above definition is established in the following lemma.
Lemma 2.1.2. For any topological space with -algebra G that contains the Borel -algebra
B and any (prior) probability measure : G [0, 1], supp() G and (supp()) = 1.
N ote that supp() is closed, as it is an intersection of closed sets, supp() B G . The
proof that the support has measure 1 is left as exercise 2.7.
In example 2.1.1, the model P consists of all normal distributions of the form N (, 1),
R, but the support of the prior supp() equals the singleton {N (1 , 1)} P.
Note that the support of the prior is defined based on a topology, the Borel -algebra of
which must belong to the domain of the prior measure. In parametric models this assumption
is rarely problematic, but in non-parametric models finding such a prior may be difficult and
the support may be an ill-defined concept. Therefore we may choose to take a less precise
but more generally applicable perspective: the model is viewed as the support of the prior
, but only up to -null-sets (c.f. the -almost-sure nature of the identification (2.2)). That
means that we may add to or remove from the model at will, as long as we make sure that
the changes have prior measure equal to zero: the model itself is a -almost-sure concept.
(Since the Bayesian procedure involves only integration of integrable functions with respect
to the prior, adding or removing -null-sets to/from the domain of integration will not have
unforeseen consequences.)
To many who have been introduced to statistics from the frequentist point of view, including the parameter for the model as a random variable seems somewhat unnatural because
the frequentist role of the parameter is entirely different from that of the data. The following
example demonstrates that in certain situations the Bayesian point of view is not unnatural
at all.
Example 2.1.2. In the posthumous publication of An essay towards solving a problem in the
doctrine of chances in 1763 [4], Thomas Bayes included an example of a situation in which
the above, subjectivist perspective arises quite naturally. It involves a number of red balls and
one white ball placed on a table and has become known in the literature as Bayes billiard.
17
We consider the following experiment: unseen by the statistician, someone places n red
balls and one white ball on a billiard table of length 1. Calling the distance between the white
ball and the bottom cushion of the table X and the distances between the red balls and the
bottom cushion Yi , (i = 1, . . . , n), it is known to the statistician that their joint distribution
is:
(X; Y1 , . . . , Yn ) U [0, 1]n+1 ,
(2.5)
i.e. all balls are placed independently with uniform distribution. The statistician will be reported the number K of red balls that is closer to the cushion than the white ball (the data,
denoted Y in the rest of this section) and is asked to give a distribution reflecting his beliefs
concerning the position of the white ball X (the parameter, denoted in the rest of this section) based on K. His prior knowledge concerning X (i.e. without knowing the observed value
K = k) offers little information: the best that can be said is that X U [0, 1], the marginal
distribution of X, i.e. the prior. The question is how this distribution for X changes when
we incorporate the observation K = k, that is, when we use the observation to arrive at our
posterior beliefs based on our prior beliefs.
Since for every i, Yi and X are independent c.f. (2.5), we have,
P (Yi X|X = x) = P (Yi x) = x,
So for each of the red balls, determining whether it lies closer to the cushion than the white
ball amounts to a Bernoulli experiment with parameter x. Since in addition the positions
Y1 , . . . , Yn are independent, counting the number K of red balls closer to the cushion than
the white ball amounts to counting successes in a sequence of independent Bernoulli experiments. We conclude that K has a binomial distribution Bin(n; x), i.e.
P (K = k|X = x) =
n!
xk (1 x)nk .
k!(n k)!
It is possible to obtain the density for the distribution of X conditional on K = k from the
above display using Bayes rule (c.f. lemma A.6.2):
p(x|K = k) = P (K = k|X = x)
p(x)
,
P (K = k)
(2.6)
but in order to use it, we need the two marginal densities p(x) and P (K = k) in the fraction.
From (2.5) it is known that p(x) = 1 and P (K = k) can be obtained by integrating
Z 1
P (K = k) =
P (K = k|X = x) p(x) dx
0
P (K = k|X = x) p(x)
P (K = k|X = x) p(x) dx
= B(n, k) xk (1 x)nk .
where B(n, k) is a normalization factor. The x-dependence of the density in the above display
reveals that X|K = k is distributed according to a Beta-distribution, B(k + 1, n k + 1), so
that the normalization factor B(n, k) must equal B(n, k) = (n + 2)/(k + 1)(n k + 1).
18
Bayesian basics
This provides the statistician with distributions reflecting his beliefs concerning the position
of the white ball for all possible values k for the observation K. Through conditioning on
K = k, the prior distribution of X is changed: if a relatively small number of red balls is
closer to the cushion than the white ball (i.e. in case k is small compared to n), then the white
ball is probably close to the cushion; if k is relatively large, the white ball is probably far from
the cushion (see figure 2.1).
k=0
k=6
k=1
k=5
k=2
k=3
k=4
Figure 2.1 Posterior densities for the position X of the white ball, given the number
k of red balls closer to the cushion of the billiard (out of a total of n = 6 red balls). For
the lower values of k, the white ball is close to the cushion with high probability, since
otherwise more red balls would probably lie closer to the cushion. This is reflected
by the posterior density for X|K = 1, for example, by the fact that it concentrates
much of its mass close to x = 0.
In many experiments or observations, the data consists of a sample of n repeated, stochastically independent measurements of the same quantity. To accommodate this situation formally, we choose Y equal to the n-fold product of a sample space X endowed with a -algebra
A , so that the observation takes the form Y = (X1 , X2 , . . . , Xn ). The additional assumption
that the sample is i.i.d. (presently a statement concerning the conditional independence of
the observations given = ) then reads:
Y | ( X1 A1 , . . . , Xn An | = ) =
n
Y
i=1
Y | ( Xi Ai | = ) =
n
Y
P (Xi Ai ),
i=1
19
Z Y
n
n ( G | X1 , X2 , . . . , Xn ) =
G i=1
Z Y
n
p (Xi ) d()
,
(2.7)
p (Xi ) d()
i=1
.
Y
d( | X1 , . . . , Xn )
() =
p (Xi )
d
Z Y
n
i=1
p (Xi ) d(),
(P0n a.s.).
(2.8)
i=1
The latter fact explains why such strong relations exist (e.g. the Bernstein-Von Mises theorem,
theorem 4.4.1) between Bayesian and maximum-likelihood methods. Indeed, the proportionality of the posterior density and the likelihood provides a useful qualitative picture of the
posterior as a measure that concentrates on regions in the model where the likelihood is relatively high. This may serve as a direct motivation for the use of Bayesian methods in a
frequentist context, c.f. section 1.3. Moreover, this picture gives a qualitative explanation of
the asymptotic behaviour of Bayesian methods: under suitable continuity-, differentiabilityand tail-conditions, the likelihood remains relatively high in small neighbourhoods of P0 and
drops off steeply outside in the large-sample limit. Hence, if the prior mass in those neighbourhoods is not too small, the posterior concentrates its mass in neighbourhoods of P0 , leading to
the asymptotic behaviour described in chapter 4.
Returning to the distributions that play a role in the subjectivist Bayesian formulation,
there exists also a marginal for the observation Y .
Definition 2.1.4. The distribution P : B [0, 1] defined by
Z Y
n
Pn ( X1 A1 , . . . , Xn An ) =
P (Ai ) d()
(2.9)
i=1
Pn,m
( Xn+1 An+1 , . . . , Xn+m An+m | X1 , . . . , Xn ) =
Z Y
m
i=1
P (An+i ) d( | X1 , . . . , Xn ))
20
Bayesian basics
The prior predictive distribution is subject to correction by observation through sub-
stitution of the prior by the posterior: the resulting posterior predictive distribution is
interpreted as the Bayesians expectation concerning the distribution of the observations
Xn+1 , Xn+2 , . . . , Xn+m given the observations X1 , X2 , . . . , Xn and the prior .
Remark 2.1.2. The form of the prior predictive distribution is the subject of de Finettis
theorem (see theorem A.2.2), which says that the distribution of a sequence (X1 , . . . , Xn ) of
random variables is of the form on the r.h.s. of the above display (with uniquely determined
prior ) if and only if the sample (X1 , . . . , Xn ) is exchangeable, that is, if and only if the
joint distribution for (X1 , . . . , Xn ) equals that of (X(1) , . . . , X(n) ) for all permutations of
n elements.
Remark 2.1.3. We conclude the discussion of the distributions that play a role in Bayesian
statistics with the following important point: at no stage during the derivation above, was
an underlying distribution of the data used or needed! For comparison we turn to assumption (1.1), which is fundamental in the frequentist approach. More precisely, the assumption
preceding (2.1) (c.f. the subjectivist Bayesian approach) is at odds with (1.1), unless
Z
n
P0 = Pn =
Pn d(),
Note, however, that the l.h.s. is a product-measure, whereas on the r.h.s. only exchangeability
is guaranteed! (Indeed, the equality in the above display may be used as the starting point
for definition of a goodness-of-fit criterion for the model and prior (see section 3.3). The
discrepancy in the previous display makes the pure (e.g. subjectivist) Bayesian reluctant to
assume the existence of a distribution P0 for the sample.)
The distribution P0 could not play a role in our analysis if we did not choose to adopt
assumption (1.1). In many cases we shall assume that Y contains an i.i.d. sample of observations X1 , X2 , . . . , Xn where X P0 (so that Y P0n ). Indeed, if we would not make this
assumption, asymptotic considerations like those in chapter 4 would be meaningless. However,
adopting (1.1) leaves us with questions concerning the background of the quantities defined
in this section because they originate from the subjectivist Bayesian framework.
Remark 2.1.4. (Bayesian/frequentist hybrid approach) Maintaining the frequentist assumption that Y P0 for some P0 requires that we revise our approach slightly: throughout the rest
of these lecture notes, we shall assume (1.1) and require the model P to be a probability space
(P, G , ) with a probability measure referred to as the prior. So the prior is introduced as a
measure on the model, rather than emergent as a marginal to a product-space measure. Model
and sample space are left in the separate roles they are assigned by the frequentist. We then
proceed to define the posterior by expression (2.7). Regularity of the posterior is imposed (for
a more detailed analysis, see Schervish (1995) [75] and Barron, Schervish and Wasserman
(1999) [7]). In that way, we combine a frequentist perspective on statistics with Bayesian
21
methodology: we make use of Bayesian quantities like prior and posterior, but analyze them
from a frequentist perspective.
Remark 2.1.5. In places throughout these lecture notes, probability measures P are decomposed into a P0 -absolutely-continuous part Pk and a P0 -singular part P . Following Le Cam,
we use the convention that if P is not dominated by P0 , the Radon-Nikodym derivative refers
to the P0 -absolutely-continuous part only: dP/dP0 = dPk /dP0 . (See theorem A.4.2.) With
this in mind, we write the posterior as follows
Z Y
n
dP
(Xi ) d()
A i=1 dP0
,
( A | X1 , X2 , . . . , Xn ) = Z n
Y dP
(Xi ) d()
dP0
(P0n a.s.)
(2.10)
i=1
Since the data X1 , X2 , . . . are i.i.d.-P0 -distributed, the P0 -almost-sure version of the posterior
in the above display suffices. Alternatively, any -finite measure that dominates P0 may be
used instead of P0 in (2.10) while keeping the definition P0n -almost-sure. Such P0 -almost sure
representations are often convenient when deriving proofs.
In cases where the model is not dominated, (2.10) may be used as the definition of the
posterior measure but there is no guarantee that (2.10) leads to sensible results!
Example 2.1.3. Suppose that the samplespace is R and the model P consists of all measures
of the form (see example A.2.2):
P =
m
X
j xj ,
(2.11)
j=1
Pm
j=1 j
= 1 and x1 , . . . , xm R. A
suitable prior for this model exists: distributions drawn from the so-called Dirichlet process
prior are of the form (2.11) with (prior) probability one. There is no -finite dominating
measure for this model and hence the model can not be represented by a family of densities,
c.f. definition 1.1.3. In addition, if the true distribution P0 for the observation is also a convex
combination of Dirac measures, distributions in the model are singular with respect to P0 unless
they happen to have support-points in common with P0 . Consequently definition (2.10) does
not give sensible results in this case. We have to resort to the subjectivist definition (2.3) in
order to make sense of the posterior distribution.
To summarize, the Bayesian procedure consists of the following steps
(i) Based on the background of the data Y , the statistician chooses a model P, usually
with some measurable parameterization P : 7 P .
(ii) A prior measure on P is chosen, based either on subjectivist or objectivist criteria.
Usually a measure on is defined, inducing a measure on P.
22
Bayesian basics
(iii) Based on (2.3), (2.4) or in the case of an i.i.d. sample Y = (X1 , X2 , . . . , Xn ), on:
n
Y
p (Xi ) d()
dn ( | X1 , X2 , . . . , Xn ) = Z i=1n
Y
,
p (Xi ) d()
i=1
on (R, B). Reasonable definitions of location, like the mean and the median of P , all
assign as the location of P the point 0 R. Yet small neighbourhoods of 0 do not receive
any P -mass, so 0 can hardly be viewed as a point around which P concentrates its mass. The
intuitition of a distributions location can be made concrete without complications of the above
23
Z
P (B) d( P | Y ),
(2.12)
i1
i1
P i1
i1
XZ
P
P (Bi ) d( P | Y ) =
P (Bi ),
i1
Z
d( | Y ) ,
(2.13)
P0n -almost-surely.
Remark 2.2.4. The distinction between the posterior mean and the parametric posterior
mean, as made above, is non-standard: it is customary in the Bayesian literature to refer to
either as the posterior mean. See, however, example 2.2.1.
24
Bayesian basics
Y dP =
1
P Y d (P ) =
P Y d() =
2
P
() d = (0, 0).
0
Note that none of the distributions in P has the origin as its expectation. We can also
calculate the expectation of Y under P in this situation:
1 (Y ) =
1
d() =
2
d = ,
0
which leads to PY = P Y = (1, 0). Clearly, the posterior mean does not equal the point in
the model corresponding to the parametric posterior mean. In fact, we see from the above that
P 6 P.
The fact that the expectations of P and P in example 2.2.1 differ makes it clear that
P 6= P,
unless special circumstances apply: if we consider a parameterization 7 P from a (closed,
convex) parameterizing space with posterior measure (d) onto a space of probability
measures P (with induced posterior (dP )), it makes a difference whether we consider the
posterior mean as defined in (2.12), or calculate P. The parametric posterior mean P lies
in the model P; P lies in the closed convex hull co(P) of the model, but not necessarily
P P.
25
Since there are multiple ways of defining the location of a distribution, there are more
ways of obtaining point-estimators from the posterior distribution. For example in a onedimensional parametric model, we can consider the posterior median defined by
) = inf s R : ( s|Y ) 1/2 ,
(Y
i.e. the smallest value for such that the posterior mass to its left is greater than or equal
to 1/2. (Note that this definition simplifies in case the posterior has a continuous, strictly
monotone distribution function: in that case the median equals the (unique) point where
this distribution function equals 1/2.) More generally, we consider the following class of
point-estimators [63].
Definition 2.2.3. Let P be a model with metric d : P P R and a prior on G
containing the Borel -algebra corresponding to the metric topology on P. Let ` : R R be a
convex loss-function ` : R R. The formal Bayes estimator is the minimizer of the function:
Z
P R : P 7
`(d(P, Q)) d( Q | Y ),
P
over the model P (provided that such a minimizer exists and is unique).
The heuristic idea behind formal Bayes estimators is decision-theoretic (see section 2.4).
Ideally, one would like to estimate by a point P in P such that `(d(P, P0 )) is minimal; if
P0 P, this would lead to P = P0 . However, lacking specific knowledge on P0 , we choose
to represent it by averaging over P weighted by the posterior, leading to the notion in
definition 2.2.3. Another useful point estimator based on the posterior is defined as follows.
Definition 2.2.4. Let the data Y with model P, metric d and prior be given. Suppose
that the -algebra on which is defined contains the Borel -algebra generated by the metric
topology. For given > 0, the small-ball estimator is defined to be the maximizer of the
function
P 7 ( Bd (P, ) | Y ),
(2.14)
over the model, where Bd (P, ) is the d-ball in P of radius centred on P (provided that such
a maximizer exists and is unique).
Remark 2.2.5. Similarly to definition 2.2.4, for a fixed value p such that 1/2 < p < 1, we
may define a Bayesian point estimator as the centre point of the smallest d-ball with posterior
mass greater than or equal to p (if it exists and is unique (see also, exercise 2.6)).
If the posterior is dominated by a -finite measure , the posterior density with respect
to can be used as a basis for defining Bayesian point estimators.
Definition 2.2.5. Let P be a model with prior . Assume that the posterior is absolutely
continuous with respect to a -finite measure on P. Denote the -density of ( |Y ) by
7 (|Y ). The maximum-a-posteriori estimator (or MAP-estimator, or posterior mode)
26
Bayesian basics
2 for is defined as the point in the model where the posterior density takes on its maximal
value (provided that such a point exists and is unique):
(2 |Y ) = sup (|Y ).
(2.15)
Remark 2.2.6. The MAP-estimator has a serious weak point: a different choice of dominating measure leads to a different MAP estimator! A MAP-estimator is therefore unspecified
unless we specify also the dominating measure used to obtain a posterior density. It is with
respect to this dominating measure that we define our estimator, so a motivation for the dominating measure used is inherently necessary (and often conspicuously lacking). Often the
Lebesgue measure is used without further comment, or objective measures (see section 3.2)
are used. Another option is to use the prior measure as the dominating measure, in which
case the MAP estimator equals the maximum-likelihood estimator (see remark 2.2.7).
All Bayesian point estimators defined above as maximizers or minimizers over the model
suffer from the usual existence and uniqueness issues associated with extrema. However, there
are straightforward methods to overcome such issues. We illustrate using the MAP-estimator.
Questions concerning the existence and uniqueness of MAP-estimators should be compared
to those of the existence and uniqueness of M -estimators in frequentist statistics. Although
it is hard to formulate conditions of a general nature to guarantee that the MAP-estimator
exists, often one can use the following lemma to guarantee existence.
Lemma 2.2.1. Consider a parameterized model P : 7 P If the is compact1
and the posterior density 7 (|Y ) is upper-semi-continuous (P0n -almost-surely) then the
posterior density takes on its maximum in some point in , P0n -almost-surely.
To prove uniqueness, one has to be aware of various possible problems, among which
are identifiability of the model (see section 1.1, in particular definition 1.1.5). Considerations like this are closely related to matters of consistency of M -estimators, e.g. Walds
consistency conditions for the maximum-likelihood estimator. The crucial property is called
well-separatedness of the maximum, which says that outside neighbourhoods of the maximum,
the posterior density must be uniformly strictly below the maximum. The interested reader
is referred to chapter 5 of van der Vaart (1998) [83], e.g. theorems 5.7 and 5.8.
Remark 2.2.7. There is an interesting connection between (Bayesian) MAP-estimation and
(frequentist) maximum-likelihood estimation. Referring to formula (2.7) we see that in an
i.i.d. experiment with parametric model, the MAP-estimator maximizes:
R : 7
n
Y
p (Xi ) (),
i=1
1
many statistical problems, especially when the model is non-parametric. However in a Bayesian context
Ulams theorem (see theorem A.2.3) offers a way to relax this condition.
27
where it is assumed that the model is dominated and that the prior has a density with respect to the Lebesgue measure . If the prior had been uniform, the last factor would have
dropped out and maximization of the posterior density is maximization of the likelihood. Therefore, differences between ML and MAP estimators are entirely due to non-uniformity of the
prior. Subjectivist interpretation aside, prior non-uniformity has an interpretation in the frequentist setting as well, through what is called penalized maximum likelihood estimation (see,
Van de Geer (2000) [39]): Bayes rule (see lemma A.6.2) applied to the posterior density
n (|X1 , . . . , Xn ) gives:
log n (|X1 , . . . , Xn ) = log n (X1 , . . . , Xn |) + log () + D(X1 , . . . , Xn ),
where D is a (-independent, but stochastic) normalization constant. The first term equals the
log-likelihood and the logarithm of the prior plays the role of a penalty term when maximizing
over . Hence, maximizing the posterior density over the model can be identified with
maximization of a penalized likelihood over . So defining a penalized MLE n with the
logarithm of the prior density 7 log () in the role of the penalty, the MAP-estimator
coincides with n . The above offers a direct connection between Bayesian and frequentist
methods of point-estimation. As such, it provides an frequentist interpretation of the prior as
a penalty in the ML procedure. The asymptotic behaviour of the MAP-estimator is discussed
in chapter 4 (see theorem 4.4.2).
28
Bayesian basics
Neyman-Pearson hypothesis testing is to find out whether or not the data contains enough
evidence to reject the null hypothesis as a likely explanation when compared to alternative
explanations. Sufficiency of evidence is formulated in terms of statistical significance.
For simplicity, we consider a so-called simple null hypothesis (i.e. a hypothesis consisting
of only one point in the model, which is assumed to be identifiable): let a certain point 1
be given and consider the hypotheses:
H0 :
0 = 1 ,
H1 :
0 6= 1 ,
where H0 denotes the null-hypothesis and H1 the alternative. By no means does frequentist
hypothesis testing equate to the corresponding classification problem, in which one would
treat H0 and H1 symmetrically and make a choice for one or the other based on the data (for
more on frequentist and Bayesian classification, see section 2.4).
To assess both hypotheses using the data, the simplest version of the Neyman-Pearson
method of hypothesis testing seeks to find a test-statistic T (Y ) R displaying different
behaviour depending on whether the data Y is distributed according to (a distribution in) H0
or in H1 . To make this distinction more precise, we define a so-called critical region K R,
such that P1 (T K) is small and P (T 6 K) is small for all 6= 1 . What we mean
by small probabilities in this context is a choice for the statistician, a so-called significance
level is to be chosen to determine when these probabilities are deemed small. That way,
upon realization Y = y, a distribution in the hypothesis H0 makes an outcome T (y) K
improbable compared to H1 .
Definition 2.3.1. Let P : P be a parameterized model for a sample Y . Formulate
two hypotheses H0 and H1 by introducing a two-set partition {0 , 1 } of the model :
H0 :
0 0 ,
H1 :
0 1 .
We say that a test for these hypotheses based on a test-statistic T with critical region K is of
level (0, 1) if the power function : [0, 1], defined by
() = P T (Y ) K ,
is uniformly small over 0 :
sup () .
(2.16)
From the above definition we arrive at the conclusion that if Y = y and T (y) K,
hypothesis H0 is improbable enough to be rejected, since H0 forms an unlikely explanation
of observed data (at said significance level). The degree of unlikeliness can be quantified in
terms of the so-called p-value, which is the lowest significance level at which the realised value
of the test statistic T (y) would have led us to reject H0 . Of course there is the possibility
that our decision is wrong and H0 is actually true but T (y) K nevertheless, so that our
rejection of the null hypothesis is unwarranted. This is called a type-I error ; a type-II error is
29
made when we do not reject H0 while H0 is not true. The significance level thus represents
a fixed upper-bound for the probability of a type-I error. Having found a collection of tests
displaying the chosen significance level, the Neyman-Pearson approach calls for subsequent
minimization of the Type-II error probability, i.e. of all the pairs (T, K) satisfying (2.16), one
prefers a pair that minimizes P (T (Y ) 6 K), ideally uniformly in 1 . However, generically
such uniformly most-powerful tests do not exist due to the possibility that different (T, K)
pairs are most powerful over distinct subsets of the alternative. The famed Neyman-Pearson
lemma [60] asserts that a most powerful test exists in the case contains only two points and
can be extended to obtain uniformly most powerful tests in certain models.
We consider the Neyman-Pearson approach to testing in some more detail in the following
example while also extending the argument to the asymptotic regime. Here Y is an i.i.d.
sample and the test-statistic and critical region are dependent on the size n of this sample.
We investigate the behaviour of the procedure in the limit n .
Example 2.3.1. Suppose that the data Y forms an i.i.d. sample from a distribution P0 = P0
and that P X = for all . Moreover, assume that P X 2 < for all . Due to the law
of large numbers, the sample-average
Tn (X1 , . . . , Xn ) = Pn X,
converges to under P (for all ) and seems a suitable candidate for the test-statistic,
at least in the regime where the sample-size n is large (i.e. asymptotically). The central limit
theorem allows us to analyze matters in greater detail, for all s R:
Pn Gn X s() (s),
(n ).
(2.17)
where () denotes the standard deviation of X under P . For simplicity, we assume that
7 () is a known quantity in this derivation. The limit (2.17) implies that
Pn Tn (X1 , . . . , Xn ) + n1/2 () s (s),
(n ).
Assuming that H0 holds, i.e. that 0 = 1 , we then find that, given an asymptotic significance
level (0, 1) and with the standard-normal quantiles denoted s ,
P0n Tn (X1 , . . . , X2 ) 1 + n1/2 (1 )s/2 1 21 ,
For significance levels close to zero, we see that under the null-hypothesis, it is improbable to
observe Tn > 1 + n1/2 (1 )s/2 . It is equally improbable to observe Tn < 1 n1/2 s/2 ,
which means that we can take as our critical region Kn,
Kn, = R \ [1 n1/2 (1 )s/2 , 1 + n1/2 (1 )s/2 ],
(Note that this choice for the critical region is not unique unless we impose that it be an interval
located symmetrically around 1 .) Then we are in a position to formulate our decision on the
null hypothesis, to reject H0 or not:
30
Bayesian basics
P0 P0 ,
H1 :
P0 P1 .
A test sequence (n )n1 is a sequence of statistics n : X n [0, 1], (for all n 1). An
asymptotic test is defined as a criterion for the decision to reject H0 or not, based on (a
realization of ) n (X1 , . . . , Xn ) and is studied in the limit n .
31
An example of such a criterion is the procedure given in definition 2.3.1 and example 2.3.1,
where test-functions take on the values zero or one depending on the (realized) test-statistic
and the critical region. When we replace indicators by test functions as in definition 2.3.2
criteria may vary depending on the nature of the test functions used.
Definition 2.3.3. Extending definition 2.3.2, we define the power function sequence of the
test sequence (n ) as a map n : P [0, 1] on the model defined by:
n (P ) = P n .
Like in definition 2.3.1, the quality of the test depends on the behaviour of the power
sequence on P0 and P1 respectively. If we are interested exclusively in rejection of the null
hypothesis, we could reason like in definition 2.3.1 and set a significance level to select only
those test sequences that satisfy
sup n (P ) .
P P0
Subsequently, we prefer test sequences that have high power on the alternative. For example,
if we have two test sequences (n ) and (n ) and a point P P1 such that
lim P n lim P n ,
(2.18)
(2.19)
P P1
(which is sometimes also referred to as the power function) is small in the limit n ,
possibly quantified by introduction of a significance level pertaining to both type-I and typeII errors simultaneously. In many proofs of Bayesian limit theorems (see chapter 4), a test
sequence (n ) is needed such that (2.19) goes to zero, or is bounded by a sequence (an )
decreasing to zero (typically an = enD for some D > 0). The existence of such test sequences
forms the subject of section 4.5.
32
Bayesian basics
Closely related to hypothesis tests are confidence intervals. Suppose that pose our infer-
ential problem differently: our interest now lies in using the data Y P0 to find a datadependent subset C(Y ) of the model that contains P0 with high probability. Again, high
probability requires quantification in terms of a level , called the confidence level.
Definition 2.3.4. Let P : 7 P be a parameterized model; let Y P0 for some
0 . Choose a confidence level (0, 1). Let C(Y ) be subset of dependent only on the
data Y . Then C(Y ) is a confidence region for of confidence level , if
P C(Y ) 1 ,
(2.20)
for all .
The dependence of C on the data Y is meant to express that C(Y ) can be calculated once
the data has been observed. The confidence region may also depend on other quantities that
are known to the statistician, so C(Y ) is a statistic (see definition 1.1.9). Note also that the
dependence of C(Y ) on the data Y makes C(Y ) a random subset of the model. Compare this
to point estimation, in which the data-dependent estimator is a random point in the model.
Like hypothesis testing, confidence regions can be considered from an asymptotic point of
view, as demonstrated in the following example.
Example 2.3.2. We consider the experiment of example 2.3.1, i.e. we suppose that the data
Y forms an i.i.d. sample from a distribution P0 = P0 or R and that P X = for all .
Moreover, we assume that for some known constant S > 0, 2 () = Var X S 2 , for all
. Consider the sample-average Tn (X1 , . . . , Xn ) = Pn X. Choose a confidence level
(0, 1). The limit (2.17) can be rewritten in the following form:
Pn |T (X1 , . . . , Xn ) | n1/2 ()s/2 1 ,
(n ).
(2.21)
Define Cn by
Cn (X1 , . . . , Xn ) = T (X1 , . . . , Xn ) n1/2 Ss/2 , T (X1 , . . . , Xn ) + n1/2 Ss/2 .
Then, for all ,
lim Pn Cn (X1 , . . . , Xn ) 1 .
Note that the -dependence of () would violate the requirement that Cn be a statistic: since
the true value 0 of is unknown, so is (). Substituting the (known) upper-bound S for
() enlarges the ()-interval that follows from (2.21) to its maximal extent, eliminating the
-dependence. In a realistic situation, one would not use S but substitute () by an estimator
(Y ), which amounts to the use of a plug-in version of (2.21). As a result, we would also have
to replace the standard-normal quantiles s by the quantiles of the Student t-distribution.
Clearly, confidence regions are not unique, but of course small confidence regions are more
informative than large ones: if, for some confidence level , two confidence regions C(Y ) and
33
D(Y ) are given, both satisfying (2.20) for all , and C(Y ) D(Y ), P -almost-surely for
all , then C(Y ) is preferred over D(Y ).
The Bayesian analogs of tests and confidence regions are called Bayes factors and credible
regions, both of which are derived from the posterior distribution. We start by considering
credible sets. The rationale behind their definition is exactly the same one that motivated
confidence regions: we look for a subset D of the model that is as small as possible, while
receiving a certain minimal probability. Presently, however, the word probability is in line
with the Bayesian notion, i.e. probability according to the posterior distribution.
Definition 2.3.5. Let (, G ) be a measurable space parameterizing a model P : 7 P
for data Y , with prior : G [0, 1]. Choose a level (0, 1). Let D G be a subset of .
Then D is a level- credible set for , if
D Y 1 .
(2.22)
(2.23)
34
Bayesian basics
almost-surely with respect to both priors (and P0 ). So the fraction of posterior densities is a
positive constant as a function of . Therefore, for all k 0,
D1 (k) = : 1 (|Y ) k = : 2 (|Y ) K(Y ) k = D2 K(Y )1 k .
and, hence, for all (0, 1),
k1, = sup k 0 : ( D2 (K(Y )1 k)|Y ) 1 = K(Y ) k2, .
To conclude,
D1, = D1 (k1, ) = D2 K(Y )1 k1, ) = D2 (k2, ) = D2, .
35
The above lemma proves that using the posterior density with respect to the prior leads
to HPD-credible sets that are independent of the choice of prior. This may be interpreted
further, by saying that only the data is of influence on HPD-credible sets based on the posterior
density with respect to the prior. Such a perspective is attractive to the objectivist, but rather
counterintuitive from a subjectivist point of view: a prior chosen according to subjectivist
criteria places high mass in subsets of the model that the statistician attaches high belief
to. Therefore, the density of the posterior with respect to the prior can be expected to
be relatively small in those subsets! As a result, those regions may end up in D only for
relatively high values of . However, intuition is to be amended by mathematics in this case:
when we say above that only the data is of influence, this is due entirely to the likelihood
factor in (2.8). Rather than incorporating both prior knowledge and data in HPD credible
sets, the above construction emphasizes the differences between prior and posterior beliefs,
which lie entirely in the data and are represented in the formalism by the likelihood. (We shall
reach a similar conclusion when considering the difference between posterior odds and Bayes
factors later in this section). To present the same point from a different perspective, HPD
credible regions based on the posterior density with respect to the prior coincide with levelsets
of the likelihood and centre on the ML estimate if the likelihood is smooth enough and has
a well-separated maximum (as a function on the model). We shall see that the coincidence
between confidence regions and credible sets becomes more pronounced in the large-sample
limit when we study the Bernstein-Von Mises theorem (see chapter 4 for more on large-sample
limiting behaviour of the posterior).
Bayesian hypothesis testing is formulated in a far more straightforward fashion than frequentist methods based on the Neyman-Pearson approach. The two hypotheses H0 and H1
correspond to a two-set partition {0 , 1 } of the model and for each of the parts, we have
both posterior and prior probabilities. Based on the proportions between those, we shall
decide which hypothesis is the more likely one. It can therefore be remarked immediately
that in the Bayesian approach, the hypotheses are treated on equal footing, a situation that is
more akin to classification than to Neyman-Pearson hypothesis testing. To introduce Bayesian
hypothesis testing, we make the following definitions.
Definition 2.3.7. Let (, G ) a measurable space parameterizing a model P : 7 P
for data Y Y , with prior : G [0, 1]. Let {0 , 1 } be a partition of such that
(0 ) > 0 and (1 ) > 0. The prior and posterior odds ratios are defined by (0 )/(1 )
and (0 |Y )/(1 |Y ) respectively. The Bayes factor in favour of 0 is defined to be
B=
(0 |Y ) (1 )
.
(1 |Y ) (0 )
When doing Bayesian hypothesis testing, we have a choice of which ratio to use and that
choice will correspond directly with a choice for subjectivist or objectivist philosophies. In
36
Bayesian basics
the subjectivists view, the posterior odds ratio has a clear interpretation: if
(0 |Y )
> 1,
(1 |Y )
then the probability of 0 is greater than the probability of 0 and hence, the
subjectivist decides to adopt H0 rather than H1 . If, on the other hand, the above display is
smaller than 1, the subjectivist decides to adopt H1 rather than H0 . The objectivist would
object to this, saying that the relative prior weights of 0 and 1 can introduce a heavy bias
in favour of one or the other in this approach (upon which the subjectivist would answer that
that is exactly what he had in mind). Therefore, the objectivist would prefer to use a criterion
that is less dependent on the prior weights of 0 and 1 . We look at a very simple example
to illustrate the point.
Example 2.3.3. Let be a dominated model that consists of only two points, 0 and 1 and
let 0 = {0 }, 1 = {1 }, corresponding to simple null and alternative hypotheses H0 , H1 .
Denote the prior by and assume that both ({0 }) > 0 and ({1 }) > 0. By Bayes rule,
the posterior weights of 0 and 1 are
( i |Y ) =
pi (Y )(i )
,
p0 (Y )(0 ) + p1 (Y )(1 )
p0 (Y )
.
p1 (Y )
We see that the Bayes factor does not depend on the prior weights assigned to 0 and 1 (in
this simple example), but the posterior odds ratio does. Indeed, suppose we stack the prior
odds heavily in favour of 0 , by choosing (0 ) = 1 and (1 ) = (for some small > 0).
Even if the likelihood ratio p0 (Y )/p1 (Y ) is much smaller than one (but greater than /1 ),
the subjectivists criterion favours H0 . In that case, the data clearly advocates hypothesis H1
but the prior odds force adoption of H0 . The Bayes factor B equals the likelihood ratio (in
this example), so it does not suffer from the bias imposed on the posterior odds.
The objectivist prefers the Bayes factor to make a choice between two hypotheses: if B > 1
the objectivist adopts H0 rather than H1 ; if, on the other hand, B < 1, then the objectivist
adopts H1 rather than H0 . In example 2.3.3 the Bayes factor is independent of the choice of
the prior. In general, the Bayes factor is not completely independent of the prior, but it does
not depend on the relative prior weights of 0 and 1 . We prove this using the following
decomposition of the prior:
(A) = (A|0 ) (0 ) + (A|1 ) (1 ),
(2.24)
37
for all A G (where it is assumed that (0 ) > 0 and (1 ) > 0). In the above display,
( |i ) can be any probability measure on i (i = 0, 1), and since (0 ) + (1 ) = 1, is
decomposed as a convex combination of two probability measures on 0 and 1 respectively.
The Bayes factor is then rewritten using Bayes rule (see lemma A.6.1):
B=
(Y |0 )
(0 |Y ) (1 )
=
,
(1 |Y ) (0 )
(Y |1 )
p (Y ) d(|i ),
i
38
Bayesian basics
like the accuracy of an estimation procedure, coverage probabilities for confidence intervals
or the probability of Type-I and type-II errors in a testing procedure.
The distinction lies in the nature of the optimality criteria: so far we have practiced
what is called statistical inference, in which optimality is formulated entirely in terms of
the stochastic description of the data. For that reason, it is sometimes said that statistical
inference limits itself to those questions that summarize the data. By contrast, statistical
decision theory formalizes the criteria for optimality by adopting the use of a so-called lossfunction to quantify the consequences of wrong decisions in a way prescribed by the context
of the statistical problem.
In statistical decision theory the nomenclature is slightly different from that introduced
earlier. We consider a system that is in an unknown state , where is called the
state-space. The observation Y takes its values in the samplespace Y , a measurable space
with -algebra B. The observation is stochastic, its distribution P : B [0, 1] being
dependent on the state of the system. The observation does not reveal the state of the
system completely or with certainty. Based on the outcome Y = y of the observation, we
take a decision a A (or perform an action a, as some prefer to say), where A is the called
the decision-space. For each state of the system there may be an optimal or prescribed
decision, but since observation of Y does not give us the state of the system with certainty,
the decision is stochastic and may be wrong. The goal of statistical decision theory is to arrive
at a rule that decides in the best possible way given only the data Y .
The above does not add anything new to the approach we were already following: aside
from the names, the concepts introduced here are those used in the usual problem of statistically estimating a A . Decision theory distinguishes itself through its definition of optimality
in terms of a so-called loss-function.
Definition 2.4.1. Any lower-bounded function L : A R may serve as a loss-function.
The utility-function is L : A R.
(Although statisticians talk about loss-functions, people in applied fields often prefer to
talk of utility-functions, which is why the above definition is given both in a positive and
a negative version.) The interpretation of the loss-function is the following: if a particular
decision a is taken while the state of the system is , then a loss L(, a) is incurred which can
be either positive (loss) or negative (profit). To illustrate, in systems where observation of
the state is direct (i.e. Y = ) and non-stochastic, the optimal decision a() given the state
is the value of a that minimizes the loss L(, a). However, the problem we have set is more
complicated because the state is unknown and can not be measured directly. All we have is
the observation Y .
Definition 2.4.2. Let A be a measurable space with -algebra H . A measurable : Y A
is called a decision rule.
39
Z
R(, ) =
L(, (Y )) dP .
(2.25)
Of interest to the frequentist is only the expected loss under the true distribution Y P0 .
But since 0 is unknown, we are forced to consider all values of , i.e. look at the risk-function
7 R(, ) for each decision rule .
Definition 2.4.4. Let the state-space , states P , ( ), decision space A and loss L be
given. Choose 1 , 2 . The decision rule 1 is R-better than 2 , if
:
(2.26)
A decision rule is admissible if there exists no 0 that is R-better than (and inadmissible
if such a 0 does exist).
It is clear that the definition of R-better decision-rules is intended to order decision rules:
if the risk-function associated with a decision-rule is relatively small, then that decision rule
is preferable. Note, however, that the ordering we impose by definition 2.4.4 may be partial
rather than complete: pairs 1 , 2 of decision rules may exist such that neither 1 nor 2 is
R-better than the other. This is due to the fact that 1 may perform better (in the sense that
R(, 1 ) R(, 2 )) for values of in some 1 , while 2 performs better in 2 = \ 1 ,
resulting in a situation where (2.26) is true for neither. For that reason, it is important to find
a way to compare risks (and thereby decision rules) in a -independent way and thus arrive
at a complete ordering among decision rules. This motivates the following definition.
Definition 2.4.5. (Minimax decision principle) Let the state-space , states P , ( ),
decision space A and loss L be given. The function
R : 7 sup R(, )
is called the minimax risk. Let 1 , 2 be given. The decision rule 1 is minimax-preferred
to 2 , if
sup R(, 1 ) < sup R(, 2 ).
40
Bayesian basics
(2.27)
under the conditions that R is convex on , concave on and that the topology on is
such that is compact, 7 R(, ) is continuous for all . Since many loss-functions used in
practice satisfy the convexity requirements, the Minimax theorem has broad applicability in
statistical decision theory and many other fields. In some cases, use of the minimax theorem
requires that we extend the class to contain more general decision rules. Particularly, it is
often necessary to consider the class of all so-called randomized decision rules. Randomized
decision rules are not only stochastic in the sense that they depend on the data, but also
through a further stochastic influence: concretely, this means that after realisation Y = y
of the data, uncertainty in the decision remains. To give a formal definition, consider a
measurable space (, F ) with data Y : Y and a decision rule : A . The decision
rule is a randomized decision rule whenever () is not a subset of (Y ), i.e. is not a
function of Y . An example of such a situation is that in which we entertain the possibility
of using one of two different non-randomized decision rules 1 , 2 : Y A . After the data
is realised as Y = y, 1 and 2 give rise to two decisions 1 (y), 2 (y), which may differ. In
that case, we flip a coin with outcome C {0, 1} to decide which decision to use. The extra
stochastic element introduced by the coin-flip has then randomized our decision rule. The
product space Y {0, 1} endowed with the product -algebra may serve as the measurable
space (, F ) with : Y defined by,
(y, c) 7 (y, c) = c 1 (Y ) + (1 c) 2 (y),
for all y Y and c {0, 1}. Perhaps a bit counterintuitively (but certainly in accordance
with the fact that minimization over a larger set produces a lower infimum), in some decision
problems the minimax risk associated with such randomized decision rules lies strictly below
the minimax risks of both non-randomized decision rules. We return to the Minimax theorem
in section 4.3.
Example 2.4.1. (Decision theoretic L2 -estimation) The decision-theoretic approach can also
be used to formulate estimation problems in a generalized way, if we choose the decision space
A equal to the state-space = R. Let Y N (0 , 1) for some unknown 0 . Choose
L : A R equal to the quadratic difference
L(, a) = ( a)2 ,
41
Note that plays the role of a family of estimators for 0 here. The risk-function takes the
form:
Z
R(, c ) =
Z
L(, c (Y )) dP =
( cy)2 dN (, 1)(y)
Z
=
2
c( y) + (1 c) dN (, 1)(y)
Z
2 2
c (y ) + 2c(1 c)( y) + (1 c)
dN (, 1)(y)
= c2 + (1 c)2 2 .
It follows that 1 is R-better than all c for c > 1, so that for all c > 1, c is inadmissible. If
we had restricted c to be greater than or equal to 1, 1 would have been admissible. However,
since c may lie in [0, 1) as well, admissibility in the uniform sense of (2.26) does not apply to
any c . To see this, note that R(, 1 ) = 1 for all , whereas for c < 1 and some > c/(1 c),
R(0, c ) < 1 < R(, c ). Therefore, there is no admissible decision rule in .
The minimax criterion does give rise to a preference. However, in order to guarantee its
existence, we need to bound (or rather, compactify) the parameter space: let M > 0 be given
and assume that = [M, M ]. The minimax risk for c is given by
sup R(, c ) = c2 + (1 c)2 M 2 ,
which is minimal iff c = M 2 /(1 + M 2 ), i.e. the (unique) minimax decision rule for this
problem (or, since we are using decision theory to estimate a parameter in this case, the
minimax estimator with respect to L2 -loss) is therefore,
M (Y ) =
M2
Y.
1 + M2
Note that if we let M , this estimator for converges to the MLE for said problem.
As demonstrated in the above example, uniform admissibility of a decision rule (c.f. (2.26))
is hard to achieve, but in many such cases a minimax decision rule does exist. One important
remark concerning the use the minimax decision principle remains: considering (2.27), we see
that the minimax principle chooses the decision rule that minimizes the maximum of the risk
R( , ) over . As such, the minimax criterion takes into account only the worst-case scenario
and prefers decision rules that perform well under those conditions. In practical problems,
that means that the minimax principle tends to take a rather pessimistic perspective on
decision problems.
Bayesian decision theory presents a more balanced perspective because instead of maximizing the risk function over , the Bayesian has the prior to integrate over . Optimization
42
Bayesian basics
of the resulting integral takes into account more than just the worst case, so that the resulting
decision rule is based on a less pessimistic perspective than the minimax decision rule.
Definition 2.4.6. Let the state-space , states P , ( ), decision space A and loss
L be given. In addition, assume that is a measurable space with -algebra G and prior
: G R. The function
Z
R(, ) d(),
r(, ) =
(2.28)
is called the Bayesian risk function. Let 1 , 2 be given. The decision rule 1 is Bayespreferred to 2 , if
r(, 1 ) < r(, 2 ).
If minimizes 7 r(, ), i.e.
r(, ) = inf r(, ).
(2.29)
then is called a Bayes rule. The quantity r(, ) is called the Bayes risk.
Lemma 2.4.1. Let Y Y denote data in a decision theoretic problem with state space ,
decision space A and loss L : A R. For any prior and all decision rules : Y A ,
r(, ) sup R(, ),
i.e. the Bayesian risk is always upper bounded by the minimax risk.
The proof of this lemma follows from the fact that the minimax risk is an upper bound
for the integrand in the Bayesian risk function.
Example 2.4.2. (continuation of example 2.4.1) Let = R and Y N (0 , 1) for some
unknown 0 . Choose the loss-function L : A R and the decision space as in
example 2.4.1. We choose a prior = N (0, 2 ) (for some > 0) on . Then the Bayesian
risk function is give by:
Z
r(, c ) =
Z
R(, c ) d() =
c2 + (1 c)2 2 dN (0, 2 )()
R
2 2
= c + (1 c) ,
which is minimal iff c = 2 /(1 + 2 ). The (unique) Bayes rule for this problem and corresponding Bayes risk are therefore,
(Y ) =
2
Y,
1 + 2
r(, ) =
2
.
1 + 2
In the Bayesian case, there is no need for a compact parameter space , since we do not
maximize but integrate over .
43
In the above example, we could find the Bayes rule by straightforward optimization of the
Bayesian risk function, because the class was rather restricted. If we extend the class to
contain all non-randomized decision rules, the problem of finding the Bayes rule seems to be
far more complicated at first glance. However, as we shall see in theorem 2.4.1, the following
definition turns out to be the solution to this question.
Definition 2.4.7. (The conditional Bayes decision principle) Let the state-space , states P ,
( ), decision space A and loss L be given. In addition, assume that is a measurable
space with -algebra G and prior : G R. We define the decision rule : Y A to be
such that for all y Y ,
Z
Z
L(, (y)) d(|Y = y) = inf
L(, a) d(|Y = y),
aA
(2.30)
i.e. point-wise for every y, the decision rule (y) minimizes the posterior expected loss.
The above defines the decision rule implicitly as a point-wise minimizer, which raises
the usual questions concerning existence and uniqueness, of which little can be said in any
generality. However, if the existence of is established, it is optimal.
Theorem 2.4.1. Let the state-space , states P , ( ), decision space A and loss L be
given. In addition, assume that is a measurable space with -algebra G and prior : G R.
Assume that there exists a -finite measure : B R such that P for all . If the
decision rule : Y A is well-defined, then is a Bayes rule.
Proof Denote the class of all decision rules for this problem by throughout the proof. We
start by rewriting the Bayesian risk function for a decision rule : Y A .
Z
Z Z
r(, ) =
R(, ) d() =
L(, (y)) dP (y) d()
Y
Z Z
=
L(, (y)) p (y) d() d(y)
Y
Z Z
Z
=
p (y) d()
L(, (y)) d(|Y = y) d(y).
Y
where we use definitions (2.28) and (2.25), the Radon-Nikodym theorem (see theorem A.4.2),
Fubinis theorem (see theorem A.4.1) and the definition of the posterior, c.f. (2.7). Using the
prior predictive distribution (2.9), we rewrite the Bayesian risk function further:
Z Z
r(, ) =
L(, (y)) d(|Y = y) dP (y).
Y
(2.31)
By assumption, the conditional Bayes decision rule exists. Since satisfies (2.30) pointwise for all y Y , we have
Z
Z
44
Bayesian basics
Y
Z Z
L(, (y)) d(|Y = y) dP (y)
inf
Y
= inf r(, ).
To conclude, it is noted that randomization of the decision is not needed when optimizing
with respect to the Bayes risk. The conditional Bayes decision rule is non-randomized and
optimal.
Example 2.4.3. (Classification and Bayesian classifiers) Many decision-theoretic questions
take the form of a classification problem: under consideration is a population of objects
that each belong to one of a finite number of classes A = {1, 2, . . . , L}. The class K of the
object is the unknown quantity of interest. Observing a vector Y of features of the object,
the goal is to classify the object, i.e. estimate which class it belongs to. We formalize the
problem in decision-theoretic terms: the population is a probability space (, F , P ); both the
feature vector and the class of the object are random variables, Y : Y and K : A
respectively. The state-space in a classification problem equals the decision space A : the class
can be viewed as a state in the sense that the distribution PY |K=k of Y given the class
K = k depends on k. Based on the feature vector Y , we decide to classify in class (Y ),
i.e. the decision rule (or classifier, as it is usually referred to in the context of classification
problems) maps features to classes by means of a map : Y A . A classifier can be
viewed equivalently as a finite partition of the feature-space Y : for every k A , we define
Yk = {y Y : (y) = k}
and note that if k 6= l, then Yk Yl = and Y1 Y2 . . . YL = Y . The partition of the
feature space is such that if Y = y Yk for certain k A , then we classify the object in class
k.
Depending on the context of the classification problem, a loss-function L : A A R
is defined (see the examples in the introduction to this section, e.g. the example on medical
diagnosis). Without context, the loss function in a classification problem can be chosen as
follows
L(k, l) = 1{k6=l} .
i.e. we incur a loss equal to one for each misclassification.
Using the minimax decision principle, we look for a classifier M : Y A that minimizes:
Z
7 sup
L(k, (y)) dP (y|K = k) = sup P (Y ) 6= k K = k ,
kA
kA
45
i.e. the minimax decision principle prescribes that we minimize the probability of misclassification uniformly over all classes.
In a Bayesian context, we need a prior on the state-space, which equals A in classification
problems. Note that if known (or estimable), the marginal probability distribution for K is to
be used as the prior for the state k, in accordance with definition 2.1.1. In practical problems,
frequencies of occurrence for the classes {1, . . . , L} in are often available or easily estimable;
in the absence of information on the marginal distribution of K equal prior weights can be
assigned. Here, we assume that the probabilities P (K = k) are known and use them to define
the prior density with respect to the counting measure on the (finite) space A :
(k) = P (K = k).
The Bayes rule : Y A for this classification problem is defined to as the minimizer of
Z
7
L
X
(y) 6= K Y = y
k=1
for every y Y . According to theorem 2.4.1, the classifier minimizes the Bayes risk, which
in this situation is given by:
Z
XZ
r(, ) =
R(k, ) d() =
L(k, (y)) dP (y|K = k) (k)
A
kA
P k=
6 (Y ) K = k P (K = k) = P K 6= (Y ) .
kA
Summarizing, the Bayes rule minimizes the overall probability of misclassification, i.e.
without referring to the class of the object. (Compare this with the minimax classifier.)
Readers interested in the statistics of classification and its applications are encouraged to
read B. Ripleys Pattern recognition and neural networks (1996) [73].
To close the chapter, the following remark is in order: when we started our comparison of
frequentist and Bayesian methods, we highlighted the conflict in philosophy. However, now
that we have seen some of the differences in more detail by considering estimation, testing
and decision theory in both schools, we can be far more specific. Statistical problems can be
solved in both schools; whether one chooses for a Bayesian or frequentist solution is usually not
determined by adamant belief in either philosophy, but by much more practical considerations.
Perhaps example 2.4.3 illustrates this point most clearly: if one is concerned about correct
classification for objects in the most difficult class, one should opt for the minimax decision
rule. If, on the other hand, one wants to minimize the overall misclassification probability
(disregarding misclassification per class), one should choose to adopt the conditional Bayes
decision rule. In other words, depending on the risk to be minimized (minimax risk and Bayes
risk are different!) one arrives at different classifiers. Some formulations are more natural in
frequentist context and others belong in the Bayesian realm. Similarly, practicality may form
an argument in favour of imposing a (possibly subjective) bias (see example 1.2.1). Bayesian
46
Bayesian basics
methods are a natural choice in such cases, due to the intrinsic bias priors express. For
example, forensic statistics is usually performed using Bayesian methods, in order to leave
room for common-sense bias. Another reason to use one or the other may be computational
advantages or useful theoretical results that exist for one school but have no analog in the
other.
Philosophical preference should not play a role in the choice for a statistical procedure,
practicality should (and usually does).
2.5 Exercises
Exercise 2.1. Calibration
A physicist prepares for repreated measurement of a physical quantity Z in his laboratory.
To that end, he installs a measurement apparatus that will give him outcomes of the form
Y = Z + e where e is a measurement error due to the inaccuracy of the apparatus, assumed
to be stochastically independent of Z. Note that if the expectation of e equals zero, long-run
sample averages converge to the expectation of Z; if P e 6= 0, on the other hand, averaging
does not cancel out the resulting bias.
The manufacturer of the apparatus says that e is normally distributed with known variance
2
> 0. The mean of this normal distribution depends on the way the apparatus is installed
and thus requires calibration. The following questions pertain to the calibration procedure.
The physicist decides to conduct the following steps to calibrate his measurement: if he
makes certain that the apparatus receives no input signal, Z = 0. A sample of n independent
measurements of Y then amounts to an i.i.d. sample from the distribution of e, which can be
used to estimate the unknown mean . The physicist expects that Ee lies close to zero.
a. Explain why, from a subjectivist point of view, the choice N (0, 2 ) forms a suitable
prior in this situation. Explain the role of the parameter 2 > 0.
b. With the choice of prior as in part a., calculate the posterior density for .
c. Interpret the influence of 2 on the posterior, taking into account your answer under
part a. (Hint: take limits 2 0 and 2 in the expression you have found under b.)
d. What is the influence of the samplesize n? Show that the particular choice of the constant
2 becomes irrelevant in the large-sample limit n .
Exercise 2.2. Let X1 , . . . , Xn be an i.i.d. sample from the normal distribution N (0, 2 ), with
unknown variance 2 > 0. As a prior for 2 , let 1/ 2 (1, 2). Calculate the posterior
distribution for 2 with respect to the Lebesgue measure on (0, ).
Exercise 2.3. Let X1 , . . . , Xn be an i.i.d. sample from the Poisson distribution Poisson(),
with unknown parameter > 0. As a prior for , let (2, 1). Calculate the posterior
density for with respect to the Lebesgue measure on (0, ).
Exercises
47
Exercise 2.4. Let the measurement Y P0 be given. Assume that the model P = {P :
} is dominated but possibly misspecified. Let denote a prior distribution on . Show that
the posterior distribution is P0 -almost-surely equal to the prior distribution iff the likelihood
is P0 -almost-surely constant (as a function of (, y) Y ). Explain the result of
example 2.1.1 in this context.
Exercise 2.5. Consider the following questions in the context of exercise 2.3.
a. Calculate the maximum-likelihood estimator and the maximum-a-posteriori estimator
for (0, ).
b. Let n both in the MLE and MAP estimator and conclude that the difference
vanishes in the limit.
c. Following remark 2.2.7, explain the difference between ML and MAP estimators exclusively in terms of the prior.
d. Consider and discuss the choice of prior (2, 1) twice, once in a qualitative, subjectivist Bayesian fashion, and once following the frequentist interpretation of the logprior-density.
Exercise 2.6. Let Y P0 denote the data. The following questions pertain to the small-ball
estimator defined in remark 2.2.5 for certain, fixed p (1/2, 1), which we shall denote by
P (Y ). Assume that the model P is compact in the topology induced by the metric d.
a. Show that for any two measurable model subsets A, B P,
( A | Y ) ( B | Y ) ( A B | Y ) ( A B | Y ),
P0 -almost-surely.
b. Prove that the map (, P ) 7 ( Bd (P, ) | Y ) is continuous, P0 -almost-surely.
c. Show that P (Y ) exists, P0 -almost-surely.
d. Suppose that > 0 denotes the smallest radius for which there exists a ball Bd (P, ) P
of posterior probability greater than or equal to p. Show that, if both P1 (Y ) and P2 (Y )
are centre points of such balls, then d(P1 (Y ), P2 (Y )) < 2, P0 -almost-surely.
Exercise 2.7. Complete the proof of lemma 2.1.2. (Hint: Denote S = supp(); assume that
(S) = < 1; show that (S c C) = 1 for any closed C such that (C) = 1; then use
that intersections of closed sets are closed.
Exercise 2.8. Let Y be normally distributed with known variance 2 > 0 and unknown
location . As a prior for , choose = N (0, 2 ). Let (0, 1) be given. Using the posterior
density with respect to the Lebesgue measure, express the level- HPD-credible set in terms of
Y , 2 , 2 and quantiles of the standard normal distribution. Consider the limit 2 and
compare with level- confidence intervals centred on the ML estimate for .
48
Bayesian basics
Exercise 2.9. Let Y Bin(n; p) for known n 1 and unknown p (0, 1). As a prior for
p, choose = Beta( 21 , 12 ). Calculate the posterior distribution for the parameter p. Using
the Lebesgue measure on (0, 1) to define the posterior density, give the level- HPD-credible
interval for p in terms of Y , n and the quantiles of beta-distributions.
Exercise 2.10. Consider a dominated model P = {P : } for data Y , where R is
an interval. For certain 0 , consider the simple null-hypothesis and alternative:
H0 :
= 0 ,
H1
: 6= 0 .
Show that if the prior is absolutely continuous with respect to the Lebesgue measure on ,
then the Bayes factor B for the hypotheses H0 versus H1 satisfies B = 0.
Interpret this fact as follows: calculation of Bayes factors (and posterior/prior odds ratios)
makes sense only if both hypotheses receive non-zero prior mass. Otherwise, the statistical
question we ask is rendered invalid ex ante by our beliefs concerning , as formulated through
the choice of the prior.
Exercise 2.11. Prisoners dilemma
Two men have been arrested on the suspicion of burglary and are held in separate cells awaiting
interrogation. The prisoners have been told that burglary carries a maximum sentence of x
years. However, if they confess, their prison terms are reduced to y years (where 0 < y < x).
If one of them confesses and the other does not, the first receives a sentence of y years while
the other is sentenced to x years.
Guilty of the crime he is accused of, our prisoner contemplates whether to confess to
receive a lower sentence, or to deny involvement in the hope of escaping justice altogether.
He cannot confess without implicating the other prisoner. If he keeps his mouth shut and so
does his partner in crime, they will both walk away free. If he keeps his mouth shut but his
partner talks, he gets the maximum sentence. If he talks, he will always receive a sentence of
y years and the other prisoner receives y or x years depending on whether he confessed or not
himself. To talk or not to talk, that is the question.
There is no data in this problem, so we set equal to 1 or 0, depending on whether the
other prisoner talks or not. Our prisoner can decide to talk (t = 1) or not (t = 0). The loss
function L(, t) equals the prison term for our prisoner. In the absence of data, risk and loss
are equal.
a. Calculate the minimax risk for both t = 0 and t = 1. Argue that the minimax-optimal
decision for our prisoner is to confess.
As argued in section 2.4, the minimax decision can be overly pessimistic. In the above, it
assumes that the other prisoner will talk and chooses t accordingly.
The Bayesian perspective balances matters depending on the chance that the other prisoner
will confess when interrogated. This chance finds its way into the formalism as a prior for the
trustworthiness of the other prisoner. Let p [0, 1] be the probability that the other prisoner
confesses, i.e. ( = 1) = p and ( = 0) = 1 p.
Exercises
49
b. Calculate the Bayes risks for t = 0 and t = 1 in terms of x, y and p. Argue that the
Bayes decision rule for our prisoner is as follows: if y/x > p then our prisoner does
not confess, if y/x < p, the prisoner confesses. If y/x = p, the Bayes decision criterion
does not have a preference.
So, depending on the degree to which our prisoner trusts his associate and the ratio of prison
terms, the Bayesian draws his conclusion. The latter is certainly more sophisticated and
perhaps more realistic, but it requires that our prisoner quantifies his trust in his partner in
the form of a prior Bernoulli(p) distribution.
Chapter 3
Bayesian procedures have been the object of much criticism, often focusing on the choice of
the prior as an undesirable source of ambiguity. The answer of the subjectivist that the prior
represents the belief of the statistician or expert knowledge pertaining to the measurement elevates this ambiguity to a matter of principle, thus setting the stage for a heated
debate between pure Bayesians and pure frequentists concerning the philosophical merits
of either school within statistics. As said, the issue is complicated further by the fact that
the Bayesian procedure does not refer to the true distribution P0 for the observation (see
section 2.1), providing another point of fundamental philosophical disagreement for the fanatically pure to lock horns over. Leaving the philosophical argumentation to others, we shall
try to discuss the choice of a prior at a more conventional, practical level.
In this chapter, we look at the choice of the prior from various points of view: in section 3.1, we consider the priors that emphasize the subjectivists prior belief. In section 3.2
we construct priors with the express purpose not to emphasize any part of the model, as
advocated by objectivist Bayesians. Because it is often desirable to control properties of the
posterior distribution and be able to compare it to the prior, conjugate priors are considered
in section 3.3. As will become clear in the course of the chapter, the choice of a good prior
is also highly dependent on the model under consideration.
Since the Bayesian school has taken up an interest in non-parametric statistics only relatively recently, most (if not all) of the material presented in the first three sections of this
chapter applies only to parametric models. To find a suitable prior for a non-parametric
model can be surprisingly complicated. Not only does the formulation involve topological
aspects that do not play a role in parametric models, but also the properties of the posterior may be surprisingly different from those encountered in parametric models! Priors on
infinite-dimensional models are considered in section 3.4.
51
52
53
cause in a subjectivist setting, the motivation for the choice of a certain prior (and not any
other) is part of the analysis rather than an external consideration.
54
distribution of 1 , given 2 , . . . , d ,
F = 1 |2 ,...,d .
Suppose, furthermore, that a reasonable subjective prior G for the second component may be
found, independent of 1 , but given 3 , . . . , d . Then,
G = 2 |3 ,...,d .
If we continue like this, eventually defining the marginal prior for the last component d , we
have found a prior for the full parameter , because for all A1 , . . . , Ad B,
(1 A1 , . . . , d Ad ) = (1 A1 |2 A2 , . . . , d Ad ) (2 A2 |3 A3 , . . . , d Ad )
. . . (d1 Ad1 |d Ad ) (d Ad ).
Because prior beliefs may be more easily expressed when imagining a situation where other
parameters have fixed values, one eventually succeeds in defining the prior for the highdimensional model. The construction indicated here is that of a so-called hyperprior, which we
shall revisit section 3.3. Note that when doing this, it is important to choose the parametrization of the model such that one may assume (with some plausibility), that i is independent
of (1 , . . . , i1 ), given (i+1 , . . . , d ), for all i 1.
In certain situations, the subjectivist has more factual information at his disposal when
defining the prior for his analysis. In particular, if a probability distribution on the model
reflecting the subjectivists beliefs can be found by other statistical means, it can be used
as a prior. Suppose the statistician is planning to measure a quantity Y and infer on a model
P; suppose also that this experiment repeats or extends an earlier analysis. From the earlier
analysis, the statistician may have obtained a posterior distribution on P. For the new
experiment, this posterior may serve as a prior.
Example 3.1.2. Let P : 7 P be a parametrized model for an i.i.d. sample
X1 , X2 , . . . , Xn with prior measure 1 : G [0, 1]. Let the model be dominated (see definition 1.1.3), so that the posterior 1 ( |X1 , . . . , Xn ) satisfies (2.8). Suppose that this experiment has been conducted, with the sample realised as (X1 , X2 , . . . , Xn ) = (x1 , x2 , . . . , xn ).
Next, consider a new, independent experiment in which a quantity Xn+1 is measured (with
the same model). As a prior 2 for the new experiment, we use the (realised) posterior of the
earlier experiment, i.e. for all G G ,
2 (G) = 1 ( G |X1 = x1 , . . . , Xn = xn ).
55
Non-informative priors
The posterior for the second experiment then satisfies:
d2 (|Xn+1 ) = Z
p (Xn+1 ) d2 (|X1 = x1 , . . . , Xn = xn )
p (Xn+1 ) d2 (|X1 = x1 , . . . , Xn = xn )
p (Xn+1 )
=Z
n
Y
p (Xn+1 )
(3.1)
p (xi )d1 ()
i=1
n
Y
p (xj ) d1 ()
j=1
The latter form is comparable to the posterior that would have been obtained if we had conducted a single experiment with an i.i.d. sample X1 , X2 , . . . , Xn+1 of size n + 1 and prior 1 .
In that case, the posterior would have been of the form:
n+1
Y
( |X1 , . . . , Xn+1 ) = Z
p (Xi ) d1 ()
i=1
n+1
Y
(3.2)
p (Xj ) d1 ()
j=1
i.e. the only difference is the fact that the posterior 1 ( |X1 = x1 , . . . , Xn = xn ) is realised.
As such, we may interpret independent consecutive experiments as a single, interrupted experiment and the posterior 1 ( |X1 , . . . , Xn ) can be viewed as an intermediate result.
Remark 3.1.3. Note that it is necessary to assume that the second experiment is stochastically
independent of the first, in order to enable comparison between (3.1) and (3.2).
Clearly, there are other ways to obtain a distribution on the model that can be used as
an informative prior. One example is the distribution that is obtained when a previously
obtained frequentist estimator for is subject to a procedure called the bootstrap. Although
the bootstrap gives rise to a distribution that is interpreted (in the frequentist sense) as the
distribution of the estimator rather than itself, a subjectivist may reason that the estimator
provides him with the expert knowledge on that he needs to define a prior on . (For
more on bootstrap methods, see Efron and Tibshirani (1993) [32].)
56
if it is uniform over the parameter space: if we are inferring on parameter [1, 1] and we
do not want to favour any part of the model over any other, we would choose a prior of the
form, (A B),
(A) = 12 (A),
(3.3)
where denotes the Lebesgue measure on [1, 1]. Attempts to minimize the amount of
subjectivity introduced by the prior therefore focus on uniformity (argumentation that departs
from the Shannon entropy in discrete probability spaces reaches the same conclusion (see, for
example, Ghosh and Ramamoorthi (2003) [42], p. 47)). The original references on Bayesian
methods (e.g. Bayes (1763) [4], Laplace (1774) [57]) use uniform priors as well. But there
are several problems with this approach: first of all, one must wonder how to extend such
reasoning when R (or any other unbounded subset of R). In that case, () = and
we can not normalize to be a probability measure! Any attempt to extend to such
unbounded models as a probability measure (or even as a finite measure) would eventually
lead to inhomogeneity, i.e. go at the expense of the unbiasedness of the procedure.
The compromise some objectivists are willing to make, is to relinquish the interpretation
that subjectivists give to the prior: they do not express any prior degree of belief in A G
through the subjectivist statement that the (prior) probability of finding A equals (A).
Although they maintain the Bayesian interpretation of the posterior, they view the prior as
a mathematical definition rather than a philosophical concept. Then, the following definition
can be made without further reservations.
such that () =
Definition 3.2.1. Given a model (, G ), a prior measure : G R
is called an improper prior.
Note that the normalization factor
1
2
c.f. (2.4): any finite multiple of a (finite) prior is equivalent to the original prior as far as
the posterior is concerned. However, this argument does not extend to the improper case:
integrability problems or other infinities may ruin the procedure, even to the point where the
posterior measure becomes infinite or ill-defined. So not just the philosophical foundation
of the Bayesian approach is lost, mathematical integrity of the procedure can no longer be
guaranteed either! When confronted with an improper prior, the entire procedure must be
checked for potential problems. In particular, one must verify that the posterior is a welldefined probability measure.
Remark 3.2.1. Throughout these notes, whenever we refer to a prior measure, it is implied
that this measure is a probability measure unless stated otherwise.
But even if one is willing to accept that objectivity of the prior requires that we restrict
attention to models on which uniform probability measures exist (e.g. with a bounded
subset of Rd ), a more fundamental problem exists: the very notion of uniformity is dependent
on the parametrization of the model! To see this we look at a model that can be parametrized
57
Non-informative priors
in two ways and we consider the way in which uniformity as seen in one parametrization manifests itself in the other parametrization. Suppose that we have a d-dimensional parametric
model P with two different parametrizations, on 1 Rd and 2 Rd respectively,
1 : 1 P,
2 : 2 P
(3.4)
both of which are bijective. Assume that P has a topology and is endowed with the corresponding Borel -algebra G ; let 1 and 2 be continuous and assume that their inverses 1
1
and 1
2 are continuous as well. Assuming that 1 is bounded, we consider the uniform prior
1 on 1 , i.e. the normalized Lebesgue measure on 1 , i.e. for all A B1 ,
1 (A) = (1 )1 (A),
This induces a prior 01 on P: for all B G ,
01 (B) = (1 1
1 )(B).
(3.5)
2 : (0, 1) P : 7 N (0, 2 ).
(3.6)
Although used more commonly than 1 , parametrization 2 is not special in any sense: both
parametrizations describe exactly the same model. Now, suppose that we choose to endow the
first parametrization with a uniform prior 1 , equal to the Lebesgue measure on (0, 1). By
(3.5), this induces a prior on P. Let us now see what this prior looks like if we consider P
parametrized by : for any constant C (0, 1) the point N (0, C) in P is the image of = C
Since 1 equals the Lebesgue measure, we find that the density of 001 with respect to the
Lebesgue measure equals:
d
() = 2.
d
This density is non-constant and we see that 001 is non-uniform. In a subjectivist sense, the
100 () = 1 ( ())
prior 001 places higher prior belief on values of close to 1 than on values close to 0.
58
59
Non-informative priors
given by:
1 X2
` (X) =
1
.
2
The Fisher information (which is a 1 1-maxtrix in this case), is then given by:
X2
2
2
1
1
= 2
I = P ` ` = 2 P
2
d() =
d,
for all 2 = (0, 1). A similar calculation using the parametrization 1 shows that, in
terms of the parameter , Jeffries prior takes the form:
1
d( ) = d,
2
for all 1 = (0, 1). That both densities give rise to the same measure on P is the
assertion of the following lemma.
Lemma 3.2.1. (Parameterization-independence of Jeffreys prior)
Consider the situation of (3.4) and assume that the parametrizations 1 and 2 satisfy the
conditions of definition 3.2.2. In addition, we require that the map 1
1 2 : 2 1 is
differentiable. Then the densities (3.7), calculated in coordinates 1 and 2 induce the same
measure on P, Jeffreys prior.
Proof Since the Fisher information can be written as:
I1 = P1 (`1 `T1 ),
and the score `1 (X) is defined as the derivative of 1 7 log p1 (X) with respect to 1 , a
change of parametrization 1 (2 ) = (1
1 2 )(2 ) induces a transformation of the form
I2 = S1,2 (2 ) I1 (2 ) S1,2 (2 )T ,
on the Fisher information matrix, where S1,2 (2 ) is the total derivative matrix of 2 7 1 (2 )
in the point 2 of the model. Therefore,
q
q
q
p
T
det I2 d2 = det(S1,2 (2 ) I1 (2 ) S1,2 (2 ) ) d2 = det(S1,2 (2 ))2 det(I1 (2 ) ) d2
q
q
= det(I1 (2 ) ) det(S1,2 (2 )) d2 = det(I1 ) d1
i.e. the form of the density is such that reparametrization leads exactly to the Jacobian for
the transformation of d2 to d1 .
Ultimately, the above construction derives from the fact that the Fisher information I (or
in fact, any other positive-definite symmetric matrix-valued function on the model, e.g. the
Hessian of a twice-differentiable, convex function) can be viewed as a Riemann metric on the
manifold P. The construction of a measure with Lebesgue density (3.7) is then a standard
construction in differential geometry.
60
Example 3.2.3. To continue with the normal model of examples 3.2.1 and 3.2.2, we note
q
p
2
2 d
2 1
1
det I2 d2 =
d =
( ) d = d = d = det(I1 ) d1 ,
( ) d
2
2
which verifies the assertion of lemma 3.2.1 explicitly.
Other constructions and criteria for the construction of non-informative priors exist: currently very popular is the use of so-called reference priors, as introduced in Lindley (1956)
[65] and rediscovered in Bernardo (1979) [12] (see also Berger and Bernardo (1992) [9]). By
defining principle, a reference prior is required to maximize the Kullback-Leibler divergence
between prior and posterior. To motivate this condition, we have to look at information theory, from which the Kullback-Leibler divergence has emerged as one (popular but by no means
unique) way to quantify the notion of the amount of information contained in a probability
distribution. Sometimes called the Shannon entropy, the Kullback-Leibler divergence of a
distribution P with respect to the counting measure in discrete probability spaces,
S(P ) =
p() log(p()),
can be presented as such convincingly (see Bolzmann (1895, 1898) [22], Shannon (1948) [78]).
For lack of a default dominating measure, the argument does not extend formally to continuous probability spaces but is generalized nevertheless. A reference prior on a dominated,
parametrized model P : 7 P for an observation Y is to be chosen such that the
Lindley entropy,
Z Z
SL =
log
(|Y = y)
d(|Y = y) dP (y),
()
is maximized. Note that this definition does not depend on the specific parametrization, since
the defining property is parametrization independent. Usually, the derivation of a reference
prior [12] is performed in the limit where the posterior becomes asymptotically normal, c.f.
theorem 4.4.1. Jeffreys prior emerges as a special case of a reference prior.
For an overview of various objective methods of constructing priors, the reader is referred
to Kass and Wasserman (1995) [49]. When using non-informative priors, however, the following general warning should be heeded
Remark 3.2.2. In many models, non-informative priors, including Jeffries prior and reference priors, are improper.
61
= N (0, 2 ), for some choice of 2 > 0, we easily calculate that posterior distribution is a
normal distribution,
2
2 2
(A),
( A|Y ) = N
Y,
2 + 2 2 + 2
for every A B. The posterior mean, a point-estimator for , is then given by,
)=
(Y
2
Y.
2 + 2
The frequentists critisism of Bayesian statistics focusses on the parameter 2 : the choice that
a subjectivist makes for 2 may be motivated by expert knownledge or belief, but remains the
statisticians personal touch in a context where the frequentist would prefer an answer of a
more universal nature. As long as some form of expert knowledge is available, the subjectivists
argument constitutes a tenable point of view (or may even be compelling, see examples 1.2.1
and 2.1.2). However, in situations where no prior belief or information on the parameter is
available, or if the parameter itself does not have a clear interpretation, the subjectivist has
no answer. Yet a choice for 2 is required! Enter the objectivists approach: if we have no
prior information on , why not express our prior ignorance by choosing a uniform prior
for ? As we have seen in section 3.2, uniformity is parametrization dependent (and, as such,
still dependent on the statisticians personal choice for one parametrization and not another).
Moreover, uniform priors are improper if is unbounded in Rk . In the above example of
estimation of a normal mean, where R is unbounded, insistance on uniformity leads to an
improper prior as well. Perhaps more true to the original interpretation of the prior, we might
)) by considering
express ignorance about 2 (and eliminate 2 from the point-estimator (Y
more and more homogeneous (but still normal) priors by means of the limit , in which
)=Y.
case we recover the maximum-likelihood estimate: lim 2 (Y
Remark 3.3.1. From a statistical perspective, however, there exists a better answer to the
question regarding 2 : if is not known, why not estimate its value from the data!
In this section, we consider this solution both from the Bayesian and from the frequentists
perspective, giving rise to procedures known as hierarchical Bayesian modelling and empirical
Bayesian estimation respectively.
Beforehand, we consider another type of choice of prior, which is motivated primarily by
mathematical convenience. Taking another look at the normal example with which we began
this section, we note that both the prior and the posterior are normal distributions. Since the
calculation of the posterior is tractable, any choice for the location and variance of the normal
prior can immediately be updated to values for location and variance of the normal posterior
upon observation of Y = y. Not only does this signify ease of manipulation in calculations
with the posterior, it also reduces the computational burden dramatically since simulation of
(or, sampling from) the posterior is no longer necessary.
62
( |Y = y) M,
(3.8)
for all y Y .
This structure was first proposed by Raiffa and Schlaifer (1961) [70]. Their method for the
prior choice is usually classified as objectivist because it does not rely on subjectivist notions
and is motivated without reference to outside factors.
Remark 3.3.2. Often in the literature, a prior is refered to as a conjugate prior if the
posterior is of the same form. This practice is somewhat misleading, since it is the family M
that is closed under conditioning on the data Y , a property that depends on the model and M ,
but not on the particular M .
Example 3.3.1. Consider an experiment in which we observe n independent Bernoulli trials
and consider the total number of successes, Y Bin(n, p) with unknown parameter p [0, 1],
n k
Pp (Y = k) =
p (1 p)nk .
k
For the parameter p we choose a prior p Beta(, ) from the Beta-family, for some , > 0,
d(p) = B(, ) p1 (1 p)1 dp,
where B(, ) = ( + )/(() ()) normalizes . Then the posterior density with respect
to the Lebesgue measure on [0, 1] is proportional to:
d(p|Y ) pY (1 p)nY p1 (1 p)1 dp = p+Y 1 (1 p)+nY 1 dp,
We conclude that the posterior again lies in the Beta-family, with parameters equal to a dataamended version of those of the prior, as follows:
( |Y ) = Beta( + Y, + n Y ).
So the family of Beta-distributions is a conjugate family for the binomial model. Depending
on the available amount of prior information on , the priors parameters may be chosen on
subjective grounds (see figure 2.1 for graphs of the densities of Beta-distributions for various
parameter values). However, in the absence thereof, the parameters , suffer from the same
ambiguity that plagues the parameter 2 featuring in the example with which we opened this
section.
Example 3.3.1 indicates a strategy to find conjugate families for a given parametrized,
dominated model P = {P : }. We view densities y 7 p (y) as functions of the outcome
63
Y = y foremost, but they are functions of the parameter as well and their dependence
7 p (y) determines which prior densities 7 () preserve their functional form when
multiplied by the likelihood p (Y ) to yield the posterior density.
Although we shall encounter an example of a conjugate family for a non-parametric model
in the next section, conjugate families are, by and large, part of parametric statistics. Many
of these families are so-called exponential families, for which conjugate families of priors can
be found readily.
Definition 3.3.2. A dominated collection of probability measures P = {P : } is called
a k-parameter exponential family, if there exists a k 1 such that for all ,
k
X
p (x) = exp
i () Ti (x) B() h(x),
(3.9)
i=1
(3.10)
i=1
=
In addition, P is said to be of full rank if the interior of H Rk is non-void, i.e. H
6 .
Although parametric, exponential families are both versatile modelling tools and mathematically tractable; many common models, like the Bernoulli-, normal-, binomial-, Gamma-,
Poisson-models, etcetera, can be rewritten in the form (3.9). One class of models that can
immediately be disqualified as possible exponential families is that of all models in which the
support depends on the parameter, like the family of all uniform distributions on R, or the
Pareto-model. Their statistical practicality stems primarily from the fact that for an exponential family of full rank, the statistics Ti , i = 1, . . . , k are sufficient and complete, enabling
the use of the Lehmann-Scheffe theorem for minimal-variance unbiased estimation (see, for instance, Lehmann and Casella (1998) [59]). Their versatility can be understood in many ways,
e.g. by the Pitman-Koopman-Darmois theorem (see, Jeffreys (1961) [47]), which says that a
family of distributions whose support does not depend on the parameter, is exponential, if
and only if in the models describing its i.i.d. samples, there exist sufficient statistics whose
dimension remains bounded asymptotically (i.e. as we let the sample size diverge to infinity).
Presently, however, our interest lies in the following theorem, which says that if a model
P constitutes an exponential family, there exists a conjugate family of priors for P.
64
Theorem 3.3.1. Let P be a model that can be written as an exponential family, c.f. definition 3.3.2. Then there exists a parametrization of P of the form (3.10) and the family of
distributions , , defined by Lebesgue probability densities
, () = K(, ) exp
k
X
i i A() ,
(3.11)
i=1
(where Rk and R are such that 0 < K(, ) < ), is a conjugate family for P.
Proof It follows from the argument preceding definition 3.3.3 that P can be parametrized
as in (3.10). Choosing a prior on H of the form (3.11), we find that the posterior again takes
the form (3.11),
(|X) exp
k
X
i (i + Ti (X)) ( + 1) A()
i=1
(the factor h(X) arises both in numerator and denominator of (2.4) and is -independent, so
that it cancels). The data-amended versions of the parameters and that emerge from the
posterior are therefore given by:
( + T (X), + 1),
and we conclude that the distributions , form a conjugate family for P.
Remark 3.3.3. From a frequentist perspective, it is worth noting the import of the factorization theorem, which says that the parameter-dependent factor in the likelihood is a function
of the data only through the sufficient statistic. Since the posterior is a function of the likelihood, in which data-dependent factors that do not depend on the parameter can be cancelled
between numerator and denominator, the posterior is a function of the data X only through
the sufficient statistic T (X). Therefore, if the exponential family P is of full rank (so that
T (X) is also complete for P), any point-estimator we derive from this posterior (e.g. the
posterior mean, see definition 2.2.1) that is unbiased and quadratically integrable, is optimal
in the sense of Rao-Blackwell, c.f. the theorem of Lehmann-Scheffe (see Lehmann and Casella
(1998) [59], for explanation of the Rao-Blackwell and Lehmann-Scheffe theorems).
Next, we turn to the Bayesian answer to remark 3.3.2 which said that parameters of the
prior (e.g. 2 ) are to be estimated themselves. Recall that the Bayesian views a parameter
to be estimated as just another random variable in the probability model. In case we want
to estimate the parameter for a family of priors, then that parameter is to be included in
the probability space from the start. Going back to the example with which we started
this section, this means that we still use normal distributions P = N (, 2 ) to model the
uncertainty in the data Y , supply R with a prior 1 = N (0, 2 ) and then proceed to
choose a another prior 2 for 2 (0, ):
Y |, 2 = Y | P = N (, 2 ),
| 2 1 = N (0, 2 ),
2 2 ,
65
Note that the parameter 2 has no direct bearing on the model distributions: conditional on ,
Y is independent of 2 . In a sense, the hierarchical Bayesian approach to prior choice combines
subjective and objective philosophies: whereas the subjectivist will make a definite, informed
choice for 2 and the objectivist will keep himself as uncommitted as possible by striving for
uniformity, the choice for a hierarchical prior expresses uncertainty about the value of 2 to be
used in the form of a probability distribution 2 . As such, the hierarchical Bayesian approach
allows for intermediate prior choices: if 2 is chosen highly concentrated around one point in
the model, resembling a degenerate measure, the procedure will be close to subjective; if 2
is spread widely and is far from degenerate, the procedure will be less biased and closer to
objective. Besides interpolating between objective and subjective prior choices, the flexibility
gained through introduction of 2 offers a much wider freedom of modelling. In particular, we
may add several levels of modelled parameter uncertainty to build up a hierarchy of priors for
parameters of priors. Such structures are used to express detailed subjectivist beliefs, much
in the way graphical models are used to build intricate dependency structures for observed
data (for a recent text on graphical models, see chapter 8 of Bishop (2006) [20]). The origins
of the hierarchical approach go back, at least, to Lindley and Smith (1972) [66].
Definition 3.3.4. Let the data Y be random in (Y , B). A hierarchical Bayesian model for Y
consists of a collection of probability measures P = {P : 0 }, with (0 , G0 ) measurable
and endowed with a prior : G0 [0, 1] built up in the following way: for some k 1, we
introduce measurable spaces (i , Gi ), i = 1, 2, . . . , k and conditional priors
Gi i+1 [0, 1] : (G, i+1 ) 7 i (G|i+1 ),
for i = 1, . . . , k 1 and a marginal k : Gk [0, 1] on k . The prior for the original
parameter is then defined by,
Z
( G) =
0 ( G|1 ) d(1 |2 ) . . . d(k1 |k ) dk (k ),
(3.12)
1 ...k
for all G G0 . The parameters 1 , . . . k and the priors 1 , . . . , 2 are called hyperparameters
and their hyperpriors.
This definition elicits several remarks immediately.
Remark 3.3.4. Definition 3.3.4 of a hierarchical Bayesian model does not constitute a generalization of the Bayesian procedure in any formal sense: after specification of the hyperpriors,
one may proceed to calculate the prior , c.f. (3.12), and use it to infer on in the ways
indicated in chapter 2 without ever having to revisit the hierachical background of . As such,
the significance of the definition lies entirely in its conceptual, subjective interpretation.
Remark 3.3.5. Definition 3.3.4 is very close to the general Bayesian model that incorporates
all parameters (, 1 , . . . , k ) as modelling parameters. What distinguishes hierarchical modelling from the general situation is the dependence structure imposed on the parameters. The
66
parameter is distinct from the hyperparameters by the fact that conditional on , the data
Y is independent of 1 , . . . , k . This distinction is repreated at higher levels in the hierarchy,
i.e. levels are separate from one another through the conditional independence of i |i+1 from
i+2 , . . . , , k .
Remark 3.3.6. The hierarchy indicated in definition 3.3.4 inherently loses interpretability
as we ascend in level. One may be able to give a viable interpretation to the parameter
and to the hyperparameter 1 , but higher-level parameters 2 , 3 , . . . become harder and harder
to understand heuristically. Since the interpretation of the hierarchy requires a subjective
motivation of the hyperpriors, interpretability of each level is imperative, or left as a noninformative choice. In practice, Bayesian hierarchical models are rarely more than two levels
deep (k = 2) and the last hyperprior k is often chosen by objective criteria.
Example 3.3.2. We observe the number of surviving offspring from a birds litter and aim to
estimate the number of eggs the bird laid: the bird lays N 0 eggs, distributed according to a
Poisson distribution with parameter > 0. For the particular species of bird in question, the
Poisson rate is not known exactly: the uncertainty in can be modelled in many ways; here
we choose to model it by a Gamma-distribution (, ), where and are chosen to reflect
our imprecise knowledge of as well as possible. Each of the eggs then comes out, producing
a viable chick with known probability p [0, 1], independently. Hence, the total number Y of
surviving chicks from the litter is distributed according to a binomial distribution, conditional
on N ,
Y |N Bin(N, p),
N | Poisson(),
(, ).
n k
p (1 p)nk ,
k
n0 P (N
or normalization factor for the posterior given Y = k) can be read off once we have the
expression for the numerator. We therefore concentrate on the marginal for N = n, (n 0):
Z
1
P (N = n) =
P (N = n|) p, () d =
()
Z
0
e n 1 /
e
d.
n!
The integral is solved using the normalization constant of the ((+n), (/+1))-distribution:
Z
+n
+1
+n1 d = ( + n)
e
.
+1
0
67
n
Y
1
( + n) 1 1 +n
1 n
+l1
=
() n! + 1
n! + 1 ( + 1)
(3.13)
l=1
Although not in keeping with the subjective argumentation we insist on in the introduction to
this example, for simplicity we consider = = 1 and find that in that case,
P (N = n) = (1/2)n .
The posterior for N = n given Y = k then takes the form:
X
1 n k
1 m k
nk
P (N = n|Y = k) = n
p (1 p)
p (1 p)mk .
2 k
2m k
m0
The eventual form of the posterior illustrates remark 3.3.4: in case we choose = = 1, the
posterior we find from the hierarchical Bayesian model does not differ from the posterior that
we would have found if we had have started from the non-hierarchical model with a geometric
prior,
Y |N Bin(N, p),
N Geo(1/2).
Indeed, even if we leave and free, the marginal distribution for N we found in (3.13) is
none other than the prior (3.12) for this problem.
The conclusion one should draw from remark 3.3.4 and example 3.3.2, is that the hierarchical Bayesian approach adds nothing new to the formal Bayesian procedure: eventually, it
amounts a choice for the prior just like in chapter 2. However, in a subjectivist sense, the
hierarchical approach allows for greater freedom and a more solid foundation to motivate the
choice for certain prior over other possibilities. This point is all the more significant in light
of remark 3.1.1: the motivation of a subjectivist choice for the prior is part of the statistical
analysis rather than an external aspect of the procedure. Hierarchical Bayesian modelling
helps to refine and justify motivations for subjectivist priors.
But the subjectivist answer is not the only one relevant to the statistical perspective
of remark 3.3.2 on the initial question of this section. The objectivist Bayesian may argue
that any hyperprior should be chosen in a non-informative fashion, either as a matter of
principle, or to reflect lack of interpretability or prior information on the parameter 2 . Such
a strategy amounts to the hierarchical Bayesian approach with one or more levels of objective
hyperpriors, a point of view that retains only the modelling freedom gained through the
hierarchical approach.
More unexpected is the frequentist perspective on remark 3.3.2: if 2 is an unknown, pointestimate it first and then perform the Bayesian analysis with this point-estimate as a plug-in
for the unknown 2 . Critical notes can be placed with the philosophical foundations for this
practice, since it appears to combine the methods of two contradictory schools of statistics.
Be that as it may, the method is used routinely based on its practicality: eventually, the
68
justification comes from the subjectivist who does not reject frequentist methods to obtain
expert knowledge on his parameters, as required in his own paradigm.
Remark 3.3.7. Good statistical practice dictates that one may not peek at the data to
decide which statistical method to use for the analysis of the same data. The rationale behind
this dictum is that pre-knowledge of the data could bias the analysis. If we take this point
strictly, the choice for a prior (read, the point-estimate for 2 ) should not be made on the
basis of the same data Y that is to be used later to derive the posterior for . If one has
two independent realisations of the data, one can be used to choose the prior, (here, by a
point-estimate for 2 ) and the other to condition on, in the posterior.
Yet the above rule cannot be taken too strictly. Any statistician (and common sense)
will tell you that it is crucial for the statistical analysis that one first obtains a certain feeling
for the statistical problem by inspection of the data, before making decisions on how to analyse
it (to see this point driven to the extreme, read, e.g. Tukey (1977) [82]). Ideally, one would
make those decisions based on a sample of the data that is independent of the data used in
the analysis proper. This precaution is often omitted, however: for example, it is common
practice to use plug-in parameters based on the sample Y whenever the need arises, possibly
leading to a bias in the subsequent analysis of the same data Y (unless the plug-in estimator
is independent of all other estimators used, of course).
There are many different ways in which the idea of a prior chosen by frequentist methods
is applied, all of which go under the name empirical Bayes. Following Berger [8], we note two
types of statistical questions that are especially well suited for application. When we analyse
data pertaining to an individual from a larger population and it is reasonable to assume that
the prior can be inferred from the population, then one may estimate parameters like 2 above
from population data and use the estimates in the prior for the individual.
Another situation where empirical Bayes is often used, is in model selection: suppose
that there are several models P1 , P2 , . . . with priors 1 , 2 , . . ., each of which may serve as
a reasonable explanation of the data, depending on an unknown parameter K {1, 2, . . .}.
The choice to use model-prior pair k in the determination of the posterior can only be made
after observation (or estimation) of K. If K is estimated by freqentist methods, the resulting
procedure belongs to the realm of the empirical Bayes methods.
Example 3.3.3. Consider the situation where we are provided with a specimen from a population that is divided into an unknown number of classes. Assume that all we know about
the classes is that they occur with equal probabilities in the population. The particular class
of our specimen remains unobserved. We perform a real-valued measurement Y on the specimen, which is normally distributed with known variance 2 and an unknown mean k R
that depends on the class k. Then Y is distributed according to a discrete mixture of normal
distributions of the form
Y PK;1 ,...,K =
K
1 X
N (k , 1)
K
k=1
69
70
If one imagines the situation where the number of observations is of the same order as the
number of classes, this should come as no surprise.
A less ambitious application of empirical Bayesian methods is the estimation of hyperparameters by maximum-likelihood estimation through the prior predictive distribution (see
definition 2.1.4). Recall that the marginal distribution of the data in the subjectivist Bayesian
formulation (c.f. section 2.1) predicts how the data is distributed. This prediction may be
reversed to decide which value for the hyperparameter leads to the best explanation of the
observed data, where our notion of best is based on the likelihood principle.
Denote the data by Y and assume that it takes its values in a measurable space (Y , B).
Denote the model by P = {P : 0 }. Consider a family of priors parametrized by a
hyperparameter H, { : H}. For every , the prior predictive distribution P is
given by:
Z
P (A) d (),
P (A) =
for all A B, i.e. we obtain a new model for the observation Y , given by P 0 = {P : H},
contained in the convex hull of the original model co(P). Note that this new model is
parametrized by the hyperparameter; hence if we close our eyes to the rest of the problem
and we follow the maximum-likelihood procedure for estimation of in this new model, we
find the value of the hyperparameter that best explains the observation Y . Assuming that
the model P 0 is dominated, with densities {p : H}, the maximum-likelihood estimate is
found as the point (Y ) H such that
p(Y ) = sup p (Y ).
H
71
Non-parametric priors
pl .
lA
M (X ) = P : 2
[0, 1] :
k
X
pi = 1, pi 0, (1 i k) ,
i=1
and is in bijective correspondence with the simplex in Rk . For reasons to be discussed shortly,
we consider the following family of distributions on M (X ).
Definition 3.4.1. (Finite-dimensional Dirichlet distribution)
Let = (1 , . . . , k ) with i > 0 for all 1 i k. A stochastic vector p = (p1 , . . . , pk ) is said
to have Dirichlet distribution D with parameter , if the density for p satisfies:
Pk
k1
X
k 1
k1 1
i=1 i
p1 1 1 p2 2 1 . . . pk1
1
pi
(p) =
(1 ) . . . (k )
i=1
(1 + 2 ) 1 1
p
(1 p1 )2 1 ,
(1 ) (2 ) 1
72
i.e. p1 has a Beta distribution B(1 , 2 ). Examples of graphs of Beta densities with 1 = k+1,
2 = n k + 1 for various integer values of k are depicted in figure 2.1). We also note the
following two well-known facts on the Dirichlet distribution (proofs can be found in [42]).
Lemma 3.4.1. (Gamma-representation of D )
If Z1 , . . . , Zk are independent and each marginally distributed according to a -distribution
with parameter i , i.e.
Zi (i ),
for all 1 i k, then the normalized vector
Z
S
with S =
,...,
Zk
D ,
S
(3.14)
Pk
i=1 Zi .
Lemma 3.4.1 shows that we may think of a D -distributed vector as being composed
of k independent, -distributed components, normalized to form a probability distribution,
through division by S in (3.14). This division should be viewed as an L1 -projection from
the positive cone in Rk onto the k 1-dimensional simplex. The following property can also
be viewed as a statement on the effect of a projection on a distribution, this time from the
simplex in Rk to lower-dimensional simplices. It is this property (related to a property called
infinite divisibility of the Dirichlet distribution) that motivates the choice for the Dirichlet
distribution made by definition 3.4.1.
Lemma 3.4.2. Let X be a finite pointset. If the density p : X [0, 1] of a distribution P
is itself distributed according to a Dirichlet distribution with parameter , p D , then for
any partition {A1 , . . . , Am } of X , the vector of probabilities (P (A1 ), P (A2 ), . . . , P (Am )) has
a Dirichlet distribution again,
P (A1 ), P (A2 ), . . . , P (Am ) D0 ,
where the parameter 0 is given by:
!
0
(10 , . . . , m
)=
X
lA1
l , . . . ,
l .
(3.15)
lAm
The identification (3.15) in lemma 3.4.2 suggests that we adopt a slightly different perspective on the definition of the Dirichlet distribution: we view as a finite measure on X ,
so that P D , if and only if, for every partition (A1 , . . . , Am ),
P (A1 ), . . . , P (Am ) D((A1 ),...,(Am )) .
(3.16)
Property (3.16) serves as the point of departure of the generalization to the non-parametric
model, because it does not depend on the finite nature of X .
73
Non-parametric priors
Definition 3.4.2. Let X be a finite pointset; denote the collection of all probability measures
on X by M (X ). The Dirichlet family D(X ) is defined to be the collection of all Dirichlet
distributions on M (X ), i.e. D(X ) consists of all D with a finite measure on X .
The following property of the Dirichlet distribution describes two independent Dirichletdistributed quantities in convex combination, which form a new Dirichlet-distributed quantity
if mixed by means of an (independent) Beta-distributed parameter.
Lemma 3.4.3. Let X be a finite pointset and let 1 , 2 be two measures on (X , 2X ). Let
(P1 , P2 ) be independent and marginally distributed as
P 1 D 1 ,
P2 D2 .
n
Y
p(Xi ) =
i=1
k
Y
pnl l ,
l=1
where nl = #{Xi = l : 1 i n}. Multiplying by the prior density for = D , we find that
the posterior density is proportional to,
(p1 , . . . , pk |X1 , . . . , Xn ) (p1 , . . . , pk )
n
Y
pXi
i=1
k
Y
l=1
pnl l
k1
Y
l=1
pll 1
k1
k1
k1
X
Y +n 1
X
k 1
k +nk 1
1
pi
=
pl l l
1
pi
,
i=1
l=1
i=1
74
which is again a Dirichlet density, but with changed base measure . Since the posterior is a
probability distribution, we know that the normalization factor follows suit. Noting that we
may view nl as the density of the measure nPn since
nl =
n
X
i=1
Next we consider the Dirichlet process prior, a probability measure on the full nonparametric model for a measurable space (X , B). For the sake of simplicity, we assume that
X = R and B is the Borel -algebra on R. We denote the collection of all probability measures
on (R, B) by M (R, B). We consider the collection of random quantities {P (A) : A B} and
impose two straightforward conditions on its finite-dimensional marginals. The Kolmogorov
existence theorem (see theorem A.5.1) then guarantees existence of a stochastic process with
finitely additive sample path P : B [0, 1]. Said straightforward conditions are satisfied
if we choose the finite-dimensional marginal distributions to be (finite-dimensional) Dirichlet
distributions (3.16). Also by this choice, -additivity of P can be guaranteed. The resulting
process on the space of all probability measures on (X , B) is called the called the Dirichlet
process and the associated probability measure is called the Dirichlet process prior.
Theorem 3.4.2. (Existence of the Dirichlet process)
Given a finite measure on (R, B), there exists a probability measure D on M (R, B) (called
the Dirichlet process prior with parameter ) such that for P D and every B-measurable
partition (B1 , . . . , Bk ) of R,
P (B1 ), . . . , P (Bk ) D((B1 ),...,(Bk )) .
(3.17)
Proof Let k 1 and A1 , . . . , Ak B be given. Through the indicators 1Ai for these sets,
we define 2k new sets
1B1 ...k =
k
Y
i=1
where 1 , . . . , k {0, 1}. Then the collection {B1 ...k : i {0, 1}, 1 i k} forms a
partition of R. For the P -probabilities corresponding to this partition, we assume finitedimensional marginals
P (B1 ...k ) : i {0, 1}, 1 i k B1 ...k :i {0,1},1ik ,
The distribution of the vector (P (A1 ), . . . , P (Ak )) then follows from the definition:
X
P (Ai ) =
P (B1 ...k ),
{i:i =1}
for all 1 i k. This defines marginal distributions for all finite subsets of B, as needed
in theorem A.5.1. To define the underlying probability space (, F , ) we now impose two
conditions.
75
Non-parametric priors
(F1) With -probability one, the empty set has P -measure zero:
P () = 0 = 1.
r1
[
Bi0 ,
...
k
[
Bk =
i=1
Bi0 ,
i=rk1 +1
(for certain r1 < . . . < rk1 ), then we have the following equality in distribution:
L
r1
X
i=1
k
X
P (Bi0 ), . . . ,
P (Bi0 ) = L P (B1 ), . . . , P (Bk ) .
i=rk1 +1
Condition (F1) ensures that if (A1 , . . . , Ak ) is itself a partition of R, the above construction
does not lead to a contradiction. Condition (F2) ensures finite additivity of P with prior
probability one, i.e. for any A, B, C B such that A B = and A B = C,
P (A) + P (B) = P (C) = 1.
(3.18)
Ferguson (1973,1974) [34, 35] has shown that conditions (F1) and (F2) imply that Kolmogorovs consistency conditions (K1) and (K2) (see section A.5) are satisfied. As we have
seen in the first part of this section, if we impose the Dirichlet distribution:
P (B1 ...k ) : i {0, 1}, 1 i k D{(B1 ...k ):i {0,1},1ik} .
(3.19)
and is a measure on B, condition (F2) is satisfied. Combining all of this, we conclude that
there exists a probability space (, F , ) on which the stochastic process {P (A) : A B}
can be represented with finite dimensional marginals c.f. (3.19). Lemma 3.4.4 shows that
(P M (R, B)) = 1, completing the proof.
The last line in the above proof may require some further explanation: P is merely the
sample-path of our stochastic process. The notation P (A) suggests that P is a probability
measure, but all we have shown up to that point, is that (F1) and (F2) imply that P is a
finitely additive set-function such that:
P (B) [0, 1] = 1,
with -probability equal to one. What remains to be demonstrated is -almost-sure additivity of P .
Lemma 3.4.4. If is a Dirichlet process prior D on M (X , B),
P is -additive = 1.
76
Proof Let (An )n1 be a sequence in B that decreases to . Since is -additive, (An )
P
() = 0. Therefore, there exists a subsequence (Anj )j1 such that j (Anj ) < . For
fixed > 0, using Markovs inequality first,
Z
X
X1
1 X (Anj )
P (Anj ) >
P (Anj ) d(P ) =
< ,
(R)
j1
j1
j1
according to lemma 3.4.5. From the Borel-Cantelli lemma (see lemma A.2.1), we see that
\ [
lim sup{P (Anj ) > } =
{P (Anj ) > } = 0,
j
J1 jJ
which shows that limj P (Anj ) = 0, -almost-surely. Since, by -almost-sure finite additivity
of P ,
P (An ) P (An+1 ) . . . = 1,
we conclude that limn P (An ) = 0, -almost-surely. By the continuity theorem for measures
(see theorem A.2.1 and the proof in [52], theorem 3.2), P is -additive -almost-surely.
The proof of lemma 3.4.4 makes use of the following lemma, which establishes the basic
properties of the Dirichlet process prior.
Lemma 3.4.5. Let be a finite measure on (R, B) and let {P (A) : A B} be the associated
Dirichlet process with distribution D . Let B B be given.
(i) If (B) = 0, then P (B) = 0, a.s.
(ii) If (B) > 0, then P (B) > 0, a.s.
(iii) The expectation of P under D is given by
Z
(B)
.
P (B) dD (P ) =
(R)
Proof Let B B be given. Consider the partition (B1 , B2 ) of R, where B1 = B, B2 = R\B.
According to (3.17),
P (B1 ), P (B2 ) D((B),(R)(B)) ,
so that P (B) B((B), (R) (B)). Stated properties then follow from the properties of
the Beta-distribution.
This concludes the proof for the existence of Dirichlet processes and the associated priors.
One may then wonder what is the nature of the prior we have constructed. As it turns out,
the Dirichlet process prior has some remarkable properties.
Lemma 3.4.6. (Support of the Dirichlet process prior)
Consider M (R, B), endowed with the topology of weak convergence. Let be a finite measure
on (R, B). The support of D is given by
M (R, B) = P M (R, B) : supp(P ) supp() .
77
Non-parametric priors
In fact, we can be more precise, as shown in the following lemma.
Lemma 3.4.7. Let be a finite measure on (R, B) and let {P (A) : A B} be the associated
Dirichlet process with distribution D . Let Q M (R, B) be such that Q . Then, for any
m 1 and A1 , . . . , Am B and > 0,
D P M (R, B) : |P (Ai ) Q(Ai )| < , 1 i m > 0.
Proof The proof of this lemma can be found in [42], theorem 3.2.4.
So if we endow M (R, B) with the (slightly stronger) topology of pointwise onvergence (see
definition A.7.2), the support of D remains large, consisting of all P M (R, B) that are
dominated by .
The following property reveals a most remarkable characterization of Dirichlet process
priors: the subset D(R, B) of all finite convex combinations of Dirac measures (see example A.2.2) receives prior mass equal to one.
Lemma 3.4.8. Let be a finite measure on (R, B) and let {P (A) : A B} be the associated
Dirichlet process with distribution D . Then,
D P D(R, B) = 1.
Proof The proof of this lemma can be found in [42], theorem 3.2.3.
The above phenomenon leads to problems with support or convergence in stronger topologies (like total variation or Hellinger topologies) and with regard to the Kullback-Leibler
criteria mentioned in the asymptotic theorems of chapter 4. Generalizing this statement somewhat, we may infer from the above that the Dirichlet process prior is not suited to (direct)
estimation of densities. Although clearly dense enough in M (R, B) in the toplogy of weak
convergence, the set D(R, B) may be rather sparse in stronger topologies! (Notwithstanding
the fact that mixture models with a Dirichlet process prior for the mixing distribution can be
(minimax) optimal for the estimation of mixture densities [41].)
Lemma 3.4.9. Let be a finite measure on (R, B) and let {P (A) : A B} be the associated
Dirichlet process with distribution D . Let g : R R be non-negative and Borel-measurable.
Then,
Z
Z
g(x) d(x) <
(D a.s.).
Perhaps the most important result of this section is the fact that the family of Dirichlet
process priors on M (R, B) is a conjugate family for the full, non-parametric model on (R, B),
as stated in the following theorem.
78
(B)
,
(R)
(3.20)
P (B) d P X1 , . . . , Xn =
=
Z
P (B) dD+nPn (P ) =
( + nPn )(B)
( + nPn )(B)
(R)
n
P (B) +
Pn (B),
(R) + n
(R) + n
P0n -almost-surely. Defining n = (R)/((R) + n), we see that the posterior mean Pn can be
viewed as a convex combination of the prior mean distribution and the empirical distributions,
Pn = n P + (1 n )Pn ,
P0n -almost-surely. As a result, we see that
kPn Pn kT V = n kP Pn k n ,
P0n -almost-surely. Since n 0 as n , the difference between the sequence of posterior
means (Pn )n1 and the sequence of empirical measures (Pn )n1 converges to zero in total
variation as we let the samplesize grow to infinity. Generalizing likelihood methods to nondominated models, Dvoretzky, Kiefer and Wolfowitz (1956) [30] have shown that the empirical
distribution Pn can be viewed as the non-parametric maximum-likelihood estimator (usually
abbreviated NPMLE). This establishes (an almost-sure form of ) consistency for the posterior
mean, in the sense that it has the same point of convergence as the NPMLE. In chapter 4,
convergence of the posterior distribution (and in particular its mean) to the MLE will manifest
itself as a central connection between frequentist and Bayesian statistics.
79
Exercises
Remark 3.4.1. The above example provides the subjectivist with a guideline for the choice
of the base measure . More particularly, equality (3.20) says that the prior predictive distribution equals the (normalized) base measure . In view of the fact that subjectivists should
choose the prior to reflect their prior beliefs, should therefore be chosen such that it assigns
relatively high mass to sets B B that are believed to be probable.
3.5 Exercises
Exercise 3.1. A proper Jeffreys prior
Let X be a random variable, distributed Bin(n; p) for known n and unknown p (0, 1).
Calculate Jeffreys prior for this model, identify the standard family of probability distributions
it belongs to and conclude that this Jeffreys prior is proper.
Exercise 3.2. Jeffreys and uniform priors
Let P be a model parametrized according to some mapping P : 7 P . Assuming
differentiability of this map, Jeffreys prior takes the form (3.7). In other parametrizations,
the form of this expression remains the same, but the actual dependence on the parameter
changes. This makes it possible that there exists another parametrization of P such that
Jeffreys prior is equal to the uniform prior. We shall explore this possibility in three exercises
below.
For each of the following models in their standard parametrizations 7 P , find a parameter
H, = (), such that the Fisher information I , expressed in terms of , is constant.
a. Find for P the model of all Poission distributions.
b. In the cases = 1, 2, 3, find for the model P consisting of all (, )-distributions,
with (0, ).
c. Find for the model P of all Bin(n; ) distributions, where n is known and (0, 1).
Note that if the Fisher information I is constant, Jeffries prior is uniform. Therefore,
if H is unbounded, Jeffries prior is improper.
Exercise 3.3. Optimality of unbiased Bayesian point estimators
Let P be a dominated, parametric model, parametrized identifiably by P : 7 P , for
some Rk . Assume that (X1 , . . . , Xn ) X n form an i.i.d. sample from a distribution
P0 = P0 P, for some 0 . Let be a prior on and denote the posterior by
(|X1 , . . . , Xn ). Assume that T : X n Rm is a sufficient statistic for the model P.
a. Use the factorization theorem to show that the posterior depends on the data only through
the sufficient statistic T (X1 , . . . , Xn ).
b. Let n : X n denote a point-estimator derived from the posterior. Use a. above to
argue that there exists a function n : Rm , such that,
n (X1 , . . . , Xn ) = n (T (X1 , . . . , Xn )).
80
Bayesian point-estimators share this property with other point-estimators that are derived from
the likelihood function, like the maximum-likelihood estimator and penalized versions thereof.
Next, assume that P n (n )2 < and that n is unbiased, i.e. P n n = 0 .
0
c. Apply the Lehmann-Scheffe theorem to prove that, for any other unbiased estimator
0 : X n 7 ,
n
N Poisson().
Exercises
81
Exercise 3.7. Let X1 , . . . , Xn form an i.i.d. sample from a binomial distribution Bin(n; p),
given p [0, 1]. For the parameter p we impose a prior p (, ) with hyperparameters
, > 0.
Show that the family of -distributions is conjugate for binomial data. Using (standard expressions for) the expectation and variance of -distributions, give the posterior mean and
variance in terms of the original and chosen for the prior and the data. Calculate the
prior predictive distribution and give frequentist estimates for and . Substitute the result
in the posterior mean and comment on (asymptotic) data dependence of the eventual point
estimator.
Appendix A
Measure theory
In this appendix we collect some important notions from measure theory. The goal is not
to present a self-contained presentation, but rather to establish the basic definitions and
theorems from the theory for reference in the main text. As such, the presentation omits
certain existence theorems and many of the proofs of other theorems (although references are
given). The focus is strongly on finite (e.g. probability-) measures, in places at the expense
of generality. Some background in elementary set-theory and analysis is required. As a
comprehensive reference, we note Kingman and Taylor (1966) [52], alternatives being Dudley
(1989) [29] and Billingsley (1986) [15].
A.2 Measures
Rough setup: set-functions, (signed) measures, probability measures, sigma-additivity, sigmafiniteness
Theorem A.2.1. Let (, F ) be a measurable space with measure : F [0, ]. Then,
(i) for any monotone decreasing sequence (Fn )n1 in F such that (Fn ) < for some n,
lim (Fn ) =
\
n=1
91
Fn ,
(A.1)
92
Measure theory
[
Gn ,
(A.2)
n=1
Theorem A.2.1) is sometimes referred to as the continuity theorem for measures, because
if we view n Fn as the monotone limit lim Fn , (A.1) can be read as limn (Fn ) = (limn Fn ),
expressing continuity from below. Similarly, (A.2) expresses continuity from above. Note that
theorem A.2.1 does not guarantee continuity for arbitrary sequences in F . It should also be
noted that theorem A.2.1) is presented here in simplified form: the full theorem states that
continuity from below is equivalent to -additivity of (for a more comprehensive formulation
and a proof of theorem A.2.1, see [52], theorem 3.2).
Example A.2.1. Let be a discrete set and let F be the powerset 2 of , i.e. F is the
collection of all subsets of . The counting measure n : F [0, ] on (, F ) is defined
simply to count the number n(F ) of points in F . If contains a finite number of points,
n is a finite measure; if contains a (countably) infinite number of points, n is -finite. The
counting measure is -additive.
Example A.2.2. We consider R with any -algebra F , let x R be given and define the
measure x : F [0, 1] by
x (A) = 1{x A},
for any A F . The probability measure x is called the Dirac measure (or delta measure, or
atomic measure) degenerate at x and it concentrates all its mass in the point x. Clearly, x
is finite and -additive. Convex combinations of Dirac measures, i.e. measures of the form
P =
m
X
j xj ,
j=1
Pm
j=1 j
model for an observation X that take values in a discrete (but unknown) subset {x1 , . . . , xm }
of R. The resulting model (which we denote D(R, B) for reference) is not dominated.
Often, one has a sequence of events (An )n1 and one is interested in the probability of a
limiting event A, for example the event that An occurs infinitely often. The following three
lemmas pertain to this situation.
Lemma A.2.1. (First Borel-Cantelli lemma)
Let (, F , P ) be a probability space and let (An )n1 F be given and denote A = lim sup An .
If
X
n1
then P (A) = 0.
P (An ) < ,
93
Integration
In the above lemma, the sequence (An )n1 is general. To draw the converse conclusion,
the sequence needs to exist of independent events.
Lemma A.2.2. (Second Borel-Cantelli lemma)
Let (, F , P ) be a probability space and let (An )n1 F be independent and denote A =
lim sup An . If
X
P (An ) = ,
n1
then P (A) = 1.
Together, the Borel-Cantelli lemmas assert that for a sequence of independent events
P
(An )n1 , P (A) equals zero or one, according as n P (An ) converges or diverges. As such,
this corollary is known as a zero-one law , of which there are many in probability theory.
exchangability, De Finettis theorem
Theorem A.2.2. (De Finettis theorem) State De Finettis theorem.
Theorem A.2.3. (Ulams theorem) State Ulams theorem.
Definition A.2.1. Let (Y , B) be a measurable space. Given a set-function : B [0, ],
the total variation total-variation norm of is defined:
kkT V = sup |(B)|.
(A.3)
BB
Lemma A.2.3. Let (Y , B) be a measurable space. The collection of all signed measures on
Y forms a linear space and total variation is a norm on this space.
A.4 Integration
Rough setup: the definition of the integral, its basic properties, limit-theorems (Fatou, dominated convergence) and Lp -spaces.
Definition A.4.1. Let (, F , ) be a measure space. A real-valued measurable function
f : R is said to be -integrable if
Z
mega|f | d < .
O
(A.4)
94
Measure theory
Remark A.4.1. If f is a stochastic vector taking values in Rd , the above definition of integrability is extended naturally by imposing (A.4) on each of the component functions. This
extension is more problematic in infinite-dimensional spaces. However, various generalizations can be found in an approach motivated by functional analysis (see Megginson (1998)
[67] for an introduction to functional analysis): suppose that f : X takes its values in
an infinite-dimensional space X. If (X, k k) is a normed space, one can impose that
int kf k d < ,
but this definition may be too strong, in the sense that too few functions f satisfy it. If X has
a dual X , one may impose that for all x X ,
Z
x (f ) d < ,
which is often a weaker condition than the one in the previous display. In case X is itself (a
subset of ) the dual of a space X 0 , then X 0 X and we may impose that for all x X 0 ,
Z
f (x) d <
P f d(P ) < .
for all f X. Then, suitable integrability is not an issue in the definition of the posterior
mean (2.2.1), since P |f | supRn |f | = kf k < for all f X and the posterior is a
probability measure.
Theorem A.4.1. (Fubinis theorem) State Fubinis theorem.
Theorem A.4.2. (Radon-Nikodym theorem) Let (, F ) be a measurable space and let , :
F [0, ] be two -finite measures on (, F ). There exists a unique decomposition
= k + ,
such that p arallel and and are mutually singular. Furthermore, there exists a
finite-valued, F -measurable function f : R such that for all F F ,
Z
k (F ) =
f d.
F
(A.5)
95
Stochastic processes
Z
X d =
Xf d.
Remark A.4.2. Integrability is not a necessary condition here, but the statement of the
corollary becomes rather less transparent if we indulge in generalization.
96
Measure theory
(A.6)
are defined and satisfy conditions (K1) and (K2). Then there exists a probability space
(, F , P ) and a stochastic process { Xt : X : t T } such that all distributions of
the form (A.6) are marginal to P .
Kolmogorovs approach to the definition and characterization of stochastic processes in
terms of finite-dimensional marginals turns out to be of great practical value: it allows one to
restrict attention to finite-dimensional marginal distributions when characterising the process.
The drawback of the construction becomes apparent only upon closer inspection of the algebra F : F is the -algebra generated by the cylinder sets, which implies that measurability
of events restricing an uncountable number of Xt s simultaneously can not be guaranteed!
For instance, if T = [0, ) and X = R, the probability that sample-paths of the process are
coninuous,
P R R : t 7 Xt iscontinuous ,
may be ill-defined because it involves an uncountable number of ts. This is the ever-recurring
trade-off between generality and strength of a mathematical result: Kolmogorovs existence
theorem always works but it does not give rise to a comfortably large domain for the resulting
P : F [0, 1].
P (A B)
.
P (B)
(A.7)
97
Conditional distributions
Conditional probability given B describes a set-function on F and one easily checks that
this set-function is a measure. The conditional probability measure P ( |B) : F [0, 1] can
be viewed as the restriction of P to F -measurable subsets of B, normalized to be a probability
measure. Definition (A.7) gives rise to a relation between P (A|B) and P (B|A) (in case both
P (A) > 0 and P (B) > 0, of course), which is called Bayes Rule.
Lemma A.6.1. (Bayes Rule)
Let (, F , P ) be a probability space and let A, B F be such that P (A) > 0, P (B) > 0.
Then
P (A|B) P (B) = P (B|A) P (A).
However, being able to condition on events B of non-zero probability only is too restrictive.
Furthermore, B above is a definite event; it is desirable also to be able to discuss probabilities
conditional on events that have not been measured yet, i.e. to condition on a -algebra.
Definition A.6.2. Let (, F , P ) be a probability space, let C be a sub--algebra of F and
let X be a P -integrable random variable. The conditional expectation of X given C , denoted
E[X|C ], is a C -measurable random variable such that for all C C ,
Z
Z
X dP =
E[X|C ] dP.
C
The condition that X be P -integrable is sufficient for the existence of E[X|C ]; E[X|C ]
is unique P -almost-surely (see theorem 10.1.1 in Dudley (1989)). Often, the -algebra C is
the -algebra (Z) generated by another random variable Z. In that case we denote the conditional expectation by E[X|Z]. Note that conditional expectations are random themselves:
realisation occurs only when we impose Z = z.
Definition A.6.3. Let (Y , B) be a measurable space, let (, F , P ) be a probability space and
let C be a sub--algebra of F . Furthermore, let Y : Y be a random variable taking
values in Y . The conditional distribution of Y given C is P -almost-surely defined as follows:
PY |C (A, ) = E[1A |C ]().
(A.8)
Although seemingly innocuous, the fact that conditional expectations are defined only
P -almost-surely poses a rather subtle problem: for every A B there exists an A-dependent
null-set on which PY |C (A, ) is not defined. This is not a problem if we are interested only
in A (or in a countable number of sets). But usually, we wish to view PY |C as a probability
measure, that is to say, it must be well-defined as a map on the -algebra B almost-surely.
Since most -algebras are uncountable, there is no guarantee that the corresponding union
of exceptional null-sets has measure zero as well. This means that definition (A.8) is not
sufficient for our purposes: the property that the conditional distribution is well-defined as a
map is called regularity.
98
Measure theory
Definition A.6.4. Under the conditions of definition A.6.3, we say that the conditional
distribution Y |C is regular, if there exists a set E F such that P (E) = 0 and for all
\ E, Y |C ( , ) satisfies A.8 for all A B.
Definition A.6.5. A topological space (S, T ) is said to be a Polish space if T is metrizable
with metric d and (S, d) is complete and separable.
Polish spaces appear in many subjects in probability theory, most notably in a theorem
that guarantees that conditional distributions are regular.
Theorem A.6.1. (regular conditional distributions) Let Y be a Polish space and denote its
Borel -algebra by B. Furthermore let (, F , P ) be a probability space and Y : Y a
random variable taking values in Y . Let C be a sub--algebra of F . Then a conditional
distribution [MORE MORE]
Proof
For a proof of this theorem, the reader is referred to Dudley (1989) [29], theo-
rem 10.2.2).
99
Lemma A.7.3. When endowed with the topology of total variation, the space M (R, B) is a
Polish subspace of the Banach space of all signed measures on (R, B).
100
Measure theory
Bibliography
[1] M. Alpert and H. Raiffa, A progress report on the training of probability assessors, In Judgement under uncertainty: heuristics and biases, eds. D. Kahneman, P. Slovic and A. Tversky,
Cambridge University Press, Cambridge (1982).
[2] S. Amari, Differential-geometrical methods in statistics, Lecture Notes in Statistics No. 28,
Springer Verlag, Berlin (1990).
[3] M. Bayarri and J. Berger, The interplay of Bayesian and frequentist analysis, Preprint (2004).
[4] T. Bayes, An essay towards solving a problem in the doctrine of chances, Phil. Trans. Roy. Soc.
53 (1763), 370418.
[5] A. Barron, L. Birge and P. Massart, Risk bounds for model selection via penalization, Probability
Theory and Related Fields 113 (1999), pp. 301413.
[6] S. Bernstein, Theory of probability, (in Russian), Moskow (1917).
[7] A. Barron, M. Schervish and L. Wasserman, The consistency of posterior distributions in
nonparametric problems, Ann. Statist. 27 (1999), 536561.
[8] J. Berger, Statistical decision theory and Bayesian analysis, Springer, New York (1985).
[9] J. Berger and J. Bernardo, On the development of reference priors, Bayesian Statististics 4
(1992), 3560.
[10] R. Berk, Consistency of a posteriori, Ann. Math. Statist. 41 (1970), 894906.
[11] R. Berk and I. Savage, Dirichlet processes produce discrete measures: an elementary proof,
Contributions to statistics, Reidel, Dordrecht (1979), 2531.
[12] J. Bernardo, Reference posterior distributions for Bayesian inference, J. Roy. Statist. Soc. B41
(1979), 113147.
[13] J. Bernardo and A. Smith, Bayesian theory, John Wiley & Sons, Chichester (1993).
[14] P. Bickel and J. Yahav, Some contributions to the asymptotic theory of Bayes solutions, Zeitschrift f
ur Wahrscheinlichkeitstheorie und Verwandte Gebiete 11 (1969), 257276.
[15] P. Billingsley, Probability and Measure, 2nd edition, John Wiley & Sons, Chichester (1986).
, Approximation dans les espaces metriques et theorie de lestimation, Zeitschrift f
[16] L. Birge
ur
Wahrscheinlichkeitstheorie und Verwandte Gebiete 65 (1983), 181238.
101
102
Bibliography
, Sur un theor`eme de minimax et son application aux tests, Probability and Mathemat[17] L. Birge
ical Statistics 3 (1984), 259282.
and P. Massart, From model selection to adaptive estimation, Festschrift for Lucien
[18] L. Birge
Le Cam, Springer, New York (1997), 5587.
and P. Massart, Gaussian model selection, J. Eur. Math. Soc. 3 (2001), 203268.
[19] L. Birge
[20] C. Bishop, Pattern Recognition and Machine Learning, Springer, New York (2006).
[21] D. Blackwell and L. Dubins, Merging of opinions with increasing information, Ann. Math.
Statist. 33 (1962), 882886.
[22] L. Boltzmann, Vorlesungen u
ber Gastheorie, (2 Volumes), Leipzig (1895, 1898).
r, Mathematical methods of statistics, Princeton University Press, Princeton (1946).
[23] H. Crame
[24] A. Dawid, On the limiting normality of posterior distribution, Proc. Canad. Phil. Soc. B67
(1970), 625-633.
[25] P. Diaconis and D. Freedman, On the consistency of Bayes estimates, Ann. Statist. 14 (1986),
126.
[26] P. Diaconis and D. Freedman, On inconsistent Bayes estimates of location, Ann. Statist. 14
(1986), 6887.
[27] P. Diaconis and D. Freedman, Consistency of Bayes estimates for nonparametric regression:
Normal theory, Bernoulli, 4 (1998), 411-444.
[28] J. Doob, Applications of the theory of martingales, Le calcul des Probabilites et ses Applications,
Colloques Internationales du CNRS, Paris (1948), 2228.
[29] R. Dudley, Real analysis and probability, Wadsworth & Brooks-Cole, Belmont (1989).
[30] A. Dvoretzky, J. Kiefer, and J. Wolfowitz, Asymptotic minimax character of the sample
distribution function and of the classical multinomial estimator, Ann. Math. Statist. 27 (1956),
642669.
[31] B. Efron, Defining curvature on a statistical model, Ann. Statist. 3 (1975), 11891242.
[32] B. Efron and R. Tibshirani, An introduction to the Bootstrap, Chapman and Hall, London
(1993).
[33] M. Escobar and M. West, Bayesian density estimation and inference with mixtures, Journal
of the American Statistical Association 90 (1995), 577588.
[34] T. Ferguson, A Bayesian analysis of some non-parametric problems, Ann. Statist. 1 (1973),
209230.
[35] T. Ferguson, Prior distribution on the spaces of probability measures, Ann. Statist. 2 (1974),
615629.
[36] D. Freedman, On the asymptotic behavior of Bayes estimates in the discrete case I, Ann. Math.
Statist. 34 (1963), 13861403.
[37] D. Freedman, On the asymptotic behavior of Bayes estimates in the discrete case II, Ann. Math.
Statist. 36 (1965), 454456.
103
[38] D. Freedman, On the Bernstein-von Mises theorem with infinite dimensional parameters, Ann.
Statist. 27 (1999), 11191140.
[39] S. van de Geer, Empirical Processes in M-Estimation. Cambridge University Press, Cambridge
(2000).
[40] S. Ghosal, J. Ghosh and R. Ramamoorthi, Non-informative priors via sieves and packing numbers, Advances in Statistical Decision theory and Applications (eds. S. Panchapakeshan,
N. Balakrishnan), Birkh
auser, Boston (1997).
[41] S. Ghosal, J. Ghosh and A. van der Vaart, Convergence rates of posterior distributions,
Ann. Statist. 28 (2000), 500531.
[42] J. Ghosh and R. Ramamoorthi, Bayesian Nonparametrics, Springer Verlag, Berlin (2003).
[43] P. Green, Reversible jump Markov chain Monte Carlo computation and Bayesian model determination, Biometrika 82 (1995), 711732.
[44] T.-M. Huang, Convergence rates for posterior distributions and adaptive estimation, Carnegie
Mellon University, preprint (accepted for publication in Ann. Statist.).
[45] I. Ibragimov and R. Hasminskii, Statistical estimation: asymptotic theory, Springer, New York
(1981).
[46] H. Jeffreys, An invariant form for the prior probability in estimation problems, Proc. Roy. Soc.
London A186 (1946), 453461.
[47] H. Jeffreys, Theory of probability (3rd edition), Oxford University Press, Oxford (1961).
[48] R. Kass and A. Raftery, Bayes factors, Journal of the American Statistical Association 90
(1995), 773795.
[49] R. Kass and L. Wasserman, A reference Bayesian test for nested hypotheses and its relationship
to the Schwarz criterion, Journal of the American Statistical Association 90 (1995), 928934.
[50] Yongdai Kim and Jaeyong Lee, The Bernstein-von Mises theorem of survival models, (accepted for publication in Ann. Statist.)
[51] Yongdai Kim and Jaeyong Lee, The Bernstein-von Mises theorem of semiparametric Bayesian
models for survival data, (accepted for publication in Ann. Statist.)
[52] J. Kingman and S. Taylor, Introduction to measure and probability, Cambridge University
Press, Cambridge (1966).
[53] B. Kleijn and A. van der Vaart, Misspecification in Infinite-Dimensional Bayesian Statistics,
Ann. Statist. 34 (2006), 837877.
[54] B. Kleijn and A. van der Vaart, The Bernstein-Von-Mises theorem under misspecification,
(submitted for publication in the Annals of Statistics).
[55] B. Kleijn and A. van der Vaart, A Bayesian analysis of errors-in-variables regression, (submitted for publication in the Annals of Statistics).
[56] A. Kolmogorov and V. Tikhomirov, Epsilon-entropy and epsilon-capacity of sets in function
spaces, American Mathematical Society Translations (series 2), 17 (1961), 277364.
104
Bibliography
[57] P. Laplace, Memoire sur la probabilite des causes par les evenements, Mem. Acad. R. Sci.
Presentes par Divers Savans 6 (1774), 621656. (Translated in Statist. Sci. 1, 359378.)
[58] P. Laplace, Theorie Analytique des Probabilites (3rd edition), Courcier, Paris (1820).
[59] E. Lehmann and G. Casella, Theory of point-estimation, (2nd ed.) Springer, New York (1998).
[60] E. Lehmann and J. Romano, Testing statistical hypothesis, Pringer, New York (2005).
[61] L. Le Cam, On some asymptotic properties of maximum-likelihood estimates and related Bayes
estimates, University of California Publications in Statistics, 1 (1953), 277330.
[62] L. Le Cam, On the assumptions used to prove asymptotic normality of maximum likelihood
estimators, Ann. Math. Statist. 41 (1970), 802828.
[63] L. Le Cam, Asymptotic methods in statistical decision theory, Springer, New York (1986).
[64] L. Le Cam and G. Yang, Asymptotics in Statistics: some basic concepts, Springer, New York
(1990).
[65] D. Lindley, A measure of the information provided by an experiment, Ann. Math. Statist. 27
(1956), 9861005.
[66] D. Lindley and A. Smith, Bayes estimates for the linear model, J. Roy. Statist. Soc. B43
(1972), 141.
[67] R. Megginson, An introduction to Banach Space Theory, Springer, New York (1998).
[68] R. Von Mises, Wahrscheinlichkeitsrechnung, Springer Verlag, Berlin (1931).
[69] J. Munkres, Topology (2nd edition), Prentice Hall, Upper Saddle River (2000).
[70] H. Raiffa, and R. Schlaifer, Decision analysis: introductory lectures on choices under uncertainty, Addison-Wesley, Reading (1961).
[71] C. Rao, Information and the accuracy attainable in the estimation of statistical parameters, Bull.
Calcutta Math. Soc. 37 (1945), 8191.
[72] C. Robert, The Bayesian choice: from decision-theoretic foundations to computational implementation, Springer, New York (2001).
[73] B. Ripley, Pattern recognition and neural networks, Cambridge University Press, Cambridge
(1996).
[74] L. Savage, The subjective basis of statistical practice, Technical report, Dept. Statistics, University of Michigan (1961).
[75] M. Schervish, Theory of statistics, Springer, New York (1995).
[76] L. Schwartz, On Bayes procedures, Zeitschrift f
ur Wahrscheinlichkeitstheorie und verwandte
Gebiete 4 (1965), 1026.
[77] G. Schwarz, Estimating the dimension of a model, Ann. Statist. 6 (1978), pp. 461464.
[78] C. Shannon, A Mathematical Theory of Communication, Bell System Technical Journal 27
(1948), 379423, 623656.
[79] X. Shen and L. Wasserman, Rates of convergence of posterior distributions, Ann. Statist. 29
(2001), 687714.
105
[80] X. Shen, Asymptotic normality of semiparametric and nonparametric posterior distributions,
Journal of the American Statistical Association 97 (2002), 222235.
[81] H. Strasser, Mathematical Theory of Statistics, de Gruyter, Amsterdam (1985).
[82] J. Tukey, Exploratory data analysis, Addison-Wesley, Reading (1977).
[83] A. van der Vaart, Asymptotic Statistics, Cambridge University Press, Cambridge (1998).
[84] A. Walker, On the asymptotic behaviour of posterior distributions, J. Roy. Statist. Soc. B31
(1969), 8088.
[85] L. Wasserman, Bayesian model selection and model averaging, J. Math. Psych. 44 (2000), 92
107.
[86] Y. Yang and A. Barron, An asymptotic property of model selection criteria, IEEE Transactions
on Information Theory 44 (1998), 95116.
Index
R-better, 39
data, 1
p-value, 28, 30
categorical, 1
i.i.d., 1
interval, 1
nominal, 1
action, 38
ordinal, 1
admissible, 39
ranked, 1
ratio, 1
hypothesis, 27
decision, 38
decision principle
Bayes factor, 35
minimax, 39
Bayes billiard, 17
decision rule, 38
belief, 10
Bayes, 42
bootstrap, 53
minimax, 40
classification, 28, 43
randomized, 40
classifier, 44
decision-space, 38
regular, 14, 98
Dirichlet distribution, 69
conditional expectation, 97
Dirichlet family, 70
conditional independence, 18
Dirichlet process, 71
conditional probability, 96
distribution
unimodal, 23
confidence region, 32
conjugate family, 59, 75
empirical Bayes, 66
consistency conditions, 95
empirical expectation, 11
continuity theorem, 92
empirical process, 11
entropy
counting measure, 4, 92
Lindley, 58
Shannon, 58
estimator, 4
M -, 26
credible set, 33
MAP, 25
HPD, 34
critical region, 28
maximum-a-posteriori, 25
106
maximum-likelihood, 6, 11
misclassification, 44
minimax, 41
ML-II estimator, 68
non-parametric ML, 76
penalized maximum-likelihood, 27
model, 2
small-ball, 25
dimension, 3
exchangeability, 20
dominated, 2
expectation
full non-parametric, 4
empirical, 5
hierarchical Bayes, 63
exponential family, 61
identifiable, 2
canonical representation, 61
mis-specified, 3
of full rank, 61
non-parametric, 4
normal, 3
feature vector, 44
parameterized, 2
hyperparameter, 63
parametric, 3
hyperprior, 63
well-specified, 3
hypothesis, 27
identifiability, 2
norm
total-variation, 6, 93
inadmissible, 39
inference, 38
infinite divisibility, 70
null
hypothesis, 27
integrability, 93
odds ratio
lemma
First Borel-Cantelli, 92
posterior, 35
Second Borel-Cantelli, 93
prior, 35
optimality criteria, 6
level, 28, 33
over-fitting, 67
confidence, 32
significance, 28
parameter space, 2
likelihood, 7
likelihood principle, 6
pointwise convergence, 99
limit distribution, 5
Polish space, 98
location, 22
Portmanteau lemma, 98
posterior, 8, 15
loss-function, 25, 38
posterior expectation, 23
L2 -, 41
posterior mean, 23
parametric, 23
measure
atomic, 92
posterior median, 25
delta, 92
posterior mode, 25
Dirac, 92
power function, 28
107
sequence, 31
state, 38
power-set, 4
state-space, 38
powerset, 92
statistic, 5, 32
predictive distribution
posterior, 19
statistics
prior, 19, 75
inferential, 38
preferred
stochastic process, 95
Bayes, 42
support, 16
minimax, 39
test
prior, 8, 20
asymptotic, 30
conjugate, 60
more powerful, 31
Dirichlet process, 21
improper, 54
informative, 50
test sequence, 30
Jeffreys, 56
test-statistic, 28
non-informative, 53
theorem
objective, 53
central limit, 5
reference, 58
De Finettis, 93
subjective, 50
Fubinis, 94
subjectivist, 15
Glivenko-Cantelli, 6
probability density, 95
Minimax, 40
Radon-Nikodym, 94
Radon-Nikodym derivative, 95
Ulams, 93
rate of convergence, 5
type-I error, 28
type-II error, 28
risk
utility, see utility-function
Bayes, 42
utility-function, 38
minimax, 39
risk function
Weak convergence, 98
Bayesian, 42
zero-one law, 93
sample-average, 5, 11
sample-size, 5
samplespace, 1, 38
significance level, see lvel28, see lvel, significance28
asymptotic, 29
simple
hypothesis, 28
simplex, 4
108
Cover Illustration
The figure on the front cover originates from Bayes (1763), An essay towards solving a problem in the doctrine of
chances, (see [4] in the bibliography), and depicts what is nowadays known as Bayes Billiard. To demonstrate
the uses of conditional probabilities and Bayes Rule, Bayes came up with the following example: one white ball
and n red balls are placed on a billiard table of length normalized to 1, at independent, uniformly distributed
positions. Conditional on the distance X of the white ball to one end of the table, the probability of finding
exactly k of the n red balls closer to that end, is easily seen to be:
P k X=x =
n!
xk (1 x)nk .
k! (n k)!
One finds the probability that k red balls are closer than the white, by integrating with respect to the position
of the white ball:
1
.
n+1
Application of Bayes Rule then gives rise to a Beta-distribution B(k + 1, n k + 1) for the position of the
P(k ) =
white ball conditional on the number k of red balls that are closer. The density:
k+1,nk+1 (x) =
(n + 1)! k
x (1 x)nk ,
k! (n k)!
for this Beta-distribution is the curve drawn at the bottom of the billiard in the illustration. (See example 2.1.2)