0% found this document useful (0 votes)
31 views89 pages

Foundations of Statistical Inference

Uploaded by

jfang1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views89 pages

Foundations of Statistical Inference

Uploaded by

jfang1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 89

SB2.

1
Foundations of Statistical Inference
Michaelmas Term 2022

George Deligiannidis
Contents

0 Notations 4

1 Exponential Families 5
1.1 Definition and examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.1 Support and counterexamples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Parsimonious parametrisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 The parameter space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4 Curved exponential families . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2 Sufficiency and Minimality 16


2.1 Sufficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 Minimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Minimal sufficiency in exponential families . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3 Fisher Information 21
3.1 The one-dimensional case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 The multivariate case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4 Point estimation 25
4.1 The method of moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 Maximum likelihood estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2.1 Finding the MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.3 Variance and mean squared error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5 MVUEs and the Cramer-Rao Lower Bound 29


5.1 The CRLB in the one-dimensional case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.2 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.3 The multivariate case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.4 MLEs and MVUEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

6 The Rao-Blackwell and Lehmann-Scheffé theorems 34

7 Bayesian Inference: Conjugacy and Improper Priors 41


7.1 Recap of fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
7.2 Conjugate priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
7.3 Improper priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
7.4 Predictive Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

8 Non-Informative Priors 47
8.1 Uniform priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
8.2 Jeffrey’s prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
8.2.1 Jeffrey’s prior in higher dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
8.3 Maximum entropy prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

9 Hierarchical Models 51
9.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
9.2 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
9.3 Exchangeability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

Page 1 of 88
Foundations of Statistical Inference CONTENTS

9.4 Gaussian data example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

10 Decision Theory 58
10.1 Basic framework and risk function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
10.2 Admissibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
10.3 Minimax rules and Bayes rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
10.4 Bayes rule and posterior risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
10.5 Point estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
10.6 Finite decision problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
10.6.1 The case k = 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

11 The James-Stein Estimator 68

12 Empirical Bayes Methods 72


12.1 Basic setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
12.2 Choice of point estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
12.3 James-Stein and empirical Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
12.4 Non-parametric empirical Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

13 Hypothesis Tests 76
13.1 Recap from part A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
13.1.1 General setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
13.1.2 Neyman-Pearson Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
13.1.3 Uniformly most powerful tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
13.2 Bayes factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
13.2.1 Bayes factors for simple hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
13.2.2 Bayes factors for composite hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . 80
13.3 Hypothesis testing in the context of decision theory . . . . . . . . . . . . . . . . . . . . . . 82
13.3.1 Bayes tests for simple-simple hypothesis . . . . . . . . . . . . . . . . . . . . . . . . 82
13.3.2 The case of the 0–1 loss function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
13.4 Exponential families . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
13.5 Two sided hypothesis tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
13.5.1 UMPU tests for one-parameter exponential families . . . . . . . . . . . . . . . . . 86

Page 2 of 88
Foundations of Statistical Inference CONTENTS

This set of notes is largely based on notes prepared by Damian Falck in MT2020 based on slides and
lectures given by Julien Berestycki. The material goes back to material prepared by Judith Rousseau.

This is still work in progress and may contain typos or even errors. I would very much appreciate your
help in improving them so if you spot anything please let me know at [email protected].

Page 3 of 88
Chapter 0

Notations

The situations of interest to us in this course start in general with having observed some data x, where
x is a point in X .

Example. Consider a large field of soybean plants. During 7 weeks, each Monday 5 plants are
randomly chosen and the average height recorded.

The data are x = {5, 13, 16, 13, 23, 33, 40}. Here X = R7+ .

We will consider x as the realisation of a random variable X, taking values in some measurable space
(X , F), where the distribution of X is (at least partly) unknown. Statistical inference is about using
x to gain information on the distribution of X.

We write P(X ) for the collection of all probability measures on (X , F). We will usually be interested in
a smaller class or family of possible distributions P ⊂ P(X ) parametrised by some parameter θ:

Definition 0.1. A set P = {Pθ : θ ∈ Θ}, where the Pθ are probability distributions on X , is called
a statistical model . Here Θ is the parameter space.

If Pθ is absolutely continuous w.r.t. to some reference measure µ (for our purposes the Lebesgue measure),
we write f (x, θ), or pθ (x) for its probability density function, whereas if Pθ is discrete we write f (x, θ)
for its probability mass function. We will often identify a distribution Pθ with its density. We write
Eθ [·] and Pθ [·] to mean expectations/probabilities under Pθ ; so in Eθ [ϕ(X)], for example, we take X to
have distribution Pθ .

Other possible notations for the same mass/density include pθ (x), p(x, θ), p(x | θ), f (x | θ), Pθ (X = x)
(in the discrete case), and L(θ; x).
Remark (Remark on use of notation). Throughout this course we will freely drift between different
notations for the same objects. This is somewhat intentional.

Page 4 of 88
Chapter 1

Exponential Families

1.1 Definition and examples


There is one particular class of statistical models that will come up time and time again in our journey,
and to which many of the common distributions belong:

Definition 1.1. A family P = {Pθ : θ ∈ Θ} of probabilities (pmf or pdf) on some set X , indexed by
θ is called an exponential family if there exists k ∈ N, functions η1 , . . . , ηk , B : Θ 7→ R, statistics
T1 , . . . , Tk : X 7→ R and a non-negative real-valued function h on X such that the pdf/pmfs p(x; θ)
of Pθ have the form
 
X k
(1.1) p(x; θ) = exp  ηi (θ)Ti (x) − B(θ) h(x).
i=1

Remark. By changing the set X if necessary, we can always focus on the case where 0 < h(x) < ∞, for
all x ∈ X without changing the exponential family.
Remark. The above should be implicitly understood as saying that

dPθ (·) = p(x; ·)dµ,

for some reference measure µ. When x ∈ Rd , µ will typically be Lebesgue measure, and for discrete sets
it will be the counting measure.

The ηi are called the natural or canonical parameters, and the Ti (x) are called the natural or
canonical observations.

Since for all θ ∈ Θ


   
Z Z Xk
1= p(x; θ) dx = exp(−B(θ))  h(x) exp  ηi (θ)Ti (x) dx ,
 
x x i=1

we can think of exp(−B(θ)) as a normalisation. Observe that B only depends on θ through η(θ).
(In the above expression the integral should be a sum if p(x; θ) is a pmf, or we can think of p(x; θ) as
distribution to encompass both cases).

Often, it is useful to use the ηi as the parameters and to write the model in its canonical form,
 
Xn
p(x; η) = exp  ηi Ti (x) − B(η) h(x).
i=1

Page 5 of 88
Foundations of Statistical Inference 1. Exponential Families

(Note this is possible even if θ 7→ η is not one-to-one. Note also that there is a slight abuse of notation
as (x, θ) 7→ p(x; θ) and (x, η) 7→ p(x; η) are not the same function).

In general θ and x can be multidimensional.

Examples (Common 1-parameter exponential families).


e−θ θ x
• Poisson distribution. For the Poi(θ) distribution, the mass function f (x; θ) = x! (x =
0, 1, 2, . . .) can be written as
1 −θ+x log θ
(1.2) f (x; θ) = e
x!
(1.3) = h(x) exp(η(θ)x − B(θ))

with h(x) = 1/x!, η(θ) = log θ, B(θ) = θ and T (x) = x. The natural parameter is log θ.
• Binomial distribution with known number of trials. For the Bin(n, p) distribution,
considering n to be known and p to be the parameter, the mass function may be written as
 
n x
(1.4) f (x; p) = p (1 − p)n−x
x
 
n  
(1.5) = exp x(log p − log(1 − p)) + n log(1 − p)
x
p
(for x = 0, 1, . . . , n). So h(x) = nx , T (x) = x, η(p) = log 1−p

, and B(p) = −n log(1 − p).

• Gaussian distribution with known variance. For the N (µ, 1) distribution (for example),
the density may be written as
 2
− x2
" # " #
1 (x − µ)2 exp µ2
f (x; µ) = √ exp − = √ exp µx − ,
2π 2 2π 2
2
 
exp − x2 µ2
so h(x) = √

, η(µ) = µ, T (x) = x and B(µ) = 2 .

Examples (Common 2-parameter exponential families).

• Gamma distribution. For the Gamma(α, β) distribution, with θ = (α, β), we have mass
function
β α xα−1 e−βx
(1.6) f (x; θ) = 1x⩾0
Γ(α)
" #
(1.7) x − (log(Γ(α)) − α log β) 1x⩾0 .
= exp (α − 1) log x − β |{z}
| {z } | {z } |{z} | {z } | {z }
η1 (θ) T1 (x) η2 (θ) T2 (x) B(θ) h(x)

• Gaussian distribution. For the N (µ, σ 2 ) distribution, with θ = (µ, σ 2 ), we have mass
function
!
1 (x − µ)2
(1.8) f (x; θ) = √ exp −
2πσ 2 2σ 2
" !#
1 2 µ µ2 1 2
(1.9) = exp − 2 |{z} x + 2 |{z} x − + log(2πσ ) .
2σ } σ 2σ 2 2
| {z T1 (x) |{z} T2 (x)
η1 (θ) η2 (θ) | {z }
B(θ)

Page 6 of 88
Foundations of Statistical Inference 1. Exponential Families

1.1.1 Support and counterexamples


One important property of exponential families, is that all distributions in such a family are equivalent
as stated by the following proposition.

Proposition 1.2. Two probability measures P and Q are said to be equivalent if we have P(N ) = 0
iff Q(N ) = 0. If P = {p(x; θ) : θ ∈ Θ} is an exponential family, then all p(·; θ) are equivalent.

Proof. Take θ1 ̸= θ2 ∈ Θ and suppose Pθ1 (N ) = 0. Write 1N for the indicator function of N .
 
Z
Pθ1 (N ) = e
X
−B(θ1 )
(1.10) exp  ηj (θ1 )Tj (x) h(x)1N (x)dx = 0
j

This implies that h(x)1N (x) = 0 for Lebesgue almost all x and therefore that
 
Z
Pθ (N ) = e−B(θ) exp  ηj (θ)Tj (x) h(x)1N (x)dx = 0
X
(1.11)
j

for arbitrary θ ∈ Θ.

Definition 1.3. For a distribution P with density f , the support of P, equivalently of f , is defined
as the set
supp(P) = {x : f (x) > 0}.

Corollary 1.4. In an exponential family P = {f (x; θ), θ ∈ Θ} the support of f (x; θ) does not
depend on θ. We will write A for the common support of the f (x; θ).

In fact, in the general case where h is allowed to vanish on a subset of X , A = {x : h(x) > 0}. It can be
easily seen that Pθ (A) = 1 for all θ.

Example. f (x; θ) = eθ−x 1x>θ is not an exponential family.

Example. Another example of a family which is not exponential is the Cauchy family with location
parameter µ:
1
f (x; µ) = .
π(1 + (x − µ)2 )

1.2 Parsimonious parametrisation


Exponential families P generally have multiple representations. Although η = (η1 , η2 , . . . , ηk ), T =
(T1 , . . . , Tk ) and k are not uniquely determined we call (1.1) a k-dimensional family . We will see that
optimal statistical procedures will only depend on the k-dimensional statistic T and it therefore makes
sense to choose k to be as small as possible. An exponential family with minimal number of summands
is called a strictly k-parameter exponential family .

Definition 1.5. A class of probability measures P = {p(x; θ) : θ ∈ Θ} which is an exponential


family is said to be strictly k-parameter when k is minimal, that is if P = {g(x; θ) : θ ∈ Θ} with
 
Xl 
g(x; θ) = h̃(x) exp η̃i (θ)T̃i (x) − B̃(θ) ,
 
i=1

Page 7 of 88
Foundations of Statistical Inference 1. Exponential Families

then l ≥ k.

Definition 1.6. A representation of an exponential family P of the form (1.1) is called minimal
if k is minimal; that is for any other representation of P in terms of {η̃j : j ≤ l}, {T̃j : j ≤ l} we
have l ≥ k.

This means that one cannot find s < k, such that p(x; θ) can be written in the form (1.1) with s replacing
k and some new statistics T1′ , . . . , Ts′ , new functions η1′ , . . . , ηs′ and B ′ on Θ and a new function h′ .

It can be easily seen that if the {Ti }ki=1 satisfy some affine relationship of the form i ci Ti (x) = c0
P
for all x ∈ A, then we can rewrite one of the Ti in terms of the rest and a constant, reducing k by 1.
Similarly if the {ηi }ki=1 satisfy an affine relationship. In fact we can keep reducing k until {ηi }ki=1 and
{Ti }ki=1 become affinely independent.

Definition 1.7. The functions T1 , . . . , Tn are called affinely independent (P-affine indepen-
dent in [LZ16]) if for anya c0 , . . . , cn ∈ R,
n
! !
X
cj Tj (x) = c0 µ-almost everywhere =⇒ cj = 0 for j = 0, . . . , n .
j=1

The functions η1 , . . . , ηn are affinely independent if


n
! !
X
cj ηj (θ) = c0 ∀θ ∈ Θ =⇒ cj = 0 for j = 0, . . . , k .
j=1

a Recall from Corollary 1.4 that A denotes the common support of the exponential family defined in Corollary 1.4.

Remark. In the case of the functions T1 , . . . , Tn it may help to intuitively understand affine independence
as saying that c0 , . . . , cn ∈ R,
n
! !
X
cj Tj (x) = c0 ∀x ∈ A =⇒ cj = 0 for j = 0, . . . , n ,
j=1

where A was defined in Corollary 1.4 as the common support of the exponential family—notice that
Pθ (A) = 1 for all θ.
Remark. If the functions {ηj (·)} are not affinely-independent then they are contained in a k − 1-
dimensional hyperplane. Similarly for {Tj (·)}, ignoring a set of measure zero.
Remark. It is easy to see that affine independence is stronger than linear independence. For example
the functions {x, x + 1} are linearly independent viewed as functions R 7→ R, but not affine independent.
Affine independence of {f1 , . . . , fk } means that {f1 , . . . , fk , 1} are linearly independent where 1 denotes
the constant function.

We argued earlier that we iteratively exploit any affine relationships, reducing the dimension of the
representation, until we arrive at an affinely independent representation. We will now establish that
every affinely independent representation has the same dimension.

Proposition 1.8. For every exponential family P = {p(x; θ) : θ ∈ Θ}, with p(x; θ) of the form
(1.1), there exists a k ′ ≤ k such that (1.1) has a k ′ -parameter, affinely independent representation.
Any affinely independent representation has dimension k ′ .

Proof. First of all we can always arrive at an affinely independent representation by iteratively
exploiting affine relationships until none exist. Therefore we may assume that P is given by (1.1),

Page 8 of 88
Foundations of Statistical Inference 1. Exponential Families

with (1.1) affinely independent. Define

H := {(ηi (θ))ki=1 : θ ∈ Θ} ⊂ Rk .

By definition, since {ηi }i are affinely independent, H is not contained in any (k − 1)-dimensional
hyperplane. Consider the collection R of log-likelihood ratios

R := {x 7→ log p(x; θ) − log p(x; θ′ ) : θ, θ′ ∈ Θ},

where
k
p(x; θ) X
log ′
= (ηi (θ) − ηi (θ′ ))Ti (x) + B(θ′ ) − B(θ).
p(x; θ ) i=1

Then let V be the vector space spanned by R and the constant function 1X ,
 
 m 
βi fi : m ∈ N, βi ∈ R, fi ∈ R .
X
V := β0 +
 
i=1

Clearly V ⊂ W where W := span(1X , T1 , . . . , Tk ), and by affine independence of {Ti }, W is (k + 1)-


dimensional. In addition, since {ηi } are affinely independent, H is not contained in any (k − 1)-
dimensional hyperplane. By affine independence again, the collection
 
 Xk 
x 7→ βi Ti (x) : β = (βi ) ∈ H ,
 
i=1

does not lie in any (k − 1)-dimensional hyperplane of W and therefore V does not lie in any
k-dimensional hyperplane of W . Therefore V = W . Finally notice that the likelihood ratios
are independent of the particular representation chosen in (1.1), in particular independent of the
reference measure and of h. Therefore so is the vector space V . Since V is also independent of the
choice of T , k is determined uniquely.

Proposition 1.9. Suppose that P is given by (1.1). Then P is strictly k-parameter if and only if
in (1.1) the functions {ηi : i ≤ k} and the statistics {Ti : i ≤ k} are affinely independent.

Proof. One direction is obvious, since if the functions {ηi (·)} and the statistics {Ti (·)}i in a k-
parameter family are not affinely independent then we can find a (k − 1)-dimensional representation
and therefore the family is not strictly k-parameter.

For the other direction, we assume that the exponential family P is given by P = {f (x; θ) : θ ∈ Θ},
where
 
Xk 
(1.12) f (x; θ) = h(x) exp ηi (θ)Ti (x) − B(θ) , θ ∈ Θ
 
i=1

where the functions {ηi (·)} and the statistics {Ti (·)}i are affinely independent. We need to prove
that P is strictly k-parameter.

We can argue in two ways.

Way 1: Notice that any minimal representation must be affinely independent, since otherwise
we can always reduce the dimension violating minimality. We also know from Proposition 1.8
that any affinely independent representation has the same dimension k. Recall from the proof of
Proposition 1.8 that the vector space V spanned by the collectino of the likelihood ratios {x 7→
log p(x; θ)/p(x; θ′ ) : θ, θ′ ∈ Θ} is (k + 1)-dimensional and is independent of the representation.
Therefore any representation must have dimension at least k.

Page 9 of 88
Foundations of Statistical Inference 1. Exponential Families

Way 2: Suppose that P also has another representation

P = {f (x; θ) : θ ∈ Θ} = {g(x; θ) : θ ∈ Θ}

where
 
Xl 
(1.13) g(x; θ) = h̃(x) exp η̃i (θ)T̃i (x) − B̃(θ) , θ ∈ Θ.
 
i=1

Next notice that likelihood ratios are independent of the parameterisation. Then we have
 
f (θ, x)f (θ0 , x0 ) h(x) exp η(θ) · t(x) − B(θ) × h(x0 ) exp η(θ0 ) · t(x0 ) − B(θ0 )
(1.14) =  
f (θ0 , x)f (θ, x0 ) h(x) exp η(θ0 ) · t(x) − B(θ0 ) × h(x0 ) exp η(θ) · t(x0 ) − B(θ)

(1.15) = exp [η(θ) − η(θ0 )] · [t(x) − t(x0 )] ,

is independent of the parameterisation.

Fix a P0 ∈ P corresponding to θ0 , θ0 . For P, given equivalently by (1.12) and (1.13) with θ and θ
respectively, we have that
      h i
(1.16) η(θ) − η(θ0 ) · T (x) − T (x0 ) = η̃(θ) − η̃(θ0 ) · T̃ (x) − T̃ (x0 ) .

Since by assumption the functions {ηi (·) : i = 1, . . . , k} are affinely independent, they are not
contained in any (k − 1)-dimensional hyperplane. Therefore we can find θ0 , . . . , θk such that the
vectors η(θ0 ), . . . , η(θk ) are not contained in any (k − 1)-dimensional hyperplane and therefore the
vectors η(θ1 ) − η(θ0 ), . . . , η(θk ) − η(θ0 ) span Rk . Applying with θ = θi , i = 1, . . . , k, we get k
equations which in matrix notation can be written as
   
η(θ1 ) − η(θ0 ) η̃(θ1 ) − η̃(θ0 )
.. ..  h
     i
(1.17)  × T (x) − T (x0 ) =   × T̃ (x) − T̃ (x0 )
. .

   
η(θk ) − η(θ0 ) η̃(θk ) − η̃(θ0 )

where η, η̃ are written as row vectors and T, T̃ are column vectors.

Since η(θi ) − η(θ0 ), i = 1, . . . , k span Rk we can invert the left-most matrix and obtain an equation
of the form T (x) = AT̃ (x) + b, where A is a constant (in x) k × l matrix and b a constant vector.
Notice that since {T (x) : x ∈ X } is not contained in any (k − 1)-dimensional hyperplane, the image
of A must have dimension k; therefore rank(A) = k and thus l ≥ k.

Using the same reasoning, and the fact that {Ti (·) : i = 1, . . . k} are affinely independent, we can
find x0 , . . . , xk so that T (xi ) − T (x0 ), i = 1, . . . , k span Rk and working in a similar way as before
we can obtain an equation of the form η(θ) = C η̃(θ) + d with C a constant k × l matrix and d a
constant vector.

This proves that if in (1.12) {ηi (·) : i = 1, . . . , k} and {Ti (·) : i = 1, . . . , k} are affinely independent,
then any other representation must have dimension l ≥ k and therefore P is strictly k-parameter.
Remark. We have now seen that a representation is minimal if and only if it is affinely independent. In
fact, often in the literature, a representation is called minimal if it is affinely independent.

If X ∼ f (x; θ), then T = (T1 (X), . . . , TN (X)) is a random vector. Let Covθ (T ) be its covariance matrix
under f (x; θ). The following gives a condition for the statistics {Ti } to be affine-independent.

Proposition 1.10. The functions Ti are P-affinely independent if and only if for some θ, and thus
for all θ, Covθ (T ) is positive definite.

Page 10 of 88
Foundations of Statistical Inference 1. Exponential Families

Proof. As before let X ∼ f (x; θ). First or all notice that the following are all equivalent:

(i) Covθ (T ) is positive definite;


(ii) for all η ̸= 0 X 
X
ηi Covθ (T )ij ηj = Varθ ηi Ti (X) > 0;
ij
P
(iii) for all η ̸= 0 the mapping x 7→ ηj Tj (x) is not Pθ -a.s. constant;
(iv) for all η ̸= 0 we have Pθ⊗2 (Xη ) > 0 where
n X o
Xη := (x, x′ ) ∈ X 2 : ηj Tj (x) − Tj (x′ ) ̸= 0 ,

(1.18)

and Pθ⊗2 = Pθ ⊗ Pθ is the product measure.

Since, for all θ′ , Pθ′ is equivalent to Pθ , it easily follows that Pθ⊗2


′ is equivalent to Pθ⊗2 ; it suffices to
⊗2
write down the Radon-Nikodym derivative. Therefore Pθ′ (Xη ) > 0 for all θ′ , and therefore by the
above equivalences we have that Covθ′ (T ) is positive definite for all θ′ .
P
Suppose first that the Tj are affine independent. Then for any η ̸= 0, we have that x 7→ j ηj Tj (x)
cannot be constant µ- and thus also Pθ -almost everywhere for any θ; equivalently for all η ̸= 0 and
θ we have Pθ⊗2 (Xη ) > 0 with Xη as in (1.18). Therefore for any η ̸= 0 and any θ, letting X, X ′ be
i.i.d. from Pθ we have
X 2  X 
(1.19) 0 < E
X X

ηj Tj (X) − ηj Tj (X ) = 2 Varθ ηj Tj (X) = 2 ηi Covθ (T )ij ηj .
ij

This proves that Covθ (T ) is positive definite for all θ.

For the other


P direction suppose that the Tj are not affine independent: there exists an η ̸= 0 such
that x 7→ ηj Tj (x) is µ-a.e. constant, and therefore Pθ⊗2 (Xη ) = 1 for all θ. Therefore
X 2  X 
(1.20) 0 = E
X X

ηj Tj (X) − ηj Tj (X ) = 2 Varθ ηj Tj (X) = 2 ηi Covθ (T )ij ηj ,
ij

and thus we see that Covθ is not positive definite.


Remark. Above we used the fact that with Z, Z ′ i.i.d. E[(Z − Z ′ )2 ] = 2 Var(Z) and that if Z = Z ′ a.s.
with Z, Z ′ i.i.d. then Z, Z ′ are almost surely constant by the equality case of Jensen’s inequality since
E[Z 2 ] = E[Z × Z ′ ] = E[Z]2 .

Example. Suppose X takes values in {1, 2, 3} with P(X = i) = pi for i = 1, 2, 3, so that θ =


(p1 , p2 , p3 ). Then

where Ii (x) := 1x=i


I (x) I2 (x) I3 (x)
(1.21) p(x; θ) = p11 p2 p3
(1.22) = exp(I1 (x) log(p1 ) + I2 (x) log(p2 ) + I3 (x) log(p3 )),

so X belongs to a 3-parameter exponential family, but I1 (x) + I2 (x) + I3 (x) = 1 so it is not strictly
3-dimensional. Indeed,
    !
p1 p2
p(x; θ) = exp I1 (x) log + I2 (x) log + log(p3 )
1 − (p1 + p2 ) 1 − (p1 + p2 )

so it is a strictly 2-dimensional exponential family.

Page 11 of 88
Foundations of Statistical Inference 1. Exponential Families

1.3 The parameter space


Definition 1.11. The parameter space is defined to be
 Z 
Pn 
Θ := θ : h(x) exp η
i=1 i (θ)Ti (x) dx < ∞ ,

i.e. the set of θ for which the integrand can be normalized to become a probability density function.

Definition 1.12. The natural parameter space is defined to be


 Z 
Pn 
Ξ := η = (η1 , . . . , ηn ) : h(x) exp i=1 ηi Ti (x) dx < ∞ ,

 
R Pn
i.e. the set of η for which we can define B(η) := log h(x) exp η T
i=1 i i (x) dx so that

 
Pn
f˜(x; η) = e−B(η) h(x) exp η T
i=1 i i (x)

is a pdf/pmf on X .

Observe that you we always have η(Θ) ⊂ Ξ, although it may be the case that η(Θ) ̸= Ξ).

Theorem 1.13. The natural parameter space Ξ of a strictly k-parameter exponential family is
convex and contains a non-empty k-dimensional ball.

Proof. Take η, η ′ ∈ Ξ and let α ∈ (0, 1). Define B(η) = log exp
R P 
i ηi Ti (x) h(x) dx. Then
Z
(1.23) B(αη + (1 − α)η ′ ) = log exp(α i ηi Ti (x) + (1 − α) i ηi′ Ti (x))h(x) dx
P P

Z
α  1−α
exp( i ηi′ Ti (x))h(x)
 P P
(1.24) = log exp( i ηi Ti (x))h(x) dx

(1.25) (using h = hα h1−α ))


Z α Z 1−α
exp( i ηi′ Ti (x))h(x) dx
P P
(1.26) ⩽ log exp( i ηi Ti (x))h(x) dx

(1.27) by Hölder’s inequality



(1.28) = αB(η) + (1 − α)B(η ) < ∞.

Notice that if Ξ is contained in a (k − 1)-dimensional subspace, then so would η(Θ) ⊂ Ξ, which is


impossible since {θ 7→ ηi (θ) : i = 1, . . . , k} are affinely independent. Therefore Ξ is not contained
in any (k − 1)-dimensional subspace and must therefore contain k + 1 points, say ξ1 , . . . , ξk+1 , not
all lying in the same (k − 1)-subspace; since Ξ is convex it must also contain the convex hull of
these points; since they don’t all lie in the same (k − 1)-subspace Ξ contain a non-empty, open
k-dimensional ball.

Definition 1.14. If the image of the parameter space η(Θ) ⊆ Ξ for a strictly k-parameters expo-
nential family contains a k-dimensional open set, then it is called full rank .

Full rank is also sometimes referred to as regular . See the discussion of curved exponential families
below for examples of families which are not full rank.
Remark. In the one dimensional case, being full-rank is easily checked: it suffices that η(Θ) contains an
interval.

Page 12 of 88
Foundations of Statistical Inference 1. Exponential Families

Theorem 1.15. Let P be a strictly k-parameter exponential family with natural parameter space
Ξ. Then for all η ∈ Int(Ξ):

(a) all moments of T (with respect to f (x; η)) exist;



(b) Eη [Ti (X)] = B(η) ∀i; and
∂ηi
∂2
(c) Covη (Ti , Tj ) = B(η) ∀i, j.
∂ηi ∂ηj

Proof. Recall that  


Z k
X
exp(B(η)) = exp  ηi Ti (x) h(x)dx.
i=1

Then for s = (s1 , . . . , sk ) consider the moment generating function of T (X), when X ∼ Pη ,
  
k
MT (X) (s) = Eη exp 
X
(1.29) si Ti (X)
 
i=1

Z k
X 
(1.30) = exp (ηi + si )Ti (X) − B(η) h(x)dx
i=1

(1.31) = exp B(η + s) − B(η) .

Since by assumption η ∈ Int(Ξ), there exists a δ > 0, such that for all y ∈ B(η, δ) we have y ∈ Ξ
and therefore for all |s| < δ we have
 
MT (X) (s) = exp −B(η) + B(η + s) < ∞.

This implies all moments are finite.

For the other two parts, start by differentiating exp(B(η)) to get


Z k
∂ exp(B(η)) 1 X  
(1.32) = lim exp ηi Ti (X) exp(sT1 (x)) − 1 h(x)dx
∂η1 s→0 s
i=1
Z k
∂B(η)  X 
(1.33) exp(B(η)) = exp ηi Ti (X) T1 (x)h(x)dx
∂η1 i=1
∂B(η)
(1.34) = Eη [T1 (X)],
∂η1
where the differentiation under the integral sign may be justified by dominated convergence. The
last statement follows by differentiating once more.

1.4 Curved exponential families


Definition 1.16. A family P = {p(x; θ) : θ ∈ Θ} of probabilities (pmf or pdf) indexed by θ is
called a curved exponential family if there exists q < k ∈ N, real-valued functions η1 , . . . , ηk and
B on Θ ⊆ Rq , real-valued statistics T1 , . . . , Tk and a non-negative real-valued function h on X such
that

1. ∃θ : Covθ T is positive definite

Page 13 of 88
Foundations of Statistical Inference 1. Exponential Families

2. the pdf/pmfs p(x; θ) have the form


 
Xk
(1.35) p(x; θ) = exp  ηi (θ)Ti (x) − B(θ) h(x).
i=1

Positive-definiteness of the covariance matrix guarantees that k is not an arbitrary large number. In fact
it guarantees that the exponential family is strictly k-parameter, although the parameter space forms a
lower-dimensional, non-linear submanifold of the natural parameter space.

Example. (Normal Distribution). Let the statistical model be the class of all normal distributions
with N (µ, µ2 ) where µ is unknown and µ ̸= 0 with parameter θ = µ ∈ R∗ .

(1.36) ( ) ( ) ( )
1 (x − µ)2 1 (−x2 + 2xµ − µ2 1 x2 x 1
p(x; θ) = √ exp − =√ exp =√ exp − − 2 + −
2πµ 2µ2 2πµ 2µ2 2πµ 2µ µ 2

We thus have
T1 (x) = x, T2 (x) = x2 , η1 (θ) = µ−1 , η2 (θ) = −µ−2 /2.
This examples satisfies the definition of Curved Exponential families because the parameter θ is
one dimensional but results in a 2-parameter exponential family. The covariance matrix can be
calculated with the known moments
 2
2µ3

µ
Covθ T =
2µ3 6µ4

The covariance matrix is shown to be positive definite for all θ ∈ Θ (the determinant is 2µ6 > 0).
We see that T1 and T2 are P-affine independent and the familly is strictly 2-parameters (the ηi are
constrained, but not linearly).

Example. Suppose X1 ∼ N (θ, 1) and X2 ∼ N ( θ1 , 1) are independent. Their joint distribution has
log-density

(x1 − θ)2 (x2 − θ1 )2


(1.37) log f (x; θ) = − − + constant
2 2
1 θ2 θ−2
(1.38) = x1 θ + x2 − − + terms in (x1 , x2 ) alone,
θ 2 2
so that η1 = θ, η2 = θ1 , T1 = x1 and T2 = x2 . This is a (2, 1)-curved family.

Observe that η(Θ) = {(θ, θ1 ) ∈ R2 : θ ∈ R \ {0}} is a one-dimensional manifold.

Further examples can be found in this document. See also here for some examples about curved/full-rank
families.

Finally, let us formulate the following important statement about the distribution of a sample of inde-
pendent r.v.s distributed according to a distribution from an exponential family:

Theorem 1.17. (a) If X1 , . . . , Xn is a sample of independent r.v.s with distributions belonging


to an exponential family, then the joint distribution of the vector X = (X1 , ..., Xn ) is an
element of an exponential family.

(b) If X = (X1 , . . . , Xn ) are i.i.d. samples from a k-parameter exponential distribution of the form
(1.1) with functions η = (η1 , . . . , ηk ) and T = (T1 , . . . , Tk ), then the distribution of X belongs

Page 14 of 88
Foundations of Statistical Inference 1. Exponential Families

Pn
to a a k-parameter exponential family with natural observation T(n) (x) := i=1 T (xi ).

Remark. The fact that X = (X1 , X2 , . . . , Xn ) also belongs to a k parameter exponential family, inde-
pendently of n, is quite important.

Proof. Left as an exercise.

Page 15 of 88
Chapter 2

Sufficiency and Minimality

2.1 Sufficiency
We may often be interested in summarising a set of data without losing any information about the
parameter we’re trying to estimate. A statistic that does this is said to be sufficient:

Definition 2.1. Suppose X ∼ f (x; θ) for some parameter θ.

A statistic T (X) is a function of the data which does not depend on θ.

A statistic T (X) is said to be sufficient for θ if the conditional distribution of X given T does not
depend on θ. That is,
f (x | t, θ) = f (x | t).

Remark. In particular, this means that for any function g the map θ 7→ Eθ [g(X) | T = t] is constant.

We can think of a sufficient statistic as ‘wrapping up’ all the information there is about θ somehow.

Example. Let X1 , . . . , Xn be independent Ber(p) random variables, so that P(X = 1) = p and


P(X = 0) = 1 − p, and let T =
Pn
i=1 Xi , so that T ∼ Bin(n, p). Then, writing X = (X1 , . . . , Xn ),
for any x ∈ {0, 1}n and t ∈ {0, . . . , n} we have

P(X = x, T = t | p)
(2.1) f (x | t, p) = P(X = x | T = t, p) =
P(T = t | p)
Qn xi 1−xi
i=1 p (1 − p)
(2.2) = n t n−t
1 P
xi =t
t p (1 − p)
 −1
pt (1 − p)n−t P n
(2.3) = n t

n−t
1 xi =t = 1P
xi =t ,
t p (1 − p)
t

which has no dependence on p. So T is sufficient for p.

The intuitive meaning of this is that only the number of successes matters for estimating p; the
order in which successes arrive shouldn’t change your guess for p.

Theorem 2.2 (The Factorisation Criterion). Suppose X ∼ f (x; θ). Then a statistic T (X) is
sufficient for θ if and only if f can be written as

f (x; θ) = g(T (x), θ)h(x)

Page 16 of 88
Foundations of Statistical Inference 2. Sufficiency and Minimality

for some non-negative functions g, h.

Proof for the discrete case. Suppose T is sufficient and write t = T (x). So

f (x; θ) = Pθ (X = x) = Pθ (X = x, T = t) = Pθ (X = x | T = t) Pθ (T = t).

Then just note that as T is sufficient, Pθ (X = x | T = t) =: h(x) is independent of θ, and


Pθ (T = t) =: g(t, θ) only depends on t and θ.
Conversely, suppose f (x; θ) = g(t, θ)h(x) for some non-negative functions g, h. So

Pθ (T = t) = Pθ (X = x) =
X X X
f (x; θ) = g(t, θ) h(x).
x:T (x)=t x:T (x)=t x:T (x)=t

Thus Pθ (X = x | T = t) = Pθ (X=x,T =t) Pθ (X=x)


Pθ (T =t) = Pθ (T =t) = P h(x)
h(y) , which has no dependence on θ!
y:T (y)=t
So T is sufficient for θ.

The general case is much more complicated to prove; see [CP97, Example 6] for a proof using
disintegrations.

The following corollary is obvious.

Corollary 2.3. Let P = {f (·; θ) : θ ∈ Θ} be a k-parameter exponential family, where


 
X k 
f (x; θ) = exp ηi (θ)Ti (x) − B(x) h(x).
 
i=1


Then T1 (x), . . . , Tk (x) is sufficient for θ.

Remark. An important case of the above corollary is the distribution of X = (X1 , . . . , Xn ) are i.i.d.
samples fromPan exponential P family P = {f (·; θ) : θ ∈ Θ}. In that case the above corollary implies that
n n
T (n) (x) = T (x
i=1 1 i ), . . . , i=1 Tk (xi ) is sufficient.

The next natural question to ask is to what extent we can summarise a set of data — by how much
we can reduce it — without losing information about θ. This brings us to the concept of minimal
sufficiency .

Example. Let X1 , X2 , X3 be independent Ber(p) random variables modelling three coin tosses (so
0 means heads and 1 means tails). Consider the following four statistics:

1. T1 (X) = (X1 , X2 , X3 ),
P3
2. T2 (X) = (X1 , i=1 Xi ),
P3
3. T3 (X) = i=1 Xi ,
4. T4 (X) = 1T3 (X)=0 .
Which of these are sufficient for p?

2.2 Minimality
Suppose X takes values in X . A statistic T : X 7→ T induces a partition of the sample space X
[
X = {x ∈ X : T (x) = t};
t∈T

the partition is equivalently generated by the equivalence relation x ∼ y ⇐⇒ T (x) = T (y).

Page 17 of 88
Foundations of Statistical Inference 2. Sufficiency and Minimality

Example (continued). The following diagrams show the partitions induced by the statistics
T1 , . . . , T4 :

HHH THT HTT HTH HHH THT HTT HTH


TTH THH HHT TTT TTH THH HHT
TTT
 
1. T1 (X) = (X1 , X2 , X3 ) P3
2. T2 (X) = X1 , i=1 Xi

HHH THT HTT HTH HHH THT HTT HTH


TTH THH HHT TTT TTH THH HHT TTT
4. T4 (X) = 1T3 (X)=0
P3
3. T3 (X) = i=1 Xi

In each case T is constant within each class.

Summarising our data through a statistic can be thought of as keeping track only of the equivalence
class which contains our sample. Therefore a statistic T (equivalently a partition of the sample space)
is sufficient, if the conditional distribution of X given the equivalence class it belongs to is independent
of the parameter θ, i.e. the same for all distributions in our model.

A finer partition corresponds to keeping “more information” about our sample. We want to be as
economical as possible, that is keep as little information about the sample without losing any information
about the parameter θ. In other words, among all sufficient statistics, we want to choose the one
generating the coarsest partition.

Consider the case where X is a finite set, and let Π, Π′ be two partitions, such that Π′ is finer than Π,
that is each element of Π can be written as a union of elements of Π′ . Equivalently, we can express
this as a function sending multiple elements of Π′ to one element of Π. This leads us to the following
definition.

Definition 2.4. A statistic is minimal sufficient if it can be expressed as a function of any other
sufficient statistic.

The following result gives a way of checking minimal sufficiency.

Theorem 2.5. A statistic T is minimal sufficient if and only if

f (y; θ)
T (x) = T (y) ⇐⇒ is independent of θ.
f (x; θ)

Remark. It is useful to think about the above result in terms of partitions: it says that in order to define
a partition corresponding to a minimal sufficient statistic, the likelihood ratio for any two elements in
the same class must be independent of θ.

Example (continued). In the coin-tossing example, first consider T2 . With x = TTH and y =
HTT, we have f (x; p) = f (y; p) = p2 (1 − p), so that ff (x,p)
(y,p) = 1, but clearly T2 (X) ̸= T2 (Y ), so T2 is
not minimal sufficient.
f (y;p)
Considering T4 instead, take x = HTH and y = TTT. So clearly T4 (x) = T4 (y), but f (x;p) =
p3 p2
p(1−p)2 = (1−p)2 , which does depend on p. So T4 is also not minimal sufficient.

f (y;θ)
Proof of theorem. ( ⇐= ) Suppose T is a statistic such that T (x) = T (y) if and only if f (x;θ) is
equal to some k(x, y) independent of θ.

Page 18 of 88
Foundations of Statistical Inference 2. Sufficiency and Minimality

Sufficiency. In the discrete case,

Pθ (X = x) f (x; θ)
(2.4) f (x | t, θ) = Pθ (X = x | T = t) = =P
Pθ (T = t) y:T (y)=t f (y; θ)
f (x; θ)
(2.5) =P
y:T (y)=t f (x; θ)k(x, y)
 −1
X
(2.6) = k(x, y)
y:T (y)=t

which is independent of θ, so T is sufficient. For the continuous case, replace the sum with an
integral.

Minimality. Now suppose U : X 7→ U is another sufficient statistic and that U (x) = U (y) for some
x, y. Since U is sufficient, by the factorisation criterion we have

f (y; θ) g(U (y), θ)h(y) h(y)


= =
f (x; θ) g(U (x), θ)h(x) h(x)

which is independent of θ. So by hypothesis, T (x) = T (y) and hence ΠU is finer than ΠT , where
ΠU , ΠT be the partitions induced by U, T respectively. We now show that T must be a function of
U . Without loss of generality we assume that
[ [
U= {U (x) : x ∈ A}, T = {T (x) : x ∈ B}.
A∈ΠU B∈ΠT

We define a function ϕ : U 7→ T as follows: for each u ∈ U, there exists an A ∈ ΠU such that


u = U (x) for all x ∈ A and a t ∈ T such that for all x ∈ A, T (x) = t; let ϕ(u) = t. In this way
T (x) = ϕ(U (x)) Hence T is minimal sufficient.

( =⇒ ) Conversely, suppose T is minimal sufficient. Take x, y such that T (x) = T (y). Then by the
factorisation criterion,
f (y; θ) g(T (y), θ)h(y) h(y)
= =
f (x; θ) g(T (x), θ)h(x) h(x)
which does not depend on θ. (Note this only used the sufficiency of T .)

For the other direction, we need to show that if f (x; θ)/f (y; θ) is independent of θ then T (x) = T (y).
Start by writing x ∼ y whenever f (x; θ) = k(x, y)f (y; θ) for all θ (for some function k(x, y)).
It is easy to check that this is an equivalence relation. For each equivalence class [x] choose a
representative x and define G to be the representative function (i.e. G(y) = x for all y ∈ [x]). So G is
a statistic constant on the equivalence classes. It is also sufficient, by the factorisation criterion, since
f (x; θ) = k(x, x)f (x; θ) = k(x, G(x))f (G(x); θ) for all x. So T is a function of G (by minimality)
and hence is also constant on the equivalence classes, meaning x ∼ y =⇒ T (x) = T (y).

2.3 Minimal sufficiency in exponential families


Let us turn to the case of exponential families.

hP i
k
Theorem 2.6. Suppose the functions f (x; θ) = exp j=1 ηj (θ)Tj (x) − B(θ) h(x) form a strictly
k-parameter
Pn exponential family. Pn Let X = (X1 , . . . , Xn ), where X1 , . . . , Xn ∼ f (x, θ) are i.i.d. Then:
T(n) = T
i=1 1 (Xi ), . . . , i=1 Tk (Xi ) is minimal sufficient.

Remark. Since the vector X = (X1 , . . . , Xn ) is strictly k-parameter exponential, we could just say
T (X) = (T1 (X), . . . , Tk (X)) is minimal sufficient.

Page 19 of 88
Foundations of Statistical Inference 2. Sufficiency and Minimality

Proof of theorem. Just note that


  
Qn k n n
f ((x1 , . . . , xn ); θ) h(xi ) X X X
= Qi=1 exp  ηj (θ)  Tj (xi ) − Tj (yi )

n
f ((y1 , . . . , yn ); θ) i=1 h(yi ) j=1 i=1 i=1

Pn Pn
which is independent of θ if and only if i=1 Tj (xi ) = i=1 Tj (yi ) for all j = 1, . . . , k.

Examples.
Pn
1. Bernoulli. Let X1 , . . . , Xn be i.i.d. Bernoulli trials with parameter p, and let T (X) = i=1 Xi
be the number of successes. Then
f ((x1 , . . . , xn ); p) pT (x) (1 − p)n−T (x)
= T (y) = pT (x)−T (y) (1 − p)T (y)−T (x)
f ((y1 , . . . , yn ); p) p (1 − p)n−T (y)

which is independent of p if and only if T (x) = T (y). So T is minimal sufficient.


2. Uniform. Let X1 , . . . , Xn be i.i.d. random variables with Xi ∼ U[a, b], taking the unknown
parameter to be θ = (a, b). Then
n
1
1[a,b] (xi ) = (b − a)−n 1min xi ⩾a 1max xi ⩽b ,
Y
f ((x1 , . . . , xn ); θ) =
i=1
b − a

so by the factorisation criterion T (x) = (min xi , max xi ) is sufficient.


Exercise: is it minimal sufficient?
3. Normal. Let X = (X1 , . . . , Xn ) be a sample of i.i.d. N (µ, σ 2 )-distributed random variables.
For the parameter θ = (µ, σ 2 ) ∈ R × R+ , we have
Pn
(2πσ 2 )−n/2 exp − 2σ1 2 i=1 (xi − µ)2

f (x; θ)
(2.7) = Pn
(2πσ 2 )−n/2 exp − 2σ1 2 i=1 (yi − µ)2

f (y; θ)
  
 
n n n n
 1 X 2 X 2 X X 
(2.8) − 2σ 2 
= exp  xi − yi − 2µ  xi − yi .
i=1 i=1 i=1 i=1

Pn Pn Pn Pn
This ratio is independent of θ if and only if i=1 xi = i=1 yi and i=1 x2i = i=1 yi2 .
Thus T (X) = Xi , Xi2 is minimal sufficient.
P P

Note that x = n1 xi = T1n(x) and S 2 = n−1


1 1
xi − nx2 = n−1 1

(xi − x)2 = n−1
P P P 2
(T2 (x) −
1 2 2
T
n 1 (x) ) are in one-to-one correspondence with T (x), and hence (X, S ) is also minimal
sufficient for θ.

In closing our discussion on sufficiency, it is worth mentioning that although it is a nice property, that
allows us to reduce the data to a bounded dimension statistic, it is not all that common. In fact, only
exponential families and some families with bounded support admit finite dimensional sufficient statistics
for all sample sizes; this is the Pitman-Koopman-Darmois theorem.

Page 20 of 88
Chapter 3

Fisher Information

Much of the material in this chapter can also be found (often with a nicer exposition) in your part A
course.

We turn to the question now of whether there is some nice way to measure ‘how much’ information a
given dataset contains about a particular parameter.

Let f (x, θ) be a parametric family of densities.

Definition 3.1. For each x ∈ X , the likelihood function L(·, x) : Θ → R+ is defined by L(θ, x) =
f (x, θ).

The log-likelihood is often written ℓ(θ, x) := log L(θ, x).

To simplify our analysis, we will need some regularity assumptions about our model. These will, pri-
marily, allow us to use partial derivatives and to interchange them with sums/integrals without worrying
too much (as we’ll see).

Reg 1. The distributions {f (·, θ) : θ ∈ Θ} have common support, so that A = {x : f (x, θ) > 0} is
independent of θ.

Remark. Distributions belonging to an exponential family satisfy Reg 1.

To proceed, we’ll start by just looking at the one-dimensional case.

3.1 The one-dimensional case

Reg 2. Θ ⊆ R is an open interval (finite or infinite).

Reg 3. For all x ∈ A and for all θ ∈ Θ, the derivative ∂θ f (x, θ) exists and is finite. Furthermore,
for each θ ∈ Θ, there exists a neighborhood I ⊂ Θ and an integrable function gI (x) such that

|∂θ f (x, θ)| ⩽ gI (x), ∀θ ∈ I, ∀x ∈ A.

See Part A integration.

The following will be a useful tool to work with:

Page 21 of 88
Foundations of Statistical Inference 3. Fisher Information

Definition 3.2. When Regs 1–3 are satisfied, for x ∈ A we define the score function

∂ log L(θ, x)
S(θ, x) = ℓ′ (θ, x) = .
∂θ

Now note the following handy fact (which is what motivates the regularity assumptions):

Lemma 3.3. Under Regs 1–3, for continuous distributions


Z Z
∂ ∂
f (x, θ) dx = f (x, θ) dx
∂θ A A ∂θ

and for discrete distributions


∂ X X ∂
f (x, θ) = f (x, θ).
∂θ ∂θ
x∈A x∈A

Proof. By the Leibniz integral rule.

This allows us to see the following:

Theorem 3.4. Under Regs 1–3,


Eθ S(θ, X) = 0 ∀θ ∈ Θ.

Proof. In the continuous case,



∂θ f (x, θ)
Z Z Z
∂ ∂
Eθ [S(θ, X)] = ℓ′ (θ, x)f (x, θ) dx = f (x, θ) dx = f (x, θ) dx = 1 = 0.
A A f (x, θ) ∂θ A ∂θ

The discrete case is similar.

Definition 3.5. When Regs 1–3 are satisfied, we define the Fisher information to be

IX (θ) = Varθ [S(θ, X)] = Eθ [(ℓ′ (θ, X)2 ].

Let us introduce one more regularity assumption now:

Reg 4. The log-likelihood ℓ is twice-differentiable for all x ∈ A, θ ∈ Θ, and

∂2 ∂2
Z Z
f (x, θ) dx = f (x, θ) dx (for continuous distributions)
∂θ2 A A ∂θ
2

or
∂2 X X ∂2
f (x, θ) dx = f (x, θ) dx (for discrete distributions)
∂θ2 ∂θ2
x∈A x∈A

for all θ ∈ Θ.

Definition 3.6 (Observed information). When Regs 1–4 are satisfied, for an observation X = x,
we define the observed information to be

J(θ; x) = −ℓ′′ (θ, x).

This allows us to derive an alternative form for the Fisher information which will be much more commonly
used:

Page 22 of 88
Foundations of Statistical Inference 3. Fisher Information

Theorem 3.7. Under Regs 1–4,

IX (θ) = − Eθ [ℓ′′ (θ, X)] = Eθ [J(θ; X)].

Proof. In the continuous case,


   2
∂2 ∂ !2
∂2 ∂
∂θ 2 f f− ∂θ f
∂2 ∂
∂ ∂θ f (x, θ) ∂θ 2 f ∂θ f
ℓ′′ (θ, x) = 2 log f (x, θ) = = = − .
∂θ ∂θ f (x, θ) f2 f f

By Reg 4,

∂2 ∂2
   Z   Z Z
Eθ ∂2
∂θ 2 f /f = ∂2
∂θ 2 f /f · f dx = f dx = f dx = 0,
A A ∂θ2 ∂θ2 A

and thus  2 
− Eθ [ℓ′′ (θ, X)] = Eθ ∂
∂θ f (X, θ)/f = Eθ [(ℓ′ (θ, X))2 ].

The discrete case is similar.

Proposition 3.8 (Properties of the Fisher information).

1. (Information grows with sample size.) If X and Y are independent random variables,
then
I(X,Y ) (θ) = IX (θ) + IY (θ).
In particular, if Z = (X1 , . . . , Xn ) where the Xi are i.i.d. copies of X, then IZ (θ) = nIX (θ).
2. (Reparametrisation.) If θ = h(ξ) where h is differentiable, then the Fisher information of
X about ξ is

IX (ξ) = IX (h(ξ))[h′ (ξ)]2 .

Proof. (for the second statement) The log-likelihood w.r.t. Ph(ξ) is ℓ∗ (ξ) = ln p(x; h(ξ)) thus the
score function is
∂ ∂
S ∗ (ξ; x) = ln p(x; h(ξ)) = ln p(x; θ) θ=h(ξ) h′ (ξ)
∂ξ ∂θ
and so
Varξ S ∗ (ξ, X) = Varξ S(h(ξ), X)h′ (ξ) = IX (h(ξ))[h′ (ξ)]2 .


Example. Consider X ∼ N (0, σ 2 ). The parameter of interest is σ. Let us first compute IX (σ 2 ).


Let us write θ = σ 2 . Then
1 1 1
ℓ(θ; x) = − log(2π) − log(θ) − x2 .
2 2 2θ
Thus the score function is S(θ; X) = ℓ′ (θ; X) = − 2θ
1
+ 2θ12 X 2 . Since Varθ (X 2 ) = Eθ (X 4 ) − θ2 = 2θ2
we get
1 1
IX (θ) = 4 Varθ (X 2 ) = 2 .
4θ 2θ

Defining h(σ) = σ 2 we have h′ (σ) = 2σ and by the above Theorem

∗ 1 2
IX (σ) = IX (σ 2 )[h′ [σ)]2 = 4
· 4σ 2 = 2 .
2σ σ

Page 23 of 88
Foundations of Statistical Inference 3. Fisher Information


The direct way to compute IX (σ) is

1 1
ℓ∗ (σ; x) = − log(2π) − log(σ) − 2 x2
2 2σ
so that
1 1
S ∗ (σ; x) = − + 3 x2
σ σ
and  
∗ 1 2 1 2
IX (σ) = Varσ X = 2σ 4 = 2 .
σ3 σ 6 σ

3.2 The multivariate case


Let’s extend this all to the multivariate case now — i.e. the case where θ ∈ Rk . Reg 1 remains unaltered
but we have to adapt the other regularity assumptions:

Reg 2′ . Θ ⊆ Rk is an open set.

Reg 3′ . For all x ∈ A and for all θ ∈ Θ, the partial derivatives of L(θ, x) exist and are finite.

Reg 4′ . All second order partial derivatives of log-likelihood ℓ exist and they can all be commuted
with summation/integration over A.

We can now generalise our definitions:

Definition 3.9. When Regs 1, 2′ , 3′ are satisfied, we define the score function to be
 T
∂ ∂
S(θ, x) = ∇θ ℓ(θ, x) = ∂θ1 ℓ(θ, x), . . . , ∂θk ℓ(θ, x) .

Definition 3.10. When Regs 1, 2′ , 3′ are satisfied, we define the Fisher information matrix to
be
IX (θ) = Covθ (S(θ, X)),
so that  
IX (θ)jr = Eθ ∂ ∂
∂θj ℓ(θ, X) ∂θr ℓ(θ, X) .

Note the last line above used that the multi-dimensional score function also has zero expectation, which
can be shown much like in the one-dimensional case.

Theorem 3.11. Supposing Regs 1, 2′ , 3′ , 4′ hold, define the observed Fisher information
∂ 2 ℓ(θ, x)
matrix J by J(θ, x)jr = − for j, r = 1, . . . , k. Then
∂θj ∂θr

IX (θ) = Eθ [J(θ, X)].

Proof. Exercise (a generalisation of the one-dimensional case).

Page 24 of 88
Chapter 4

Point estimation

See Garthwaite, Joliffe and Jones p40, Liero and Zwanzig p71.

Definition 4.1. For any function g : Θ → Γ (for some set Γ), an estimator of γ = g(θ) is a
function T : X → Γ.

The value T (X) is called the estimate of g(θ).

Definition 4.2. The bias of an estimator T for γ = g(θ) is

bias(T, θ) = Eθ [T ] − g(θ).

T is called unbiased for g(θ) if Eθ [T ] = g(θ) ∀θ ∈ Θ.

2
Example.
1
Pn Suppose X = (X1 , . . . , Xn ) is a sample of 2i.i.d. N1 (µ,
Pσn ) random 2variables. Then
b = n i=1 Xi is an unbiased estimator for µ, and S = n−1 i=1 (Xi − µ
µ b) is an unbiased
2
estimator for σ

(Exercise: prove this.)

4.1 The method of moments


A very simple approach for estimating functions of moments of a random variable is to replace all of the
moments by their empirical values.

Formally, suppose (X1 , . . . , Xn ) is a sample of i.i.d. Pθ -distributed random variables, where θ ∈ Θ is the
parameter. In general if X ∼ Pθ , then the moments mr = Eθ [X r ] for r = 1, 2, . . . depend on θ.

Assume there exists a function h such that γ = h(m1 , . . . , mr ).

Pn
b k = n1 i=1 Xik . Then the moment estimator for γ
Definition 4.3. For each k = 1, . . . , r let m
is defined as
γ
bMME = h(m b 1, . . . , m
b r ).

Example. Suppose X1 , . . . , Xn are i.i.d. Poisson with parameter λ > 0. Since m1 = E[X1 ] = λ, we

Page 25 of 88
Foundations of Statistical Inference 4. Point estimation

1
Pn
can use the sample mean m
b1 = n i=1 Xi , so that
n
1X
λ
bMME = m
b1 = Xi .
n i=1

On the other hand, Var(Xi ) = λ as well, so writing Var(Xi ) = m2 −m21 we can also use the estimator
n
1X
λ b 21 =
b2 − m
bMME = m (xi − x)2 .
n i=1

Which estimator is “better”?

4.2 Maximum likelihood estimators

Definition 4.4. An estimator T is called a maximum likelihood estimator (MLE) for θ if

L(T (x), x) = max L(θ, x) ∀x ∈ X ,


θ∈Θ

and is denoted by θbMLE .

Theorem 4.5 (The Invariance Property). If γ = g(θ) and g is bijective, then θb is a MLE for
θ if and only if γ
b = g(θ)
b is a MLE for γ.

Proof. Part A statistics.

In the case above, if g is not bijective, we define γ


bMLE = g(θbMLE ).

4.2.1 Finding the MLE


1. In many standard cases the function L is differentiable. In these cases, one way to maximise L
is to differentiate it and set the (partial) derivatives with zero. If L achieves its maximum at an
interior point of Θ then θMLE must be a solution of

∂L(θ; x)
= 0, j = 1, . . . , k.
∂θj

Note that it is sometime easier to work with the log-likelihood ℓ instead of L and that maximise L
is equivalent to maximising ℓ.
2. One then has to show that the solution to this system of equation is a local maximum (by consid-
ering the Hessian matrix and if the solution is not unique to show that it is a global maximum.
3. Notice that in some cases the maximum may occur at a boundary point, in which case the partial
derivatives may not vanish.

Lemma 4.6 (MLE for exponential families). Consider a k-parameter exponential family P =
{f (·; η) : η} in natural parameterisation
 
Xk 
(4.1) L(η; x) = exp ηj Tj (x) − B(η) h(x).
 
j=1

Page 26 of 88
Foundations of Statistical Inference 4. Point estimation

Then any MLE of ϕ satisfies,


∂B
(4.2) Tj (x) = (b
ηMLE ), j = 1, . . . , k.
∂ηi

If P is strictly k-parameter and (4.2) admits a solution, then it is unique.

Proof. The first statement follows trivially from Theorem 1.15. For the second statement, also
follows by Theorem 1.15, note that
!
∂2
Covη (T ) = B(η) .
∂ηi ηj
ij

By Propositions 1.9, 1.10 Covη (T ) is strictly positive definite. Therefore the negative log-likelihood
is strictly concave and therefore any local maximum is also a global maximum.

4.3 Variance and mean squared error

Definition 4.7. The mean squared error (MSE) of an estimator T for g(θ) is defined as

MSEθ (T ) = Eθ [(T − g(θ))2 ].

(This is also often called the quadratic loss function.)

Proposition 4.8. In general, for an estimator T for g(θ),

MSEθ (T ) = Varθ (T ) + (Eθ [T ] − g(θ))2 .


| {z }
bias2

In particular, if T is unbiased, MSEθ (T ) = Varθ (T ).

Proof. Exercise.

Example. Let X = (X1 , . . . , Xn ) be a sample of i.i.d. U(0, θ) random variables. Then θbMLE =
Xmax = max{Xi : i = 1, . . . , n}.

It’s easy to check that Eθ (Xmax ) = n


n+1 θ and Varθ (Xmax ) = n 2
(n+1)2 (n+2) θ , so that

2θ2
MSEθ (Xmax ) = .
(n + 1)(n + 2)
n+1
However, the estimator θb = n Xmax is unbiased, and indeed

θ2
MSEθ (θ)
b = < MSEθ (θbMLE ).
n(n + 2)

Typically although we write θb = T (X), with X ∼ f (·; θ) for an estimator, the notation is overloaded
iid
when considering X1 , . . . , Xn ∼ f (·; θ) with n varying. In this case we understand the estimator and the
statistic as a family of estimators {θbn }n with θbn := Tn (X1 , . . . , Xn ) and Tn : X 7→ Rk for some k.

For most statistics, it is quite easy to see the dependence on n; e.g. for the sample mean Tn (x) =
(x1 + · · · + xn )/n, or Tn (x) = mini≤n xi etc.

As the sample size n → ∞ we expect a good family of estimators θbn = Tn (X1:n ) to eventually recover

Page 27 of 88
Foundations of Statistical Inference 4. Point estimation

iid
the value of the parameter we are estimating exactly, that is when X1 , . . . , Xn ∼ f (·; θ) we expect
Tn (X1:n ) → θ in some sense, e.g. in probability, almost surely or in MSE. This is the concept of
consistency .

iid
Definition 4.9 (Consistency). Let X1 , . . . , Xn ∼ f (·; θ) and consider the family of estimators
θbn = Tn (X1:n ). We say that Tn (X1:n ) is

• consistent in probability if for all ε > 0

lim Pθ ∥Tn (X1:n ) − θ∥ > ε → 0;


 
n→∞

• consistent in MSE if
lim MSEθ (Tn ; θ) → 0;
n→∞

• strongly consistent if

lim Tn (X1:n ) → θ), Pθ -almost surely.


n→∞

Any mode of convergence can be used to define a corresponding notion of consistency.

Page 28 of 88
Chapter 5

MVUEs and the Cramer-Rao Lower


Bound

Now that we have developed a few techniques for estimating a parameter, we need to evaluate how well
various estimators actually work.

Suppose X = (X1 , . . . , Xn ) is a random sample from the distribution Pθ . What is a ‘good’ estimator of
θ?

A fairly natural pathway would be to try and minimise the MSE:

Definition 5.1. We say T1 is a uniformly better estimator than T2 (or better in quadratic
mean) if for all θ ∈ Θ,
MSEθ (T1 ) ⩽ MSEθ (T2 ).

Remark. If θb = θ0 , then MSEθ0 (θ)


b = 0. Hence no other estimator can be uniformly better!

5.1 The CRLB in the one-dimensional case


Definition 5.2. T = T (X1 , . . . , Xn ) is the minimum variance unbiased estimator (MVUE)
for θ (resp. for g(θ)) if

• T is unbiased, and
• for all unbiased estimators T̃ , Varθ (T̃ ) ⩾ Varθ (T ) ∀θ ∈ Θ.

The estimator T is furthermore said to be regular if


Z Z
∂ ∂ ∂
T (x) L(θ; x)dx = T (x)L(θ; x)dx = Eθ [T (X)].
A ∂θ ∂θ A ∂θ

Theorem 5.3 (Cramer-Rao Lower Bound (CRLB) in 1 dimension). Suppose Regs 2–4 hold
and that 0 < IX (θ) < ∞. Let γ = g(θ) where g is a continuously differentiable real-valued function
with g ′ ̸= 0.

Let T be a regular unbiased estimator of γ. Then

|g ′ (θ)|2
Varθ (T ) ⩾ , ∀θ ∈ Θ
IX (θ)

Page 29 of 88
Foundations of Statistical Inference 5. MVUEs and the Cramer-Rao Lower Bound

with equality if and only if

g ′ (θ)S(θ, x)
T (x) − g(θ) = ∀x ∈ A ∀θ ∈ Θ.
IX (θ)

Remark. If T attains the CRLB,


|g ′ (θ)|2
Varθ (T ) = ,
IX (θ)
then it is clearly a MVUE. There is no guarantee that there exists an estimator which attains the bound.
Remark. In the case g(θ) = θ the CRLB is
1
Varθ (T ) ⩾
IX (θ)

and T attains the CRLB if and only if S(θ, x) = IX (θ)(T (x) − θ) ∀x ∈ A ∀θ ∈ Θ, or equivalently
T (x) = θ + S(θ,x)
IX (θ) .

Proof of theorem. Note that

(5.1) Covθ (T, S(θ, X)) = Eθ [T S(θ, X)] since Eθ (S(θ, X)) = 0
Z
∂ log p(x, θ)
(5.2) = T (x) p(x, θ) dx
∂θ
ZX
∂ p(x, θ)
(5.3) = T (x) dx
X ∂θ
Z

(5.4) = T (x)p(x, θ) dx (note this is where we need T regular)
∂θ X

(5.5) = Eθ [T ] = g ′ (θ).
∂θ
Now set c(θ) := g ′ (θ)/IX (θ). Then

(5.6) 0 ⩽ Varθ (T − c(θ)S(θ, X)) = Varθ T + c2 (θ) Varθ (S(θ, X)) − 2c(θ) Covθ (T, S(θ, X))
(5.7) = Varθ T + c2 (θ)IX (θ) − 2c(θ)g ′ (θ)
|g ′ (θ)|2
(5.8) = Varθ (T ) −
IX (θ)

which is the CRLB. We have equality if and only if T − c(θ)S(θ, X) is almost surely constant, and
in that case it must be equal to its expectation g(θ):

S(θ, x)g ′ (θ)


T (x) − c(θ)S(θ, x) = g(θ) ⇐⇒ T (x) − g(θ) = .
IX (θ)

Is it common that there exists an estimator attaining the CRLB?

Corollary 5.4. Suppose that X ∼ f (·; θ) for θ ∈ Θ. Under regularity conditions, if some estimator
b of γ = g(θ) attains the CRLB, it follows that {f (·; θ) : θ} is in some exponential family.
γ

Proof. If γ
b = T (x) attains the CRLB then we know that

g ′ (θ) ∂
T (x) = g(θ) + log f (x; θ).
IX (θ) ∂θ

Page 30 of 88
Foundations of Statistical Inference 5. MVUEs and the Cramer-Rao Lower Bound

Rearranging we have
∂ g ′ (θ)
log f (x; θ) = (T (x) − θ) .
∂θ IX (θ)
By the fundamental theorem of calculus we have that

log f (θ; x) = C(x) + η(θ)T (x) − B(θ),

for some functions C, η, B and the conclusion follows.

Corollary 5.5. Suppose that Eθ [T (X)] = θ + b(θ) (so that b(θ) is the bias of T ) and that T is
regular. Then
|1 + b′ (θ)|2
Varθ (T (X)) ⩾
IX (θ)

Example. Suppose X ∼ Bin(n, θ), where n is known. Our parameter of interest will be γ = θ(1−θ)
(so g ′ (θ) = 1 − 2θ). Hence
 
n
ℓ(θ, x) = log + (n − x) log(1 − θ) + x log θ,
x

and therefore
n−x x
S(θ, x) = − + ,
1−θ θ
so
∂ n−x x
S(θ, x) = − − 2.
∂θ (1 − θ)2 θ
Thus the Fisher information is
h i
(5.9) IX (θ) = − Eθ ∂
∂θ S(θ, X)

n − Eθ [X] Eθ [X] n
(5.10) = + = .
(1 − θ)2 θ2 (1 − θ)θ
1 x

Observe that T (x) = n−1 x 1 − n is unbiased for γ (check this as an exercise) and Varθ (T ) =
2 3 4
θ θ (5n−7)−4θ (2n−3)+θ (4n−6) (1−2θ)2 θ(1−θ)
n − n(n−1) which is larger than the CRLB of n .

5.2 Efficiency
Definition 5.6. The efficiency of an unbiased estimator T of g(θ) is the ratio of its variance and
of the CRLB, that is
[g ′ (θ)]2
e(T, θ) = .
IX (θ)Varθ T
An unbiased estimator which attains the CRLB is called efficient.

The following theorem is valid for exponential families.

Theorem 5.7. Suppose that the distribution of X = (X1 , . . . , Xn ) belongs to a one-parameter


exponential family in ζ and T . Then the sufficient statistic T is an efficient estimator for the
parameter γ = g(θ) = Eθ [T ].

Proof. The probability function of the sample is given by

p(x; θ) = exp(T (x)ζ(θ) − B(θ))h(x)

Page 31 of 88
Foundations of Statistical Inference 5. MVUEs and the Cramer-Rao Lower Bound

and we have for the score function



S(θ; x) = ln p(x; θ) = −B ′ (θ) + ζ ′ (θ)T (x)
∂θ
This means that S(θ; x) is a linear function of T (x), hence

(5.11) Corθ (S(θ; X), T (X))2 = 1.

From the proof of the CRLB we know that for an unbiased estimator T of g(θ):

Covθ (S(θ; X), T (X)) = g ′ (θ).

Thus by (5.11)
[g ′ (θ)]2
=1
Varθ (S(θ; X)Varθ (T (X))
but since Varθ (S(θ; X) = IX (θ) we have that

[g ′ (θ)]2
Varθ (T (X)) =
IX (θ)

which is the Cramer-Rao bound.

5.3 The multivariate case


We turn now to the multivariate case. Suppose that γ = g(θ) ∈ Rm .

We will compare matrices using the Loewner order:

Definition 5.8. Let T, T ∗ be two unbiased estimators for γ. We say that T ∗ has a smaller
covariance matrix than T at θ ∈ Θ if

ut (Covθ T ∗ − Covθ T )u ⩽ 0 ∀u ∈ Rm ,

and we write Covθ T ∗ ⪯ Covθ T .

Theorem 5.9 (Cramer-Rao Lower Bound in m dimensions). Suppose Regs 1, 2′ , 3′ , 4′ hold


and that IX (θ) is not singular. Then the CRLB is

Covθ T ⪰ (Dθ g)(θ)IX (θ)−1 (Dθ g)(θ)t ∀θ ∈ Θ,

∂ gi (θ)
where Dθ g is the Jacobian matrix, so (Dθ g)(θ)ij = .
∂θj

Proof. The idea is similar to the one-dimensional case; start by computing


h i
Covθ T (X) − Dθ g(θ)IX (θ)−1 S(X, θ) ⪰ 0.

The rest is an excercise in matrix multiplications.

Example. Let X = (X1 , . . . , Xn ) be a random sample of N (µ, σ 2 ) random variables, where our
parameter of interest is θ = (µ, σ 2 ). Recall from Part A Statistics that
 
n/σ 2 0
IX (θ) =  .
0 n/2σ 4

Page 32 of 88
Foundations of Statistical Inference 5. MVUEs and the Cramer-Rao Lower Bound

σ2 2σ 4
The estimators X and S 2 are independent, with Var(X) = n and Var(S 2 ) = n−1 . We can see that
the CRLB is not attained.

5.4 MLEs and MVUEs


Note too the following, which shows that MLEs line up with MVUEs when the CRLB is attained:

Theorem 5.10. Under Regs 1, 2′ , 3′ , 4′ , if θbMLE is the MLE for θ and if there exists θ̃ which is
unbiased and attains the CRLB, then θ̃ = θbMLE almost surely.

Proof. For simplicity we prove this statement in the case of a real-valued parameter. Suppose that
θ̃ is an efficient estimator of θ. Then we know that

θ̃(x) − θ = S(θ; x)/IX (θ)

for all θ ∈ Θ and x ∈ A So it holds in particular for θ = θbMLE (x)

θ̃(x) − θbMLE (x) = S(θbMLE (x); x)/IX (θbMLE (x)).

Since θbMLE (x) maximizes ℓ(·; x) we have S(θbMLE (x); x) = 0 and thus

θ̃(x) − θbMLE (x) = 0.

Remark. Note that θb may be unbiased and MVUE without attaining the CRLB, in which case the
conclusion above may be false.

Page 33 of 88
Chapter 6

The Rao-Blackwell and


Lehmann-Scheffé theorems

Of course, even when the CRLB is not achievable, we still want to be able to find a MVUE.

Theorem 6.1 (Rao-Blackwell Theorem). Let X ∼ Pθ and let T be a sufficient statistic. Let γ
b
be an unbiased estimator for γ = g(θ) ∈ Rk .

bT = Eθ [b
Define γ γ | T ]. Then:

1. γ
bT is a function of T alone and does not depend on θ,
2. Eθ [b
γT ] = γ ∀θ ∈ Θ,

γT ) ⪯ Covθ (b
3. Covθ (b γ ) (which reduces to Varθ (b
γT ) ⩽ Varθ (b
γ ), in the case k = 1).

γ )) < ∞ then Covθ (b


If tr(Covθ (b γT ) = Covθ (b
γ ) if and only if γ
bT = γ
b almost surely.

Intuitively, this says that ‘any unbiased estimator can always be improved by a sufficient statistic. The
last statement says that if some unbiased estimator cannot be improved by conditioning on a sufficient
statistics, then the estimator is essentially a function of the statistic.
Remark. Notice that the Rao-Blackwell theorem implies that if an unbiased estimator is a function of a
minimal sufficient statistic, then it is MVUE. Why is that? Notice that a minimal sufficient statistic T
can be written as a function of any other sufficient statistic T ′ ; therefore if γ
b is an unbiased estimator,
we have that γbT is a function of T and therefore a function of T ′ . Therefore E[b γT |T ′ ] = γ
bT and we are
in the equality case of the Rao-Blackwell theorem.

Proof of theorem. We prove the three parts in order:

1. Since T is sufficient, f (x | θ, T ) is independent of θ, so


Z Z
bT = Eθ [b
γ γ | T = t] = b(x)f (x | t, θ) dx =
γ b(x)f (x | t) dx
γ
X X

which does not depend on θ.


2. By the unbiasedness of γ
b and the tower property of expectations,

Eθ [bγT ] = Eθ [Eθ [bγ | T ]] = Eθ [bγ ] = γ.

Page 34 of 88
Foundations of Statistical Inference 6. The Rao-Blackwell and Lehmann-Scheffé theorems

3. For k = 1, the result is fairly straightforward:

(6.1)
γ ) = Eθ [(b
Varθ (b γ − γ)2 ] = Eθ [(b
γ−γ bT − γ)2 ]
bT + γ
(6.2) = Eθ [Eθ [(b
γ−γ bT − γ)2 | T ]]
bT + γ
(6.3) = Eθ [Eθ [(b bT )2 | T ] − 2 Eθ [(b
γ−γ γ−γ γT − γ) | T ] + Eθ [(b
bT )(b γT − γ)2 | T ]]
(6.4) = Eθ [Eθ [(b bT )2 | T ]] − 0 + Eθ [Eθ [(b
γ−γ γT − γ)2 | T ]]
(6.5) = Eθ [Varθ (b
γ | T )] + Varθ (b
γT )
(6.6) ⩾ Varθ (b
γT ).

For k > 1, we can instead do:

(6.7) γ ] = Eθ [(b
Covθ [b γ − γ)t ]
γ − γ)(b
(6.8) = Eθ [(b
γ−γ bT )t ] + Eθ [(b
γ−γ
bT )(b γT − γ)t ] − 2 Eθ [(b
γT − γ)(b γ−γ γT − γ)t ]
bT )(b
(6.9) = Eθ [(b
γ−γ γ−γ
bT )(b γT ) + 2 Eθ [(b
bT )t ] + Covθ (b γ−γ γT − γ)t ].
bT )(b

The first term here is clearly nonnegative, and it isn’t too hard to see that the third term is
equal to zero. The result follows.

The proof of the equality case is left as an exercise.

Example. LetPX1 , . . . , Xn be i.i.d. Ber(θ) random variables. Note that θb = X1 is unbiased for θ,
n
and that T = i=1 Xi is sufficient for θ.

In this case,

Pθ (X1 = 1, T = t)
(6.10) θbT = Eθ [X1 | T = t] = Pθ (X1 = 1 | T = t) =
Pθ (T = t)
Pθ X1 = 1, ni=2 Xi = t − 1
P 
(6.11) = n t

n−t
t θ (1 − θ)
θ n−1 θt−1 (1 − θ)n−t

t
(6.12) = t−1 n t

n−t
=
t θ (1 − θ)
n

so θbT = T /n.

Definition 6.2. A statistical model {Pθ : θ ∈ Θ} is called complete if for any h : X → R,

Eθ [h(X)] = 0 ∀θ ∈ Θ =⇒ Pθ (h(X) = 0) = 1 ∀θ ∈ Θ.
A statistic T is called complete if the model {PθT : θ ∈ Θ} is complete, i.e.

Eθ [h(T )] = 0 ∀θ ∈ Θ =⇒ Pθ (h(T ) = 0) = 1 ∀θ ∈ Θ.

Examples.

1. Suppose the statistical model consists only of the two distributions N (1, 2) and N (0, 1). This
√ h(x) = (x − 1) − 2. For both distributions, E[h(x)] = 0, but
2
model is not complete:
√ take
h(x) ̸= 0 ∀x ̸= 2 + 1, 1 − 2.
2. The statistical model {U(0, θ), θ ∈ R+ } is complete. Indeed, suppose 0 = Eθ [h(X)] =

Page 35 of 88
Foundations of Statistical Inference 6. The Rao-Blackwell and Lehmann-Scheffé theorems

Rθ 1
0 θ
h(x) dx for all θ > 0. Then
Z θ

h(x) dx = 0 ∀θ > 0.
∂θ 0



But ∂θ 0
h(x) dx = h(θ) almost everywhere (by Lebesgue differentiation Theorem), we con-
clude that h(x) = 0 almost everywhere. Since the variable X ∼ Pθ has a density we conclude
that Pθ (h(X) = 0) = 1.
3. If X1 , . . . , Xn are i.i.d. U(0, θ) then Xmax is a complete statistic. Indeed, the density of Xmax
is
ntn−1
fθ (t) = 1t∈[0,θ] .
θn
Then if 0 = Eθ [h(Xmax )] = −∞ h(t)fθ (t) dt = θnn 0 h(t)tn−1 dt for all θ ∈ Θ, and thus
R∞ Rθ

0 = 0 h(t)tn−1 dt for all θ ∈ Θ. Thus, again using Lebesgue differentiation Theorem (which
says that as a function of θ this is differentiable and equal to g(θ) almost everywhere in θ), we
have that g = 0 almost everywhere and we conclude as above.

The following fact will be useful:

Lemma 6.3. Let


k
X 
p(x; θ) = exp ηi (θ)Ti (x) − B(θ) h(x), θ ∈ Θ
i=1

be a strictly k-parameter exponential family. The joint distribution of the natural observation vector
T (X) = (T1 (X), . . . , Tk (X)), with X ∼ p(·; θ), belongs to a strictly k parameter exponential family
with natural parameters η1 (θ), . . . , ηk (θ).

Proof. We only deal with the discrete case. Fix some vector y ∈ Rk and let Ty = {x : T (x) = y}.
Then

Pθ (T = y) = Pθ (X = x)
X
(6.13)
x∈Ty

X k
X 
(6.14) = h(x) exp ηi (θ)yi − B(θ)
x∈Ty i=1
k
X
 X
(6.15) = exp ηi (θ)yi − B(θ) h(x)
i=1 x∈Ty
k
X
 
(6.16) = exp ηi (θ)yi − B(θ) h0 (y)
i=1

Theorem 6.4 (Completeness for exponential families). If P is a full-rank strictly k-parameter


exponential family then the natural observation T (x) = (T1 (x), . . . , Tk (x)) is sufficient and complete.

(Sketch). The idea is to use uniquess of Laplace transforms, or something closer to home, uniqueness
of the Moment Generating Function.

Page 36 of 88
Foundations of Statistical Inference 6. The Rao-Blackwell and Lehmann-Scheffé theorems

Suppose that the model is


 
Xk 
p(x, θ) = h(x) exp ηi (θ)Ti (x) − B(θ) .
 
i=1

Hence assuming Eθ [Φ(T )] = 0 for all θ means that


 
Z Xk 
(6.17) ϕ(t)h0 (t) exp ηi (θ)ti dt = 0, ∀θ ∈ Θ.
Rk 
i=1

Since the model is full rank we know that η(Θ) contains a k-dimensional interval. Notice that
by redefining h0 we can shift the parameter vector η(θ) by any fixed vector without changing the
exponential family; thus we can assume w.l.o.g. that for some a > 0 we have [−a, a]k ⊂ η(Θ). In
particular 0 ∈ η(Θ) and therefore ϕ(t)h0 (t) dt = 0. We decompose ϕ = ϕ+ − ϕ− into its positive
R

and negative parts, that is ϕ+ = max{ϕ, 0} and ϕ− = − min{ϕ, 0}. Notice that this decomposes
the range of the statistic T into two disjoint sets, T + = {x : ϕ(t) ≥ 0} and T − = {t : ϕ(t) < 0}.
Notice that ϕ+ is supported on T + and ϕ− on T − . Then we have that
Z Z
ϕ+ (t)h0 (t) dt = ϕ− (t)h0 (t) dt = w.

Notice w < ∞ since otherwise the integral would not be defined.

Suppose that w > 0. Then we can define the probability measures

ϕ+ (t)h0 (t)dt ϕ− (t)h0 (t)dt


λ+ (dt) = 1T + (t), λ− (dt) = 1T − (t).
w w
By (6.17) we thus have that the MGFs of λ+ , λ− are equal for all η ∈ [−a, a]k , i.e. that

ET ∼λ+ exp [η · T ] = ET ∼λ− exp [η · T ] ,


 

for all η ∈ [−a, a]k , which by uniqueness of MGFs implies that λ+ = λ− . But this is impossible
since λ+ (T + ) = 1 whereas λ− (T + ) = 0. Therefore it must be that w = 0, in which case we conclude
that ϕ+ , ϕ− ≡ 0 on the support of the exponential family and we are done.
We may then ask the question: does a complete sufficient statistic always exist? The answer is
no as the following example shows.

Example. Suppose X ∼ U(θ, θ + 1), θ ∈ R. Then T (X) = X is clearly a sufficient statistic. By


considering the likelihood ratios, it is not difficult to establish that T is also minimal.

Now suppose that T̃ is a sufficient statistic. By minimality of T we know that there exists a function
g such that x = T (x) = g ◦ T̃ (x). Consider then the function t 7→ sin 2πg(t) . We have
   Z θ+1  
Eθ sin 2πg(T̃ ) = sin 2πg ◦ T̃ (x) dx
θ
Z θ+1  
= sin 2πx dx = 0
θ

for all θ. However


   
Pθ sin 2πg(T̃ ) = 0 = Pθ sin(2πX) = 0 = 0,
 

and thus T̃ cannot be complete. Since T̃ was any sufficient statistic, we conclude that the model
admits no sufficient complete statistics.

Page 37 of 88
Foundations of Statistical Inference 6. The Rao-Blackwell and Lehmann-Scheffé theorems

The previous example shows that we are not always guaranteed to find sufficient complete statistics.
When we do have access to one however, the following theorem says we can use it to construct MVUEs.

Theorem 6.5 (Lehmann-Scheffé Theorem). Let T be a sufficient and complete statistic for
b be an unbiased estimator for γ = g(θ) ∈ Rk .
the statistical model P and let γ

bT = Eθ [b
Then γ γ | T ] is an MVUE for γ.

Proof. Let γ̃ be another unbiased estimator of γ. Then γ̃T = Eθ [γ̃ | T ] is also an unbiased estimator.
By definition γ̃T , γbT are both functions of T , independent of θ by sufficiency, so we can define
bT . Since both are unbiased estimators of γ we have that Eθ [f (T )] = 0 for all θ and
f (T ) := γ̃T − γ
since T is complete we have that f (T ) = 0 Pθ almost surely for all θ. This proves that γ̃T = γ bT a.s.
and therefore that for all θ
γT ) = Covθ (γ̃T ) ⪯ Covθ (γ̃),
Covθ (b
where the inequality follows from the Rao-Blackwell theorem. Since γ̃ was arbitrary the result
follows.

Examples.

1. Uniform. Let X1 , . . . , Xn be i.i.d. U[0, θ] random variables. Recall that Eθ [Xmax ] = n


n+1 θ.

We have seen that Xmax is complete and sufficient; hence θb = n+1


n Xmax is the MVUE. Note
the CRLB does not apply as the distribution is not regular enough; e.g. Reg 1 is violated as
the support depends on the parameter.
2
2. Normal. Let X1 , . . . , Xn be i.i.d. N (µ, σP ) random Pnvariables. We know this is a strictly
n 2

2-parameter exponential family, so T = i=1 Xi , i=1 Xi is completePand sufficient. As
n
(X̄, S 2 ) is unbiased and a function of T , it is the MVUE. (Here X̄ := n1 i=1 Xi and S 2 :=
1 2
n−1 (Xi − X̄) .)
Remember that for S 2 the Cramer-Rao bound is not attained.
3. Poisson 1. Let X = (X1 , . . . , Xn ) be a sample of i.i.d. P o(λ) random variables. Recall that
n n
1X 1X
λ
bMME = m
b1 = Xi and λ̃MME = m b 21 =
b2 − m (Xi − X̄)2
n i=1 n i=1

are two moment estimators for λ.


The Poisson family is a strictly 1-parameter exponential family with canonical observation
T (X) = X̄) (for the joint distribution). Thus X̄ is a sufficient and complete statistic.
Hence the Lehman-Scheffé Theorem tells us that λ bMME is the MVUE.
What is the Cramer-Rao bound? For a single observation, S(x, λ) = λx − 1 and iX (λ) = λ−1 .
Thus the CR lower bound is λ/n. Since we can also check that Var(X̄) = λ/n, we confirm
that λ
bMME = X̄ is efficient (it achieves the CRLB).

4. Poisson 2. What P about the other estimator above, λ̃MME ? Well, doing a little calculation
n
reveals that Xi | { j=1 Xj = k} ∼ Bin(k, 1/n). So, using Rao-Blackwell to ‘improve’ the

Page 38 of 88
Foundations of Statistical Inference 6. The Rao-Blackwell and Lehmann-Scheffé theorems

n
unbiased estimator S 2 = n−1 λ̃MME by the sufficient statistic X̄, we get
     
n n 2
n  k
Eλ S 2 | Eλ X12 |
X X
(6.18) Xj = k  = Xj = k  − 2
j=1
n−1 j=1
n 
(  )
k2 k2

n k 1
(6.19) = 1− + 2− 2
n−1 n n n n
k
(6.20) = .
n
So starting from S 2 as an unbiased estimator for λ we arrive at X̄ by Rao-Blackwell using
P
Xi .

The Lehmann-Scheffé theorem allows us to use complete sufficient statistics to construct MVUEs. We
have also seen in Example 6 that there are situations in which no sufficient, complete statistic exists. The
question remains then, whether an MVUE always exists. The following theorem says that a necessary and
sufficient condition for an estimator to be MVUE is that it is uncorrelated with all unbiased estimators
of 0.

Theorem 6.6 (NASC for MVUE). Suppose that P = {Pθ : θ ∈ Θ} be a family of distributions
on X and let U be the set of unbiased estimators of 0 with finite variance, that is
n o
U := h : X 7→ R : Eθ [h(X)] = 0, Eθ [h2 (X)] < ∞ .

b is an unbiased estimator of γ = g(θ) if and only if Eθ [b


Then γ γ U ] = 0 for all U ∈ U.

Proof. “⇒” Suppose γ b is MVUE. Then for U ∈ U c ∈ R define γbc = γ


b + cU . Notice that γ
bc is also
unbiased and therefore
γc ] ≥ Varθ [b
Varθ [b γ ],
or equivalently that
c2 Varθ [U ] + 2c Covθ [b
γ , U ] ≥ 0.
Viewing the LHS above as a quadratic function in c, the inequality implies that the discriminant is
non-positive, ie that
γ , U ]2 ≤ 0,
4 Covθ [b
which implies that Covθ [b
γ , U ] = 0.

b − γ̃ ∈ U. Therefore
“⇐” Let γ̃ be another unbiased estimator of γ with finite variane. Then U := γ
by assumption we have that
0 = Eθ [b
γ U ] = Eθ [b
γ 2 ] − Eθ [b
γ γ̃].
Since Eθ [b
γ ] = Eθ [γ̃] the above implies that

γ ]2 = Covθ [b
Varθ [b γ , γ̃]2 ≤ Varθ [b
γ ] Varθ [γ̃],

γ ] ≤ Varθ [γ̃].
whence we conclude that Varθ [b

Example (Example 6 continued.). The above theorem allows us to establish that in Example 6
there exists no MVUE. Suppose that γ b = T (X) is an unbiased estimator of γ = g(θ). Let H ∈ U
be any unbiased estimator of 0. Then we have
Z θ+1
H(x)dx = 0,
θ

Page 39 of 88
Foundations of Statistical Inference 6. The Rao-Blackwell and Lehmann-Scheffé theorems

and differentiating both sides

H(θ + 1) = H(θ), a.e.

By Theorem 6.6 we have that Eθ [b


γ H] = 0 and therefore that
Z θ+1
T (x)H(x)dx = 0,
θ

and differentiating both sides

T (θ + 1)H(θ + 1) = T (θ)H(θ), a.e.

which along with the fact that H(θ + 1) = H(θ) a.e. allows us to conclude that T (θ + 1) = T (θ) for
a.e. θ.

Finally since γ
b is unbiased we have
Z θ+1
T (x)dx = g(θ),
θ

and differentiating both sides

T (θ + 1) − T (θ) = g ′ (θ).

Combining everything, the only functions g for which an MVUE may exist are those that have
g ′ = 0.

Page 40 of 88
Chapter 7

Bayesian Inference: Conjugacy and


Improper Priors

We turn now, in this second half of the course, to the Bayesian view of statistical inference, and look at
how we may develop further the theory from Part A.

7.1 Recap of fundamentals


Recall that in Bayesian statistics, parameters are treated as random variables too (rather than having
an unknown true value, as in frequentist statistics). At the core of this approach is of course Bayes’
Theorem, which you have met several times over the last two years. In our setting it most commonly
reads as follows:

Theorem 7.1 (Bayes’ Theorem). Given a likelihood L(θ, x) and a prior π(θ) for θ, the
posterior distribution for θ (the conditional distribution of θ given the data X) is given by

L(θ, x)π(θ)
π(θ | x) = R .
L(θ′ , x)π(θ′ ) dθ′

(If π is a mass function replace the integral with a sum.)

We will often simply write


π(θ | x) ∝ L(θ, x)π(θ),
i.e. posterior ∝ likelihood · prior. The quantity p(x) = L(θ′ , x)π(θ′ ) dθ′ is called the marginal
R

distribution of X.

Proof. Prelims/Part A probability and statistics.


Remark. The denominator p(x) = L(θ′ , x)π(θ′ ) dθ′ in the theorem above is called the marginal like-
R

lihood in this context.

Example. Suppose X ∼ Bin(n, θ), and that our prior distribution for θ is Beta(a, b), i.e.

θa−1 (1 − θ)b − 1
π(θ) = , 0 < θ < 1.
B(a, b)

The likelihood function is L(θ, x) = nx θx (1 − θ)n−x for x = 0, . . . , n. So by Bayes’ Theorem the




Page 41 of 88
Foundations of Statistical Inference 7. Bayesian Inference: Conjugacy and Improper Priors

posterior distribution is

(7.1) π(θ | x) ∝ likelihood · prior


(7.2) ∝ θx (1 − θ)n−x · θa−1 (1 − θ)b−1
(7.3) = θa+x−1 (1 − θ)n−x+b−1 .

This is again (up to normalisation) a Beta distribution, with updated parameters a + x, b + n − x.


This is an example of conjugacy , which we will meet next.

Suppose we choose a, b here such that E[θ] = 0.7 and Var(θ) = 0.1. Suppose we then observe:

• X = 3 for a number of trials n = 10; or alternatively


• X = 30 for a number of trials n = 100.

In the first case our posterior will have a mean of about 0.5 to 0.6, and in the second case our
posterior will have a mean of less than 0.4.

As n increases, the likelihood increasingly overwhelms the prior. This captures the intuition that
the second observation seems to be much stronger evidence than the first case that θ is in fact near
to 0.3.

Remark. This example illustrates the general effect at play in Bayesian inference: as we gather more
(relevant) data, effectively the information we have about the unknown parameter increases and we revise
our beliefs accordingly.

7.2 Conjugate priors


We start off now by introducing the notion of conjugacy .

Definition 7.2. Consider a model (L(θ, x))θ∈Θ,x∈X . We say that a family of prior distributions
(πγ )γ∈Γ is conjugate if for all γ ∈ Γ and x ∈ X , there exists γ(x) ∈ Θ such that πγ (· | x) = πγ(x) (·).

We say the prior and the posterior are conjugate distributions, and the prior is a conjugate
prior for the likelihood L.

In other words, a conjugate prior is a prior which, when combined with the likelihood, produces a
posterior distribution in the same family as the prior.

Examples. (Normal conjugacy) Consider the model Xi ∼ N (µ, σ 2 ) i.i.d. and let τ = σ −2 , θ =
(τ, µ). We are going to assume a prior of the following form for θ: τ ∼ Gamma(α, β) and condi-
tionally on τ we have µ | τ ∼ N (ν, (kτ )−1 ) where k > 0, ν ∈ R are given parameters. In other
words
β α α−1 −βτ 1 √
 

(7.4) π(τ, µ) = π(τ )π(µ | τ ) = τ e √ kτ exp − (µ − ν)2
Γ(α) 2π 2
(  )
k
(7.5) ∝ τ α−1/2 exp −τ β + (µ − ν)2 .
2
Pn
Since L(θ, x) = (2π)−n/2 τ n/2 exp − τ2 i=1 (xi − µ)2 we have that


  
n
n−1
 k 1 X 
(7.6) π(θ | x) ∝ τ α+ 2 exp −τ β + (µ − ν)2 + (xi − µ)2  .
 2 2 i=1 

Page 42 of 88
Foundations of Statistical Inference 7. Bayesian Inference: Conjugacy and Improper Priors

The expression in the square bracket is quadratic in µ so that


 
n
k′
 
k
β + (µ − ν)2 + 1 X
2 ′ ′ 2
(xi − µ) = β + (µ − ν )
2 2 i=1 2

where β ′ , k ′ , ν ′ depend on n, k, ν, x but not on µ or τ .

By completing the squares first in µ and then in ν it is a good exercise to check that
 
n  2 n
k
β + (µ − ν)2 + 1 X 1 kν + nx̄ 1 nk 1X
(xi − µ)2  = (k+n) µ − + (x̄−ν)2 + (xi −x̄)2 +β
2 2 i=1 2 k+n 2n+k 2 i=1

where each term can be easily identified. This shows that the prior is conjugate for this likelihood
since (  )
α′ − 21 ′ k′ ′ 2
π(τ, µ | x) ∝ τ exp −τ β + (ν − µ)
2

where α′ = α + n2 . Thus, this tells us that τ | x ∼ Gamma(α′ , β ′ ) and that µ | τ, x ∼ N (ν ′ , (k ′ τ )−1 ).

It turns out that conjugacy is a general phenomenon for exponential families!

Proposition 7.3 (Conjugate priors for exponential families). Suppose


 
Xk 
L(θ, x) = h(x) exp ηi (θ)Ti (x) − B(θ)
 
i=1

defines a k-parameter exponential family. Then the distributions of the form


 
 Xk 
πγ (θ) ∝ exp γ0 B(θ) + γi ηi (θ) ,
 
i=1

for parameters γ = (γ0 , γ1 , . . . , γk ) are a conjugate prior family.

Proof. Exercise.

Example. Let X = (X1 , . . . , Xn ) be a sample of i.i.d. Poi(θ) random variables, so the (joint)
likelihood is
L(θ, x) ∝ exp(−nθ + T (x) log θ)
Pn
where T (x) = i=1 xi . So the natural conjugate prior is of the form

π(θ) ∝ exp(γ0 θ + γ1 log θ).

(Note this is normalisable iff γ0 < 0 and γ1 > −1.)

Writing β = −γ0 and α = γ1 +1, we have π(θ) ∝ θα−1 e−βθ which is the pdf of a Γ(α, β) distribution.

We can easily see that the posterior distribution is Γ(α + T (x), β + n). So indeed the Gamma
distribution is a conjugate prior (for the Poisson likelihood).

Example (Multinomial Distribution and Dirichlet Prior). Consider a multinomial distribu-


tion with N trials and K levels with likelihood
1 2 K
p(x1:K |θ) = θ1x θ2x · · · θK
x
.

Page 43 of 88
Foundations of Statistical Inference 7. Bayesian Inference: Conjugacy and Improper Priors

Conjugate priors take the form


X
αK
p(θ | α) ∝ θ1α1 · · · θK , θi = 1, θi ≥ 0.

The above defines a proper prior when α1 , . . . , αK > −1, so it’s more natural to parameterise as
X
αK −1
p(θ | α) ∝ θ1α1 −1 · · · θK , θi = 1, θi ≥ 0,

with αi > 0. This is called the Dirichlet distribution denoted Dirichlet(α1 , . . . , αK ).

If (X1 , . . . , Xn ) are i.i.d. samples from p(·|θ), where for i = 1, . . . , n, xi = (x1i , . . . , xK


i ),

x1j x2j xK
P P P
j
p(x1:N |θ) = θ1 θ2 · · · θK ,

it can be easily seen that the posterior is also Dirichlet


X X 
p(θ|x1:n , α) = Dirichlet x1j + α1 − 1, . . . , xK
j + αK − 1 .

7.3 Improper priors


So far both the prior and the posterior functions have been probability densities (or mass functions).
This is natural given the origin in Bayes’ Theorem, but in fact we do not require that the prior be a
‘real’ probability distribution for the posterior to exist and be well-defined.

Definition 7.4. We say that a pdf/pmf π is an improper prior if it has infinite mass:
Z
π(θ) dθ = ∞, π(θ) ⩾ 0 ∀θ ∈ Θ
Θ

(as usual replacing integrals with sums if necessary).

A posterior distribution π(θ | x) can be defined as usual as long as


Z
f (x, θ)π(θ) dθ < ∞.
Θ

Examples.

1. Likelihood X | µ ∼ N (µ, 1) and prior π(µ) = 1 ∀µ ∈ R. In this case log π(µ | x) = − 21 (x −


µ)2 + constant, i.e. the posterior distribution is N (x, 1).
2. Likelihood X | p ∼ Bin(n, p) and prior π(p) = [p(1 − p)]−1 (this is the Haldane prior ). The
posterior is π(p | x) ∝ px−1 (1 − p)n−x−1 which is improper iff x = 0 or x = n; so the posterior
is not always well-defined.

Exercise. If X is discrete and can take only finitely many values, say {z1 , . . . , zN } = X , show that
we can’t use an improper prior.

Hint: try proving that the marginal likelihood cannot be finite for all i = 1, . . . , N .

Does this argument work for X countably infinite? (Try X ∼ Po(λ), π(λ) = λ−1 .)

7.4 Predictive Distributions


Let us briefly touch on how we can make predictions for new datapoints.

Page 44 of 88
Foundations of Statistical Inference 7. Bayesian Inference: Conjugacy and Improper Priors

Definition 7.5. If X1 , . . . , Xn , Xn+1 are i.i.d. obsevations from the distribution f (x, θ), with prior
π(θ), then the posterior predictive distribution is
Z
f (xn+1 | x) = f (xn+1 , θ)π(θ | x) dθ
Θ

where here x = (x1 , . . . , xn ).

Thus the predictive distribution describes the distribution of a new observation given all the observations
we’ve already made.

Examples.

1. Poisson likelihood, Gamma prior. Suppose Y ∼ Poi(θ) and that our prior for θ is a
Γ(α, β) distribution.
The marginal likelihood for this model is
Z ∞
λy β α α−1 −βλ
m(y) = e−λ λ e dλ.
0 y! Γ(α)

On the other hand, we can use that π(θ | y) = f (y,θ)π(θ)


m(y) , so m(y) = f (y,θ)π(θ)
π(θ|y) . We have seen
previously that in this setting the posterior is π(θ | y) ∼ Γ(α + y, β + 1). Hence
 −θ y   α −βθ α−1 
e θ β e θ
y! Γ(α)
(7.7) m(y) =  
(β+1)α+y θ α+y−1 e−(β+1)θ
Γ(α+y)
 α  y
Γ(α + y) β 1
(7.8) =
Γ(α)y! β+1 β+1

which is the pmf of a NegBin(α, β) distribution.


Thus we have shown that the densities/masses of the Poisson, Gamma and negative binomial
distributions are related by
Z ∞
pNegBin (y; α, β) = pPo (y; θ) · pΓ (θ; α, β) dθ.
0

Hence the predictive distribution has pmf


Z ∞
π(yn+1 | y) = pPo (yn+1 ; θ)pΓ (θ; α + Σyi , β + n) dθ = pNegBin (yn+1 ; α + Σyi , β + n),
0
Pn
so is a negative binomial distribution with parameters α + i=1 yi and β + n.

2. Gaussian with known variance. Suppose now that X1 , . . . , Xn+1 are i.i.d. N (θ, σ 2 ) random
variables, where σ 2 is known, and that our prior distribution for the mean is θ ∼ N (µ0 , σ02 ).
We want to predict Xn+1 , having seen X1 , . . . , Xn .
The posterior after the first n observations is
" # n  
1 2
Y 1
(7.9) π(θ | x) ∝ π(θ)p(x | θ) ∝ exp − 2 (θ − µ0 ) exp − 2 (xi − θ)2
2σ0 i=1

  
n
 1 1 1 X
(7.10) ∝ exp −  2 (θ − µ)2 − 2 (xi − θ)2 

2 σ0 σ i=1
 
1
(7.11) ∝ exp − 2 (θ − µn )2
2σn

Page 45 of 88
Foundations of Statistical Inference 7. Bayesian Inference: Conjugacy and Improper Priors

σ0−2 µ0 +σ −2 n
P
i=1 xi
where, by completing the square, we find that µn = −2
σ0 +nσ −2
and σn−2 = σ0−2 + nσ −2 .
(Observe that if σ 2 = σ02 then the prior has the same weight as that of a single extra observa-
tion.)
So θ | X ∼ N (µn , σn2 ) and Xn+1 | θ ∼ N (θ, σ 2 ). We can rewrite these two facts as

θ = µn + σn Z, Xn+1 = θ + σY

for some independent Y, Z ∼ N (0, 1), and so Xn+1 = µn + σn Z + σY . Thus Xn+1 | X ∼


N (µn , σ 2 + σn2 ).
(We could also have arrived at this last result by directly integrating the densities; our method
was just an equivalent and simpler approach in this case.)

Page 46 of 88
Chapter 8

Non-Informative Priors

We’ve just seen that priors don’t always have to be probability distributions. When may we want to
make use of this?

We’re used to the notion of a subjective prior , a distribution representing our prior knowledge about
the parameter before any data is collected. With this approach, we may try different priors representing
different ‘points of view’.

This is in contrast to the concept of an objective prior (a non-informative prior ) which we’ll explore
in this chapter. This is a prior which is somehow ‘automatic’, reflecting the lack of any initial knowledge
about the parameter — and crucially may have no probabilistic interpretation, and so doesn’t have to be
a valid probability distribution. Non-informative priors can be used when little no reliable information
is available.

There are several approaches for defining a non-informative prior, three of which we’ll mention here.

8.1 Uniform priors

Definition 8.1. The uniform prior or flat prior is the prior π(θ) ∝ 1.

This is the obvious, naive representation of lack of information; every value being equally likely. Under
this prior, the posterior is
L(θ, x)
π(θ | x) = R ,
Θ
L(θ, x) dθ
R
which is well defined as long as Θ L(θ, x) dθ < ∞.

R∞
Example. Let X ∼ Exp(θ) and π(θ) = 1. The marginal likelihood is 0 e−θx θ dθ which is finite
for all x > 0, so the posterior is well-defined. But does it have nice properties?

Let η = log θ. Then the prior for η is


dθ dθ
π̃(η) = π(θ(η)) = = eη ̸= 1.
dη dη
We see that after reparameterisation the prior is not flat anymore; in fact, as a prior in η, π̃ is very
informative (large values are much more likely than small ones).

8.2 Jeffrey’s prior


The last example motivates the construction of a prior that does not depend on the parameterisation.

Page 47 of 88
Foundations of Statistical Inference 8. Non-Informative Priors

Definition 8.2. In the one-dimensional case Jeffrey’s prior is given by


p
π(θ) ∝ Iθ
h i
where Iθ = Eθ − ∂θ
∂2
2 ℓ(θ, x) is the Fisher information.

Remark. Why does this work? If θ = g(ψ) for some one-to-one differentiable function g then the
reparameterised prior is p
π̃(ψ) ∝ π(g(ψ))|g ′ (ψ)| = Iθ |g ′ (ψ)|.


Recall that Iψ = (g ′ (ψ))2 Iθ , so Iθ |g ′ (ψ)|. Hence π̃(ψ) ∝
p p
Iψ = Iψ .

So indeed Jeffrey’s prior is invariant under reparametrisation.

8.2.1 Jeffrey’s prior in higher dimensions


This definition generalises naturally to higher dimensions:

Definition 8.3. The k-dimensional Jeffrey’s prior is given by

π(θ) ∝ |Iθ |1/2 ,

where |Iθ | = det Iθ and Iθ is the Fisher


 information matrix, so under the standard regularity
assumptions (Iθ )ij = − Eθ ∂θ∂∂θ ℓ(θ, x) .
2

i j

It is easy to check that this is indeed invariant under one-to-one reparametrisation.

e−λ λx
Example. Suppose X ∼ Po(λ), so that f (x, λ) = x! for x = 0, 1, 2, . . ..

Then Jeffrey’s prior is

IX (λ) = E[(ℓ′ (λ, X))2 ]


p p
(8.1) π(λ) ∝
v "
u  2 #
x
= E
u
(8.2) t −1
λ
v
u∞  2
uX x−λ
(8.3) = t f (x, λ)
x=0
λ
v
u ∞
λx x2
 
u
−λ
X 2x
(8.4) = e
t − +1
x=0
x! λ2 λ
r
1 
E E[(X − λ)2 ]

(8.5) =
λ2
(8.6) = λ−1/2 .

Note this is an improper prior.

8.3 Maximum entropy prior


Another possible approach for constructing a non-informative prior is inspired by information theory.

Page 48 of 88
Foundations of Statistical Inference 8. Non-Informative Priors

Definition 8.4. The entropy of a pdf/pmf π is defined as


Z
Ent[π] = − π(θ) log π(θ) dθ.
Θ

As always, replace the integral with a sum if π is a pmf.

Remark. In the continuous case, entropy is often referred to as the differential entropy .

A maximum entropy probability distribution has entropy that is at least as great as that of all other
members of a specified class of probability distributions. According to the principle of maximum entropy,
if nothing is known about a distribution except that it belongs to a certain class (usually defined in terms
of specified properties or measures), then the distribution with the largest entropy should be chosen as
the least-informative default.

Since maximizing entropy minimizes the amount of prior information built into the distribution it makes
sense to pick the prior that maximises the entropy subject to any relevant constraints (e.g. a fixed mean).

Example. Suppose we wish to find the distribution π which maximises Ent[π] on Θ = R subject
to the constraints
Z ∞ Z ∞ Z ∞
π(θ) dθ = 1, θπ(θ) dθ = µ and (θ − µ)2 π(θ) dθ = σ 2
−∞ −∞ −∞

for fixed µ, σ 2 .
2 2
1
The solution is π(θ) = √2πσ 2
e−(θ−µ) /2σ . This can be shown using variational calculus or using
information-theoretic techniques (a proof is seen on a problem sheet in the Information Theory
course).

Thus the Gaussian distribution is the maximum-entropy distribution for the real line when we fix
the mean and variance.

Remark. The maximum entropy distribution does not always exist (for example the class of distributions
may have unbounded entropy).

The previous example leads us to a more general theorem, which we shall not prove:

Theorem 8.5. Let


p
hX i
π(θ) = exp λi Ti (θ) − B(λ) ∀θ ∈ Θ,
i=1

be a probability density function and suppose that


Z
(8.7) Ti (x)π(θ)dx = ti , j = 1, . . . , p.

Then π uniquely maximises Ent[π] among all densities satisfying the constraint (8.7).

Proof. Let Π be the class of distributions satisfying the constraints.

Recall that for two distributions τ1 ≪ τ2 the Kullback-Leibler divergence KL(τ1 ∥τ2 ) is defined
through Z  
dτ1
KL(τ1 ∥τ2 ) = τ1 (dx) log ,
dτ2
where dτ1 /dτ2 is the Radon-Nikodym derivative. If τ1 is not absolutely continuous w.r.t. τ2 we set

Page 49 of 88
Foundations of Statistical Inference 8. Non-Informative Priors

KL(τ1 ∥τ2 ) = +∞. It is a simple application of Jensen’s inequality to check that KL(τ1 ∥τ2 ) ≥ 0.

We now have the tools to prove the result. Let π ′ be any element of Π. Then
Z
Ent[π ] = − π ′ (x) log π ′ (x)dx

Z  ′  Z
π (x)
= − π ′ (x) log dx − π ′ (x) log π(x)dx
π(x)
Z Xp 
′ ′
= −KL(π ∥π) − π (x) λi Ti (x) dx + B(λ)
i=1

and since π[Ti ] = π ′ [Ti ] for all i


Z p
X 
= −KL(π ′ ∥π) + B(λ) − π(x) λi Ti (x) dx
i=1

= −KL(π ∥π) + Ent[π].

Rearranging we obtain
Ent[π] − Ent[π ′ ] = KL(π ′ ∥π) ≥ 0.

Example (continued). In the example above, our two constraints were E[T1 (θ)] = µ and E[T2 (θ)] =
σ 2 , where T1 (θ) = θ and T2 (θ) = (θ − µ)2 .

The above theorem then gives that the maximum-entropy prior is of the form π(θ) ∝ exp(λ1 θ +
λ2 (θ − µ)2 ). The two constraints then imply that λ1 = 0 and λ2 = − 2σ1 2 , thus giving the Gaussian
distribution we just saw.

Example. Suppose a0 ⩽ a1 ⩽ · · · ⩽ ap and θ ∈ (a0 , ap ).

Consider the constraints π(θ ∈ (aj−1 , aj ]) = ϕj for j = 1, . . . , p. This is equivalent to requiring


E[Tj (θ)] = ϕj for j = 1, . . . , p, where Tj (θ) = 1{aj−1 <θ⩽aj } .
Under these conditions the maximum-entropy distribution is of the form
 
p
λj 1{aj−1 <θ⩽aj }  , a0 ⩽ θ ⩽ ap .
X
π(θ) ∝ exp 
j=1

Hence πθ is piecewise constant on the intervals (ai , ai+1 ].

Page 50 of 88
Chapter 9

Hierarchical Models

Chapter 5 p101 in Gelman, Carlin et al. Bayesian Data analysis. Section 7.7 p170 in Garthwaite, Joliffe,
and Jones Statistical Inference. p253 Lehmann and Casella Theory of point estimation

9.1 Example
In certain situations, the data we are modelling has a natural hierarchical structure. We illustrate this
first with an extended example.

Example (Study of cardiac treatment across different hospitals). Consider the dataset in
fig. 9.1 consisting of mortality rates in infant cardiac surgery across I = 12 hospitals. Each hospital
i conducts ni surgeries, Yi of which result in death. We use the natural model for the number of
deaths at each hospital as Yi ∼ Bin(ni , θi ), where θi is an unknown parameter.

How do we model the mean mortality rates θ = (θ1 , . . . , θ12 )?

Three broad approaches come to mind:

• Identical parameters. We assume all the θi are identical. This ignores the structure of the
problem and pools all the data. In this case this means we’re assuming the surgery success
rate doesn’t depend on which hospital conducts the surgery.
• Independent parameters. We assume all the θi are independent, i.e. entirely unrelated.
The results from each unit can be analysed independently. In this case this means we’re
assuming there is nothing similar about the surgery at different hospitals, and the failure rates
at different hospitals don’t depend on each other in any way.
• Exchangeable parameters. We assume the θi are similar; no one hospital is a priori any
better than another. More on this later.

Let’s see how the first two approaches can work in this situation, where relevant examining our
estimates for hospitals A and H in particular:

• All θi equal (frequentist approach). P


P
The model is Yi ∼ Bin(ni , θ) for each i, so Yi ∼
Bin( ni , θ). Thus the MLE for θ is θb = P nyii = 0.0739.
P

• Independent θi (frequentist approach). The model is Yi ∼ Bin(ni , θi ) independently for


each i. The MLE for each θi is θbi = nyii . So in particular θbA = 0 and θbH = 0.1442.
• All θi equal (Bayesian approach). The model is Yi | θ ∼ Bin(ni , θ) for each i, and we’ll
use the prior θ ∼ Beta(a, b) with a = 4 and b = 46. (We choose the Beta distribution since

Page 51 of 88
Foundations of Statistical Inference 9. Hierarchical Models

Figure 9.1: Number of infant cardiac surgeries and number of mortalities across 12 hospitals.

it’s a conjugate prior for the binomial distribution;Pand the choice of parameters a, b will be
discussed later.) The posterior mean of θ is then P niy+α+β
i +α
= 0.0740.

• Independent θi (Bayesian approach). The model is Yi | θi ∼ Bin(ni , θi ) independently


for each i, with i.i.d. priors θi ∼ Beta(a, b). The posterior mean for each θi is then niy+α+β
i +α

which takes value 0.0412 for hospital A and 0.1321 for hospital H.

The first method (frequentist, equal parameters) gives some pretty unlikely results (e.g. the observed
death rate for hospital H is very unlikely given our estimated θ), and the second method (frequentist,
independent parameters) totally ignores data from other hospitals when estimating θi for a particular
hospital; but this is the same medical procedure, so this is unnatural.

The third method (Bayesian, equal parameters) has the same problem as in the frequentist setting,
but the last method (Bayesian, independent parameters drawn from the same distribution) seems to
address these issues; the parameters are different for each hospital, but are all drawn from the same
distribution, whose parameters can be inferred from the entire dataset.

This is what we mean by a natural hierarchical structure.

How can we estimate the parameters, then, of the shared prior distribution?

Example (continued). In the example above, the approach we settled on models the θi as drawn
independently from a Beta(a, b) distribution. How do we estimate the parameters (a, b)?

• Approximate empirical Bayes approach. The most obvious way to estimate (a, b) is to
use a standard frequentist technique; the method of moments. In this context, this means we
pick (a, b) so that the prior distribution has the same mean and variance as the sample mean
and sample variance of the observed maximum likelihood estimates for the parameters θi .
Specifically, we calculate ri = yi /ni for each hospital (this is the observed mortality rate; the
MLE for θi ) and we calculate the sample mean and sample variance of the set {r1 , . . . , r12 };
and then solve for b a, bb such that Beta(b
a, bb) has the same mean and variance.
(Then we use Beta(b a, bb) as our shared prior for the θi , to obtain the posterior distribution
π(θi | b
a, bb, yi ) for each θi as described above.)

This approach is reasonable, but we have the problem that we are using the same data twice —
once to pick ba, bb) and once to find the individual posteriors for the θi . This leads to overconfidence
in the posterior distributions! Moreover, we’re making a fixed choice of (b a, bb) and working with that
choice, so the posterior distributions we derive will not reflect the inherent uncertainty in the values
of the parameters (a, b).

This motivates a more subtle approach that is Bayesian throughout.

• hierarchical Bayesian model. We may instead assume a joint probability model for (θ, a, b).
In other words θ, a and b are all treated as random variables.
As before (except now treating these explicitly as conditional distributions) we say θi | (a, b) ∼

Page 52 of 88
Foundations of Statistical Inference 9. Hierarchical Models

Beta(a, b) independently for each i, and we now also model the marginal distribution of (a, b)
as (a, b) ∼ p(a, b). This is effectively a prior distribution for (a, b); we call it the hyperprior .
In summary, our hierarchical model has three layers:
– Level 1: Yi | θi ∼ Bin(ni , θi ) independently for each i;
– Level 2: θi | (a, b) ∼ Beta(a, b) independently for each i;
– Level 3: (a, b) ∼ p(a, b) for some hyperprior distribution p(a, b).
Note that the θi are now not independent, but they are conditionally independent given a, b.

9.2 Definition
The empirical Bayes approach will be discussed in more detail later in the course; the hierarchical Bayes
approach can be defined in generality as follows:

Definition 9.1. The building blocks of a hierarchical Bayesian model for the observations
Y1 , . . . , Yn with parameters θ1 , . . . , θn and hyperparameter ϕ are

• P = {Pθ , θ ∈ Θ} a family of probability distributions on A. We write p(y|θ) for the pmf/pdf


of Pθ .
• {πϕ , ϕ ∈ Φ} a family of probability distributions on Θ (the parametrized priors). We write
p(θ|ϕ) for the pdf/pmf of πϕ .
• and P be a distribution on Φ (the hyperprior distribution). We write p(ϕ) for its pdf/pmf.

Then the corresponding hierarchical model is the following joint distribution of the Yj , θi and ϕ.

I: yj |θj , ϕ ∼ p(yj |θj ) independently for each j, (note this does not depend on ϕ)
II: θj |ϕ ∼ p(θj |ϕ)
III: ϕ ∼ p(ϕ)

The joint prior distribution is p(θ, ϕ) = p(θ|ϕ)p(ϕ) and the joint posterior distribution is
p(θ, ϕ|y) ∝ p(y|θ, ϕ)p(θ, ϕ) = p(y|θ)p(θ|ϕ)p(ϕ).

Example (continued). In the case of the hospital data, recall that our model is Yi | θi ∼ Bin(ni , θi )
independently for each i, with i.i.d. priors θi ∼ Beta(a, b) and a hyper-prior p(a, b). Since condition-
ally on (a, b) the (yi , θi ) are independent, the joint posterior distribution is

(9.1) p(θ, a, b | y) ∝ p(y | θ)p(θ | a, b)p(a, b)


  
YI YI
(9.2) =  p(yi | θi )  p(θi | a, b) p(a, b)
i=1 i=1
  
I I
Y Y Γ(a + b)
(9.3) ∝  θiyi (1 − θi )ni −yi   θia−1 (1 − θi )b−1  p(a, b).
i=1 i=1
Γ(a)Γ(b)

Thus we have
I I
θia+yi −1 (1 − θi )b+ni −yi −1 ∝
Y Y
p(θ | a, b, y) ∝ p(θi |a, b, yi )
i=1 i=1

(what the ∝ symbol means is that we only keep the terms that involve θ). This shows that, given
a, b, y, the θi have independent beta posteriors.

On the other hand, the posterior for (a, b) is p(a, b | y) ∝ p(a, b)p(y | a, b). Observe first that

Page 53 of 88
Foundations of Statistical Inference 9. Hierarchical Models

Q Γ(a+b)
p(y | a, b) = i p(yi | a, b) by conditional independence given a, b. Let us call Ga,b = Γ(a)Γ(b) the
normalisation constant of the Beta(a, b) distribution.
Z
(9.4) p(yi | a, b) = p(yi |θi )p(θi |a, b)dθ
Z  
ni yi
(9.5) = θ (1 − θ)ni −yi G(a, b)θa−1 (1 − θ)b−1 dθ
yi
Z
(9.6) ∝ G(a, b) θyi +a−1 (1 − θ)ni −yi +b−1 G(a, b)dθ

(9.7) = G(a, b)G(yi + a, ni − yi + b)


Γ(a + b) Γ(b + ni − yi )Γ(a + yi )
(9.8) = .
Γ(a)Γ(b) Γ(a + b + ni )

Thus
I
Y Γ(a + b) Γ(b + ni − yi )Γ(a + yi )
p(a, b|y) ∝ p(a, b) .
i=1
Γ(a)Γ(b) Γ(a + b + ni )

Remark. How can we draw from the joint posterior p(θ, ϕ|y) in general?

1. Draw ϕ ∼ p(ϕ|y).

2. Draw θ ∼ p(θ|ϕ, y).


3. If needed draw predictive values ỹ from p(y|θ).

9.3 Exchangeability
In the model we’ve seen, the parameters θi were conditionally independent given the hyperparameter
vector ϕ.

This is a special case of a property that is in general desirable:

Definition 9.2. The distribution of a random vector θ = (θ1 , . . . , θI ) is symmetric, or exchange-


able, if
d
(θ1 , . . . , θI ) = (θσ(1) , . . . , θσ(I) )
for any permutation σ.

Intuitively, this says that ‘no one parameter is a priori to be treated differently from any of the other
parameters’.

Let us see that conditional independence indeed satisfies this property:

Proposition 9.3. If θ = (θ1 , . . . , θI ) has (prior) distribution


 
Z Y I
p(θ) =  π(θi | ψ) g(ψ) dψ
i=1

for some ψ with distribution g(ψ), i.e. the θi are conditionally independent given ψ, then the
distribution of θ is exchangeable (symmetric).

Proof. Exercise.

In fact, this is sufficient:

Page 54 of 88
Foundations of Statistical Inference 9. Hierarchical Models

Theorem 9.4 (De Finetti). All exchangeable sequences are of the above form in the large sample
limit.

Proof. Omitted.

9.4 Gaussian data example


This section is devoted to a single example of conjugate normal hierarchy. Similar examples can be found
on p113 in Bayesian data analysis by Gelman, Carlin, et. al. and on page 171 of Garthwaite Joliffe and
Jones.

Let us start by describing our model:

• For each j = 1, . . . , J, the Xi,j , i = 1, . . . , nj are i.i.d. N (θj , σ 2 ). The Xi,j are also independent
across different j’s.
• The θi are i.i.d N (µ, τ 2 ) (conditionally on ϕ = (µ, τ 2 )).

• The hyperparameter ϕ has the improper prior distribution p(ϕ) ∝ constant.

We are going to use the notation x = {xi,j } for the whole data set, xj = {xi,j , i = 1, . . . nj } for the
observations in group j, and
nj
1 X
x̄·,j = xi,j
nj i=1
for the mean of xj .

We start by observing that


 
nj
 1 X 
(9.9) p(xj | θj ) ∝ exp − 2 (xi,j − θj )2
 2σ 
i=1
 
  nj
1 1 X  
(9.10) = exp − 2 nj (θj2 − 2θj x̄·,j ) exp − 2 x2i,j
2σ  2σ
i=1

 
    nj
nj nj 2  1 X 
(9.11) = exp − 2 (θj − x̄·,j )2 exp x̄ ·,j exp − x2
i,j
2σ 2σ 2  2σ 2
i=1

 
nj
(9.12) ∝ exp − 2 (θj − x̄·,j )2 ∝ N (x̄·,j | θj , σj2 )

where σj2 = σ 2 /nj is the variance of x̄·,j . (this is just a fancy way of saying that x̄·,j is sufficient for θj ),

Let us start by writing the joint posterior distribution. Using the usual posterior ∝ prior × likelihood
formula we have

(9.13) p(θ, ϕ | x) ∝ p(ϕ)p(θ | ϕ)p(x | θ)


J
Y J
Y
2
(9.14) ∝ p(ϕ) N (θj | µ, τ ) p(xj | θj )
j=1 j=1
J
Y J
Y
(9.15) ∝ p(ϕ) N (θj | µ, τ 2 ) N (x̄·,j | θj , σj2 )
j=1 j=1
  
J J
Y n 1 o Y n 1 o
(9.16) ∝ p(ϕ)  exp − (x̄·,j − θi )2   τ −1 exp − 2 (θi − µ)2  .
j=1
2σj2 j=1

Page 55 of 88
Foundations of Statistical Inference 9. Hierarchical Models

Let us now determine the conditional posterior distribution of θ given ϕ. In a hierarchical model,
once the hyperparameter is given, the parameters θj are independent. Thus
J
Y
p(θ | ϕ, x) = p(θj | ϕ, xj ).
j=1

Conditionally on ϕ, we simply have J independent unknown normal means given a normal prior dis-
tribution. For each j, we thus have a simple Gaussian conjugate model: the observations xj are i.i.d
N (θj , σ 2 ) , σ 2 known and θj ∼ N (µ, τ 2 ). It can easily be checked that the posterior is still Gaussian

θj | µ, τ 2 , xj ∼ N (θbj , Vj )

where
1 1

σj2 ·,j
+ τ2 µ 1
θbj = 1 1 and Vj = 1 1 .
σj2
+ τ2 σj2
+ τ2

Observe that the posterior mean is a weighted average of the prior mean µ and of the sample mean x̄·,j
of group j.

We can now move on to the marginal posterior distribution of the hyperparameter ϕ. Once we
have obtained the joint posterior p(θ, ϕ | x) we can obtain the marginal posterio of the hyperparameter
p(ϕ | x) by integrating out the parameter θ. But for the hierarchical normal model we can simply
consider directly
p(ϕ|x) ∝ p(ϕ)p(x |′ phi).
Usually, this decomposition is no help because the marginal likelihood term p(x | ϕ) cannot generally
be written in closed form. But in the present case there is a particularly simple form to this marginal
likelihood. The key observation is that, conditionally on ϕ = (µ, τ 2 ) the x̄·,j are independent and

(9.17) x̄·,j | ϕ ∼ N (µ, σj2 + τ 2 )

(take the time to think about it).

Thus we can write


J
Y
p(ϕ | x) ∝ p(ϕ) N (x̄·,j | µ, σj2 + τ 2 ).
j=1

Now we are going to make some assumptions about the hyperparameter distribution p(ϕ). We are going
to assume a non-informative flat prior for µ given τ 2 :

p(µ, τ 2 ) = p(µ | τ 2 )p(τ 2 ) ∝ p(τ 2 ).

We can decompose the posterior into

p(ϕ | x) = p(µ | x, τ 2 )p(τ 2 | x)

(this is just the usual Bayes formula for pmf/pdf’s for the variables µ and τ 2 under their conditional
distribution given x).

Thus we see that


p(ϕ | x)
p(µ | x) = ∝ p(ϕ | x)
p(τ | x)
(the ∝ here is as a function of µ).

Pugging in our particular form of hyperprior into (9.17), we see that taking a log will give us a quadratic
expression in µ
J
X 1
− log p(µ, τ 2 | x) = (µ − x̄·,j )2 + cstt not depending on µ.
j=1
2(σj2 + τ 2 )

Page 56 of 88
Foundations of Statistical Inference 9. Hierarchical Models

Thus µ | τ 2 , x ∼ N (b
µ, Vµ ) where we only need to find the mean µ
b and variance Vµ . It can be checked by
“completing the squares” that
J J
X 1 X 1
Vµ−1 = and µ
b = Vµ x̄·,j .
j=1
σj2 + τ 2 j=1
σj2 + τ 2

We thus have a proper posterior for µ given τ 2 . Finally we want the posterior distribution of τ 2 .

p(ϕ | x)
(9.18) p(τ 2 | x) =
p(µ |, τ 2 x)
QJ
p(τ 2 ) j=1 N (x̄·,j | µ, σj2 + τ 2
(9.19) ∝ .
N (µ | µ
b, Vµ )

Observe that the left hand-side does not depend on µ so that the right-hand side cannot depend on µ
as well. We can thus choose to evaluate it at µ = µ
b for simplicity and get
QJ
2
p(τ 2 ) j=1 b, σj2 + τ 2
N (x̄·,j | µ
(9.20) p(τ | x) ∝
µ|µ
N (b b, Vµ )
J
!
Y b)2
(x̄·,j − µ
(9.21) ∝ p(τ 2 )Vµ1/2 (σj2 + τ 2 )−1/2 exp − .
j=1
2(σj2 + τ 2 )

b and Vµ also depends on τ 2 so this is a complicated function of τ 2 .


Both µ

Page 57 of 88
Chapter 10

Decision Theory

- Garthwaite, Joliffe, and Jones Statistical Inference Chapter 6 p114


- Lehmann and Casella Theory of point estimation Chapter 4 p225 and Chapter 5 p309.
- Young, Smith, et al. Essential of statistical inference Chapter 2 p4

Throughout this course we have been exploring ways of estimating parameters, predicting new values,
or inferring probability distributions. In the past we have come across hypothesis testing (which we’ll
explore again at the end of this course). All of these are examples of making decisions based on data. In
this section we develop this into a formal theory.

10.1 Basic framework and risk function


As usual, we will assume a data model X | θ ∼ f (x; θ) from some parametric family {f (x, θ) : θ ∈ Θ},
where Θ is our parameter space.

We need the following objects.

• An action (or decision) space A. Typical examples include A = {0, 1} for selecting a hypoth-
esis, or A = g(Θ) for estimating a function g(θ) of a parameter.

• A loss function L : Θ × A → R+ . Given an action a ∈ A, if the true parameter is θ ∈ Θ we incur


loss L(θ, a) (whether this is a loss function or a ).
• A set of decision rules D ⊆ {∆ : X → A}. A decision rule ∆ specifies which action we take
given observation x ∈ X .

With these in mind, we define our first measure of ‘how bad’ a decision rule is:

Definition 10.1. For a given rule ∆ ∈ D and parameter θ ∈ Θ, the (frequentist) risk is
Z
R(θ, ∆) = Eθ [L(θ, ∆(X))] = L(θ, ∆(x))f (x, θ) dx.
X

This is the expected loss assuming the true parameter is θ.

For a given rule ∆ we think of the frequentist risk as a profile of risk across the different values of θ.

We also want to consider randomised decision rules, that is a rule that selects among a collection of
decision rules according to some probability distribution.

Page 58 of 88
Foundations of Statistical Inference 10. Decision Theory

Definition 10.2 (Randomised decision rule). Suppose that the action space A is equipped
with a σ-algebra A such that (A, A) is a measurable space and let P(A) be the space of probability
measures on A.

A randomised decision rule d is a mapping from d : X 7→ dx ∈ P(A) such that for each A ∈ A,
the mapping x 7→ dx (A) is measurable.

The frequentist risk of the randomised decision rule d is given by


Z Z
R(θ, d) := L(θ, a)dx (da)f (x; θ)dx.
X A

Example. A (deterministic) decision rule ∆ ∈ D can be thought of as a randomised decision rule


by defining dx = δ∆(x) , i.e. the probability measure with a single atom located at ∆(x).

Given a collection {∆1 , . . . , ∆k } and a probability vector (p1 , . . . , pk ) we can define the randomised
Pk
rule d = i=1 pi δ∆i , which for each x selects the rule ∆i (x) with probability pi .

Examples.

• Estimation: ∆(x) is an estimator of θ ∈ Rk and L(θ, a) = ∥a − θ∥2 , so that R(θ, ∆) =


Eθ [∥∆(X) − θ∥2 ].
• Testing: we test θ ∈ H0 against θ ∈ H1 . In this case A = {0, 1} and

1

 if θ ∈ H0 , a = 1
L(θ, a) = 1 if θ ∈ H1 , a = 0,

0

otherwise.

The risk is then just the probability of the wrong decision:

Pθ (∆(X) = 0) if θ ∈ H1 ,
(
R(θ, ∆) =
Pθ (∆(X) = 1) if θ ∈ H0 .
These are the Type I/II error probabilities respectively.

10.2 Admissibility
Let’s see how we might compare decision rules.

Definition 10.3. We say that ∆2 strictly dominates ∆1 if

R(θ, ∆1 ) ⩾ R(θ, ∆2 ) ∀θ ∈ Θ

and R(θ, ∆1 ) > R(θ, ∆2 ) for at least some θ.

A procedure ∆1 is inadmissible if there exists ∆2 such that ∆2 strictly dominates ∆1 .

We define admissible to simply mean not inadmissible.

Example. Suppose X ∼ U[0, θ]. Let D = {estimators of the form θ(x)


b = ax} (so this is a family
indexed by a).

Page 59 of 88
Foundations of Statistical Inference 10. Decision Theory

Using the quadratic loss, the risk will in general be


!
θ
a2
Z
2 1
R(θ, θ)
b = (ax − θ) · dx = − a + 1 θ2
0 θ 3

which is minimised at a = 3/2. Thus θ(x)


b = ax is inadmissible for all a ̸= 3/2.

So a = 3/2 is a necessary condition for θb to be admissible for quadratic loss. Note that we have
shown that θ(x)
b = 32 x is admissible in D but not among the set of all possible estimators of the
form θ(x) = f (x) for some function f .
b

Remark. Note that being admissible is a fairly weak requirement. We will later see that some natural
estimators are in fact inadmissible (see chapter 11).

10.3 Minimax rules and Bayes rules


We now explore notions of ‘best possible’ decision rules.

Definition 10.4. A rule ∆ is a minimax rule if

sup R(θ, ∆) ⩽ sup R(θ, ∆′ ) ∀∆′ ∈ D.


θ θ

It minimises the maximum risk:

∆∗ = argmin∆∈D sup R(θ, ∆).


θ∈Θ

Intuitively, a minimax rule does best in the worst case scenario. This can often still mean poor perfor-
mance on average.

Given a prior belief π about the parameter θ, it is also natural to consider the average risk of a rule.

Definition 10.5. The Bayes integrated risk (or simply Bayes risk )for a decision rule ∆ and
a prior π(θ) is Z
r(π, ∆) := R(θ, ∆)π(θ) dθ.
Θ

A decision rule ∆ is said to be a Bayes rule w.r.t. π if it minimises the Bayes risk:

r(π, ∆) = inf

r(π, ∆′ ) =: rπ .
∆ ∈D

In case Θ is discrete the integral should be replaced by a sum.


Remark. Do not confuse Bayes rules with randomised decision rules.

We will see that Bayes rules (or estimators) provide a tool to solve minimax problems. To make this
idea precise we need the notion of least favorable prior . Recall that rπ is the Bayes risk of the Bayes
estimator ∆Bayes associated to π (when one exists).

Definition 10.6. A prior distribution π is least favorable if rπ ⩾ rπ′ for all prior distributions π ′ .

The following Theorem provides a simple condition for a Bayes estimator ∆Bayes to be minimax.

Page 60 of 88
Foundations of Statistical Inference 10. Decision Theory

Theorem 10.7. Suppose that π is a prior distribution on Θ and that ∆Bayes is the Bayes estimator
for π with
r(π, ∆Bayes ) = rπ .
If the rule ∆0 satisfies
sup R(θ, ∆0 ) ⩽ rπ
θ

then ∆0 is minimax, and, furthermore, if ∆Bayes is the unique Bayes estimator for π then ∆0 is
the unique minimax procedure.

Proof. Let ∆ be any other rule. Then


Z
(10.1) sup R(θ, ∆) ⩾ R(θ, ∆)π(θ)dθ
θ
Z
(10.2) ⩾ R(θ, ∆Bayes )π(θ)dθ

(10.3) = rπ ⩾ sup R(θ, ∆0 ).


θ

The second inequality is strict if there is a unique Bayes estimator which gives the second point.
Remark. It is interesting to note that in the Theorem above one must have that R(θ, ∆0 ) = rπ for
π-almost all θ. Indeed otherwise we would have
Z
R(θ, ∆0 )π(θ) dθ < rπ

which contradicts the definition of rπ .

Theorem 10.8. Let ∆Bayes be the Bayes estimator for some prior π. If

R(θ, ∆Bayes ) ⩽ rπ for all θ

then ∆Bayes is minimax and π is a least favorable prior.

Proof. The first part is simply an application of Theorem 10.7.

Let π ′ be some other distribution. Then, writing ∆′Bayes for the Bayes estimator with respect to π ′
we have
Z Z
rπ = R(θ, ∆Bayes )π (θ)dθ ⩽ R(θ, ∆Bayes )π ′ (θ)dθ ⩽ sup R(θ, ∆Bayes ) = rπ .

′ ′
θ

The following Corollary is often very useful.

Corollary 10.9. If a Bayes rule ∆Bayes has constant Risk, then it is minimax.

In fact the risk only needs to be constant almost everywhere.

Corollary 10.10. Let ωπ ⊂ Θ be the set of θ at which the risk function of ∆Bayes achieves its
maximum, i.e.
ωπ = {θ : R(θ, ∆Bayes ) = supθ′ R(θ′ , ∆Bayes )}.

If π(ωπ ) = 1. then ∆Bayes is minimax.

Remark. Bayes and almost surely constant risk is sufficient for minimax but not necessary. The result
is stated wrongly in Lehmann and Casella as an if and only if statement.

Page 61 of 88
Foundations of Statistical Inference 10. Decision Theory

Example. Suppose that X ∼ Bin(n, p) and we wish to estimate p with the square error loss function
L(p, pb) = (p − pb)2 . We chose pb = X
n . the risk function is

R(p, pb) = Ep [(p − pb)2 ] = p(1 − p)/n.

It has a unique maximizer at p = 1/2. So to apply the Corollary above we would need π(1/2) = 1
for which the corrresponding Bayes estimator is ∆ = 1/2, not X/n. In fact it can be checked that
X/n is not minimax.

To determine a minimax estimator by the method suggested by Theorem 10.7 let us try a Beta(a, b)
prior distribution. In that case we will see that the Bayes estimator is the posterior mean (this is
a+x
proved in Proposition 10.15, but you can try to prove this for yourself here!) , i.e. ∆(x) = a+b+n
and the risk function is
1 n
2
o
R(p, ∆) = np(1 − p) + [a(1 − p) − bp] .
(a + b + n)2

Can we find values of a, b such that this risk function is constant? Setting the coefficients of p2 and
p to 0 show that R(p, ∆) is constant in p iff

(a + b)2 = n and 2a(a + b) = n


1√
and since a, b are positive we find a = b = 2 n. It follows that the estimator
√ √
X + 12 n X n 1 1
∆= √ = √ + √
n+ n n 1+ n 21+ n

is contant risk Bayes and hence minimax. Because of the uniqueness of the Bayes estimator we see
that this is the unique minimax estimator.

10.4 Bayes rule and posterior risk


Definition 10.11. The expected posterior loss of a rule ∆ w.r.t. a prior π is
h i Z
Λ(x, ∆) = E L[θ, ∆(x)] | X = x = L(θ, ∆(x))π(θ | x) dθ.
Θ

The following result says that a Bayes rule minimises the expected posterior loss.

Theorem 10.12. Suppose that X | θ ∼ Pθ and that θ ∼ π. Suppose in addition that the following
hypothesis hold for the problem of estimating g(θ) with non-negative loss function L(θ, d).

(a) There exists an estimator (a rule) ∆0 with finite risk.


(b) For almost all x, there exists a value c(x) which minimizes

y 7→ Λ(x, y)

Then ∆(x) = c(x) is a Bayes estimator.

Proof. The Bayes risk is (using π(θ | x) = f (θ, x)π(θ)/h(x) where h(x) is the marginal distribution

Page 62 of 88
Foundations of Statistical Inference 10. Decision Theory

of X)
Z Z Z
(10.4) r(π, ∆) = R(θ, ∆)π(θ) dθ = L(θ, ∆(x))f (θ, x)π(θ) dx dθ
Z Z
(10.5) = L(θ, ∆(x))π(θ | x)h(x) dx dθ
Z Z
(10.6) = h(x) L(θ, ∆(x))π(θ | x) dθ dx
Z
(10.7) = h(x)Λ(x, ∆(x)) dx
Z
(10.8) ≤ h(x)Λ(x, ∆′ (x)) dx

for any other rule ∆′ .

Proposition 10.13 (Bayes rules and admissibility). Let ∆π be a Bayes rule w.r.t. π with
finite Bayes risk. Then

1. If ∆π is unique then it is admissible.


2. If θ 7→ R(θ, ∆) is continuous for all ∆ and π has a positive density w.r.t. the Lebesgue
measure, then ∆π is admissible.

Proof.

1. If ∆π is not admissible then there is some ∆ such that R(θ, ∆) ⩽ R(θ, ∆π ) ∀θ ∈ Θ and
R(θ, ∆) < R(θ, ∆π ) for some θ. This implies r(π, ∆) ⩽ r(π, ∆π ), so ∆ must also be Bayes, so
by uniqueness ∆ = ∆π , contradicting the definition of ∆. So ∆π is admissible.
2. As above, if ∆π is not admissible then there is some ∆ such that R(θ, ∆) ⩽ R(θ, ∆π ) ∀θ ∈ Θ
and A∆ ̸= ∅, where A∆ := {θ : R(θ, ∆) < R(θ, ∆π )}.
Since θ 7→ R(θ, ∆) − R(θ, ∆π ) is continuous, A∆ must contain an open set. So π(A∆ ) > 0
arriving at a contradiction.

10.5 Point estimation


In the setting of point estimation (coming up with a best guess for a parameter, as we’ve been doing a
lot in this course) there are three common loss functions:

(
a if |θ − θ|
b > b,
Definition 10.14. The zero-one loss is of the form L(θ, θ)
b = where a, b are
0 otherwise
positive constants.
b = k|θb − θ| where k is a positive constant.
The absolute error loss is of the form L(θ, θ)

b = k(θb − θ)2 where k is a positive constant.


The quadratic loss is of the form L(θ, θ)

Let us see what the Bayes estimate (Bayes rule) is for each of these losses, by minimising the expected
posterior loss.

Proposition 10.15. The Bayes estimate under the:

1. zero-one loss with interval radius b tends to the posterior mode as b → 0 (assuming say

Page 63 of 88
Foundations of Statistical Inference 10. Decision Theory

continuous posterior density);

2. absolute error loss is the posterior median;


3. quadratic loss is the posterior mean.

Proof.

1. The expected posterior loss is


Z
(10.9) Λ(x) = π(θ | x)L(θ, θ)
b dθ
Z ∞ Z θ−b
b
(10.10) =a π(θ | x) dθ + a π(θ | x) dθ
θ+b
b −∞
Z θ+b
b
(10.11) ∝1− π(θ | x) dθ.
θ−b
b

To find the Bayes rule one has to minimise the above or equivalently maximise
Z θ+b
b
π(θ | x) dθ.
θ−b
b

Differentiating w.r.t. to θb and setting the derivative equal to 0 we need to find θb such that
   
(10.12) π θb + b|x − π θb − b|x = 0.

If the map θ 7→ π(θ|x) is continuous, the above is guaranteed to have a solution for b small
enough. To see why let θ̃ be a mode of π(θ|x); then
   
π θ̃|x − π θ̃ − 2b|x ≥ 0, θb = θ̃ − b
   
π θ̃ + 2b|x − π θ̃ − b|x ≤ 0, θb = θ̃ + b,

for b small enough, since θ̃ is a local maximum. Therefore by the intermediate value theorem
there must be a solution θb in the interval [θ̃ − b, θ̃ + b]. To verify this is indeed a maximum,
one may also do a second derivative test if θ 7→ π(θ|x) is differentiable to obtain

π ′ (θb + b|x) − π ′ (θb − b|x) ≤ 0,

again by the fact that θb is a local maximum and thus the derivative must change sign. So
the Bayes rule is to choose θ(x)b so that (10.12) is satisfied, which can be achieved by some
θ ∈ [θ̃ − b, θ̃ + b]. So as b → 0, θb tends towards the posterior mode.
b

2. The expected posterior loss is


Z Z θb Z ∞
Λ(x) = |θb − θ|π(θ | x) dθ = (θb − θ)π(θ | x) dθ + (θ − θ)π(θ
b | x) dθ
−∞ θb

so that
Z θb Z ∞ Z θb
∂ ∂
Λ(x) = π(θ | x) dθ − π(θ | x) dθ = Λ(x) = 2 π(θ | x) dθ − 1
∂ θb −∞ θb ∂ θb −∞

Page 64 of 88
Foundations of Statistical Inference 10. Decision Theory

so, setting this to zero, Λ is minimised(indeed second derivative is non-negative) when


Z θb Z ∞
π(θ | x) dθ = π(θ | x) dθ,
−∞ θb

i.e. θb is the median of π(θ | x).


3. The expected posterior loss is

(10.13) Λ(x) = E[(θb − θ)2 | X = x]


(10.14) = E[(θb − µx + µx − θ)2 | X = x] where µx is the posterior mean
(10.15) = (θb2 − µx )2 + 2(θb − µx ) E[θ − µx | X = x] + E[(θ − µx )2 | X = x]
(10.16) = (θb − µx )2 + Var(θ | X = x).

So Λ is minimised when θb = µx , the posterior mean.

Example. (Example 4.1.5 in Lehmann Casella p230) Suppose X ∼ Bin(n, p) with a Beta(a, b) prior
for p. As we have seen this is a conjugate prior and the posterior density of p is proportional to
px+a−1 (1 − p)n−x+b−1 . Therefore, under the quadratic loss function, the Bayes estimator pbBayes of
p is
a+x
pbBayes = E[p | x] = .
a+b+n
It is interesting to compare this with the MLE (or UMVE) which is just X/n. In fact, before
taking any observation, the estimator from the Bayesian approach would be the mean of the prior
a/(a + b). Once X has been observed, the standard non-Bayesian estimator is X/n. The Bayes
estimator pbBayes lies between the two and in fact
   
a+b a n X
pbBayes = +
a+b+n a+b a+b+n n

it is a weighted average of the two.

It is instructive to examine the cases:

1. n → ∞ while a, b fixed
2. n fixed and a, b → ∞ with a/b fixed.

10.6 Finite decision problems

Definition 10.16. A decision problem is said to be finite when Θ is finite. We write Θ =


{θ1 , . . . , θk }.

In the case of a finite decision problem, the notions of admissibility, minimax and Bayes rules can be
given geometric interpretations.

We now assume that the set of decision rules is also finite and contains l decision rules ∆1 , . . . , ∆l and
randomised decision rules that can be formed by their convex combinations, that is the decision set is
the convex hull of {∆1 , . . . , ∆l }.
l
X X 
D := pi ∆i : pi ≥ 0, pi = 1 .
i=1

Page 65 of 88
Foundations of Statistical Inference 10. Decision Theory

Definition 10.17. The risk set S ⊆ Rk is the set of points {(R(θ1 , ∆), . . . , R(θk , ∆)) : ∆ ∈ D}.

It may also be the case that D is the convex hull of an infinite collection (or continuum) of non-randomized
decision rules, see e.g. Figure 10.2.

Lemma 10.18. S is a convex set.

Proof. Let ∆1 , ∆2 ∈ D be two rules. Take α ∈ (0, 1). Then define a randomized rule as follows:
(
′ ∆1 (x) with prob α,
∆ (x) =
∆2 (x) with prob 1 − α.

Then R(θ, ∆′ ) = αR(θ, ∆1 )+(1−α)R(θ, ∆2 ). So the convex combination is a valid decision rule.

10.6.1 The case k = 2


The two-dimensional case (i.e. there are two possible parameters) is particularly interesting.

In Figure 10.1 we can see an example of a risk set when Θ = {θ1 , θ2 }. The extreme points of the risk set
are the deterministic rules.

The thick line at the bottom defines the set of admissible rules. To see why, recall that a rule ∆ is
admissible if there is no other rule ∆′ such that R(θ, ∆′ ) ≤ R(θ, ∆) for all θ with strict inequality for at
least one θ. In our scenario this means that a rule ∆ is admissible if no other rule ∆′ achieves a risk in
the interior of the box 
x ≤ R(θ1 , ∆), y ≤ R(θ2 , ∆) .
Note also that the minimax rule lies on the line R1 = R2 ; since the risk set intersects the line R1 = R2
this must be the case. Suppose that an admissible rule has constant risk, that is (R1 , R2 ) with R1 = R2 .
Let (R1′ , R2′ ) be the risk of any other admissible rule; it must satisfy R1′ ≥ R1 and R2′ ≤ R2 or R1′ ≤ R1
and R2′ ≥ R2 . In either case we have that max{R1 , R2 } ≤ max{R1′ , R2′ }. The situation is similar in
Figure 10.2.

Figure 10.2: Riskset convex hull of conti-


Figure 10.1: Riskset convex hull of five non- nuum of non-randomized rules. Minimal non-
randomized rules. Minimax is randomized. randomized.

However in Figure 10.2 notice that the set of admissible rules consists of deterministic rules (in this case
D cannot be expressed as the convex hull of a finite number of deterministic rules).

For Bayes rules, suppose that (π1 , π2 ) is the prior. Then the lines π1 R1 + π2 R2 = c represent decision

Page 66 of 88
Foundations of Statistical Inference 10. Decision Theory

rules with the same Bayes risk c. We can see this in Figures 10.3,10.4. The slope of the lines are
determined by the prior. We increase c looking for the line that just touches the risk set–this gives the
Bayes rule.

In Figure 10.3 the Bayes rule is not unique, and the minimax is actually a non-randomized Bayes rule.
In Figure 10.3 the Bayes rule is unique, and non-randomized. The minimax is actually a randomized
rule distinct from the Bayes rule.

Figure 10.3: The minimax is a randomized Figure 10.4: The minimax is no longer a Bayes
Bayes rule; Bayes rule not unique. In fact there rule. Minimax is still randomized although
are non-randomized Bayes rules. Bayes rule is non-randomized.

Page 67 of 88
Chapter 11

The James-Stein Estimator

This chapter explores some rather counter-intuitive situations that can occur when one simultaneously
estimates several parameters.

Assume that Xi ∼ N (µi , 1) are mutually independent unit-variance Gaussian random variables, and
write X = (X1 , . . . , Xp ) and µ = (µ1 , . . . , µp ). The goal is to estimate µ from a single observation X.

We know the maximum likelihood estimate is µ


bMLE = X, and we have seen that this is the MVUE.

Is this estimate admissible (for, say, quadratic loss)? For p ⩾ 3, the answer is no!

Theorem 11.1 (Stein’s Paradox). The James-Stein estimator


!
p−2
bJSE := 1 − Pp
µ 2 X
i=1 Xi

strictly dominates µ
bMLE for quadratic loss.

(We will prove this shortly.)

Corollary 11.2. If p ⩾ 3, µ
bMLE is inadmissible for quadratic loss.

Remark. This is very surprising! For instance, suppose you take measurements to estimate:

1. The average weight K of a kiwi at Tesco;

2. The average height G of a blade of grass in University Parks;


3. The average speed S of a bike going down Cornmarket Street.

These are totally unrelated quantities; but Stein’s paradox tells us that we get better estimates (on
average) for the vector (K, G, S) by simultaneously using the three measurements!1

Let’s see how to prove this.

Lemma 11.3 (Stein’s Lemma). For independent Gaussian random variables X = (X1 , . . . , Xp )

1 Or does it? Stein’s paradox is about normal random variables. This can be justified by using the CLT.

Page 68 of 88
Foundations of Statistical Inference 11. The James-Stein Estimator

with Xi ∼ N (µi , 1) for each i, then for each i and for any bounded differentiable function h,
" #
∂ h(X)
E[(Xi − µi )h(X)] = E .
∂Xi

Proof. By the Tower Law,

E[(Xi − µi )h(X)] = E E[(Xi − µi )h(X) | {Xj : j ̸= i}] .


 

Using integration by parts,

(11.1)
Z ∞
E[(Xi − µi )h(X) | {Xj : j =
2
̸ i}] = (xi − µi )h(x)e−(xi −µi ) /2
dxi
−∞
ixi =∞ Z ∞
h 2 ∂ h(x) −(xi −µi )2 /2
(11.2) = −e−(xi −µi ) /2 h(x) + e dxi
xi =−∞ −∞ ∂xi
" #
∂ h(X)
(11.3) =0+E | Xj : j ̸= i
∂Xi

since h is bounded. Applying the tower property of conditional expectations again gives the result.

 
bJSE = 1 −
Proof of Stein’s Paradox. Consider the family of estimators µ Pa 2 X indexed by the
Xi
parameter a. These are called the James-Stein estimators.

Recalling that µ
bMLE = X, we get
p
E[(µi − Xi )2 ] = p
X
R(µ, µ
bMLE ) =
i=1

(since Var(Xi ) = 1).


 
bi := 1 −
On the other hand, writing µ Pa 2 Xi ,
j Xj

p
E[(µi − µbi )2 ]
X
(11.4) R(µ, µ
bJSE ) =
i=1
  
p 
" #
(Xi − µi )Xi Xi2 
E[(µi − Xi )2 ] − 2a E + a2 E 
X 
(11.5) =  .
 P 2 2 
j Xj

 P 
2
j Xj
i=1
 

Now the first term is just 1, since Var(Xi ) = 1, and by Stein’s Lemma,
   
" # " # P 2
 j Xj − 2Xi  2
(Xi − µi )Xi ∂ X  1 Xi2
E =E P i 2 = E = E

P 2 − 2
2  P 2 .
P 2 
2
j Xj ∂Xi j Xj j Xj
 P   
X 2 X
j j j j

Page 69 of 88
Foundations of Statistical Inference 11. The James-Stein Estimator

Putting this all together, we get


" # " #
1 1
(11.6) bJSE ) = p − (2ap − 4a) E P 2 + a E P 2
R(µ, µ 2
Xj Xj
" #
1
(11.7) = p − (2a(p − 2) − a2 ) E P 2 .
Xj

This is minimised at a = p − 2, and is less than p for this value; this concludes the proof.
Remark. The James-Stein estimator shrinks each component of X towards the  origin. However,
 there
(µ0 ) p−2
is of course nothing special about the origin; a similar estimator µbJSE = µ0 + 1 − ||X−µ0 ||2 (X − µ0 )
can be defined which shrinks X towards an arbitrary point µ0 , and it can easily be shown that this also
strictly dominates µ
bMLE . (See the handwritten notes for the details.)

 
a
Exercise. Show that for some a the estimator X̄1p + 1 − ||X−X̄1p ||2
(X − X̄1p ) strictly dominates
µ
bJSE , where 1p = (1, . . . , 1).

Remark. Observe that when ||X − µ0 ||2 < p − 2, the shinkage factor becomes negative. To avoid this
problem, we can define
 +
(µ0 ) p−2
µbJSE+ = µ0 + 1 − (X − µ0 )
||X − µ0 ||2
(µ )
(where x+ denotes the positive part), which strictly dominates µ 0
bJSE ).

0 (µ ) 0(µ )
It is worth noting that neither µ
bJSE nor µ
bJSE+ are admissible.

Example (Baseball example). Consider the dataset in fig. 11.1, taken from Young and Smith.
It shows statistics from the 1998 baseball pre-season in the US for 17 top players. Our interest is in
predicting the home run strike rate of each player in the full season.

For each player i, Yi is the number of home runs out of ni times at bat in the pre-season. We
assume that home runs occur according to a binomial distribution, so that player i has probability
pi of hitting a home run each time at bat, independently of other at bats and other players. Thus
Yi ∼ Bin(ni , pi ).

Here pi is the true full-season strike rate (and Yi /n is the strike rate in the pre-season); the actual
values of pi as well as the actual number ABi of at bats of each player (in the full season) and the
actual number of home runs HRi are shown in the figure.

So, how might we estimate pi given just the pre-season statistics Yi and ni for each player? Obviously
the naïve estimate is the MLE pbi = Yi /ni . These give rise to the estimated number of home runs
d i = pbi · ABi (assuming we know the actual number of at bats, which of course at the time we
HR
wouldn’t have). These values are shown in the figure.

The Stein paradox tells us we may be able to do better.

First transform the data, setting Xi = fni (Yi /ni ) where fn (y) := n1/2 sin−1 (2y − 1). Then Xi ∼
N (µi , 1) for each i, with µi = fni (pi ).

We can then use the James-Stein estimator to estimate the means µi . Using the ‘improved version’
we just encountered, we set  
p−3
JSi := X̄ + 1 − (Xi − X̄)
V
for each i, where X̄ = Xi /p and V = (Xi − X̄)2 (here p = 17).
P P

Page 70 of 88
Foundations of Statistical Inference 11. The James-Stein Estimator

Figure 11.1: Data for 17 players in the 1998 baseball pre-season and full season taken from Young and
Smith.

These estimates of the µi are shown in the figure, and transforming back will give us estimates HR
ds
for the number of home runs of each player, which are also shown.

We see that the James-Stein approach gives much better estimates on average! More precisely,
the James-Stein estimator achieves a lower aggrigate risk than the naïve estimator, but allows
increased risk in estimation of individual components.

Page 71 of 88
Chapter 12

Empirical Bayes Methods

We return now to our discussion of Bayes estimators (Bayes rules). While Bayes estimators have desirable
properties (the posterior mean, the Bayes estimator under quadratic loss, is often admissible), they can
be hard to calculate, in particular for the hierarchical models met in chapter 9.

This motivates the empirical Bayes approach.

12.1 Basic setup


Recall that a hierarchical Bayesian model consists of three ‘layers’: the likelihood X ∼ f (x, θ) parametrised
by θ, the prior θ ∼ π(θ, ψ) parametrised by ψ, and the hyperprior ψ ∼ g(ψ).

Definition 12.1. Empirical Bayes methods adapt the hierarchical Bayesian model by replacing
the hyperparameter vector ψ with a point-estimate ψb derived from the data.

So we now just have the likelihood X ∼ f (x, θ) and the prior θ ∼ ψ(θ)
b = π(θ, ψ).
b

Remark. Empirical Bayes methods can be viewed as an approximation of a full hierarchical Bayes model
that allows us to avoid doing ψ-integrals. One layer of the hierarchy has been ‘chopped off’.

Recall that we met this idea briefly in chapter 9 before hierarchical models were introduced.

The reduced model has posterior


b(θ | x) ∝ L(θ, x)π(θ, ψ)
π b

and
R b(θ | x). So for quadratic loss, we have θbEB =
a Bayes estimator θbEB can be calculated using π
π (θ | x) dθ, the posterior mean.
θb
Remark. In this setting, the Bayes estimator is called an empirical Bayes estimator , or an EB
estimator .

12.2 Choice of point estimate


How can we choose our point estimate ψb of the hyperparameter? We have all the classical frequentist
techniques at our disposal. The two most obvious ways are:

• Use the MLE ψb = argmaxψ p(x | ψ) where


Z
p(x | ψ) = L(θ, x)π(θ, ψ) dθ

is the marginal likelihood.

Page 72 of 88
Foundations of Statistical Inference 12. Empirical Bayes Methods

Figure 12.1: Data on tumor incidence in historical control groups and current group of rats, from Tarone
1982. The table displays the values yj /nj : (number of rats with tumors)/(total number of rats).

• Use the method of moments: choose ψb such that π(θ, ψ)


b has the same mean and variance as the
sample mean and sample variance of the MLEs of the θi .

Example (Meta-analysis of studies of tumors in rodents). The data in fig. 12.1 shows the
number of rats with tumors, Yi , and the total number of rats ni in each of a number of previous
experiments on tumor growth, as well as the results of a new experiment which we are interested in
analysing.

As usual we’ll assume each Yi ∼ Bin(ni , θi ) independently, for parameters θi which we want to
estimate. As our prior distribution we assume that θi ∼ Beta(α, β) independently for each i,
where α, β are hyperparameters. This choice of prior is natural as it is conjugate for the binomial
distribution: the posterior distribution, after observing the new experiment (14 rats, 4 with tumors)
will be π(θ | y) = Beta(α + 4, β + 10).

Using an empirical Bayes approach with the method of moments goes as follows:

1. Compute the MLEs Yi /ni for the previous experiments i = 1, . . . , 70.


2. Compute the sample mean and variance of these MLEs: m = 0.136 and v = 0.0106.
3. Pick α
b, βb such that Beta(b
α, β)
b has ‘matched moments’, i.e.

α
b α
bβb
= m, = v.
α
b + βb (b
α+ b 2 (b
β) α + βb + 1)

This solves to α
b = 1.4, βb = 8.6.
4. Calculate the Bayes estimate, which for the quadratic loss is the posterior mean. In this case
b(θ | y) = Beta(5.4, 18.6) so the posterior mean is 0.225
the posterior is π

This estimate is less than the maximum-likelihood estimate of θbMLE = 4/14 we’d get based solely
on the current experiment, not taking into account past experiments.

12.3 James-Stein and empirical Bayes


Suppose we have X1 , . . . , Xp ∼ N (θi , 1) as in the setup for the James-Stein estimator. Given one
observation xi per parameter θi we wish to estimate the parameters θi .

Page 73 of 88
Foundations of Statistical Inference 12. Empirical Bayes Methods

Proposition 12.2. The James-Stein estimator can be intepreted as an empirical Bayes estimator.

(Specifically, for a = p it’s the EB estimator for quadratic loss when using a mean-zero Gaussian
prior whose variance is estimated using maximum likelihood.)

Proof. We wish to construct an EB estimator for quadratic loss. There is some freedom of choice of
prior, but we will assume as our prior that θi are drawn independently from a N (0, τ 2 ) distribution.
 
τ2 τ2
Given τ , then, we have θi | (xi , τ 2 ) ∼ N xi 1+τ 2 , 1+τ 2 . This can be calculated by completing the
square.

To estimate τ , then, we can compute the marginal likelihood of Xi given τ :

Xi | τ 2 ∼ N (0, τ 2 + 1) independently for each i.


Pp
This is maximised by τb2 = p1 j=1 (Xj2 − 1). (This is from the standard result for the MLE for the
variance of a Gaussian distribution).
 
τb2 τb2
So the estimated posterior distribution is θi | xi ∼ N xi 1+b τ 2 , 1+bτ 2 . Thus the Bayes estimator for
quadratic loss, i.e. the posterior mean, is
 P 
1 p 2
j=1 j − 1
!
τb2 p X p
θEB,i = Xi
b = Xi 1
Pp 2
= Xi 1 − P 2 .
1 + τb2 p j=1 Xj Xj

This is the James-Stein estimator with a = p.

Remark. This is not the minimum James-Stein estimator (with a = p − 2) but it does strictly dominate
the MLE for all θ. The James-Stein estimator with a = p − 2 can be recovered by using moment
estimators (see Young and Smith section 3.5).

Example. Suppose that Xi ∼ Po(θi ) independently for i = 1, . . . , p.

The maximum-likelihood estimate for each θi would be simply xi . Let’s follow roughly the same
empirical Bayes approach as above to find a better estimator (similar to the James-Stein estimator).

As a prior we assume that θi are i.i.d. Exp(λ), so that π(θi | λ) = λe−λθi for each i and λ is a
hyperparameter to be estimated.

The marginal likelihood for λ is, for a single data point i,


Z ∞ −θi xi  xi
e θi 1 λ λ
p(xi | λ) = λe−λθi dθi = ∼ Geom .
0 x i ! 1 + λ 1 + λ 1 + λ
 
So given λ the Xi are marginally i.i.d. Geom 1+λ λ
with mean λ−1 .

1 Pn .
So the maximum marginal likelihood estimator is λ
b=
x̄ = xi

Hence our empirical Bayes approximation gives marginal posterior


p
Y
b(θ | x) ∝ L(θ, x)π(θ, λ)
π b = e−θi θixi λe
b −λθi .
b

i=1

We recognise from this expression that θi | xi ∼ Γ(xi + 1, λ


b + 1) for each i. So the EB estimator is

Page 74 of 88
Foundations of Statistical Inference 12. Empirical Bayes Methods

the approximated posterior mean,


α xi + 1 xi + 1 1 x̄
θbEB,i = = = x̄ = x̄ + xi .
β λ
b+1 x̄ + 1 x̄ + 1 x̄ + 1

This has the effect of shrinking the MLE estimates towards the mean x̄.

Remark. We see that the empirical Bayes approach tends to pull the estimates towards the common
mean. This is true in general for models with exchangeable parameters.

Note also that, as mentioned in chapter 9, one drawback of the empirical Bayes approach is that we’re
potentially using the same data twice, leading to overfitting.

12.4 Non-parametric empirical Bayes


So far we have estimated a hyperprior distribution by finding a point estimate for the hyperparameter.
We could instead estimate the hyperprior (or marginal) distribution directly from the data. This is
known as non-parametric empirical Bayes. One such method is illustrated below.

Example. Suppose Yi ∼ Po(θi ) independently. Assume that the parameters θi are drawn indepen-
dently from some distribution π whose form we do not know.

The posterior mean is


Z
(12.1) θbi = E[θi | Yi ] =θπ(θ | Yi ) dθ

R θYi +1 e−θ 
Yi ! π(θ) dθ
(12.2) = R  Y −θ  by Bayes’ Theorem
θ ie
Yi ! π(θ) dθ
(Yi + 1)p(Yi + 1)
(12.3) =
p(Yi )

where p(y) is the marginal pmf.

Robbin’s method is then to approximate the marginal pmf p(y) by the actual number of observed
datapoints equal to y. So in this case

(yi + 1)bp(yi + 1) (yi + 1) · |{j : yj = yi + 1}|


θbi = = .
pb(yi ) |{j : yj = yi }|

Page 75 of 88
Chapter 13

Hypothesis Tests

13.1 Recap from part A


13.1.1 General setup
Let X1 , . . . , Xn be a random sample from f (x; θ) where θ ∈ Θ is a scalar or vector parameter. Suppose
we are interested in testing

• The null hypotehsis H0 : θ ∈ Θ0


• against the alternative H1 : θ ∈ Θ1 .

Unless specified otherwise we assume that Θ0 ∩ Θ1 = ∅

If a hypothesis consists of a single point in Θ so that Θ0 = {θ0 } say, we say that it is a simple hypotehsis.
Otherwise it is called a composite hypothesis.

In general a test consists of a critical region C such that we reject H0 if and only if X ∈ C. We
reformulate this slightly by introducing the concept of the test function ϕ : X 7→ {0, 1}

 1 if x ∈ C
ϕ(x) =
 0 if x ̸∈ C

We will sometimes simply say the test ϕ. We will also sometimes need the notion of a randomized
test. Suppose that X = C1 ∪ C0 ∪ C= where C1 , C0 , C= are pairwise disjoint. Fix γ ∈ [0, 1]. Then the
we generalize the notion of test function by saying that

 1 if x ∈ C1



ϕ(x) = γ if x ∈ C=


 0 if x ∈ C0

is the test where we reject H0 when x ∈ C1 , accept H0 when x ∈ C0 , and reject H0 with probability
γ if x ∈ C= (by flipping a coin). Such a test ϕ is called a randomized test.

Definition 13.1. • The power function of a test is defined to be

w(θ) = Pθ (RejectH0 ) = Eθ [ϕ(X)].

Page 76 of 88
Foundations of Statistical Inference 13. Hypothesis Tests

• The size of a test is often denoted α and is defined to be

α := sup w(θ).
θ∈Θ0

The idea is this: a good test has a small size so that α ⩽ α0 for some specified value α0 and makes w(θ)
as large as possible on Θ1 . Within this framework we can consider various classes of problems:

1. Simple H0 vs simple H1 : here there is an elegant and complete theory which tells us exactly how
to construct the best test given by Neyman-Pearson Theorem.
2. Simple H0 vs composite H1 : In this case the obvious approach is to pick θ1 ∈ Θ1 and construct
the Neyman-Pearson test of H0 against H1 . In some cases, the critical region one obtains is the
same for all θ1 . When that happens the test is said to be uniformly most powerful (or UMP).
But there are many situations in which UMP tests do not exists, and then the problem is harder.
3. Composite H0 vs composite H1 : In this case the problem is harder again.

13.1.2 Neyman-Pearson Theorem


Consider a test of a simple null hypothesis H0 : θ = θ0 against a simple alternative H1 : θ = θ1 . Define
the likelihood ratio:
f (x, θ1 )
Λ(x) = .
f (x, θ0 )

Theorem 13.2. Define the critical region

C = {x : Λ(x) ⩾ k}

and suppose that the constants k and α are such that Pθ0 (X ∈ C) = α. Then among all tests of
H0 against H1 of size α, the test with critical region C has maximum power.

The tests with critical regions such as C are called Neyman-Pearson test or likelihood ratio test
(LRT).

13.1.3 Uniformly most powerful tests


Definition 13.3. A uniformly most powerful test or UMP test of size α is a test function ϕ0
such that

1. Eθ (ϕ0 (X)) ⩽ α for all θ ∈ Θ0 ,

2. Given any other test ϕ for which Eθ (ϕ(X)) ⩽ α for all θ ∈ Θ0 , we have Eθ (ϕ0 (X)) ⩾ Eθ (ϕ(X))
for all θ ∈ Θ1 .

Note that a UMP test does not necessarily exist. However, for one-sided testing problems involving a
single parameter there is a wide class of parametric families that have a UMP test.

Definition 13.4. A family of densities {f (x, θ), θ ∈ Θ ⊆ R} with real scalar parameter θ is said to
be of monotone likelihood ratio or MLR for short if there exists a function t(x) such that the
likelihood ratio
f (x, θ2 )
x 7→
f (x, θ1 )
is a non-decreasing function of t(x) whenever θ1 ⩽ θ2 .

Page 77 of 88
Foundations of Statistical Inference 13. Hypothesis Tests

Theorem 13.5. Suppose that X has a distribution from a family which is MLR with respect to a
statistic t(X) and that we wish to test H0 : θ ⩽ θ0 against H1 : θ > θ0 . Suppose that the distribution
of t(X) is continuous. Then

1. The test with critical region


C = {x : t(x) > t0 }
is UMP among all test of size at most Pθ0 (X ∈ C).
2. Given α, there exists some t0 such that the test above has size α.

Proof. For any θ1 > θ0 the Neyman-Pearson test of H0 : θ = θ0 against H1 : θ = θ1 has a critical
region of the form C = {x : t(x) > t0 } for some t0 which is chosen so that Pθ0 (T (X) > t0 ) = α.
Note that t0 does not depend on θ1 and so the critical region C is the same for all values of θ1 .
Thus, we see that this test is UMP for testing H0 : θ = θ0 against H1 : θ > θ0 .

Next, we claim that for any critical region of the form C = {x : t(x) > t0 } the map

θ 7→ Pθ (X ∈ C)

is non-decreasing. This can be seen using a argument involving randomized test procedures and the
optimality of the LRT (see Young and Smith p72).

It follows that if Pθ0 (X ∈ C) = α then supθ⩽θ0 Pθ (X ∈ C) ⩽ α. Suppose that C ′ is another critical


region such that supθ⩽θ0 Pθ (X ∈ C ′ ) ⩽ α as well. This implies trivially that Pθ0 (X ∈ C ′ ) ⩽ α and
thus by optimality of the LRT that for all θ1 > θ0 we have

Pθ1 (X ∈ C ′ ) ⩽ Pθ1 (X ∈ C)
This shows that C is UMP among all tests of its size.

The second statement in the Theorem is clear by continuity.

Example. Suppose the X1 , . . . , Xn are i.i.d. from an exponential distribution with mean θ :
f (x, θ) = θ−1 e−x/θ , x > 0 where θ ∈ (0, ∞). Let us say that we frist want to test H0 : θ = θ0
against H1 : θ > θ0 (this is a simple-composite test).

Consider θ1 > θ0 . We have


 n ( X )
f (x, θ1 ) θ0 1 1
Λ(x) = = exp − xi .
f (x, θ0 ) θ1 θ0 θ1
 
1 1
P
Since θ0 − θ1 > 0 we see that Λ(x) is an increasing function of t(x) = xi . So the Neyman
Pearson test will be reject H0 if t(x) > kα where kα is chosen so that Pθ0 (t(X) > kα ) = α.
Since t(X) ∼ Gamma(n, 1/θ) and therefore we can determine kα (from tables for Gamma cdf’s or
numerical computations) and you can see that kα does not depend on θ1 . In other words the test
is UMP for all θ ∈ Θ1 .

Suppose now that we want to test H0∗ : θ ⩽ θ0 against H1 : θ > θ0 . Note that
P 
i Xi k
Pθ ( Xi > k) = Pθ
X
(13.1) >
i
θ θ
(13.2) = P(Y > k/θ)

where Y is a Gamma-(n, 1) r.v. This is a non decreasing function of θ. Therefore the test with
critical region C = {x : t(x) > kα } with size α for H0 also has size α for H0∗ . Now let ϕ(X) be
any other test of size |alpha under H0∗ . Since H0 is a smaller hypothesis than H0∗ , the test ϕ also

Page 78 of 88
Foundations of Statistical Inference 13. Hypothesis Tests

has size ⩽ α under H0 . But then by the Neyman-Pearson Theorem Eθ1 ϕ(X) ⩽ Eθ1 ϕ0 (X) for all
θ1 > θ0 . Thus ϕ0 is UMP.

Example. Suppose the X1 , . . . , Xn are i.i.d from the (one dimensional exponential) density

f (x, θ) = h(x)eθT (x)−B(θ) .


P
Write t(x) = i T (xi ). Then

f (x, θ2 )
= en(B(θ1 )−B(θ2 )) exp{(θ2 − θ1 )t(x)}.
f (x, θ1 )

This is non-decreasing in t(x) and so the family is MLR.

13.2 Bayes factors


13.2.1 Bayes factors for simple hypotheses
Consider first the case H0 : θ = θ0 against H1 : θ = θ1 (simple vs simple). We need to specify a
prior probability for each hypothesis: let’s call π0 the prior probability of H0 and π1 that of H1 with
π0 + π1 = 1.

Bayes’ rule tells us that (writing fi for the density of X under Hi )


π0 f0 (x)
P(H0 is true | Xx ) =
π0 f0 (x) + π1 f1 (x)
which can also be written as
P(H0 is true | X = x) π0 f0 (x)
= .
P(H1 is true | Xx ) π1 f1 (x)

In words this is often expressed as


posterior odds = prior odds × Bayes factor.

π0 f0 (x)
Definition 13.6. We call π1 the prior odds in favor of H0 and B = f1 (x) is the Bayes factor .

The first person to use Bayes factors extensively was Jeffreys, in his book Theory of probability (first
edition 1939). This can be considered to be Jeffreys’ main contribution to the theory of statistics.
Following Jeffreys however, there were few methodological developments until the 80’s. This is an active
field of research today.

[insert table of interpretation of Bayes factors here]

From the point of view of decision theory, any Bayes rule will take the form C = {B < k} [Make a
theorem?] for some value k and is therefore an LRT. The class of Bayes Rules is exactly the class of
Neyman-Pearson rules.
Remark. A rough guide to interpreting Bayes factors given by Adrian Raftery is as follows:

P(H0 | x) B0/1 2 log(B0/1 ) evidence for H0


< 0.5 <1 <0 negative (supports H1 )
0.5 to 0.75 1 to 3 0 to 2 barely worth mentioning
0.75 to 0.92 3 to 12 2 to 5 positive
0.92 to 0.99 12 to 150 5 to 10 strong
> 0.99 > 150 > 10 very strong

Page 79 of 88
Foundations of Statistical Inference 13. Hypothesis Tests

The value 2 log(B0/1 ) is sometimes reported because it’s on the same scale as the familiar deviance and
likelihood ratio test statistic.

13.2.2 Bayes factors for composite hypothesis


Suppose now that the hypothesis H0 and/or H1 are composite. Then, it is not enough to know the prior
probabilities π0 and π1 that H0 and H1 are correct, we also need to know the full prior distribution of
θ ∈ Θ0 conditionally on H0 being true and of θ ∈ Θ1 conditionally on H1 being true. Let’s call those
priors g0 and g1 respectively.

The hypothesis Hi is not just θ ∈ Θi but rather the full prior model that
θ | Hi ∼ gi (θ), θ ∈ Θi .
Observe in particular that there is absolutely no need to suppose that the Θi are disjoints.

Definition 13.7. The Bayes factor in the composite-composite case is defined to be


R
f (x, θ)g0 (θ) dθ
B = RΘ0 .
Θ1
f (x, θ)g1 (θ) dθ

The Bayes factor in the simple-composite case is defined to be

f (x, θ0 )
B=R .
Θ1
f (x, θ)g1 (θ) dθ

More generally, there is nothing here that requires the same parametrization under the two hypothesis.
Suppose that we have two candidate parametric models M1 and M2 for data X, and the two models
have respective parameter vectors θ1 and θ2 . Under prior densities π1 (θ1 ) and π2 (θ2 ), the marginal
distribution for X under each models are found as
Z
p(x | Mi ) = f (x, θi , Mi )πi (θi ) dθi

and the Bayes factor is just their ratio


p(x | M1 )
B= .
p(x | M2 )

Note that form this point of view, what we have is really a hierarchical Bayesian model where where the
model correspond to the hyperparameter.

Example. Suppose the X1 , . . . , Xn are iid ∼ N (θ, σ 2 ) with σ 2 known. Consider H0 : θ = 0 against
H1 : θ ̸= 0. Also suppose that the prior g1 under H1 is N (µ, τ 2 ). We have
p1
B=
p2
where  
1 X 2
p1 = (2πσ 2 )−n/2 exp − 2 Xi ,

and
Z    
2 −n/2 1 X 2 2 −1/2 1 2
p2 = (2πσ ) exp − 2 (Xi − θ) (2πτ ) exp − 2 (θ − µ) dθ.
R 2σ 2τ

Completing the square we see that for an arbitrary k we have


n
b 2 + nk (x̄ − µ)2 +
X X
(xi − θ)2 + k(θ − µ)2 = (n + k)(θ − θ) (xi − x̄)2
i=1
n + k

Page 80 of 88
Foundations of Statistical Inference 13. Hypothesis Tests

where θb = (nx̄ + kµ)/(n + k). Thus:


n
1 X 2 1 2 nτ 2 + σ 2 b2+ n 1 X
2
(x i − θ) + 2
(θ − µ) = 2 2
(θ − θ) 2 2
(x̄ − µ)2 + 2 (xi − x̄)2 .
σ i=1 τ σ τ nτ + σ σ

Using that
( ) !1/2
nτ 2 + σ 2 2πσ 2 τ 2
Z
exp − b2
(θ − θ) dθ =
2σ 2 τ 2 nτ 2 + σ 2
we see that
!1/2 " #
σ2

2 −n/2 1 n 2 1 X 2
p2 = (2πσ ) exp − (x̄ − µ) + 2 (xi − x̄) .
nτ + σ 2
2 2 nτ 2 + σ 2 σ

Hence, the Bayes factor is


!1/2  " #
2  1 nx̄2
nτ n 2

B= 1+ exp − − (x̄ − µ) .
σ2  2 σ2 nτ 2 + σ 2 

√ √
Writing t == nx̄/σ, η = −µ/τ, ρ = σ/(τ n) we can rewrite this
 " #
 1/2 2
1  1 (t − ρη) 2

B = 1+ 2 exp − − − η .
ρ  2 1 + ρ2 

This illustrate a difficulty with the Bayes factor approach. In general, many Bayesian solutions to
point and interval estimation problems are close to the classical solutions when the prior is diffuse.
However, here when we let τ 2 → ∞ we see that ρ → 0 and thus B → ∞. In other words, in the
limit that the prior under H1 is diffuse (infinite variance), then we have overwhelming support for
H0 no matter the observed data. This is an instance of Lindley’s paradox . One must therefore
chose η, ρ to represent some reasonable judgement of where θ is likely to be when H0 is false and
there is no way to escape this by using some non-informative prior!

Example (Psychokinesis example). In 1987 Schmidt, Jahn and Radin ran an experiment where
a subject with alleged psychokinetic ability tried to ‘influence’ a stream of quantum particles arriving
at a quantum gate. Each particle would upon arrival at the gate either trigger a red light or a green
light; the laws of quantum mechanics suggest a 50/50 ratio, and the subject tried to influence the
particles to go to red.

Let X be the number of particles observed to go to red out of a total of n. We use the model
X ∼ Bin(n, θ) where θ is unknown. In the experiment, n = 104, 490, 000 and the observed value of
X was x = 52263471.

Has the subject influenced the particles?

Framing this as a hypothesis test, the natural choice of hypotheses is

H0 : θ = 1/2, H1 : θ ̸= 1/2.

The frequentist p-value is Pθ=1/2 (X ⩾ x) = 0.0003. This suggests very strong evidence of paranor-
mal ability?

Let’s reframe this as a Bayesian test to see what’s going on. Choose the mixed prior with π0 = 1/2
and g1 = 1[0,1] corresponding to a flat prior on [0, 1]. Under this prior, the posterior probability of

Page 81 of 88
Foundations of Statistical Inference 13. Hypothesis Tests

H0 is

π0 f (x, 1/2)
(13.3) π(H0 | x) = R1
π0 f (x, 1/2) + (π1 ) 0 f (x, θ) dθ
n −n

x 2
(13.4) = n −n 1
x 2 + n+1
(13.5) ≈ 0.92

in our casecorresponding to a Bayes factor of

B ≡ 11.5

This gives a very different conclusion from the one based on the p-value.

This reflects that we are reasonably sure before conducting the experiment that θ = 1/2 is a more
likely value than any other.

13.3 Hypothesis testing in the context of decision theory


13.3.1 Bayes tests for simple-simple hypothesis
Suppose we wish to test the hypothesis H0 : θ = θ0 against the alternative H1 : θ = θ1 and consider the
(non-random) test ϕ with critical region C

 1 if x ∈ C
ϕ(x) =
 0 if x ̸∈ C

A generic loss function can be written:


(
aϕ(x) if θ = θ0
L(θ, ϕ(x)) =
b(1 − ϕ(x)) if θ = θ1 .

Lemma 13.8. The rule ϕ has risk R(θ0 , ϕ) = aα and R(θ1 , ϕ) = bβ where β = 1 − w(θ1 ).

Proof. We have

(13.6) R(θ0 , ϕ) = Eθ0 [aϕ(X)] = aα


(13.7) R(θ1 , ϕ) = Eθ1 [b(1 − ϕ(X))] = b(1 − w(θ1 ).

To calculate the Bayes risk we need a prior π. Let π(θ0 ) = p0 and π(θ1 ) = p1 be the prior probabilities
that H0 and H1 hold, respectively.

Lemma 13.9. The Bayes risk for ϕ under the prior π is

r(π, ∆C ) = p0 aα(C) + p1 bβ(C).

Proof. Trivial, by calculating the expected risk.


Remark. Note here that we write α = α(C), β = β(C) to emphasise that α, β depend on (and only on)
our choice of critical region, whereas the other quantities are independent of it.

Page 82 of 88
Foundations of Statistical Inference 13. Hypothesis Tests

Definition 13.10. The Bayes test is the rule δC with the critical region C chosen to minimise
the Bayes risk (under the loss function defined above).

Theorem 13.11 (Bayes test for simple hypotheses). The critical region for the Bayes test
with prior π and loss L is  
f (x, θ1 )
C= x : ⩾A
f (x, θ0 )
p0 a
where A = p1 b .

Proof. The Bayes test minimises the Bayes risk

(13.8) p0 aα + p1 bβ = p0 a P(X ∈ C | H0 ) + p1 b P(X ∈ C ′ | H1 )


Z Z
(13.9) = p0 a f (x, θ0 ) dx + p1 b f (x, θ1 ) dx
C C′
Z  Z 
(13.10) = p0 a f (x, θ0 ) dx + p1 b 1 − f (x, θ1 ) dx
C C
Z
 
(13.11) = p1 b + p0 af (x, θ0 ) − p1 bf (x, θ1 ) dx.
C

So choose C such that x ∈ C iff p0 af (x, θ0 ) − p1 bf (x, θ1 ) ⩽ 0, i.e.


 
f (x, θ1 ) p0 a
C= x: ⩾ .
f (x, θ0 ) p1 b

p0 a
Corollary 13.12. The Bayes test is a likelihood ratio test with A = p1 b .

Corollary 13.13. Every likelihood ratio test is a Bayes test for some prior probabilities p0 , p1 .

Example. Suppose X1 , . . . , Xn are i.i.d. N (µ, σ 2 ) with σ 2 known, and we want to test H0 : µ = µ0
against H1 : µ = µ1 , with µ1 > µ0 .

The critical region for a likelihood ratio test becomes


 
f (x, µ0 )
(13.12) C = x ∈ Rn : ⩾A
f (x, µ1 )
( )
σ 2 log(A) 1
(13.13) = x ∈ R : x̄ ⩾
n
+ (µ0 + µ1 ) .
n(µ1 − µ0 ) 2
p0 a
For the Bayes test we need A = p1 b , so we simply substitute into the above to find the critical
region.

As an example, take µ0 = 0, µ1 = 1, σ 2 = 1, n = 4, a = 2, b = 1, p0 = 41 , p1 = 34 . Then


(   #
1 2 1
C = x ∈ R : x̄ ⩾ log
n
+ = {x ∈ Rn : x̄ ⩾ 0.3999}.
4 3 2

Page 83 of 88
Foundations of Statistical Inference 13. Hypothesis Tests

Using that X̄ ∼ N (µ, 1/4), this gives Type I/II error probabilties
!
σ2 1
α = P X̄ ⩾ 0.3999 | µ = 0, = = 0.212
n 4

and !
σ2 1
β = P X̄ < 0.3999 | µ = 1, = = 0.115.
n 4

The frequentist approach, fixing α = 0.05, would give β = 0.363 (easy to check), so we see that in
the Bayes test α is increased and β decreased relative to the frequentist test.

13.3.2 The case of the 0–1 loss function


In the case that L is the 0–1 loss, so a = b = 1 and

1 if θ = θ0 and x ∈ C,


L(θ, δC (x)) = 1 if θ = θ1 and x ∉ C,

0 otherwise,

the Bayes test takes a particularly intuitive form.

Definition 13.14. The maximum a posteriori (MAP) test chooses the hypothesis with the
highest posterior probability P(Hi | X = x).

Theorem 13.15. The MAP test is the Bayes test under the 0–1 loss.

Proof. Exercise.

Let mi (x) be the marginal likelihood of x under hypothesis Hi . Thus is Hi is simple Hi : θ = θi we


have mi (x) = f (x, θi ). If Hi is composite Hi : θ ∈ Θi we have
Z
mi (x) = f (x, θ)g1 (θ) dθ.
Θi

Let π0 , π1 be the prior probabilities of H0 , H1 .

Proposition 13.16. The Bayes test for the 0–1 loss (i.e. the MAP test) rejects H0 iff

m0 (x) π1
< .
m1 (x) π0

Proof. This is just an application of Theorem 13.11 with a = b = 1.

Nevertheless let us just check that this is indeed the MAP test in the case where H0 is simple and
H1 is composite. The marginal distribution for X under this (hierarchical) prior is
Z
m(x) = π1 f (x, θ)g1 (θ) dθ + π0 f (x, θ0 ).
Θ1

Thus the posterior probability of H0 is

π0 f (x, θ0 )
π(H0 | x) = R .
π1 Θ1
f (x, θ)g1 (θ) dθ + π0 f (x, θ0 )

The Bayes test for the 0–1 loss, i.e. the MAP test, rejects H0 iff π(H0 | x) < π(H1 | x), i.e. iff

Page 84 of 88
Foundations of Statistical Inference 13. Hypothesis Tests

π(H0 | x) < 1/2. This occurs iff


Z
(13.14) 2π0 f (x, θ0 ) < π1 f (x, θ)g1 (θ) dθ + π0 f (x, θ0 )
Θ1
Z
(13.15) ⇐⇒ π0 f (x, θ0 ) < π1 f (x, θ)g1 (θ) dθ
Θ1
f (x, θ0 ) π1
(13.16) ⇐⇒ R < ,
Θ1
f (x, θ)g 1 (θ) dθ π0

giving the result.

Example. In a quality inspection program components are selected at random from a batch and
tested. Let θ denote the failure probability. Suppose that we want to test the hypotheses

H0 : θ ⩽ 0.2, H1 : θ > 0.2.


R 0.2
We use the following prior for θ: Let g(θ) = 30θ(1−θ)4 for 0 < θ < 1. We also set π0 = 0 g(θ) dθ ≈
0.345 and π1 ≈ 1 − 0.345 with g0 (θ) = g(θ | θ ⩽ 0.2) = g(θ)1[0,0.2] (θ)/π0 , g1 (θ) = g(θ|θ > 0.2).

Suppose n components are selected for independent testing. Modelling the number of failures X as
X ∼ Bin(n, θ), the marginal likelihood for H0 is
Z
(13.17) m0 (x) = f (x, θ)g0 (θ) dθ
Θ0
  Z 0.2
n 30θ(1 − θ)4
(13.18) = θx (1 − θ)n−x dθ.
x 0 π0

For one batch of size n = 5, the value X = x = 0 is observed. So


  Z 0.2
5 30θ(1 − θ)9 0.185
m0 (x) = dθ ≈ = 0.536.
0 0 π0 0.345
R1
5 30θ(1−θ)9
Similarly m1 (x) = 0 0.2 π1 dθ ≈ 0.134.

m0 (x) 0.536 π1
So the Bayes factor is B0/1 = m1 (x) = 0.134 =4> π0 = 1.89 so the Bayes test does not reject H0 .

Indeed, the overall marginal likelihood is m(x) = m0 (x)π0 + m1 (x)(1 − π0 ) ≈ 0.273, so the posterior
probabilities for the hypotheses are π(H0 | x) = π(x|H 0 )π0
m(x) ≈ 0.185
0.273 = 0.678 and π(H1 | x) ≈ 0.322;
we see that H0 indeed maximises the posterior.

13.4 Exponential families


Good material here: Washington.edu

13.5 Two sided hypothesis tests


We now consider in more details situations in which H0 : θ ∈ Θ0 is either Θ0 = [θ1 , θ2 ] or Θ0 = {θ0 }
and Θ1 = R \ Θ0 . In this situation we cannot expect to find a UMP test, even for nice families such
as exponentials or MLR. The reason is obvious: if we construct a Neyman–Pearson test of say θ = θ0
against θ = θ1 for some θ1 ̸= θ0 , the test takes quite a different form when θ1 > θ0 from when θ1 < θ0 .
We simply cannot expect one test to be most powerful in both cases simultaneously. However, if we have
an exponential family with natural statistic T = t(X), or a family with MLR with respect to t(X), we

Page 85 of 88
Foundations of Statistical Inference 13. Hypothesis Tests

Figure 13.1: Power functions for ones and two-sided tests (From Young and Smith).

might still expect tests of the form



 1 if x ∈ t(x) ̸∈ [t1 , t2 ]



ϕ(x) = γ(x) if t(x) = t1 or t2


0 if x ∈ (t1 , t2 ).

where t1 < t2 to have good properties. Such tests are called two sided tests based on T .

Definition 13.17. A test of H0 : θ ∈ Θ0 against H1 : θ ∈ Θ1 is called unbiased of size α if

Pθ (X ∈ C) ⩽ α ∀θ ∈ Θ0 but Pθ (X ∈ C) ⩾ α ∀θ ∈ Θ1 .
A test which is uniformly most powerful amongst the class of all unbiased tests is called uniformly
most powerful unbiased , abbreviated UMPU.

The idea is illustrated by Figure 13.1, for the case H0 : θ = θ0 against H1 : θ ̸= θ0 (In the figure,
θ0 = 0.) The optimal UMP tests for the alternatives H1 : θ > θ0 and H1 : θ, θ0 each fails miserably to be
unbiased, but there is a two-sided test whose power function is given by the dotted curve, and we may
hope that such a test will be UMPU.

13.5.1 UMPU tests for one-parameter exponential families


Consider an exponential family of the form
f (x, θ) = h(x) exp{θt(x) − B(θ)}
with θ ∈ R. Let T = t(X) be the natural observation.

Remember that T itself also belongs to an exponential family with density form
fT (t, θ) = hT (t) exp{θt − B(θ)}.

Page 86 of 88
Foundations of Statistical Inference 13. Hypothesis Tests

We shall assume that T is a continuous random variable with hT > 0 on the open set that defines the
range of T . This avoids the need for randomised tests and this makes our proofs less technical at the
cost of very little loss of generality.

Theorem 13.18. For any α there exists a UMPU test of size α which is of the two-sided form in
T.

We do not include a full proof of this result here. However, we mention that it starts with the following
generalisation of Neyman-Pearson’s Theorem:

Lemma 13.19. Let f0 , f1 , . . . , fm be m + 1 probability densities, and let α1 , . . . , αm be constants


such that the class C Z
C = {ϕ : ϕ(x)fi (x) dx = αi , i = 1, . . . , m}

is non-empty. Then
R
1. There is one member of C that maximizes f0 (x)ϕ(x) dx.
2. A necessary and sufficient condition for ϕ∗ ∈ C to be a maximizer is that there exists constants
k1 , . . . , km

 1 if f (x) > Pm k f (x)
0 i=1 i i
(13.19) ϕ(x) = .
 0 if f0 (x) < Pm ki fi (x)
i=1

R
3. If ϕ ∈ C satisfies (13.19) with k1 , . . . , km ⩾ 0 then it maximises f0 (x)ϕ(x) dx among all
functions satisfying Z
ϕ(x)fi (x) dx ⩽ αi , i = 1, . . . , m

Page 87 of 88
Bibliography

Chang, Joseph T and David Pollard. “Conditioning as disintegration”. In: Statistica Neerlandica 51.3
(1997), pp. 287–317.
Liero, Hannelore and Silvelyn Zwanzig. Introduction to the theory of statistical inference. CRC
press, 2016.

Page 88 of 88

You might also like