Foundations of Statistical Inference
Foundations of Statistical Inference
1
Foundations of Statistical Inference
Michaelmas Term 2022
George Deligiannidis
Contents
0 Notations 4
1 Exponential Families 5
1.1 Definition and examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.1 Support and counterexamples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Parsimonious parametrisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 The parameter space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4 Curved exponential families . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3 Fisher Information 21
3.1 The one-dimensional case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 The multivariate case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4 Point estimation 25
4.1 The method of moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 Maximum likelihood estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2.1 Finding the MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.3 Variance and mean squared error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
8 Non-Informative Priors 47
8.1 Uniform priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
8.2 Jeffrey’s prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
8.2.1 Jeffrey’s prior in higher dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
8.3 Maximum entropy prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
9 Hierarchical Models 51
9.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
9.2 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
9.3 Exchangeability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Page 1 of 88
Foundations of Statistical Inference CONTENTS
10 Decision Theory 58
10.1 Basic framework and risk function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
10.2 Admissibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
10.3 Minimax rules and Bayes rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
10.4 Bayes rule and posterior risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
10.5 Point estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
10.6 Finite decision problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
10.6.1 The case k = 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
13 Hypothesis Tests 76
13.1 Recap from part A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
13.1.1 General setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
13.1.2 Neyman-Pearson Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
13.1.3 Uniformly most powerful tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
13.2 Bayes factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
13.2.1 Bayes factors for simple hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
13.2.2 Bayes factors for composite hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . 80
13.3 Hypothesis testing in the context of decision theory . . . . . . . . . . . . . . . . . . . . . . 82
13.3.1 Bayes tests for simple-simple hypothesis . . . . . . . . . . . . . . . . . . . . . . . . 82
13.3.2 The case of the 0–1 loss function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
13.4 Exponential families . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
13.5 Two sided hypothesis tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
13.5.1 UMPU tests for one-parameter exponential families . . . . . . . . . . . . . . . . . 86
Page 2 of 88
Foundations of Statistical Inference CONTENTS
This set of notes is largely based on notes prepared by Damian Falck in MT2020 based on slides and
lectures given by Julien Berestycki. The material goes back to material prepared by Judith Rousseau.
This is still work in progress and may contain typos or even errors. I would very much appreciate your
help in improving them so if you spot anything please let me know at [email protected].
Page 3 of 88
Chapter 0
Notations
The situations of interest to us in this course start in general with having observed some data x, where
x is a point in X .
Example. Consider a large field of soybean plants. During 7 weeks, each Monday 5 plants are
randomly chosen and the average height recorded.
The data are x = {5, 13, 16, 13, 23, 33, 40}. Here X = R7+ .
We will consider x as the realisation of a random variable X, taking values in some measurable space
(X , F), where the distribution of X is (at least partly) unknown. Statistical inference is about using
x to gain information on the distribution of X.
We write P(X ) for the collection of all probability measures on (X , F). We will usually be interested in
a smaller class or family of possible distributions P ⊂ P(X ) parametrised by some parameter θ:
Definition 0.1. A set P = {Pθ : θ ∈ Θ}, where the Pθ are probability distributions on X , is called
a statistical model . Here Θ is the parameter space.
If Pθ is absolutely continuous w.r.t. to some reference measure µ (for our purposes the Lebesgue measure),
we write f (x, θ), or pθ (x) for its probability density function, whereas if Pθ is discrete we write f (x, θ)
for its probability mass function. We will often identify a distribution Pθ with its density. We write
Eθ [·] and Pθ [·] to mean expectations/probabilities under Pθ ; so in Eθ [ϕ(X)], for example, we take X to
have distribution Pθ .
Other possible notations for the same mass/density include pθ (x), p(x, θ), p(x | θ), f (x | θ), Pθ (X = x)
(in the discrete case), and L(θ; x).
Remark (Remark on use of notation). Throughout this course we will freely drift between different
notations for the same objects. This is somewhat intentional.
Page 4 of 88
Chapter 1
Exponential Families
Definition 1.1. A family P = {Pθ : θ ∈ Θ} of probabilities (pmf or pdf) on some set X , indexed by
θ is called an exponential family if there exists k ∈ N, functions η1 , . . . , ηk , B : Θ 7→ R, statistics
T1 , . . . , Tk : X 7→ R and a non-negative real-valued function h on X such that the pdf/pmfs p(x; θ)
of Pθ have the form
X k
(1.1) p(x; θ) = exp ηi (θ)Ti (x) − B(θ) h(x).
i=1
Remark. By changing the set X if necessary, we can always focus on the case where 0 < h(x) < ∞, for
all x ∈ X without changing the exponential family.
Remark. The above should be implicitly understood as saying that
for some reference measure µ. When x ∈ Rd , µ will typically be Lebesgue measure, and for discrete sets
it will be the counting measure.
The ηi are called the natural or canonical parameters, and the Ti (x) are called the natural or
canonical observations.
we can think of exp(−B(θ)) as a normalisation. Observe that B only depends on θ through η(θ).
(In the above expression the integral should be a sum if p(x; θ) is a pmf, or we can think of p(x; θ) as
distribution to encompass both cases).
Often, it is useful to use the ηi as the parameters and to write the model in its canonical form,
Xn
p(x; η) = exp ηi Ti (x) − B(η) h(x).
i=1
Page 5 of 88
Foundations of Statistical Inference 1. Exponential Families
(Note this is possible even if θ 7→ η is not one-to-one. Note also that there is a slight abuse of notation
as (x, θ) 7→ p(x; θ) and (x, η) 7→ p(x; η) are not the same function).
with h(x) = 1/x!, η(θ) = log θ, B(θ) = θ and T (x) = x. The natural parameter is log θ.
• Binomial distribution with known number of trials. For the Bin(n, p) distribution,
considering n to be known and p to be the parameter, the mass function may be written as
n x
(1.4) f (x; p) = p (1 − p)n−x
x
n
(1.5) = exp x(log p − log(1 − p)) + n log(1 − p)
x
p
(for x = 0, 1, . . . , n). So h(x) = nx , T (x) = x, η(p) = log 1−p
, and B(p) = −n log(1 − p).
• Gaussian distribution with known variance. For the N (µ, 1) distribution (for example),
the density may be written as
2
− x2
" # " #
1 (x − µ)2 exp µ2
f (x; µ) = √ exp − = √ exp µx − ,
2π 2 2π 2
2
exp − x2 µ2
so h(x) = √
2π
, η(µ) = µ, T (x) = x and B(µ) = 2 .
• Gamma distribution. For the Gamma(α, β) distribution, with θ = (α, β), we have mass
function
β α xα−1 e−βx
(1.6) f (x; θ) = 1x⩾0
Γ(α)
" #
(1.7) x − (log(Γ(α)) − α log β) 1x⩾0 .
= exp (α − 1) log x − β |{z}
| {z } | {z } |{z} | {z } | {z }
η1 (θ) T1 (x) η2 (θ) T2 (x) B(θ) h(x)
• Gaussian distribution. For the N (µ, σ 2 ) distribution, with θ = (µ, σ 2 ), we have mass
function
!
1 (x − µ)2
(1.8) f (x; θ) = √ exp −
2πσ 2 2σ 2
" !#
1 2 µ µ2 1 2
(1.9) = exp − 2 |{z} x + 2 |{z} x − + log(2πσ ) .
2σ } σ 2σ 2 2
| {z T1 (x) |{z} T2 (x)
η1 (θ) η2 (θ) | {z }
B(θ)
Page 6 of 88
Foundations of Statistical Inference 1. Exponential Families
Proposition 1.2. Two probability measures P and Q are said to be equivalent if we have P(N ) = 0
iff Q(N ) = 0. If P = {p(x; θ) : θ ∈ Θ} is an exponential family, then all p(·; θ) are equivalent.
Proof. Take θ1 ̸= θ2 ∈ Θ and suppose Pθ1 (N ) = 0. Write 1N for the indicator function of N .
Z
Pθ1 (N ) = e
X
−B(θ1 )
(1.10) exp ηj (θ1 )Tj (x) h(x)1N (x)dx = 0
j
This implies that h(x)1N (x) = 0 for Lebesgue almost all x and therefore that
Z
Pθ (N ) = e−B(θ) exp ηj (θ)Tj (x) h(x)1N (x)dx = 0
X
(1.11)
j
for arbitrary θ ∈ Θ.
Definition 1.3. For a distribution P with density f , the support of P, equivalently of f , is defined
as the set
supp(P) = {x : f (x) > 0}.
Corollary 1.4. In an exponential family P = {f (x; θ), θ ∈ Θ} the support of f (x; θ) does not
depend on θ. We will write A for the common support of the f (x; θ).
In fact, in the general case where h is allowed to vanish on a subset of X , A = {x : h(x) > 0}. It can be
easily seen that Pθ (A) = 1 for all θ.
Example. Another example of a family which is not exponential is the Cauchy family with location
parameter µ:
1
f (x; µ) = .
π(1 + (x − µ)2 )
Page 7 of 88
Foundations of Statistical Inference 1. Exponential Families
then l ≥ k.
Definition 1.6. A representation of an exponential family P of the form (1.1) is called minimal
if k is minimal; that is for any other representation of P in terms of {η̃j : j ≤ l}, {T̃j : j ≤ l} we
have l ≥ k.
This means that one cannot find s < k, such that p(x; θ) can be written in the form (1.1) with s replacing
k and some new statistics T1′ , . . . , Ts′ , new functions η1′ , . . . , ηs′ and B ′ on Θ and a new function h′ .
It can be easily seen that if the {Ti }ki=1 satisfy some affine relationship of the form i ci Ti (x) = c0
P
for all x ∈ A, then we can rewrite one of the Ti in terms of the rest and a constant, reducing k by 1.
Similarly if the {ηi }ki=1 satisfy an affine relationship. In fact we can keep reducing k until {ηi }ki=1 and
{Ti }ki=1 become affinely independent.
Definition 1.7. The functions T1 , . . . , Tn are called affinely independent (P-affine indepen-
dent in [LZ16]) if for anya c0 , . . . , cn ∈ R,
n
! !
X
cj Tj (x) = c0 µ-almost everywhere =⇒ cj = 0 for j = 0, . . . , n .
j=1
a Recall from Corollary 1.4 that A denotes the common support of the exponential family defined in Corollary 1.4.
Remark. In the case of the functions T1 , . . . , Tn it may help to intuitively understand affine independence
as saying that c0 , . . . , cn ∈ R,
n
! !
X
cj Tj (x) = c0 ∀x ∈ A =⇒ cj = 0 for j = 0, . . . , n ,
j=1
where A was defined in Corollary 1.4 as the common support of the exponential family—notice that
Pθ (A) = 1 for all θ.
Remark. If the functions {ηj (·)} are not affinely-independent then they are contained in a k − 1-
dimensional hyperplane. Similarly for {Tj (·)}, ignoring a set of measure zero.
Remark. It is easy to see that affine independence is stronger than linear independence. For example
the functions {x, x + 1} are linearly independent viewed as functions R 7→ R, but not affine independent.
Affine independence of {f1 , . . . , fk } means that {f1 , . . . , fk , 1} are linearly independent where 1 denotes
the constant function.
We argued earlier that we iteratively exploit any affine relationships, reducing the dimension of the
representation, until we arrive at an affinely independent representation. We will now establish that
every affinely independent representation has the same dimension.
Proposition 1.8. For every exponential family P = {p(x; θ) : θ ∈ Θ}, with p(x; θ) of the form
(1.1), there exists a k ′ ≤ k such that (1.1) has a k ′ -parameter, affinely independent representation.
Any affinely independent representation has dimension k ′ .
Proof. First of all we can always arrive at an affinely independent representation by iteratively
exploiting affine relationships until none exist. Therefore we may assume that P is given by (1.1),
Page 8 of 88
Foundations of Statistical Inference 1. Exponential Families
H := {(ηi (θ))ki=1 : θ ∈ Θ} ⊂ Rk .
By definition, since {ηi }i are affinely independent, H is not contained in any (k − 1)-dimensional
hyperplane. Consider the collection R of log-likelihood ratios
where
k
p(x; θ) X
log ′
= (ηi (θ) − ηi (θ′ ))Ti (x) + B(θ′ ) − B(θ).
p(x; θ ) i=1
Then let V be the vector space spanned by R and the constant function 1X ,
m
βi fi : m ∈ N, βi ∈ R, fi ∈ R .
X
V := β0 +
i=1
does not lie in any (k − 1)-dimensional hyperplane of W and therefore V does not lie in any
k-dimensional hyperplane of W . Therefore V = W . Finally notice that the likelihood ratios
are independent of the particular representation chosen in (1.1), in particular independent of the
reference measure and of h. Therefore so is the vector space V . Since V is also independent of the
choice of T , k is determined uniquely.
Proposition 1.9. Suppose that P is given by (1.1). Then P is strictly k-parameter if and only if
in (1.1) the functions {ηi : i ≤ k} and the statistics {Ti : i ≤ k} are affinely independent.
Proof. One direction is obvious, since if the functions {ηi (·)} and the statistics {Ti (·)}i in a k-
parameter family are not affinely independent then we can find a (k − 1)-dimensional representation
and therefore the family is not strictly k-parameter.
For the other direction, we assume that the exponential family P is given by P = {f (x; θ) : θ ∈ Θ},
where
Xk
(1.12) f (x; θ) = h(x) exp ηi (θ)Ti (x) − B(θ) , θ ∈ Θ
i=1
where the functions {ηi (·)} and the statistics {Ti (·)}i are affinely independent. We need to prove
that P is strictly k-parameter.
Way 1: Notice that any minimal representation must be affinely independent, since otherwise
we can always reduce the dimension violating minimality. We also know from Proposition 1.8
that any affinely independent representation has the same dimension k. Recall from the proof of
Proposition 1.8 that the vector space V spanned by the collectino of the likelihood ratios {x 7→
log p(x; θ)/p(x; θ′ ) : θ, θ′ ∈ Θ} is (k + 1)-dimensional and is independent of the representation.
Therefore any representation must have dimension at least k.
Page 9 of 88
Foundations of Statistical Inference 1. Exponential Families
P = {f (x; θ) : θ ∈ Θ} = {g(x; θ) : θ ∈ Θ}
where
Xl
(1.13) g(x; θ) = h̃(x) exp η̃i (θ)T̃i (x) − B̃(θ) , θ ∈ Θ.
i=1
Next notice that likelihood ratios are independent of the parameterisation. Then we have
f (θ, x)f (θ0 , x0 ) h(x) exp η(θ) · t(x) − B(θ) × h(x0 ) exp η(θ0 ) · t(x0 ) − B(θ0 )
(1.14) =
f (θ0 , x)f (θ, x0 ) h(x) exp η(θ0 ) · t(x) − B(θ0 ) × h(x0 ) exp η(θ) · t(x0 ) − B(θ)
(1.15) = exp [η(θ) − η(θ0 )] · [t(x) − t(x0 )] ,
Fix a P0 ∈ P corresponding to θ0 , θ0 . For P, given equivalently by (1.12) and (1.13) with θ and θ
respectively, we have that
h i
(1.16) η(θ) − η(θ0 ) · T (x) − T (x0 ) = η̃(θ) − η̃(θ0 ) · T̃ (x) − T̃ (x0 ) .
Since by assumption the functions {ηi (·) : i = 1, . . . , k} are affinely independent, they are not
contained in any (k − 1)-dimensional hyperplane. Therefore we can find θ0 , . . . , θk such that the
vectors η(θ0 ), . . . , η(θk ) are not contained in any (k − 1)-dimensional hyperplane and therefore the
vectors η(θ1 ) − η(θ0 ), . . . , η(θk ) − η(θ0 ) span Rk . Applying with θ = θi , i = 1, . . . , k, we get k
equations which in matrix notation can be written as
η(θ1 ) − η(θ0 ) η̃(θ1 ) − η̃(θ0 )
.. .. h
i
(1.17) × T (x) − T (x0 ) = × T̃ (x) − T̃ (x0 )
. .
η(θk ) − η(θ0 ) η̃(θk ) − η̃(θ0 )
Since η(θi ) − η(θ0 ), i = 1, . . . , k span Rk we can invert the left-most matrix and obtain an equation
of the form T (x) = AT̃ (x) + b, where A is a constant (in x) k × l matrix and b a constant vector.
Notice that since {T (x) : x ∈ X } is not contained in any (k − 1)-dimensional hyperplane, the image
of A must have dimension k; therefore rank(A) = k and thus l ≥ k.
Using the same reasoning, and the fact that {Ti (·) : i = 1, . . . k} are affinely independent, we can
find x0 , . . . , xk so that T (xi ) − T (x0 ), i = 1, . . . , k span Rk and working in a similar way as before
we can obtain an equation of the form η(θ) = C η̃(θ) + d with C a constant k × l matrix and d a
constant vector.
This proves that if in (1.12) {ηi (·) : i = 1, . . . , k} and {Ti (·) : i = 1, . . . , k} are affinely independent,
then any other representation must have dimension l ≥ k and therefore P is strictly k-parameter.
Remark. We have now seen that a representation is minimal if and only if it is affinely independent. In
fact, often in the literature, a representation is called minimal if it is affinely independent.
If X ∼ f (x; θ), then T = (T1 (X), . . . , TN (X)) is a random vector. Let Covθ (T ) be its covariance matrix
under f (x; θ). The following gives a condition for the statistics {Ti } to be affine-independent.
Proposition 1.10. The functions Ti are P-affinely independent if and only if for some θ, and thus
for all θ, Covθ (T ) is positive definite.
Page 10 of 88
Foundations of Statistical Inference 1. Exponential Families
Proof. As before let X ∼ f (x; θ). First or all notice that the following are all equivalent:
so X belongs to a 3-parameter exponential family, but I1 (x) + I2 (x) + I3 (x) = 1 so it is not strictly
3-dimensional. Indeed,
!
p1 p2
p(x; θ) = exp I1 (x) log + I2 (x) log + log(p3 )
1 − (p1 + p2 ) 1 − (p1 + p2 )
Page 11 of 88
Foundations of Statistical Inference 1. Exponential Families
i.e. the set of θ for which the integrand can be normalized to become a probability density function.
R Pn
i.e. the set of η for which we can define B(η) := log h(x) exp η T
i=1 i i (x) dx so that
Pn
f˜(x; η) = e−B(η) h(x) exp η T
i=1 i i (x)
is a pdf/pmf on X .
Observe that you we always have η(Θ) ⊂ Ξ, although it may be the case that η(Θ) ̸= Ξ).
Theorem 1.13. The natural parameter space Ξ of a strictly k-parameter exponential family is
convex and contains a non-empty k-dimensional ball.
Proof. Take η, η ′ ∈ Ξ and let α ∈ (0, 1). Define B(η) = log exp
R P
i ηi Ti (x) h(x) dx. Then
Z
(1.23) B(αη + (1 − α)η ′ ) = log exp(α i ηi Ti (x) + (1 − α) i ηi′ Ti (x))h(x) dx
P P
Z
α 1−α
exp( i ηi′ Ti (x))h(x)
P P
(1.24) = log exp( i ηi Ti (x))h(x) dx
Definition 1.14. If the image of the parameter space η(Θ) ⊆ Ξ for a strictly k-parameters expo-
nential family contains a k-dimensional open set, then it is called full rank .
Full rank is also sometimes referred to as regular . See the discussion of curved exponential families
below for examples of families which are not full rank.
Remark. In the one dimensional case, being full-rank is easily checked: it suffices that η(Θ) contains an
interval.
Page 12 of 88
Foundations of Statistical Inference 1. Exponential Families
Theorem 1.15. Let P be a strictly k-parameter exponential family with natural parameter space
Ξ. Then for all η ∈ Int(Ξ):
Then for s = (s1 , . . . , sk ) consider the moment generating function of T (X), when X ∼ Pη ,
k
MT (X) (s) = Eη exp
X
(1.29) si Ti (X)
i=1
Z k
X
(1.30) = exp (ηi + si )Ti (X) − B(η) h(x)dx
i=1
(1.31) = exp B(η + s) − B(η) .
Since by assumption η ∈ Int(Ξ), there exists a δ > 0, such that for all y ∈ B(η, δ) we have y ∈ Ξ
and therefore for all |s| < δ we have
MT (X) (s) = exp −B(η) + B(η + s) < ∞.
Page 13 of 88
Foundations of Statistical Inference 1. Exponential Families
Positive-definiteness of the covariance matrix guarantees that k is not an arbitrary large number. In fact
it guarantees that the exponential family is strictly k-parameter, although the parameter space forms a
lower-dimensional, non-linear submanifold of the natural parameter space.
Example. (Normal Distribution). Let the statistical model be the class of all normal distributions
with N (µ, µ2 ) where µ is unknown and µ ̸= 0 with parameter θ = µ ∈ R∗ .
(1.36) ( ) ( ) ( )
1 (x − µ)2 1 (−x2 + 2xµ − µ2 1 x2 x 1
p(x; θ) = √ exp − =√ exp =√ exp − − 2 + −
2πµ 2µ2 2πµ 2µ2 2πµ 2µ µ 2
We thus have
T1 (x) = x, T2 (x) = x2 , η1 (θ) = µ−1 , η2 (θ) = −µ−2 /2.
This examples satisfies the definition of Curved Exponential families because the parameter θ is
one dimensional but results in a 2-parameter exponential family. The covariance matrix can be
calculated with the known moments
2
2µ3
µ
Covθ T =
2µ3 6µ4
The covariance matrix is shown to be positive definite for all θ ∈ Θ (the determinant is 2µ6 > 0).
We see that T1 and T2 are P-affine independent and the familly is strictly 2-parameters (the ηi are
constrained, but not linearly).
Example. Suppose X1 ∼ N (θ, 1) and X2 ∼ N ( θ1 , 1) are independent. Their joint distribution has
log-density
Further examples can be found in this document. See also here for some examples about curved/full-rank
families.
Finally, let us formulate the following important statement about the distribution of a sample of inde-
pendent r.v.s distributed according to a distribution from an exponential family:
(b) If X = (X1 , . . . , Xn ) are i.i.d. samples from a k-parameter exponential distribution of the form
(1.1) with functions η = (η1 , . . . , ηk ) and T = (T1 , . . . , Tk ), then the distribution of X belongs
Page 14 of 88
Foundations of Statistical Inference 1. Exponential Families
Pn
to a a k-parameter exponential family with natural observation T(n) (x) := i=1 T (xi ).
Remark. The fact that X = (X1 , X2 , . . . , Xn ) also belongs to a k parameter exponential family, inde-
pendently of n, is quite important.
Page 15 of 88
Chapter 2
2.1 Sufficiency
We may often be interested in summarising a set of data without losing any information about the
parameter we’re trying to estimate. A statistic that does this is said to be sufficient:
A statistic T (X) is said to be sufficient for θ if the conditional distribution of X given T does not
depend on θ. That is,
f (x | t, θ) = f (x | t).
Remark. In particular, this means that for any function g the map θ 7→ Eθ [g(X) | T = t] is constant.
We can think of a sufficient statistic as ‘wrapping up’ all the information there is about θ somehow.
P(X = x, T = t | p)
(2.1) f (x | t, p) = P(X = x | T = t, p) =
P(T = t | p)
Qn xi 1−xi
i=1 p (1 − p)
(2.2) = n t n−t
1 P
xi =t
t p (1 − p)
−1
pt (1 − p)n−t P n
(2.3) = n t
n−t
1 xi =t = 1P
xi =t ,
t p (1 − p)
t
The intuitive meaning of this is that only the number of successes matters for estimating p; the
order in which successes arrive shouldn’t change your guess for p.
Theorem 2.2 (The Factorisation Criterion). Suppose X ∼ f (x; θ). Then a statistic T (X) is
sufficient for θ if and only if f can be written as
Page 16 of 88
Foundations of Statistical Inference 2. Sufficiency and Minimality
Proof for the discrete case. Suppose T is sufficient and write t = T (x). So
f (x; θ) = Pθ (X = x) = Pθ (X = x, T = t) = Pθ (X = x | T = t) Pθ (T = t).
Pθ (T = t) = Pθ (X = x) =
X X X
f (x; θ) = g(t, θ) h(x).
x:T (x)=t x:T (x)=t x:T (x)=t
The general case is much more complicated to prove; see [CP97, Example 6] for a proof using
disintegrations.
Then T1 (x), . . . , Tk (x) is sufficient for θ.
Remark. An important case of the above corollary is the distribution of X = (X1 , . . . , Xn ) are i.i.d.
samples fromPan exponential P family P = {f (·; θ) : θ ∈ Θ}. In that case the above corollary implies that
n n
T (n) (x) = T (x
i=1 1 i ), . . . , i=1 Tk (xi ) is sufficient.
The next natural question to ask is to what extent we can summarise a set of data — by how much
we can reduce it — without losing information about θ. This brings us to the concept of minimal
sufficiency .
Example. Let X1 , X2 , X3 be independent Ber(p) random variables modelling three coin tosses (so
0 means heads and 1 means tails). Consider the following four statistics:
1. T1 (X) = (X1 , X2 , X3 ),
P3
2. T2 (X) = (X1 , i=1 Xi ),
P3
3. T3 (X) = i=1 Xi ,
4. T4 (X) = 1T3 (X)=0 .
Which of these are sufficient for p?
2.2 Minimality
Suppose X takes values in X . A statistic T : X 7→ T induces a partition of the sample space X
[
X = {x ∈ X : T (x) = t};
t∈T
Page 17 of 88
Foundations of Statistical Inference 2. Sufficiency and Minimality
Example (continued). The following diagrams show the partitions induced by the statistics
T1 , . . . , T4 :
Summarising our data through a statistic can be thought of as keeping track only of the equivalence
class which contains our sample. Therefore a statistic T (equivalently a partition of the sample space)
is sufficient, if the conditional distribution of X given the equivalence class it belongs to is independent
of the parameter θ, i.e. the same for all distributions in our model.
A finer partition corresponds to keeping “more information” about our sample. We want to be as
economical as possible, that is keep as little information about the sample without losing any information
about the parameter θ. In other words, among all sufficient statistics, we want to choose the one
generating the coarsest partition.
Consider the case where X is a finite set, and let Π, Π′ be two partitions, such that Π′ is finer than Π,
that is each element of Π can be written as a union of elements of Π′ . Equivalently, we can express
this as a function sending multiple elements of Π′ to one element of Π. This leads us to the following
definition.
Definition 2.4. A statistic is minimal sufficient if it can be expressed as a function of any other
sufficient statistic.
f (y; θ)
T (x) = T (y) ⇐⇒ is independent of θ.
f (x; θ)
Remark. It is useful to think about the above result in terms of partitions: it says that in order to define
a partition corresponding to a minimal sufficient statistic, the likelihood ratio for any two elements in
the same class must be independent of θ.
Example (continued). In the coin-tossing example, first consider T2 . With x = TTH and y =
HTT, we have f (x; p) = f (y; p) = p2 (1 − p), so that ff (x,p)
(y,p) = 1, but clearly T2 (X) ̸= T2 (Y ), so T2 is
not minimal sufficient.
f (y;p)
Considering T4 instead, take x = HTH and y = TTT. So clearly T4 (x) = T4 (y), but f (x;p) =
p3 p2
p(1−p)2 = (1−p)2 , which does depend on p. So T4 is also not minimal sufficient.
f (y;θ)
Proof of theorem. ( ⇐= ) Suppose T is a statistic such that T (x) = T (y) if and only if f (x;θ) is
equal to some k(x, y) independent of θ.
Page 18 of 88
Foundations of Statistical Inference 2. Sufficiency and Minimality
Pθ (X = x) f (x; θ)
(2.4) f (x | t, θ) = Pθ (X = x | T = t) = =P
Pθ (T = t) y:T (y)=t f (y; θ)
f (x; θ)
(2.5) =P
y:T (y)=t f (x; θ)k(x, y)
−1
X
(2.6) = k(x, y)
y:T (y)=t
which is independent of θ, so T is sufficient. For the continuous case, replace the sum with an
integral.
Minimality. Now suppose U : X 7→ U is another sufficient statistic and that U (x) = U (y) for some
x, y. Since U is sufficient, by the factorisation criterion we have
which is independent of θ. So by hypothesis, T (x) = T (y) and hence ΠU is finer than ΠT , where
ΠU , ΠT be the partitions induced by U, T respectively. We now show that T must be a function of
U . Without loss of generality we assume that
[ [
U= {U (x) : x ∈ A}, T = {T (x) : x ∈ B}.
A∈ΠU B∈ΠT
( =⇒ ) Conversely, suppose T is minimal sufficient. Take x, y such that T (x) = T (y). Then by the
factorisation criterion,
f (y; θ) g(T (y), θ)h(y) h(y)
= =
f (x; θ) g(T (x), θ)h(x) h(x)
which does not depend on θ. (Note this only used the sufficiency of T .)
For the other direction, we need to show that if f (x; θ)/f (y; θ) is independent of θ then T (x) = T (y).
Start by writing x ∼ y whenever f (x; θ) = k(x, y)f (y; θ) for all θ (for some function k(x, y)).
It is easy to check that this is an equivalence relation. For each equivalence class [x] choose a
representative x and define G to be the representative function (i.e. G(y) = x for all y ∈ [x]). So G is
a statistic constant on the equivalence classes. It is also sufficient, by the factorisation criterion, since
f (x; θ) = k(x, x)f (x; θ) = k(x, G(x))f (G(x); θ) for all x. So T is a function of G (by minimality)
and hence is also constant on the equivalence classes, meaning x ∼ y =⇒ T (x) = T (y).
hP i
k
Theorem 2.6. Suppose the functions f (x; θ) = exp j=1 ηj (θ)Tj (x) − B(θ) h(x) form a strictly
k-parameter
Pn exponential family. Pn Let X = (X1 , . . . , Xn ), where X1 , . . . , Xn ∼ f (x, θ) are i.i.d. Then:
T(n) = T
i=1 1 (Xi ), . . . , i=1 Tk (Xi ) is minimal sufficient.
Remark. Since the vector X = (X1 , . . . , Xn ) is strictly k-parameter exponential, we could just say
T (X) = (T1 (X), . . . , Tk (X)) is minimal sufficient.
Page 19 of 88
Foundations of Statistical Inference 2. Sufficiency and Minimality
Pn Pn
which is independent of θ if and only if i=1 Tj (xi ) = i=1 Tj (yi ) for all j = 1, . . . , k.
Examples.
Pn
1. Bernoulli. Let X1 , . . . , Xn be i.i.d. Bernoulli trials with parameter p, and let T (X) = i=1 Xi
be the number of successes. Then
f ((x1 , . . . , xn ); p) pT (x) (1 − p)n−T (x)
= T (y) = pT (x)−T (y) (1 − p)T (y)−T (x)
f ((y1 , . . . , yn ); p) p (1 − p)n−T (y)
Pn Pn Pn Pn
This ratio is independent of θ if and only if i=1 xi = i=1 yi and i=1 x2i = i=1 yi2 .
Thus T (X) = Xi , Xi2 is minimal sufficient.
P P
In closing our discussion on sufficiency, it is worth mentioning that although it is a nice property, that
allows us to reduce the data to a bounded dimension statistic, it is not all that common. In fact, only
exponential families and some families with bounded support admit finite dimensional sufficient statistics
for all sample sizes; this is the Pitman-Koopman-Darmois theorem.
Page 20 of 88
Chapter 3
Fisher Information
Much of the material in this chapter can also be found (often with a nicer exposition) in your part A
course.
We turn to the question now of whether there is some nice way to measure ‘how much’ information a
given dataset contains about a particular parameter.
Definition 3.1. For each x ∈ X , the likelihood function L(·, x) : Θ → R+ is defined by L(θ, x) =
f (x, θ).
To simplify our analysis, we will need some regularity assumptions about our model. These will, pri-
marily, allow us to use partial derivatives and to interchange them with sums/integrals without worrying
too much (as we’ll see).
Reg 1. The distributions {f (·, θ) : θ ∈ Θ} have common support, so that A = {x : f (x, θ) > 0} is
independent of θ.
Reg 3. For all x ∈ A and for all θ ∈ Θ, the derivative ∂θ f (x, θ) exists and is finite. Furthermore,
for each θ ∈ Θ, there exists a neighborhood I ⊂ Θ and an integrable function gI (x) such that
Page 21 of 88
Foundations of Statistical Inference 3. Fisher Information
Definition 3.2. When Regs 1–3 are satisfied, for x ∈ A we define the score function
∂ log L(θ, x)
S(θ, x) = ℓ′ (θ, x) = .
∂θ
Now note the following handy fact (which is what motivates the regularity assumptions):
Definition 3.5. When Regs 1–3 are satisfied, we define the Fisher information to be
∂2 ∂2
Z Z
f (x, θ) dx = f (x, θ) dx (for continuous distributions)
∂θ2 A A ∂θ
2
or
∂2 X X ∂2
f (x, θ) dx = f (x, θ) dx (for discrete distributions)
∂θ2 ∂θ2
x∈A x∈A
for all θ ∈ Θ.
Definition 3.6 (Observed information). When Regs 1–4 are satisfied, for an observation X = x,
we define the observed information to be
This allows us to derive an alternative form for the Fisher information which will be much more commonly
used:
Page 22 of 88
Foundations of Statistical Inference 3. Fisher Information
By Reg 4,
∂2 ∂2
Z Z Z
Eθ ∂2
∂θ 2 f /f = ∂2
∂θ 2 f /f · f dx = f dx = f dx = 0,
A A ∂θ2 ∂θ2 A
and thus 2
− Eθ [ℓ′′ (θ, X)] = Eθ ∂
∂θ f (X, θ)/f = Eθ [(ℓ′ (θ, X))2 ].
1. (Information grows with sample size.) If X and Y are independent random variables,
then
I(X,Y ) (θ) = IX (θ) + IY (θ).
In particular, if Z = (X1 , . . . , Xn ) where the Xi are i.i.d. copies of X, then IZ (θ) = nIX (θ).
2. (Reparametrisation.) If θ = h(ξ) where h is differentiable, then the Fisher information of
X about ξ is
∗
IX (ξ) = IX (h(ξ))[h′ (ξ)]2 .
Proof. (for the second statement) The log-likelihood w.r.t. Ph(ξ) is ℓ∗ (ξ) = ln p(x; h(ξ)) thus the
score function is
∂ ∂
S ∗ (ξ; x) = ln p(x; h(ξ)) = ln p(x; θ) θ=h(ξ) h′ (ξ)
∂ξ ∂θ
and so
Varξ S ∗ (ξ, X) = Varξ S(h(ξ), X)h′ (ξ) = IX (h(ξ))[h′ (ξ)]2 .
∗ 1 2
IX (σ) = IX (σ 2 )[h′ [σ)]2 = 4
· 4σ 2 = 2 .
2σ σ
Page 23 of 88
Foundations of Statistical Inference 3. Fisher Information
∗
The direct way to compute IX (σ) is
1 1
ℓ∗ (σ; x) = − log(2π) − log(σ) − 2 x2
2 2σ
so that
1 1
S ∗ (σ; x) = − + 3 x2
σ σ
and
∗ 1 2 1 2
IX (σ) = Varσ X = 2σ 4 = 2 .
σ3 σ 6 σ
Reg 3′ . For all x ∈ A and for all θ ∈ Θ, the partial derivatives of L(θ, x) exist and are finite.
Reg 4′ . All second order partial derivatives of log-likelihood ℓ exist and they can all be commuted
with summation/integration over A.
Definition 3.9. When Regs 1, 2′ , 3′ are satisfied, we define the score function to be
T
∂ ∂
S(θ, x) = ∇θ ℓ(θ, x) = ∂θ1 ℓ(θ, x), . . . , ∂θk ℓ(θ, x) .
Definition 3.10. When Regs 1, 2′ , 3′ are satisfied, we define the Fisher information matrix to
be
IX (θ) = Covθ (S(θ, X)),
so that
IX (θ)jr = Eθ ∂ ∂
∂θj ℓ(θ, X) ∂θr ℓ(θ, X) .
Note the last line above used that the multi-dimensional score function also has zero expectation, which
can be shown much like in the one-dimensional case.
Theorem 3.11. Supposing Regs 1, 2′ , 3′ , 4′ hold, define the observed Fisher information
∂ 2 ℓ(θ, x)
matrix J by J(θ, x)jr = − for j, r = 1, . . . , k. Then
∂θj ∂θr
Page 24 of 88
Chapter 4
Point estimation
See Garthwaite, Joliffe and Jones p40, Liero and Zwanzig p71.
Definition 4.1. For any function g : Θ → Γ (for some set Γ), an estimator of γ = g(θ) is a
function T : X → Γ.
bias(T, θ) = Eθ [T ] − g(θ).
2
Example.
1
Pn Suppose X = (X1 , . . . , Xn ) is a sample of 2i.i.d. N1 (µ,
Pσn ) random 2variables. Then
b = n i=1 Xi is an unbiased estimator for µ, and S = n−1 i=1 (Xi − µ
µ b) is an unbiased
2
estimator for σ
Formally, suppose (X1 , . . . , Xn ) is a sample of i.i.d. Pθ -distributed random variables, where θ ∈ Θ is the
parameter. In general if X ∼ Pθ , then the moments mr = Eθ [X r ] for r = 1, 2, . . . depend on θ.
Pn
b k = n1 i=1 Xik . Then the moment estimator for γ
Definition 4.3. For each k = 1, . . . , r let m
is defined as
γ
bMME = h(m b 1, . . . , m
b r ).
Example. Suppose X1 , . . . , Xn are i.i.d. Poisson with parameter λ > 0. Since m1 = E[X1 ] = λ, we
Page 25 of 88
Foundations of Statistical Inference 4. Point estimation
1
Pn
can use the sample mean m
b1 = n i=1 Xi , so that
n
1X
λ
bMME = m
b1 = Xi .
n i=1
On the other hand, Var(Xi ) = λ as well, so writing Var(Xi ) = m2 −m21 we can also use the estimator
n
1X
λ b 21 =
b2 − m
bMME = m (xi − x)2 .
n i=1
Theorem 4.5 (The Invariance Property). If γ = g(θ) and g is bijective, then θb is a MLE for
θ if and only if γ
b = g(θ)
b is a MLE for γ.
∂L(θ; x)
= 0, j = 1, . . . , k.
∂θj
Note that it is sometime easier to work with the log-likelihood ℓ instead of L and that maximise L
is equivalent to maximising ℓ.
2. One then has to show that the solution to this system of equation is a local maximum (by consid-
ering the Hessian matrix and if the solution is not unique to show that it is a global maximum.
3. Notice that in some cases the maximum may occur at a boundary point, in which case the partial
derivatives may not vanish.
Lemma 4.6 (MLE for exponential families). Consider a k-parameter exponential family P =
{f (·; η) : η} in natural parameterisation
Xk
(4.1) L(η; x) = exp ηj Tj (x) − B(η) h(x).
j=1
Page 26 of 88
Foundations of Statistical Inference 4. Point estimation
Proof. The first statement follows trivially from Theorem 1.15. For the second statement, also
follows by Theorem 1.15, note that
!
∂2
Covη (T ) = B(η) .
∂ηi ηj
ij
By Propositions 1.9, 1.10 Covη (T ) is strictly positive definite. Therefore the negative log-likelihood
is strictly concave and therefore any local maximum is also a global maximum.
Definition 4.7. The mean squared error (MSE) of an estimator T for g(θ) is defined as
Proof. Exercise.
Example. Let X = (X1 , . . . , Xn ) be a sample of i.i.d. U(0, θ) random variables. Then θbMLE =
Xmax = max{Xi : i = 1, . . . , n}.
2θ2
MSEθ (Xmax ) = .
(n + 1)(n + 2)
n+1
However, the estimator θb = n Xmax is unbiased, and indeed
θ2
MSEθ (θ)
b = < MSEθ (θbMLE ).
n(n + 2)
Typically although we write θb = T (X), with X ∼ f (·; θ) for an estimator, the notation is overloaded
iid
when considering X1 , . . . , Xn ∼ f (·; θ) with n varying. In this case we understand the estimator and the
statistic as a family of estimators {θbn }n with θbn := Tn (X1 , . . . , Xn ) and Tn : X 7→ Rk for some k.
For most statistics, it is quite easy to see the dependence on n; e.g. for the sample mean Tn (x) =
(x1 + · · · + xn )/n, or Tn (x) = mini≤n xi etc.
As the sample size n → ∞ we expect a good family of estimators θbn = Tn (X1:n ) to eventually recover
Page 27 of 88
Foundations of Statistical Inference 4. Point estimation
iid
the value of the parameter we are estimating exactly, that is when X1 , . . . , Xn ∼ f (·; θ) we expect
Tn (X1:n ) → θ in some sense, e.g. in probability, almost surely or in MSE. This is the concept of
consistency .
iid
Definition 4.9 (Consistency). Let X1 , . . . , Xn ∼ f (·; θ) and consider the family of estimators
θbn = Tn (X1:n ). We say that Tn (X1:n ) is
• consistent in MSE if
lim MSEθ (Tn ; θ) → 0;
n→∞
• strongly consistent if
Page 28 of 88
Chapter 5
Now that we have developed a few techniques for estimating a parameter, we need to evaluate how well
various estimators actually work.
Suppose X = (X1 , . . . , Xn ) is a random sample from the distribution Pθ . What is a ‘good’ estimator of
θ?
Definition 5.1. We say T1 is a uniformly better estimator than T2 (or better in quadratic
mean) if for all θ ∈ Θ,
MSEθ (T1 ) ⩽ MSEθ (T2 ).
• T is unbiased, and
• for all unbiased estimators T̃ , Varθ (T̃ ) ⩾ Varθ (T ) ∀θ ∈ Θ.
Theorem 5.3 (Cramer-Rao Lower Bound (CRLB) in 1 dimension). Suppose Regs 2–4 hold
and that 0 < IX (θ) < ∞. Let γ = g(θ) where g is a continuously differentiable real-valued function
with g ′ ̸= 0.
|g ′ (θ)|2
Varθ (T ) ⩾ , ∀θ ∈ Θ
IX (θ)
Page 29 of 88
Foundations of Statistical Inference 5. MVUEs and the Cramer-Rao Lower Bound
g ′ (θ)S(θ, x)
T (x) − g(θ) = ∀x ∈ A ∀θ ∈ Θ.
IX (θ)
and T attains the CRLB if and only if S(θ, x) = IX (θ)(T (x) − θ) ∀x ∈ A ∀θ ∈ Θ, or equivalently
T (x) = θ + S(θ,x)
IX (θ) .
(5.1) Covθ (T, S(θ, X)) = Eθ [T S(θ, X)] since Eθ (S(θ, X)) = 0
Z
∂ log p(x, θ)
(5.2) = T (x) p(x, θ) dx
∂θ
ZX
∂ p(x, θ)
(5.3) = T (x) dx
X ∂θ
Z
∂
(5.4) = T (x)p(x, θ) dx (note this is where we need T regular)
∂θ X
∂
(5.5) = Eθ [T ] = g ′ (θ).
∂θ
Now set c(θ) := g ′ (θ)/IX (θ). Then
(5.6) 0 ⩽ Varθ (T − c(θ)S(θ, X)) = Varθ T + c2 (θ) Varθ (S(θ, X)) − 2c(θ) Covθ (T, S(θ, X))
(5.7) = Varθ T + c2 (θ)IX (θ) − 2c(θ)g ′ (θ)
|g ′ (θ)|2
(5.8) = Varθ (T ) −
IX (θ)
which is the CRLB. We have equality if and only if T − c(θ)S(θ, X) is almost surely constant, and
in that case it must be equal to its expectation g(θ):
Corollary 5.4. Suppose that X ∼ f (·; θ) for θ ∈ Θ. Under regularity conditions, if some estimator
b of γ = g(θ) attains the CRLB, it follows that {f (·; θ) : θ} is in some exponential family.
γ
Proof. If γ
b = T (x) attains the CRLB then we know that
g ′ (θ) ∂
T (x) = g(θ) + log f (x; θ).
IX (θ) ∂θ
Page 30 of 88
Foundations of Statistical Inference 5. MVUEs and the Cramer-Rao Lower Bound
Rearranging we have
∂ g ′ (θ)
log f (x; θ) = (T (x) − θ) .
∂θ IX (θ)
By the fundamental theorem of calculus we have that
Corollary 5.5. Suppose that Eθ [T (X)] = θ + b(θ) (so that b(θ) is the bias of T ) and that T is
regular. Then
|1 + b′ (θ)|2
Varθ (T (X)) ⩾
IX (θ)
Example. Suppose X ∼ Bin(n, θ), where n is known. Our parameter of interest will be γ = θ(1−θ)
(so g ′ (θ) = 1 − 2θ). Hence
n
ℓ(θ, x) = log + (n − x) log(1 − θ) + x log θ,
x
and therefore
n−x x
S(θ, x) = − + ,
1−θ θ
so
∂ n−x x
S(θ, x) = − − 2.
∂θ (1 − θ)2 θ
Thus the Fisher information is
h i
(5.9) IX (θ) = − Eθ ∂
∂θ S(θ, X)
n − Eθ [X] Eθ [X] n
(5.10) = + = .
(1 − θ)2 θ2 (1 − θ)θ
1 x
Observe that T (x) = n−1 x 1 − n is unbiased for γ (check this as an exercise) and Varθ (T ) =
2 3 4
θ θ (5n−7)−4θ (2n−3)+θ (4n−6) (1−2θ)2 θ(1−θ)
n − n(n−1) which is larger than the CRLB of n .
5.2 Efficiency
Definition 5.6. The efficiency of an unbiased estimator T of g(θ) is the ratio of its variance and
of the CRLB, that is
[g ′ (θ)]2
e(T, θ) = .
IX (θ)Varθ T
An unbiased estimator which attains the CRLB is called efficient.
Page 31 of 88
Foundations of Statistical Inference 5. MVUEs and the Cramer-Rao Lower Bound
From the proof of the CRLB we know that for an unbiased estimator T of g(θ):
Thus by (5.11)
[g ′ (θ)]2
=1
Varθ (S(θ; X)Varθ (T (X))
but since Varθ (S(θ; X) = IX (θ) we have that
[g ′ (θ)]2
Varθ (T (X)) =
IX (θ)
Definition 5.8. Let T, T ∗ be two unbiased estimators for γ. We say that T ∗ has a smaller
covariance matrix than T at θ ∈ Θ if
ut (Covθ T ∗ − Covθ T )u ⩽ 0 ∀u ∈ Rm ,
∂ gi (θ)
where Dθ g is the Jacobian matrix, so (Dθ g)(θ)ij = .
∂θj
Example. Let X = (X1 , . . . , Xn ) be a random sample of N (µ, σ 2 ) random variables, where our
parameter of interest is θ = (µ, σ 2 ). Recall from Part A Statistics that
n/σ 2 0
IX (θ) = .
0 n/2σ 4
Page 32 of 88
Foundations of Statistical Inference 5. MVUEs and the Cramer-Rao Lower Bound
σ2 2σ 4
The estimators X and S 2 are independent, with Var(X) = n and Var(S 2 ) = n−1 . We can see that
the CRLB is not attained.
Theorem 5.10. Under Regs 1, 2′ , 3′ , 4′ , if θbMLE is the MLE for θ and if there exists θ̃ which is
unbiased and attains the CRLB, then θ̃ = θbMLE almost surely.
Proof. For simplicity we prove this statement in the case of a real-valued parameter. Suppose that
θ̃ is an efficient estimator of θ. Then we know that
Since θbMLE (x) maximizes ℓ(·; x) we have S(θbMLE (x); x) = 0 and thus
Remark. Note that θb may be unbiased and MVUE without attaining the CRLB, in which case the
conclusion above may be false.
Page 33 of 88
Chapter 6
Of course, even when the CRLB is not achievable, we still want to be able to find a MVUE.
Theorem 6.1 (Rao-Blackwell Theorem). Let X ∼ Pθ and let T be a sufficient statistic. Let γ
b
be an unbiased estimator for γ = g(θ) ∈ Rk .
bT = Eθ [b
Define γ γ | T ]. Then:
1. γ
bT is a function of T alone and does not depend on θ,
2. Eθ [b
γT ] = γ ∀θ ∈ Θ,
γT ) ⪯ Covθ (b
3. Covθ (b γ ) (which reduces to Varθ (b
γT ) ⩽ Varθ (b
γ ), in the case k = 1).
Intuitively, this says that ‘any unbiased estimator can always be improved by a sufficient statistic. The
last statement says that if some unbiased estimator cannot be improved by conditioning on a sufficient
statistics, then the estimator is essentially a function of the statistic.
Remark. Notice that the Rao-Blackwell theorem implies that if an unbiased estimator is a function of a
minimal sufficient statistic, then it is MVUE. Why is that? Notice that a minimal sufficient statistic T
can be written as a function of any other sufficient statistic T ′ ; therefore if γ
b is an unbiased estimator,
we have that γbT is a function of T and therefore a function of T ′ . Therefore E[b γT |T ′ ] = γ
bT and we are
in the equality case of the Rao-Blackwell theorem.
Page 34 of 88
Foundations of Statistical Inference 6. The Rao-Blackwell and Lehmann-Scheffé theorems
(6.1)
γ ) = Eθ [(b
Varθ (b γ − γ)2 ] = Eθ [(b
γ−γ bT − γ)2 ]
bT + γ
(6.2) = Eθ [Eθ [(b
γ−γ bT − γ)2 | T ]]
bT + γ
(6.3) = Eθ [Eθ [(b bT )2 | T ] − 2 Eθ [(b
γ−γ γ−γ γT − γ) | T ] + Eθ [(b
bT )(b γT − γ)2 | T ]]
(6.4) = Eθ [Eθ [(b bT )2 | T ]] − 0 + Eθ [Eθ [(b
γ−γ γT − γ)2 | T ]]
(6.5) = Eθ [Varθ (b
γ | T )] + Varθ (b
γT )
(6.6) ⩾ Varθ (b
γT ).
(6.7) γ ] = Eθ [(b
Covθ [b γ − γ)t ]
γ − γ)(b
(6.8) = Eθ [(b
γ−γ bT )t ] + Eθ [(b
γ−γ
bT )(b γT − γ)t ] − 2 Eθ [(b
γT − γ)(b γ−γ γT − γ)t ]
bT )(b
(6.9) = Eθ [(b
γ−γ γ−γ
bT )(b γT ) + 2 Eθ [(b
bT )t ] + Covθ (b γ−γ γT − γ)t ].
bT )(b
The first term here is clearly nonnegative, and it isn’t too hard to see that the third term is
equal to zero. The result follows.
Example. LetPX1 , . . . , Xn be i.i.d. Ber(θ) random variables. Note that θb = X1 is unbiased for θ,
n
and that T = i=1 Xi is sufficient for θ.
In this case,
Pθ (X1 = 1, T = t)
(6.10) θbT = Eθ [X1 | T = t] = Pθ (X1 = 1 | T = t) =
Pθ (T = t)
Pθ X1 = 1, ni=2 Xi = t − 1
P
(6.11) = n t
n−t
t θ (1 − θ)
θ n−1 θt−1 (1 − θ)n−t
t
(6.12) = t−1 n t
n−t
=
t θ (1 − θ)
n
so θbT = T /n.
Eθ [h(X)] = 0 ∀θ ∈ Θ =⇒ Pθ (h(X) = 0) = 1 ∀θ ∈ Θ.
A statistic T is called complete if the model {PθT : θ ∈ Θ} is complete, i.e.
Eθ [h(T )] = 0 ∀θ ∈ Θ =⇒ Pθ (h(T ) = 0) = 1 ∀θ ∈ Θ.
Examples.
1. Suppose the statistical model consists only of the two distributions N (1, 2) and N (0, 1). This
√ h(x) = (x − 1) − 2. For both distributions, E[h(x)] = 0, but
2
model is not complete:
√ take
h(x) ̸= 0 ∀x ̸= 2 + 1, 1 − 2.
2. The statistical model {U(0, θ), θ ∈ R+ } is complete. Indeed, suppose 0 = Eθ [h(X)] =
Page 35 of 88
Foundations of Statistical Inference 6. The Rao-Blackwell and Lehmann-Scheffé theorems
Rθ 1
0 θ
h(x) dx for all θ > 0. Then
Z θ
∂
h(x) dx = 0 ∀θ > 0.
∂θ 0
∂
Rθ
But ∂θ 0
h(x) dx = h(θ) almost everywhere (by Lebesgue differentiation Theorem), we con-
clude that h(x) = 0 almost everywhere. Since the variable X ∼ Pθ has a density we conclude
that Pθ (h(X) = 0) = 1.
3. If X1 , . . . , Xn are i.i.d. U(0, θ) then Xmax is a complete statistic. Indeed, the density of Xmax
is
ntn−1
fθ (t) = 1t∈[0,θ] .
θn
Then if 0 = Eθ [h(Xmax )] = −∞ h(t)fθ (t) dt = θnn 0 h(t)tn−1 dt for all θ ∈ Θ, and thus
R∞ Rθ
Rθ
0 = 0 h(t)tn−1 dt for all θ ∈ Θ. Thus, again using Lebesgue differentiation Theorem (which
says that as a function of θ this is differentiable and equal to g(θ) almost everywhere in θ), we
have that g = 0 almost everywhere and we conclude as above.
be a strictly k-parameter exponential family. The joint distribution of the natural observation vector
T (X) = (T1 (X), . . . , Tk (X)), with X ∼ p(·; θ), belongs to a strictly k parameter exponential family
with natural parameters η1 (θ), . . . , ηk (θ).
Proof. We only deal with the discrete case. Fix some vector y ∈ Rk and let Ty = {x : T (x) = y}.
Then
Pθ (T = y) = Pθ (X = x)
X
(6.13)
x∈Ty
X k
X
(6.14) = h(x) exp ηi (θ)yi − B(θ)
x∈Ty i=1
k
X
X
(6.15) = exp ηi (θ)yi − B(θ) h(x)
i=1 x∈Ty
k
X
(6.16) = exp ηi (θ)yi − B(θ) h0 (y)
i=1
(Sketch). The idea is to use uniquess of Laplace transforms, or something closer to home, uniqueness
of the Moment Generating Function.
Page 36 of 88
Foundations of Statistical Inference 6. The Rao-Blackwell and Lehmann-Scheffé theorems
Since the model is full rank we know that η(Θ) contains a k-dimensional interval. Notice that
by redefining h0 we can shift the parameter vector η(θ) by any fixed vector without changing the
exponential family; thus we can assume w.l.o.g. that for some a > 0 we have [−a, a]k ⊂ η(Θ). In
particular 0 ∈ η(Θ) and therefore ϕ(t)h0 (t) dt = 0. We decompose ϕ = ϕ+ − ϕ− into its positive
R
and negative parts, that is ϕ+ = max{ϕ, 0} and ϕ− = − min{ϕ, 0}. Notice that this decomposes
the range of the statistic T into two disjoint sets, T + = {x : ϕ(t) ≥ 0} and T − = {t : ϕ(t) < 0}.
Notice that ϕ+ is supported on T + and ϕ− on T − . Then we have that
Z Z
ϕ+ (t)h0 (t) dt = ϕ− (t)h0 (t) dt = w.
for all η ∈ [−a, a]k , which by uniqueness of MGFs implies that λ+ = λ− . But this is impossible
since λ+ (T + ) = 1 whereas λ− (T + ) = 0. Therefore it must be that w = 0, in which case we conclude
that ϕ+ , ϕ− ≡ 0 on the support of the exponential family and we are done.
We may then ask the question: does a complete sufficient statistic always exist? The answer is
no as the following example shows.
Now suppose that T̃ is a sufficient statistic. By minimality of T we know that there exists a function
g such that x = T (x) = g ◦ T̃ (x). Consider then the function t 7→ sin 2πg(t) . We have
Z θ+1
Eθ sin 2πg(T̃ ) = sin 2πg ◦ T̃ (x) dx
θ
Z θ+1
= sin 2πx dx = 0
θ
and thus T̃ cannot be complete. Since T̃ was any sufficient statistic, we conclude that the model
admits no sufficient complete statistics.
Page 37 of 88
Foundations of Statistical Inference 6. The Rao-Blackwell and Lehmann-Scheffé theorems
The previous example shows that we are not always guaranteed to find sufficient complete statistics.
When we do have access to one however, the following theorem says we can use it to construct MVUEs.
Theorem 6.5 (Lehmann-Scheffé Theorem). Let T be a sufficient and complete statistic for
b be an unbiased estimator for γ = g(θ) ∈ Rk .
the statistical model P and let γ
bT = Eθ [b
Then γ γ | T ] is an MVUE for γ.
Proof. Let γ̃ be another unbiased estimator of γ. Then γ̃T = Eθ [γ̃ | T ] is also an unbiased estimator.
By definition γ̃T , γbT are both functions of T , independent of θ by sufficiency, so we can define
bT . Since both are unbiased estimators of γ we have that Eθ [f (T )] = 0 for all θ and
f (T ) := γ̃T − γ
since T is complete we have that f (T ) = 0 Pθ almost surely for all θ. This proves that γ̃T = γ bT a.s.
and therefore that for all θ
γT ) = Covθ (γ̃T ) ⪯ Covθ (γ̃),
Covθ (b
where the inequality follows from the Rao-Blackwell theorem. Since γ̃ was arbitrary the result
follows.
Examples.
4. Poisson 2. What P about the other estimator above, λ̃MME ? Well, doing a little calculation
n
reveals that Xi | { j=1 Xj = k} ∼ Bin(k, 1/n). So, using Rao-Blackwell to ‘improve’ the
Page 38 of 88
Foundations of Statistical Inference 6. The Rao-Blackwell and Lehmann-Scheffé theorems
n
unbiased estimator S 2 = n−1 λ̃MME by the sufficient statistic X̄, we get
n n 2
n k
Eλ S 2 | Eλ X12 |
X X
(6.18) Xj = k = Xj = k − 2
j=1
n−1 j=1
n
( )
k2 k2
n k 1
(6.19) = 1− + 2− 2
n−1 n n n n
k
(6.20) = .
n
So starting from S 2 as an unbiased estimator for λ we arrive at X̄ by Rao-Blackwell using
P
Xi .
The Lehmann-Scheffé theorem allows us to use complete sufficient statistics to construct MVUEs. We
have also seen in Example 6 that there are situations in which no sufficient, complete statistic exists. The
question remains then, whether an MVUE always exists. The following theorem says that a necessary and
sufficient condition for an estimator to be MVUE is that it is uncorrelated with all unbiased estimators
of 0.
Theorem 6.6 (NASC for MVUE). Suppose that P = {Pθ : θ ∈ Θ} be a family of distributions
on X and let U be the set of unbiased estimators of 0 with finite variance, that is
n o
U := h : X 7→ R : Eθ [h(X)] = 0, Eθ [h2 (X)] < ∞ .
b − γ̃ ∈ U. Therefore
“⇐” Let γ̃ be another unbiased estimator of γ with finite variane. Then U := γ
by assumption we have that
0 = Eθ [b
γ U ] = Eθ [b
γ 2 ] − Eθ [b
γ γ̃].
Since Eθ [b
γ ] = Eθ [γ̃] the above implies that
γ ]2 = Covθ [b
Varθ [b γ , γ̃]2 ≤ Varθ [b
γ ] Varθ [γ̃],
γ ] ≤ Varθ [γ̃].
whence we conclude that Varθ [b
Example (Example 6 continued.). The above theorem allows us to establish that in Example 6
there exists no MVUE. Suppose that γ b = T (X) is an unbiased estimator of γ = g(θ). Let H ∈ U
be any unbiased estimator of 0. Then we have
Z θ+1
H(x)dx = 0,
θ
Page 39 of 88
Foundations of Statistical Inference 6. The Rao-Blackwell and Lehmann-Scheffé theorems
which along with the fact that H(θ + 1) = H(θ) a.e. allows us to conclude that T (θ + 1) = T (θ) for
a.e. θ.
Finally since γ
b is unbiased we have
Z θ+1
T (x)dx = g(θ),
θ
T (θ + 1) − T (θ) = g ′ (θ).
Combining everything, the only functions g for which an MVUE may exist are those that have
g ′ = 0.
Page 40 of 88
Chapter 7
We turn now, in this second half of the course, to the Bayesian view of statistical inference, and look at
how we may develop further the theory from Part A.
Theorem 7.1 (Bayes’ Theorem). Given a likelihood L(θ, x) and a prior π(θ) for θ, the
posterior distribution for θ (the conditional distribution of θ given the data X) is given by
L(θ, x)π(θ)
π(θ | x) = R .
L(θ′ , x)π(θ′ ) dθ′
distribution of X.
Example. Suppose X ∼ Bin(n, θ), and that our prior distribution for θ is Beta(a, b), i.e.
θa−1 (1 − θ)b − 1
π(θ) = , 0 < θ < 1.
B(a, b)
Page 41 of 88
Foundations of Statistical Inference 7. Bayesian Inference: Conjugacy and Improper Priors
posterior distribution is
Suppose we choose a, b here such that E[θ] = 0.7 and Var(θ) = 0.1. Suppose we then observe:
In the first case our posterior will have a mean of about 0.5 to 0.6, and in the second case our
posterior will have a mean of less than 0.4.
As n increases, the likelihood increasingly overwhelms the prior. This captures the intuition that
the second observation seems to be much stronger evidence than the first case that θ is in fact near
to 0.3.
Remark. This example illustrates the general effect at play in Bayesian inference: as we gather more
(relevant) data, effectively the information we have about the unknown parameter increases and we revise
our beliefs accordingly.
Definition 7.2. Consider a model (L(θ, x))θ∈Θ,x∈X . We say that a family of prior distributions
(πγ )γ∈Γ is conjugate if for all γ ∈ Γ and x ∈ X , there exists γ(x) ∈ Θ such that πγ (· | x) = πγ(x) (·).
We say the prior and the posterior are conjugate distributions, and the prior is a conjugate
prior for the likelihood L.
In other words, a conjugate prior is a prior which, when combined with the likelihood, produces a
posterior distribution in the same family as the prior.
Examples. (Normal conjugacy) Consider the model Xi ∼ N (µ, σ 2 ) i.i.d. and let τ = σ −2 , θ =
(τ, µ). We are going to assume a prior of the following form for θ: τ ∼ Gamma(α, β) and condi-
tionally on τ we have µ | τ ∼ N (ν, (kτ )−1 ) where k > 0, ν ∈ R are given parameters. In other
words
β α α−1 −βτ 1 √
kτ
(7.4) π(τ, µ) = π(τ )π(µ | τ ) = τ e √ kτ exp − (µ − ν)2
Γ(α) 2π 2
( )
k
(7.5) ∝ τ α−1/2 exp −τ β + (µ − ν)2 .
2
Pn
Since L(θ, x) = (2π)−n/2 τ n/2 exp − τ2 i=1 (xi − µ)2 we have that
n
n−1
k 1 X
(7.6) π(θ | x) ∝ τ α+ 2 exp −τ β + (µ − ν)2 + (xi − µ)2 .
2 2 i=1
Page 42 of 88
Foundations of Statistical Inference 7. Bayesian Inference: Conjugacy and Improper Priors
By completing the squares first in µ and then in ν it is a good exercise to check that
n 2 n
k
β + (µ − ν)2 + 1 X 1 kν + nx̄ 1 nk 1X
(xi − µ)2 = (k+n) µ − + (x̄−ν)2 + (xi −x̄)2 +β
2 2 i=1 2 k+n 2n+k 2 i=1
where each term can be easily identified. This shows that the prior is conjugate for this likelihood
since ( )
α′ − 21 ′ k′ ′ 2
π(τ, µ | x) ∝ τ exp −τ β + (ν − µ)
2
Proof. Exercise.
Example. Let X = (X1 , . . . , Xn ) be a sample of i.i.d. Poi(θ) random variables, so the (joint)
likelihood is
L(θ, x) ∝ exp(−nθ + T (x) log θ)
Pn
where T (x) = i=1 xi . So the natural conjugate prior is of the form
Writing β = −γ0 and α = γ1 +1, we have π(θ) ∝ θα−1 e−βθ which is the pdf of a Γ(α, β) distribution.
We can easily see that the posterior distribution is Γ(α + T (x), β + n). So indeed the Gamma
distribution is a conjugate prior (for the Poisson likelihood).
Page 43 of 88
Foundations of Statistical Inference 7. Bayesian Inference: Conjugacy and Improper Priors
The above defines a proper prior when α1 , . . . , αK > −1, so it’s more natural to parameterise as
X
αK −1
p(θ | α) ∝ θ1α1 −1 · · · θK , θi = 1, θi ≥ 0,
x1j x2j xK
P P P
j
p(x1:N |θ) = θ1 θ2 · · · θK ,
Definition 7.4. We say that a pdf/pmf π is an improper prior if it has infinite mass:
Z
π(θ) dθ = ∞, π(θ) ⩾ 0 ∀θ ∈ Θ
Θ
Examples.
Exercise. If X is discrete and can take only finitely many values, say {z1 , . . . , zN } = X , show that
we can’t use an improper prior.
Hint: try proving that the marginal likelihood cannot be finite for all i = 1, . . . , N .
Does this argument work for X countably infinite? (Try X ∼ Po(λ), π(λ) = λ−1 .)
Page 44 of 88
Foundations of Statistical Inference 7. Bayesian Inference: Conjugacy and Improper Priors
Definition 7.5. If X1 , . . . , Xn , Xn+1 are i.i.d. obsevations from the distribution f (x, θ), with prior
π(θ), then the posterior predictive distribution is
Z
f (xn+1 | x) = f (xn+1 , θ)π(θ | x) dθ
Θ
Thus the predictive distribution describes the distribution of a new observation given all the observations
we’ve already made.
Examples.
1. Poisson likelihood, Gamma prior. Suppose Y ∼ Poi(θ) and that our prior for θ is a
Γ(α, β) distribution.
The marginal likelihood for this model is
Z ∞
λy β α α−1 −βλ
m(y) = e−λ λ e dλ.
0 y! Γ(α)
2. Gaussian with known variance. Suppose now that X1 , . . . , Xn+1 are i.i.d. N (θ, σ 2 ) random
variables, where σ 2 is known, and that our prior distribution for the mean is θ ∼ N (µ0 , σ02 ).
We want to predict Xn+1 , having seen X1 , . . . , Xn .
The posterior after the first n observations is
" # n
1 2
Y 1
(7.9) π(θ | x) ∝ π(θ)p(x | θ) ∝ exp − 2 (θ − µ0 ) exp − 2 (xi − θ)2
2σ0 i=1
2σ
n
1 1 1 X
(7.10) ∝ exp − 2 (θ − µ)2 − 2 (xi − θ)2
2 σ0 σ i=1
1
(7.11) ∝ exp − 2 (θ − µn )2
2σn
Page 45 of 88
Foundations of Statistical Inference 7. Bayesian Inference: Conjugacy and Improper Priors
σ0−2 µ0 +σ −2 n
P
i=1 xi
where, by completing the square, we find that µn = −2
σ0 +nσ −2
and σn−2 = σ0−2 + nσ −2 .
(Observe that if σ 2 = σ02 then the prior has the same weight as that of a single extra observa-
tion.)
So θ | X ∼ N (µn , σn2 ) and Xn+1 | θ ∼ N (θ, σ 2 ). We can rewrite these two facts as
θ = µn + σn Z, Xn+1 = θ + σY
Page 46 of 88
Chapter 8
Non-Informative Priors
We’ve just seen that priors don’t always have to be probability distributions. When may we want to
make use of this?
We’re used to the notion of a subjective prior , a distribution representing our prior knowledge about
the parameter before any data is collected. With this approach, we may try different priors representing
different ‘points of view’.
This is in contrast to the concept of an objective prior (a non-informative prior ) which we’ll explore
in this chapter. This is a prior which is somehow ‘automatic’, reflecting the lack of any initial knowledge
about the parameter — and crucially may have no probabilistic interpretation, and so doesn’t have to be
a valid probability distribution. Non-informative priors can be used when little no reliable information
is available.
There are several approaches for defining a non-informative prior, three of which we’ll mention here.
Definition 8.1. The uniform prior or flat prior is the prior π(θ) ∝ 1.
This is the obvious, naive representation of lack of information; every value being equally likely. Under
this prior, the posterior is
L(θ, x)
π(θ | x) = R ,
Θ
L(θ, x) dθ
R
which is well defined as long as Θ L(θ, x) dθ < ∞.
R∞
Example. Let X ∼ Exp(θ) and π(θ) = 1. The marginal likelihood is 0 e−θx θ dθ which is finite
for all x > 0, so the posterior is well-defined. But does it have nice properties?
Page 47 of 88
Foundations of Statistical Inference 8. Non-Informative Priors
Remark. Why does this work? If θ = g(ψ) for some one-to-one differentiable function g then the
reparameterised prior is p
π̃(ψ) ∝ π(g(ψ))|g ′ (ψ)| = Iθ |g ′ (ψ)|.
√
Recall that Iψ = (g ′ (ψ))2 Iθ , so Iθ |g ′ (ψ)|. Hence π̃(ψ) ∝
p p
Iψ = Iψ .
i j
e−λ λx
Example. Suppose X ∼ Po(λ), so that f (x, λ) = x! for x = 0, 1, 2, . . ..
Page 48 of 88
Foundations of Statistical Inference 8. Non-Informative Priors
Remark. In the continuous case, entropy is often referred to as the differential entropy .
A maximum entropy probability distribution has entropy that is at least as great as that of all other
members of a specified class of probability distributions. According to the principle of maximum entropy,
if nothing is known about a distribution except that it belongs to a certain class (usually defined in terms
of specified properties or measures), then the distribution with the largest entropy should be chosen as
the least-informative default.
Since maximizing entropy minimizes the amount of prior information built into the distribution it makes
sense to pick the prior that maximises the entropy subject to any relevant constraints (e.g. a fixed mean).
Example. Suppose we wish to find the distribution π which maximises Ent[π] on Θ = R subject
to the constraints
Z ∞ Z ∞ Z ∞
π(θ) dθ = 1, θπ(θ) dθ = µ and (θ − µ)2 π(θ) dθ = σ 2
−∞ −∞ −∞
for fixed µ, σ 2 .
2 2
1
The solution is π(θ) = √2πσ 2
e−(θ−µ) /2σ . This can be shown using variational calculus or using
information-theoretic techniques (a proof is seen on a problem sheet in the Information Theory
course).
Thus the Gaussian distribution is the maximum-entropy distribution for the real line when we fix
the mean and variance.
Remark. The maximum entropy distribution does not always exist (for example the class of distributions
may have unbounded entropy).
The previous example leads us to a more general theorem, which we shall not prove:
Then π uniquely maximises Ent[π] among all densities satisfying the constraint (8.7).
Recall that for two distributions τ1 ≪ τ2 the Kullback-Leibler divergence KL(τ1 ∥τ2 ) is defined
through Z
dτ1
KL(τ1 ∥τ2 ) = τ1 (dx) log ,
dτ2
where dτ1 /dτ2 is the Radon-Nikodym derivative. If τ1 is not absolutely continuous w.r.t. τ2 we set
Page 49 of 88
Foundations of Statistical Inference 8. Non-Informative Priors
KL(τ1 ∥τ2 ) = +∞. It is a simple application of Jensen’s inequality to check that KL(τ1 ∥τ2 ) ≥ 0.
We now have the tools to prove the result. Let π ′ be any element of Π. Then
Z
Ent[π ] = − π ′ (x) log π ′ (x)dx
′
Z ′ Z
π (x)
= − π ′ (x) log dx − π ′ (x) log π(x)dx
π(x)
Z Xp
′ ′
= −KL(π ∥π) − π (x) λi Ti (x) dx + B(λ)
i=1
Rearranging we obtain
Ent[π] − Ent[π ′ ] = KL(π ′ ∥π) ≥ 0.
Example (continued). In the example above, our two constraints were E[T1 (θ)] = µ and E[T2 (θ)] =
σ 2 , where T1 (θ) = θ and T2 (θ) = (θ − µ)2 .
The above theorem then gives that the maximum-entropy prior is of the form π(θ) ∝ exp(λ1 θ +
λ2 (θ − µ)2 ). The two constraints then imply that λ1 = 0 and λ2 = − 2σ1 2 , thus giving the Gaussian
distribution we just saw.
Page 50 of 88
Chapter 9
Hierarchical Models
Chapter 5 p101 in Gelman, Carlin et al. Bayesian Data analysis. Section 7.7 p170 in Garthwaite, Joliffe,
and Jones Statistical Inference. p253 Lehmann and Casella Theory of point estimation
9.1 Example
In certain situations, the data we are modelling has a natural hierarchical structure. We illustrate this
first with an extended example.
Example (Study of cardiac treatment across different hospitals). Consider the dataset in
fig. 9.1 consisting of mortality rates in infant cardiac surgery across I = 12 hospitals. Each hospital
i conducts ni surgeries, Yi of which result in death. We use the natural model for the number of
deaths at each hospital as Yi ∼ Bin(ni , θi ), where θi is an unknown parameter.
• Identical parameters. We assume all the θi are identical. This ignores the structure of the
problem and pools all the data. In this case this means we’re assuming the surgery success
rate doesn’t depend on which hospital conducts the surgery.
• Independent parameters. We assume all the θi are independent, i.e. entirely unrelated.
The results from each unit can be analysed independently. In this case this means we’re
assuming there is nothing similar about the surgery at different hospitals, and the failure rates
at different hospitals don’t depend on each other in any way.
• Exchangeable parameters. We assume the θi are similar; no one hospital is a priori any
better than another. More on this later.
Let’s see how the first two approaches can work in this situation, where relevant examining our
estimates for hospitals A and H in particular:
Page 51 of 88
Foundations of Statistical Inference 9. Hierarchical Models
Figure 9.1: Number of infant cardiac surgeries and number of mortalities across 12 hospitals.
it’s a conjugate prior for the binomial distribution;Pand the choice of parameters a, b will be
discussed later.) The posterior mean of θ is then P niy+α+β
i +α
= 0.0740.
which takes value 0.0412 for hospital A and 0.1321 for hospital H.
The first method (frequentist, equal parameters) gives some pretty unlikely results (e.g. the observed
death rate for hospital H is very unlikely given our estimated θ), and the second method (frequentist,
independent parameters) totally ignores data from other hospitals when estimating θi for a particular
hospital; but this is the same medical procedure, so this is unnatural.
The third method (Bayesian, equal parameters) has the same problem as in the frequentist setting,
but the last method (Bayesian, independent parameters drawn from the same distribution) seems to
address these issues; the parameters are different for each hospital, but are all drawn from the same
distribution, whose parameters can be inferred from the entire dataset.
How can we estimate the parameters, then, of the shared prior distribution?
Example (continued). In the example above, the approach we settled on models the θi as drawn
independently from a Beta(a, b) distribution. How do we estimate the parameters (a, b)?
• Approximate empirical Bayes approach. The most obvious way to estimate (a, b) is to
use a standard frequentist technique; the method of moments. In this context, this means we
pick (a, b) so that the prior distribution has the same mean and variance as the sample mean
and sample variance of the observed maximum likelihood estimates for the parameters θi .
Specifically, we calculate ri = yi /ni for each hospital (this is the observed mortality rate; the
MLE for θi ) and we calculate the sample mean and sample variance of the set {r1 , . . . , r12 };
and then solve for b a, bb such that Beta(b
a, bb) has the same mean and variance.
(Then we use Beta(b a, bb) as our shared prior for the θi , to obtain the posterior distribution
π(θi | b
a, bb, yi ) for each θi as described above.)
This approach is reasonable, but we have the problem that we are using the same data twice —
once to pick ba, bb) and once to find the individual posteriors for the θi . This leads to overconfidence
in the posterior distributions! Moreover, we’re making a fixed choice of (b a, bb) and working with that
choice, so the posterior distributions we derive will not reflect the inherent uncertainty in the values
of the parameters (a, b).
• hierarchical Bayesian model. We may instead assume a joint probability model for (θ, a, b).
In other words θ, a and b are all treated as random variables.
As before (except now treating these explicitly as conditional distributions) we say θi | (a, b) ∼
Page 52 of 88
Foundations of Statistical Inference 9. Hierarchical Models
Beta(a, b) independently for each i, and we now also model the marginal distribution of (a, b)
as (a, b) ∼ p(a, b). This is effectively a prior distribution for (a, b); we call it the hyperprior .
In summary, our hierarchical model has three layers:
– Level 1: Yi | θi ∼ Bin(ni , θi ) independently for each i;
– Level 2: θi | (a, b) ∼ Beta(a, b) independently for each i;
– Level 3: (a, b) ∼ p(a, b) for some hyperprior distribution p(a, b).
Note that the θi are now not independent, but they are conditionally independent given a, b.
9.2 Definition
The empirical Bayes approach will be discussed in more detail later in the course; the hierarchical Bayes
approach can be defined in generality as follows:
Definition 9.1. The building blocks of a hierarchical Bayesian model for the observations
Y1 , . . . , Yn with parameters θ1 , . . . , θn and hyperparameter ϕ are
Then the corresponding hierarchical model is the following joint distribution of the Yj , θi and ϕ.
I: yj |θj , ϕ ∼ p(yj |θj ) independently for each j, (note this does not depend on ϕ)
II: θj |ϕ ∼ p(θj |ϕ)
III: ϕ ∼ p(ϕ)
The joint prior distribution is p(θ, ϕ) = p(θ|ϕ)p(ϕ) and the joint posterior distribution is
p(θ, ϕ|y) ∝ p(y|θ, ϕ)p(θ, ϕ) = p(y|θ)p(θ|ϕ)p(ϕ).
Example (continued). In the case of the hospital data, recall that our model is Yi | θi ∼ Bin(ni , θi )
independently for each i, with i.i.d. priors θi ∼ Beta(a, b) and a hyper-prior p(a, b). Since condition-
ally on (a, b) the (yi , θi ) are independent, the joint posterior distribution is
Thus we have
I I
θia+yi −1 (1 − θi )b+ni −yi −1 ∝
Y Y
p(θ | a, b, y) ∝ p(θi |a, b, yi )
i=1 i=1
(what the ∝ symbol means is that we only keep the terms that involve θ). This shows that, given
a, b, y, the θi have independent beta posteriors.
On the other hand, the posterior for (a, b) is p(a, b | y) ∝ p(a, b)p(y | a, b). Observe first that
Page 53 of 88
Foundations of Statistical Inference 9. Hierarchical Models
Q Γ(a+b)
p(y | a, b) = i p(yi | a, b) by conditional independence given a, b. Let us call Ga,b = Γ(a)Γ(b) the
normalisation constant of the Beta(a, b) distribution.
Z
(9.4) p(yi | a, b) = p(yi |θi )p(θi |a, b)dθ
Z
ni yi
(9.5) = θ (1 − θ)ni −yi G(a, b)θa−1 (1 − θ)b−1 dθ
yi
Z
(9.6) ∝ G(a, b) θyi +a−1 (1 − θ)ni −yi +b−1 G(a, b)dθ
Thus
I
Y Γ(a + b) Γ(b + ni − yi )Γ(a + yi )
p(a, b|y) ∝ p(a, b) .
i=1
Γ(a)Γ(b) Γ(a + b + ni )
Remark. How can we draw from the joint posterior p(θ, ϕ|y) in general?
1. Draw ϕ ∼ p(ϕ|y).
9.3 Exchangeability
In the model we’ve seen, the parameters θi were conditionally independent given the hyperparameter
vector ϕ.
Intuitively, this says that ‘no one parameter is a priori to be treated differently from any of the other
parameters’.
for some ψ with distribution g(ψ), i.e. the θi are conditionally independent given ψ, then the
distribution of θ is exchangeable (symmetric).
Proof. Exercise.
Page 54 of 88
Foundations of Statistical Inference 9. Hierarchical Models
Theorem 9.4 (De Finetti). All exchangeable sequences are of the above form in the large sample
limit.
Proof. Omitted.
• For each j = 1, . . . , J, the Xi,j , i = 1, . . . , nj are i.i.d. N (θj , σ 2 ). The Xi,j are also independent
across different j’s.
• The θi are i.i.d N (µ, τ 2 ) (conditionally on ϕ = (µ, τ 2 )).
We are going to use the notation x = {xi,j } for the whole data set, xj = {xi,j , i = 1, . . . nj } for the
observations in group j, and
nj
1 X
x̄·,j = xi,j
nj i=1
for the mean of xj .
where σj2 = σ 2 /nj is the variance of x̄·,j . (this is just a fancy way of saying that x̄·,j is sufficient for θj ),
Let us start by writing the joint posterior distribution. Using the usual posterior ∝ prior × likelihood
formula we have
Page 55 of 88
Foundations of Statistical Inference 9. Hierarchical Models
Let us now determine the conditional posterior distribution of θ given ϕ. In a hierarchical model,
once the hyperparameter is given, the parameters θj are independent. Thus
J
Y
p(θ | ϕ, x) = p(θj | ϕ, xj ).
j=1
Conditionally on ϕ, we simply have J independent unknown normal means given a normal prior dis-
tribution. For each j, we thus have a simple Gaussian conjugate model: the observations xj are i.i.d
N (θj , σ 2 ) , σ 2 known and θj ∼ N (µ, τ 2 ). It can easily be checked that the posterior is still Gaussian
θj | µ, τ 2 , xj ∼ N (θbj , Vj )
where
1 1
x̄
σj2 ·,j
+ τ2 µ 1
θbj = 1 1 and Vj = 1 1 .
σj2
+ τ2 σj2
+ τ2
Observe that the posterior mean is a weighted average of the prior mean µ and of the sample mean x̄·,j
of group j.
We can now move on to the marginal posterior distribution of the hyperparameter ϕ. Once we
have obtained the joint posterior p(θ, ϕ | x) we can obtain the marginal posterio of the hyperparameter
p(ϕ | x) by integrating out the parameter θ. But for the hierarchical normal model we can simply
consider directly
p(ϕ|x) ∝ p(ϕ)p(x |′ phi).
Usually, this decomposition is no help because the marginal likelihood term p(x | ϕ) cannot generally
be written in closed form. But in the present case there is a particularly simple form to this marginal
likelihood. The key observation is that, conditionally on ϕ = (µ, τ 2 ) the x̄·,j are independent and
Now we are going to make some assumptions about the hyperparameter distribution p(ϕ). We are going
to assume a non-informative flat prior for µ given τ 2 :
(this is just the usual Bayes formula for pmf/pdf’s for the variables µ and τ 2 under their conditional
distribution given x).
Pugging in our particular form of hyperprior into (9.17), we see that taking a log will give us a quadratic
expression in µ
J
X 1
− log p(µ, τ 2 | x) = (µ − x̄·,j )2 + cstt not depending on µ.
j=1
2(σj2 + τ 2 )
Page 56 of 88
Foundations of Statistical Inference 9. Hierarchical Models
Thus µ | τ 2 , x ∼ N (b
µ, Vµ ) where we only need to find the mean µ
b and variance Vµ . It can be checked by
“completing the squares” that
J J
X 1 X 1
Vµ−1 = and µ
b = Vµ x̄·,j .
j=1
σj2 + τ 2 j=1
σj2 + τ 2
We thus have a proper posterior for µ given τ 2 . Finally we want the posterior distribution of τ 2 .
p(ϕ | x)
(9.18) p(τ 2 | x) =
p(µ |, τ 2 x)
QJ
p(τ 2 ) j=1 N (x̄·,j | µ, σj2 + τ 2
(9.19) ∝ .
N (µ | µ
b, Vµ )
Observe that the left hand-side does not depend on µ so that the right-hand side cannot depend on µ
as well. We can thus choose to evaluate it at µ = µ
b for simplicity and get
QJ
2
p(τ 2 ) j=1 b, σj2 + τ 2
N (x̄·,j | µ
(9.20) p(τ | x) ∝
µ|µ
N (b b, Vµ )
J
!
Y b)2
(x̄·,j − µ
(9.21) ∝ p(τ 2 )Vµ1/2 (σj2 + τ 2 )−1/2 exp − .
j=1
2(σj2 + τ 2 )
Page 57 of 88
Chapter 10
Decision Theory
Throughout this course we have been exploring ways of estimating parameters, predicting new values,
or inferring probability distributions. In the past we have come across hypothesis testing (which we’ll
explore again at the end of this course). All of these are examples of making decisions based on data. In
this section we develop this into a formal theory.
• An action (or decision) space A. Typical examples include A = {0, 1} for selecting a hypoth-
esis, or A = g(Θ) for estimating a function g(θ) of a parameter.
With these in mind, we define our first measure of ‘how bad’ a decision rule is:
Definition 10.1. For a given rule ∆ ∈ D and parameter θ ∈ Θ, the (frequentist) risk is
Z
R(θ, ∆) = Eθ [L(θ, ∆(X))] = L(θ, ∆(x))f (x, θ) dx.
X
For a given rule ∆ we think of the frequentist risk as a profile of risk across the different values of θ.
We also want to consider randomised decision rules, that is a rule that selects among a collection of
decision rules according to some probability distribution.
Page 58 of 88
Foundations of Statistical Inference 10. Decision Theory
Definition 10.2 (Randomised decision rule). Suppose that the action space A is equipped
with a σ-algebra A such that (A, A) is a measurable space and let P(A) be the space of probability
measures on A.
A randomised decision rule d is a mapping from d : X 7→ dx ∈ P(A) such that for each A ∈ A,
the mapping x 7→ dx (A) is measurable.
Given a collection {∆1 , . . . , ∆k } and a probability vector (p1 , . . . , pk ) we can define the randomised
Pk
rule d = i=1 pi δ∆i , which for each x selects the rule ∆i (x) with probability pi .
Examples.
Pθ (∆(X) = 0) if θ ∈ H1 ,
(
R(θ, ∆) =
Pθ (∆(X) = 1) if θ ∈ H0 .
These are the Type I/II error probabilities respectively.
10.2 Admissibility
Let’s see how we might compare decision rules.
R(θ, ∆1 ) ⩾ R(θ, ∆2 ) ∀θ ∈ Θ
Page 59 of 88
Foundations of Statistical Inference 10. Decision Theory
So a = 3/2 is a necessary condition for θb to be admissible for quadratic loss. Note that we have
shown that θ(x)
b = 32 x is admissible in D but not among the set of all possible estimators of the
form θ(x) = f (x) for some function f .
b
Remark. Note that being admissible is a fairly weak requirement. We will later see that some natural
estimators are in fact inadmissible (see chapter 11).
Intuitively, a minimax rule does best in the worst case scenario. This can often still mean poor perfor-
mance on average.
Given a prior belief π about the parameter θ, it is also natural to consider the average risk of a rule.
Definition 10.5. The Bayes integrated risk (or simply Bayes risk )for a decision rule ∆ and
a prior π(θ) is Z
r(π, ∆) := R(θ, ∆)π(θ) dθ.
Θ
A decision rule ∆ is said to be a Bayes rule w.r.t. π if it minimises the Bayes risk:
r(π, ∆) = inf
′
r(π, ∆′ ) =: rπ .
∆ ∈D
We will see that Bayes rules (or estimators) provide a tool to solve minimax problems. To make this
idea precise we need the notion of least favorable prior . Recall that rπ is the Bayes risk of the Bayes
estimator ∆Bayes associated to π (when one exists).
Definition 10.6. A prior distribution π is least favorable if rπ ⩾ rπ′ for all prior distributions π ′ .
The following Theorem provides a simple condition for a Bayes estimator ∆Bayes to be minimax.
Page 60 of 88
Foundations of Statistical Inference 10. Decision Theory
Theorem 10.7. Suppose that π is a prior distribution on Θ and that ∆Bayes is the Bayes estimator
for π with
r(π, ∆Bayes ) = rπ .
If the rule ∆0 satisfies
sup R(θ, ∆0 ) ⩽ rπ
θ
then ∆0 is minimax, and, furthermore, if ∆Bayes is the unique Bayes estimator for π then ∆0 is
the unique minimax procedure.
The second inequality is strict if there is a unique Bayes estimator which gives the second point.
Remark. It is interesting to note that in the Theorem above one must have that R(θ, ∆0 ) = rπ for
π-almost all θ. Indeed otherwise we would have
Z
R(θ, ∆0 )π(θ) dθ < rπ
Theorem 10.8. Let ∆Bayes be the Bayes estimator for some prior π. If
Let π ′ be some other distribution. Then, writing ∆′Bayes for the Bayes estimator with respect to π ′
we have
Z Z
rπ = R(θ, ∆Bayes )π (θ)dθ ⩽ R(θ, ∆Bayes )π ′ (θ)dθ ⩽ sup R(θ, ∆Bayes ) = rπ .
′
′ ′
θ
Corollary 10.9. If a Bayes rule ∆Bayes has constant Risk, then it is minimax.
Corollary 10.10. Let ωπ ⊂ Θ be the set of θ at which the risk function of ∆Bayes achieves its
maximum, i.e.
ωπ = {θ : R(θ, ∆Bayes ) = supθ′ R(θ′ , ∆Bayes )}.
Remark. Bayes and almost surely constant risk is sufficient for minimax but not necessary. The result
is stated wrongly in Lehmann and Casella as an if and only if statement.
Page 61 of 88
Foundations of Statistical Inference 10. Decision Theory
Example. Suppose that X ∼ Bin(n, p) and we wish to estimate p with the square error loss function
L(p, pb) = (p − pb)2 . We chose pb = X
n . the risk function is
It has a unique maximizer at p = 1/2. So to apply the Corollary above we would need π(1/2) = 1
for which the corrresponding Bayes estimator is ∆ = 1/2, not X/n. In fact it can be checked that
X/n is not minimax.
To determine a minimax estimator by the method suggested by Theorem 10.7 let us try a Beta(a, b)
prior distribution. In that case we will see that the Bayes estimator is the posterior mean (this is
a+x
proved in Proposition 10.15, but you can try to prove this for yourself here!) , i.e. ∆(x) = a+b+n
and the risk function is
1 n
2
o
R(p, ∆) = np(1 − p) + [a(1 − p) − bp] .
(a + b + n)2
Can we find values of a, b such that this risk function is constant? Setting the coefficients of p2 and
p to 0 show that R(p, ∆) is constant in p iff
is contant risk Bayes and hence minimax. Because of the uniqueness of the Bayes estimator we see
that this is the unique minimax estimator.
The following result says that a Bayes rule minimises the expected posterior loss.
Theorem 10.12. Suppose that X | θ ∼ Pθ and that θ ∼ π. Suppose in addition that the following
hypothesis hold for the problem of estimating g(θ) with non-negative loss function L(θ, d).
y 7→ Λ(x, y)
Proof. The Bayes risk is (using π(θ | x) = f (θ, x)π(θ)/h(x) where h(x) is the marginal distribution
Page 62 of 88
Foundations of Statistical Inference 10. Decision Theory
of X)
Z Z Z
(10.4) r(π, ∆) = R(θ, ∆)π(θ) dθ = L(θ, ∆(x))f (θ, x)π(θ) dx dθ
Z Z
(10.5) = L(θ, ∆(x))π(θ | x)h(x) dx dθ
Z Z
(10.6) = h(x) L(θ, ∆(x))π(θ | x) dθ dx
Z
(10.7) = h(x)Λ(x, ∆(x)) dx
Z
(10.8) ≤ h(x)Λ(x, ∆′ (x)) dx
Proposition 10.13 (Bayes rules and admissibility). Let ∆π be a Bayes rule w.r.t. π with
finite Bayes risk. Then
Proof.
1. If ∆π is not admissible then there is some ∆ such that R(θ, ∆) ⩽ R(θ, ∆π ) ∀θ ∈ Θ and
R(θ, ∆) < R(θ, ∆π ) for some θ. This implies r(π, ∆) ⩽ r(π, ∆π ), so ∆ must also be Bayes, so
by uniqueness ∆ = ∆π , contradicting the definition of ∆. So ∆π is admissible.
2. As above, if ∆π is not admissible then there is some ∆ such that R(θ, ∆) ⩽ R(θ, ∆π ) ∀θ ∈ Θ
and A∆ ̸= ∅, where A∆ := {θ : R(θ, ∆) < R(θ, ∆π )}.
Since θ 7→ R(θ, ∆) − R(θ, ∆π ) is continuous, A∆ must contain an open set. So π(A∆ ) > 0
arriving at a contradiction.
(
a if |θ − θ|
b > b,
Definition 10.14. The zero-one loss is of the form L(θ, θ)
b = where a, b are
0 otherwise
positive constants.
b = k|θb − θ| where k is a positive constant.
The absolute error loss is of the form L(θ, θ)
Let us see what the Bayes estimate (Bayes rule) is for each of these losses, by minimising the expected
posterior loss.
1. zero-one loss with interval radius b tends to the posterior mode as b → 0 (assuming say
Page 63 of 88
Foundations of Statistical Inference 10. Decision Theory
Proof.
To find the Bayes rule one has to minimise the above or equivalently maximise
Z θ+b
b
π(θ | x) dθ.
θ−b
b
Differentiating w.r.t. to θb and setting the derivative equal to 0 we need to find θb such that
(10.12) π θb + b|x − π θb − b|x = 0.
If the map θ 7→ π(θ|x) is continuous, the above is guaranteed to have a solution for b small
enough. To see why let θ̃ be a mode of π(θ|x); then
π θ̃|x − π θ̃ − 2b|x ≥ 0, θb = θ̃ − b
π θ̃ + 2b|x − π θ̃ − b|x ≤ 0, θb = θ̃ + b,
for b small enough, since θ̃ is a local maximum. Therefore by the intermediate value theorem
there must be a solution θb in the interval [θ̃ − b, θ̃ + b]. To verify this is indeed a maximum,
one may also do a second derivative test if θ 7→ π(θ|x) is differentiable to obtain
again by the fact that θb is a local maximum and thus the derivative must change sign. So
the Bayes rule is to choose θ(x)b so that (10.12) is satisfied, which can be achieved by some
θ ∈ [θ̃ − b, θ̃ + b]. So as b → 0, θb tends towards the posterior mode.
b
so that
Z θb Z ∞ Z θb
∂ ∂
Λ(x) = π(θ | x) dθ − π(θ | x) dθ = Λ(x) = 2 π(θ | x) dθ − 1
∂ θb −∞ θb ∂ θb −∞
Page 64 of 88
Foundations of Statistical Inference 10. Decision Theory
Example. (Example 4.1.5 in Lehmann Casella p230) Suppose X ∼ Bin(n, p) with a Beta(a, b) prior
for p. As we have seen this is a conjugate prior and the posterior density of p is proportional to
px+a−1 (1 − p)n−x+b−1 . Therefore, under the quadratic loss function, the Bayes estimator pbBayes of
p is
a+x
pbBayes = E[p | x] = .
a+b+n
It is interesting to compare this with the MLE (or UMVE) which is just X/n. In fact, before
taking any observation, the estimator from the Bayesian approach would be the mean of the prior
a/(a + b). Once X has been observed, the standard non-Bayesian estimator is X/n. The Bayes
estimator pbBayes lies between the two and in fact
a+b a n X
pbBayes = +
a+b+n a+b a+b+n n
1. n → ∞ while a, b fixed
2. n fixed and a, b → ∞ with a/b fixed.
In the case of a finite decision problem, the notions of admissibility, minimax and Bayes rules can be
given geometric interpretations.
We now assume that the set of decision rules is also finite and contains l decision rules ∆1 , . . . , ∆l and
randomised decision rules that can be formed by their convex combinations, that is the decision set is
the convex hull of {∆1 , . . . , ∆l }.
l
X X
D := pi ∆i : pi ≥ 0, pi = 1 .
i=1
Page 65 of 88
Foundations of Statistical Inference 10. Decision Theory
Definition 10.17. The risk set S ⊆ Rk is the set of points {(R(θ1 , ∆), . . . , R(θk , ∆)) : ∆ ∈ D}.
It may also be the case that D is the convex hull of an infinite collection (or continuum) of non-randomized
decision rules, see e.g. Figure 10.2.
Proof. Let ∆1 , ∆2 ∈ D be two rules. Take α ∈ (0, 1). Then define a randomized rule as follows:
(
′ ∆1 (x) with prob α,
∆ (x) =
∆2 (x) with prob 1 − α.
Then R(θ, ∆′ ) = αR(θ, ∆1 )+(1−α)R(θ, ∆2 ). So the convex combination is a valid decision rule.
In Figure 10.1 we can see an example of a risk set when Θ = {θ1 , θ2 }. The extreme points of the risk set
are the deterministic rules.
The thick line at the bottom defines the set of admissible rules. To see why, recall that a rule ∆ is
admissible if there is no other rule ∆′ such that R(θ, ∆′ ) ≤ R(θ, ∆) for all θ with strict inequality for at
least one θ. In our scenario this means that a rule ∆ is admissible if no other rule ∆′ achieves a risk in
the interior of the box
x ≤ R(θ1 , ∆), y ≤ R(θ2 , ∆) .
Note also that the minimax rule lies on the line R1 = R2 ; since the risk set intersects the line R1 = R2
this must be the case. Suppose that an admissible rule has constant risk, that is (R1 , R2 ) with R1 = R2 .
Let (R1′ , R2′ ) be the risk of any other admissible rule; it must satisfy R1′ ≥ R1 and R2′ ≤ R2 or R1′ ≤ R1
and R2′ ≥ R2 . In either case we have that max{R1 , R2 } ≤ max{R1′ , R2′ }. The situation is similar in
Figure 10.2.
However in Figure 10.2 notice that the set of admissible rules consists of deterministic rules (in this case
D cannot be expressed as the convex hull of a finite number of deterministic rules).
For Bayes rules, suppose that (π1 , π2 ) is the prior. Then the lines π1 R1 + π2 R2 = c represent decision
Page 66 of 88
Foundations of Statistical Inference 10. Decision Theory
rules with the same Bayes risk c. We can see this in Figures 10.3,10.4. The slope of the lines are
determined by the prior. We increase c looking for the line that just touches the risk set–this gives the
Bayes rule.
In Figure 10.3 the Bayes rule is not unique, and the minimax is actually a non-randomized Bayes rule.
In Figure 10.3 the Bayes rule is unique, and non-randomized. The minimax is actually a randomized
rule distinct from the Bayes rule.
Figure 10.3: The minimax is a randomized Figure 10.4: The minimax is no longer a Bayes
Bayes rule; Bayes rule not unique. In fact there rule. Minimax is still randomized although
are non-randomized Bayes rules. Bayes rule is non-randomized.
Page 67 of 88
Chapter 11
This chapter explores some rather counter-intuitive situations that can occur when one simultaneously
estimates several parameters.
Assume that Xi ∼ N (µi , 1) are mutually independent unit-variance Gaussian random variables, and
write X = (X1 , . . . , Xp ) and µ = (µ1 , . . . , µp ). The goal is to estimate µ from a single observation X.
Is this estimate admissible (for, say, quadratic loss)? For p ⩾ 3, the answer is no!
strictly dominates µ
bMLE for quadratic loss.
Corollary 11.2. If p ⩾ 3, µ
bMLE is inadmissible for quadratic loss.
Remark. This is very surprising! For instance, suppose you take measurements to estimate:
These are totally unrelated quantities; but Stein’s paradox tells us that we get better estimates (on
average) for the vector (K, G, S) by simultaneously using the three measurements!1
Lemma 11.3 (Stein’s Lemma). For independent Gaussian random variables X = (X1 , . . . , Xp )
1 Or does it? Stein’s paradox is about normal random variables. This can be justified by using the CLT.
Page 68 of 88
Foundations of Statistical Inference 11. The James-Stein Estimator
with Xi ∼ N (µi , 1) for each i, then for each i and for any bounded differentiable function h,
" #
∂ h(X)
E[(Xi − µi )h(X)] = E .
∂Xi
(11.1)
Z ∞
E[(Xi − µi )h(X) | {Xj : j =
2
̸ i}] = (xi − µi )h(x)e−(xi −µi ) /2
dxi
−∞
ixi =∞ Z ∞
h 2 ∂ h(x) −(xi −µi )2 /2
(11.2) = −e−(xi −µi ) /2 h(x) + e dxi
xi =−∞ −∞ ∂xi
" #
∂ h(X)
(11.3) =0+E | Xj : j ̸= i
∂Xi
since h is bounded. Applying the tower property of conditional expectations again gives the result.
bJSE = 1 −
Proof of Stein’s Paradox. Consider the family of estimators µ Pa 2 X indexed by the
Xi
parameter a. These are called the James-Stein estimators.
Recalling that µ
bMLE = X, we get
p
E[(µi − Xi )2 ] = p
X
R(µ, µ
bMLE ) =
i=1
p
E[(µi − µbi )2 ]
X
(11.4) R(µ, µ
bJSE ) =
i=1
p
" #
(Xi − µi )Xi Xi2
E[(µi − Xi )2 ] − 2a E + a2 E
X
(11.5) = .
P 2 2
j Xj
P
2
j Xj
i=1
Now the first term is just 1, since Var(Xi ) = 1, and by Stein’s Lemma,
" # " # P 2
j Xj − 2Xi 2
(Xi − µi )Xi ∂ X 1 Xi2
E =E P i 2 = E = E
P 2 − 2
2 P 2 .
P 2
2
j Xj ∂Xi j Xj j Xj
P
X 2 X
j j j j
Page 69 of 88
Foundations of Statistical Inference 11. The James-Stein Estimator
This is minimised at a = p − 2, and is less than p for this value; this concludes the proof.
Remark. The James-Stein estimator shrinks each component of X towards the origin. However,
there
(µ0 ) p−2
is of course nothing special about the origin; a similar estimator µbJSE = µ0 + 1 − ||X−µ0 ||2 (X − µ0 )
can be defined which shrinks X towards an arbitrary point µ0 , and it can easily be shown that this also
strictly dominates µ
bMLE . (See the handwritten notes for the details.)
a
Exercise. Show that for some a the estimator X̄1p + 1 − ||X−X̄1p ||2
(X − X̄1p ) strictly dominates
µ
bJSE , where 1p = (1, . . . , 1).
Remark. Observe that when ||X − µ0 ||2 < p − 2, the shinkage factor becomes negative. To avoid this
problem, we can define
+
(µ0 ) p−2
µbJSE+ = µ0 + 1 − (X − µ0 )
||X − µ0 ||2
(µ )
(where x+ denotes the positive part), which strictly dominates µ 0
bJSE ).
0 (µ ) 0(µ )
It is worth noting that neither µ
bJSE nor µ
bJSE+ are admissible.
Example (Baseball example). Consider the dataset in fig. 11.1, taken from Young and Smith.
It shows statistics from the 1998 baseball pre-season in the US for 17 top players. Our interest is in
predicting the home run strike rate of each player in the full season.
For each player i, Yi is the number of home runs out of ni times at bat in the pre-season. We
assume that home runs occur according to a binomial distribution, so that player i has probability
pi of hitting a home run each time at bat, independently of other at bats and other players. Thus
Yi ∼ Bin(ni , pi ).
Here pi is the true full-season strike rate (and Yi /n is the strike rate in the pre-season); the actual
values of pi as well as the actual number ABi of at bats of each player (in the full season) and the
actual number of home runs HRi are shown in the figure.
So, how might we estimate pi given just the pre-season statistics Yi and ni for each player? Obviously
the naïve estimate is the MLE pbi = Yi /ni . These give rise to the estimated number of home runs
d i = pbi · ABi (assuming we know the actual number of at bats, which of course at the time we
HR
wouldn’t have). These values are shown in the figure.
First transform the data, setting Xi = fni (Yi /ni ) where fn (y) := n1/2 sin−1 (2y − 1). Then Xi ∼
N (µi , 1) for each i, with µi = fni (pi ).
We can then use the James-Stein estimator to estimate the means µi . Using the ‘improved version’
we just encountered, we set
p−3
JSi := X̄ + 1 − (Xi − X̄)
V
for each i, where X̄ = Xi /p and V = (Xi − X̄)2 (here p = 17).
P P
Page 70 of 88
Foundations of Statistical Inference 11. The James-Stein Estimator
Figure 11.1: Data for 17 players in the 1998 baseball pre-season and full season taken from Young and
Smith.
These estimates of the µi are shown in the figure, and transforming back will give us estimates HR
ds
for the number of home runs of each player, which are also shown.
We see that the James-Stein approach gives much better estimates on average! More precisely,
the James-Stein estimator achieves a lower aggrigate risk than the naïve estimator, but allows
increased risk in estimation of individual components.
Page 71 of 88
Chapter 12
We return now to our discussion of Bayes estimators (Bayes rules). While Bayes estimators have desirable
properties (the posterior mean, the Bayes estimator under quadratic loss, is often admissible), they can
be hard to calculate, in particular for the hierarchical models met in chapter 9.
Definition 12.1. Empirical Bayes methods adapt the hierarchical Bayesian model by replacing
the hyperparameter vector ψ with a point-estimate ψb derived from the data.
So we now just have the likelihood X ∼ f (x, θ) and the prior θ ∼ ψ(θ)
b = π(θ, ψ).
b
Remark. Empirical Bayes methods can be viewed as an approximation of a full hierarchical Bayes model
that allows us to avoid doing ψ-integrals. One layer of the hierarchy has been ‘chopped off’.
Recall that we met this idea briefly in chapter 9 before hierarchical models were introduced.
and
R b(θ | x). So for quadratic loss, we have θbEB =
a Bayes estimator θbEB can be calculated using π
π (θ | x) dθ, the posterior mean.
θb
Remark. In this setting, the Bayes estimator is called an empirical Bayes estimator , or an EB
estimator .
Page 72 of 88
Foundations of Statistical Inference 12. Empirical Bayes Methods
Figure 12.1: Data on tumor incidence in historical control groups and current group of rats, from Tarone
1982. The table displays the values yj /nj : (number of rats with tumors)/(total number of rats).
Example (Meta-analysis of studies of tumors in rodents). The data in fig. 12.1 shows the
number of rats with tumors, Yi , and the total number of rats ni in each of a number of previous
experiments on tumor growth, as well as the results of a new experiment which we are interested in
analysing.
As usual we’ll assume each Yi ∼ Bin(ni , θi ) independently, for parameters θi which we want to
estimate. As our prior distribution we assume that θi ∼ Beta(α, β) independently for each i,
where α, β are hyperparameters. This choice of prior is natural as it is conjugate for the binomial
distribution: the posterior distribution, after observing the new experiment (14 rats, 4 with tumors)
will be π(θ | y) = Beta(α + 4, β + 10).
Using an empirical Bayes approach with the method of moments goes as follows:
α
b α
bβb
= m, = v.
α
b + βb (b
α+ b 2 (b
β) α + βb + 1)
This solves to α
b = 1.4, βb = 8.6.
4. Calculate the Bayes estimate, which for the quadratic loss is the posterior mean. In this case
b(θ | y) = Beta(5.4, 18.6) so the posterior mean is 0.225
the posterior is π
This estimate is less than the maximum-likelihood estimate of θbMLE = 4/14 we’d get based solely
on the current experiment, not taking into account past experiments.
Page 73 of 88
Foundations of Statistical Inference 12. Empirical Bayes Methods
Proposition 12.2. The James-Stein estimator can be intepreted as an empirical Bayes estimator.
(Specifically, for a = p it’s the EB estimator for quadratic loss when using a mean-zero Gaussian
prior whose variance is estimated using maximum likelihood.)
Proof. We wish to construct an EB estimator for quadratic loss. There is some freedom of choice of
prior, but we will assume as our prior that θi are drawn independently from a N (0, τ 2 ) distribution.
τ2 τ2
Given τ , then, we have θi | (xi , τ 2 ) ∼ N xi 1+τ 2 , 1+τ 2 . This can be calculated by completing the
square.
Remark. This is not the minimum James-Stein estimator (with a = p − 2) but it does strictly dominate
the MLE for all θ. The James-Stein estimator with a = p − 2 can be recovered by using moment
estimators (see Young and Smith section 3.5).
The maximum-likelihood estimate for each θi would be simply xi . Let’s follow roughly the same
empirical Bayes approach as above to find a better estimator (similar to the James-Stein estimator).
As a prior we assume that θi are i.i.d. Exp(λ), so that π(θi | λ) = λe−λθi for each i and λ is a
hyperparameter to be estimated.
1 Pn .
So the maximum marginal likelihood estimator is λ
b=
x̄ = xi
i=1
Page 74 of 88
Foundations of Statistical Inference 12. Empirical Bayes Methods
This has the effect of shrinking the MLE estimates towards the mean x̄.
Remark. We see that the empirical Bayes approach tends to pull the estimates towards the common
mean. This is true in general for models with exchangeable parameters.
Note also that, as mentioned in chapter 9, one drawback of the empirical Bayes approach is that we’re
potentially using the same data twice, leading to overfitting.
Example. Suppose Yi ∼ Po(θi ) independently. Assume that the parameters θi are drawn indepen-
dently from some distribution π whose form we do not know.
Robbin’s method is then to approximate the marginal pmf p(y) by the actual number of observed
datapoints equal to y. So in this case
Page 75 of 88
Chapter 13
Hypothesis Tests
If a hypothesis consists of a single point in Θ so that Θ0 = {θ0 } say, we say that it is a simple hypotehsis.
Otherwise it is called a composite hypothesis.
In general a test consists of a critical region C such that we reject H0 if and only if X ∈ C. We
reformulate this slightly by introducing the concept of the test function ϕ : X 7→ {0, 1}
1 if x ∈ C
ϕ(x) =
0 if x ̸∈ C
We will sometimes simply say the test ϕ. We will also sometimes need the notion of a randomized
test. Suppose that X = C1 ∪ C0 ∪ C= where C1 , C0 , C= are pairwise disjoint. Fix γ ∈ [0, 1]. Then the
we generalize the notion of test function by saying that
1 if x ∈ C1
ϕ(x) = γ if x ∈ C=
0 if x ∈ C0
is the test where we reject H0 when x ∈ C1 , accept H0 when x ∈ C0 , and reject H0 with probability
γ if x ∈ C= (by flipping a coin). Such a test ϕ is called a randomized test.
Page 76 of 88
Foundations of Statistical Inference 13. Hypothesis Tests
α := sup w(θ).
θ∈Θ0
The idea is this: a good test has a small size so that α ⩽ α0 for some specified value α0 and makes w(θ)
as large as possible on Θ1 . Within this framework we can consider various classes of problems:
1. Simple H0 vs simple H1 : here there is an elegant and complete theory which tells us exactly how
to construct the best test given by Neyman-Pearson Theorem.
2. Simple H0 vs composite H1 : In this case the obvious approach is to pick θ1 ∈ Θ1 and construct
the Neyman-Pearson test of H0 against H1 . In some cases, the critical region one obtains is the
same for all θ1 . When that happens the test is said to be uniformly most powerful (or UMP).
But there are many situations in which UMP tests do not exists, and then the problem is harder.
3. Composite H0 vs composite H1 : In this case the problem is harder again.
C = {x : Λ(x) ⩾ k}
and suppose that the constants k and α are such that Pθ0 (X ∈ C) = α. Then among all tests of
H0 against H1 of size α, the test with critical region C has maximum power.
The tests with critical regions such as C are called Neyman-Pearson test or likelihood ratio test
(LRT).
2. Given any other test ϕ for which Eθ (ϕ(X)) ⩽ α for all θ ∈ Θ0 , we have Eθ (ϕ0 (X)) ⩾ Eθ (ϕ(X))
for all θ ∈ Θ1 .
Note that a UMP test does not necessarily exist. However, for one-sided testing problems involving a
single parameter there is a wide class of parametric families that have a UMP test.
Definition 13.4. A family of densities {f (x, θ), θ ∈ Θ ⊆ R} with real scalar parameter θ is said to
be of monotone likelihood ratio or MLR for short if there exists a function t(x) such that the
likelihood ratio
f (x, θ2 )
x 7→
f (x, θ1 )
is a non-decreasing function of t(x) whenever θ1 ⩽ θ2 .
Page 77 of 88
Foundations of Statistical Inference 13. Hypothesis Tests
Theorem 13.5. Suppose that X has a distribution from a family which is MLR with respect to a
statistic t(X) and that we wish to test H0 : θ ⩽ θ0 against H1 : θ > θ0 . Suppose that the distribution
of t(X) is continuous. Then
Proof. For any θ1 > θ0 the Neyman-Pearson test of H0 : θ = θ0 against H1 : θ = θ1 has a critical
region of the form C = {x : t(x) > t0 } for some t0 which is chosen so that Pθ0 (T (X) > t0 ) = α.
Note that t0 does not depend on θ1 and so the critical region C is the same for all values of θ1 .
Thus, we see that this test is UMP for testing H0 : θ = θ0 against H1 : θ > θ0 .
Next, we claim that for any critical region of the form C = {x : t(x) > t0 } the map
θ 7→ Pθ (X ∈ C)
is non-decreasing. This can be seen using a argument involving randomized test procedures and the
optimality of the LRT (see Young and Smith p72).
Pθ1 (X ∈ C ′ ) ⩽ Pθ1 (X ∈ C)
This shows that C is UMP among all tests of its size.
Example. Suppose the X1 , . . . , Xn are i.i.d. from an exponential distribution with mean θ :
f (x, θ) = θ−1 e−x/θ , x > 0 where θ ∈ (0, ∞). Let us say that we frist want to test H0 : θ = θ0
against H1 : θ > θ0 (this is a simple-composite test).
Suppose now that we want to test H0∗ : θ ⩽ θ0 against H1 : θ > θ0 . Note that
P
i Xi k
Pθ ( Xi > k) = Pθ
X
(13.1) >
i
θ θ
(13.2) = P(Y > k/θ)
where Y is a Gamma-(n, 1) r.v. This is a non decreasing function of θ. Therefore the test with
critical region C = {x : t(x) > kα } with size α for H0 also has size α for H0∗ . Now let ϕ(X) be
any other test of size |alpha under H0∗ . Since H0 is a smaller hypothesis than H0∗ , the test ϕ also
Page 78 of 88
Foundations of Statistical Inference 13. Hypothesis Tests
has size ⩽ α under H0 . But then by the Neyman-Pearson Theorem Eθ1 ϕ(X) ⩽ Eθ1 ϕ0 (X) for all
θ1 > θ0 . Thus ϕ0 is UMP.
Example. Suppose the X1 , . . . , Xn are i.i.d from the (one dimensional exponential) density
f (x, θ2 )
= en(B(θ1 )−B(θ2 )) exp{(θ2 − θ1 )t(x)}.
f (x, θ1 )
π0 f0 (x)
Definition 13.6. We call π1 the prior odds in favor of H0 and B = f1 (x) is the Bayes factor .
The first person to use Bayes factors extensively was Jeffreys, in his book Theory of probability (first
edition 1939). This can be considered to be Jeffreys’ main contribution to the theory of statistics.
Following Jeffreys however, there were few methodological developments until the 80’s. This is an active
field of research today.
From the point of view of decision theory, any Bayes rule will take the form C = {B < k} [Make a
theorem?] for some value k and is therefore an LRT. The class of Bayes Rules is exactly the class of
Neyman-Pearson rules.
Remark. A rough guide to interpreting Bayes factors given by Adrian Raftery is as follows:
Page 79 of 88
Foundations of Statistical Inference 13. Hypothesis Tests
The value 2 log(B0/1 ) is sometimes reported because it’s on the same scale as the familiar deviance and
likelihood ratio test statistic.
The hypothesis Hi is not just θ ∈ Θi but rather the full prior model that
θ | Hi ∼ gi (θ), θ ∈ Θi .
Observe in particular that there is absolutely no need to suppose that the Θi are disjoints.
f (x, θ0 )
B=R .
Θ1
f (x, θ)g1 (θ) dθ
More generally, there is nothing here that requires the same parametrization under the two hypothesis.
Suppose that we have two candidate parametric models M1 and M2 for data X, and the two models
have respective parameter vectors θ1 and θ2 . Under prior densities π1 (θ1 ) and π2 (θ2 ), the marginal
distribution for X under each models are found as
Z
p(x | Mi ) = f (x, θi , Mi )πi (θi ) dθi
Note that form this point of view, what we have is really a hierarchical Bayesian model where where the
model correspond to the hyperparameter.
Example. Suppose the X1 , . . . , Xn are iid ∼ N (θ, σ 2 ) with σ 2 known. Consider H0 : θ = 0 against
H1 : θ ̸= 0. Also suppose that the prior g1 under H1 is N (µ, τ 2 ). We have
p1
B=
p2
where
1 X 2
p1 = (2πσ 2 )−n/2 exp − 2 Xi ,
2σ
and
Z
2 −n/2 1 X 2 2 −1/2 1 2
p2 = (2πσ ) exp − 2 (Xi − θ) (2πτ ) exp − 2 (θ − µ) dθ.
R 2σ 2τ
Page 80 of 88
Foundations of Statistical Inference 13. Hypothesis Tests
Using that
( ) !1/2
nτ 2 + σ 2 2πσ 2 τ 2
Z
exp − b2
(θ − θ) dθ =
2σ 2 τ 2 nτ 2 + σ 2
we see that
!1/2 " #
σ2
2 −n/2 1 n 2 1 X 2
p2 = (2πσ ) exp − (x̄ − µ) + 2 (xi − x̄) .
nτ + σ 2
2 2 nτ 2 + σ 2 σ
√ √
Writing t == nx̄/σ, η = −µ/τ, ρ = σ/(τ n) we can rewrite this
" #
1/2 2
1 1 (t − ρη) 2
B = 1+ 2 exp − − − η .
ρ 2 1 + ρ2
This illustrate a difficulty with the Bayes factor approach. In general, many Bayesian solutions to
point and interval estimation problems are close to the classical solutions when the prior is diffuse.
However, here when we let τ 2 → ∞ we see that ρ → 0 and thus B → ∞. In other words, in the
limit that the prior under H1 is diffuse (infinite variance), then we have overwhelming support for
H0 no matter the observed data. This is an instance of Lindley’s paradox . One must therefore
chose η, ρ to represent some reasonable judgement of where θ is likely to be when H0 is false and
there is no way to escape this by using some non-informative prior!
Example (Psychokinesis example). In 1987 Schmidt, Jahn and Radin ran an experiment where
a subject with alleged psychokinetic ability tried to ‘influence’ a stream of quantum particles arriving
at a quantum gate. Each particle would upon arrival at the gate either trigger a red light or a green
light; the laws of quantum mechanics suggest a 50/50 ratio, and the subject tried to influence the
particles to go to red.
Let X be the number of particles observed to go to red out of a total of n. We use the model
X ∼ Bin(n, θ) where θ is unknown. In the experiment, n = 104, 490, 000 and the observed value of
X was x = 52263471.
H0 : θ = 1/2, H1 : θ ̸= 1/2.
The frequentist p-value is Pθ=1/2 (X ⩾ x) = 0.0003. This suggests very strong evidence of paranor-
mal ability?
Let’s reframe this as a Bayesian test to see what’s going on. Choose the mixed prior with π0 = 1/2
and g1 = 1[0,1] corresponding to a flat prior on [0, 1]. Under this prior, the posterior probability of
Page 81 of 88
Foundations of Statistical Inference 13. Hypothesis Tests
H0 is
π0 f (x, 1/2)
(13.3) π(H0 | x) = R1
π0 f (x, 1/2) + (π1 ) 0 f (x, θ) dθ
n −n
x 2
(13.4) = n −n 1
x 2 + n+1
(13.5) ≈ 0.92
B ≡ 11.5
This gives a very different conclusion from the one based on the p-value.
This reflects that we are reasonably sure before conducting the experiment that θ = 1/2 is a more
likely value than any other.
Lemma 13.8. The rule ϕ has risk R(θ0 , ϕ) = aα and R(θ1 , ϕ) = bβ where β = 1 − w(θ1 ).
Proof. We have
To calculate the Bayes risk we need a prior π. Let π(θ0 ) = p0 and π(θ1 ) = p1 be the prior probabilities
that H0 and H1 hold, respectively.
Page 82 of 88
Foundations of Statistical Inference 13. Hypothesis Tests
Definition 13.10. The Bayes test is the rule δC with the critical region C chosen to minimise
the Bayes risk (under the loss function defined above).
Theorem 13.11 (Bayes test for simple hypotheses). The critical region for the Bayes test
with prior π and loss L is
f (x, θ1 )
C= x : ⩾A
f (x, θ0 )
p0 a
where A = p1 b .
p0 a
Corollary 13.12. The Bayes test is a likelihood ratio test with A = p1 b .
Corollary 13.13. Every likelihood ratio test is a Bayes test for some prior probabilities p0 , p1 .
Example. Suppose X1 , . . . , Xn are i.i.d. N (µ, σ 2 ) with σ 2 known, and we want to test H0 : µ = µ0
against H1 : µ = µ1 , with µ1 > µ0 .
Page 83 of 88
Foundations of Statistical Inference 13. Hypothesis Tests
Using that X̄ ∼ N (µ, 1/4), this gives Type I/II error probabilties
!
σ2 1
α = P X̄ ⩾ 0.3999 | µ = 0, = = 0.212
n 4
and !
σ2 1
β = P X̄ < 0.3999 | µ = 1, = = 0.115.
n 4
The frequentist approach, fixing α = 0.05, would give β = 0.363 (easy to check), so we see that in
the Bayes test α is increased and β decreased relative to the frequentist test.
Definition 13.14. The maximum a posteriori (MAP) test chooses the hypothesis with the
highest posterior probability P(Hi | X = x).
Theorem 13.15. The MAP test is the Bayes test under the 0–1 loss.
Proof. Exercise.
Proposition 13.16. The Bayes test for the 0–1 loss (i.e. the MAP test) rejects H0 iff
m0 (x) π1
< .
m1 (x) π0
Nevertheless let us just check that this is indeed the MAP test in the case where H0 is simple and
H1 is composite. The marginal distribution for X under this (hierarchical) prior is
Z
m(x) = π1 f (x, θ)g1 (θ) dθ + π0 f (x, θ0 ).
Θ1
π0 f (x, θ0 )
π(H0 | x) = R .
π1 Θ1
f (x, θ)g1 (θ) dθ + π0 f (x, θ0 )
The Bayes test for the 0–1 loss, i.e. the MAP test, rejects H0 iff π(H0 | x) < π(H1 | x), i.e. iff
Page 84 of 88
Foundations of Statistical Inference 13. Hypothesis Tests
Example. In a quality inspection program components are selected at random from a batch and
tested. Let θ denote the failure probability. Suppose that we want to test the hypotheses
Suppose n components are selected for independent testing. Modelling the number of failures X as
X ∼ Bin(n, θ), the marginal likelihood for H0 is
Z
(13.17) m0 (x) = f (x, θ)g0 (θ) dθ
Θ0
Z 0.2
n 30θ(1 − θ)4
(13.18) = θx (1 − θ)n−x dθ.
x 0 π0
m0 (x) 0.536 π1
So the Bayes factor is B0/1 = m1 (x) = 0.134 =4> π0 = 1.89 so the Bayes test does not reject H0 .
Indeed, the overall marginal likelihood is m(x) = m0 (x)π0 + m1 (x)(1 − π0 ) ≈ 0.273, so the posterior
probabilities for the hypotheses are π(H0 | x) = π(x|H 0 )π0
m(x) ≈ 0.185
0.273 = 0.678 and π(H1 | x) ≈ 0.322;
we see that H0 indeed maximises the posterior.
Page 85 of 88
Foundations of Statistical Inference 13. Hypothesis Tests
Figure 13.1: Power functions for ones and two-sided tests (From Young and Smith).
where t1 < t2 to have good properties. Such tests are called two sided tests based on T .
Pθ (X ∈ C) ⩽ α ∀θ ∈ Θ0 but Pθ (X ∈ C) ⩾ α ∀θ ∈ Θ1 .
A test which is uniformly most powerful amongst the class of all unbiased tests is called uniformly
most powerful unbiased , abbreviated UMPU.
The idea is illustrated by Figure 13.1, for the case H0 : θ = θ0 against H1 : θ ̸= θ0 (In the figure,
θ0 = 0.) The optimal UMP tests for the alternatives H1 : θ > θ0 and H1 : θ, θ0 each fails miserably to be
unbiased, but there is a two-sided test whose power function is given by the dotted curve, and we may
hope that such a test will be UMPU.
Remember that T itself also belongs to an exponential family with density form
fT (t, θ) = hT (t) exp{θt − B(θ)}.
Page 86 of 88
Foundations of Statistical Inference 13. Hypothesis Tests
We shall assume that T is a continuous random variable with hT > 0 on the open set that defines the
range of T . This avoids the need for randomised tests and this makes our proofs less technical at the
cost of very little loss of generality.
Theorem 13.18. For any α there exists a UMPU test of size α which is of the two-sided form in
T.
We do not include a full proof of this result here. However, we mention that it starts with the following
generalisation of Neyman-Pearson’s Theorem:
is non-empty. Then
R
1. There is one member of C that maximizes f0 (x)ϕ(x) dx.
2. A necessary and sufficient condition for ϕ∗ ∈ C to be a maximizer is that there exists constants
k1 , . . . , km
1 if f (x) > Pm k f (x)
0 i=1 i i
(13.19) ϕ(x) = .
0 if f0 (x) < Pm ki fi (x)
i=1
R
3. If ϕ ∈ C satisfies (13.19) with k1 , . . . , km ⩾ 0 then it maximises f0 (x)ϕ(x) dx among all
functions satisfying Z
ϕ(x)fi (x) dx ⩽ αi , i = 1, . . . , m
Page 87 of 88
Bibliography
Chang, Joseph T and David Pollard. “Conditioning as disintegration”. In: Statistica Neerlandica 51.3
(1997), pp. 287–317.
Liero, Hannelore and Silvelyn Zwanzig. Introduction to the theory of statistical inference. CRC
press, 2016.
Page 88 of 88