0% found this document useful (0 votes)
18 views113 pages

Principles of Statistics

This document discusses the principles of statistics, focusing on the distinction between probability theory and statistical inference. It outlines the structure of statistical models, including definitions of statistical experiments and likelihood functions, and presents various parametric statistical models such as normal, binomial, and Poisson models. The document also introduces exponential families as a versatile form for statistical models, emphasizing the importance of understanding the underlying distributions and parameters involved in statistical analysis.

Uploaded by

Bobby Weche
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views113 pages

Principles of Statistics

This document discusses the principles of statistics, focusing on the distinction between probability theory and statistical inference. It outlines the structure of statistical models, including definitions of statistical experiments and likelihood functions, and presents various parametric statistical models such as normal, binomial, and Poisson models. The document also introduces exponential families as a versatile form for statistical models, emphasizing the importance of understanding the underlying distributions and parameters involved in statistical analysis.

Uploaded by

Bobby Weche
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 113

PRINCIPLES OF STATISTICS

(Based on lectures for Cambridge Mathematics


Tripos, Part II)

c 2013 A. P. Dawid

<[email protected]>
<http://www.statslab.cam.ac.uk/∼apd/>
2
Preamble

Probability and Statistics

PROBABILITY THEORY is an entirely mathematical subject. Starting


with the description of a problem, we apply the axioms and rules of the theory
to solve well-specified problems. Our starting point is a Probability Model, i.e. a
fully specified distribution P for a given variable (or variables, or process. . . ).
STATISTICS (or Statistical Inference) can be regarded as the inverse of Prob-
ability:
STATISTICS = (PROBABILITY)−1 .

We do not start with a known probability distribution and explore its mathemat-
ically determined properties. Instead, we start with DATA, regarded as having
been generated from some unknown probability distribution P , and want to say
something sensible about P (perhaps so as to be able to say something sensible
about future variables also generated from P ).

PROBABILITY: known P ⇒ properties of (as yet unobserved) observables X

STATISTICS: observed values x ⇒ properties of unknown generating distri-


bution P

Although the top route is entirely mathematical, the bottom route is not! In-
deed, how to say something sensible about the future on the basis of past data
constitutes the essentially insoluble philosophical problem of INDUCTION.
So. . . Statistics is impossible! And this is reflected in the fact that there are nu-
merous different warring schools of thought (frequentist, Bayesian, likelihood,. . . )
as to how to conduct statistical inference.

3
4
Chapter I

Statistical Models

1 Statistical experiment

Although we may not know the distribution (probability model ) P that generated
our data, we may be willing to confine it to some specified statistical model :
P ∈ P. For most of this course we take this model to be parametric, i.e. of the
form
P := {Pθ : θ ∈ }
where  is a finite or countable set, or a Euclidean space. More general forms of
P are non-parametric.
We will thus preface any statistical analysis with our specification of the relevant
(parametric) statistical experiment:

Definition 1 Statistical experiment


A statistical experiment, E = (X, X, , Θ, P) comprises the following ingredients:

• A measurable1 sample space X


• A (typically observable) random variable X taking values in X
• A parameter space 
• A (typically unobservable) parameter variable2 Θ with possible values in 
1
We shall not fuss about the associated σ-field of measurable sets. In applications, X will
almost always be either discrete, with all sets measurable, or a subset of a Euclidean space,
equipped with its Borel σ-algebra.
2
The use of different symbols for the name (Θ) and a possible value (θ) of the parameter
is non-standard, but marks a useful distinction. Note that the symbol Θ is commonly used in
place of our  to denote the parameter space.

5
6 CHAPTER I. STATISTICAL MODELS

• For each θ ∈ , a fully specified distribution Pθ over X, describing uncer-


tainty about X, given Θ = θ.3 (We shall also use the notation P (A | Θ = θ)
or P (A | θ), interchangeably with Pθ (A)).

The labelled family of distributions P := {Pθ : θ ∈ } constitutes the associated


statistical model . Our task is to “draw inference” about Θ from observation of
X.
In this course we shall focus almost exclusively on data that can be considered
as having arisen as independent and identically distributed observations from a
common probability distribution. Although cases where these assumptions can
not be made are both practically and theoretically important, we shall have quite
enough to handle in the independent and identically distributed case. So hence-
forth we assume independent and identically distributed models except where
otherwise noted.
In our applications, the complete observable will thus comprise n independent
and identically distributed observables from a common distribution. If E =
(X, X, , Θ, P) describes the experiment relating to a single observation, that re-
lating to n independent and identically distributed observations X = (X1 , . . . , Xn )
will be E n := (Xn , X, , Θ, P n ); here P n := {Pθn : θ ∈ }, where Pθn denotes the
joint distribution of X when the Xi are independent and identically distributed
as Pθ . In this case it is obviously enough to specify the single-observation exper-
iment.
A note on densities: We don’t need much measure theory as such, but the lan-
guage of measure theory provides an elegant and unified way of describing things.
In all the cases we shall consider, there will be a dominating measure µ on X such
that each Pθ is absolutely continuous with respect to µ (i.e., µ(A) = 0 ⇒ Pθ (A) =
0). Then by the Radon-Nikodym theorem there will, for each θ, exist a density
function dP θ
= p(x | θ) (or, pθ (x)), unique p.p. [µ], such that, for measurable A,
dµ R
Pθ (X ∈ A) = A p(x | θ)dµ(x). For all our examples and applications, we shall
have either X ⊆ Rk , or X discrete (finite or countable). Unless otherwise noted,
in the former case we shall assume µ = Lebesgue measure, so using the standard
definition of probability density function; and, in the latter, µ = counting mea-
sure, when the “density” becomes the probability mass function. However, even
in these cases it is sometimes helpful to consider other dominating measures.
The density function p(x | θ) is only defined up to a set of µ-measure 0. It will
3
To avoid silly complications, we shall assume that θ 6= θ′ → Pθ 6= Pθ′ —in this case the
parameter Θ is said to be identifiable.
2. SOME PARAMETRIC STATISTICAL MODELS 7

typically be posible to choose unique versions of these functions that are con-
tinuous in both arguments. This will henceforth be assumed unless otherwise
noted.

Likelihood

An important ingredient of many inference methods is the likelihood function 4


generated by an observation.

Definition 2 Given a statistical experiment E = (X, X, , Θ, P), suppose we


observe X = x. The associated likelihood function for Θ is the function5 L :  →
R+ defined by:
L(θ) ∝ p(x | θ).
2

When L(θ) > 0 on , we often work with the log-likelihood function l(θ) ≃
log L(θ) (where ≃ denotes that the two sides may differ by an additive constant).

The value θb (if it exists) maximising L(θ) (or equivalently l(θ)) is the maximum
likelihood estimate (MLE) of Θ (for data X = x).

2 Some parametric statistical models

Useful statistical models include:

NORMAL MODEL
X = R.
Θ = (M, V ) ∈  = R × R+ .
4
“Likelihood” was introduced by Ronald Fisher, who introduced this new term to make a
clear distinction between the likelihood function p(x | ·), for given x a function on the parameter
space  (transforming as a scalar), and the probability density p(· | θ), for given θ a (non-scalar)
function on the sample space X. Unfortunately this careful distinction is in danger of being
lost, with the term “likelihood” increasingly employed as if it were synonymous with “statistical
model”.
5
More properly, “the” likelihood function is not a unique function, but an equivalence class
of functions, regarded as equivalent if they are related by multiplication by a positive scaling
factor. To be well-defined, any procedure based on “the” likelihood function must be unaffected
by such scalar multiplication: this will be the case for all those we shall consider.
8 CHAPTER I. STATISTICAL MODELS

When (M, V ) = (µ, v), X ∼ N (µ, v), with density


 
− 21 (x − µ)2
p(x | µ, v) = (2πv) exp − .
2v

We have E(X | µ, v) = µ, var(X | µ, v) = v.


We are also often interested in the submodels:
NORMAL MEAN MODEL: M is unknown but V = v0 is known.
NORMAL VARIANCE MODEL: V is unknown but M = µ0 is known.
BINOMIAL MODEL
X = {0, 1, 2, . . . , n}
 = [0, 1]
Under Pθ , X ∼ B(n; θ), with density (probability mass function)
n!
p(x | θ) = θx (1 − θ)n−x .
x! (n − x)!

We have E(X | θ) = nθ, var(X | θ) = nθ(1 − θ).


NEGATIVE BINOMIAL MODEL
X = {0, 1, 2, . . .}
 = [0, 1]
Uner Pθ , X ∼ N B(k; θ), with density
(x + k − 1)! x
p(x | θ) = θ (1 − θ)k .
(k − 1)! x!
We have E(X | θ) = kθ/(1 − θ), var(X | θ) = kθ/(1 − θ)2 .
POISSON MODEL
X = {0, 1, . . .}
 = R+
Under Pθ , X ∼ P(θ), with density
p(x | θ) = e−θ θx /x!

We have E(X | θ) = θ, var(X | θ) = θ.


GAMMA MODEL
X = R+
Θ = (A, B) ∈  = R+ × R+
Given (A, B) = (a, b), X ∼ Γ(a, b), with density
ba a−1 −bx
p(x | a, b) = x e
Γ(a)
2. SOME PARAMETRIC STATISTICAL MODELS 9

where Γ(·) is the Gamma function:6


Z ∞
Γ(a) := xa−1 e−x dx
0

[For integer a, Γ(a) = (a − 1)!, more generally Γ(a + 1) = a Γ(a).]


We have E(X | a, b) = a/b, var(X | a, b) = a/b2 .
We could also entertain the submodels obtained by fixing A or B. For
the special case that A is fixed at 1, we obtain the EXPONENTIAL
MODEL: X ∼ E(b). For B fixed at 12 , and reparametrising by N = 2A,
we obtain the CHI-SQUARED MODEL: χ2ν = Γ( 12 ν, 12 ).

BETA MODEL
X = (0, 1)
Θ = (A, B) ∈  = R+ × R+
Given (A, B) = (a, b), X ∼ β(a, b), with density

Γ(a + b) a−1
p(x | a, b) = x (1 − x)b−1 .
Γ(a)Γ(b)

We have E(X | a, b) = a/(a + b), var(X | a, b) = ab/{(a + b)2 (a + b + 1)}.

UNIFORM MODEL
X=R
Θ = (A, B)
 = {(a, b) ∈ R × R : a < b}
Given (A, B) = (a, b), X ∼ U(a, b), with density

p(x | a, b) = (b − a)−1 (a ≤ x ≤ b)

Note: For this model it is not possible to choose the density function to be
a continuous function of all its arguments.

CAUCHY LOCATION MODEL


X=R
=R
Under Pθ , X ∼ C(θ, 1), with density

1 1
p(x | θ) =
π 1 + (x − θ)2

The mean and variance of this distribution do not exist.


6
Do not confuse the Gamma function and the Gamma distribution.
10 CHAPTER I. STATISTICAL MODELS

3 Exponential families

A particularly versatile and tractable general form for a statistical model is the
exponential family (EF).7

Definition 3 Let P = {Pθ : θ ∈ } be a family of probability distibutions on


the measure space (X, A, m), all dominated by m, with densities p(x | θ) = dP
dm
θ
(x)
defined for x ∈ X, θ ∈ .
We call P a (p-parameter) exponential family if we can express

p(x | θ) = exp{a(x) + b(θ) + u(θ)T t(x)} (1)

for some functions a : X → R, b :  → R, u :  → Rp , t : X → Rp . 2

R
If we introduce the reference measure µ defined by µ(A) = A exp{a(x)} dm(x),
then the density with respect to µ will be exp{b(θ) + u(θ)T t(x)}, i.e. we can take
a(x) ≡ 0.
The transformed parameter Φ := u(Θ) is termed the natural or canonical pa-
rameter. By the Fisher-Neyman factorisation criterion, the statistic T := t(X) is
sufficient 8 : it is termed the natural sufficient or canonical statistic. Both Φ and
T take values in Rp .
Re-expressed in terms of Φ, we can write

p(x | φ) = exp{a(x) − k(φ) + φT t(x)} (2)

where necessarily
Z
k(φ)
e = exp{a(x) + φT t(x)} dm(x). (3)

It can be shown9 that (with respect to an appropriate underlying measure ν on


Rp ) the marginal distribution, Qφ say, of T = t(X), when Φ = φ, will have density
qφ (·) of the form:
qφ (t) = exp{α(t) − k(φ) + φT t} (4)
—although calculation of the function α may not be straightforward. In partic-
ular, the family Q = {Qφ } of distributions of T is itself an EF.
7
Take care not to confuse this with the exponential distribution, or with the exponential
model (a family of exponential distributions).
8
Indeed, T is minimal sufficient (see § 4 of Chapter II below) whenever u() contains an
open set.
9
See Corollary 3 of Chapter II below.
3. EXPONENTIAL FAMILIES 11

In order for (2) (and hence (4)) to define a density, it is necessary and sufficient
that k(φ) given by (3) be finite. The set  ⊆ Rp for which this holds is the natural
(or canonical) parameter space, and the EF is full when Φ can take any value in
.

Lemma 1 The natural parameter space  is convex. The function k is strictly


convex on .

We shall always suppose that  contains an open set in Rp . This does not lose
any generality.

For, if not, let q < p be the dimension of the affine span of . We can reexpress Φ ∈  as as
an affine function AΨ + c of a new parameter Ψ ∈ R q, and thence (2) in terms of Ψ as natural
parameter, and new natural statistic S = AT T ∈ Rq . Convexity and full affine dimension of the
new natural parameter space imply that it contains an open set in Rq .

The function k is very important, containing essentially all the information about
the family Q. Nevertheless it does not appear to have a well-established name.
The moment-generating function (MGF) Mφ : Rp → R+ ∪ {∞} of Qφ is given by:
 T 
Mφ (s) := Eφ es T
= exp {k(φ + s) − k(φ)} .

So long as φ ∈ ◦ , the interior of  —which we now assume—this will be finite


for s in a neighbourhood of 0.
The cumulant generating function (CGF) of Qφ is thus

κφ (s) := log Mφ (s)


= k(φ + s) − k(φ).

In view of this, we shall call k(·) the cumulant function of the EF.
From general theory, knowledge of its (finite) MGF, or equivalently of its CGF, in
a neighbourhood of 0 completely determines a distribution. So for an EF, knowl-
edge of k in a neighbourhood of φ ∈ ◦ completely determines the distribution
Qφ of T when Φ = φ.
Further, the first two derivatives of κφ at 0 — or, equivalently, of k at φ —
give the mean and variance (dispersion) of the associated distribution Qφ . We
can also show these properties directly. Differentiate (3) with respect to φj (and
12 CHAPTER I. STATISTICAL MODELS

assume that this can be passed through the integral, which is indeed the case for
φ ∈ ◦ ). We obtain:
Z
∂k(φ) k(φ)
e = tj (x) exp{a(x) + φT t(x)} dm(x). (5)
∂φj
Dividing this by (3) yields:
Z
∂k(φ)
= tj (x)p(x | φ) dm(x)
∂φj
= Eφ (Tj ). (6)

The mean value parameter H := EΦ (T ) is an important transformation of Φ.


From (6), the values η and φ of H and Φ are related by
η = ∇k(φ). (7)
This transformation of Φ into H is 1-to-1. The inverse transformation g, express-
ing the natural parameter Φ as a function g(H) of the mean-value parameter H,
is known as the (canonical) link function of the EF.
A further differentiation yields: for φ ∈ ◦ ,
∂ 2 k(φ)
= covφ (Tj , Tk ). (8)
∂φj ∂φk
In particular, varφ (Tj ) = ∂ 2 k(φ)/∂φ2j . For p = 1 we thus have k ′′ (φ) = varφ (T ) >
0, in accordance with Lemma 1.

3.1 Examples

Example 1 Normal location model N (M, v0 )


Suppose that the variance V of the normal distribution is known to have value
v0 , the unknown parameter thus being just M. With X = R, we have density
(with respect to Lebesgue measure):
 
− 21 1 (x − µ)2
p(x | µ) = (2πv0 ) exp − . (9)
2 v0
Equation (9) can be written in the EF form (1) (for p = 1), with:
 
1 x2
a(x) = − log(2πv0 ) +
2 v0
2

b(µ) = −
2 v0
µ
u(µ) =
v0
t(x) = x.
3. EXPONENTIAL FAMILIES 13

In particular the natural sufficient statistic is T ≡ X (whence the mean-value


parameter H is E(X | M) = M), and the natural parameter is Φ = M/v0 . The
cumulant function k is given by k(φ) = 12 v0 φ2 , and we readily check k ′ (φ) =
v0 φ = µ = E(X), k ′′ (φ) = v0 = var(X). 2

Example 2 Binomial model B(n; Π)


Now X = {0, 1, . . . , n}. Our parameter is Π ∈ [0, 1]. With respect to counting
measure, the density is

n!
p(x | π) = π x (1 − π)n−x . (10)
x!(n − x)!

This can be written in the EF form (1), with:


 
n!
a(x) = log
x!(n − x)!
b(π) = n log(1 − π)
 
π
u(π) = log
1−π
t(x) = x.

Again the natural sufficient statistic is T ≡ X, and so the mean-value parameter


is H = nΠ. The natural parameter is Φ ≡ log{Π/(1 − Π)} (whence H = neΦ /(1 +
eΦ )). We have k(φ) = n log(1 + eφ ), and check: k ′ (φ) = neφ /(1 + eφ ) = E(X),
k ′′ (φ) = neφ /(1 + eφ )2 = nπ(1 − π) = var(X). 2

3.2 Repeated observations

Suppose (Xi ) are independent and identically distributed from the EF with den-
sity (1). Then the joint density of X = (X1 , . . . , Xn ) (with respect to the product
measure mn on Xn ) is
n
Y
p(x | φ) = p(xi | φ)
i=1
= exp{an (x) − kn (φ) + φT tn (x)} (11)
P P
with an (x) = ni=1 a(xi ), tn (x) = ni=1 t(xi ), and kn (φ) = nk(φ). Hence, as φ
varies in , the joint distributions of X again constitute an exponential
Pn family.
The natural parameter Φ is unchanged, the natural statistic is T = i=1 t(Xi ),
and the mean-value parameter and the function k are both multiplied by n.
14 CHAPTER I. STATISTICAL MODELS

We note that, under repeated sampling, the dimensionality of the sufficient statis-
tic remains bounded (by p) as the sample-size n increases. Conversely, under
certain conditions10 this property will hold only for an EF.

3.3 Exponential family likelihood

Suppose, for the EF of (1), that we observe X = x. The associated likelihood


function on  is
LΘ (θ) ∝ exp{b(θ) + u(θ)T t(x)} (12)
In particular, in terms of the natural parameter Φ, the log-likelihood function is
lΦ (φ) ≃ −k(φ) + φT t(x). (13)

The maximum likelihood estimate φb of Φ (if it exists) is the maximiser of lΦ .


Since k is convex, lΦ is concave, and thus has at most one finite maximum on
the convex set . If φb exists and lies in the interior ◦ (which will often but not
always be the case), differentiation of (13) shows that it will satisfy the likelihood
equation:
∂k
= tj (x) (j = 1, . . . , p). (14)
∂φj
Conversely when (14) has a solution in , this will be the MLE.
From (6) we see that equation (14) is formed by equating the observed and
the expected value of T . This directly delivers the MLE ηb = t(x) of the mean
value parameteter H; for any other parameter function, Θ = g(H) say, we have
θb = g{t(x)}. Whenever this yields a solution within the parameter-space, that
will be the MLE. For a random sample of size n, we similarly obtain the MLE
by equating the population and sample means of T .
One measure of the precision of estimation is the curvature of the log-likelihood
function at its maximum, as given by its negative second derivative. For the
b Since k ′′ (φ) =
parameter Φ, this is just the second derivative of k, evaluated at φ.
(d/dφ)E(T | φ), this is large (and so the log-likelihood function is spiky) when
E(T | φ) is changing rapidly in the neighbourhood of φ,b which intuitively suggests
that T is very informative about φ.
For Φ ∈ Rp the p-dimensional generalisation of k ′′ (φ)b is the matrix ( ∂ 2 k(φ)/∂φi ∂φj | b).
φ=φ
This is a special case of the Fisher information matrix
Z
−{∂ 2 log p(x | θ)/∂θi ∂θj } p(x | θ) dµ(x)

10
— of which the most important is that the set of possible values for X does not depend
on the value of Θ
3. EXPONENTIAL FAMILIES 15

(more of which later) evaluated at the MLE.


16 CHAPTER I. STATISTICAL MODELS
Chapter II

Principles of Inference

Consider a statistical experiment E = E X = (X, X, , Θ, P).

1 Sufficient statistic

Definition 1 Let T be a measurable space, and t : X → T a known measurable


function. The variable T := t(X) is a statistic. 2

If, instead of learning X, I learn only T , I will generally have lost information.
But sometimes such reduction of our data can be effected without loss of useful
information about Θ.

Definition 2 Sufficient statistic


A statistic T is (Fisher) sufficient for the statistical model P (or for its parameter
Θ) if there exists a specification of the conditional distributions for the full data
X, given T , that serves for each P ∈ P. 2

Example 1 Suppose, given Λ = λ, the (Xi ) are independent and identically


distributed as P(λ), with density
Pn (probability mass function) p(x | λ) = e−λ λx /x!
(x =P0, 1, . . .). Then T := i=1 Xi ∼ P(nλ). Thus, when each xi = 0, 1, . . . and
t = ni=1 xi :
Qn
i=1 Prob(Xi = xi | Λ = λ)
Prob(X = x | T = t, Λ = λ) =
Prob(T = t | Λ = λ)
t!
= Qn n−t .
x
i=1 i !
Since this expression does not depend on λ, the statistic T is sufficient for Λ. 2

17
18 CHAPTER II. PRINCIPLES OF INFERENCE

Example 2 Suppose X = (X1 . . . , Xn ), where the (Xi ) are independent and


identically distributed with an arbitrary continuous distribution over X = R.
Consider the function t : Rn → Rn which is the order statistic of the data,
obtained by arranging the values (which, by continuity, can be assumed distinct)
in increasing order—thus losing the information about their original order. It
is intuitive, and can readily be shown, that the conditional distribution of X,
given t(X) = t, is the discrete uniform distribution over the set of all n! possible
orderings of the given values t. Since this does not depend on the underlying
distribution, the order statistic is sufficient for the statistical model comprising
all continuous, independent and identically distributed probability models on Rn .
2

2 Sufficiency principles

The motivation for introducing sufficiency by means of Definition 2 is as follows.


Let T be a statistic. We can consider the experiment E X generating the full
observable X (probabilistically determined by the value of Θ), as performed in
two stages. In the first stage, we generate T from its marginal distribution (which
will typically depend on the value of Θ), thus performing the marginal experiment
E T . Secondly, having obtained say T = t, we generate X from its conditional
distribution, given T = t (and Θ): thus performing the conditional experiment
E X|T =t .
If now T is sufficient, the probabilistic structure of the conditional experiment
E X|T =t is fully determined, independently of whatever value Θ may take. It can
thus be treated as generating “pure noise”, and so entirely uninformative about
Θ: indeed, we could convincingly mimic it ourselves using a randomising device
such as a roulette wheel. This argument suggests that we can safely ignore the
conditional experiment E X|T =t , for the purpose of inferences about Θ, and base
these entirely on the marginal experiment E T . This in turn would mean that
the sufficient statistic T must contain all the useful information about Θ that is
available from conducting the experiment E X .
This intuition is formalised in:
STRONG SUFFICIENCY PRINCIPLE [SSP]
If T ≡ t(X) is a sufficient statistic, then the inference to be drawn from observing
X = x in E X should be the same as that to be drawn from observing T = t(x)
in the marginal experiment E T . 2

More precisely, let I be any proposed inference method, intended to be applied


3. SUFFICIENCY AND LIKELIHOOD 19

across a variety of experiments E: I(E X , x) is the inference I produces when the


outcome of experiment E X is X = x. The we can say I respects SSP if, whenever
T ≡ t(X) is a sufficient statistic in E X , I(E X , x) = I(E T , t(x)).
A weaker variant, referring only to the overall experiment E = E X , is:
WEAK SUFFICIENCY PRINCIPLE [WSP]
If T ≡ t(X) is a sufficient statistic, and t(x) = t(x′ ), then the inference to be
drawn from observing X = x (in E) should be the same as that to be drawn from
observing X = x′ (in the same experiment E). 2

That is to say, an inference method I respects WSP so long as I(E, x) = I(E, x′ )


whenever t(x) = t(x′ ) for any statistic T that is sufficient in E. Clearly

SSP ⇒ WSP: Any inference method respecting the strong sufficiency


principle will automatically respect the weak sufficency principle.

We shall see later how these sufficiency principles apply to various specific infer-
ence methods.

Example 3 Applied to Example 2, WSP requires that we should ignore the


ordering of the data in our inferences: whatever inference we may make from
some data-sequence (x1 , . . . , xn ), we should make the identical inference from
a different data-sequence that differed from this only by a permutation of the
elements. For example, use of the sample mean X n , or of the sample median, as
an estimator would respect this principle. However, if we were to use X n/2 we
would be violating WSP. 2

3 Sufficiency and likelihood

For further analysis we will assume that there is an underlying measure µ on X


such that each Pθ has a density, say p(x | θ), with respect to µ.1

1
Note however that this does not hold for Example 2 if we allow — as we can — continuous
distributions that are not absolutely continuous.
20 CHAPTER II. PRINCIPLES OF INFERENCE

Theorem 1 (Fisher-Neyman factorisation criterion) The statistic T ≡ t(X)


is sufficient for Θ if and only if we can express the density p(x | θ) in the form

p(x | θ) = h(x) g{t(x), θ}. (1)

Proof. A simple proof for a discrete sample space was given in STATISTICS IB. The following
is a more general argument. For simplicity, we suppose that, for some θ0 ∈ ,

p(x | θ0 ) > 0 for all x ∈ X. (2)

Let T := t(X) be the set of possible values for T . We note the following:

(i). Assuming (2), the factorisation (1) will hold if and only if, for all θ ∈ , we can express

p(x | θ) = p(x | θ0 ) a{t(x), θ}. (3)

(ii). p(x | θ) = p(x | θ0 ) b(x, θ) if and only if, for any function f : X → R, Eθ {f (X)} =
Eθ0 {f (X) b(X, θ)}.
(iii). (Kolmogorov’s general definition of conditional expectation). Suppose E(Y ) exists (i.e.
E(|Y |) < ∞). Then E(Y | T ) = κ(T ) if and only if, for any function φ(T ) of T ,
E{φ(T ) κ(T )} = E{φ(T ) Y }.

(In both (ii) and (iii), the equality is to be interpreted in the sense that whenever either side
exists, so does the other, and they are equal).
Suppose first that T is sufficient, with density say q(t | θ) (with respect to some measure
on T that we need not specify). Then (2) implies q(t | θ0 ) > 0, all t ∈ T. Define a(t, θ) :=
q(t | θ)/q(t | θ0 ).
For arbitrary k(X) with finite expectation, let κ(T ) := Eθ {k(X) | T }, chosen not to depend
on θ, as is possible by sufficiency. Then

Eθ {k(X)} = Eθ {κ(T )} since κ(T ) = Eθ {k(X) | T }


= Eθ0 {κ(T ) a(T, θ)} by (ii) applied to the distributions of T
= Eθ0 {k(X) a(T, θ)} on applying (iii), since κ(T ) = Eθ0 {k(X) | T }.

On using (ii) and (i), we see that (1) holds.


Conversely, suppose we have a factorisation as in (1), or equivalently (3) holds. Consider
arbitrary functions k(X), f (T ), and let κ(T ) := Eθ0 {k(X) | T }. Then, using (ii),

Eθ {f (T ) κ(T )} = Eθ0 {f (T ) κ(T ) a(T, θ)}


= Eθ0 {f (T ) k(X) a(T, θ)}
= Eθ {f (T ) k(X)}.

So, by (iii), κ(T ) := Eθ {k(X) | T }. Since Eθ {k(X) | T } is thus the same for all θ, for any function
k(X),2 T is sufficient for Θ. 2

2
It is enough to consider all indicator functions of sets.
4. MINIMAL SUFFICIENCY 21

It readily follows that the natural sufficient statistic in an exponential family is


indeed sufficient.

Corollary 2 Suppose T ≡ t(X) is sufficient for Θ, and x1 , x2 ∈ X are such that


t(x1 ) = t(x2 ). Then, as functions of θ, p(x1 | θ) ∝ p(x2 | θ).

Thus Corollary 2 says that, whenever two possible outcomes of the same exper-
iment yield the same value for some sufficient statistic, the likelihood functions
(see Definition 2 of Chapter I) that they generate for Θ must be proportional.

Corollary 3 Suppose T ≡ t(X) is sufficient for Θ, with density q(t | θ) (with re-
spect to some measure on T). Then, for any x ∈ X, as functions of θ, q{t(x) | θ} ∝
p(x | θ).

Proof. Examining the proof of Theorem 1, we see that q{t(x) | θ}/q{t(x) | θ0 } = a{t(x), θ} =
p(x | θ)/p(x | θ0 ). 2

That is to say, we will always obtain proportional likelihood functions, whether


we observe the full data X or a sufficient reduction T .

4 Minimal sufficiency

In general there are many non-equivalent sufficient statistics. Thus suppose Xi ∼


N (Θ, 1) (i = 1, 2, 3). Then the following statistics are sufficient: T0 := X1 + X2 +
X3 ; T1 := (X1 , X2 + X3 ); T2 := (X1 + X2 , X3 ); T3 := the order statistic of
(X1 , X2 , X3 ). Note that T0 is a function of each of the others, and intuitively it
does not seem possible to find a non-trivial function of T0 that will be sufficient
— i.e. T0 is (apparently) a minimal sufficient statistic.
In general, for functions s, t on X, write s  t if S := s(X) is a function of
T := t(X); that is, t(x) = t(x′ ) ⇒ s(x) = s(x′ ). We also write S  T .

Lemma 4 Suppose T is sufficient, and T  S. Then S is sufficient.

Proof. For compatible values x, s, t of X, S, T , p(x | s, θ) = p(x | s, t, θ), which,


being obtained by further conditioning p(x | t, θ) on S, can be chosen to be in-
dependent of the value θ of Θ. [Alternatively, using the factorisation criterion:
if p(x | θ) = h(x)g{t(x), θ}, this can also be written as p(x | θ) = h(x)γ{s(x), θ}].
2
22 CHAPTER II. PRINCIPLES OF INFERENCE

Intuitively, if all the usable information about Θ is contained in T , it will be


retained if we add further data to T . But T constitutes the more effective reduc-
tion.

Definition 3 A statistic T is called minimal sufficient if it is sufficient and, for


any sufficient statistic S, T  S. 2

If it exists, a minimal sufficient statistic (MSS) T is the most compact reduction


of the original observable X that does not discard any useful information about
Θ. It is essentially unique: if both T and T ′ are minimal sufficient, they are
related by an invertible transformation, and so embody the same information.
It is not obvious that a MSS need exist. However the following constructive
argument shows that it does.

Theorem 5 A minimal sufficient statistic T exists.3

Proof. For x, x′ ∈ X, write x ∼ x′ if the likelihood functions based on observing,


respectively, X = x and X = x′ , are proportional (as functions on ). That is,
for some c > 0,
p(x | θ) = c p(x′ | θ) (4)
for all θ ∈ . Note that c will usually depend on x and x′ : c = c(x, x′ ).
It is easy to see that this is an equivalence relation on X. Within each equivalence
class, select one representative point, let T be the set of all such representative
points, and define t : X → T by: t(x) is the representative of the equivalence class
containing x. Then t(x) = t(x′ ) ⇐⇒ x ∼ x′ .
By Theorem 1, if S = s(X) is sufficient, and s(x) = s(x′ ), then x ∼ x′ , so t(x) =
t(x′ ). Hence T  S. Moreover, since x ∼ t(x), p(x | θ) = c(x, t(x)) p{t(x) | θ}.
Hence, again by Theorem 1, T is itself sufficient. 2

In applications we would not normally use this particular form of minimal suffi-
cient statistic: any statistic T ≡ t(X) such that t(x) = t(x′ ) if and only x and x′
yield proportional likelihoods will serve.

Corollary 6 In order to respect WSP in an experiment E X , it is necessary and


sufficient that we draw the identical inference from observations x and x′ ∈ X
whenever x and x′ yield proportional likelihood functions for Θ ( i.e., (4) holds).
3
Our argument is not entirely rigorous for a non-discrete sample space, since we do not fuss
about measurability or sets of measure 0; but for all but pathological counter-examples these
gaps can be filled by suitable technical elaborations.
5. COMPLETENESS 23

5 Completeness

The following technical definition, while not easy to motivate intuitively, proves
to be of importance for a variety of theoretical developments.

Definition 4 Given a statistical model P = {Pθ : θ ∈ }, a statistic T (taking


values in T say) is called complete if it admits no non-trivial unbiased estimator
of 0: that is, whenever h : T → R satisfies

Eθ {h(T )} ≡ 0 (all θ ∈ ) (5)

then h(T ) ≡ 0 (almost surely for any Pθ ).


A slightly weaker property is bounded completeness, in which the above is only
required under the additional condition that the function h be bounded. 2

If T is [boundedly] complete and also sufficient for P, it is called (unsurprisingly)


[boundedly] complete sufficient.

Example 4 Exponential model


GivenPΛ = λ ∈ R+ , Xi ∼ E(λ), independent and identically distributed. Then
T := ni=1 Xi is sufficient for Λ. The distribution of T , given Λ = λ, is Γ(n, λ),
t e . To show that T is complete, let h : R+ → R be such
λn n−1 −λt
with density Γ(n)
that Eλ h(T ) = 0 for all λ > 0. That is,
Z ∞
h(t) tn−1 e−λt dt = 0,
0

all λ > 0. Let h+ (t) := max{h(t), 0}, h− (t) := max{−h(t), 0} be the positive and
negative parts of h. Then
Z ∞ Z ∞
n−1 −λt
h+ (t) t e dt = h− (t) tn−1 e−λt dt,
0 0

all λ > 0. By the uniqueness of Laplace transforms, we must have h+ (t) tn−1 =
h− (t) tn−1 (almost everywhere), i.e. h(t) = 0 (almost everywhere). 2

The above argument can be generalized, to show:

Theorem 7 If T is the natural sufficient statistic of a p-parameter exponential


family, and the parameter space  contains an open set in Rp , then T is complete.
24 CHAPTER II. PRINCIPLES OF INFERENCE

Lemma 8 If T is boundedly complete sufficient, it is minimal sufficient.

Proof. Let S be minimal sufficient; it is thus a function of T . Consider a bounded real


function h(T ) of T , and define k(S) = Eθ {h(T ) | S} (defined, as is possible by sufficiency of S,
to be independent of the underlying parameter-value θ). Then, for any θ ∈ , Eθ {h(T )} =
Eθ [Eθ {h(T ) | S}] = Eθ {k(S)}, so h(T ) − k(S) is an unbiased estimator of 0 which is a function
of T . By completeness, we deduce that h(T ) = k(S), so that h(T ) is a function of S. Since this
holds for every bounded real function h(T ), T is a function of S, and hence minimal sufficient.
2

In particular, under the conditions of Theorem 7, the natural sufficient statistic


of an exponential family will be minimal sufficient.

6 The likelihood principle

Here is another putative “Principle of Statistics”:


LIKELIHOOD PRINCIPLE (LP)
Let E = (X, X, , Θ, P), E ′ = (X′ , X ′ , , Θ, P ′ ) be two possibly different exper-
iments governed by the same parameter Θ. Denote the sampling densities by
p(· | θ) for E, and p′ (· | θ) for E ′ .
Suppose that, for a specific pair of values x ∈ X and x′ ∈ X′, the corresponding
likelihood functions are proportional:

p′ (x′ | θ) ∝ p(x | θ) (6)

where the constant of proportionality may depend on x and x′ , but not on θ.


Then we should make the same inference about Θ, whether we observe X = x in
E, or X ′ = x′ in E ′ . 2

That is to say, a general inference method I respects LP if I(E, x) = I(E ′ , x′ ))


whenever (6) holds.
By virtue of Corollary 6, WSP is equivalent to the Restricted Likelihood Principle
(RLP), in which LP is only applied to the case E = E ′ , X ≡ X ′ . Similarly SSP
is equivalent to the special case in which E = E X and E ′ = E T , the marginal
experiment of E X for a sufficient statistic T . In particular, we see that

LP ⇒ SSP: Any inference method respecting the likelihood princi-


ple will automatically respect the strong (and hence also the weak)
sufficency principle.
6. THE LIKELIHOOD PRINCIPLE 25

How much more broadly should we regard LP as appropriate? Do you agree with
its application in the following case?

Example 5 Suppose E specifies N ∼ P(t0 Θ), while E ′ specifies T ∼ Γ(n0 , Θ)


(t0 , n0 fixed). Note that the sample spaces are quite different, being discrete for
E and continuous for E ′ . The likelihood function based on observing N = n0 in
E is proportional to e−t0 θ θn0 , as also is that based on observing T = t0 in E ′ .
Hence LP tells us to make identical inference about Θ in both these cases. 2

6.1 Optional stopping


Suppose that variables X1 , X2 , . . ., with a joint distribution depending on Θ, arrive in sequence.
At any time t, after having observed the values of X t := (X1 , . . . , Xt ), you can choose either to
stop observing, or to continue. This decision may depend on the data-values observed so far,
and may even involve some randomisation; but we assume that it is not further influenced by
the value of Θ—such a stopping rule is termed non-informative.4 Let s(xt ) be the probability
you will stop observing, after having observed X t = (x1 , . . . , xt ), and c(xt ) := 1 − s(xt ) the
probability of continuing. Each choice for the function s specifies a different statistical experi-
ment, E s say, but all are governed by the same parameter Θ (and hence, necessarily, share the
same value θ of Θ).
In E s , the probability-density of observing exactly X1 = x1 , . . . , Xt = xt , when Θ = θ, is

ps (xt | θ) = c(∅) p(x1 | θ) c(x1 ) p(x2 | x1 , θ) . . . c(xt−1 ) p(xt | xt−1 , θ) s(xt )


t
Y
∝ p(xi | xi−1 , θ)
i=1

= p(xt | θ).

Since this holds for any non-informative stopping rule s, any inference method that respects LP
must make the same inference about Θ from the observation X t = xt , no matter which stopping
rule gave rise to it.

Example 6 (See also Question 3(b) of Example Sheet 1.) Suppose that we are told that there
were 9 heads in 12 tosses of a coin, but not which (non-informative) stopping rule was applied.
It might have been: “Toss exactly 12 times” (a binomial experiment), or “Toss until the 3rd tail
appears” (a negative binomial experiment). Or perhaps “Toss 10 times, and then stop after the
first head”. All of these different experiments have different sample spaces, and would support
different unbiased estimators for Θ – so the method of unbiased estimation violates LP. In fact,
pretty well every frequentist method violates LP in this setting. However, according to LP we
should not care which was experiment was actually performed, but make the identical inference
in all these cases. 2

Example 5 can be regarded as a continuous-time variant of the above set-up, where we observe
a Poisson process of rate Θ, either for a given time t0 , or until we have seen a given number n0
of events.

4
For an example of informative stopping, consider an experimenter who tossed the coin 20 times,
but chooses to reveal only the first (say) 14 outcomes, because that maximises the proportion of
heads.
26 CHAPTER II. PRINCIPLES OF INFERENCE

7 Ancillarity

A statistic S ≡ s(X) is called ancillary for Θ if the distribution of S given Θ = θ


is the same for all θ ∈ .

Example 7 A fair coin is tossed. If it lands H, X is generated from N (Θ, 102),


if T then from N (Θ, 502). The outcome of the coin-toss is ancillary. 2

Example 8 X1 , X2 , . . . are independent with a joint distribution depending on


Θ. An integer N with known distribution is generated, e.g. by a randomising
device. If N = n, we observe X1 , . . . , Xn . Then the sample size N is ancillary.
2

Example 9 X1 , X2 are independent and identically distributed with the Cauchy


distribution C(Θ, 1). Then S := X2 − X1 is ancillary. 2

Again, we can consider the experiment E generating the data X (probabilistically


determined by the value of Θ), as comprising two stages: In the first stage, we
conduct experiment E S , generating S from its marginal distribution—which by
ancillarity will not depend on the value of Θ. Then, having obtained S = s, we
perform experiment E X|S=s , generating X from its conditional distribution given
S = s and Θ . This time we can argue that it is the first stage that is entirely
uninformative about Θ, and thus that our inferences should be based entirely on
the second stage, the conditional experiment.
This intuition is formalised in the
ANCILLARITY PRINCIPLE [AP]
If s = S(X) is an ancillary statistic, then the inference to be drawn from observing
X = x in E should be the same as that to be drawn from observing X = x in
the conditional experiment E X|S=s(x) . 2

More precisely, we can say that a general inference method I respects AP if


I(E, x) = I(E X|S=s(x) , x), for any ancillary statistic S ≡ s(X) in E.
When S is ancillary, we have the density factorisation p(x | θ) = p(s)p(x | s, θ)
(s = s(x)). In particular, p(x | θ) ∝ p(x | s, θ), so that we obtain essentially the
same likelihood function, based on data X = x, whether we form it from the
overall experiment E or the conditional experiment E X|S=s . Consequently:
8. BIRNBAUM’S THEOREM 27

LP ⇒ AP: Any inference method respecting the likelihood principle


(including Bayesian inference) will automatically respect the ancillar-
ity principle.

What other ways are there of respecting AP?

Theorem 9 (Basu) Suppose that, in an experiment E, T is bounded complete


sufficient for Θ, and S is ancillary for Θ. Then S ⊥
⊥ T | Θ.

Proof. Let k(S) be a bounded function of S, and g(T ) := E{k(S) | T } (independent of the
value θ of Θ, by sufficiency of T ). Then Eθ {g(T )} = E{k(S), the latter being independent of θ
by ancillarity. By bounded completeness, , almost surely for any θ we have g(T ) = E{k(S), i.e.
Eθ {k(S) | T } = Eθ {k(S)}. Since k was arbitrary, the result follows. 2

Corollary 10 Suppose that, in every experiment E under consideration, there


exists a bounded complete sufficient statistic; and let I be an inference method
that, in any such E, depends only the distribution and value of that statistic.
Then I respects AP.

Unfortunately the hypothesis of Corollary 10 is very restrictive, so that it does


not provide a useful way of ensuring AP is respected.
Another approach—dual to basing inference on a minimal sufficient statistic—is
to look for a maximal ancillary statistic S ∗ , with the property that S  S ∗ for any
other ancillary statistic S. In the presence of such S ∗ , we could operate reduction
by ancillarity, which, starting from a given inference method I, creates a new
method I ∗ by applying I to the model conditioned on S ∗ . Then I ∗ would satisfy
AP. However, in contrast to the situation with sufficiency, a maximal ancillary
typically does not exist.

Example 10 Let (X, Y ) be bivariate normal with means 0, variances 1, and


unknown correlation Θ. Then each of X, Y is separately ancillary, so the only
possible choice of a maximal ancillary would be (X, Y )—which is clearly not
ancillary! 2

8 Birnbaum’s theorem

We have seen that the likelihood principle implies both the (strong or weak)
suficiency principle and the ancillarity principle. Conversely, Birnbaum showed
28 CHAPTER II. PRINCIPLES OF INFERENCE

that the only general way to respect both WSP and AP is to respect LP. In fact
we do not even need the full force of AP, but only the weaker:
CONDITIONALITY PRINCIPLE [CP]
Let E 1 = (X1 , X1 , , Θ, P1 ), E 2 = (X2 , X2 , , Θ, P2 ) be two experiments governed
by the same parameter Θ, and let the compound experiment E be conducted by
first flipping a fair coin, with outcomes labelled 1 and 2; and then doing E i if
the outcome is i. Then the inference drawn from observing outcome (i, xi ) in the
overall experiment E should be the same as that drawn from observing outcome
xi in the component experiment E i . 2

More precisely, a general inference method I respects CP if I(E, (i, xi )) = I(E i , xi ).

Theorem 11 (Birnbaum)

WSP & CP ⇒ LP

That is to say, if I is a general inference method respecting both WSP and CP,
then it must respect LP.

Proof. Let pi (· | θ) be the density for Xi in E i when Θ = θ. Then in E the


outcome (i, xi ) has probability/density p(i, xi | θ) = 21 pi (xi ).
Suppose p1 (x1 | θ) ∝ p2 (x2 | θ). Then in E the outcomes (1, x1 ) and (2, x2 ) have
proportional likelihoods, and so have the same value for the minimal sufficient
statistic. Hence, applying WSP, I(E, (1, x1 )) = I(E, (2, x2 )). Also, by CP,
I(E, (i, xi )) = I(E i , xi ). We deduce I(E 1 , x2 ) = I(E 2 , x2 ); so I respects LP.
2

This mathematically trivial theorem caused a great deal of consternation when


first announced, since while most statisticians believed in WSP and CP, they
could not accept LP. It is fair to say that its full force has still not been properly
appreciated.
Chapter III

Frequentist Estimation Theory

The frequentist approach to inference is indirect: it focuses on constructing and


evaluating inference procedures of various kinds, constructed before any data are
gathered and intended to be applied to whatever data are subsequently observed.
It replaces statistical inference from the observed data by probabilistic properties
of procedures. These are assessed and compared by means of various more-or-less
appealling criteria for the “goodness” of a procedure. Once a favoured procedure
has been identified, it can be applied to the data observed to produce the relevant
inference — in the hope that any good properties it may have as a procedure are
somehow retained in its application to the specific data to hand.

1 Frequentist evaluation of estimators

We suppose specified a statistical experiment E n = E n , with observable X ∈ Xn .1


One common task of statistical inference is to come up with a point estimate of
the unknown parameter Θ, or some chosen function of it (the estimand ), in the
light of the observed data x. In addition, given that we do not expect to hit the
nail exactly on the head, we would like some idea as to how close we might be.
A purely likelihood-based approach might suggest using (say) the MLE and the
curvature of the observed log-likelihood function there. The frequentist approach
needs to take a step back, and consider what might be done ahead of gathering
the data.

Definition 1 Estimator
An estimator is a function θe : Xn → . We also use the term to refer to the
1
Much of the description and analysis applies just as well to more general experiments.

29
30 CHAPTER III. FREQUENTIST ESTIMATION THEORY

random variable Θ e
e := θ(X). 2

Once we have decided to use this estimator, our estimate, for observed data x, is
e
θ(x).

The estimator Θ e is a function of X, and so, for each possible value θ of Θ, has
a distribution that can in principle be calculated from the distribution Pθn of X.
A frequentist bases her evaluation of the proposed estimator on this family of
distributions for Θe given θ. But to do this she needs to apply specific criteria for
such an evaluation. There is plenty of scope for developing and analysing (and
criticising) “reasonable” evaluation criteria.
Here is one possible criterion, for the case that  ⊆ R:2

Definition 2 Unbiasedness
e of Θ is the function b :  → R given by:
The bias function of the estimator Θ

b(θ) = E(Θ
e | θ) − θ.

e is an unbiased estimator of Θ if its bias is identically 0, i.e., for any θ ∈ ,


Θ

E( Θ
e | θ) = θ.

When we need to distinguish this bias function from that of other estimators, we
might write bΘe (·), or eb(·),. . .
When unbiasedness holds, the strong law of large numbers implies that, if we were
to apply the estimator Θe repeatedly to independent repetitions of the experiment
3
E n , all with the same parameter-value, then the overall average of our estimates
would converge to the true parameter value, with probability 1, no matter what
that true value might be. This is the “repeated sampling” interpretation of the
unbiasedness criterion.
While unbiasedness may seem like a natural and appealing property, it has its
problems.

Example 1 Exponential model


We have density p(x | λ) = λe−λx (x > 0). We might be interested in the
e of M taken to be the
expectation M = Λ−1 of X. Consider the estimator M
2
This readily extends to  ⊆ Rk , or any other vector space.
3
Note that there are two nested levels of repetition involved here.
1. FREQUENTIST EVALUATION OF ESTIMATORS 31
P
sample mean X := n−1 ni Xi . The distribution of X, when Λ = λ, is Γ(n, nλ),
e is an unbiased estimator of M.
with expectation n/nλ = µ. So M

Suppose instead our estimand were Λ. It seems sensible to take Λ b := Me −1 as an


estimator of Λ = M−1 . However, E(M
e −1 | λ) = nλ/(n − 1) 6= λ, so that this is not
an unbiased estimator of Λ. (The modified transformed estimator (n − 1)/nΛ e
would be unbiased for Λ). 2

Example 2 Likelihood principle


Consider two experiments: E = {X, X, , Θ, P}, and E ′ = {X′ , X ′ , , Θ, P ′}. In
E, given Θ = θ, X ∼ B(10, θ), with

10!
Prob(X = x | θ) = θx (1 − θ)10−x (x = 0, . . . , 10).
x! (10 − x)!

In E ′ , given Θ = θ, X ′ ∼ N B(7, θ), with

(x′ + 6)! x′
Prob(X ′ = x′ | θ) = θ (1 − θ)7 (x′ = 0, 1, . . .).
6! x′ !
The (unique) unbiased estimator of Θ in E is T := X/10; while the (unique)
unbiased estimator of Θ in E ′ is T ′ := X ′ /(X ′ + 6).
Consider now the observations: X = 3 in E, and X ′ = 3 in E ′ . Unbiased esti-
mation yield different answers, respectively 3/10 and 3/9. But both observations
yield essentially the same likelihood function, proportional to θ3 (1 − θ)7 in each
case. So the method of unbiased estimation does not satisfy the likelihood prin-
ciple. 2

A more basic criticism is that the property of unbiasedness says nothing about
how close the estimator might be to its estimand. In Example 1, one simple
unbiased estimator of M is just the first observation, X1 ; but this seems a poor
choice.
To address this issue we might proceed as follows.

Definition 3 Mean squared error


e of Θ is the function
The mean squared error (MSE) function of an estimator Θ
mse :  → R given by:
mse(θ) = E{(Θ
e − θ)2 | θ}.

2
32 CHAPTER III. FREQUENTIST ESTIMATION THEORY

e is unbiased, mse(θ) is just the (sampling) variance v(θ) := var(Θ


When Θ e | θ) of
e More generally we have
Θ.

mse(θ) = v(θ) + b(θ)2 .

If we have two suggested estimators, Θ e and Θ∗ , of the same parameter Θ, such


that mse(θ)
g ≤ mse∗ (θ) for all θ, then we might choose to use Θe in preference to
Θ∗ .

Example 3 In Example 1, the variance function of M e = X is e v(µ) = µ2 /n. If



instead we used M = X1 , we would get v ∗ (µ) = µ2 . Since both estimators are
unbiased for M, these are also the mean squared error functions. Assuming n > 1,
mse(µ)
g e
< mse∗ (µ) for all µ, so on this criterion we would prefer the estimator M

to M .
Consider now a third (biased) estimator of M: M† := nX/(n + 1). This has
bias function b† (µ) = −µ/(n + 1), and variance function v † (µ) = nµ2 /(n + 1)2 ,
hence mse† (µ) = µ2 /(n + 1), which is everywhere less than mse(µ)
g = µ2 /n. So
this biased estimator beats the more obvious unbiased estimator M e on the MSE
criterion. 2

Yet another (crazy??) estimator of M is M◦ ≡ 3, i.e. we ignore the data entirely


and always estimate M as 3. Although heavily biased (b◦ (µ) = 3 − µ), it does
at least have the advantage of small variance, v ◦ (µ) = 0 ! And indeed it is NOT
beaten by (say) M, e since when µ = 3 we have mse◦ = 0, whereas mse g > 0.
Of course, for µ far from 3, mse◦ (µ) = b◦ (µ)2 will be much larger than mse(µ).
g
But since the inequality in mse(µ) goes in both directions as µ varies, the MSE
criterion can not select either of these estimators as being superior to the other—
they are simply incomparable.
In an attempt to evade such incomparabilities it is common to restrict attention to
unbiased estimators only, which can then be compared in terms of their variance
functions—though this is pretty arbitrary, and has the undesirable side-effect of
excluding good biased estimators such as M† above.

Definition 4 Minimum variance unbiased estimator


e of Θ is a minimum variance unbiased estimator (MVUE) if:
An estimator Θ

e is unbiased for Θ
• Θ

• for all θ ∈ , e
v (θ) ≤ v ∗ (θ) for every other unbiased estimator Θ∗ of Θ.
2. TOWARDS A VARIANCE BOUND 33

Theorem 1 If a MVUE of Θ exists, it is unique.

Proof. We use the readily verified property


   
1 1 1
var (T1 + T2 ) + var (T1 − T2 ) = {var(T1 ) + var(T2 )} . (1)
2 2 2

Suppose T1 and T2 are both MVUEs of Θ. It immediately follows that they have
the same variance function, v(θ) say. Also, 12 (T1 + T2 ) is an unbiased estimator of
Θ, so varθ { 12 (T1 + T2 )} ≥ v(θ). Substituting in (1) we find varθ { 12 (T1 − T2 )} ≤ 0,
hence varθ (T1 − T2 ) = 0. Also, Eθ (T1 − T2 ) = θ − θ = 0. Thus T1 = T2 with
probability 1 under any Pθ . 2

2 Towards a variance bound

How do we identify a MVUE (if one exists), or check that a proposed estimator
is in fact MVUE? One way is to show the existence of a lower bound v(θ) for the
variance function v(θ) of any unbiased estimator. Then if we find v(θ) = v(θ) we
shall know that we have a MVUE.
We now develop some theory working towards identifying such a bound. We
suppose a statistical experiment E n = E n with  ⊆ R. We further suppose
p(x | θ) > 0 on X, and introduce, for x ∈ Xn , θ ∈ :

ln (x, θ) := log p(x | θ).

In manipulating functions such as ln (x, θ) we shall be taking integrals with respect


to x (using the dominating measure µn over Xn ), and taking derivatives with
respect to θ ∈  (which we indicate with a dash: ′ ). We shall suppose that
the functions treated are suficiently regular to allow these operations, as well as
interchange of the order in which they are applied.
The random variable Un (θ) := ln′ (X, θ) (both the form and the distribution of
which depend on the value θ of the parameter) is called the (Fisher) score function
or variable.
34 CHAPTER III. FREQUENTIST ESTIMATION THEORY

Lemma 2 Under conditions4 allowing the interchange of integration over X and


differentiation on ,

Eθ {ln′ (X, θ)}= 0. (2)


 
Eθ {−ln′′ (X, θ)} = Eθ {ln′ (X, θ)}2 . (3)

Proof. Property (2) is readily proved by differentiating the normalisation


property
Z
p(x| θ) dµ(x) = 1 (4)
Xn
with respect to θ, and switching the order of integration and differentiation (which
will be valid under suitable regularity conditions).
For (3), we have
"   2 #
p′′ (X | θ) p′ (X | θ)
Eθ {−ln′′ (X, θ)} = Eθ − + .
p(X | θ) p(X | θ)

The first term has expectation 0, by an argument similar to that used to show
(2). The second term is the right-hand side of (3). 2

3 Fisher information

Definition 5 The (Fisher) information function for estimation of θ from X in


the given model is:
In (θ) := Eθ {−ln′′ (X, θ)}.

From (2) and (3) we also have

In (θ) = varθ {ln′ (X, θ)}. (5)

Thus, under Pθ , the score variable Un (θ) has expectation 0 and variance In (θ).
4
The most important is that the range of x-values for which the integrand is positive should
be the same for all θ ∈ .
3. FISHER INFORMATION 35

3.1 Repeated sampling

For this course we are assuming the Xi independent and identically distributed
according to Pθ . Then, with

l(x, θ) := log p(x | θ)

we have
n
Y
log p(X | θ) = log p(Xi | θ)
i=1
n
X
= l(Xi , θ)
i=1

whence
n
X
Un (θ) = ui (θ)
i=1

where ui (θ) is the score variable based on the single outcome variable Xi . Thus,
under Pθ , Un (θ) is a sum of n independent and identically distributed components,
each with mean 0 and variance i(θ) := Eθ {−l′′ (X, θ)}, the Fisher information
based on a single observation. In particular, in this independent and identically
distributed case, In (θ) = ni(θ). By the central limit theorem, the asymptotic
1
distribution under Pθ of the standardised score variable, Sn (θ) := {In (θ)}− 2 Un (θ),
as n → ∞, is standard normal.
In particular, for large n we can test a hypothesis Θ = θ0 by referring Sn (θ0 )
to standard normal tables; essentially equivalent, we can refer Sn (θ0 )2 (which is
called the score statistic 5 ) to tables of χ21 . This yields the score test.
Similarly, we can form an approximate 95% (say) confidence interval for Θ as:
1
{θ : |{In (θ)}− 2 Un (θ)| ≤ 1.96}. (6)

3.2 Change of variable

Transformation of data Suppose Y is a smooth invertible function of X: Y =


f (X). If we always use Lebesgue as the dominating measure, the densities
are related by: pY (y | θ) = pX (f −1 (y) | θ) γ(y), where γ(y) is the Jacobian
of the transformation f −1 at y. A similar relationship holds more generally,
for any choices of the dominating measures, with some adjustment factor
5
not to be confused with the score variable!
36 CHAPTER III. FREQUENTIST ESTIMATION THEORY

′ ′
γ(y) that does not depend on the parameter. Hence lY (Y, θ) = lX (X, θ):
i.e. the score random variable is the same, whether based on X or Y. In
particular, the Fisher information, which is the variance of the score, is
unaffected by a data transformation.

Transformation of parameter Now consider a smooth invertible parameter


tranformation: Φ = g(Θ). Then pΦ (x | φ) = pΘ (x | g −1(φ)), whence ∂lΦ (x, φ)/∂φ =
{g ′(θ)}−1 ∂lΘ (x, θ)/∂θ (where φ = g(θ)). We deduce the following relation-
ships between the score variables and information functions before and after
a parameter transformation:

UΦ (φ) = {g ′(θ)}−1 UΘ (θ) (7)


IΦ (φ) = {g ′(θ)}−2 IΘ (θ). (8)

These can be expressed more symmetrically and mnemonically as:

UΦ (φ) dφ = UΘ (θ) dθ (9)


IΦ (φ) (dφ)2 = IΘ (θ) (dθ)2 . (10)

1 1
In particular, IΦ (φ)− 2 UΦ (φ) = IΘ (θ)− 2 UΘ (θ): the standardised score variable
(and hence also the score statistic) is unaffected by an invertible transformation
of either the data or the parameter.

3.3 Fisher information for the exponential family

Consider a 1-parameter EF, with natural parameter Φ and cumulant function


k(φ). So

l(x, φ) = −k(φ) + φ t(x)


l′ (x, φ) = −k ′ (φ) + t(x)
UΦ (φ) = l′ (X, φ) = T − k ′ (φ)
′′
−l (x, φ) = k ′′ (φ)
iΦ (φ) = Eφ{−l′′ (X, φ)} = k′′ (φ)
IΦ (φ) = nk ′′ (φ).

The general theory confirms our earlier results that Eφ (T ) = k ′ (φ), varφ (T ) =
k ′′ (φ).
Consider now the mean-value parameter H = k ′ (Φ). Using (10) we find iH (η) =
{k ′′ (φ)}−1 (where η = k ′ (φ)), and so IH (η) = n{k ′′ (φ)}−1
4. THE CRAMÉR-RAO LOWER BOUND 37

4 The Cramér-Rao lower bound

Here—at last—is a reason to care about the Fisher information.

Theorem 3 (Cramér-Rao inequality) Let Θ e be an unbiased estimator of Θ.


Then (assuming it is legitimate to interchange integration over X and differenti-
ation on —see Footnote 4) its variance function ve(θ) satisfies:
For all θ ∈ ,
v (θ) ≥ I(θ)−1 .
e (11)

Proof. Recall the Cauchy-Schwarz inequality:

{cov(Y, Z)}2 ≤ var(Y ) var(Z), (12)

which holds for any variables Y, Z, having any joint distribution (so long as the
variances exist); moreover, we get equality in (12) if and only if there is a linear
relationship between Y and Z.
e and
Fixing an arbitrary value of θ, we will apply (12) to the variables Y = Θ
n
Z = U(θ), using the distribution Pθ .
Using (2),

e U(θ)} =
covθ {Θ, EZ θ {θ(X)
e ln′ (X, θ)}

= e p (x| θ) × p(x| θ) dµ(x)
θ(x)
p(x| θ)
ZX
= e p′ (x| θ) dµ(x)
θ(x)
X
Z
d e
= θ(x)p(x| θ) dµ(x)
dθ X
d
= Eθ (Θ)
e

= 1

e
(where we have use the unbiasedness of Θ: Eθ {Θ}
e ≡ θ.) Inserting this, and (5),
into (12), the result follows. 2

Corollary 4 Suppose that Θ e is an unbiased estimator of Θ whose variance func-


tion e −1
v satisfies ve(θ) = I(θ) . Then Θ e is the MVUE of Θ.
38 CHAPTER III. FREQUENTIST ESTIMATION THEORY

Example 4 Exponential model


Consider the exponential model parametrised by M = E(X | Λ) = Λ−1 , so that
p(x | µ) = µ−1 exp{−x/µ}, and so

l(x, µ) = − log µ − x/µ


−l (x, µ) = −µ−2 + 2xµ−3
′′

i(µ) = µ−2 .

Hence any unbiased estimator of M must have variance at least I(µ)−1 = µ2 /n.
But var(X | µ) = µ2 , so the variance of the unbiased estimator X of M is µ2 /n.
It follows that this is the MVUE of M. 2

4.1 Achieving the bound

An estimator whose variance function achieves the Cramér-Rao lower bound is


necessarily a MVUE. However the converse is false in general, since the bound is
not necessarily achieved. When does this happen?
Suppose then that an unbiased estimator Θ b
b = θ(X) of some parameter Θ has
variance function achieving the Cramér-Rao lower bound. The linearity condition
following (12) states that equality in (11) holds if and only if there exist coeffi-
cients B and W , possibly depending on θ, such that (with probability 1 under
b
b Thus l′ (x, θ) must have the form B(θ) + W (θ) θ(x),
Pθn ), ln′ (X, θ) = B + W Θ. n
and so p(x | θ) = exp ln (x, θ) must have the form

b
p(x | θ) = exp{a(x) + b(θ) + w(θ) θ(x)}.

We thus see that, for equality in (11), we must be dealing with an exponential
family of distributions for X. In the independent and identically distributed case
we confine ourselves to, this in turn requires that the basic single-observation
model p(x | θ) be an EF.

Moreover, our estimator Θb must be the natural statistic T . Since we are requiring
that Θb be an unbiased estimator of Θ, that in turn implies that Θ must be the
mean-value parameter H of the EF. So we can achieve equality in (11) if and only
if we are estimating the mean-value parameter in an exponential family; and to
do so we must be using the natural sufficient statistic.

Example 5 Suppose X1 ∼ P(Θ), X2 ∼ P(Θ2 ), independently. Then

l(x, θ) ≃ −θ − θ2 + (x1 + 2x2 ) log θ


5. FISHER INFORMATION: MULTIPARAMETER CASE 39

X1 + 2X2
U(θ) = l′ (X, θ) = −1 − 2θ +
θ
′′ X1 + 2X2
−l (X, θ) = 2 + .
θ2
This model forms an exponential family, with natural parameter Φ = log Θ,
natural statistic T = X1 + 2X2 , and mean-value parameter H = Θ + 2Θ2 .
Note that Eθ {U(θ)} = 0. Also IΘ (θ) = Eθ {−l′′ (X, θ)} = 4 + θ−1 , so the Cramér-
Rao lower bound for an unbiased estimator of Θ is θ/(1 + 4θ). The estimator
e := X1 is clearly unbiased for Θ, with variance function e
Θ v (θ) = θ. This is close
to the CRLB when θ is small.
Now consider the parameter Λ = Θ2 . Using (8), or directly, we find the CRLB
for unbiased estimation of Λ is IΛ (λ)−1 = 4θ3 /(1 + 4θ). The unbiased estimator
X2 of Λ comes close to achieving this for large θ.
Finally, if we just happened to be interested in the mean-value parameter H =
Θ + 2Θ2 , we would find the unbiased estimator T = X1 + 2X2 exactly achieves
the variance bound IH (η)−1 = θ + 4θ2 . 2

5 Fisher information: multiparameter case


Suppose now the parameter Θ = (Θ1 , . . . , Θp )T ∈  ⊆ Rp . For a function f :  → R, we define
the first derivative, f ′ (θ), of f as the (p × 1) vector of partial derivatives with respect to each
element of θ:  
∂f (θ)/∂θ1
 .. 
f ′ (θ) ≡ 
 .
.

∂f (θ)/∂θp
′ (X, θ), is now a (p × 1) random vector.
In particular, the score variable, Un (θ) = ln
We define the second derivative, f ′′ (θ), as the symmetric (p × p) matrix of mixed second
partial derivatives of f with respect to two elements of θ:

 
∂ 2 f (θ)/∂θ12 ··· ∂ 2 f (θ)/∂θ1 ∂θp
 .. .. .. 
f ′′ (θ) ≡ 
 . . .
.

∂ 2 f (θ)/∂θp ∂θ1 ··· ∂ 2 f (θ)/∂θp2

In particular, the information function In (θ) = −Eθ {ln ′′ (X, θ)} is now a symmetric (p × p)

matrix.
In parallel with the single-parameter results, we can show that the score vector Un (θ) has
1
(under Pθ ) expectation 0 and dispersion matrix In (θ). Letting i(θ) 2 denote any (non-random)
matrix square root of i(θ) (i.e. a non-random matrix A(θ) such that A(θ)A(θ)T = i(θ)), and
1 1 1 1
In (θ) 2 = n 2 i(θ) 2 , the asymptotic distribution of In (θ)− 2 Un (θ) is that of p independent
standard normal variables. In particular, the asymptotic distribution of Un (θ)T In (θ)−1 Un (θ)
(which is invariant under transformations of either data or parameter) is χ2p .
Now suppose Θ e 1 is an unbiased estimator of Θ1 . Similar to before, we find cov {Θ
e
θ 1 , Uj (θ)} =
1 for j = 1, 0 otherwise. It follows that, for any non-random vector c (which may however depend
on θ) we have  
covθ Θ e 1 , cT U(θ) = c1 .
40 CHAPTER III. FREQUENTIST ESTIMATION THEORY

Applying Cauchy-Schwarz, we get

e 1 ) ≥ c21 /cT I(θ)c.


varθ (Θ (13)

The right-hand side of (13) is maximised when c ∝ I(θ)−1 δ 1 , where δ 1 is the (p × 1) vector
whose first entry is 1 and all others are 0. The maximised value is then I 11 (θ) = δ T I(θ)−1 δ ,
1 1

the (1, 1) entry of I(θ)−1 .6 Hence any unbiased estimator of Θ1 must have variance at least
I 11 (θ).

6 Sufficiency and optimal estimation

If S is any statistic, and T is sufficient, then we can form the conditional ex-
pectation of S given T , S ∗ := E(S | T ), which will be a function of T and, by
sufficiency, will not depend on which value of θ is used in its construction. It is
thus another statistic.
The following result from STATISTICS IB enables us to improve (in MSE terms)
on any estimator that is not already a function of a sufficient statistic:

Theorem 5 (Rao-Blackwell) Let T be sufficient, and let S ∗ := E(S | T ). Then:

(i). Eθ (S ∗) = Eθ (S).
(ii). varθ (S ∗ ) ≤ varθ (S), with equality if and only if S is a function of T (so that
S ∗ = S).

Proof. The following formulae hold for any variables S, T with any joint distribution:

E(S) = E{E(S | T )} (14)


var(S) = E{var(S | T )} + var{E(S | T )}. (15)

Apply these to the case in hand, under distribution Pθ , using Eθ (S | T ) = S ∗ . Then (i) follows
directly from (14). Also, since the first term on the right-hand side of (15) is non-negative, we
obtain (ii). For equality in (ii) we must have Eθ {varθ (S | T )} = 0, which implies varθ (S | T ) = 0,
i.e. S is fully determined by T . 2

Corollary 6 If a MVUE of Θ exists, it is a function of any sufficient statistic.


6 Beware: This is not the same as the inverse of I11 (θ), the (1, 1) entry of I(θ); that would be the
C-R bound for the case that the values of all the other parameters wre fixed and known. In fact
I 11 (θ) ≥ I11 (θ)−1 : we lose precision by not knowing the other parameters.
6. SUFFICIENCY AND OPTIMAL ESTIMATION 41

It follows in particular that a MVUE (if it exists) must be a function of the


minimal sufficient statistic. Conversely, if there is a unique unbiased estimator
based on the minimal sufficient statistic, this must be the MVUE.

Example 6 In Example 5 we had X1 ∼ P(Θ), X2 ∼ P(Θ2 ), forming an expo-


nential family with sufficient statistic T = X1 + 2X2 .7 “Rao-Blackwellisation” of
the unbiased estimator X2 of Θ2 produces the improved (and in fact MVUE—see
below) estimator k(T ) = E(X2 | T ), where, explicitly,
P
{(x − 1)! (t − 2x)!}−1
k(t) = P
{x! (t − 2x)!}−1
with the sum being over x an integer between 1 and t/2 inclusive. The MVUE
of Θ is E(X1 | T ) = T − 2k(T ).
In neither case does the variance of the estimator based on T achieve the Cramér-
Rao lower bound, since we are not estimating the mean-value parameter. 2

6.1 Completeness

The following result follows easily from the definition of completeness (Definition 4
of Chapter II) and the Rao-Blackwell theorem.

Theorem 7 (Lehmann-Scheffé) Suppose T is a complete sufficient statistic.


If an unbiased estimator of Θ exists, there is a unique such estimator based on
T , and this is the MVUE.

Example 7 Exponential model Pn


We showed in Example 4 that T := i=1 Xi is complete sufficient. We also

saw in Example 1 that (for n > 1) Λ = (n − 1)/T is an unbiased estimator of
Λ—which we now see depends only on T . By completeness, Λ∗ is the unique
unbiased estimator of Λ based on T . If now Λ◦ is any unbiased estimator of Λ,
then we must have E(Λ◦ | T ) = Λ∗ . Then, by Rao-Blackwell, var(Λ◦ ) ≥ var(Λ∗ ),
and so Λ∗ is the MVUE of Λ. 2

Note that for n = 1 there is no unbiased estimator, and hence no MVUE, of Λ.

Example 8 Let X ∼ P(Λ), and consider the parameter Θ = e−2Λ . It is easily


checked that the statistic T = (−1)X is an unbiased estimator8 of Θ. Moreover
7
Note that the marginal distribution of T is not Poisson.
8
Strictly speaking, T is not a proper estimator, since it takes values outside the parameter-
space. If we exclude it from consideration, then there is no unbiased estimator of Θ.
42 CHAPTER III. FREQUENTIST ESTIMATION THEORY

X itself is complete, so T is the UMVUE of Θ. As an estimator T thus has all


the “desirable” frequentist properties we have so far introduced. Nevertheless we
would probably never want to use the estimate it supplies, for any data x. 2
Chapter IV

Asymptotics

Often we can not analyse exactly the behaviour of some statistical procedure,
but we can say something about its asymptotic behaviour as the sample size
n → ∞—with luck, this will be a useful guide to what happens for finite n. In
order for this to make sense, we must conceive of our procedure as being defined
for arbitrary sample size—so we really have a sequence of experiments, (E n ), and
a sequence of procedures, one for each experiment. Thus we might e.g. consider
the behaviour of an estimator sequence (Θ e n ) as n → ∞, where Θe n : Xn → .
Here we restrict our attention to such estimation problems.
There are in fact two possible set-ups for such asymptotics:

(i). We consider a sequence (E n ) of experiments, governed by the same pa-


rameter Θ, E n having sample space Xn . There is no required relationship
between values generated in different experiments.

(ii). We conceive of an infinite sequence of variables, (X1 , X2 , . . .), with a joint


distribution governed by Θ (in our applications, the (Xi ) are independent
and identically distributed). Experiment E n consists in the observation of
the initial subsequence (X1 , . . . , Xn ). In this case the actual value of (say)
X1 would necessarily be the same in all the experiments.

For simplicity, unless otherwise stated we restrict attention to the case of a real
parameter:  ⊆ R.

43
44 CHAPTER IV. ASYMPTOTICS

1 Forms of asymptotics

Since we are dealing with random variables, not numbers, we have to take care
over what we mean by asymptotic behaviour. In particular, we distinguish the
following ways in which a sequence (Zn ) of real1 random variables can “converge”
to another real random variable Z:

Convergence in distribution
d
We say Zn tends to Z in distribution, and write Zn → Z, if, for any bounded
continuous function h : R → R, E{h(Zn )} → E{h(Z)}.
A useful equivalent condition relates to the distribution functions Fn (z) :=
Prob(Zn ≤ z), F (z) := Prob(Z ≤ z): we require Fn (z) → F (z) whenever
Prob(Z = z) = 0.
Note that for this definition to be meaningful, we only need to specify the
marginal distributions of the Zn and Z—they do not even need to be defined
on the same probability space. In particular, there need be no convergence
between the values of Zn and Z.

Weak convergence We say Zn tends to Z in probability, or weakly, and write


p d
Zn → Z, if Zn − Z → 0 (where 0 denotes the “random” variable which is
always equal to 0).
An equivalent condition is: for each ǫ > 0, Prob(|Zn − Z| > ǫ) → 0.
For this definition to be meaningful, Zn and Z must be defined on the same
probability space.

Strong convergence We say Zn tends to Z almost surely (“a.s.”), or strongly,


a.s.
and write Zn → Z, if Prob(Zn → Z) = 1. This again requires all the
variables to be defined on the same probability space.

It can be shown that strong convergence ⇒ weak convergence ⇒ convergence in


distribution.

Lemma 1 (Slutsky) Suppose Yn → Y and Zn → c ∈ R. Let g : R × R → R be


d p

d
continuous. Then g(Yn, Zn ) → g(Y, c).

1
Extensions to variables taking values in a vector space are straightforward
2. CONSISTENCY 45

2 Consistency

Definition 1 An estimator sequence (Tn ) is (asymptotically) consistent for a


parameter-function Φ = φ(Θ) if, for each θ ∈ , Tn → φ(θ) under Pθ . We speak
of weak [resp., strong] consistency if this convergence is weak [resp., strong] (the
latter only making sense under the asymptotic set-up (ii).) 2

If Tn is consistent for Φ, then, for a (bounded continuous) f on , f (Tn ) is


consistent for f (Φ). Since2 it is possible to find a consistent estimator sequence
for the full parameter Θ—and hence for any parameter-function Φ—it would be
unwise to use an inconsistent estimator sequence, since for large sample size this
would not have the achievable property of being close to the estimand Φ.

P
Example 1 Let Φ = E(X | Θ) (assumed to exist), Tn = X n := n−1 ni=1 Xi . The
weak law of large numbers shows that, under Pθ , Tn → E(X | θ) in probability.
Thus (Tn ) is weakly consistent for Φ.
In the asymptotic set-up (ii), we can similarly call on the strong law of large
numbers to show that (Tn ) is strongly consistent for Φ.
Note that these results would apply equally if, in E n , we used (say) Tn = X n/2 ,
so ignoring half the data. 2

Example 2 Let Φ be the median of the distribution of X, so that Prob(X ≤


φ) = 21 .3 We suppose that the distribution function F of X is continuous and
strictly increasing, but is otherwise arbitrary. For simplicity suppose n odd,
and let Tn = X en := X((n+1)/2) be the sample median. Now let Yi := F (Xi ).
Then the (Yi ) are independent and identically distributed U[0, 1] variables, and
so Yen ∼ β{(n + 1)/2, (n + 1)/2} (see Example Sheet 2, Question 9): in particular,
p p
Yen → 12 . But X en = F −1 (Yen ), and so Tn → F −1 ( 12 ) = Φ, i.e. Tn is weakly
consistent for Φ. 2

Example 3 Let Xi ∼ E(Λ), and consider Tn := n min{Xi : i = 1, . . . , n}. Then


Tn ∼ E(Λ) for all n. In particular Tn is unbiased for M = 1/Λ for each n, but the
sequence (Tn ) is not consistent for M. 2

2
At any rate (for the independent and identically distributed case) whenever Θ is identified ,
in the sense that different values for Θ always give rise to different distributions for X.
3
For simplicity we omit explicit mention of the underlying parameter-value θ.
46 CHAPTER IV. ASYMPTOTICS

2.1 Consistency of the MLE

Under very broad conditions (essentially, in the independent and identically dis-
tributed case, that the parameter Θ be identified), the maximum likelihood esti-
mator is strongly consistent. The key property to note is:

Lemma 2 For all θ′ 6= θ ∈ ,

Eθ {log p(X | θ′)} < Eθ {log p(X | θ)}.

Proof. Jensen’s inequality states that, for any convex function g, and any
random variable Y such that the expectations exist,

E{g(Y )} ≥ g{E(Y )}; (1)

For by convexity there exist a, b such that g(y) ≥ a + by, all y, with equality at
y = E(Y ). Now take expectations of both sides. If, moreover, g is strictly convex,
we will have equality in (1) if and only if Y is almost surely constant.
Apply (1) with g ≡ − log, and Y = p(X | θ′)/p(X | θ) if the denominator is
positive, else 0. Take expectations under Pθ . Since we assume the parameter Θ
is identifiable, the functions p(· | θ) and p(· | θ′) are not identically equal, so Y is
non-constant, so the inequality is strict. Thus

Eθ {log p(X | θ′) − log p(X | θ)} = Eθ {log(Y )} < log Eθ (Y ).


But, with Xθ := {x : p(x | θ) > 0},
Z   Z Z
p(x | θ′)
Eθ (Y ) = p(x | θ) dx = ′
p(x | θ ) dx ≤ p(x | θ′ ) dx = 1.
Xθ p(x | θ) Xθ X
2

Corollary 3 For any θ′ 6= θ, with probability 1 under Pθ there exists N such


that, for all n > N, ln (θ′ ) < ln (θ).

P
Proof. Since, for any θ′ , ln (θ′ ) = ni=1 log p(Xi | θ′ ), and the summands are
independent and identically distributed, the strong law of large numbers ensures
that, with Pθ -probability 1,

n−1 {ln (θ′ ) − ln (θ)} → Eθ {log p(X | θ′ ) − log p(X | θ)} < 0.

2
3. ASYMPTOTIC NORMALITY 47

In particular, for any δ > 0, with probability 1 under Pθ there exists N such that,
for all n > N, both ln (θ − δ) < ln (θ) and ln (θ + δ) < ln (θ); and when this occurs
there must be a local maximum of the function ln (·) in the interval (θ − δ, θ + δ).

Corollary 4 For any θ ∈ , there exists a local maximiser Θ


b n of the likelihood
function such that
b n a.s.
Θ → θ under Pθ . (2)

For cases (for example, in an EF) when we know that ln (θ) has just one maxi-
mum Θ b n , Corollary 4 ensures that the sequence (Θ b n ) will be strongly consistent.
Otherwise, this property will hold so long as we make a suitable choice of local
maximum.4 In the sequel Θ b n will refer to such a choice.

3 Asymptotic normality

We showed in § 3.1 of Chapter III that, when Θ = θ, as n → ∞, the distribution


of the standardised score variable
1
Sn (θ) =: {In (θ)}− 2 Un (θ) (3)

(which, we recall, does not depend on how the data or the parameter are ex-
pressed) is asymptotically N (0, 1). The score test of H0 : Θ = θ0 is executed
by calculating the observed value of Sn (θ0 )2 = {In (θ0 )}−1 {Un (θ0 )}2 and referring
this value to tables of the χ21 distribution.
From this, other asymptotically equivalent results can be derived from the prop-
erty that, for large n, the log-likelihood function ln (θ) is approximately quadratic
in a region of size O(n− 2 ) around the MLE θbn , and that values outside such a
1

region are essentially ignorable. To start, we work at a non-rigorous level.


Let

bn := Ln (θbn ) = sup Ln (θ)


L
θ∈Θ

Ln := Ln (θ)/L bn
ln := log Ln .

(In particular, supΘ Ln (θ) = 1, supΘ ln (θ) = 0.)


4
— though it may not be obvious which choice is “suitable”, nor does Corollary 4 guarantee
that the same choice will work for all θ. We shall ignore these difficulties!
48 CHAPTER IV. ASYMPTOTICS

Lemma 5 Suppose y = − 21 kx2 . Then dy/dx = −kx, y = − 12 k −1 (dy/dx)2.

1 1 1
Corollary 6 k − 2 (dy/dx) = −k 2 x = (−2y) 2 × sgn(−x).

Now take x = θ − θbn , y = ln (θ) (so dy/dx = Un (θ)), and approximate ln (θ) by
e for some θe in the region where this
a quadratic y = − 12 kn x2 , with kn ≈ −ln′′ (θ)
approximation is adequate. In particular we can take kn to be any of:

(i). Jn := −ln′′ (θ)


b n)
(ii). ̂n := −ln′′ (Θ

(iii). In (θ)
b n)
(iv). ı̂n := In (Θ

where (iii) and (iv) are motivated by −ln′′ (θ)/In (θ) ≈ 1 for large n. Note that In
depends only on the parameter, ı̂n and ̂n only on the data, while Jn depends on
both.
We deduce the approximate equivalence of:

−1
(a). kn 2 Un (θ)
1
b n − θ)
(b). kn2 (Θ
1
b n − θ)
(c). {−2ln (θ)} 2 × sgn(Θ

for suitable kn as above, and any of these quantities will be asymptotically


N (0, 1). Asymptotically, inference about Θ can be based on any of these proper-
ties. A test of H0 : Θ = θ0 based on the approximate standard normal distribution
1
b n − θ0 ) (or the approximate χ2 distribution of kn (Θ
of kn2 (Θ b n − θ0 )2 ) under H0 is
called a Wald test: typically, though not invariably, this uses kn = In (θ0 ). The
likelihood ratio, or Wilks, test of H0 refers −2l n (θ0 ) to the χ21 distribution.

Example 4 Let X ∼ B(n; Θ). Then Θ b n = X/n, Un (θ) = (X − nθ)/θ(1 − θ),


Jn = X/θ +(n−X)/(1−θ) , In (θ) = n/θ(1−θ), b
2 2
jn = ı̂n = n/Θb n (1− Θ b n ). Under
1
Pθ , we have an asymptotic standard normal distribution for In (θ)− 2 Un (θ) =
1 p 1 1
b n − θ) = (X − nθ)/ nθ(1 − θ), and for ̂n2 (Θ
In (θ) 2 (Θ b n − θ) = ı̂n2 (Θ
b n − θ) =
q
(X − nθ)/ nΘ b n (1 − Θ
b n ). With data X = x, an approximate 95% confidence
3. ASYMPTOTIC NORMALITY 49
q
interval could taken as as θbn ± 1.96 θbn (1 − θbn )/n. A test of Θ = θ0 could
be based on referring any of (x − nθ0 )2 /nθ0 (1 − θ0 ), (x − nθ0 )2 /nθbn (1 − θbn ), or
−2l n (θ0 ) = 2[x log(x/nθ0 ) + (n − x) log{(n − x)/n(1 − θ0 )}] to tables of χ21 .
With o (observed value) being, in turn, x and n−x (for each of the two categories),
and e (expected
P value)2 correspondingly
P nθ0 andPn(1 − θ0 ), we can write these test
2
statistics as (o − e) /e, (o − e) /o, and 2 o log(o/e). The first of these is
“Pearson’s chi-squared statistic”, sometimes denoted by X2 ; the last is sometimes
denoted by Y 2 . 2

3.1 Distribution of MLE


1
b n − θ) ∼
Here we consider in more detail the asymptotic property (b): kn2 (Θ
N (0, 1), for the various choices for kn . We start with (iii).

Theorem 7 Suppose l(x, φ) is three times differentiable in φ ∈ . Fix θ ∈ ,


and suppose there exists a random variable Z on X, with finite expectation under
Pθ , such that we have the locally uniform bound: |l′′′ (X, φ)| ≤ Z for all φ in
1
some neighbourhood N of θ. Then the distribution of In (θ) 2 (Θ b n − θ) under Pθ
is asymptotically N (0, 1).

Proof. In the following all probability calculations are under Pθ .


By the mean-value theorem,

b n ) = Un (θ) + (Θ
0 = Un (Θ b n − θ)U ′ (Θ∗ ) (4)
n n

b n . Thus
for some Θ∗n between θ and Θ
1
b n − θ) = Sn (θ) {−In (θ)−1 U ′ (Θ∗ )}−1 .
In (θ) 2 (Θ (5)
n n
Pn
Now Un′ (θ) = i=1 u′i (θ) is the sum of n independent and identically distributed compo-
nents, with Eθ {−u′i (θ)} = i(θ). Also, In (θ) = ni(θ). So, by the weak law of large numbers,
p
−In (θ)−1 Un′ (θ) → 1.
b n → θ, so that eventually Θ∗ ∈ N , and then
Furthermore, with probability 1 Θ n

n
X
In (θ)−1 {Un′ (Θ∗n ) − Un′ (θ)} = i(θ)−1 n−1 {l′′ (Xi , Θ∗n ) − l′′ (Xi , θ)}
i=1
n
X
≤ i(θ)−1 n−1 l′′ (Xi , Θ∗n ) − l′′ (Xi , θ)
i=1

≤ i(θ)−1 |Θ∗n − θ| Z n ,
P
i=1 Zi → E(Z | θ), by the weak law of large numbers, while |Θn − θ| → 0 by
p p
where Z n := n−1 n ∗

b
strong, and hence weak, consistency of Θn . We deduce
p
− In (θ)−1 Un′ (Θ∗n ) → 1. (6)

d
The result now follows on applying Slutsky’s lemma to (5), using Sn (θ) → N (0, 1) and (6). 2
50 CHAPTER IV. ASYMPTOTICS

d
bn →
The above result is often expressed as: Θ N {θ, In (θ)−1 } (although the right-
hand side itself varies with n). This can be read as: “Θ b n is asymptotically
normal and asymptotically unbiased, with asymptotic variance In (θ)−1 ”. Note
that this asymptotic variance is the Cramér-Rao lower bound on the variance of
any unbiased estimator. Thus Θ b n is “asymptotically efficient”.
p
For (iv), use Θ bn → θ to see that (assuming the function i(θ) is continuous)
p
b n )/I(θ) = i(Θ
I(Θ b n )/i(θ) → 1. The result now follows by Slutsky’s lemma.
Similarly the result for (i) follows on using the weak law of large numbers to
p
show −ln′′ (θ)/I(θ) → 1. The argument for (ii) is similar.
There is both theoretical and experimental evidence that the data-dependent
scalings (especially (ii)) yield closer approximations to the normal distribution
than use of (iii). Moreover, as in Example 4, using the fully data-based form
(ii), we can readily form an approximate 95% (say) confidence interval for Θ
−1 −1
as θbn ± 1.96̂n 2 (alternatively, using (iv), we obtain θbn ± 1.96ı̂n 2 —this form is
more commonly seen, but may be less accurate). In contrast, direct use of (iii)
or (i) would involve solving a typically non-linear equation for θ—and would
generally give a poorer approximation to boot. (Note however that none of
these approximate intervals transforms properly under a non-linear change of
parameter).

4 Hypothesis tests

Suppose  is an interval of the real line, and we wish to test the null hypothesis
H0 : Θ = θ0

for some specified θ0 ∈ , against the alternative H1 that leaves Θ totally un-
specified.

Definition 2 The (maximum) likelihood ratio (MLR) statistic for testing H0


against H1 is5
Λ = Λ(θ0 ) := Ln (θ0 ). (7)
The Wilks statistic is W = W (θ0 ) := −2 log Λ(θ0 ) = −2l(θ0 ). 2

The following important result now follows from the asymptotic equivalence,
demonstrated informally in § 3, of (c) and Sn (θ); a rigorous argument can be
developed along the same lines as Theorem 7.
5
You may sometimes see the definition given with numerator and denominator interchanged.
4. HYPOTHESIS TESTS 51

Theorem 8 (Wilks) Under Pθ0 , the asymptotic distribution of W (θ0 ) is χ21 .

Notes: Suppose independent and identically distributed sampling from a continuous model.
(i). Use of the approximate χ2 distribution function for W is typically accurate with a relative
error of order O(n−1 ).
p
(ii). The signed root likelihood ratio statistic ((c) of § 3) is r(θ0 ) := sgn(θb − θ0 ) W (θ0 );
this will be asymptotically standard normal under Pθ0 . However this approximation is
1
typically accurate only to order O(n− 2 ).
(iii). In fact Eθ0 {W (θ0 )} has the form 1+b(θ0 )/n+O(n−2 ), where we can in principle calculate
the function b(θ). If we replace W by W ′ := W/{1 + b(θ)/n}—a move known as Bartlett
adjustment—we find that not only the mean but the whole distribution is now χ2 with a
relative error of only O(n−2 ). For small samples this can make a dramatic improvement.

It follows from Theorem 8 that we can calculate an approximate significance


level , for testing H0 : Θ = θ0 against H1 , as
SL(θ0 ) := Prob(χ21 ≥ −2 log λ), (8)
where λ is the value of Λ for the observed data. One would reject H0 at level
5% (say) if SL(θ0 ) < 0.05, or equivalently − log λ > 1.92, where 1.92 = 12 × 3.84,
3.84 = (1.96)2 being the upper 5% point of the χ21 distribution. This will be
asymptotically equivalent to the score test (with which it shares the property of
being invariant under invertible recodings of the observable X 6 or the parameter
Θ), and to Wald tests based on kn (Θ b − θ0 )2 (which are however not invariant).
(Note that the score test is generally simpler to implement, since it only involves
quantities defined locally to the null value θ0 : in particular, it does not require
b
calculation of the MLE Θ).
Similarly we can construct a 95% (say) confidence interval for Θ as
I := {θ ∈  : SL(θ) ≥ .05}
= {θ ∈  : ln (θ) ≥ −1.92} (9)
(Note that this inference method respects the Likelihood Principle.) It follows
that, under any Pθ ∈ P, in large samples there will be an approximately 95%
chance7 that this construction will produce an interval that contains θ.

4.1 Multiparameter extensions

The above results (and their proof techniques) extend in a pretty straightforward
manner to a p-dimensional parameter space. We then have asymptotic equiv-
T b
alence of the random vectors A−1n Un (θ) and An (Θn − θ), where An is a matrix
6
or, indeed, by replacement of X by a sufficient reduction.
7
Ideally one would like this to hold, simultaneously, for all θ ∈ . This would require
conditions and arguments allowing the extension of the relevant asymptotic sampling results
(such as Theorem 7) to hold uniformly over .
52 CHAPTER IV. ASYMPTOTICS

square root8 of Kn , i.e. An ATn = Kn , the matrix Kn being any of In (θ), ı̂n , Jn or
̂n (with their obvious multiparameter definitions). Any of these quantities will
have an asymptotic N (0, Ip ) distribution under Pθ ; in particular, asymptotically
Un (θ) ∼ N (0, In(θ)) and (Θb n − θ) ∼ N (0, In (θ)−1 ). The asymptotic distribution
T −1
of U(θ) Kn U(θ) and of (Θ b n − θ)T Kn (Θ
b n − θ) will be χ2 , as will be that of the
p
assymptotically equivalent Wilks statistic W (θ) := −2ln (θ).

4.2 Composite null hypothesis

Suppose we wish to test a composite null hypothesis:

H0 : Θ ∈ M

where M is a smooth submanifold of the parameter space . One approach


involves generalising Definition 2 as follows:

Definition 3 The (maximum) likelihood ratio (MLR) statistic for testing H0 :


Θ ∈ M against H1 : Θ ∈  \ M is

Λ := sup Ln (θ) (10)


θ∈M

The Wilks statistic is again W := −2 log Λ. 2

Theorem 9 (Wilks, general version) Let dim() = p, dim(M) = q, r = p−q.


Then the asymptotic distribution of W , given Θ = θ0 , is χ2r for any θ0 ∈ M◦ .

Proof. (This proof emphasises algebraic and probabilistic aspects. A fully rigorous account
would require additional technical conditions and analysis, along the lines of Theorem 7 and its
proof.)
In a neighbourhood of θ0 , we can regard  as an open subset of Rp , and parametrise M as
{h(β) : β ∈ B}, where B is an open subset of Rq and h : B →  is (1, 1) and differentiable. Thus
under M, can take as our parameter B ∈ B, and then Θ = h(B). Let B = β0 ∈ B correspond
to the specific null value Θ = θ0 ∈ M, i.e. θ0 = h(β0 ). Let U, I be the p × 1 score vector and
p × p information matrix for Θ, evaluated at θ0 ; and let V, J be the q × 1 score vector and
q × q information matrix for B, evaluated at β0 . Also let H be the p × q matrix with (i, j) entry
hij = ∂hi (β)/∂βj , again evaluated at β0 . Then

V = HTU
J = H T IH.

0 denote the log-likelihood of the data under the null value Θ = θ , or equivalently B =
Let ln 0
0 −sup T −1 U ; and,
β0 . Under the null we have asymptotic equivalence of −2{ln θ∈ ln (θ)} and U I

8
We could use e.g. the symmetric or lower triangular square root; but should not let the
way we form this square root depend on the data.
4. HYPOTHESIS TESTS 53

0 − sup T −1 V . Hence W , their difference, is asymptotically


similarly, of −2{ln θ∈M ln (θ)} and V J
equivalent to

W∗ := U T {I −1 − H(H T IH)−1 H T }U
= Z T {I − K(K T K)−1 K T }Z
= Z T ΠZ

say, where Z := A−1 U , K := AT H with A a matrix square root of I. We note that Π is


an orthogonal projection matrix (Π = ΠT = Π2 ) of rank r. So we can write Π = QQT for
a p × r matrix Q with QT Q = Ir (take the columns of Q to be an orthonormal basis of the
range of Π). Then W ∗ = Y T Y with Y := QT Z. But asymptotically Z ∼ N (0, Ip ), whence
Y ∼ N (0, QT Ip Q) = N (0, Ir ). The result follows. 2

Note particularly that the asymptotic distribution of W is the same, for any
value of Θ satisfying the null model. We can test H0 against H1 by referring the
observed value w of W to tables of the χ2r distribution.
Although we have proved Wilks’s theorem only for the independent and identi-
cally distributed case, it holds much more generally.

Example 5 Generalised linear model Let P = {Pφ : φ ∈ } be an exponen-


tial family with its natural parameter. The “saturated” model has observable
X = (X1 , . . . , Xn ) and parameter Φ = (Φ1 , . . . , Φn ) ∈ n , with Xi ∼ PΦi , inde-
pendently (i = 1, . . . , n). We consider a GLM M of the form M : Φ = An Θ
(Θ ∈  ⊆ Rp ), for specified n × p design matrix An ; and a hypothesis (sub-model
of M), H : C Θ = 0 (C being r × p).
Standard statistical software will output, for any data x and any GLM M, the
deviance D(M) := 2{supφ∈n l(φ) − supφ∈M l(φ)}. Then the Wilks statistic for
testing H within M has value w = D(H) − D(M). Assuming An and C are of
full rank (and An is “well-behaved” as n → ∞), an approximate test of H within
M is conducted by referring w to tables of χ2r . 2

4.3 Profile likelihood

Let φ :  → , and Φ := φ(Θ).

Definition 4 The profile likelihood function LΦ :  → ℜ is given by

LΦ (φ) = sup{L(θ) : θ ∈ , φ(θ) = φ}. (11)

2
54 CHAPTER IV. ASYMPTOTICS

Clearly the MLE φb := φ(θ)


b of Φ is the maximiser of LΦ , with LΦ (φ)
b = sup
φ∈ LΦ (φ) =
b and the scaled profile likelihood LΦ is obtained by applying (11) directly to L.
L,
Note that −2l Φ (φ) is the Wilks statistic for testing Φ = φ. In particular, when
dim  = 1, we typically have, for any φ ∈ , dim{θ : φ(θ) = φ} = p − 1. Then,
by Theorem 9, lΦ (φ) ≥ −1.92 with asymptotic probability 95%, under Pθ , for
any θ with φ(θ) = φ; and so the likelihood-based interval {φ : lΦ (φ) ≥ −1.92} is
asymptotically a 95% confidence interval for Φ.
Chapter V

Nonparametric methods

1 Nonparametric estimation of a distribution


on R

In nonparametric analysis the family P of possible distributions for the observable


X is no longer describable by a discrete or finite-dimensional parameter. Here
we will restrict ourselves to the case X = X n := (X1 , X2 , . . . , Xn ), where each Xi
is real-valued, and P is the family of all distributions under which the (Xi ) are
independent and identically distributed, from some arbitrary distribution P on
R. We observe X n = xn := (x1 , . . . , xn ), and wish to make inferences about P .
We can describe P in various ways, especially by means of its (cumulative) dis-
tribution function (cdf ) F : R → [0, 1], where:

F (x) = P (−∞, x]
= Prob(X ≤ x) where X ∼ P.

A cdf F has the following properties:

(i). F is non-decreasing

(ii). F (−∞) = 0, F (+∞) = 1

(iii). F is right-continuous: F (x) = F (x+) := limy↓x F (y).

Conversely, any function with these properties can serve as a cdf. We then have
F (x−) := limy↑x F (y) = P (−∞, x) = Prob(X < x).

55
56 CHAPTER V. NONPARAMETRIC METHODS

Definition 1 The empirical distribution based on data xn is the discrete distri-


bution n
b b n 1X
Pn = Pn (x ) := δx (1)
n i=1 i
where δx attaches probability 1 to the point x. 2

Thus Pb n (xn ) attaches probability mass 1/n to each observed data-point (with
suitable accounting for any repetitions). Before we have conducted the experi-
P
ment, we have random X n , and correspondingly random Pbn = n1 ni=1 δXi .

The cdf of Pbn is the (random) empirical distribution function Fbn (·), given by
1
Fbn (x) = #{i ≤ n : Xi ≤ x}. (2)
n

Intuitively, for large n Pbn (resp., Fbn ) should approximate P (resp., F ). Here we
shall try and refine this intuition, and assess the uncertainty involved.
The following consistency result is an immediate consequence of the strong law
of large numbers:

Lemma 1 For any measurable set A ⊆ R, with probability 1,


Pbn (A) → P (A) as n → ∞. (3)

Corollary 2 For any x ∈ R, with probability 1


Fbn (x) → F (x) (4)
Fbn (x−) → F (x−). (5)

2 Uniformly consistent estimation of the distri-


bution function

The limiting behaviour in (3) does not hold uniformly across all measurable A.
Indeed we typically have, with probability 1,
lim inf sup |Pbn (A) − P (A)| > 0.
n→∞ A

Thus suppose that the discrete part (if any) of P has total mass d < 1 . Taking An
to be the (discrete) support of Pbn , for all n we will have Pbn (An ) = 1, P (A) ≤ d,
so that supA |Pbn (A) − P (A)| > 1 − d > 0.
2. UNIFORMLY CONSISTENT ESTIMATION OF THE DISTRIBUTION FUNCTION57

If however we restrict A to some subcollection A of sets, we might be able to


obtain uniformity. The Glivenko-Cantelli theorem shows this for the case that A
consists of all semi-infinite intervals.

Theorem 3 (Glivenko-Cantelli) Let X1 , X2 , . . . be independent and identi-


cally distributed real random variables, with distribution function F . Let Fbn be
the empirical distribution function of {X1 , . . . , Xn }. Then with probability 1,

sup |Fbn (t) − F (t)| → 0.


t∈R

Proof. Let ǫ > 0. There exists a finite sequence −∞ = t0 < t1 < . . . < tk = ∞
such that F (ti −) − F (ti−1 ) = Prob(ti−1 < X < ti ) ≤ ǫ for all i. Then for
ti−1 < t < ti ,

Fbn (t) − F (t) ≤ Fbn (ti −) − F (ti−1 )


≤ Fbn (ti −) − F (ti −) + ǫ
n o
b
≤ max Fn (ti −) − F (ti −) + ǫ.
i

Similarly
n o
F (t) − Fbn (t) ≤ max F (ti−1 ) − Fbn (ti−1 ) − ǫ.
i

Hence for all t ∈ R,


n o
|Fbn (t) − F (t)| ≤ max max |Fbn (ti ) − F (ti )|, |Fbn (ti −) − F (ti −)| + ǫ
i

(this holding trivially if t is one of the {ti }). But by Corollary 2, for each i,
Prob(|Fbn (ti ) − F (ti )| → 0) = 1 and Prob(|Fbn (ti −) − F (ti −)| → 0) = 1. Since
the conjunction of a finite number of almost sure events is almost sure, with
probability 1
n o
b b
max max |Fn (ti ) − F (ti )|, |Fn (ti −) − F (ti −)| → 0,
i

whence lim supn→∞ supt∈R |Fbn (t) − F (t)| ≤ ǫ with probability 1. Since ǫ > 0 was
arbitrary, the result follows. 2

Corollary 4 With probability 1, supt∈R |Fbn (t−) − F (t−)| → 0.


58 CHAPTER V. NONPARAMETRIC METHODS

3 Inference for the distribution function

Suppose X has continuous real cdf F , and consider the variable U := F (X).

Lemma 5 U ∼ U(0, 1).

Proof. Let u ∈ (0, 1). Since F is continuous and non-decreasing, A :=


{y : F (y) ≥ u} is a closed semi-infinite interval [x, ∞) with F (x) = ‘u, and
F (y) < u ⇔ y < x. So Prob(U < u) = Prob{F (X) < u} = Prob(X < x) =
F (x−) = F (x) = u. 2

3.1 Hypothesis test

In particular, suppose we wish to test the hypothesis H0 that the (Xi ) are in-
dependent and identically distributed with specified continuous cdf F . This is
equivalent to the hypothesis H0∗ that, with Ui := F (Xi ), the (Ui ) are independent
and identically distributed U(0, 1), i.e. have cdf F ∗ (u) ≡ u (0 < u < 1). This
reduces the problem of testing a quite general continuous F to the special case
F = F ∗.

4 Kolmogorov-Smirnov test

Any reasonable test of uniformity of the Ui will be based on their empirical


distribution function Fbn∗ , which from Theorem 3 we know will, almost surely,
converge uniformly to the uniform cdf F ∗ when the (Ui ) are indeed uniformly
distributed.

Definition 2 The Kolmogorov-Smirnov statistic for testing uniformity is

Kn := sup{|dn (u)| : 0 < u < 1}, (6)

where dn (u) := Fbn∗ (u) − u. 2

It is readily seen that (for continuous F ) dn (u) = Fbn (x) − F (x) with u = F (x),
and so
Kn = sup{|Fbn (x) − F (x)|. (7)
x∈R
4. KOLMOGOROV-SMIRNOV TEST 59

This is form of the Kolmogorov-Smirnov statistic for testing the null hypothesis
H0 that the common distribution function of the (Xi ) is F . The distribution of
Kn when H0 holds will not depend on F ,1 though it will depend on n.
Tables of the distribution of Kn are available: see e.g. Birnbaum, Z. W., (1952),
Journal of the American Statistical Association, Vol. 47, pp. 425–441. For n > 35
the asymptotic distribution (see § 4.1 below) can be used.
It simplifies computation of Kn to note that the supremum in (7) must be attained
at a data-point. To perform a size-α test of H0 , we look up dα (n), the upper 1−α
quantile of the distribution of Kn , and reject if the observed value of Kn exceeds
dα (n). We can also form a level-γ “confidence band” Fγ , consisting of all those
continuous distribution functions F which would not be rejected by the test of
size α = 1 − γ. That is, F ∈ Fγ if, for all x, F (x) lies within the constant-width
band Fbn (x) ± d1−γ (n) about the empirical distribution function.2

4.1 Asymptotics

For any fixed v ∈ (0, 1), nFbn∗ (v) = #{i ≤ n : Ui ≤ v} has the binomial dis-
tribution B(n; v) with mean nv and variance nv(1 − v). By the central limit
1 d
theorem, n 2 dn (v) → N {0, v(1 − v)}. Similarly, for 0 < v < u < 1, the dis-
tribution of n{Fbn∗ (v), Fbn∗(u) − Fbn∗ (v), 1 − Fbn∗ (u)} is trinomial with probabilities
1 1
(v, u − v, 1 − u), whence we find cov{n 2 dn (v), n 2 dn (u)} = v(1 − u).
It can be shown that, considered as random functions from (0, 1) to R, the
1
distribution of n 2 dn (·) converges (in a suitable sense) to that of the 0-mean
Gaussian process B(·) determined by the above asymptotic covariance structure:
cov{B(v), B(u)} = v(1 − u) (v ≤ u) (this is known as the Brownian bridge); and
1 1
that the limiting distribution of the functional n 2 Kn = sup{n 2 |dn (·)| : 0 < u <
1
1} of n 2 dn (·) is given by the distribution of the same functional of the limiting
process B(·), viz. K := sup{|B(u)| : 0 < u < 1}. Derivation of this distribution
(the Kolmogorov distribution) is beyond our scope, so we merely assert: for x > 0,
∞ √ ∞
X 2π X −(2i−1)2 π2 /(8x2 )
i−1 −2i2 x2
Prob(K ≤ x) = 1 − 2 (−1) e = e . (8)
i=1
x i=1

This function is tabulated in G. R. Shorack and J. A. Wellner, Empirical Pro-


cesses: With Applications to Statistics (Wiley, 1986), p. 143.
1
Caution: This works only when F is fully specified. It would fail if for example we wanted
to test that the distribution is normal but with unspecified mean and variance, and used for F
the normal distribution function with sample-based estimates of its parameters.
2
If d1−γ (n) is small enough this band will be empty!
60 CHAPTER V. NONPARAMETRIC METHODS

1
For large n the size-α Kolmogorov-Smirnov test will reject H0 for Kn > n− 2 Kα ,
where Kα is chosen so that Prob(K > Kα ) when K has distribution (8). It
follows from the Glivenko-Cantelli theorem that the Kolmogorov-Smirnov test is
consistent, i.e. if F is not the true distribution function then the probability of
rejection will tend to 1 as n → ∞.

4.2 Two-sample test


Suppose now that (X1 , . . . , Xn1 ) are independent and identically distributed with continuous
distribution function FX , independently of (Y1 , . . . , Ym ), which are independent and identically
distributed with continuous distribution function FY . We wish to test the null hypothesis
H0 : FX ≡ FY , the common value, F say, being unspecified.
Let FbX , FbY be the empirical distribution functionss of the X’s and Y ’s, respectively. The
two-sample Kolmogorov-Smirnov statistic is

Dn,m := sup |FbX (x) − FbY (x)|. (9)


x∈R

If we replace Xi by F (Xi ) and Yj by F (Yj ),3 the problem is transformed to one in which
(under H0 ) F is replaced by the uniform U (0, 1) cdf, while the value of Dn,m is unaltered.
Consequently the null distribution of Dn,m is the same, no matter what may be the common
distribution function F .
1 1
Take n = N p, m = N q (p ∈ (0, 1), q = 1 − p) and let N → ∞. Then N 2 p 2 {FbX (·) −
d 1 1 d
FX (·)} → BX (·), and N 2 q 2 {FbX (·) − FY (·)} → BY (·), where BX , BY are independent Brow-
1 d 1 1
nian bridge processes. So, under H0 , N {FbX (·) − FbY (·)} → p− 2 BX (·) − q − 2 BY (·). Con-
2

sideration of the covariance function of this Gaussian limiting process show it to be of the
1
form (p−1 + q −1 ) 2 B ∗ (·), where B ∗ is a Brownian bridge. We deduce that the asymptotic null
1

distribution of Dn,m := {nm/(n + m)} 2 Dn,m is the Kolmogorov distribution (8). Hence we
can conduct an asymptotically size-α test of the (composite, nonparametric) null hypothesis

H0 : FX ≡ FY by rejecting H0 when Dn,m > Kα . Again, this test is consistent, i.e. rejects
with probability tending to 1 if FX 6≡ FY .

3 Note that we can not actually compute these values in absence of knowledge of F .
Chapter VI

The bootstrap

1 Pivotal inference

Definition 1 A pivotal quantity, or pivot, in an experiment E = (X, X, , Θ, P)


is a function of X and Θ that is distributed independently of Θ. 2

Another common synonym for pivot is root.

Example 1 Suppose Xi ∼ N (M, v), independent and identically distributed,


i = 1, . . . , n, where the variance v is known. Let X be the sample mean. Then
1
the quantity K := n 2 (X − M) is a pivot1 having known distribution N (0, v),
given any value for M. 2

Example 2 Now suppose Xi ∼ N (M, V ), independent and identically distributed,


i = 1, . . . , n, with both parameters unknown. Let√S 2 be the usual unbiased esti-
mator of V . Then a pivot is T := (X − M)/(S/ n), which has a Student tn−1
distribution (and is asymptotically standard normal). 2

Pivots are useful for constructing confidence intervals and hypothesis tests. Thus
in the known variance case, whatever be the value µ of M, the probability is 95%
1
that K lies in the interval ±1.96 v 2 , and consequently this is also the probability
1
that the interval X ± 1.96(v/n) 2 contains µ. With both parameters unknown,
the familiar t-test of H0 : M = µ0 is based on the fact that, if H0 is true, the pivot
1
1
The scaling by n 2 , while not essential for immediate purposes, yields a distribution that
is insensitive to sample size. For asymptotic analysis it is helpful to consider pivots with non-
trivial limiting distributions.

61
62 CHAPTER VI. THE BOOTSTRAP

T is equal to (X − µ0 )/(S/ n), which can be calculated from the data; and if
that value does not appears consistent with the tn−1 -distribution the hypothesis
is put in doubt. Similarly, we can base confidence intervals for M on the pivot T .

2 The parametric bootstrap

For the normal case, the actual sampling distributions of the above pivots are
available from general theory, and readily manipulated. Let us suppose for a
moment that one or other of these conditions were false. What could we do? For
simplicity we just consider the normal location model of Example 1.
Suppose we have data (x1 , . . . , xn ). We start by estimating the underlying data
distribution: an obvious estimate is N (x, v). Although this of course differs from
the true distribution N (M, v), it belongs to the same normal location family.
Consequently, both these distributions will have the identical distribution, D
say, for the pivot K. So consider a new sample, (X1∗ , . . . , Xn∗ ), the same size n as
the original data-set, but now generated from this estimated distribution. Then
1 ∗ 1
K ∗ := n 2 (X − x) has the same distribution as K = n 2 (X − M). Even if we
couldn’t actually compute this distribution, we can readily approximate it by
simulation. From the estimated distribution N (x, v) we generate many random
samples (“resamples”) of size n, and for each compute the associated value k ∗ of
1
the pivot K ∗ : thus for a specific resample (x∗1 , . . . , x∗n ), we obtain k ∗ = n 2 (x∗ −x).
If we do this for M independent resamples, we will obtain a random sample of size
M from the distribution D of K ∗ —which is the same as that of K. Let D ∗ denote
the empirical distribution of these values, i.e. the discrete distribution putting
probability 1/n on each value appearing (with obvious adjustment in the case of
repeated values). Then D ∗ is an approximation to the desired pivotal distribution
D. Using D ∗ , we can now conduct approximate pivotal inference for M, based
on the original sample mean x. If M is sufficiently large the approximation will
be good, with high probability.

2.1 Other parameters

Staying with the normal location model, suppose (purely for illustration) we are
interested in the upper quartile Qu of the distribution. We could just use the
1
known normal-theory relationship Qu = M + 0.6745 v 2 to make an appropriate
adjustment to our inference for M. Alternatively (if less efficiently) we might
1
base inference directly on the pivot Ku := n 2 (Xu − Qu ), where Xu is the upper
quartile of the sample (X1 , . . . , Xn ). Again, for a given dataset (x1 , . . . , xn ) we
can use resampling from the estimated distribution N (x, v) to approximate the
3. NONPARAMETRIC BOOTSTRAP 63

distribution of Ku . In particular, we can estimate b := E(Ku ), and so obtain the


1
approximately unbiased estimate xu − n− 2 b of Qu .

3 Nonparametric bootstrap

A still further level of approximation ignores the fact that the underlying distri-
bution is known to be normal. For our estimated distribution we now use, not
N (x, v), but a non-parametric estimate based on the original sample (x1 , . . . , xn ).
Various forms are possible, but the most usual is also the simplest: just the em-
pirical distribution of the data. The corresponding estimate of Qu is now the
sample upper quartile, xu .
A resample is now simply a random sample with replacement from the values in
the original data-set (in particular, this will usually contain some repeats). Let
(x∗1 , . . . , x∗n ) be a generic resample. Then the associated estimate of the pivot is
1
ku∗ := n 2 (x∗u − xu ). If n is large then the empirical distribution of the data will
be close to the parametric estimate N (x, v), and then an empirical distribution
for ku∗ , computed from a large number M of resamples, will approximate the true
1
distribution of Ku = n 2 (Xu − Qu ), and can be used, together with the estimate
xu from the original data, as a basis for approximate pivotal inference about Qu .

3.1 Unknown distribution

The nonparametric bootstrap made no use of the assumption that the underlying
distribution was normal with variance v. Consequently it could be applied, un-
modified, to any location family, to make inference about (say) its upper quartile
Qu . So long as n is large enough, the shape of the estimated distribution, used
for resampling, should be close to that generating the original data—i.e. they
should, to a good approximation, differ only in their location. Then (so long
as M is large enough) the bootstrap distribution of the pivot should be close to
its true distribution, and hence usable to make inference about Qu . And this
holds for any underlying distribution—which can thus be completely unspecified.
In particular, even the assumption of an underlying location model is no longer
needed.

3.2 Interval estimation

There are various ways of using the bootstrap to form an approximate 95% (say)
confidence interval for a parameter such as (say) Qu . These include:
64 CHAPTER VI. THE BOOTSTRAP

Crude bootstrap interval Approximate the distribution of Ku by a normal


distribution with mean b and and variance σ 2 estimated from the bootstrap
1
resamples, and find lower and upper confidence limits by equating n 2 (xu −
Qu ) to (l+ , l− ) := b ± 1.96σ.

Basic bootstrap interval As above, but take l− [resp. l+ ] to be the lower [resp.,
upper] 2.5%-point of the nonparametric bootstrap estimate of the distribu-
tion of Ku .

Percentile interval Take the lower [resp., upper] confidence limit for Qu to
be the lower [resp., upper] 2.5%-point of the bootstrap distribution of Xu∗ .
(This works if the there is a monotonic function g such that the pivot
g(Xu ) − g(Qu ) has a symmetric distribution about 0, with a shape well
approximated by that of its bootstrap distribution.)

3.3 Other pivots

The same essential logic can be applied using from other pivots. For example, if
we had started with a location-scale model, we might have used a “Student-t”-
1
type pivot, such as n 2 (Qu − Xu )/S, where S is some location-invariant estimator
of scale, such as the interquartile range of the data (X1 , . . . , Xn ). We can again
approximate its distribution, again by resampling from the empirical distribution
of the data. And this might be more “robust”, and more quickly convergent, than
1
just using n 2 (Qu −Xu ) as our pivot, since all we now require is that the empirical
distribution of the data and the true data-generating distribution should agree
(approximately) up to a location-scale transformation (i.e., we need no longer
require them to have approximately the same scale). Such “studentised pivots”
are generally recommended over simple unscaled pivots.

4 Further considerations

The basic idea of bootstrap inference, and its basic implementation (though
highly computer-intensive), are extremely simple and appealling—almost too
good to be true. But it really does work.
Of course there are many loose ends to tie up. Theoretically, we would want to
be able to show that, under suitable conditions, the convergence of the sample
empirical distribution to that of the data-generating distribution is fast enough
that, for large n, substituting the former for the latter is asymptotically negli-
gible, when we focus attention on their associated distributions for the chosen
pivot. We would also like this convergence to hold uniformly, over a suitable
4. FURTHER CONSIDERATIONS 65

class of distributions. Pragmatically, we would like guidance on how big should


n and M be for bootstrap inference to be reliable. There is also the question
of what is the best pivot to use, among many variant possibilities, both from
the point of view of getting good sampling distribution approximations out of
the bootstrap, and from more principled points of view, such as using the data
most appropriately and efficiently (e.g. through a sufficient statistic, and/or with
appropriate conditioning).
Proper attention to such points can involve very sophisticated analysis, seemingly
far from the “back-of-the-envelope” nature of the basic idea, and has turned
bootstrap inference into one of the most active areas of current statistical research.
66 CHAPTER VI. THE BOOTSTRAP
Chapter VII

Bayesian Inference

Bayesian inference is an over-arching and self-consistent approach to the enter-


prise of Statistics. For any problem, we use Probability to model both “physi-
cal” uncertainty about potential observables, given the relevant parameters, and
“epistemological” uncertainty1 about those unknown parameters. We then have
a full probabilistic joint distribution for all unknowns, and statistical inference
reduces to calculating the conditional distribution for unknown quantities of in-
terest (which might include future observables, as well as unknown parameters),
given the observed data. Anything else — e.g., summarisation and display of
this distribution, calculation of its features, or using it to address problems of
decision-making under uncertainty — is just an optional add-on. So once we have
passed the (important!) stage of specifying our initial ingredients, Bayesian in-
ference just reduces to the mathematics of probability theory. That is, Bayesian
inference replaces the inverse relationship

STATISTICS = (PROBABILITY)−1

by the direct relationship

STATISTICS = PROBABILITY.

If only every one agreed that the Bayesian approach was the right one, then
Statistics would be set on a firm logical and mathematical basis!
This logical coherence, avoiding the need for a new trick for every new problem,
means — in principle at least — that Bayesian methods can be applied to an
enormous range of extremely complicated and realistic models. In practice, how-
ever, this raises new challenges: both of specifying appropriate prior distributions
1
There is a large body of theory, based for example on principles of rational economic
behaviour, justifying the use of Probability Theory for this purpose.

67
68 CHAPTER VII. BAYESIAN INFERENCE

over high-dimensional parameter spaces, and of computing, and computing with,


the resulting posterior distributions. Much of modern Bayesian research is aimed
at addressing these twin challenges.

1 Prior and posterior

A truly fundamentalist2 Bayesian approach does not mess with parametric models
at all, but directly models joint epistemological uncertainty about a collection
of observables including both those to be observed and those to be predicted;
it focuses on directly finding the conditional distribution of the latter, given
observations on the former. However for current purposes we restrict attention
to Bayesian analysis of a parametric statistical experiment E = (X, X, , Θ, P).
The statistical model P provides the conditional density p(x | θ) (with respect to
some fixed dominating measure µ on X) of the observable X, given Θ. It is up
to the statistician/decision maker to specify, in addition, a marginal distribution
Π (interpreted as representing epistemological — typically “subjective” or “per-
sonal” — uncertainty) for the parameter Θ, which now has the mathematical
status of a random variable. We denote the density of Π (with respect to a suit-
able measure ν on the parameter-space ) by π(·). The distribution Π describes
perceived uncertainty about Θ before taking any account of the observed value
of X, and should ideally be specified at this stage. It is thus termed the prior
distribution of Θ.
There is no necessary connexion whatsoever between the two building blocks: the
conditional distributions for X given Θ, as specified by the statistical model, and
the marginal distribution for Θ, as described by the prior. But once they have
been separately specified, their conjunction fully determines the joint distribution
for (X, Θ). We can manipulate this in any way we please, using the standard
probability calculus, to address any queries we may have.
The joint density of (X, Θ) (with respect to µ × ν over X × ) is
p(x, θ) = p(x | θ) π(θ). (1)

The marginal density for X alone is thus


Z
p(x) = p(x | θ) π(θ) dν(θ). (2)

Although this is a distribution for the observable X, it is not purely “objective”,


but rather an odd amalgam of the “objective” model density p(x | θ) and the
2
See e.g. Bruno de Finetti’s highly original two-volume work Theory of Probability.
1. PRIOR AND POSTERIOR 69

“subjective” prior density π(θ). It represents your uncertainty about how X will
turn out, lacking knowledge of the value of Θ. It is termed the (prior) predictive
density of X.
Using (1) and (2), the conditional density π(θ | x) for Θ, given X = x, is
π(θ | x) = p(x, θ)/p(x)
∝ p(x | θ) π(θ) (3)
where the proportionality sign indicates the omission of a factor (viz., 1/p(x))
that does not depend on the argument θ: after observing data X = x, this
factor is simply a constant. It is enough to know π(θ | x) up to such an unspec-
ified
R factor, which can always be recovered from the normalisation condition:
 π(θ | x) dν(θ) = 1.
The distribution Πx , whose density is π(θ | x), represents the relevant episte-
mological uncertainty about Θ after having observed X = x. Obtaining and
describing this posterior distribution is typically the main, and often the sole,
aim of a Bayesian analysis. We can apply any probabilistic manipulations to this
end, but normally it will be accomplished by the application of formula (3). This
is Bayes’s theorem. Since, for the given data X = x, p(x | θ) is proportional to
the likelihood function L(θ), we can also write this as
π(θ | x) ∝ π(θ) L(θ). (4)
In words:

POSTERIOR ∝ PRIOR × LIKELIHOOD.

In particular, the only feature of the statistical model needed to perform a prior-
to-posterior analysis is the likelihood function (up to proportionality) for the
specific data at hand. Thus so long as the identical prior distribution is used
for a quantity Θ in any experiment in which it appears as a parameter (a not
unreasonable requirement), Bayesian inference will respect LP.

1.1 Sufficiency

Suppose that T ≡ t(X) is a sufficient statistic for Θ, based on data X. Since


LP ⇒ SSP ⇒ WSP, it follows that the posterior of Θ depends on the data only
through the value of T , and can be computed starting from the marginal exper-
iment E T , which observes only T . This property can streamline computations,
since instead of working with the full data we can use the model for, and observed
value of, a sufficient statistic. In particular, and most usefully, this holds when
T is minimal sufficient.
70 CHAPTER VII. BAYESIAN INFERENCE

Another way of understanding this property is as follows. Sufficiency of T can be written as the
conditional independence property:
X⊥ ⊥ Θ | T,
expressing the fact that the conditional distribution of X, given both Θ and T , can be chosen
to depend on T alone. This interpretation of conditional independence is meaningful even if Θ
is not random. However, as soon we take a Bayesian stance, so that Θ is just another random
variable, we can use the symmetry of probabilistic independence to deduce:

Θ⊥
⊥ X | T. (5)

And we can now interpret (5) as asserting that the conditional (= posterior) distribution of Θ,
given X and (redundantly) T , in fact depends on T alone.

1.2 Comparison of hypotheses

A special case of prior-to-posterior analysis arises when  is a discrete set, I say,


of hypotheses about the distribution of X. Given Hi : Θ = i, X has density
pi (x), and π(i) is now the prior probability Prob(Hi ) of hypothesis Hi . From (4),
the posterior probabilities {Prob(Hi | x)} then satisfy

Prob(Hi | x) Prob(Hi ) pi (x)


= × . (6)
Prob(Hj | x) Prob(Hj ) pj (x)

The term Prob(Hi )/Prob(Hj ) is the prior odds on Hi as against Hj , and similarly
Prob(Hi | x)/Prob(Hj | x) is the posterior odds. The term pi (x)/pj (x) = Li /Lj is
the corresponding likelihood ratio. So

POSTERIOR ODDS = PRIOR ODDS × LIKELIHOOD RATIO

Knowing the (prior or posterior) odds between all pairs of hypotheses, we know
the probabilities up to a scale factor, and hence, by normalisation, we can recover
them completely. This is particularly useful in the case of just two hypotheses H0
and H1 . In this case the odds ω in favour of H1 as against H0 is just a recoding
of the probability π = Prob(H1 ) (and hence of the distribution over (H0 , H1 )):
ω = π/(1 − π), π = ω/(1 + ω). The likelihood ratio term is p1 (x)/p0 (x).

1.3 Summarisation

Once the posterior distribution has been obtained it can be summarised in a


variety of ways, according to our further purposes and interests. For example,
we could calculate the posterior mean, or posterior mode, of Θ, which can be
regarded as point estimates. For any chosen form of summary, the procedure of
always calculating that summary, for any data, yields an estimator ; this can be of
1. PRIOR AND POSTERIOR 71

interest even to a confirmed frequentist — who would, however, judge it in terms


of his own frequentist criteria,3 which are of no interest to the pure Bayesian.
We could similarly summarise remaining uncertainty about Θ, in the light of the
data, by means of, say, the variance, or the entropy, of the posterior distribution.
Another useful summary is a (posterior) credible region: a subset of  having
some given posterior probability (e.g. 50% or 95%). There are of course many
such regions; all are equally valid, but some, such as 1-sided, or equal-tailed
central regions, may be more useful. A region of the form {θ ∈  : π(θ | x) ≥ c},
bounded by a contour of the posterior density, is a highest posterior density
(HPD) region; the constant c being chosen to yield the desired probability. Such
a region has minimum measure for its chosen posterior probability. Note that a
HPD region is not invariant under transformation of the parameter. An invariant
“snug” region can be constructed using, instead, the contours of the likelihood
function: {θ ∈  : L(θ) ≥ c}. Such a region has minimum prior probability for
the given posterior probability.
For a real parameter, a Bayesian analogue of a classical 1-sided significance level,
for testing the hypothesis Θ = θ0 , is the posterior probability Πx (Θ ≥ θ0 ). But
there is not really any useful analogue of a two-sided significance level.

Bayes factor

Suppose we seriously entertain the possibility that a hypothesis H0 : Θ = θ0


could be exactly true. Then we should use a prior distribution that gives this
a positive probability, say Prob(H0 ) = π0 ; and our inference about whether or
not H0 holds, on the basis of data X = x, would be captured by the posterior
probability Prob(H0 | x).
Under the alternative hypothesis H1 : Θ 6= θ0 (which has prior probability π1 =
1 − π0 ), the value of Θ remains unspecified. Consequently the Bayesian must
assign a conditional distribution to Θ, given H1 . Let this have density π(θ | H1 ).
By Bayes’s theorem,
Prob(H1 | x) π1 p(x | H1 )
= × . (7)
Prob(H0 | x) π0 p(x | H0 )
Moreover, p(x | H0 ) = p(x | θ0 ), while
Z
p(x | H1 ) = p(x | θ) π(θ | H1) dν(θ),

the predictive density of X, given H1 .


3
Such Bayesian-derived estimators do typically behave very well in frequentist terms
72 CHAPTER VII. BAYESIAN INFERENCE

Note how the prior probabilities (π0 , π1 ) only enter into the first factor in (7),
while the prior distributions of Θ conditional on the hypotheses only enter into
the second factor. This second factor is termed the Bayes factor (in favour of H1
as against H0 ). It is essentially a likelihood ratio (in favour of H1 as against H0 ,
but now the relevant “likelihood” of H1 is not purely objective since it involves
the conditional prior density π(θ | H1).
The posterior probability of H0 and the Bayes factor will depend on the data
through the value of a sufficient statistic.

1.4 Prediction

Bayesians tend to be more interested in predicting as yet unobserved observables,


rather than in inference about forever unobservable parameters.
Suppose X represents the data in an experiment, and Y some future observable
to be predicted. We suppose given a joint statistical model for (X, Y ), given Θ,
with densities p(x, y | θ); and a prior density π(θ) for Θ. The rules of probability
give: Z
p(y | x) = p(y | x, θ) π(θ | x) dν(θ), (8)

expressing the (posterior) predictive density of Y , given the data X = x, as a


mixture of the model-based conditional densities, where the mixing distribution
is the posterior distribution of Θ.
Often, X and Y will be independent, given Θ. Then p(y | x, θ) = p(y | θ), and we
get Z
p(y | x) = p(y | θ) π(θ | x) dν(θ), (9)

a posterior mixture over the unconditional statistical model for Y .

2 Asymptotic posterior distribution

We deal with this very informally.


For large n, in the vanishingly small interval around the MLE θbn where the log-
likelihood ln (θ) is significant, it will be approximately quadratic, with slope 0 and
curvature ̂n at θbn . Hence

1
Ln (θ) ≈ exp − ̂n (θ − θbn )2 . (10)
2
3. CONJUGATE FAMILIES 73

So long as the prior density π(θ) is continuous and positive, it will be approx-
imately constant in this region, so that the posterior density will be essentially
proportional to the likelihood (10) in the region where it is significant, and neg-
ligible elsewhere. We deduce that asymptotically
Θ ∼ N (θbn ,̂−1
n ). (11)

Importantly, this applies for any (continuous and positive) prior density: given
enough data, a priori divergent opinions will be brought into posterior agreement.
1
We note the similarity of the conclusion (11), expressed as ̂n2 (θbn −Θ) ∼ N (0, 1), to
1
b n − θ) ∼ N (0, 1). Thus with enough data
the asymptotic sampling property ̂n2 (Θ
both approaches will give numerically similar inferences (though with different
interpretations).

3 Conjugate families

Although there is no logical relationship between the statistical model and the
prior distribution, certain forms of prior can be more tractable than others. Even
though these usually will not be good approximations to real prior uncertainty,
they can be used as building blocks in more complex and realistic specifications,
thus simplifying their analysis. Such “artificial” priors are therefore worth atten-
tion.

3.1 Closure under sampling

Definition 1 For a given statistical experiment, a family F of distributions over


 is closed under sampling if, for any prior distribution Π ∈ F for Θ, and any
data x, the posterior distribution Πx of Θ given X = x is also in F . 2

This property is particularly valuable when our statistical model is an exponential


family:
p(x | θ) = exp{a(x) + b(θ) + u(θ)T t(x)}. (12)
In this case, consider the family F of all prior distributions whose density (with
respect to a given underlying measure ν on ) is of the form:
π(θ) ∝ π0 (θ) exp{n0 b(θ) + u(θ)T t0 }, (13)
a (p+1)-parameter exponential family with hyperparameters n0 ∈ R+ and t0 ∈ Rp
— this is termed a conjugate family for the given EF. If the prior is given by
74 CHAPTER VII. BAYESIAN INFERENCE

(13), and we observe data (x1 , . . . , xn ) from the model (12), the the posterior
will evidently have the same form, with the hyperparameters
Pn n0 and t0 replaced,
respectively, by n1 = n0 + n, and t1 = t0 + t (t = i=1 t(xi )). Consequently a
conjugate family is closed under sampling. Since it is itself a (p + 1)-parameter
family, it will usually be easy to describe and manipulate its generic member,
thus solving at one fell swoop all possible inference problems that could arise for
any data.
Note that, since π0 (and indeed the underlying measure ν) is arbitrary, any prior
is a member of some conjugate family, so it does not really make sense to talk of
“a conjugate prior”. However, such a description is frequently used for the case
where Θ is the natural parameter Φ (so u(φ) ≡ φ), ν is Lebesgue and π0 ≡ 1.

3.2 Mixture priors


Using a mixture of conjugate priors is more flexible than using just one.
P
Suppose the prior density can be expressed in the form π(θ) = j wj πj (θ), for some given
set of densities (πj ) and probability weights (wj ). We can represent this in terms of an auxiliary
discrete parameter J, with Prob(J = j) = wj and π(θ | J = j) = πj (θ).
In the joint posterior for (J, Θ) given X = x, we then have π(θ | J = j, x) = πj (θ | x). Also
wj∗ := Prob(J = j | x) ∝ wj pj (x), where pj is the prior predictive density based on prior πj ,
P
and the constant of proportionality is recoverable from the condition j wj∗ = 1. Consequently
P ∗
π(θ | x) = j wj πj (θ | x), again in mixture form but with revised weights.
When each of the πj is of conjugate form, it is typically easy to compute both the component
posterior πj (θ | x) and the predictive density pj (x), hence the posterior weights (wj∗ ).

4 Normal models

In this Section, all densities are with respect to Lebesgue measure.

4.1 Normal location model

Consider the normal location model , parametrised by its mean M with the vari-
ance v known. Let h := v −1 , the sampling precision. Thus

Xi ∼ N (M, h−1 ), (14)

The likelihood, for data x = (x1 , . . . , xn ), is

1
Ln (µ) ∝ exp − nh(x − µ)2 . (15)
2
4. NORMAL MODELS 75

This is the exponential of a quadratic form in µ, so conjugacy suggests using a


prior of similar functional form:

1
π(µ) ∝ exp − h0 (µ − m0 )2 (16)
2
i.e.
M ∼ N (m0 , h−1
0 ). (17)

The posterior density π(µ | x) ∝ π(µ) Ln (µ) is again the exponential of a quadratic
form in µ, so that
1
π(µ | x) ∝ exp − hn (µ − mn )2 ,
2
i.e.
M | x ∼ N (mn , h−1
n ). (18)

Equating linear and quadratic coefficients yields:

hn = h0 + nh (19)
h0 m0 + nhx
mn = . (20)
h0 + nh

In particular, the family of normal distributions for M is closed under sampling


from N (M, h−1 ).
Note that the same posterior (18) would be obtained if we had only observed the
sufficient statistic X, having distribution N {M, (nh)−1 }: the analysis is essen-
tially identical, after first replacing n by 1 and h by nh.
Equation (19) exhibits the posterior precision hn as the sum of the prior precision
h0 and the total sample precision nh; while (20) gives the posterior mean mn of
M as the weighted average of the prior mean m0 and the sample estimate x, each
weighted with its associated precision.

Summaries

Point estimate We might take the posterior mean mn as in (20) as a “Bayesian


point estimate” of M, while the remaining uncertainty about M is captured in
the posterior precision hn as in (19).

−1
Interval estimate A 95% credible interval for M would be mn ± 1.96hn 2 .
76 CHAPTER VII. BAYESIAN INFERENCE

Hypothesis test If there were some value µ0 for M of special interest — corre-
sponding to some “null hypothesis” H0 : M = µ0 of no effect — we might assess
the tenability of the hypothesis H0 in the light of the data by assessing how far
out in a tail of the posterior distribution is the value µ0 : this could be quantified
by the posterior probability Prob(Θ | x) ≥ µ0 , with values close to either 0 or 1
making H0 look untenable. Note however that since our prior probability that
H0 holds exactly was 0, so will be its posterior probability.

Predictive distribution

The prior predictive density of a single observation X is given by


Z ∞
p(x) = p(x | µ) π(µ) dµ (21)
−∞

which is readily computed from (14) and (16). We obtain

X ∼ N (m0 , h−1 −1
0 + h ). (22)

Alternatively we can argue directly as follows. Given M, the distribution of Z := X − M is


N (0, h−1 ). Since this does not depend on M, Z is independent of M, with this distribution.
Also, M ∼ N (m0 , h−1
0 ). So X = M + Z, where the summands are independent normal. Their
sum is thus normal with the means and variances summed, yielding (22).

A similar argument shows that the prior predictive distribution of the sample
mean X of n future observations is

X ∼ N (m0 , h−1 −1
0 + (nh) ). (23)

The posterior predictive density of a new observation Xn+1 , given data X = x, is


Z ∞
p(xn+1 | x) = p(xn+1 | µ) π(µ | x) dµ
−∞

where (importantly) no further conditioning on x is necessary in p(xn+1 | µ), since


Xn+1 ⊥⊥ X | M. This is of the same form as (21) except that the prior hyperpa-
rameters are replaced by the posterior hyperparameters. Thus we have

Xn+1 | x ∼ N (mn , h−1 + h−1


n ). (24)
4. NORMAL MODELS 77

Bayes factor

In the set-up of § 1.3, suppose we want to test H0 : M = 0 against H1 : M 6= 0.


For simplicty we take the conditional prior distribution, given the alternative
1
hypothesis, as M | H1 ∼ N (0, h−10 ). Then Z := (nh) X is sufficient for M, with
2

Z | H0 ∼ N (0, 1); while, using (23), Z | H1 ∼ N (0, hn /h0 ). So the Bayes factor in
favour of H1 is
p(z | H1)
B10 =
p(z | H0)
  12
h0 1
= exp (nh/hn )z 2 . (25)
hn 2

Large sample behaviour

As n → ∞, the sample mean X remains bounded almost surely. We can thus


argue as follows.
The exact posterior N (mn , h−1
n ) depends on the prior hyperparameters (m0 , h0 ),
but for large n their influence is swamped by the data and from (19) and (20) we
have, approximately,

hn ≈ nh
mn ≈ x,

i.e.
M | x ∼ N {x, (nh)−1 }. (26)
This may be compared with the sampling property:

X | µ ∼ N {µ, (nh)−1}.

In either case we have X − M ∼ N {0, (nh)−1}, but for the sampling distribution
this holds conditionally on the parameter M, whereas for the Bayesian posterior
it holds conditionally on the data X.

If we compare the posterior means m∗n and m†n based on two different priors, N {m∗0 , (h∗0 )−1 }
and N {m†0 , (h†0 )−1 }, we find they approach each other at rate n−1 , as against the slower rate
1
n− 2 at which each approaches the true value of M. Using a suitable measure of distance between
1
distributions, the full posterior distributions approach each other at rate n− 2 — a result that
continues to hold even for non-normal priors (and extends to other models). So the posterior
distribution M | x ∼ N {x, (nh)−1 } is in “asymptotically objective”, in that it is insensitive to
initial opinions about M.
As for the posterior predictive distribution (24), given extensive data this will be approxi-
mately N (x, h−1 ); and the resulting distributions for different priors now approach each other
at rate n−1 (a result that again extends to other models).
78 CHAPTER VII. BAYESIAN INFERENCE

1 1
From (25), the Bayes factor B10 behaves asymptotically as (h0 /h) 2 n− 2 exp 12 nhx2 . Note
that the dependence on the prior precision h0 does not disappear in the limit. For fixed non-zero
x, B10 → ∞, corresponding to increasingly strong evidence against H0 .
It is interesting to contrast the Bayes factor approach to testing H0 with the classical
1
approach. The latter would reject H0 when |Z| = (nh) 2 |X| exceeds some critical point k (e.g.,
1.96), irrespective of the value of n. The approximate Bayes facore at this critical point is
1 1
(h0 /h) 2 n− 2 exp 12 k 2 , which tends to 0 as n → ∞, thus eventually discrediting H1 in favour of
the null hypothesis H0 . So the same data which would lead to rejection of H0 from a classical
perspective count strongly in its favour from the Bayesian perspective.

Improper prior distribution

With each observation made, the precision parameter hn of M increases. Hence


we could argue that, to represent “great prior uncertainty”, that precision should
be taken as low as possible. But for a proper prior we require h0 > 0. Formally, we
consider h0 ↓ 0 (for fixed data, while m0 stays fixed, or at least bounded). Then
hn → nh, mn → x. Hence even for finite samples we might regard the formal
posterior (26) as “objective”, inasmuch as based on minimal prior information.
We could alternatively try taking the limit as h0 ↓ 0 in the prior. However the
prior density
  12
h0 1
π(µ) = exp − h0 (µ − m0 )2
2π 2
then converges pointwise to 0, which does not define a probability distribution.
If instead we ignore the normalising constant and work from
1
π(µ) ∝ exp − h0 (µ − m0 )2
2
we obtain the (pointwise) “formal limit”

π(µ) ∝ 1. (27)

Again (27) does not describe a genuine probability distribution, since there is no
way of scaling it to integrate to 1: it is an “improper” prior density. Nevertheless,
if we insert it into Bayes’s theorem, expressed as π(µ | x) ∝ π(µ) Ln (µ) we obtain
the perfectly well-behaved “objective posterior” (26).
Fully subjective Bayesians frown upon such “objective” priors and posteriors,
and indeed the purely formal manipulations involved can not be put on a firm
logical basis, and can generate paradox and inconsistency. Nevertheless they
have a strong intuitive appeal. They are particularly prized as ways of gener-
ating essentially the same result as a frequentist analysis, but with a Bayesian
interpretation. For example, the standard frequentist 95%-confidence interval
4. NORMAL MODELS 79

1
x ± 1.96(nh)− 2 has posterior probability exactly 95% under the formal posterior
(26). Similarly, the P -value P = Prob[N {0, (nh)−1 )} > x] for testing the hy-
pothesis H0 : M = 0 against H1 : M > 0 can alternatively be interpreted as the
posterior probability Prob(M < 0 | x) under (26).

Predictive distribution Combining the formal posterior distribution (26) for


M | x with the sampling distribution N (M, h−1 ) for Xn+1 , the posterior predictive
distribution for Xn+1 is Xn+1 | x ∼ N {x, (1 + n−1 )h−1 }. A 95% “prediction
1 1
interval” for Xn+1 is thus x ± 1.96(1 + n−1) 2 h− 2 . This agrees with the frequentist
prediction interval, constructed using the property that, conditional on any value
for M, Xn+1 − X ∼ N {0, (1 + n−1 )h−1 }.

Bayes factor If (for finite n) we try to express great prior uncertainty by h0 ↓ 0,


we find that, for any fixed data, the Bayes factor B10 of (25) tends to 0. Such
behaviour is clearly unsatisfactory. There is no non-controversial way of using
improper priors for the comparison of hypotheses.

4.2 Normal scale model

Now suppose the mean is known, M = µ say, but the precision H is unknown.
The likelihood function for H is
1 1 ∗2 h
Ln (h) ∝ h 2 n e− 2 ns (28)
P
with s∗ 2 the observed value of S ∗ 2 := n−1 ni=1 (Xi − µ)2 , the mean squared
deviation of the observations from their known population mean µ; this is an
unbiased estimator of the variance V = H −1 . The likelihood form (28) shows
that we have an exponential family; it has natural parameter H, natural sufficient
statistic ns∗2 , and mean-value parameter nV .
Conjugacy suggests taking a prior for H of similar form to (28). This is achieved
if we take  
1 1 2
H∼Γ ν0 , ν0 τ0 (29)
2 2
(or, equivalently, H ∼ χ2ν0 /ν0 τ02 ), since this has density
1 1 2
π(h) ∝ h 2 ν0 −1 e− 2 ν0 τ0 h (30)

(where the proportionality sign hides a purely numerical factor, not depending on
h). This has expectation 1/τ02 , so that the hyperparameter τ02 is a “prior guess”
80 CHAPTER VII. BAYESIAN INFERENCE
1
at V = H −1 ; and coefficient of variation 2ν0 − 2 , so that the “degrees of freedom”
hyperparameter ν0 is a measure of the precision of this prior.
Forming the posterior density by multiplying (30) and (28) and simplifying, we
find  
1 1 2
H |x ∼ Γ νn , νn τn (31)
2 2
with
νn = ν0 + n
ν0 τ02 + ns∗2
τn2 =
ν0 + n
In particular, the Gamma family of distributions for H is closed under sampling
from N (µ, H −1). We observe that the posterior precision hyperparameter is the
sum of the prior precision hyperparameter and the sample size, while the posterior
“guess” at V is a weighted average of the prior guess and the data-based estimate,
each weighted with its degrees of freedom.

5 Computational methods

Although the logic of Bayesian analysis is clearcut, as soon as we move away


from simple conjugate analysis its computational implementation can be far from
straightforward. We illustrate some useful techniques with a fairly simple exam-
ple.

Non-conjugate normal analysis. Consider the two-parameter model Xi ∼


N (M, H), independently
P (i = 1, . . . , n). A minimal sufficient statistic is (X, S 2 ),
n
where S 2 := i=1 (Xi − X)2 /(n − 1). We assume a joint prior distribution of the
form:
H ∼ χ2ν0 /ν0 τ02
M ∼ N (m0 , h−1 0 ),

independently. This family of joint distributions is not conjugate: the joint pos-
terior is not of the same form.
Using results for the normal scale and normal location models separately, we see
that, in the posterior,
 
h0 m0 + nhx 1
M | x, h ∼ N , . (32)
h0 + nh h0 + nh
χ2ν0 +n
H | x, µ ∼ (33)
ν0 τ02 + (n − 1)s2 + n(x − µ)2
5. COMPUTATIONAL METHODS 81

However the marginal posterior distributions are non-standard.


We can use (33) and (32) to compute the (joint) mode (µ∗ , h∗ ) of the posterior
distribution, since each component must be the mode of the conditional density,
evaluated conditional on the other component. We obtain (for ν0 + n > 2):
h0 m0 + nh∗ x
µ∗ = (34)
h0 + nh∗
ν0 + n − 2
h∗ = 2
. (35)
ν0 τ0 + (n − 1)s2 + n(x − µ∗ )2
These combine to produce a cubic equation that can be solved for µ∗ .
Alternatively, starting from a trial value we can cycle between (34) and (35) to
update each estimate in turn: this is the method of iterative conditional modes.
It will converge to a local mode of the joint posterior.

5.1 Gibbs sampling

Suppose (M, H) follow their correct joint posterior distribution Π. Given H = h,


generate a new random variable M′ from the conditional posterior distribution
(32); then clearly (M′ , H) ∼ Π. Alternatively, given M = µ, generate H ′ from
the conditional posterior distribution (33): now (M, H ′) ∼ Π. Thus the joint
posterior distribution is invariant under regeneration of either M of H in this
way.
This suggests a way of simulating from the joint distribution Π. First generate
M1 somehow. Then, if M1 = µ, generate H1 from distribution (33). Next, if
H1 = h, generate M2 from (32); and so on. This procedure defines a Markov
transition kernel from (Mi , Hi ) to (Mi+1 , Hi+1 ), and the desired joint distribution
Π is a stationary distribution of the Markov chain. If the density π of Π is
everywhere positive (and also under weaker conditions) then the Markov chain
will be ergodic, Π will be its unique stationary distribution, and the limiting
distribution of (Mn , Hn ) will be Π, no matter how the chain is started off. Hence
by running the chain for long enough we can ensure we have sampled from the
desired joint distribution Π. There are many delicate implementation issues, such
as how long to run the chain to ensure a close approximation to the stationary
distribution, and how to collect a large sample from that distribution — e.g.,
by running many parallel independent chains and taking their final values, or by
running just one and taking suitable well-separated terms in the sequence.
Once we have such a randomPsample, say (µ1 , h1 ), . . . , (µN , hN ), we might esti-
mate E{g(M)}, say, by N −1 N i=1 g(µi ). However, we can do better by utilis-
ing (32), which (with luck) allows us to calculate k(H) := E{g(M) | H}. Since
82 CHAPTER VII. BAYESIAN INFERENCE
P
E{k(H)} = E{g(M)}, we can alternatively estimate E{g(M)} by N −1 Ni=1 k(hi ).
Since var{g(M)} = var{k(H)}+E[var{g(M) | H}], we have var{k(H)} ≤ var{g(M)},
and we should get a more precise estimate by this procedure — which could
in principle be iterated. Because of an obvious parallel, this is termed “Rao-
Blackwellisation”.
If we wanted an estimate of the marginal distribution of M, we might use the
“empirical distribution” of the (µi ), putting probability 1/N at each µi ) (assumed
distinct), so that Prob(M ∈ A) is estimated by the proportion of the (µi ) falling in
the set A. However this discrete estimate of a typically continuous distribution
is unsatisfying. A genuine (and better) density estimate is obtained by Rao-
Blackwellisation:
XN
π̂(µ) = N −1 π(µ | hi ).
i=1

This technique, Gibbs sampling, is just one of a number of Markov Chain Monte
Carlo (MCMC) algorithms for sampling from a posterior distribution by simulat-
ing from a Markov chain that has that as its equilibrium distribution. Such com-
putationally intensive methods allow us to perform otherwise intractable Bayesian
inferences, and their development and application forms a major strand of mod-
ern Bayesian statistics.
Chapter VIII

Decision Theory

Suppose we need to choose an action a ∈ A, the consequence of which depends


on the value θ of Θ, this being measured by a loss function L(θ, a). We suppose
L bounded below (we often take it to be non-negative): then its expectation over
any randomness introduced into its arguments will be well-defined (possibly +∞)
and can be calculated as a repeated expectation in any order.

1 Bayes act

Suppose my current uncertainty about Θ is expressed by a Bayesian distribu-


tion Π over . If I take act a, my (subjective) expected loss is L(Π, a) :=
EΘ∼ΠL(Θ, a). I can use this to rank the available acts. Any minimiser aΠ of
L(Π, a) over A is called a Bayes act (with respect to Π); the minimised value
H(Π) := inf a∈A L(Π, a) = L(Π, aΠ ) is the Bayes loss under Π.

1.1 Bayes estimate

For the following examples we assume , A ⊆ R. We regard quoting an estimate


of Θ as an act a.

Squared error loss Suppose L(θ, a) = (θ − a)2 . Then the expected loss,
for estimate a, is var(Θ) + {a − E(θ)}2 . Hence the Bayes estimate is just the
expectation of Θ, and the Bayes loss is the variance of Θ.

83
84 CHAPTER VIII. DECISION THEORY

Absolute error loss Now consider L(θ, a) = |a − θ|. Then the Bayes estimate
is the median m of the distribution of Θ, i.e. that value (for simplicity assumed
existent and unique) such that Prob(Θ < m) = Prob(Θ > m) = 12 . For by
considering all possible orderings of θ, m, a, we see that
L(θ, a) − L(θ, m) ≥ (m − a) sign(θ − m)
and so, since E{sign(Θ − m)} = 0, E{L(Θ, a)} ≥ E{L(Θ, m)} for any choice of a.
As a variation, suppose L(θ, a) = c1 (a − θ) if θ ≤ a, and c2 (θ − a) otherwise
(c1 , c2 > 0). Let q := c2 /(c1 + c2 ). Then the Bayes estimate is the q-quantile of
the distribution of Θ, i.e. that value θq such that Prob(Θ ≤ θq ) = q.

0–1 loss Suppose now L(θ, a) = 0 if |θ − a| ≤ k, else 1. The Bayes estimate is


that value of a maximising the probability that Θ lies in the interval a ± k. For
small k this will be, essentially, the mode of the density π(θ).
For the case of discrete , with loss L(θ, a) = 0 if a = θ, else 1, the Bayes estimate
is the discrete mode, i.e. the value of i maximising Prob(Θ = i).

2 Decision rules

Suppose now that, before having to choose an act, we can conduct an experiment
E, observing X, with distribution governed by Θ, and so learning more about
Θ. A Bayesian would naturally apply the procedure above using the posterior
distribution given the observed value x of X. However, a frequentist must proceed
differently: by choosing between procedures for using any data that might be
observed.

Definition 1 A (non-randomised) decision rule is a map d : X → A. Its risk


function is the function R(θ, d) = Eθ [L{θ, d(X)}] of θ ∈ . 2

For example, with A =  = R we might use squared error loss: L(θ, a) = (θ − a)2 .
Then a non-randomised decision rule d : X → R is an estimator, and its risk
function R(θ, d) is just its mean-squared error function mse(θ), as considered in
detail in Chapter III.
In the case of finite , with points labelled θ1 , . . . , θn , we can think of the risk
function of a decision rule d as a point (R(θ1 , d), . . . , R(θn , d))T ∈ Rn .

A randomised decision rule de is formed by randomly choosing a decision rule


according to a known probability distribution over the space of such rules (inde-
pendently of both X and Θ). Its risk function is calculated by taking a further
3. CONVEX LOSS 85

expectation over this distribution. The set of all possibly randomised decision
rules will be denoted by D, with D0 its subset of non-randomised rules.
An apparently more general concept is that of a behavioural decision rule. Such a
rule δ associates with each x ∈ X a probability distribution over A (independent
of Θ)—again, the risk is defined by an appropriate further expectation. How-
ever, unlike the case for a randomised decision rule, the dependence between the
outcomes of the randomisations for different x-values remains unspecified. But
given any randomised decision rule de we can define a behavioural rule δ, which
e (where de is random); and δ and de
associates with each x the distribution of d(x)
will have the same risk function.
Once a preferred decision rule d has been selected, and X = x observed, the
corresponding (non-random or randomised) action a = d(x) is taken.
The risk set is R := {R(·, d) : d ∈ D} (a subset of Rn for finite ). The technical
reason for allowing randomised decision rules is that R is then convex : i.e., if
x, y ∈ R and 0 ≤ p ≤ 1, then z := px + (1 − p)y ∈ R. For if x = R(·, d1), y =
e where de is the randomised decision rule that chooses
R(·, d2 ), we have z = R(·, d)
d1 with probability p, d2 with probability 1 − p.
In the frequentist approach we compare decision rules in terms of their risk func-
tions. We consider d1 at least as good as d2 , and write d1  d2 , if R(θ, d1 ) ≤
R(θ, d2 ), all θ ∈ . In this case we may say that d1 weakly dominates d2 . If
moreover we have strict inequality for some θ, we write d1 ≺ d2 , and say d1 dom-
inates d2 . But  is only a partial order: typically R(θ, d1 ) < R(θ, d2 ) for some
θ ∈ , R(θ′ , d1) > R(θ′ , d2 ) for some other θ′ ∈ , and then d1 and d2 are simply
incomparable.
If d is dominated we call it inadmissible, else admissible.

3 Convex loss

We say that the decision problem is convex if the action space A can be repre-
sented as a real interval1 and, for each θ ∈ Θ, the loss function L(θ, a) is a convex
function of a on A.
For example, the squared-error estimation problem, with  = A = R and L(θ, a) =
(θ − a)2 , is convex. Any decision problem with only a finite number of available
actions is non-convex.

1
Extension to a convex subset of a general vector space is straightforward.
86 CHAPTER VIII. DECISION THEORY

Theorem 1 In a convex decision problem, any behavioural decision rule δ is


weakly dominated by a non-randomised decision rule.

Proof. For each x, let d(x) := E{δ(x)}, where the expectation is under the
distribution δ(x) over A. Then d(x) ∈ A, so d is a non-randomised decision rule.
Recall Jensen’s inequality (see Chapter IV, § 2.1): for any convex function g of
a real random variable Y , and any distribution P for Y , g{E(Y )} ≤ E{g(Y )}.
For each x and θ, apply this to g(·) = L(θ, ·), under the distribution over A
determined by δ(x), to obtain
L{θ, d(x)} ≤ EL{θ, δ(x)}. (1)
Taking a further expectation over X ∼ Pθ yields R(θ, d) ≤ R(θ, δ), so that d
weakly dominates δ. 2

Corollary 2 Under the above conditions, any randomised decision rule is weakly
dominated by a non-randomised decision rule.

Thus, in a convex decision problem, there is never any point in randomising.

3.1 Sufficiency

Let the statistic T , with values in T, be sufficient for Θ. Given any (say non-
randomised)2 decision rule d, let δ be the behavioural decision rule, based on
T alone, that assigns, to t ∈ T, the distribution over A of d(X) given T = t;
this being determined by the distribution of X given T = t under Pθ , which, by
sufficiency, is in fact known, independently of the value θ. Then, conditional on
T = t and Θ = θ, both d and δ determine the identical distribution over the
action space A; hence they have identical risk functions. Effectively, δ mimics,
by external randomisation, the known probabilistic way in which X arises fol-
lowing observation of T . So any decision rule is weakly dominated by a (possibly
randomised) decision rule based on T alone.
The following result, generalizing Theorem 5 of Chapter III, now follows from
Theorem 1:

Theorem 3 (Generalized Rao-Blackwell Theorem) In a convex decision prob-


lem, any decision rule is weakly dominated by a non-randomised decision rule
based on the minimal sufficient statistic alone.
2
The argument is easily extended to handle the case that d is a behavioural or randomised
decision rule.
4. SELECTION CRITERIA 87

4 Selection criteria

It would appear unwise to use an inadmissible rule. But typically there will be
many admissible rules, all necessarily mutually incomparable. Other criteria are
then needed to select between them. Here we consider two: Bayes and minimax.

4.1 Bayes rules

Definition 2 Let Π be a distribution for Θ over . The Bayes risk of a decision


rule d is
r(Π, d) := EΘ∼Π {R(Θ, d)}.
A decision rule d∗ is Bayes (with respect to Π) if it minimises r(Π, d) over d ∈ D.
2

Bayesians do not need the added generality of randomised decision rules:

Lemma 4 If there exists a Bayes rule, there exists a non-randomised Bayes rule.

Proof. Supposed d∗ is a randomised Bayes rule. It arises by generating a


random D from some distribution ∆ over D0 , independently of Θ ∼ Π. Then
r(Π, d∗ ) = EΘ∼Π R(Θ, d∗ ) = EΘ∼Π ED∼∆ R(Θ, D) = ED∼∆ EΘ∼Π R(Θ, D) = ED∼∆ r(Π, D).
But, for all d, r(Π, d∗) ≤ r(Π, d). If we had r(Π, d∗) < r(Π, d) for all d ∈ D0 we
would have a contradiction. So {d ∈ D0 : r(Π, d) = r(Π, d∗ )} is non-empty, and
this is just the set of non-randomised Bayes rules. (In fact the distribution ∆
defining d∗ must pick a member of this set with probability 1). 2

Given a prior distribution Π for Θ, we now have two Bayesian approaches to


decision theory:

Extensive form For any given data X = x, choose a Bayes act ax in the pos-
terior distribution:

ax = arg min E{L(Θ, a) | X = x}. (2)


a∈A

Normal form First choose a (non-randomised) Bayes rule dΠ :

dΠ = arg min0 EΘ∼Π R(Θ, d). (3)


d∈D

Then apply this to the observed data x.


88 CHAPTER VIII. DECISION THEORY

Theorem 5 (Fundamental theorem of Bayesian decision theory)


Normal and extensive form Bayesian analyses deliver the same result.

Proof. Starting from an extensive form approach, we can construct the associ-
ated non-randomised decision rule δ, such that δ(x) = ax . Then for any decision
rule d ∈ D0
r(Π, d) − r(Π, δ) = E[E{L(Θ, d(X)) − L(Θ, δ(X)) | Θ}]
= E[E{L(Θ, d(X)) − L(Θ, δ(X)) | X}]
≥ 0 (4)
since E{L(Θ, d(x)) − L(Θ, δ(x)) | X = x} ≥ 0 for all x. So the decision rule δ con-
tructed from an extensive form analysis is a normal form Bayes rule. Conversely,
if dΠ is normal form Bayes we have
E[E{L(Θ, dΠ(X)) − L(Θ, δ(X))} | X] ≤ 0
and since, by definition of δ, the integrand is non-negative, it must be 0 (with
probability 1). So dΠ (x) is a Bayes act in the posterior distribution given X = x.
2

Corollary 6 The minimised Bayes risk r(Π, dΠ ) is E{H(ΠX )}, where H(Πx ) is
the Bayes loss for Θ ∼ Πx , the posterior distribution given X = x.

Since extensive form analysis requires minimisation only over the action space
A, rather than over the much bigger space D0 of non-randomised decision rules
as required for normal form analysis, it gives a much easier way to construct a
Bayes rule.
Any rule that dominates a Bayes rule must be Bayes for the same prior. Conse-
quently if a Bayes rule with respect to Π is unique, it is admissible.
More generally, a Bayes rule will be admissible under some additional conditions.
Here we note the following weaker result:

Theorem 7 Let dΠ be a Bayes rule with respect to Π, and d a decision rule that
dominates dΠ . Define A := {θ ∈  : R(θ, d) < R(θ, dΠ ). Then Π(Θ ∈ A) = 0.

Proof. Follows easily from R(θ, d)−R(θ, dΠ) ≤ 0 and EΠ {R(Θ, d)−R(Θ, dΠ)} ≥
0. 2

Corollary 8 Suppose  is discrete, and Π(Θ = θ) > 0 for each θ ∈ . Then the
Bayes rule dΠ is admissible.
5. INADMISSIBILITY OF THE MULTIVARIATE NORMAL MEAN 89

4.2 Minimax rules

For those who don’t like priors, an “objective” (but arbitrary) criterion for se-
lecting a decision rule is “minimax”:

Definition 3 A decision rule d∗ is said to be minimax if it minimises the worst-


case risk:  

d = arg inf sup R(θ, d) .
d∈D θ∈
2

The minimax concept is fundamental to the theory of games against an intelligent


and malicious opponent. In a statistical context, where the opponent is “Nature”,
who chooses the value θ of Θ, it might be considered unduly pessimistic.
In general a minimax rule may need to be randomised. And it need not be
admissible, although any rule that dominates it must have the same worst-case
risk.

5 Inadmissibility of the multivariate normal mean

Here is one of the more surprising results in decision theory: the obvous (minimax,
maximum likelihood, best unbiased, best invariant,. . . ) estimator of a collection
of 3 or more normal means is inadmissible!

Theorem 9 (James-Stein) Suppose X =  = Rp , and the loss function is


quadratic: L(θ − a) = kθ − ak2 . Our model for X = (Xi ) has independent normal
components: Xi ∼ N (Θi , 1). Then the MLE of Θ is Θ b = X. However, for p > 2,
b is dominated by
Θ
e α = (1 − αkXk−2 )X
Θ (5)
for 0 < α < 2(p − 2).

b ≡ p.
Proof. We easily compute R(θ, Θ)
Also
!
2
αX
e α) =
R(θ, Θ Eθ X −θ−
kXk2
 
X T (X − θ) α2
= Eθ kX − θk − 2α
2
+
kXk2 kXk2
90 CHAPTER VIII. DECISION THEORY

X    
Xi (Xi − θi ) 1
= p − 2α Eθ +α 2

i
kXk2 kXk2
Now for any reasonable h, using integration by parts,
Eθ {(Xi − θi )h(X)} = Eθ {h[i](X)}
where h[i] (x) := ∂h(x)/∂xi . (This is Stein’s lemma). Take h(x) ≡ xi /kxk2 , so
h[i] (x) ≡ kxk−2 − 2x2i /kxk4 . Then
X  Xi (Xi − θi )  X 
Eθ 2
= Eθ kXk−2 − 2Xi2/kXk4
i
kXk i

= (p − 2)Eθ kXk−2 .
Hence

e α ) = p − α{2(p − 2) − α} Eθ kXk−2 .
R(θ, Θ
2

Each Θ e α (0 < α < 2(p − 2)) dominates the minimax estimator Θ b so is itself
minimax (with worst-case loss p). For any θ, R(θ, Θ e α ) is minimised at α = p − 2,
e p−2 dominates all other estimators in this class, which are thus inadmissible—
so Θ
including, for α = 0, the MLE Θ b = X. But even Θ e p−2 is inadmissible: it is
2 + +
dominated by {(1 − (p − 2)/kXk } X, where z := max{z, 0} (and even this
estimator is inadmissible. . . )

5.1 Hierarchical Bayes and Empirical Bayes3 anal-


ysis
“Shrinkage estimators” similar to Θ e α of (5) arise in Bayesian analysis.
Suppose that, in that example, the prior distribution for Θ is specified in terms of an
auxiliary “hyperparameter” V . Given V = v, Θ ∼ N (0, vIp ). We also need to assign some prior
distribution to V : this yields a simple hierarchical Bayesian model , which allows us to learn
from the data about the hyperparameter V of the first-stage prior distribution.
The Bayes estimator is E(Θ | X) = E{E(Θ | V, X) | X} = E{V /(1 + V ) | X} X. We thus
need the posterior expectation of W := V /(1 + V ). But conditional on V alone, we have
X ∼ N {0, (1 + V )Ip } = N {0, (1 − W )−1 Ip }. A sufficient statistic for W in these distributions
is kXk2 , with distribution (1 − W )−1 χ2p . We could now find the posterior distribution of W
given kxk2 by combining this model with whatever prior is used for W .
In practice this “compleat Bayesian” analysis can be demanding. A short-cut (“empirical
Bayes”) is to replace the posterior expectation of W given kxk2 with some point estimate of W
based on kXk2 . Since E(kXk2 | W ) = p(1 − W )−1 , one possible estimate is w
e := (1 − p/kxk2 )+ .
e p . Alternatively, note that, for p > 2,
Using this produces the estimator Θ E(1/kXk2 | W ) =
(1 − W )/(p − 2), suggesting the estimate w
b := {1 − (p − 2)/kxk2 }+ : this yields the positive-part
e+ .
James-Stein estimator Θ p−2

3
The original meaning of this term, introduced by Herbert Robbins, has now been largely discarded in favour
of the usage described here
6. TWO SIMPLE HYPOTHESES 91

6 Two simple hypotheses

The simplest decision problem is when both parameter- and action-space have
just two elements: say  = {θ0 , θ1 }, A = {a0 , a1 }. The loss function is L(θi , aj ) =
lij , where we suppose l00 < l01 , l11 < l10 , so that the optimal act when Θ is
known to take value θi is ai (i = 0, 1). Although this formulation and most of
the analysis below treats θ0 and θ1 entirely symmetrically, in some approaches we
distinguish between the null hypothesis H0 : Θ = θ0 and the alternative hypothesis
H1 : Θ = θ1 , and treat these differently. Then a1 is interpreted as “reject H0 ”
and a0 as “accept H0 ” (or, at any rate, “do not reject H0 ”).
We do not restrict the sample-space X. The density (with respect to a suitable
underlying measure µ) of X given Θ = θi is denoted by pi (·). The likelihood ratio
is Λ := λ(X) := p1 (X)/p0 (X).
A behavioral decision rule is defined by a function φ : X → [0, 1], where φ(x) is
the probability of taking action a1 after observing X = x. Its risk function is
given by
R(θ0 , φ) = l00 (1 − α) + l01 α
= l00 + (l01 − l00 )α
(6)
R(θ1 , φ) = l10 β + l11 (1 − β)
= l11 + (l10 − l11 )β
where α := E{φ(X) | θ0} is the overall probability of taking action a1 given Θ =
θ0 , the type I error of φ; and β := 1 − E{φ(X) | θ1} is the overall probability of
taking action a0 given Θ = θ1 , the type II error .
We also term α the size of the test φ, and γ := 1 −β its power . Thus φ  φ′ ⇐⇒
α ≤ α′ and β ≤ β ′ .
For a non-randomised rule φ takes values 0 and 1 only. It is equivalently deter-
mined by its critical region. C = {x : φ(x) = 1}. Then α = Prob(X ∈ C | Θ =
θ0 ), β = Prob(X ∈ C | Θ = θ1 ) (where C denotes the complement X \ C of C in
X).

6.1 The risk set

The risk set is {(R(θ0 , φ), R(θ1 , φ)) : φ ∈ D}. From equations (6), this is a simple
transformation of the set S of error-pairs {(α, β) : φ ∈ D}, a subset of the unit
square. So without any real loss of generality, we may assume lij = 0 if i = j, 1
otherwise, when the risk-set is just S.
Since we are allowing randomisation, S is convex; further if (α, β) ∈ S, corre-
92 CHAPTER VIII. DECISION THEORY

sponding to φ, then (1 − α, 1 − β) ∈ S, corresponding to 1 − φ. Moreover S is


closed. Any set S ⊆ [0, 1] × [0, 1] with these properties can serve as a risk-set for
some experiment. An example is shown in Figure 1.

1.0

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Figure 1: Risk set for testing simple hypotheses

The admissible rules are those corresponding to points on the lower boundary of
S.

6.2 Bayes rules

Let now Θ have prior distribution Π, where Π(Θ = θi ) = πi (i = 0, 1). The Bayes
risk of φ is then
r(Π, φ) = π0 α + π1 β. (7)
Finding a Bayes rule is thus equivalent to finding the rule that minimises a
linear combination of the two error rates with non-negative coefficients. From
Corollary 8 of Chapter VIII, when neither coefficient is 0 such a rule will be
admissible.4 In terms of the risk set, it will correspond to a point found by
4
Admissibility need not hold if we allow zero coefficients. Thus the rule φ ≡ 0, corresponding
to the point (0, 1) ∈ S, is Bayes for prior π1 = 1, π0 = 0; but it will be dominated if there exists
6. TWO SIMPLE HYPOTHESES 93

sliding down a line of slope −π0 /π1 until it is just about to leave S: see Figure 2.
Its intersection with S is then non-empty and completely contained within the
lower boundary of S; any rule corresponding to a point in this intersection will
be a Bayes rule with respect to Π

Bayes rule

Figure 2: Bayes rules are admissible

The figure suggests that any point on the lower boundary can be characterised
in this way and is thus a Bayes rule: this is indeed the case, and can be proved
using the supporting hyperplane theorem for convex sets. We end up with a (1, 1)
correspondence between admissible rules and Bayes rules so long as both error
probabilities are positive.

Finding Bayes rules

Finding a Bayes rule—equivalently, minimising a linear combination or the error


rates—is really quite easy.

any other, non-trivial, rule with α = 0.


94 CHAPTER VIII. DECISION THEORY

Theorem 10 Let k ≥ 0. A rule φ minimises kα + β if and only if:


)
p1 (x) > kp0 (x) ⇒ φ(x) = 1
(8)
p1 (x) < kp0 (x) ⇒ φ(x) = 0.

With k interpreted as π0 /π1 , this result is just a special case of Theorem 5,


since we can interpret the required minimiser as solving a normal form Bayes
analysis, and (8) as the extensive form solution to that problem. (Note that this
corresponds to choosing the act with the higher posterior probability.)
We call a test of the form (8) a likelihood ratio test, with cut-off k, LRT(k).
When p0 (x) > 0 on X, (8) is equivalent to:
)
λ(x) > k ⇒ φ(x) = 1
(9)
λ(x) < k ⇒ φ(x) = 0.

When {x : p1 (x) = k p0 (x)} has measure 0, an LRT(k) is essentially unique.


Note that, since (0, 1) ∈ S (it corresponds to φ ≡ 0), a LRT(0) must have β = 0.
We can similarly allow k = ∞: a LRT(∞) has α = 0.

Corollary 11 Any LRT(k) with 0 < k < ∞ is admissible.

6.3 Neyman-Pearson approach

The following simple Corollary (which was proved directly in STATISTICS IB)
to the Bayesian result of Theorem 10 is the basis for the Neyman-Pearson (N-P)
frequentist approach to hypothesis testing.

Theorem 12 (Neyman-Pearson lemma) Let φ be a likelihood ratio test, with


associated error probabilities (α, β) and power γ = 1 − β; and let φ′ be any test.
Then
α′ ≤ α ⇒ γ ′ ≤ γ.

This result can be expressed as: Any LRT is most powerful for its size. Note that
a most powerful test need not be admissible: we might have α′ < α and γ ′ = γ.
This can happen for a LRT(0) (with γ = 1).
The N-P approach is to preselect the desired size α (e.g. at 5%), and adjust the
cut-off k to yield a LRT of this size—which will then have maximum power γ
subject to size ≤ α. Note that this introduces an asymmetry into the treatment
of the two hypotheses.
Chapter IX

Multivariate Analysis

1 Linear multivariate analysis

Multivariate analysis deals with relationships between several variables, consid-


ered on an equal footing.
Let V1 , . . . , Vp be the basic variables of interest — for example, V1 = H :=
“log(height in meters)”, V2 = W := “log(weight in kg)”, . . . . (In this course
we are only going to deal in linear operations, so if we also wanted to include
e.g. untransformed height we would need to introduce it as a new variable). The
string of basic variables (V1 , . . . , Vp ) will be regarded as a (1 × p) row-vector V .
We can construct new variables by linear operations, e.g. B := W − 2H is the
variable “log(body mass index)”. We shall consider all such linearly constructed
variables, of the form W = V a, where a is a (p × 1) column vector, as essentially
on a par with the basic ones. We obtain a p-dimensional vector space V of
variables, the basic variables V1 , . . . , Vp forming a basis of V.
An individual will have a value Xj — considered as random — for any basic
variable Vj . These will be strung into a (1 × p) random row-vector X, with some
multivariate distribution P , say. The value for any variable W = V a ∈ V is thus
XW = Xa, which will have an induced univariate distribution PW .

Lemma 1 The joint distribution PV is determined by its collection of univariate


marginals {PW : W ∈ V}.1

Proof. P is determined by its multivariate characteristic function φP : Rp → C;


1
Note: It is not enough to consider only the basic variables (Vj ).

95
96 CHAPTER IX. MULTIVARIATE ANALYSIS

but φP (a) = EP (exp i Xa) is determined by the distribution PW of XW = Xa.


2

Let µj = E(Xj ). The expectation of X is the vector µ = (µ1 , . . . , µp ). Then for


W = V a, µa is the expectation of XW ; and clearly µ is fully determined by all
such univariate expectations.
Let σjk = cov(Xj , Xk ) = E{(Xj − µj )(Xk − µk )}. In particular, σjj = var(Xj ).
The (p × p) matrix Σ = (σjk ) is the dispersion or covariance matrix of X. Thus
Σ = cov(X T , X) := E{(X − µ)T (X − µ)}. For W = V a, var(XW ) = aT Σa.
Since this must be non-negative for any a ∈ Rp , Σ must be (symmetric and)
non-negative definite. It is positive definite if and only if, whenever a 6= 0, the
univariate marginal PW of W = V a is non-trivial (i.e., not a 1-point distribution).
In this case K := Σ−1 is termed the precision matrix (or concentration matrix )
of X.
In accordance with Lemma 1, we can recover Σ from the univariate marginals:
4σjk = var(Xj + Xk ) − var(Xj − Xk ).
Let Y = XA, Z = XB, with A (p × q), B (p × r), be the values for new sets of
variables V A, V B. Then cov(Y T , Z) = cov(AT X T , XB) = AT ΣB. In particular
the dispersion matrix of Y = XA is AT ΣA.
The correlation between Xj and Xk is
cov(Xj , Xk ) σjk
ρjk = corr(Xj , Xk ) := p =√ .
var(Xj ) var(Xk ) σjj σkk
By Cauchy-Schwarz, any correlation lies in [−1, 1].

2 Multivariate Normal Distribution

Definition 1 We say that X has a multivariate normal (MVN) distribution if,


for any constructed variable W = V a, its measurement XW = Xa has a univari-
ate normal distribution. 2

It immediately follows that, if X (1 × p) is MVN, then so is any linear transfor-


mation Y = XA, for fixed A (p × q).
Let the mean-vector and dispersion matrix of X be µ and Σ. Then the distribu-
tion of XW must be the univariate normal distribution N (µa, aT Σa). We thus
write
X ∼ Np (µ, Σ)
2. MULTIVARIATE NORMAL DISTRIBUTION 97

if, for any a ∈ Rp ,


Xa ∼ N (µa, aT Σa). (1)
The following is then immediate:

Lemma 2 If X ∼ Np (µ, Σ), and Y = XA for fixed A (p × q), then Y ∼


Nq (µA, AT ΣA).

By Lemma 1 there can be at most one distribution satisfying (1) for all a ∈ Rp .
Indeed, because the characteristic function of N (m, v) is given by t 7→ exp(imt −
1 2
2
vt ), that of N (µ, Σ) (if such a distribution exists) must be given by
 
1 T
φ(a) = exp iµa − a Σa . (2)
2

We note that, since a linear combination of independent normal variables is nor-


mal, if Z = (Z1 , . . . , Zp ) with the (Zj ) independent and identically distributed
N (0, 1), then Z ∼ Np (0, Ip ). Thus Np (0, Ip ) exists. Now let µ ∈ Rp , and let Σ
be an arbitrary symmetric non-negative definite (p × p) matrix. We can write
Σ = AT A for some A (p × p). Defining X = µ + ZA, where Z ∼ Np (0, Ip ), it is
readily seen that X ∼ Np (µ, Σ). Thus for any (1 × p) mean vector µ and (p × p)
dispersion matrix Σ, a p-variate MVN distribution with these attributes exists,
and is fully determined by them.

2.1 Density

Because it has independent and identically distributed N (0, 1) entries, the joint
density for Z ∼ Np (0, Ip ) is readily computed:
1 1
pZ (z) = 1 exp − z z T . (3)
(2π) 2
p 2

Now let X ∼ Np (µ, Σ), with Σ positive definite. We can write Σ = AT A with
A (p × p) non-singular; then X = µ + ZA, where Z = (X − µ)A−1 ∼ Np (0, Ip ).
The Jacobian of the transformation x → z is (dxj /dzk ) = AT , with determinant
1
(det Σ) 2 . Applying the multivariate change-of-variables formula to (3), we obtain
the density of Np (µ, Σ):
1 1
pX (x) = 1 exp − (x − µ)Σ−1 (x − µ)T
1
p
(2π) (det Σ)
2 2 2
1 1 −1
= 1 1 exp − trΣ (x − µ)T (x − µ). (4)
p
(2π) 2 (det Σ) 2 2
98 CHAPTER IX. MULTIVARIATE ANALYSIS

What if Σ is only positive semi-definite (and hence singular)? In that case there exists a
with Σa = 0, and so var(Xa) = 0. So the distribution of X is confined to the affine subspace
{x : (x − µ)a = 0}, and so can not have a density with respect to Lebesgue measure on
Rp . Nevertheless, we can still have a perfectly satisfactory MVN distribution according to our
definition.

2.2 Bivariate case


2
When p = 2 we will usually write X = (X, Y ), µ = (µX , µY ), σ11 = σXX = σX ,
2
σ22 = σY Y = σY , and ρ12 = ρ.
We can consider the joint distribution as made up from the marginal distribution
of X, and the conditional distribution of Y , given X:

Lemma 3 Let (X, Y ) ∼ N2 (µ, Σ). Then


X ∼ N (µX , σXX ) (5)
Y | X ∼ N (α + βX, σY Y ·X ) (6)
where
β = σXY /σXX (7)
α = µY − βµX (8)
σ2
σY Y ·X = σY Y − XY . (9)
σXX

Proof. Property (5) is immediate from the definition of multivariate (here


bivariate) normality.
Let R = Y − α − βX. We readily compute E(R) = 0, var(R) = σY Y ·X ,
cov(X, R) = 0. In particular, R ⊥
⊥ X so that, given X, R ∼ N (0, σY Y ·X ). Prop-
erty (6) follows. 2

It can be checked that σY−1Y ·X = kY Y , the corresponding entry of the precision


matrix K = Σ−1 (when this exists).
The function (of x) E(Y | X = x) = α + βx is the regression (line) of Y on X;
β is the regression coefficient. The variable2 Y ∗ X := E(Y | X) = α + βX is
the linear predictor of Y based on X. The difference2 Y · X := Y − Y ∗ X
(= R) is the residual , and its variance σY Y ·X is the residual variance, of Y (after
allowing/adjusting for X).
2
This notation is non-standard!
3. DATA 99

More generally, for vectors X and Y , we have linear predictor

Y ∗ X = E(Y | X) = µY + (X − µX )Σ−1
XX ΣXY ,

with ΣXY = cov(X T , Y ), etc.; and residual vector Y · X = Y − Y ∗ X.

2.3 Partial covariance and correlation

Now suppose p = 3, X = (X, Y, Z) etc. We can adjust both Y and Z for


X, as above, yielding residuals Y · X, Z · X. We compute cov(Y · X, Z · X) =
σY Z·X := σY Z − σXY σXZ /σXX , the residual or partial covariance between Y
and Z, after allowing for X. Correspondingly the residual/partial correlation is

ρY Z·X = σY Z·X / σY Y ·X σZZ·X .
The vector extension (when Σ−1
XX exists) is

ΣY Z·X = cov(Y · X T , Z · X) = ΣY Z − ΣY X Σ−1


XX ΣXZ .

In particular, the residual/partial dispersion matrix of Y , allowing for X, is


ΣY Y ·X = ΣY Y − ΣY X Σ−1
XX ΣXY . Moreover, if Σ is the dispersion matrix of (X, Y ),
and K = Σ−1 exists, then ΣY Y ·X = KY−1Y .

3 Data

In applications we will have measured the value of each variable Vj on each of


a collection of individuals ξ1 , . . . , ξn . Let xij be the measured value for variable
Vj on individual ξi . The (n × p) data matrix is X = (xij ).3 Its ith row xi =
(xi1 , . . . , xip ) contains the measurements, for all variables, on individual ξi ; while
its jth column, vj , contains all the data on variable Vj , across individuals; more
generally, all the data on variable W = V a are contained in the (n × 1) vector
w = Xa.
P
The average data string, across individuals, is x := n−1 ni=1 xi = n−1 1T X, where
1 is the (n × 1) vector of 1s. Correspondingly the average value for variable
W = V a is w = xa = n−1 1T Xa.
Let H := n−1 J, with J = 11T the (n × n) matrix of 1s. Then every row of X :=
HX is x. The (p × p) matrix H is a (rank 1) projection matrix: H = H T = H 2 .
3
We have so far tried to use upper case symbols for variables, and lower-case symbols for
their values. In dealing with matrices, however, it is common to use upper case for a matrix,
and lower case for its entries. A notational conflict arises when we deal with random matrices;
we will not be able to resolve it consistently.
100 CHAPTER IX. MULTIVARIATE ANALYSIS

Let Π := I − H. Again, Π is a (n × n) orthogonal projection matrix, this time of


rank n − 1; and HΠ = ΠH = 0.
The (mean)-corrected data matrix is X c := ΠX, and has ith row xi − x. The
mean-corrected data on any variable W = V a are contained in the (n × 1) vector
wc = X c a = ΠXa. We see that linear operations on variables commute with
linear operations on individuals.
The (p × p) (corrected) sum-of-squares-and-products (SSP) matrix is

S c := (X c )T X c = X T ΠX. (10)
Pn
Then aT S c a is the corrected sum of squares i=1 (wi − w)
2
for variable W = V a.

3.1 Random sampling

If the individuals ξ1 , . . . , ξn can be regarded as randomly sampled from some


population, then we can treat the n rows of X as independent and identically
distributed random vectors X 1 , . . . , X n from some p-variate distribution P . Cor-
respondingly each entry Xij of X is random, with independence across different
i (but not necessarily across j). If the mean-vector and dispersion matrix of P
are µ and Σ, then E(Xij | µ, Σ) = µj , cov(Xij , Xi′ j ′ | µ, Σ) = δii′ σjj ′ .

Consider any variable W = V a. Its vector of values (W1 , . . . , Wn )T = Xa are


independent and identically distributed, each with expectation µW = µa and
2
variance σW = aT Σa. From univariate theory,
P we know that unbiased estimators
of µW and σW 2
are W = Xa and (n − 1)−1 ni=1 (Wi − W )2 = (n − 1)−1 aT S c a. It
follows that we have multivariate unbiased estimators:

E(X) = µ (11)
E{(n − 1) −1 c
S } = Σ. (12)

If µ is known, without loss of generality we take µ = 0: then an unbiased estimator


P
of Σ is n−1 S where S := i xT T
i xi = X X is the uncorrected sum-of-squares-and-
products matrix.

4 Normal dispersion model

Suppose now that the rows (X i : i = 1, . . . , n) of the data matrix X are inde-
pendent and identically distributed from the normal distribution N (0, Σ) with
4. NORMAL DISPERSION MODEL 101

known mean-vector (taken as 0 without loss of generality), so that the only un-
known parameter is the dispersion matrix Σ. In particular, the entries (Xij ) are
jointly normal, with E(Xij ) = 0, cov(Xij , Xi′ j ′ ) = δii′ σjj ′ .
We shall suppose Σ non-singular. From (4), the overall joint density is

1 1
p(X | Σ) = 1 1 exp − trΣ−1 S (13)
(2π) 2
np
(det Σ) 2
n 2
!
1 1 X X
= 1 1 exp − kii sii + 2 kij sij (14)
(2π) 2 np (det Σ) 2 n 2 i i<j

where K := Σ−1 is the precision matrix.


We see that (14) is of exponential family form. We can take S (more precisely, its
upper triangle) as the natural sufficient statistic; then the mean-value parameter
is nΣ, and the natural parameter is essentially4 K. The likelihood equation
equates the natural statistic and the mean-value parameter, yielding MLE Σ b =
n−1 S, which is unbiased.

4.1 Wishart distribution

The distribution of S = X T X in the above model is called the Wishart distribu-


tion. It depends on Σ and n, and we write S ∼ Wp (n; Σ). The parameter Σ is
the scale parameter, and n is the degrees of freedom. When Σ = Ip we obtain
the standard Wishart distribution on n degrees of freedom.
From univariate theory we have W1 (n; φ) ≡ φ χ2n .
The following result is immediate from Lemma 2:

Lemma 4 Let S ∼ Wp (n; Σ), and A (p × q) be a fixed matrix. Then AT SA ∼


Wq (n; AT ΣA).

Corollary 5 Suppose S ∼ W (n; Σ). Then for fixed a ∈ Rp ,

aT Sa
∼ χ2n .
aT Σa
4
To be precise: the values {− 21 kii : 1 ≤ i ≤ p}, {−kij : 1 ≤ i < j ≤ p}. Alternatively we can
take K itself as the natural parameter; the associated natural sufficient statistic then requires
compensating adjustments to the entries of S.
102 CHAPTER IX. MULTIVARIATE ANALYSIS

Corollary 6 Sii /σii ∼ χ2n .

Lemma 7 Let S be (p + q) × (p + q), with the first p rows/columns corresponding


to variables Z, and the last q rows/columns corresponding to variables Y . If
−1
S ∼ Wp+q (n; Σ) (n ≥ p), then SY Y ·Z := SY Y − SY Z SZZ SZY ∼ Wq (n − p, ΣY Y ·Z ).

(Proof relegated to Example Sheet.)

Corollary 8 Suppose S ∼ Wp (n; Σ). Then for fixed a ∈ Rp ,


aT Σ−1 a
T −1
∼ χ2n−p+1 .
a S a

Proof. Consider first the case a = e := (0, . . . , 0, 1)T . Then aT S −1 a is the


(p, p) entry of S −1 , and its inverse is SY Y ·Z , where Z corresponds to the first
p − 1 variables and Y to the pth. Similarly for Σ. From Lemma 7 the result holds
in this case.
For the general case, let S ∗ := H T SH where H is non-singular with H T a = e.
Then S ∗ ∼ Wp (n; Σ∗ ) with Σ∗ = HΣH T , and the (p, p) entry of (S ∗ )−1 is aT S −1 a
(and similarly for Σ). 2

Wishart density

When we speak of the density of S, this is to be interpreted as the joint density


of its p(p + 1)/2 mathematically independent entries (sij : i ≤ j), with respect to
Lebesgue measure over the set of values for which S is non-negative definite.
Let S ∼ Wp (n; Σ). When n < p, S is singular, having rank n with probability 1,
so does not have a density.
For n ≥ p, S is almost surely of full rank p, and its density then exists and is
given by:
1 1 1
pS (S | Σ) = C(p, n) (det Σ)− 2 n (det S) 2 (n−p−1) exp − trΣ−1 S. (15)
2
The normalisation constant is given by
Y p  
−1 1
pn p(p−1)/4 1
C(p, n) = 2 π 2 Γ (n + i − 1)) . (16)
i=1
2
In particular, the density of the standard Wishart distribution W (n; Ip ) is
1 1
pS (S | Ip ) = C(p, n) (det S) 2 (n−p−1) exp − trS. (17)
2
4. NORMAL DISPERSION MODEL 103

Reduction to standard form

The following argument shows how, if we can establish (17), we can easily deduce
(15).
Since S is a sufficient statistic for the family of densities (13), for any data
the likelihood based on S is proportional to that based on the full data X (see
Corollary 3 of Chapter II). So
pS (S | Σ) pX (X | Σ)
= (18)
pS (S | Ip ) pX (X | Ip )
1 1  −1 
= 1 exp − trΣ S − trS . (19)
(det Σ) 2 n 2

Substituting (17) for the left-hand side denominator, we obtain (15).

Bivariate case
We here derive (17) for the case p = 2. The ith row of the data matrix X will be written as
(Xi , Yi ).
We have U := SXX ∼ χ2n by Corollary 6. We now argue conditionally on the X’s (so these
can be regarded as fixed), and consider the zero-intercept sample regression of Y on X, estimated
from the data pairs (X1 , Y1 ), . . . , (Xn , Yn ). From standard regression theory, the estimated
regression coefficient is βb := SXY /SXX , and the unbiased estimator of the residual variance of
Y , about its regression line, is s2 := SY Y ·X /(n − 1), with SY Y ·X := SY Y − SXY
2 /S
XX . Since,
with Σ = I2 , the population counterparts of these quantities are 0 and 1, regression distribution
theory tells us that (still conditional on the X’s) βb and s2 are independent, with respective
1 1
distributions N (0, S −1 ) and χ2 /(n − 1). Hence if we define V := S 2 βb = SXY /S 2 ,
XX n−1 XX XX
W := SY Y ·X , we have V ∼ N (0, 1), W ∼ χ2n−1 , independently. Moreover, since this joint
distribution of (V, W ) holds conditionally on the X’s, but does not in fact depend on their
values, while U := SXX is entirely determined by those values, we must have (V, W ) ⊥
⊥ U ; and
we know U ∼ χ2n . The joint density of (U, V, W ) is thus:

1 n

1 n−1
1 1 1 1 2 1 1
p(U,V,W ) (u, v, w) = 2
1
 e− 2 u u 2 n−1 · √ e− 2 v · 2
1
 e− 2 w w 2 (n−1)−1 . (20)
Γ 2
n 2π Γ 2
n − 12

1
Now change variables to SXX = U , SXY = U 2 V , SY Y = W + V 2 . The Jacobian of the
1
transformation is J := det{∂(sXX , sXY , sY Y )/∂(u, v, w)} = u 2 , so the transformed density is

pS (S) := pS (sXX , sXY , sY Y ) = p(U,V,W ) (u, v, w)/J


1 1
= C (det S) 2 (n−3) exp − trS (21)
2

where    
√ 1 1 1
C −1 = 2n πΓ n Γ n− . (22)
2 2 2

4.2 Sample correlation


p
The sample correlation coefficient based on S is rij := Sij / Sii Sjj . When S ∼
Wp (n; Σ), r12 depends only on the leading (2 × 2) submatrix S2 of S, which
104 CHAPTER IX. MULTIVARIATE ANALYSIS

has distribution W (n; Σ2 ), Σ2 being the leading (2 × 2) submatrix of Σ. So to


investigate the distribution of a sample correlation coefficient r we can restrict
to the√ case p = 2, r = r12 . Let the ith row of X be (Xi , Yi ). Then r =
SXY / SXX SY Y .
Note that the value, and hence the distribution, of r is unchanged if we first trans-
√ √
form to Xi∗ := Xi / σXX , Yi∗ := Yi / σY Y . This has the effect of transforming Σ
to !
1 ρ
Σ∗ = (23)
ρ 1
where ρ is the population correlation between X and Y . Hence the distribution
of r depends only on ρ (and n); and to investigate it we may as well take Σ to
have the form in (23).
Derivation of the distribution of r is in principle straightforward, starting from
the joint density (15) for (sXX , sXY , sY Y ) (for n ≥ 2), making a multivariate
1 1
change of variables sX = sXX2
, sY = sY2 Y , r = sXY /(sX sY ), and integrating out
sX and sY . But the integral can not be done in closed form. The final result can
be expressed in various ways, one of which is
  Z ∞
n−1 2 12 n 2 21 (n−3)
p(r | n; ρ) = (1 − ρ ) (1 − r ) (cosh u − ρr)−n du. (24)
π 0

We shall derive (24) for the special case ρ = 0, when we can take Σ = I2 , so
that the Xs and Y s are independent normal with unit variance. (In this case the
otherwise problematic integral term in (24) is just a constant.)
Introduce
1 r
t := (n − 1) 2 √ . (25)
1 − r2
Then
1 S /S
t = SXX
2
p XY XX
SY Y ·X /(n − 1)
1
SXX
2
βb
=
s
where βb = SXY /SXX is the sample (zero-intercept) regression coefficient of Y on
X, β = 0 is the corresponding population value, and s2 is the sample residual
mean square (on n − 1 degrees of freedom) for Y about its sample regression
on X. We recognise t as Student’s t-statistic for testing the (true) hypothesis
β = 0. Consequently, from standard regression theory, conditionally on the X’s
and hence also unconditionally,
t ∼ tn−1 . (26)
4. NORMAL DISPERSION MODEL 105

Using the known form of the density for tn−1 ,


 − 12 n
1 Γ( 12 n) t2
p(t) = p 1 1 1+ , (27)
(n − 1)π Γ( 2 n − 2 ) n−1

and applying the change-of-variables formula with (25), delivers the density of r
on (−1, 1):
Γ( 12 n) 1
p(r | n; 0) = 1 1 √ (1 − r 2 ) 2 (n−3) . (28)
Γ( 2 n − 2 ) π

But to test the null hypothesis ρ = 0 it is more straightforward first to trans-


form the observed value of r to t using (25), and refer this to tables of the tn−1
distribution.

It would be nice if an argument similar to that based on (18) could be used to transfer the “null”
sampling distribution (28) for r to the case ρ 6= 0, but unfortunately this does not work because
r is not sufficient for the model with Σ of the form (23).

Asymptotic distribution

It can be shown (see e.g. Anderson, 2003, §4.2.3) that, as n → ∞,


√ d 
n(r − ρ) → N 0, (1 − ρ2 )2 . (29)

Now consider z := tanh−1 r = 12 log{(1 + r)/(1 − r)}, ζ := tanh−1 ρ. Since, for


large n, r is close to ρ, and d(tanh−1 r)/dr = 1/(1 − r 2 ), we can approximate
z − ζ ≈ {1/(1 − ρ2 )}(r − ρ), so that z has asymptotic expectation ζ and constant
asymptotic variance 1/n, and we have
√ d
n(z − ζ) → N (0, 1). (30)

Formula (30) yields a better approximation than (29), and can be used as the basis
of approximate inference about ζ, and thereby about ρ = (eζ − e−ζ )/(eζ + e−ζ ).

Sample partial correlation

Sample partial correlations can be constructed from S in the same way as popu-
lation partial correlations are constructed from Σ. On account of Lemma 7, the
distribution theory (both exact and asymptotic) is exactly as above, but with the
degrees of freedom decremented by p, the number of variables adjusted for.
106 CHAPTER IX. MULTIVARIATE ANALYSIS

5 Normal location-dispersion model

More realistically, we now consider the case of unknown mean-vector M as well


as unknown dispersion matrix Σ (assumed non-singular). From (4), the overall
joint density is then
n
1 1X
p(X | µ, Σ) = exp − (x − µ)Σ−1 (xi − µ)T
1 1
(2π) 2 np (det Σ) 2 n 2 i=1 i
  
exp − 12 nµΣ−1 µT −1 T 1 −1
= 1 1 exp nµΣ x − trΣ S . (31)
(2π) 2 np (det Σ) 2 n 2

Again (31) is of exponential family form. The natural sufficient statistic can
be taken as (X, S). Since E(X i ) = µ and E(X T T
i X i ) = Σ + µ µ, the mean-
value parameter is (µ, n(Σ + µT µ)). Solving the likelihood equations, obtained
by equating the natural sufficient statistic with the mean-value parameter, yields
maximum likelihood estimates:

µ
b = x (32)
b = n−1 S c .
Σ (33)

However, the biased estimator n−1 S c is usually replaced with the unbiased esti-
mator (n − 1)−1 S c .

5.1 Sampling distribution of sufficient statistic

The statistic (X, S c ) is a (1,1) function of the natural sufficient statistic (X, S),
hence minimal sufficient (although not itself a natural sufficient statistic).

Lemma 9 X and S c are independent.

Proof. The n × n matrix of covariances between the jth and kth columns of X
is σjk In . Hence the vector of covariances between the jth column of X c = ΠX
and the kth entry of X = n−1 1T X is n−1 Π σjk In 1 = 0 since Π1 = 0. Thus
every entry of X c is uncorrelated with every entry of X. Since the entries of X
are jointly normally distributed, and we are dealing with linear transformations,
zero correlation implies independence: X ⊥ ⊥ X c . Hence, since S c = (X c )T X c ,
X⊥ ⊥ S c. 2

The distribution of X is readily found:


5. NORMAL LOCATION-DISPERSION MODEL 107

X ∼ Np (µ, Σ/n). (34)

To find the distribution of S c we argue as follows.


Construct an orthonormal basis a1 , . . . , an of Rn with a1 = n− 2 1. Let A (n ×
1

(n − 1)) have columns a2 , . . . , an . It is easily checked that AT A = In−1 , while


AAT = Π. Let X ∗ := AT X, so that X ∗ is (n−1)×p and S c = (X ∗ )T X ∗ . Because
E(X) = 1µ and AT 1 = 0, X ∗ has expectation 0. Moreover it can be checked
that the n − 1 rows of X ∗ all have dispersion Σ, and are uncorrelated — hence
(with normality) independent. It now follows from the definition of the Wishart
distribution that
S c ∼ Wp (ν; Σ) (35)
with ν = n − 1.
Thus if we base inference about Σ on S c , we can use all the results found for the
normal dispersion model, after first replacing n by the new degrees of freedom ν
throughout.

5.2 Inference for mean-vector

In the univariate case we can base inference for the single mean M on the Student’s
t pivot:

1 1
t := n 2 (X − M)/S XX
2
, (36)
P
where S XX := i (Xi −X)2 /(n−1) is the usual unbiased estimator of the variance
of X, on νp= n − 1 degrees of freedom. Then t has the tν distribution, defined as
that of Z/ V /ν when Z ∼ N (0, 1) and V ∼ χ2ν , independently. Alternatively we
can use F := t2 = n(X − M)2 /S XX , which has the F distribution Fν1 , where Fνp is
defined as the distribution of (U/p)/(V /ν) when U ∼ χ2p , V ∼ χ2ν independently.
The generalisation of t2 to the multivariate case is Hotelling’s T 2 :

Hotelling’s T 2

Lemma 10 Suppose X ∼ Np (0, Ip ) and S ∼ Wp (ν; Ip ) (ν > p), independently.


−1 p
Define S := S/ν, T 2 := XS X T . Then ν−p+1
νp
T 2 ∼ Fν−p+1 .

Proof. Let U := X X T , V := νU/T 2 . Then U ∼ χ2p . Also, by Corollary 8,


conditionally on X, V ∼ χ2ν−p+1 . Since this conditional distribution does not
108 CHAPTER IX. MULTIVARIATE ANALYSIS

depend on X, in fact V is independent of X, and therefore of U. We thus have


 
2 νp U/p
T =
ν − p + 1 V /(ν − p + 1)
with U ∼ χ2p , V ∼ χ2ν−p+1 independently; and so, from the definition of the F
p
distribution, (U/p)/{V /(ν − p + 1)} ∼ Fν−p+1 . 2

The (scaled F ) distribution of T 2 is termed Hotelling’s T 2 -distribution with pa-


rameters p and ν, T 2 (p, ν).

Corollary 11 Let X ∼ Np (0, Σ), S ∼ Wp (ν; Σ) (ν > p), independently, where


−1
Σ is positive definite. Then XS X T ∼ T 2 (p, ν).

Proof. Let A be non-singular with AAT = Σ−1 . Define X ∗ := XA, S ∗ :=


−1
AT SA. Then X ∗ ∼ Np (0, Ip ), S ∗ ∼ Wp (ν; Ip ), and XS X T = X ∗ (S ∗ )−1 X ∗T .
2

Returning to our location-dispersion model, define

T 2 := n(X − M)(S c )−1 (X − M)T (37)

where S c = S c /(n − 1). This is one-sample Hotelling’s T 2 .


1
Since, given the parameters (M, Σ), n 2 (X − M) ∼ Np (0, Σ) and is independent
of S c ∼ Wp (n − 1; Σ), we immediately deduce that T 2 ∼ T 2 (p; n − 1). Since this
distribution does not depend on the values of (M, Σ), T 2 is a pivotal quantity.
We can thus test the hypotheses M = µ (with Σ unspecified) by referring n(x −
µ)(S c )−1 (x − µ)T to tables of the T 2 (p; n − 1) distribution; equivalently, we refer
p
{(n − p)/p} n(x − µ)(S c )−1 (x − µ)T to tables of Fn−p . A (1 − α)-level confidence
region for M can be formed as
   
n−p c −1 T p
µ: n(x − µ)(S ) (x − µ) ≤ Fn−p (α)
p
p p p
where Fn−p (α) is the upper α-point of the Fn−p distribution (i.e., Prob{Fn−p >
p
Fn−p (α)} = α). This will be an ellipsoid centred on x, with principal axes aligned
with those of S c .

Two-sample test
Suppose we have independent (p × 1) data-vectors X i ∼ N (MX , Σ) (i = 1, . . . , nX ), Y i ∼
N (MY , Σ) (i = 1, . . . , nY ), from two populations with the same dispersion matrix but possibly
6. CLASSIFICATION 109

c +S c , which are mu-


different mean-vectors. The sufficient statistic is (X, Y , Sw ) with Sw := SX Y
tually independent with respective distributions N (MX , Σ/nX ), N (MY , Σ/nY ), and WP (ν; Σ)
where ν = nX + nY − 2. Let ∆ := MY − MX . With S w := Sw /(nX + nY − 2), inference for ∆
can be based on the pivot (Hotelling’s two-sample T 2 ):

(Y − X − ∆)S w (Y − X − ∆)T
−1
T 2 := (n−1 −1 −1
X + nY ) (38)

with distribution T 2 (p; ν). In particular, a test of the hypothesis MY = MX can be conducted
by referring {(ν − p + 1)/p} (n−1 −1 −1
X + nY ) (y − x)Sw −1
(y − x)T to tables of Fν−p+1
p
.

6 Classification

Suppose the random (p × 1) vector X of observations on variables V is known


to come from one of two populations, Π1 or Π2 , having distribution N (µk , Σ) if
from Πk , with density
1 1
pk (x) = 1 exp − (x − µk )Σ−1 (x − µk )T .
1 (39)
p
(2π) (det Σ)
2 22

For the moment we suppose all parameters known, and µ1 6= µ2 .


We observe X = x and have to assign the observation to one of the two popula-
tions. The likelihood ratio statistic is Λ = λ(X) where
p2 (x) 
λ(x) = = exp (x − µ)a (40)
p1 (x)
where
1
µ := (µ + µ2 ) (41)
2 1
a := Σ−1 (µ2 − µ1 )T . (42)

A likelihood ratio test (LRT) will thus be of the form: “assign to Π2 if Xa > k”.
The variable W := V a is the linear discriminant function (LD) for this problem.
When k = µ̄W := µa, this test is equivalent to assigning to Π2 if Λ > 1. By
general properties of LRTs, this rule will minimise the sum α + β of the two
types of error. By symmetry it will have α = β, and will be minimax. If we
have non-equal prior probabilities for the two hypotheses, or differing losses for
the two types of error, a different cut-off will be optimal.

6.1 Another interpretation

Suppose we decide to collapse our data to the observation on just one variable,
say U = V c. Then, in population Πk , XU := Xc will have the univariate normal
110 CHAPTER IX. MULTIVARIATE ANALYSIS

distribution N (µk c, cT Σc). The separation between the two distributions can
reasonably be measured by the distance between their means, measured in units
of their (common) standard deviation; or, essentially equivalent, of its square:

{(µ2 − µ1 )c}2
d2 := . (43)
cT Σc

We might now try to choose c to maximise d2 . Since d2 is unchanged on multiply-


ing c by a scalar, we may as well impose the normalisation constraint cT Σc = 1,
and subject to this choose c to maximise {(µ2 − µ1 )c}2 . Using a Lagrange
multiplier γ, we have to maximise cT {(µ2 − µ1 )T (µ2 − µ1 ) − γΣ}c. This gives
{(µ2 − µ1 )T (µ2 − µ1 ) − γΣ}c = 0, so Σc = γ −1 {(µ2 − µ1 )c}(µ2 − µ1 )T , which is
a scalar multiple of (µ2 − µ1 )T . Hence c ∝ Σ−1 (µ2 − µ1 )T = a. That is, choosing
the variable U to be (proportional to) the linear discriminant LD will achieve
maximum separation between its two univariate distributions.
This maximised separation is

D 2 := (µ2 − µ1 )Σ−1 (µ2 − µ1 )T . (44)

It is termed the Mahalanobis distance between the two multivariate distributions.


Note that it is invariant under a non-singular transformation of the variables.

6.2 Data

What if we do not know the parameters µ1 , µ2 , Σ? If we have data from each


population, as envisaged in § 5.2, we can simply replace these by their sample
estimates x1 , x2 , Sw ) and proceed as before.
The sample counterpart of the Mahalanobis distance is (using notation similar
−1
to that of § 5.2) (x2 − x1 )S w (x2 − x1 )T , a multiple of Hotelling’s two-sample
statistic for testing the hypothesis M1 = M2 . Since this hypothesis is false, the
distribution of T 2 is somewhat complex,5 although it depends only on D 2 . Note
that, because the sample separation is optimised for the specific data at hand, it
will be biased upwards as an estimate of the population separation D 2 , seriously
so unless ν ≫ p.

5
It is a non-central T 2 -distribution, which is a scaled version of the non-central F -
distribution. See Anderson (2003), § 5.2.2.
7. PRINCIPAL COMPONENTS 111

7 Principal components

Let X, with non-singular dispersion matrix Σ, be the observation on a set of p


variables V . By an abuse of language we may also term Σ the dispersion matrix
of V . Under a non-singular change of variables, V ∗ = V A has dispersion matrix
Σ∗ = AT ΣA. If we know Σ, it can be helpful to make a change of variables such
that Σ∗ is the identity, i.e. the different variables are uncorrelated (independent
if jointly normal) with unit variance. This can be achieved by using any matrix A
which is a square-root of Σ−1 , i.e. AAT = Σ−1 . There are many such square roots:
for example, Gram-Schmidt orthogonalisation is equivalent to using the unique
upper-triangular square root. Or we can reorder the variables before constructing
this.
One useful transformation to uncorrelated variables (though not necessarily with
unit variance, which could however easily be arranged by further scaling of each
individual variable) is based on the spectral decomposition of the non-negative
definite symmetric matrix Σ, in terms of its eigenvalues and eigenvectors. General
matrix theory tells us that there exists a diagonal matrix Λ = diag(λ1 , . . . , λp )
with λ1 ≥ . . . ≥ λp > 0, and an orthogonal matrix A, such that we can write

AT ΣA = Λ. (45)

Then λj is an eigenvalues of Σ, satisfying det(Σ − λj Ip ) = 0, and the jth column


of A, aj , is a corresponding unit-length eigenvector (unique, up to a sign change,
if the multiplicity of λj is unity), satisfying Σaj = λj aj .
The transformed variables (V1∗ , . . . , Vp∗ ) comprising V ∗ := V A are the principal
components.6 They are uncorrelated, with var(Vj∗ ) = λj .

Pp
Definition 2 A variable W = V a is termed normalised if i=1 a2i = 1. 2

Note: This definition is dependent on the choice of basic variables V .

Theorem 12 Among all normalised variables, the choice W = V1∗ , the first
principal component, has maximum variance. Among all normalised variables
uncorrelated with (V1∗ , . . . , Vk∗ ) (1 ≤ k < p), the choice W = Vk+1

has maximum
variance.

Proof. We have V a = V ∗ b with b = AT a. Since A is


Porthogonal,
T
aP a = bT b.
∗ 2
So W is normalised if and only if it has the form i bi Vi with i bi = 1.
6
If a certain eigenvalue is repeated, with multiplicity m > 1, the associated principal com-
ponents are not uniquely defined, but the m-dimensional space they span will be.
112 CHAPTER IX. MULTIVARIATE ANALYSIS

(In particular, each Vi∗ is normalised.)


P Because the (Vi∗ ) are uncorrelated with
variances (λi ), the variance of W is i b2i λi . This is a weighted average of the
(λi ) with weights (b2i ) summing to 1. It is clear that it will be maximised when
all the weight is on the largest value, which will hold for b1 = ±1, bi = 0 (i > 1)
(essentially uniquely so if the first eigenvalue is of multiplicity 1): i.e., for W =
V1∗ .
A variable W = V ∗ b will be uncorrelated with ∗ ∗
Pp (V1 2, . . . , Vk ) if and only if bi = 0
for i = 1, . . . , k. Its variance will then be i=k+1 bi , and the same argument as

above shows that this will be maximised, for W normalised, when W = Vk+1 . 2

The first principal component V1∗ is often interpreted as the most important con-
structed variable, since (subject to normalisation) it varies the most. For certain
purposes (e.g. predicting some other variable) we might consider discarding all
other variables and only using V1∗ . Similarly for k < p the first k-principal com-
ponents might be regarded as the most important set of k variables, and only
these retained — thus achieving dimension reduction. However whether these
aspirations are achieved is quite another matter. There is also the question of
the choice of k: a commonPcriterionPis to fix π ∈ (0, 1) close to 1 and choose
the minimum k for which ki=1 λi / pi=1 λi ≥ π. (This quantity is sometimes,
though not very appropriately, called the proportion of total variance explained
by the first k principal components.)
We have noted that the above construction is relative to a specific choice of basic
variables V . Starting with another set (even if they are scalar multiples of the
originals) will change Σ, and typically produce different principal components (in
the context of Theorem 12 it will involve a different understanding of what it
means for a variable to be normalised). It is common first to scale each of the
initial basic variables by its standard deviation, so that Σ becomes the correlation
matrix , and to construct principal components from that.

7.1 Data

When we do not know Σ, we might replace it by a sample-based estimate such


as S c . With S c ∼ Wp (ν; Σ) the exact joint sampling distribution of the sample
eigenvalues (L1 . . . , Lp ) and eigenvectors of S c can be found but is (unsurprisingly)
complex. Asymptotically, log(Li /λi ) ∼ N (0, 2/ν), independently.
To investigate the possibility of reducing the dimension to k < p one might think
of testing the hypothesis λk+1 = . . . λp = 0. However, if this is true then Σ is
singular, with probability 1 X is confined to a k-dimensional subspace, and the
sample eigenvalues Lk+1 , . . . , Lp will all be zero. Hence if we observe non-singular
7. PRINCIPAL COMPONENTS 113

S we know the hypothesis is false, and no test is called for.


In summary, principal component analysis is best viewed as a descriptive, rather
than an analytic, technique.

You might also like