Principles of Statistics
Principles of Statistics
c 2013 A. P. Dawid
<[email protected]>
<http://www.statslab.cam.ac.uk/∼apd/>
2
Preamble
We do not start with a known probability distribution and explore its mathemat-
ically determined properties. Instead, we start with DATA, regarded as having
been generated from some unknown probability distribution P , and want to say
something sensible about P (perhaps so as to be able to say something sensible
about future variables also generated from P ).
Although the top route is entirely mathematical, the bottom route is not! In-
deed, how to say something sensible about the future on the basis of past data
constitutes the essentially insoluble philosophical problem of INDUCTION.
So. . . Statistics is impossible! And this is reflected in the fact that there are nu-
merous different warring schools of thought (frequentist, Bayesian, likelihood,. . . )
as to how to conduct statistical inference.
3
4
Chapter I
Statistical Models
1 Statistical experiment
Although we may not know the distribution (probability model ) P that generated
our data, we may be willing to confine it to some specified statistical model :
P ∈ P. For most of this course we take this model to be parametric, i.e. of the
form
P := {Pθ : θ ∈ }
where is a finite or countable set, or a Euclidean space. More general forms of
P are non-parametric.
We will thus preface any statistical analysis with our specification of the relevant
(parametric) statistical experiment:
5
6 CHAPTER I. STATISTICAL MODELS
typically be posible to choose unique versions of these functions that are con-
tinuous in both arguments. This will henceforth be assumed unless otherwise
noted.
Likelihood
When L(θ) > 0 on , we often work with the log-likelihood function l(θ) ≃
log L(θ) (where ≃ denotes that the two sides may differ by an additive constant).
The value θb (if it exists) maximising L(θ) (or equivalently l(θ)) is the maximum
likelihood estimate (MLE) of Θ (for data X = x).
NORMAL MODEL
X = R.
Θ = (M, V ) ∈ = R × R+ .
4
“Likelihood” was introduced by Ronald Fisher, who introduced this new term to make a
clear distinction between the likelihood function p(x | ·), for given x a function on the parameter
space (transforming as a scalar), and the probability density p(· | θ), for given θ a (non-scalar)
function on the sample space X. Unfortunately this careful distinction is in danger of being
lost, with the term “likelihood” increasingly employed as if it were synonymous with “statistical
model”.
5
More properly, “the” likelihood function is not a unique function, but an equivalence class
of functions, regarded as equivalent if they are related by multiplication by a positive scaling
factor. To be well-defined, any procedure based on “the” likelihood function must be unaffected
by such scalar multiplication: this will be the case for all those we shall consider.
8 CHAPTER I. STATISTICAL MODELS
BETA MODEL
X = (0, 1)
Θ = (A, B) ∈ = R+ × R+
Given (A, B) = (a, b), X ∼ β(a, b), with density
Γ(a + b) a−1
p(x | a, b) = x (1 − x)b−1 .
Γ(a)Γ(b)
UNIFORM MODEL
X=R
Θ = (A, B)
= {(a, b) ∈ R × R : a < b}
Given (A, B) = (a, b), X ∼ U(a, b), with density
p(x | a, b) = (b − a)−1 (a ≤ x ≤ b)
Note: For this model it is not possible to choose the density function to be
a continuous function of all its arguments.
1 1
p(x | θ) =
π 1 + (x − θ)2
3 Exponential families
A particularly versatile and tractable general form for a statistical model is the
exponential family (EF).7
R
If we introduce the reference measure µ defined by µ(A) = A exp{a(x)} dm(x),
then the density with respect to µ will be exp{b(θ) + u(θ)T t(x)}, i.e. we can take
a(x) ≡ 0.
The transformed parameter Φ := u(Θ) is termed the natural or canonical pa-
rameter. By the Fisher-Neyman factorisation criterion, the statistic T := t(X) is
sufficient 8 : it is termed the natural sufficient or canonical statistic. Both Φ and
T take values in Rp .
Re-expressed in terms of Φ, we can write
where necessarily
Z
k(φ)
e = exp{a(x) + φT t(x)} dm(x). (3)
In order for (2) (and hence (4)) to define a density, it is necessary and sufficient
that k(φ) given by (3) be finite. The set ⊆ Rp for which this holds is the natural
(or canonical) parameter space, and the EF is full when Φ can take any value in
.
We shall always suppose that contains an open set in Rp . This does not lose
any generality.
For, if not, let q < p be the dimension of the affine span of . We can reexpress Φ ∈ as as
an affine function AΨ + c of a new parameter Ψ ∈ R q, and thence (2) in terms of Ψ as natural
parameter, and new natural statistic S = AT T ∈ Rq . Convexity and full affine dimension of the
new natural parameter space imply that it contains an open set in Rq .
The function k is very important, containing essentially all the information about
the family Q. Nevertheless it does not appear to have a well-established name.
The moment-generating function (MGF) Mφ : Rp → R+ ∪ {∞} of Qφ is given by:
T
Mφ (s) := Eφ es T
= exp {k(φ + s) − k(φ)} .
In view of this, we shall call k(·) the cumulant function of the EF.
From general theory, knowledge of its (finite) MGF, or equivalently of its CGF, in
a neighbourhood of 0 completely determines a distribution. So for an EF, knowl-
edge of k in a neighbourhood of φ ∈ ◦ completely determines the distribution
Qφ of T when Φ = φ.
Further, the first two derivatives of κφ at 0 — or, equivalently, of k at φ —
give the mean and variance (dispersion) of the associated distribution Qφ . We
can also show these properties directly. Differentiate (3) with respect to φj (and
12 CHAPTER I. STATISTICAL MODELS
assume that this can be passed through the integral, which is indeed the case for
φ ∈ ◦ ). We obtain:
Z
∂k(φ) k(φ)
e = tj (x) exp{a(x) + φT t(x)} dm(x). (5)
∂φj
Dividing this by (3) yields:
Z
∂k(φ)
= tj (x)p(x | φ) dm(x)
∂φj
= Eφ (Tj ). (6)
3.1 Examples
n!
p(x | π) = π x (1 − π)n−x . (10)
x!(n − x)!
Suppose (Xi ) are independent and identically distributed from the EF with den-
sity (1). Then the joint density of X = (X1 , . . . , Xn ) (with respect to the product
measure mn on Xn ) is
n
Y
p(x | φ) = p(xi | φ)
i=1
= exp{an (x) − kn (φ) + φT tn (x)} (11)
P P
with an (x) = ni=1 a(xi ), tn (x) = ni=1 t(xi ), and kn (φ) = nk(φ). Hence, as φ
varies in , the joint distributions of X again constitute an exponential
Pn family.
The natural parameter Φ is unchanged, the natural statistic is T = i=1 t(Xi ),
and the mean-value parameter and the function k are both multiplied by n.
14 CHAPTER I. STATISTICAL MODELS
We note that, under repeated sampling, the dimensionality of the sufficient statis-
tic remains bounded (by p) as the sample-size n increases. Conversely, under
certain conditions10 this property will hold only for an EF.
10
— of which the most important is that the set of possible values for X does not depend
on the value of Θ
3. EXPONENTIAL FAMILIES 15
Principles of Inference
1 Sufficient statistic
If, instead of learning X, I learn only T , I will generally have lost information.
But sometimes such reduction of our data can be effected without loss of useful
information about Θ.
17
18 CHAPTER II. PRINCIPLES OF INFERENCE
2 Sufficiency principles
We shall see later how these sufficiency principles apply to various specific infer-
ence methods.
1
Note however that this does not hold for Example 2 if we allow — as we can — continuous
distributions that are not absolutely continuous.
20 CHAPTER II. PRINCIPLES OF INFERENCE
Proof. A simple proof for a discrete sample space was given in STATISTICS IB. The following
is a more general argument. For simplicity, we suppose that, for some θ0 ∈ ,
Let T := t(X) be the set of possible values for T . We note the following:
(i). Assuming (2), the factorisation (1) will hold if and only if, for all θ ∈ , we can express
(ii). p(x | θ) = p(x | θ0 ) b(x, θ) if and only if, for any function f : X → R, Eθ {f (X)} =
Eθ0 {f (X) b(X, θ)}.
(iii). (Kolmogorov’s general definition of conditional expectation). Suppose E(Y ) exists (i.e.
E(|Y |) < ∞). Then E(Y | T ) = κ(T ) if and only if, for any function φ(T ) of T ,
E{φ(T ) κ(T )} = E{φ(T ) Y }.
(In both (ii) and (iii), the equality is to be interpreted in the sense that whenever either side
exists, so does the other, and they are equal).
Suppose first that T is sufficient, with density say q(t | θ) (with respect to some measure
on T that we need not specify). Then (2) implies q(t | θ0 ) > 0, all t ∈ T. Define a(t, θ) :=
q(t | θ)/q(t | θ0 ).
For arbitrary k(X) with finite expectation, let κ(T ) := Eθ {k(X) | T }, chosen not to depend
on θ, as is possible by sufficiency. Then
So, by (iii), κ(T ) := Eθ {k(X) | T }. Since Eθ {k(X) | T } is thus the same for all θ, for any function
k(X),2 T is sufficient for Θ. 2
2
It is enough to consider all indicator functions of sets.
4. MINIMAL SUFFICIENCY 21
Thus Corollary 2 says that, whenever two possible outcomes of the same exper-
iment yield the same value for some sufficient statistic, the likelihood functions
(see Definition 2 of Chapter I) that they generate for Θ must be proportional.
Corollary 3 Suppose T ≡ t(X) is sufficient for Θ, with density q(t | θ) (with re-
spect to some measure on T). Then, for any x ∈ X, as functions of θ, q{t(x) | θ} ∝
p(x | θ).
Proof. Examining the proof of Theorem 1, we see that q{t(x) | θ}/q{t(x) | θ0 } = a{t(x), θ} =
p(x | θ)/p(x | θ0 ). 2
4 Minimal sufficiency
In applications we would not normally use this particular form of minimal suffi-
cient statistic: any statistic T ≡ t(X) such that t(x) = t(x′ ) if and only x and x′
yield proportional likelihoods will serve.
5 Completeness
The following technical definition, while not easy to motivate intuitively, proves
to be of importance for a variety of theoretical developments.
all λ > 0. Let h+ (t) := max{h(t), 0}, h− (t) := max{−h(t), 0} be the positive and
negative parts of h. Then
Z ∞ Z ∞
n−1 −λt
h+ (t) t e dt = h− (t) tn−1 e−λt dt,
0 0
all λ > 0. By the uniqueness of Laplace transforms, we must have h+ (t) tn−1 =
h− (t) tn−1 (almost everywhere), i.e. h(t) = 0 (almost everywhere). 2
How much more broadly should we regard LP as appropriate? Do you agree with
its application in the following case?
= p(xt | θ).
Since this holds for any non-informative stopping rule s, any inference method that respects LP
must make the same inference about Θ from the observation X t = xt , no matter which stopping
rule gave rise to it.
Example 6 (See also Question 3(b) of Example Sheet 1.) Suppose that we are told that there
were 9 heads in 12 tosses of a coin, but not which (non-informative) stopping rule was applied.
It might have been: “Toss exactly 12 times” (a binomial experiment), or “Toss until the 3rd tail
appears” (a negative binomial experiment). Or perhaps “Toss 10 times, and then stop after the
first head”. All of these different experiments have different sample spaces, and would support
different unbiased estimators for Θ – so the method of unbiased estimation violates LP. In fact,
pretty well every frequentist method violates LP in this setting. However, according to LP we
should not care which was experiment was actually performed, but make the identical inference
in all these cases. 2
Example 5 can be regarded as a continuous-time variant of the above set-up, where we observe
a Poisson process of rate Θ, either for a given time t0 , or until we have seen a given number n0
of events.
4
For an example of informative stopping, consider an experimenter who tossed the coin 20 times,
but chooses to reveal only the first (say) 14 outcomes, because that maximises the proportion of
heads.
26 CHAPTER II. PRINCIPLES OF INFERENCE
7 Ancillarity
Proof. Let k(S) be a bounded function of S, and g(T ) := E{k(S) | T } (independent of the
value θ of Θ, by sufficiency of T ). Then Eθ {g(T )} = E{k(S), the latter being independent of θ
by ancillarity. By bounded completeness, , almost surely for any θ we have g(T ) = E{k(S), i.e.
Eθ {k(S) | T } = Eθ {k(S)}. Since k was arbitrary, the result follows. 2
8 Birnbaum’s theorem
We have seen that the likelihood principle implies both the (strong or weak)
suficiency principle and the ancillarity principle. Conversely, Birnbaum showed
28 CHAPTER II. PRINCIPLES OF INFERENCE
that the only general way to respect both WSP and AP is to respect LP. In fact
we do not even need the full force of AP, but only the weaker:
CONDITIONALITY PRINCIPLE [CP]
Let E 1 = (X1 , X1 , , Θ, P1 ), E 2 = (X2 , X2 , , Θ, P2 ) be two experiments governed
by the same parameter Θ, and let the compound experiment E be conducted by
first flipping a fair coin, with outcomes labelled 1 and 2; and then doing E i if
the outcome is i. Then the inference drawn from observing outcome (i, xi ) in the
overall experiment E should be the same as that drawn from observing outcome
xi in the component experiment E i . 2
Theorem 11 (Birnbaum)
WSP & CP ⇒ LP
That is to say, if I is a general inference method respecting both WSP and CP,
then it must respect LP.
Definition 1 Estimator
An estimator is a function θe : Xn → . We also use the term to refer to the
1
Much of the description and analysis applies just as well to more general experiments.
29
30 CHAPTER III. FREQUENTIST ESTIMATION THEORY
random variable Θ e
e := θ(X). 2
Once we have decided to use this estimator, our estimate, for observed data x, is
e
θ(x).
The estimator Θ e is a function of X, and so, for each possible value θ of Θ, has
a distribution that can in principle be calculated from the distribution Pθn of X.
A frequentist bases her evaluation of the proposed estimator on this family of
distributions for Θe given θ. But to do this she needs to apply specific criteria for
such an evaluation. There is plenty of scope for developing and analysing (and
criticising) “reasonable” evaluation criteria.
Here is one possible criterion, for the case that ⊆ R:2
Definition 2 Unbiasedness
e of Θ is the function b : → R given by:
The bias function of the estimator Θ
b(θ) = E(Θ
e | θ) − θ.
E( Θ
e | θ) = θ.
When we need to distinguish this bias function from that of other estimators, we
might write bΘe (·), or eb(·),. . .
When unbiasedness holds, the strong law of large numbers implies that, if we were
to apply the estimator Θe repeatedly to independent repetitions of the experiment
3
E n , all with the same parameter-value, then the overall average of our estimates
would converge to the true parameter value, with probability 1, no matter what
that true value might be. This is the “repeated sampling” interpretation of the
unbiasedness criterion.
While unbiasedness may seem like a natural and appealing property, it has its
problems.
10!
Prob(X = x | θ) = θx (1 − θ)10−x (x = 0, . . . , 10).
x! (10 − x)!
(x′ + 6)! x′
Prob(X ′ = x′ | θ) = θ (1 − θ)7 (x′ = 0, 1, . . .).
6! x′ !
The (unique) unbiased estimator of Θ in E is T := X/10; while the (unique)
unbiased estimator of Θ in E ′ is T ′ := X ′ /(X ′ + 6).
Consider now the observations: X = 3 in E, and X ′ = 3 in E ′ . Unbiased esti-
mation yield different answers, respectively 3/10 and 3/9. But both observations
yield essentially the same likelihood function, proportional to θ3 (1 − θ)7 in each
case. So the method of unbiased estimation does not satisfy the likelihood prin-
ciple. 2
A more basic criticism is that the property of unbiasedness says nothing about
how close the estimator might be to its estimand. In Example 1, one simple
unbiased estimator of M is just the first observation, X1 ; but this seems a poor
choice.
To address this issue we might proceed as follows.
2
32 CHAPTER III. FREQUENTIST ESTIMATION THEORY
e is unbiased for Θ
• Θ
• for all θ ∈ , e
v (θ) ≤ v ∗ (θ) for every other unbiased estimator Θ∗ of Θ.
2. TOWARDS A VARIANCE BOUND 33
Suppose T1 and T2 are both MVUEs of Θ. It immediately follows that they have
the same variance function, v(θ) say. Also, 12 (T1 + T2 ) is an unbiased estimator of
Θ, so varθ { 12 (T1 + T2 )} ≥ v(θ). Substituting in (1) we find varθ { 12 (T1 − T2 )} ≤ 0,
hence varθ (T1 − T2 ) = 0. Also, Eθ (T1 − T2 ) = θ − θ = 0. Thus T1 = T2 with
probability 1 under any Pθ . 2
How do we identify a MVUE (if one exists), or check that a proposed estimator
is in fact MVUE? One way is to show the existence of a lower bound v(θ) for the
variance function v(θ) of any unbiased estimator. Then if we find v(θ) = v(θ) we
shall know that we have a MVUE.
We now develop some theory working towards identifying such a bound. We
suppose a statistical experiment E n = E n with ⊆ R. We further suppose
p(x | θ) > 0 on X, and introduce, for x ∈ Xn , θ ∈ :
The first term has expectation 0, by an argument similar to that used to show
(2). The second term is the right-hand side of (3). 2
3 Fisher information
Thus, under Pθ , the score variable Un (θ) has expectation 0 and variance In (θ).
4
The most important is that the range of x-values for which the integrand is positive should
be the same for all θ ∈ .
3. FISHER INFORMATION 35
For this course we are assuming the Xi independent and identically distributed
according to Pθ . Then, with
we have
n
Y
log p(X | θ) = log p(Xi | θ)
i=1
n
X
= l(Xi , θ)
i=1
whence
n
X
Un (θ) = ui (θ)
i=1
where ui (θ) is the score variable based on the single outcome variable Xi . Thus,
under Pθ , Un (θ) is a sum of n independent and identically distributed components,
each with mean 0 and variance i(θ) := Eθ {−l′′ (X, θ)}, the Fisher information
based on a single observation. In particular, in this independent and identically
distributed case, In (θ) = ni(θ). By the central limit theorem, the asymptotic
1
distribution under Pθ of the standardised score variable, Sn (θ) := {In (θ)}− 2 Un (θ),
as n → ∞, is standard normal.
In particular, for large n we can test a hypothesis Θ = θ0 by referring Sn (θ0 )
to standard normal tables; essentially equivalent, we can refer Sn (θ0 )2 (which is
called the score statistic 5 ) to tables of χ21 . This yields the score test.
Similarly, we can form an approximate 95% (say) confidence interval for Θ as:
1
{θ : |{In (θ)}− 2 Un (θ)| ≤ 1.96}. (6)
′ ′
γ(y) that does not depend on the parameter. Hence lY (Y, θ) = lX (X, θ):
i.e. the score random variable is the same, whether based on X or Y. In
particular, the Fisher information, which is the variance of the score, is
unaffected by a data transformation.
1 1
In particular, IΦ (φ)− 2 UΦ (φ) = IΘ (θ)− 2 UΘ (θ): the standardised score variable
(and hence also the score statistic) is unaffected by an invertible transformation
of either the data or the parameter.
The general theory confirms our earlier results that Eφ (T ) = k ′ (φ), varφ (T ) =
k ′′ (φ).
Consider now the mean-value parameter H = k ′ (Φ). Using (10) we find iH (η) =
{k ′′ (φ)}−1 (where η = k ′ (φ)), and so IH (η) = n{k ′′ (φ)}−1
4. THE CRAMÉR-RAO LOWER BOUND 37
which holds for any variables Y, Z, having any joint distribution (so long as the
variances exist); moreover, we get equality in (12) if and only if there is a linear
relationship between Y and Z.
e and
Fixing an arbitrary value of θ, we will apply (12) to the variables Y = Θ
n
Z = U(θ), using the distribution Pθ .
Using (2),
e U(θ)} =
covθ {Θ, EZ θ {θ(X)
e ln′ (X, θ)}
′
= e p (x| θ) × p(x| θ) dµ(x)
θ(x)
p(x| θ)
ZX
= e p′ (x| θ) dµ(x)
θ(x)
X
Z
d e
= θ(x)p(x| θ) dµ(x)
dθ X
d
= Eθ (Θ)
e
dθ
= 1
e
(where we have use the unbiasedness of Θ: Eθ {Θ}
e ≡ θ.) Inserting this, and (5),
into (12), the result follows. 2
i(µ) = µ−2 .
Hence any unbiased estimator of M must have variance at least I(µ)−1 = µ2 /n.
But var(X | µ) = µ2 , so the variance of the unbiased estimator X of M is µ2 /n.
It follows that this is the MVUE of M. 2
b
p(x | θ) = exp{a(x) + b(θ) + w(θ) θ(x)}.
We thus see that, for equality in (11), we must be dealing with an exponential
family of distributions for X. In the independent and identically distributed case
we confine ourselves to, this in turn requires that the basic single-observation
model p(x | θ) be an EF.
Moreover, our estimator Θb must be the natural statistic T . Since we are requiring
that Θb be an unbiased estimator of Θ, that in turn implies that Θ must be the
mean-value parameter H of the EF. So we can achieve equality in (11) if and only
if we are estimating the mean-value parameter in an exponential family; and to
do so we must be using the natural sufficient statistic.
X1 + 2X2
U(θ) = l′ (X, θ) = −1 − 2θ +
θ
′′ X1 + 2X2
−l (X, θ) = 2 + .
θ2
This model forms an exponential family, with natural parameter Φ = log Θ,
natural statistic T = X1 + 2X2 , and mean-value parameter H = Θ + 2Θ2 .
Note that Eθ {U(θ)} = 0. Also IΘ (θ) = Eθ {−l′′ (X, θ)} = 4 + θ−1 , so the Cramér-
Rao lower bound for an unbiased estimator of Θ is θ/(1 + 4θ). The estimator
e := X1 is clearly unbiased for Θ, with variance function e
Θ v (θ) = θ. This is close
to the CRLB when θ is small.
Now consider the parameter Λ = Θ2 . Using (8), or directly, we find the CRLB
for unbiased estimation of Λ is IΛ (λ)−1 = 4θ3 /(1 + 4θ). The unbiased estimator
X2 of Λ comes close to achieving this for large θ.
Finally, if we just happened to be interested in the mean-value parameter H =
Θ + 2Θ2 , we would find the unbiased estimator T = X1 + 2X2 exactly achieves
the variance bound IH (η)−1 = θ + 4θ2 . 2
∂ 2 f (θ)/∂θ12 ··· ∂ 2 f (θ)/∂θ1 ∂θp
.. .. ..
f ′′ (θ) ≡
. . .
.
∂ 2 f (θ)/∂θp ∂θ1 ··· ∂ 2 f (θ)/∂θp2
In particular, the information function In (θ) = −Eθ {ln ′′ (X, θ)} is now a symmetric (p × p)
matrix.
In parallel with the single-parameter results, we can show that the score vector Un (θ) has
1
(under Pθ ) expectation 0 and dispersion matrix In (θ). Letting i(θ) 2 denote any (non-random)
matrix square root of i(θ) (i.e. a non-random matrix A(θ) such that A(θ)A(θ)T = i(θ)), and
1 1 1 1
In (θ) 2 = n 2 i(θ) 2 , the asymptotic distribution of In (θ)− 2 Un (θ) is that of p independent
standard normal variables. In particular, the asymptotic distribution of Un (θ)T In (θ)−1 Un (θ)
(which is invariant under transformations of either data or parameter) is χ2p .
Now suppose Θ e 1 is an unbiased estimator of Θ1 . Similar to before, we find cov {Θ
e
θ 1 , Uj (θ)} =
1 for j = 1, 0 otherwise. It follows that, for any non-random vector c (which may however depend
on θ) we have
covθ Θ e 1 , cT U(θ) = c1 .
40 CHAPTER III. FREQUENTIST ESTIMATION THEORY
The right-hand side of (13) is maximised when c ∝ I(θ)−1 δ 1 , where δ 1 is the (p × 1) vector
whose first entry is 1 and all others are 0. The maximised value is then I 11 (θ) = δ T I(θ)−1 δ ,
1 1
the (1, 1) entry of I(θ)−1 .6 Hence any unbiased estimator of Θ1 must have variance at least
I 11 (θ).
If S is any statistic, and T is sufficient, then we can form the conditional ex-
pectation of S given T , S ∗ := E(S | T ), which will be a function of T and, by
sufficiency, will not depend on which value of θ is used in its construction. It is
thus another statistic.
The following result from STATISTICS IB enables us to improve (in MSE terms)
on any estimator that is not already a function of a sufficient statistic:
(i). Eθ (S ∗) = Eθ (S).
(ii). varθ (S ∗ ) ≤ varθ (S), with equality if and only if S is a function of T (so that
S ∗ = S).
Proof. The following formulae hold for any variables S, T with any joint distribution:
Apply these to the case in hand, under distribution Pθ , using Eθ (S | T ) = S ∗ . Then (i) follows
directly from (14). Also, since the first term on the right-hand side of (15) is non-negative, we
obtain (ii). For equality in (ii) we must have Eθ {varθ (S | T )} = 0, which implies varθ (S | T ) = 0,
i.e. S is fully determined by T . 2
6.1 Completeness
The following result follows easily from the definition of completeness (Definition 4
of Chapter II) and the Rao-Blackwell theorem.
Asymptotics
Often we can not analyse exactly the behaviour of some statistical procedure,
but we can say something about its asymptotic behaviour as the sample size
n → ∞—with luck, this will be a useful guide to what happens for finite n. In
order for this to make sense, we must conceive of our procedure as being defined
for arbitrary sample size—so we really have a sequence of experiments, (E n ), and
a sequence of procedures, one for each experiment. Thus we might e.g. consider
the behaviour of an estimator sequence (Θ e n ) as n → ∞, where Θe n : Xn → .
Here we restrict our attention to such estimation problems.
There are in fact two possible set-ups for such asymptotics:
For simplicity, unless otherwise stated we restrict attention to the case of a real
parameter: ⊆ R.
43
44 CHAPTER IV. ASYMPTOTICS
1 Forms of asymptotics
Since we are dealing with random variables, not numbers, we have to take care
over what we mean by asymptotic behaviour. In particular, we distinguish the
following ways in which a sequence (Zn ) of real1 random variables can “converge”
to another real random variable Z:
Convergence in distribution
d
We say Zn tends to Z in distribution, and write Zn → Z, if, for any bounded
continuous function h : R → R, E{h(Zn )} → E{h(Z)}.
A useful equivalent condition relates to the distribution functions Fn (z) :=
Prob(Zn ≤ z), F (z) := Prob(Z ≤ z): we require Fn (z) → F (z) whenever
Prob(Z = z) = 0.
Note that for this definition to be meaningful, we only need to specify the
marginal distributions of the Zn and Z—they do not even need to be defined
on the same probability space. In particular, there need be no convergence
between the values of Zn and Z.
d
continuous. Then g(Yn, Zn ) → g(Y, c).
1
Extensions to variables taking values in a vector space are straightforward
2. CONSISTENCY 45
2 Consistency
P
Example 1 Let Φ = E(X | Θ) (assumed to exist), Tn = X n := n−1 ni=1 Xi . The
weak law of large numbers shows that, under Pθ , Tn → E(X | θ) in probability.
Thus (Tn ) is weakly consistent for Φ.
In the asymptotic set-up (ii), we can similarly call on the strong law of large
numbers to show that (Tn ) is strongly consistent for Φ.
Note that these results would apply equally if, in E n , we used (say) Tn = X n/2 ,
so ignoring half the data. 2
2
At any rate (for the independent and identically distributed case) whenever Θ is identified ,
in the sense that different values for Θ always give rise to different distributions for X.
3
For simplicity we omit explicit mention of the underlying parameter-value θ.
46 CHAPTER IV. ASYMPTOTICS
Under very broad conditions (essentially, in the independent and identically dis-
tributed case, that the parameter Θ be identified), the maximum likelihood esti-
mator is strongly consistent. The key property to note is:
Proof. Jensen’s inequality states that, for any convex function g, and any
random variable Y such that the expectations exist,
For by convexity there exist a, b such that g(y) ≥ a + by, all y, with equality at
y = E(Y ). Now take expectations of both sides. If, moreover, g is strictly convex,
we will have equality in (1) if and only if Y is almost surely constant.
Apply (1) with g ≡ − log, and Y = p(X | θ′)/p(X | θ) if the denominator is
positive, else 0. Take expectations under Pθ . Since we assume the parameter Θ
is identifiable, the functions p(· | θ) and p(· | θ′) are not identically equal, so Y is
non-constant, so the inequality is strict. Thus
P
Proof. Since, for any θ′ , ln (θ′ ) = ni=1 log p(Xi | θ′ ), and the summands are
independent and identically distributed, the strong law of large numbers ensures
that, with Pθ -probability 1,
n−1 {ln (θ′ ) − ln (θ)} → Eθ {log p(X | θ′ ) − log p(X | θ)} < 0.
2
3. ASYMPTOTIC NORMALITY 47
In particular, for any δ > 0, with probability 1 under Pθ there exists N such that,
for all n > N, both ln (θ − δ) < ln (θ) and ln (θ + δ) < ln (θ); and when this occurs
there must be a local maximum of the function ln (·) in the interval (θ − δ, θ + δ).
For cases (for example, in an EF) when we know that ln (θ) has just one maxi-
mum Θ b n , Corollary 4 ensures that the sequence (Θ b n ) will be strongly consistent.
Otherwise, this property will hold so long as we make a suitable choice of local
maximum.4 In the sequel Θ b n will refer to such a choice.
3 Asymptotic normality
(which, we recall, does not depend on how the data or the parameter are ex-
pressed) is asymptotically N (0, 1). The score test of H0 : Θ = θ0 is executed
by calculating the observed value of Sn (θ0 )2 = {In (θ0 )}−1 {Un (θ0 )}2 and referring
this value to tables of the χ21 distribution.
From this, other asymptotically equivalent results can be derived from the prop-
erty that, for large n, the log-likelihood function ln (θ) is approximately quadratic
in a region of size O(n− 2 ) around the MLE θbn , and that values outside such a
1
Ln := Ln (θ)/L bn
ln := log Ln .
1 1 1
Corollary 6 k − 2 (dy/dx) = −k 2 x = (−2y) 2 × sgn(−x).
Now take x = θ − θbn , y = ln (θ) (so dy/dx = Un (θ)), and approximate ln (θ) by
e for some θe in the region where this
a quadratic y = − 12 kn x2 , with kn ≈ −ln′′ (θ)
approximation is adequate. In particular we can take kn to be any of:
(iii). In (θ)
b n)
(iv). ı̂n := In (Θ
where (iii) and (iv) are motivated by −ln′′ (θ)/In (θ) ≈ 1 for large n. Note that In
depends only on the parameter, ı̂n and ̂n only on the data, while Jn depends on
both.
We deduce the approximate equivalence of:
−1
(a). kn 2 Un (θ)
1
b n − θ)
(b). kn2 (Θ
1
b n − θ)
(c). {−2ln (θ)} 2 × sgn(Θ
b n ) = Un (θ) + (Θ
0 = Un (Θ b n − θ)U ′ (Θ∗ ) (4)
n n
b n . Thus
for some Θ∗n between θ and Θ
1
b n − θ) = Sn (θ) {−In (θ)−1 U ′ (Θ∗ )}−1 .
In (θ) 2 (Θ (5)
n n
Pn
Now Un′ (θ) = i=1 u′i (θ) is the sum of n independent and identically distributed compo-
nents, with Eθ {−u′i (θ)} = i(θ). Also, In (θ) = ni(θ). So, by the weak law of large numbers,
p
−In (θ)−1 Un′ (θ) → 1.
b n → θ, so that eventually Θ∗ ∈ N , and then
Furthermore, with probability 1 Θ n
n
X
In (θ)−1 {Un′ (Θ∗n ) − Un′ (θ)} = i(θ)−1 n−1 {l′′ (Xi , Θ∗n ) − l′′ (Xi , θ)}
i=1
n
X
≤ i(θ)−1 n−1 l′′ (Xi , Θ∗n ) − l′′ (Xi , θ)
i=1
≤ i(θ)−1 |Θ∗n − θ| Z n ,
P
i=1 Zi → E(Z | θ), by the weak law of large numbers, while |Θn − θ| → 0 by
p p
where Z n := n−1 n ∗
b
strong, and hence weak, consistency of Θn . We deduce
p
− In (θ)−1 Un′ (Θ∗n ) → 1. (6)
d
The result now follows on applying Slutsky’s lemma to (5), using Sn (θ) → N (0, 1) and (6). 2
50 CHAPTER IV. ASYMPTOTICS
d
bn →
The above result is often expressed as: Θ N {θ, In (θ)−1 } (although the right-
hand side itself varies with n). This can be read as: “Θ b n is asymptotically
normal and asymptotically unbiased, with asymptotic variance In (θ)−1 ”. Note
that this asymptotic variance is the Cramér-Rao lower bound on the variance of
any unbiased estimator. Thus Θ b n is “asymptotically efficient”.
p
For (iv), use Θ bn → θ to see that (assuming the function i(θ) is continuous)
p
b n )/I(θ) = i(Θ
I(Θ b n )/i(θ) → 1. The result now follows by Slutsky’s lemma.
Similarly the result for (i) follows on using the weak law of large numbers to
p
show −ln′′ (θ)/I(θ) → 1. The argument for (ii) is similar.
There is both theoretical and experimental evidence that the data-dependent
scalings (especially (ii)) yield closer approximations to the normal distribution
than use of (iii). Moreover, as in Example 4, using the fully data-based form
(ii), we can readily form an approximate 95% (say) confidence interval for Θ
−1 −1
as θbn ± 1.96̂n 2 (alternatively, using (iv), we obtain θbn ± 1.96ı̂n 2 —this form is
more commonly seen, but may be less accurate). In contrast, direct use of (iii)
or (i) would involve solving a typically non-linear equation for θ—and would
generally give a poorer approximation to boot. (Note however that none of
these approximate intervals transforms properly under a non-linear change of
parameter).
4 Hypothesis tests
Suppose is an interval of the real line, and we wish to test the null hypothesis
H0 : Θ = θ0
for some specified θ0 ∈ , against the alternative H1 that leaves Θ totally un-
specified.
The following important result now follows from the asymptotic equivalence,
demonstrated informally in § 3, of (c) and Sn (θ); a rigorous argument can be
developed along the same lines as Theorem 7.
5
You may sometimes see the definition given with numerator and denominator interchanged.
4. HYPOTHESIS TESTS 51
Notes: Suppose independent and identically distributed sampling from a continuous model.
(i). Use of the approximate χ2 distribution function for W is typically accurate with a relative
error of order O(n−1 ).
p
(ii). The signed root likelihood ratio statistic ((c) of § 3) is r(θ0 ) := sgn(θb − θ0 ) W (θ0 );
this will be asymptotically standard normal under Pθ0 . However this approximation is
1
typically accurate only to order O(n− 2 ).
(iii). In fact Eθ0 {W (θ0 )} has the form 1+b(θ0 )/n+O(n−2 ), where we can in principle calculate
the function b(θ). If we replace W by W ′ := W/{1 + b(θ)/n}—a move known as Bartlett
adjustment—we find that not only the mean but the whole distribution is now χ2 with a
relative error of only O(n−2 ). For small samples this can make a dramatic improvement.
The above results (and their proof techniques) extend in a pretty straightforward
manner to a p-dimensional parameter space. We then have asymptotic equiv-
T b
alence of the random vectors A−1n Un (θ) and An (Θn − θ), where An is a matrix
6
or, indeed, by replacement of X by a sufficient reduction.
7
Ideally one would like this to hold, simultaneously, for all θ ∈ . This would require
conditions and arguments allowing the extension of the relevant asymptotic sampling results
(such as Theorem 7) to hold uniformly over .
52 CHAPTER IV. ASYMPTOTICS
square root8 of Kn , i.e. An ATn = Kn , the matrix Kn being any of In (θ), ı̂n , Jn or
̂n (with their obvious multiparameter definitions). Any of these quantities will
have an asymptotic N (0, Ip ) distribution under Pθ ; in particular, asymptotically
Un (θ) ∼ N (0, In(θ)) and (Θb n − θ) ∼ N (0, In (θ)−1 ). The asymptotic distribution
T −1
of U(θ) Kn U(θ) and of (Θ b n − θ)T Kn (Θ
b n − θ) will be χ2 , as will be that of the
p
assymptotically equivalent Wilks statistic W (θ) := −2ln (θ).
H0 : Θ ∈ M
Proof. (This proof emphasises algebraic and probabilistic aspects. A fully rigorous account
would require additional technical conditions and analysis, along the lines of Theorem 7 and its
proof.)
In a neighbourhood of θ0 , we can regard as an open subset of Rp , and parametrise M as
{h(β) : β ∈ B}, where B is an open subset of Rq and h : B → is (1, 1) and differentiable. Thus
under M, can take as our parameter B ∈ B, and then Θ = h(B). Let B = β0 ∈ B correspond
to the specific null value Θ = θ0 ∈ M, i.e. θ0 = h(β0 ). Let U, I be the p × 1 score vector and
p × p information matrix for Θ, evaluated at θ0 ; and let V, J be the q × 1 score vector and
q × q information matrix for B, evaluated at β0 . Also let H be the p × q matrix with (i, j) entry
hij = ∂hi (β)/∂βj , again evaluated at β0 . Then
V = HTU
J = H T IH.
0 denote the log-likelihood of the data under the null value Θ = θ , or equivalently B =
Let ln 0
0 −sup T −1 U ; and,
β0 . Under the null we have asymptotic equivalence of −2{ln θ∈ ln (θ)} and U I
8
We could use e.g. the symmetric or lower triangular square root; but should not let the
way we form this square root depend on the data.
4. HYPOTHESIS TESTS 53
W∗ := U T {I −1 − H(H T IH)−1 H T }U
= Z T {I − K(K T K)−1 K T }Z
= Z T ΠZ
Note particularly that the asymptotic distribution of W is the same, for any
value of Θ satisfying the null model. We can test H0 against H1 by referring the
observed value w of W to tables of the χ2r distribution.
Although we have proved Wilks’s theorem only for the independent and identi-
cally distributed case, it holds much more generally.
2
54 CHAPTER IV. ASYMPTOTICS
Nonparametric methods
F (x) = P (−∞, x]
= Prob(X ≤ x) where X ∼ P.
(i). F is non-decreasing
Conversely, any function with these properties can serve as a cdf. We then have
F (x−) := limy↑x F (y) = P (−∞, x) = Prob(X < x).
55
56 CHAPTER V. NONPARAMETRIC METHODS
Thus Pb n (xn ) attaches probability mass 1/n to each observed data-point (with
suitable accounting for any repetitions). Before we have conducted the experi-
P
ment, we have random X n , and correspondingly random Pbn = n1 ni=1 δXi .
The cdf of Pbn is the (random) empirical distribution function Fbn (·), given by
1
Fbn (x) = #{i ≤ n : Xi ≤ x}. (2)
n
Intuitively, for large n Pbn (resp., Fbn ) should approximate P (resp., F ). Here we
shall try and refine this intuition, and assess the uncertainty involved.
The following consistency result is an immediate consequence of the strong law
of large numbers:
The limiting behaviour in (3) does not hold uniformly across all measurable A.
Indeed we typically have, with probability 1,
lim inf sup |Pbn (A) − P (A)| > 0.
n→∞ A
Thus suppose that the discrete part (if any) of P has total mass d < 1 . Taking An
to be the (discrete) support of Pbn , for all n we will have Pbn (An ) = 1, P (A) ≤ d,
so that supA |Pbn (A) − P (A)| > 1 − d > 0.
2. UNIFORMLY CONSISTENT ESTIMATION OF THE DISTRIBUTION FUNCTION57
Proof. Let ǫ > 0. There exists a finite sequence −∞ = t0 < t1 < . . . < tk = ∞
such that F (ti −) − F (ti−1 ) = Prob(ti−1 < X < ti ) ≤ ǫ for all i. Then for
ti−1 < t < ti ,
Similarly
n o
F (t) − Fbn (t) ≤ max F (ti−1 ) − Fbn (ti−1 ) − ǫ.
i
(this holding trivially if t is one of the {ti }). But by Corollary 2, for each i,
Prob(|Fbn (ti ) − F (ti )| → 0) = 1 and Prob(|Fbn (ti −) − F (ti −)| → 0) = 1. Since
the conjunction of a finite number of almost sure events is almost sure, with
probability 1
n o
b b
max max |Fn (ti ) − F (ti )|, |Fn (ti −) − F (ti −)| → 0,
i
whence lim supn→∞ supt∈R |Fbn (t) − F (t)| ≤ ǫ with probability 1. Since ǫ > 0 was
arbitrary, the result follows. 2
Suppose X has continuous real cdf F , and consider the variable U := F (X).
In particular, suppose we wish to test the hypothesis H0 that the (Xi ) are in-
dependent and identically distributed with specified continuous cdf F . This is
equivalent to the hypothesis H0∗ that, with Ui := F (Xi ), the (Ui ) are independent
and identically distributed U(0, 1), i.e. have cdf F ∗ (u) ≡ u (0 < u < 1). This
reduces the problem of testing a quite general continuous F to the special case
F = F ∗.
4 Kolmogorov-Smirnov test
It is readily seen that (for continuous F ) dn (u) = Fbn (x) − F (x) with u = F (x),
and so
Kn = sup{|Fbn (x) − F (x)|. (7)
x∈R
4. KOLMOGOROV-SMIRNOV TEST 59
This is form of the Kolmogorov-Smirnov statistic for testing the null hypothesis
H0 that the common distribution function of the (Xi ) is F . The distribution of
Kn when H0 holds will not depend on F ,1 though it will depend on n.
Tables of the distribution of Kn are available: see e.g. Birnbaum, Z. W., (1952),
Journal of the American Statistical Association, Vol. 47, pp. 425–441. For n > 35
the asymptotic distribution (see § 4.1 below) can be used.
It simplifies computation of Kn to note that the supremum in (7) must be attained
at a data-point. To perform a size-α test of H0 , we look up dα (n), the upper 1−α
quantile of the distribution of Kn , and reject if the observed value of Kn exceeds
dα (n). We can also form a level-γ “confidence band” Fγ , consisting of all those
continuous distribution functions F which would not be rejected by the test of
size α = 1 − γ. That is, F ∈ Fγ if, for all x, F (x) lies within the constant-width
band Fbn (x) ± d1−γ (n) about the empirical distribution function.2
4.1 Asymptotics
For any fixed v ∈ (0, 1), nFbn∗ (v) = #{i ≤ n : Ui ≤ v} has the binomial dis-
tribution B(n; v) with mean nv and variance nv(1 − v). By the central limit
1 d
theorem, n 2 dn (v) → N {0, v(1 − v)}. Similarly, for 0 < v < u < 1, the dis-
tribution of n{Fbn∗ (v), Fbn∗(u) − Fbn∗ (v), 1 − Fbn∗ (u)} is trinomial with probabilities
1 1
(v, u − v, 1 − u), whence we find cov{n 2 dn (v), n 2 dn (u)} = v(1 − u).
It can be shown that, considered as random functions from (0, 1) to R, the
1
distribution of n 2 dn (·) converges (in a suitable sense) to that of the 0-mean
Gaussian process B(·) determined by the above asymptotic covariance structure:
cov{B(v), B(u)} = v(1 − u) (v ≤ u) (this is known as the Brownian bridge); and
1 1
that the limiting distribution of the functional n 2 Kn = sup{n 2 |dn (·)| : 0 < u <
1
1} of n 2 dn (·) is given by the distribution of the same functional of the limiting
process B(·), viz. K := sup{|B(u)| : 0 < u < 1}. Derivation of this distribution
(the Kolmogorov distribution) is beyond our scope, so we merely assert: for x > 0,
∞ √ ∞
X 2π X −(2i−1)2 π2 /(8x2 )
i−1 −2i2 x2
Prob(K ≤ x) = 1 − 2 (−1) e = e . (8)
i=1
x i=1
1
For large n the size-α Kolmogorov-Smirnov test will reject H0 for Kn > n− 2 Kα ,
where Kα is chosen so that Prob(K > Kα ) when K has distribution (8). It
follows from the Glivenko-Cantelli theorem that the Kolmogorov-Smirnov test is
consistent, i.e. if F is not the true distribution function then the probability of
rejection will tend to 1 as n → ∞.
If we replace Xi by F (Xi ) and Yj by F (Yj ),3 the problem is transformed to one in which
(under H0 ) F is replaced by the uniform U (0, 1) cdf, while the value of Dn,m is unaltered.
Consequently the null distribution of Dn,m is the same, no matter what may be the common
distribution function F .
1 1
Take n = N p, m = N q (p ∈ (0, 1), q = 1 − p) and let N → ∞. Then N 2 p 2 {FbX (·) −
d 1 1 d
FX (·)} → BX (·), and N 2 q 2 {FbX (·) − FY (·)} → BY (·), where BX , BY are independent Brow-
1 d 1 1
nian bridge processes. So, under H0 , N {FbX (·) − FbY (·)} → p− 2 BX (·) − q − 2 BY (·). Con-
2
sideration of the covariance function of this Gaussian limiting process show it to be of the
1
form (p−1 + q −1 ) 2 B ∗ (·), where B ∗ is a Brownian bridge. We deduce that the asymptotic null
1
∗
distribution of Dn,m := {nm/(n + m)} 2 Dn,m is the Kolmogorov distribution (8). Hence we
can conduct an asymptotically size-α test of the (composite, nonparametric) null hypothesis
∗
H0 : FX ≡ FY by rejecting H0 when Dn,m > Kα . Again, this test is consistent, i.e. rejects
with probability tending to 1 if FX 6≡ FY .
3 Note that we can not actually compute these values in absence of knowledge of F .
Chapter VI
The bootstrap
1 Pivotal inference
Pivots are useful for constructing confidence intervals and hypothesis tests. Thus
in the known variance case, whatever be the value µ of M, the probability is 95%
1
that K lies in the interval ±1.96 v 2 , and consequently this is also the probability
1
that the interval X ± 1.96(v/n) 2 contains µ. With both parameters unknown,
the familiar t-test of H0 : M = µ0 is based on the fact that, if H0 is true, the pivot
1
1
The scaling by n 2 , while not essential for immediate purposes, yields a distribution that
is insensitive to sample size. For asymptotic analysis it is helpful to consider pivots with non-
trivial limiting distributions.
61
62 CHAPTER VI. THE BOOTSTRAP
√
T is equal to (X − µ0 )/(S/ n), which can be calculated from the data; and if
that value does not appears consistent with the tn−1 -distribution the hypothesis
is put in doubt. Similarly, we can base confidence intervals for M on the pivot T .
For the normal case, the actual sampling distributions of the above pivots are
available from general theory, and readily manipulated. Let us suppose for a
moment that one or other of these conditions were false. What could we do? For
simplicity we just consider the normal location model of Example 1.
Suppose we have data (x1 , . . . , xn ). We start by estimating the underlying data
distribution: an obvious estimate is N (x, v). Although this of course differs from
the true distribution N (M, v), it belongs to the same normal location family.
Consequently, both these distributions will have the identical distribution, D
say, for the pivot K. So consider a new sample, (X1∗ , . . . , Xn∗ ), the same size n as
the original data-set, but now generated from this estimated distribution. Then
1 ∗ 1
K ∗ := n 2 (X − x) has the same distribution as K = n 2 (X − M). Even if we
couldn’t actually compute this distribution, we can readily approximate it by
simulation. From the estimated distribution N (x, v) we generate many random
samples (“resamples”) of size n, and for each compute the associated value k ∗ of
1
the pivot K ∗ : thus for a specific resample (x∗1 , . . . , x∗n ), we obtain k ∗ = n 2 (x∗ −x).
If we do this for M independent resamples, we will obtain a random sample of size
M from the distribution D of K ∗ —which is the same as that of K. Let D ∗ denote
the empirical distribution of these values, i.e. the discrete distribution putting
probability 1/n on each value appearing (with obvious adjustment in the case of
repeated values). Then D ∗ is an approximation to the desired pivotal distribution
D. Using D ∗ , we can now conduct approximate pivotal inference for M, based
on the original sample mean x. If M is sufficiently large the approximation will
be good, with high probability.
Staying with the normal location model, suppose (purely for illustration) we are
interested in the upper quartile Qu of the distribution. We could just use the
1
known normal-theory relationship Qu = M + 0.6745 v 2 to make an appropriate
adjustment to our inference for M. Alternatively (if less efficiently) we might
1
base inference directly on the pivot Ku := n 2 (Xu − Qu ), where Xu is the upper
quartile of the sample (X1 , . . . , Xn ). Again, for a given dataset (x1 , . . . , xn ) we
can use resampling from the estimated distribution N (x, v) to approximate the
3. NONPARAMETRIC BOOTSTRAP 63
3 Nonparametric bootstrap
A still further level of approximation ignores the fact that the underlying distri-
bution is known to be normal. For our estimated distribution we now use, not
N (x, v), but a non-parametric estimate based on the original sample (x1 , . . . , xn ).
Various forms are possible, but the most usual is also the simplest: just the em-
pirical distribution of the data. The corresponding estimate of Qu is now the
sample upper quartile, xu .
A resample is now simply a random sample with replacement from the values in
the original data-set (in particular, this will usually contain some repeats). Let
(x∗1 , . . . , x∗n ) be a generic resample. Then the associated estimate of the pivot is
1
ku∗ := n 2 (x∗u − xu ). If n is large then the empirical distribution of the data will
be close to the parametric estimate N (x, v), and then an empirical distribution
for ku∗ , computed from a large number M of resamples, will approximate the true
1
distribution of Ku = n 2 (Xu − Qu ), and can be used, together with the estimate
xu from the original data, as a basis for approximate pivotal inference about Qu .
The nonparametric bootstrap made no use of the assumption that the underlying
distribution was normal with variance v. Consequently it could be applied, un-
modified, to any location family, to make inference about (say) its upper quartile
Qu . So long as n is large enough, the shape of the estimated distribution, used
for resampling, should be close to that generating the original data—i.e. they
should, to a good approximation, differ only in their location. Then (so long
as M is large enough) the bootstrap distribution of the pivot should be close to
its true distribution, and hence usable to make inference about Qu . And this
holds for any underlying distribution—which can thus be completely unspecified.
In particular, even the assumption of an underlying location model is no longer
needed.
There are various ways of using the bootstrap to form an approximate 95% (say)
confidence interval for a parameter such as (say) Qu . These include:
64 CHAPTER VI. THE BOOTSTRAP
Basic bootstrap interval As above, but take l− [resp. l+ ] to be the lower [resp.,
upper] 2.5%-point of the nonparametric bootstrap estimate of the distribu-
tion of Ku .
Percentile interval Take the lower [resp., upper] confidence limit for Qu to
be the lower [resp., upper] 2.5%-point of the bootstrap distribution of Xu∗ .
(This works if the there is a monotonic function g such that the pivot
g(Xu ) − g(Qu ) has a symmetric distribution about 0, with a shape well
approximated by that of its bootstrap distribution.)
The same essential logic can be applied using from other pivots. For example, if
we had started with a location-scale model, we might have used a “Student-t”-
1
type pivot, such as n 2 (Qu − Xu )/S, where S is some location-invariant estimator
of scale, such as the interquartile range of the data (X1 , . . . , Xn ). We can again
approximate its distribution, again by resampling from the empirical distribution
of the data. And this might be more “robust”, and more quickly convergent, than
1
just using n 2 (Qu −Xu ) as our pivot, since all we now require is that the empirical
distribution of the data and the true data-generating distribution should agree
(approximately) up to a location-scale transformation (i.e., we need no longer
require them to have approximately the same scale). Such “studentised pivots”
are generally recommended over simple unscaled pivots.
4 Further considerations
The basic idea of bootstrap inference, and its basic implementation (though
highly computer-intensive), are extremely simple and appealling—almost too
good to be true. But it really does work.
Of course there are many loose ends to tie up. Theoretically, we would want to
be able to show that, under suitable conditions, the convergence of the sample
empirical distribution to that of the data-generating distribution is fast enough
that, for large n, substituting the former for the latter is asymptotically negli-
gible, when we focus attention on their associated distributions for the chosen
pivot. We would also like this convergence to hold uniformly, over a suitable
4. FURTHER CONSIDERATIONS 65
Bayesian Inference
STATISTICS = (PROBABILITY)−1
STATISTICS = PROBABILITY.
If only every one agreed that the Bayesian approach was the right one, then
Statistics would be set on a firm logical and mathematical basis!
This logical coherence, avoiding the need for a new trick for every new problem,
means — in principle at least — that Bayesian methods can be applied to an
enormous range of extremely complicated and realistic models. In practice, how-
ever, this raises new challenges: both of specifying appropriate prior distributions
1
There is a large body of theory, based for example on principles of rational economic
behaviour, justifying the use of Probability Theory for this purpose.
67
68 CHAPTER VII. BAYESIAN INFERENCE
A truly fundamentalist2 Bayesian approach does not mess with parametric models
at all, but directly models joint epistemological uncertainty about a collection
of observables including both those to be observed and those to be predicted;
it focuses on directly finding the conditional distribution of the latter, given
observations on the former. However for current purposes we restrict attention
to Bayesian analysis of a parametric statistical experiment E = (X, X, , Θ, P).
The statistical model P provides the conditional density p(x | θ) (with respect to
some fixed dominating measure µ on X) of the observable X, given Θ. It is up
to the statistician/decision maker to specify, in addition, a marginal distribution
Π (interpreted as representing epistemological — typically “subjective” or “per-
sonal” — uncertainty) for the parameter Θ, which now has the mathematical
status of a random variable. We denote the density of Π (with respect to a suit-
able measure ν on the parameter-space ) by π(·). The distribution Π describes
perceived uncertainty about Θ before taking any account of the observed value
of X, and should ideally be specified at this stage. It is thus termed the prior
distribution of Θ.
There is no necessary connexion whatsoever between the two building blocks: the
conditional distributions for X given Θ, as specified by the statistical model, and
the marginal distribution for Θ, as described by the prior. But once they have
been separately specified, their conjunction fully determines the joint distribution
for (X, Θ). We can manipulate this in any way we please, using the standard
probability calculus, to address any queries we may have.
The joint density of (X, Θ) (with respect to µ × ν over X × ) is
p(x, θ) = p(x | θ) π(θ). (1)
“subjective” prior density π(θ). It represents your uncertainty about how X will
turn out, lacking knowledge of the value of Θ. It is termed the (prior) predictive
density of X.
Using (1) and (2), the conditional density π(θ | x) for Θ, given X = x, is
π(θ | x) = p(x, θ)/p(x)
∝ p(x | θ) π(θ) (3)
where the proportionality sign indicates the omission of a factor (viz., 1/p(x))
that does not depend on the argument θ: after observing data X = x, this
factor is simply a constant. It is enough to know π(θ | x) up to such an unspec-
ified
R factor, which can always be recovered from the normalisation condition:
π(θ | x) dν(θ) = 1.
The distribution Πx , whose density is π(θ | x), represents the relevant episte-
mological uncertainty about Θ after having observed X = x. Obtaining and
describing this posterior distribution is typically the main, and often the sole,
aim of a Bayesian analysis. We can apply any probabilistic manipulations to this
end, but normally it will be accomplished by the application of formula (3). This
is Bayes’s theorem. Since, for the given data X = x, p(x | θ) is proportional to
the likelihood function L(θ), we can also write this as
π(θ | x) ∝ π(θ) L(θ). (4)
In words:
In particular, the only feature of the statistical model needed to perform a prior-
to-posterior analysis is the likelihood function (up to proportionality) for the
specific data at hand. Thus so long as the identical prior distribution is used
for a quantity Θ in any experiment in which it appears as a parameter (a not
unreasonable requirement), Bayesian inference will respect LP.
1.1 Sufficiency
Another way of understanding this property is as follows. Sufficiency of T can be written as the
conditional independence property:
X⊥ ⊥ Θ | T,
expressing the fact that the conditional distribution of X, given both Θ and T , can be chosen
to depend on T alone. This interpretation of conditional independence is meaningful even if Θ
is not random. However, as soon we take a Bayesian stance, so that Θ is just another random
variable, we can use the symmetry of probabilistic independence to deduce:
Θ⊥
⊥ X | T. (5)
And we can now interpret (5) as asserting that the conditional (= posterior) distribution of Θ,
given X and (redundantly) T , in fact depends on T alone.
The term Prob(Hi )/Prob(Hj ) is the prior odds on Hi as against Hj , and similarly
Prob(Hi | x)/Prob(Hj | x) is the posterior odds. The term pi (x)/pj (x) = Li /Lj is
the corresponding likelihood ratio. So
Knowing the (prior or posterior) odds between all pairs of hypotheses, we know
the probabilities up to a scale factor, and hence, by normalisation, we can recover
them completely. This is particularly useful in the case of just two hypotheses H0
and H1 . In this case the odds ω in favour of H1 as against H0 is just a recoding
of the probability π = Prob(H1 ) (and hence of the distribution over (H0 , H1 )):
ω = π/(1 − π), π = ω/(1 + ω). The likelihood ratio term is p1 (x)/p0 (x).
1.3 Summarisation
Bayes factor
Note how the prior probabilities (π0 , π1 ) only enter into the first factor in (7),
while the prior distributions of Θ conditional on the hypotheses only enter into
the second factor. This second factor is termed the Bayes factor (in favour of H1
as against H0 ). It is essentially a likelihood ratio (in favour of H1 as against H0 ,
but now the relevant “likelihood” of H1 is not purely objective since it involves
the conditional prior density π(θ | H1).
The posterior probability of H0 and the Bayes factor will depend on the data
through the value of a sufficient statistic.
1.4 Prediction
1
Ln (θ) ≈ exp − ̂n (θ − θbn )2 . (10)
2
3. CONJUGATE FAMILIES 73
So long as the prior density π(θ) is continuous and positive, it will be approx-
imately constant in this region, so that the posterior density will be essentially
proportional to the likelihood (10) in the region where it is significant, and neg-
ligible elsewhere. We deduce that asymptotically
Θ ∼ N (θbn ,̂−1
n ). (11)
Importantly, this applies for any (continuous and positive) prior density: given
enough data, a priori divergent opinions will be brought into posterior agreement.
1
We note the similarity of the conclusion (11), expressed as ̂n2 (θbn −Θ) ∼ N (0, 1), to
1
b n − θ) ∼ N (0, 1). Thus with enough data
the asymptotic sampling property ̂n2 (Θ
both approaches will give numerically similar inferences (though with different
interpretations).
3 Conjugate families
Although there is no logical relationship between the statistical model and the
prior distribution, certain forms of prior can be more tractable than others. Even
though these usually will not be good approximations to real prior uncertainty,
they can be used as building blocks in more complex and realistic specifications,
thus simplifying their analysis. Such “artificial” priors are therefore worth atten-
tion.
(13), and we observe data (x1 , . . . , xn ) from the model (12), the the posterior
will evidently have the same form, with the hyperparameters
Pn n0 and t0 replaced,
respectively, by n1 = n0 + n, and t1 = t0 + t (t = i=1 t(xi )). Consequently a
conjugate family is closed under sampling. Since it is itself a (p + 1)-parameter
family, it will usually be easy to describe and manipulate its generic member,
thus solving at one fell swoop all possible inference problems that could arise for
any data.
Note that, since π0 (and indeed the underlying measure ν) is arbitrary, any prior
is a member of some conjugate family, so it does not really make sense to talk of
“a conjugate prior”. However, such a description is frequently used for the case
where Θ is the natural parameter Φ (so u(φ) ≡ φ), ν is Lebesgue and π0 ≡ 1.
4 Normal models
Consider the normal location model , parametrised by its mean M with the vari-
ance v known. Let h := v −1 , the sampling precision. Thus
1
Ln (µ) ∝ exp − nh(x − µ)2 . (15)
2
4. NORMAL MODELS 75
1
π(µ) ∝ exp − h0 (µ − m0 )2 (16)
2
i.e.
M ∼ N (m0 , h−1
0 ). (17)
The posterior density π(µ | x) ∝ π(µ) Ln (µ) is again the exponential of a quadratic
form in µ, so that
1
π(µ | x) ∝ exp − hn (µ − mn )2 ,
2
i.e.
M | x ∼ N (mn , h−1
n ). (18)
hn = h0 + nh (19)
h0 m0 + nhx
mn = . (20)
h0 + nh
Summaries
−1
Interval estimate A 95% credible interval for M would be mn ± 1.96hn 2 .
76 CHAPTER VII. BAYESIAN INFERENCE
Hypothesis test If there were some value µ0 for M of special interest — corre-
sponding to some “null hypothesis” H0 : M = µ0 of no effect — we might assess
the tenability of the hypothesis H0 in the light of the data by assessing how far
out in a tail of the posterior distribution is the value µ0 : this could be quantified
by the posterior probability Prob(Θ | x) ≥ µ0 , with values close to either 0 or 1
making H0 look untenable. Note however that since our prior probability that
H0 holds exactly was 0, so will be its posterior probability.
Predictive distribution
X ∼ N (m0 , h−1 −1
0 + h ). (22)
A similar argument shows that the prior predictive distribution of the sample
mean X of n future observations is
X ∼ N (m0 , h−1 −1
0 + (nh) ). (23)
Bayes factor
Z | H0 ∼ N (0, 1); while, using (23), Z | H1 ∼ N (0, hn /h0 ). So the Bayes factor in
favour of H1 is
p(z | H1)
B10 =
p(z | H0)
12
h0 1
= exp (nh/hn )z 2 . (25)
hn 2
hn ≈ nh
mn ≈ x,
i.e.
M | x ∼ N {x, (nh)−1 }. (26)
This may be compared with the sampling property:
X | µ ∼ N {µ, (nh)−1}.
In either case we have X − M ∼ N {0, (nh)−1}, but for the sampling distribution
this holds conditionally on the parameter M, whereas for the Bayesian posterior
it holds conditionally on the data X.
If we compare the posterior means m∗n and m†n based on two different priors, N {m∗0 , (h∗0 )−1 }
and N {m†0 , (h†0 )−1 }, we find they approach each other at rate n−1 , as against the slower rate
1
n− 2 at which each approaches the true value of M. Using a suitable measure of distance between
1
distributions, the full posterior distributions approach each other at rate n− 2 — a result that
continues to hold even for non-normal priors (and extends to other models). So the posterior
distribution M | x ∼ N {x, (nh)−1 } is in “asymptotically objective”, in that it is insensitive to
initial opinions about M.
As for the posterior predictive distribution (24), given extensive data this will be approxi-
mately N (x, h−1 ); and the resulting distributions for different priors now approach each other
at rate n−1 (a result that again extends to other models).
78 CHAPTER VII. BAYESIAN INFERENCE
1 1
From (25), the Bayes factor B10 behaves asymptotically as (h0 /h) 2 n− 2 exp 12 nhx2 . Note
that the dependence on the prior precision h0 does not disappear in the limit. For fixed non-zero
x, B10 → ∞, corresponding to increasingly strong evidence against H0 .
It is interesting to contrast the Bayes factor approach to testing H0 with the classical
1
approach. The latter would reject H0 when |Z| = (nh) 2 |X| exceeds some critical point k (e.g.,
1.96), irrespective of the value of n. The approximate Bayes facore at this critical point is
1 1
(h0 /h) 2 n− 2 exp 12 k 2 , which tends to 0 as n → ∞, thus eventually discrediting H1 in favour of
the null hypothesis H0 . So the same data which would lead to rejection of H0 from a classical
perspective count strongly in its favour from the Bayesian perspective.
π(µ) ∝ 1. (27)
Again (27) does not describe a genuine probability distribution, since there is no
way of scaling it to integrate to 1: it is an “improper” prior density. Nevertheless,
if we insert it into Bayes’s theorem, expressed as π(µ | x) ∝ π(µ) Ln (µ) we obtain
the perfectly well-behaved “objective posterior” (26).
Fully subjective Bayesians frown upon such “objective” priors and posteriors,
and indeed the purely formal manipulations involved can not be put on a firm
logical basis, and can generate paradox and inconsistency. Nevertheless they
have a strong intuitive appeal. They are particularly prized as ways of gener-
ating essentially the same result as a frequentist analysis, but with a Bayesian
interpretation. For example, the standard frequentist 95%-confidence interval
4. NORMAL MODELS 79
1
x ± 1.96(nh)− 2 has posterior probability exactly 95% under the formal posterior
(26). Similarly, the P -value P = Prob[N {0, (nh)−1 )} > x] for testing the hy-
pothesis H0 : M = 0 against H1 : M > 0 can alternatively be interpreted as the
posterior probability Prob(M < 0 | x) under (26).
Now suppose the mean is known, M = µ say, but the precision H is unknown.
The likelihood function for H is
1 1 ∗2 h
Ln (h) ∝ h 2 n e− 2 ns (28)
P
with s∗ 2 the observed value of S ∗ 2 := n−1 ni=1 (Xi − µ)2 , the mean squared
deviation of the observations from their known population mean µ; this is an
unbiased estimator of the variance V = H −1 . The likelihood form (28) shows
that we have an exponential family; it has natural parameter H, natural sufficient
statistic ns∗2 , and mean-value parameter nV .
Conjugacy suggests taking a prior for H of similar form to (28). This is achieved
if we take
1 1 2
H∼Γ ν0 , ν0 τ0 (29)
2 2
(or, equivalently, H ∼ χ2ν0 /ν0 τ02 ), since this has density
1 1 2
π(h) ∝ h 2 ν0 −1 e− 2 ν0 τ0 h (30)
(where the proportionality sign hides a purely numerical factor, not depending on
h). This has expectation 1/τ02 , so that the hyperparameter τ02 is a “prior guess”
80 CHAPTER VII. BAYESIAN INFERENCE
1
at V = H −1 ; and coefficient of variation 2ν0 − 2 , so that the “degrees of freedom”
hyperparameter ν0 is a measure of the precision of this prior.
Forming the posterior density by multiplying (30) and (28) and simplifying, we
find
1 1 2
H |x ∼ Γ νn , νn τn (31)
2 2
with
νn = ν0 + n
ν0 τ02 + ns∗2
τn2 =
ν0 + n
In particular, the Gamma family of distributions for H is closed under sampling
from N (µ, H −1). We observe that the posterior precision hyperparameter is the
sum of the prior precision hyperparameter and the sample size, while the posterior
“guess” at V is a weighted average of the prior guess and the data-based estimate,
each weighted with its degrees of freedom.
5 Computational methods
independently. This family of joint distributions is not conjugate: the joint pos-
terior is not of the same form.
Using results for the normal scale and normal location models separately, we see
that, in the posterior,
h0 m0 + nhx 1
M | x, h ∼ N , . (32)
h0 + nh h0 + nh
χ2ν0 +n
H | x, µ ∼ (33)
ν0 τ02 + (n − 1)s2 + n(x − µ)2
5. COMPUTATIONAL METHODS 81
This technique, Gibbs sampling, is just one of a number of Markov Chain Monte
Carlo (MCMC) algorithms for sampling from a posterior distribution by simulat-
ing from a Markov chain that has that as its equilibrium distribution. Such com-
putationally intensive methods allow us to perform otherwise intractable Bayesian
inferences, and their development and application forms a major strand of mod-
ern Bayesian statistics.
Chapter VIII
Decision Theory
1 Bayes act
Squared error loss Suppose L(θ, a) = (θ − a)2 . Then the expected loss,
for estimate a, is var(Θ) + {a − E(θ)}2 . Hence the Bayes estimate is just the
expectation of Θ, and the Bayes loss is the variance of Θ.
83
84 CHAPTER VIII. DECISION THEORY
Absolute error loss Now consider L(θ, a) = |a − θ|. Then the Bayes estimate
is the median m of the distribution of Θ, i.e. that value (for simplicity assumed
existent and unique) such that Prob(Θ < m) = Prob(Θ > m) = 12 . For by
considering all possible orderings of θ, m, a, we see that
L(θ, a) − L(θ, m) ≥ (m − a) sign(θ − m)
and so, since E{sign(Θ − m)} = 0, E{L(Θ, a)} ≥ E{L(Θ, m)} for any choice of a.
As a variation, suppose L(θ, a) = c1 (a − θ) if θ ≤ a, and c2 (θ − a) otherwise
(c1 , c2 > 0). Let q := c2 /(c1 + c2 ). Then the Bayes estimate is the q-quantile of
the distribution of Θ, i.e. that value θq such that Prob(Θ ≤ θq ) = q.
2 Decision rules
Suppose now that, before having to choose an act, we can conduct an experiment
E, observing X, with distribution governed by Θ, and so learning more about
Θ. A Bayesian would naturally apply the procedure above using the posterior
distribution given the observed value x of X. However, a frequentist must proceed
differently: by choosing between procedures for using any data that might be
observed.
For example, with A = = R we might use squared error loss: L(θ, a) = (θ − a)2 .
Then a non-randomised decision rule d : X → R is an estimator, and its risk
function R(θ, d) is just its mean-squared error function mse(θ), as considered in
detail in Chapter III.
In the case of finite , with points labelled θ1 , . . . , θn , we can think of the risk
function of a decision rule d as a point (R(θ1 , d), . . . , R(θn , d))T ∈ Rn .
expectation over this distribution. The set of all possibly randomised decision
rules will be denoted by D, with D0 its subset of non-randomised rules.
An apparently more general concept is that of a behavioural decision rule. Such a
rule δ associates with each x ∈ X a probability distribution over A (independent
of Θ)—again, the risk is defined by an appropriate further expectation. How-
ever, unlike the case for a randomised decision rule, the dependence between the
outcomes of the randomisations for different x-values remains unspecified. But
given any randomised decision rule de we can define a behavioural rule δ, which
e (where de is random); and δ and de
associates with each x the distribution of d(x)
will have the same risk function.
Once a preferred decision rule d has been selected, and X = x observed, the
corresponding (non-random or randomised) action a = d(x) is taken.
The risk set is R := {R(·, d) : d ∈ D} (a subset of Rn for finite ). The technical
reason for allowing randomised decision rules is that R is then convex : i.e., if
x, y ∈ R and 0 ≤ p ≤ 1, then z := px + (1 − p)y ∈ R. For if x = R(·, d1), y =
e where de is the randomised decision rule that chooses
R(·, d2 ), we have z = R(·, d)
d1 with probability p, d2 with probability 1 − p.
In the frequentist approach we compare decision rules in terms of their risk func-
tions. We consider d1 at least as good as d2 , and write d1 d2 , if R(θ, d1 ) ≤
R(θ, d2 ), all θ ∈ . In this case we may say that d1 weakly dominates d2 . If
moreover we have strict inequality for some θ, we write d1 ≺ d2 , and say d1 dom-
inates d2 . But is only a partial order: typically R(θ, d1 ) < R(θ, d2 ) for some
θ ∈ , R(θ′ , d1) > R(θ′ , d2 ) for some other θ′ ∈ , and then d1 and d2 are simply
incomparable.
If d is dominated we call it inadmissible, else admissible.
3 Convex loss
We say that the decision problem is convex if the action space A can be repre-
sented as a real interval1 and, for each θ ∈ Θ, the loss function L(θ, a) is a convex
function of a on A.
For example, the squared-error estimation problem, with = A = R and L(θ, a) =
(θ − a)2 , is convex. Any decision problem with only a finite number of available
actions is non-convex.
1
Extension to a convex subset of a general vector space is straightforward.
86 CHAPTER VIII. DECISION THEORY
Proof. For each x, let d(x) := E{δ(x)}, where the expectation is under the
distribution δ(x) over A. Then d(x) ∈ A, so d is a non-randomised decision rule.
Recall Jensen’s inequality (see Chapter IV, § 2.1): for any convex function g of
a real random variable Y , and any distribution P for Y , g{E(Y )} ≤ E{g(Y )}.
For each x and θ, apply this to g(·) = L(θ, ·), under the distribution over A
determined by δ(x), to obtain
L{θ, d(x)} ≤ EL{θ, δ(x)}. (1)
Taking a further expectation over X ∼ Pθ yields R(θ, d) ≤ R(θ, δ), so that d
weakly dominates δ. 2
Corollary 2 Under the above conditions, any randomised decision rule is weakly
dominated by a non-randomised decision rule.
3.1 Sufficiency
Let the statistic T , with values in T, be sufficient for Θ. Given any (say non-
randomised)2 decision rule d, let δ be the behavioural decision rule, based on
T alone, that assigns, to t ∈ T, the distribution over A of d(X) given T = t;
this being determined by the distribution of X given T = t under Pθ , which, by
sufficiency, is in fact known, independently of the value θ. Then, conditional on
T = t and Θ = θ, both d and δ determine the identical distribution over the
action space A; hence they have identical risk functions. Effectively, δ mimics,
by external randomisation, the known probabilistic way in which X arises fol-
lowing observation of T . So any decision rule is weakly dominated by a (possibly
randomised) decision rule based on T alone.
The following result, generalizing Theorem 5 of Chapter III, now follows from
Theorem 1:
4 Selection criteria
It would appear unwise to use an inadmissible rule. But typically there will be
many admissible rules, all necessarily mutually incomparable. Other criteria are
then needed to select between them. Here we consider two: Bayes and minimax.
Lemma 4 If there exists a Bayes rule, there exists a non-randomised Bayes rule.
Extensive form For any given data X = x, choose a Bayes act ax in the pos-
terior distribution:
Proof. Starting from an extensive form approach, we can construct the associ-
ated non-randomised decision rule δ, such that δ(x) = ax . Then for any decision
rule d ∈ D0
r(Π, d) − r(Π, δ) = E[E{L(Θ, d(X)) − L(Θ, δ(X)) | Θ}]
= E[E{L(Θ, d(X)) − L(Θ, δ(X)) | X}]
≥ 0 (4)
since E{L(Θ, d(x)) − L(Θ, δ(x)) | X = x} ≥ 0 for all x. So the decision rule δ con-
tructed from an extensive form analysis is a normal form Bayes rule. Conversely,
if dΠ is normal form Bayes we have
E[E{L(Θ, dΠ(X)) − L(Θ, δ(X))} | X] ≤ 0
and since, by definition of δ, the integrand is non-negative, it must be 0 (with
probability 1). So dΠ (x) is a Bayes act in the posterior distribution given X = x.
2
Corollary 6 The minimised Bayes risk r(Π, dΠ ) is E{H(ΠX )}, where H(Πx ) is
the Bayes loss for Θ ∼ Πx , the posterior distribution given X = x.
Since extensive form analysis requires minimisation only over the action space
A, rather than over the much bigger space D0 of non-randomised decision rules
as required for normal form analysis, it gives a much easier way to construct a
Bayes rule.
Any rule that dominates a Bayes rule must be Bayes for the same prior. Conse-
quently if a Bayes rule with respect to Π is unique, it is admissible.
More generally, a Bayes rule will be admissible under some additional conditions.
Here we note the following weaker result:
Theorem 7 Let dΠ be a Bayes rule with respect to Π, and d a decision rule that
dominates dΠ . Define A := {θ ∈ : R(θ, d) < R(θ, dΠ ). Then Π(Θ ∈ A) = 0.
Proof. Follows easily from R(θ, d)−R(θ, dΠ) ≤ 0 and EΠ {R(Θ, d)−R(Θ, dΠ)} ≥
0. 2
Corollary 8 Suppose is discrete, and Π(Θ = θ) > 0 for each θ ∈ . Then the
Bayes rule dΠ is admissible.
5. INADMISSIBILITY OF THE MULTIVARIATE NORMAL MEAN 89
For those who don’t like priors, an “objective” (but arbitrary) criterion for se-
lecting a decision rule is “minimax”:
Here is one of the more surprising results in decision theory: the obvous (minimax,
maximum likelihood, best unbiased, best invariant,. . . ) estimator of a collection
of 3 or more normal means is inadmissible!
b ≡ p.
Proof. We easily compute R(θ, Θ)
Also
!
2
αX
e α) =
R(θ, Θ Eθ X −θ−
kXk2
X T (X − θ) α2
= Eθ kX − θk − 2α
2
+
kXk2 kXk2
90 CHAPTER VIII. DECISION THEORY
X
Xi (Xi − θi ) 1
= p − 2α Eθ +α 2
Eθ
i
kXk2 kXk2
Now for any reasonable h, using integration by parts,
Eθ {(Xi − θi )h(X)} = Eθ {h[i](X)}
where h[i] (x) := ∂h(x)/∂xi . (This is Stein’s lemma). Take h(x) ≡ xi /kxk2 , so
h[i] (x) ≡ kxk−2 − 2x2i /kxk4 . Then
X Xi (Xi − θi ) X
Eθ 2
= Eθ kXk−2 − 2Xi2/kXk4
i
kXk i
= (p − 2)Eθ kXk−2 .
Hence
e α ) = p − α{2(p − 2) − α} Eθ kXk−2 .
R(θ, Θ
2
Each Θ e α (0 < α < 2(p − 2)) dominates the minimax estimator Θ b so is itself
minimax (with worst-case loss p). For any θ, R(θ, Θ e α ) is minimised at α = p − 2,
e p−2 dominates all other estimators in this class, which are thus inadmissible—
so Θ
including, for α = 0, the MLE Θ b = X. But even Θ e p−2 is inadmissible: it is
2 + +
dominated by {(1 − (p − 2)/kXk } X, where z := max{z, 0} (and even this
estimator is inadmissible. . . )
3
The original meaning of this term, introduced by Herbert Robbins, has now been largely discarded in favour
of the usage described here
6. TWO SIMPLE HYPOTHESES 91
The simplest decision problem is when both parameter- and action-space have
just two elements: say = {θ0 , θ1 }, A = {a0 , a1 }. The loss function is L(θi , aj ) =
lij , where we suppose l00 < l01 , l11 < l10 , so that the optimal act when Θ is
known to take value θi is ai (i = 0, 1). Although this formulation and most of
the analysis below treats θ0 and θ1 entirely symmetrically, in some approaches we
distinguish between the null hypothesis H0 : Θ = θ0 and the alternative hypothesis
H1 : Θ = θ1 , and treat these differently. Then a1 is interpreted as “reject H0 ”
and a0 as “accept H0 ” (or, at any rate, “do not reject H0 ”).
We do not restrict the sample-space X. The density (with respect to a suitable
underlying measure µ) of X given Θ = θi is denoted by pi (·). The likelihood ratio
is Λ := λ(X) := p1 (X)/p0 (X).
A behavioral decision rule is defined by a function φ : X → [0, 1], where φ(x) is
the probability of taking action a1 after observing X = x. Its risk function is
given by
R(θ0 , φ) = l00 (1 − α) + l01 α
= l00 + (l01 − l00 )α
(6)
R(θ1 , φ) = l10 β + l11 (1 − β)
= l11 + (l10 − l11 )β
where α := E{φ(X) | θ0} is the overall probability of taking action a1 given Θ =
θ0 , the type I error of φ; and β := 1 − E{φ(X) | θ1} is the overall probability of
taking action a0 given Θ = θ1 , the type II error .
We also term α the size of the test φ, and γ := 1 −β its power . Thus φ φ′ ⇐⇒
α ≤ α′ and β ≤ β ′ .
For a non-randomised rule φ takes values 0 and 1 only. It is equivalently deter-
mined by its critical region. C = {x : φ(x) = 1}. Then α = Prob(X ∈ C | Θ =
θ0 ), β = Prob(X ∈ C | Θ = θ1 ) (where C denotes the complement X \ C of C in
X).
The risk set is {(R(θ0 , φ), R(θ1 , φ)) : φ ∈ D}. From equations (6), this is a simple
transformation of the set S of error-pairs {(α, β) : φ ∈ D}, a subset of the unit
square. So without any real loss of generality, we may assume lij = 0 if i = j, 1
otherwise, when the risk-set is just S.
Since we are allowing randomisation, S is convex; further if (α, β) ∈ S, corre-
92 CHAPTER VIII. DECISION THEORY
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
The admissible rules are those corresponding to points on the lower boundary of
S.
Let now Θ have prior distribution Π, where Π(Θ = θi ) = πi (i = 0, 1). The Bayes
risk of φ is then
r(Π, φ) = π0 α + π1 β. (7)
Finding a Bayes rule is thus equivalent to finding the rule that minimises a
linear combination of the two error rates with non-negative coefficients. From
Corollary 8 of Chapter VIII, when neither coefficient is 0 such a rule will be
admissible.4 In terms of the risk set, it will correspond to a point found by
4
Admissibility need not hold if we allow zero coefficients. Thus the rule φ ≡ 0, corresponding
to the point (0, 1) ∈ S, is Bayes for prior π1 = 1, π0 = 0; but it will be dominated if there exists
6. TWO SIMPLE HYPOTHESES 93
sliding down a line of slope −π0 /π1 until it is just about to leave S: see Figure 2.
Its intersection with S is then non-empty and completely contained within the
lower boundary of S; any rule corresponding to a point in this intersection will
be a Bayes rule with respect to Π
Bayes rule
The figure suggests that any point on the lower boundary can be characterised
in this way and is thus a Bayes rule: this is indeed the case, and can be proved
using the supporting hyperplane theorem for convex sets. We end up with a (1, 1)
correspondence between admissible rules and Bayes rules so long as both error
probabilities are positive.
The following simple Corollary (which was proved directly in STATISTICS IB)
to the Bayesian result of Theorem 10 is the basis for the Neyman-Pearson (N-P)
frequentist approach to hypothesis testing.
This result can be expressed as: Any LRT is most powerful for its size. Note that
a most powerful test need not be admissible: we might have α′ < α and γ ′ = γ.
This can happen for a LRT(0) (with γ = 1).
The N-P approach is to preselect the desired size α (e.g. at 5%), and adjust the
cut-off k to yield a LRT of this size—which will then have maximum power γ
subject to size ≤ α. Note that this introduces an asymmetry into the treatment
of the two hypotheses.
Chapter IX
Multivariate Analysis
95
96 CHAPTER IX. MULTIVARIATE ANALYSIS
By Lemma 1 there can be at most one distribution satisfying (1) for all a ∈ Rp .
Indeed, because the characteristic function of N (m, v) is given by t 7→ exp(imt −
1 2
2
vt ), that of N (µ, Σ) (if such a distribution exists) must be given by
1 T
φ(a) = exp iµa − a Σa . (2)
2
2.1 Density
Because it has independent and identically distributed N (0, 1) entries, the joint
density for Z ∼ Np (0, Ip ) is readily computed:
1 1
pZ (z) = 1 exp − z z T . (3)
(2π) 2
p 2
Now let X ∼ Np (µ, Σ), with Σ positive definite. We can write Σ = AT A with
A (p × p) non-singular; then X = µ + ZA, where Z = (X − µ)A−1 ∼ Np (0, Ip ).
The Jacobian of the transformation x → z is (dxj /dzk ) = AT , with determinant
1
(det Σ) 2 . Applying the multivariate change-of-variables formula to (3), we obtain
the density of Np (µ, Σ):
1 1
pX (x) = 1 exp − (x − µ)Σ−1 (x − µ)T
1
p
(2π) (det Σ)
2 2 2
1 1 −1
= 1 1 exp − trΣ (x − µ)T (x − µ). (4)
p
(2π) 2 (det Σ) 2 2
98 CHAPTER IX. MULTIVARIATE ANALYSIS
What if Σ is only positive semi-definite (and hence singular)? In that case there exists a
with Σa = 0, and so var(Xa) = 0. So the distribution of X is confined to the affine subspace
{x : (x − µ)a = 0}, and so can not have a density with respect to Lebesgue measure on
Rp . Nevertheless, we can still have a perfectly satisfactory MVN distribution according to our
definition.
Y ∗ X = E(Y | X) = µY + (X − µX )Σ−1
XX ΣXY ,
3 Data
S c := (X c )T X c = X T ΠX. (10)
Pn
Then aT S c a is the corrected sum of squares i=1 (wi − w)
2
for variable W = V a.
E(X) = µ (11)
E{(n − 1) −1 c
S } = Σ. (12)
Suppose now that the rows (X i : i = 1, . . . , n) of the data matrix X are inde-
pendent and identically distributed from the normal distribution N (0, Σ) with
4. NORMAL DISPERSION MODEL 101
known mean-vector (taken as 0 without loss of generality), so that the only un-
known parameter is the dispersion matrix Σ. In particular, the entries (Xij ) are
jointly normal, with E(Xij ) = 0, cov(Xij , Xi′ j ′ ) = δii′ σjj ′ .
We shall suppose Σ non-singular. From (4), the overall joint density is
1 1
p(X | Σ) = 1 1 exp − trΣ−1 S (13)
(2π) 2
np
(det Σ) 2
n 2
!
1 1 X X
= 1 1 exp − kii sii + 2 kij sij (14)
(2π) 2 np (det Σ) 2 n 2 i i<j
aT Sa
∼ χ2n .
aT Σa
4
To be precise: the values {− 21 kii : 1 ≤ i ≤ p}, {−kij : 1 ≤ i < j ≤ p}. Alternatively we can
take K itself as the natural parameter; the associated natural sufficient statistic then requires
compensating adjustments to the entries of S.
102 CHAPTER IX. MULTIVARIATE ANALYSIS
Wishart density
The following argument shows how, if we can establish (17), we can easily deduce
(15).
Since S is a sufficient statistic for the family of densities (13), for any data
the likelihood based on S is proportional to that based on the full data X (see
Corollary 3 of Chapter II). So
pS (S | Σ) pX (X | Σ)
= (18)
pS (S | Ip ) pX (X | Ip )
1 1 −1
= 1 exp − trΣ S − trS . (19)
(det Σ) 2 n 2
Bivariate case
We here derive (17) for the case p = 2. The ith row of the data matrix X will be written as
(Xi , Yi ).
We have U := SXX ∼ χ2n by Corollary 6. We now argue conditionally on the X’s (so these
can be regarded as fixed), and consider the zero-intercept sample regression of Y on X, estimated
from the data pairs (X1 , Y1 ), . . . , (Xn , Yn ). From standard regression theory, the estimated
regression coefficient is βb := SXY /SXX , and the unbiased estimator of the residual variance of
Y , about its regression line, is s2 := SY Y ·X /(n − 1), with SY Y ·X := SY Y − SXY
2 /S
XX . Since,
with Σ = I2 , the population counterparts of these quantities are 0 and 1, regression distribution
theory tells us that (still conditional on the X’s) βb and s2 are independent, with respective
1 1
distributions N (0, S −1 ) and χ2 /(n − 1). Hence if we define V := S 2 βb = SXY /S 2 ,
XX n−1 XX XX
W := SY Y ·X , we have V ∼ N (0, 1), W ∼ χ2n−1 , independently. Moreover, since this joint
distribution of (V, W ) holds conditionally on the X’s, but does not in fact depend on their
values, while U := SXX is entirely determined by those values, we must have (V, W ) ⊥
⊥ U ; and
we know U ∼ χ2n . The joint density of (U, V, W ) is thus:
1 n
1 n−1
1 1 1 1 2 1 1
p(U,V,W ) (u, v, w) = 2
1
e− 2 u u 2 n−1 · √ e− 2 v · 2
1
e− 2 w w 2 (n−1)−1 . (20)
Γ 2
n 2π Γ 2
n − 12
1
Now change variables to SXX = U , SXY = U 2 V , SY Y = W + V 2 . The Jacobian of the
1
transformation is J := det{∂(sXX , sXY , sY Y )/∂(u, v, w)} = u 2 , so the transformed density is
where
√ 1 1 1
C −1 = 2n πΓ n Γ n− . (22)
2 2 2
We shall derive (24) for the special case ρ = 0, when we can take Σ = I2 , so
that the Xs and Y s are independent normal with unit variance. (In this case the
otherwise problematic integral term in (24) is just a constant.)
Introduce
1 r
t := (n − 1) 2 √ . (25)
1 − r2
Then
1 S /S
t = SXX
2
p XY XX
SY Y ·X /(n − 1)
1
SXX
2
βb
=
s
where βb = SXY /SXX is the sample (zero-intercept) regression coefficient of Y on
X, β = 0 is the corresponding population value, and s2 is the sample residual
mean square (on n − 1 degrees of freedom) for Y about its sample regression
on X. We recognise t as Student’s t-statistic for testing the (true) hypothesis
β = 0. Consequently, from standard regression theory, conditionally on the X’s
and hence also unconditionally,
t ∼ tn−1 . (26)
4. NORMAL DISPERSION MODEL 105
and applying the change-of-variables formula with (25), delivers the density of r
on (−1, 1):
Γ( 12 n) 1
p(r | n; 0) = 1 1 √ (1 − r 2 ) 2 (n−3) . (28)
Γ( 2 n − 2 ) π
It would be nice if an argument similar to that based on (18) could be used to transfer the “null”
sampling distribution (28) for r to the case ρ 6= 0, but unfortunately this does not work because
r is not sufficient for the model with Σ of the form (23).
Asymptotic distribution
Formula (30) yields a better approximation than (29), and can be used as the basis
of approximate inference about ζ, and thereby about ρ = (eζ − e−ζ )/(eζ + e−ζ ).
Sample partial correlations can be constructed from S in the same way as popu-
lation partial correlations are constructed from Σ. On account of Lemma 7, the
distribution theory (both exact and asymptotic) is exactly as above, but with the
degrees of freedom decremented by p, the number of variables adjusted for.
106 CHAPTER IX. MULTIVARIATE ANALYSIS
Again (31) is of exponential family form. The natural sufficient statistic can
be taken as (X, S). Since E(X i ) = µ and E(X T T
i X i ) = Σ + µ µ, the mean-
value parameter is (µ, n(Σ + µT µ)). Solving the likelihood equations, obtained
by equating the natural sufficient statistic with the mean-value parameter, yields
maximum likelihood estimates:
µ
b = x (32)
b = n−1 S c .
Σ (33)
However, the biased estimator n−1 S c is usually replaced with the unbiased esti-
mator (n − 1)−1 S c .
The statistic (X, S c ) is a (1,1) function of the natural sufficient statistic (X, S),
hence minimal sufficient (although not itself a natural sufficient statistic).
Proof. The n × n matrix of covariances between the jth and kth columns of X
is σjk In . Hence the vector of covariances between the jth column of X c = ΠX
and the kth entry of X = n−1 1T X is n−1 Π σjk In 1 = 0 since Π1 = 0. Thus
every entry of X c is uncorrelated with every entry of X. Since the entries of X
are jointly normally distributed, and we are dealing with linear transformations,
zero correlation implies independence: X ⊥ ⊥ X c . Hence, since S c = (X c )T X c ,
X⊥ ⊥ S c. 2
In the univariate case we can base inference for the single mean M on the Student’s
t pivot:
1 1
t := n 2 (X − M)/S XX
2
, (36)
P
where S XX := i (Xi −X)2 /(n−1) is the usual unbiased estimator of the variance
of X, on νp= n − 1 degrees of freedom. Then t has the tν distribution, defined as
that of Z/ V /ν when Z ∼ N (0, 1) and V ∼ χ2ν , independently. Alternatively we
can use F := t2 = n(X − M)2 /S XX , which has the F distribution Fν1 , where Fνp is
defined as the distribution of (U/p)/(V /ν) when U ∼ χ2p , V ∼ χ2ν independently.
The generalisation of t2 to the multivariate case is Hotelling’s T 2 :
Hotelling’s T 2
Two-sample test
Suppose we have independent (p × 1) data-vectors X i ∼ N (MX , Σ) (i = 1, . . . , nX ), Y i ∼
N (MY , Σ) (i = 1, . . . , nY ), from two populations with the same dispersion matrix but possibly
6. CLASSIFICATION 109
(Y − X − ∆)S w (Y − X − ∆)T
−1
T 2 := (n−1 −1 −1
X + nY ) (38)
with distribution T 2 (p; ν). In particular, a test of the hypothesis MY = MX can be conducted
by referring {(ν − p + 1)/p} (n−1 −1 −1
X + nY ) (y − x)Sw −1
(y − x)T to tables of Fν−p+1
p
.
6 Classification
A likelihood ratio test (LRT) will thus be of the form: “assign to Π2 if Xa > k”.
The variable W := V a is the linear discriminant function (LD) for this problem.
When k = µ̄W := µa, this test is equivalent to assigning to Π2 if Λ > 1. By
general properties of LRTs, this rule will minimise the sum α + β of the two
types of error. By symmetry it will have α = β, and will be minimax. If we
have non-equal prior probabilities for the two hypotheses, or differing losses for
the two types of error, a different cut-off will be optimal.
Suppose we decide to collapse our data to the observation on just one variable,
say U = V c. Then, in population Πk , XU := Xc will have the univariate normal
110 CHAPTER IX. MULTIVARIATE ANALYSIS
distribution N (µk c, cT Σc). The separation between the two distributions can
reasonably be measured by the distance between their means, measured in units
of their (common) standard deviation; or, essentially equivalent, of its square:
{(µ2 − µ1 )c}2
d2 := . (43)
cT Σc
6.2 Data
5
It is a non-central T 2 -distribution, which is a scaled version of the non-central F -
distribution. See Anderson (2003), § 5.2.2.
7. PRINCIPAL COMPONENTS 111
7 Principal components
AT ΣA = Λ. (45)
Pp
Definition 2 A variable W = V a is termed normalised if i=1 a2i = 1. 2
Theorem 12 Among all normalised variables, the choice W = V1∗ , the first
principal component, has maximum variance. Among all normalised variables
uncorrelated with (V1∗ , . . . , Vk∗ ) (1 ≤ k < p), the choice W = Vk+1
∗
has maximum
variance.
The first principal component V1∗ is often interpreted as the most important con-
structed variable, since (subject to normalisation) it varies the most. For certain
purposes (e.g. predicting some other variable) we might consider discarding all
other variables and only using V1∗ . Similarly for k < p the first k-principal com-
ponents might be regarded as the most important set of k variables, and only
these retained — thus achieving dimension reduction. However whether these
aspirations are achieved is quite another matter. There is also the question of
the choice of k: a commonPcriterionPis to fix π ∈ (0, 1) close to 1 and choose
the minimum k for which ki=1 λi / pi=1 λi ≥ π. (This quantity is sometimes,
though not very appropriately, called the proportion of total variance explained
by the first k principal components.)
We have noted that the above construction is relative to a specific choice of basic
variables V . Starting with another set (even if they are scalar multiples of the
originals) will change Σ, and typically produce different principal components (in
the context of Theorem 12 it will involve a different understanding of what it
means for a variable to be normalised). It is common first to scale each of the
initial basic variables by its standard deviation, so that Σ becomes the correlation
matrix , and to construct principal components from that.
7.1 Data