Lecture Notes - 1
Lecture Notes - 1
LECTURE NOTES
1. INTRODUCTION
The primary subject of statistical inference is drawing conclusions about some aspect of a population
of persons or objects based on a set of quantitative observations randomly gathered from that
population, or equivalently, drawing conclusions about the generating process of certain quantities
based on a set of randomly generated observed outcomes from that process. More specifically,
we will be interested in estimating or testing some numerical characteristic(s) of a population or
generating process based on a set of random observations from that population or process and then
assigning some level of confidence to our estimates or conclusions. These notions will no doubt be
familiar concepts from any introductory unit in statistics. Our focus here will be on more fully
developing the underlying theory and philosophy upon which the techniques learned in earlier units
are based and then using these principles to extend our understanding of statistical concepts to a
wider range of situations. As such, a basic knowledge of introductory mathematics and statistics
will be assumed throughout. In particular, we shall assume that the reader is familiar with the
following concepts and areas:
• Single and Multi-variable Differentiation and Integration;
• Maximisation and Minimisation of Functions;
• Taylor-Series Expansions;
• Basic Probability and Random Variables;
• Joint and Marginal Distributions and Independence;
• Moments of Random Variables and Moment Generating Functions;
• The Change of Variable Formula for Probability Densities; and,
• Basic Conditional Distributions and Conditional Expectations.
We note that the reader is only assumed to be familiar with the above (and other related) topics
and not necessarily expert. Indeed, it is not the intention of these notes to provide a rigorous
mathematical development of the theory of statistical inference. Nonetheless, any reasonable un-
derstanding of the development and properties of statistical inference and estimation procedures
must be based to some degree on a firm mathematical foundation. We shall strive, therefore, to use
mathematics as a tool rather than as an end in itself and thus, while completely rigorous proofs will
rarely be provided, basic mathematical explanations and justifications will certainly be presented.
In order to more formally define our task, we shall focus on examining the properties of so-
called probability models. Loosely speaking, a probability model is simply a collection, or family,
of related probability distributions, one of which is believed to fully characterise the population
or process from which a set of observed data values arose. Typically, these models will be termed
parametric when each member of the family of distributions in question is uniquely associated
with (or indexed by) a vector of numerical values, called parameters. To give a specific example,
we might assume that the values of a numerical characteristic of interest among the elements in
a particular population are well described by a normal (or bell-shaped or Gaussian) distribution
with some unspecified expectation (or mean or centre), generally designated by µ, and variance (or
spread), generally designated by σ 2 . In this case, the probability model (i.e., the family of normal
distributions) is indexed (i.e., each member is uniquely identified) by the two values µ and σ 2 . Our
task is thus reduced to estimating or testing hypotheses regarding the true (but unknown) values
of these parameters.
Statistical Inference (STAT3013/8027) Lecture Notes - Page 2
In Section 2 of these notes, we shall start by examining the relatively simple task of estimating
the value of a parameter from a chosen probability model or family. In particular, we shall develop
and discuss theory regarding the construction of estimates and the determination and comparison
of the properties of these estimation procedures. Of course, since estimates by their nature must
be based on random information, they will inevitably contain error (i.e., the observed value of
an estimator will not exactly equal the value of the parameter it is intended to estimate except
in the most special of circumstances). Thus, in addition to providing estimates, we should also
endeavour to provide some measure of how strongly we believe (or how confident we are) in the
precision of our estimated value. This attachment of confidence to an estimator is the subject of
interval estimation which we discuss in Section 3 of these notes. As an alternative to estimating
the values of parameters, we may have specific hypotheses about their true values, the plausibility
of which we can test using the observed data. Such hypothesis testing will be the subject of Section
4, the last section of these notes. Before proceeding on to the details of parametric estimation
and testing, we note that the vast majority of our results will be based on the assumption that the
chosen probability model is indeed correct (i.e., that the population or process under study is indeed
characterised by one member of the family of distributions which comprise the model). Often times
this assumption is either not overly critical or is demonstrably true. Other times, however, there
is some non-negligible doubt associated with the choice of probability model, and methods which
are less constrained to specific parametric families are desirable. Throughout these notes, then, we
shall begin to explore the area of so-called non-parametric procedures, which are a first attempt at
widening the class of probability models available to include those which are not easily indexed by
a finite collection of parameters (e.g., we might wish to use the family of all symmetric distributions
instead of the family of normal distributions, and this new family is not possible to index by just
its expectation and variance, nor indeed by any finite collection of numerical parameters).
Statistical Inference (STAT3013/8027) Lecture Notes - Page 3
Some simple algebra shows that the solution to this system, θ̂ = (µ̂, σ̂ 2 ), is given by
n
1
µ̂ = m1 (x1 , . . . , xn ) = n xi = x
i=1
σ̂ 2 = m2 (x1 , . . . , xn ) − {m1 (x1 , . . . , xn )}2
n n 2
1 2 1
=n xi − n xi
i=1 i=1
n
= 1
n (xi − x)2 .
i=1
Note that the method of moments estimator of σ 2 is not the usual unbiased estimate, s2 =
1
n 2 2
i=1 (xi − x) = n−1 σ̂ .
n
n−1
Now, suppose that we wanted to estimate σ, instead of σ 2 . One convenient method is to
√
define the function τ (θ) = τ (µ, σ 2 ) = σ 2 , so that σ = τ (µ, σ 2 ). Thus, a method of moments
√ n
estimate for σ is given by τ (µ̂, σ̂ 2 ) = σ̂ 2 = n1 i=1 (xi − x)2 . Alternatively, we might choose
to reparmeterise our probability model (i.e., index the family by a different set of parameters),
and set θ = (µ, σ) and then solve the method of moments equations for µ̂ and σ̂ directly.
Generally, either approach will provide the same result (though there are some special, and
generally unimportant, cases in which these two approaches lead to different answers).
Before moving on to the next estimation procedure, we note that the use of raw moments in the
standard method of moments procedure is by no means required. Indeed, generalisations of the
method of moments procedures which employ matching various other corresponding population and
sample quantities are possible. For instance, we might employ a pre-specified collection of sample
and population percentiles rather than moments, which yields the so-called method of percentiles
estimators (e.g., for the normal distribution we might solve a system of equations based on equating
the theoretical quartiles to the observed sample quartiles). The most common generalisation,
however, is based on replacing the raw and sample moments with the so-called central moments
1
n
µr = Eθ {X − Eθ (X)}r and mr = n−1 i=1 (xi − x) and derives an estimate by solving the
r
system of k equations µ1 = m1 = x and µr = mr for r = 2, . . . , k (note that the first equation
does not involve central moments, since the both µ1 and m1 are always equal to zero). Another
common generalisation of the method of moments is to employ any k (central) moments for the
k defining equations rather than simply the first k (central) moments. Finally, another common
generalisation, often referred to as the generalised method of moments is to use the first moment
of k functions gi (·), i = 1, . . . , k in the defining equations. In other words, the generalised method
of moments estimator of θ is the solution to the k equations:
1
n
Eθ {g1 (X)} = g1 (xi )
n i=1
..
.
1
n
Eθ {gk (X)} = gk (xi ).
n i=1
If the gi (·)’s are set to gi (x) = xi , then we recover the standard method of moments equations.
2.1.2. Maximum Likelihood: Perhaps the most flexible, important and intuitively appealing of
all estimation procedures is that of maximum likelihood. Before formally describing this estimation
method, we start with a simple example to demonstrate the concept behind maximum likelihood.
Statistical Inference (STAT3013/8027) Lecture Notes - Page 6
Example 2.2: Suppose that a particular population contains individuals of two types, A and
B. Moreover, suppose that we are told that there are three times more of one type of individual
than the other, but we do not know which of the two types of individuals is the more prevalent.
We would like to know whether it is the type A or type B individuals who are predominant,
and to answer this question we plan to randomly sample 3 individuals. Letting X denote the
number of type A individuals in the sample, it should be clear that X has a binomial distribution
with a number of trials equal to three and a success probability p which is either 0.25 if type B
individuals are the most prevalent or 0.75 if type A individuals are the most prevalent. Based
on this fact, we can determine the probability of X taking on any of its four possible values
(0, 1, 2, or 3) under each of the two possible success probability options using the binomial
probability mass function:
3!
P rp (X = x) = px (1 − p)3−x ,
x!(3 − x)!
Based on this table of probabilities, we can now devise a reasonable estimator for the true
population value of p, based on the notion of the “preponderance of evidence” or the likelihood.
The idea is to select the value of our estimator for p as either 0.25 or 0.75, whichever gives
a larger probability to the event which we actually observed, X = x. In other words, if we
observe zero type A individuals in our sample, we would estimate p as 0.25 since the probability
of observing this sample result when p = 0.25 is much larger than the probability of the observed
sample result under the other alternative, p = 0.75. Formally, we define our estimator as:
1
4 if x = 0, 1
p̂ = p̂(x) = = argmax{P rp (X = x)},
3 p∈{ 14 , 34 }
4 if x = 2, 3
In this way, we see that our estimator is that value in the possible parameter set for p which
maximises the probability mass function for the random variable X. The the probability mass
function P rp (X = x), when treated as a function of the parameter p for a fixed value of x (instead
of the more usual interpretation which treats it as a function of x with a fixed parameter value
p) will be referred to as the likelihood function, and we will write L(p) = P rp (X = x), the
notation highlighting the fact that it is a function of p and not x. In this way, we can redefine
our estimator p̂(x) as the value of p within the range of its allowable values (generally referred
to as the parameter space) which maximises L(p). Note that an alternative common sense
estimator might be defined as x/3, but this is clearly less desirable in the present problem since
it will never give the correct answer, its only possible values being 0, 1/3, 2/3 and 1. Of course,
this is due to the description of our problem, which required that p be one of two specific values.
In the previous example, the choice between two specific values of p made the problem rather
special. However, the notion of maximising a likelihood is easily extended to more general cases. In
Statistical Inference (STAT3013/8027) Lecture Notes - Page 7
particular, if we are not told that p, the proportion of type A individuals in the population, must
be either 0.25 or 0.75, then we can define a general maximum likelihood estimator for p by simply
maximising the likelihood function L(p) over the full range of possibilities; namely the interval
[0, 1]. Of course, doing so now requires simple calculus techniques as opposed to the examination
of a table. Specifically, setting the derivative of the likelihood function equal to zero, yields the
defining equation of the maximum likelihood estimator as:
d 3!
L(p) = L (p̂) = {xp̂x−1 (1 − p̂)3−x − (3 − x)p̂x (1 − p̂)2−x } = 0,
dp p=p̂ x!(3 − x)!
which is equivalent to
which now does equal the “common sense” estimator of the population proportion of type A
individuals.
In general, then, we can define the maximum likelihood estimator of a (vector) parameter θ indexing
a parametric model family having densities fX (x; θ) as follows:
i. The likelihood function for a parameter θ based on a sample of n random variables X1 , . . . , Xn
is defined to be the joint probability density function of the n random variables considered as
a function of the parameter θ:
(Throughout these notes, we will interpret the word “density” to mean a probability mass
function if the random variables in question are discrete). Note that if the Xi ’s are indepen-
dent and identically distributed with probability density function fX (x; θ), then the likelihood
function can be written as
n
L(θ) = fX (xi ; θ).
i=1
ii. The maximum likelihood estimator (M LE) of a parameter θ is defined to be the value, θ̂ =
θ̂(x1 , . . . , xn ), which maximises the likelihood function L(θ; x1 , . . . , xn ) over the chosen set
of allowable parameter values or parameter space, usually denoted Θ [NOTE: the notation
θ̂(x1 , . . . , xn ) is used to remind us that the M LE, like any other estimator, is a function of
the observed data values]. Typically, the M LE will be the solution to the system of equations
determined by setting the (partial) derivative(s) of the likelihood function equal to zero. In
∂
other words, θ̂ is the solution (in θ) to the (vector) equation ∂θ L(θ) = 0. Of course, in the
event that the solution to these equations does not lie in the specified parameter space Θ, we
must then choose some other method of finding the appropriate restricted maximum.
iii. The form of most common probability densities usually means that the likelihood function
itself can be quite complicated to maximise directly. However, since the natural logarithm is
Statistical Inference (STAT3013/8027) Lecture Notes - Page 8
a monotonically increasing function, it is clear that the value of θ which maximises L(θ) is the
same as the value which maximises the log-likelihood function l(θ) = ln{L(θ)}. Typically, the
log-likelihood function will be much easier to deal with, and indeed, in the case of independent
and identically distributed observations the log-likelihood transforms the product structure of
the likelihood into the much more tractable summation structure
n
l(θ) = ln{fX (xi ; θ)}.
i=1
Using the log-likelihood, we can then define the M LE as the solution to the score equations:
∂ ∂
l(θ) = 0, . . . , l(θ) = 0,
∂θ1 ∂θk
provided the solution exists and is an element of Θ (NOTE: if the solution is not in Θ, then we
must find the M LE by examining the boundary of the set Θ to determine which parameter
value within the parameter space makes the log-likelihood the largest).
We now present some examples of the implementation of the maximum likelihood estimation pro-
cedure:
Example 2.3: Suppose that X1 , . . . , Xn are independent random variables each having a nor-
mal distribution with zero mean and variance σ 2 . In this case, the appropriate density function
is:
1 x2
fX (x; σ 2 ) = √ e− 2σ2 ,
2πσ 2
which leads to a log-likelihood function of:
1 2
n x2
n
2 1 − i2 n 2
l(σ ) = ln √ e 2σ = − ln(2πσ ) − 2 x .
i=1 2πσ 2 2 2σ i=1 i
Differentiating this function with respect to σ 2 and setting equal to zero yields the M LE of σ 2
as:
1 2
n
d 2 n
l(σ ) = − 2 + x
dσ 2 2σ 2(σ 2 )2 i=1 i
1 2
n
n
=⇒ − 2+ x =0
2σ̂ 2(σ̂ 2 )2 i=1 i
1 2
n
=⇒ σ̂ 2 = x .
n i=1 i
Example 2.4: Suppose that we observe n random vectors X1 = (X11 , X12 ), . . . , Xn = (X21 , X22 )
each having a bivariate normal distribution with zero mean and variance-covariance matrix
τ1 + τ2 τ2 − τ 1
V = ,
τ2 − τ 1 τ1 + τ2
with 0 < τ1 ≤ τ2 . [NOTE: This example may seem somewhat contrived, but in fact, with some
minor algebraic modifications, it forms the basis for an extremely important class of statistical
techniques known as mixed linear models or random effects ANOVA models. However, a full
discussion of these models is beyond the scope of these notes.] In this case, the appropriate
density function for the random vectors Xi is:
1 1 2 2
fXi (xi1 , xi2 ; τ1 , τ2 ) = exp − {(τ1 + τ2 )(xi1 + xi2 ) + 2(τ1 − τ2 )xi1 xi2 } ,
8πτ1 τ2 8τ1 τ2
Statistical Inference (STAT3013/8027) Lecture Notes - Page 9
n
l(τ1 , τ2 ) = ln{fXi (xi1 , xi2 ; τ1 , τ2 )}
i=1
(τ1 + τ2 ) 2 (τ1 − τ2 )
n n
= −n ln(τ1 τ2 ) − (xi1 + x2i2 ) − xi1 xi2 .
8τ1 τ2 i=1 4τ1 τ2 i=1
[NOTE: Technically, there should be an additional term in the log-likelihood of the form
−n ln(8π), but it is common practice to omit any additive term in the log-likelihood which
is completely unrelated to the parameters. The reason for this is that such terms are irrelevant
for the purposes of determining the M LE, as can be seen from the fact that these terms will
disappear upon differentiation with respect to the parameter values performed in deriving the
score equation.] Differentiating this function with respect to τ1 and τ2 yields:
1 2 1
n n
∂ n 2
l(τ1 , τ2 ) = − + 2 (x + xi2 ) − 2 xi1 xi2
∂τ1 τ1 8τ1 i=1 i1 4τ1 i=1
1
n
n
=− + 2 (xi1 − xi2 )2
τ1 8τ1 i=1
1 2 1
n n
∂ n 2
l(τ1 , τ2 ) = − + 2 (x + xi2 ) + 2 xi1 xi2
∂τ2 τ2 8τ2 i=1 i1 4τ2 i=1
1
n
n
=− + 2 (xi1 + xi2 )2 .
τ2 8τ2 i=1
1
n 2
Setting these derivatives equal to zero and solving yields the M LEs as τ̂1 = 8n
i=1 (xi1 − xi2 )
1 n 2
and τ̂2 = 8n i=1 (xi1 + xi2 ) , provided that τ̂1 ≤ τ̂2 . If τ̂1 > τ̂2 , then the solutions to the score
equations are not in the allowable parameter space, and we must find the M LEs by examining
the boundary of the parameter space. In this case, that means that we must maximise the
likelihood subject to the boundary condition τ1 = τ2 . Making this substitution into the log-
likelihood function we have
1 2
n
l(τ1 , τ1 ) = −2n ln(τ1 ) − (x + x2i2 ).
4τ1 i=1 i1
1
n 2
Differentiating this function and setting equal to zero yields the solution τ̂1 = τ̂2 = 8n i=1 (xi1 +
x2i2 ). Therefore, the M LEs for this problem are
1
n n n
8n i=1 (xi1 − xi2 )2 if i=1 (xi1 − xi2 )2 ≤ i=1 (xi1 + xi2 )2
τ̂1 =
1
2
n n ;
+ x2i2 ) − xi2 )2 > + xi2 )2
n
8n i=1 (xi1 if i=1 (xi1 i=1 (xi1
1
n n n
8n i=1 (xi1 + xi2 )2 if i=1 (xi1 − xi2 )2 ≤ i=1 (xi1 + xi2 )2
τ̂2 =
1
2
n n .
+ x2i2 ) − xi2 )2 > + xi2 )2
n
8n i=1 (xi1 if i=1 (xi1 i=1 (xi1
Before we move on to a brief discussion of some other general estimation methods, we note that our
discussion of maximum likelihood estimation so far has only enabled us to estimate θ, the parameter
(vector) itself. Recall, however, that we are more generally interested in estimation of functions of
our parameters, τ = τ (θ). If τ (·) is a one-to-one vector function of θ, then we can “reparameterise”
our family of distributions, using the new parameter τ = τ (θ) and then implement our maximum
Statistical Inference (STAT3013/8027) Lecture Notes - Page 10
likelihood procedure on the newly indexed family. Essentially, this amounts to “renaming” each
member of the family, which in turn reduces to employing the chain rule for differentiation on the
score equations to arrive at new objective functions for deriving the M LE τ̂ . Fortunately, none
of this is explicitly necessary, since some simple calculus and algebraic computations demonstrate
that for any function τ = τ (θ), the M LE of τ is given by τ̂ = τ (θ̂). This property is known as
functional equivariance of the M LE, and is formally stated and proved in the following theorem:
Theorem 2.1: Let x1 , . . . , xn be an iid sample from a distribution having likelihood function
L(θ; x1 , . . . , xn ). Also, let θ̂ = θ̂(x1 , . . . , xn ) be the M LE of θ based on this likelihood function.
For any function τ = τ (θ), we can define the likelihood function induced by τ (·) as
M (τ ; x1 , . . . , xn ) = sup L(θ; x1 , . . . , xn )
θ: τ (θ)=τ
and τ̂ , the M LE of τ , is then defined as the value which maximises this induced likelihood. In
such circumstances, τ̂ = τ (θ̂).
Proof: To show that τ̂ = τ (θ̂), we need to demonstrate that τ (θ̂) maximises the induced
likelihood M (τ ; x1 , . . . , xn ). In other words, we need to show that
M {τ (θ̂); x1 , . . . , xn } ≥ M (τ ; x1 , . . . , xn ),
M (τ ; x1 , . . . , xn ) = sup L(θ; x1 , . . . , xn )
θ: τ (θ)=τ
≤ sup L(θ; x1 , . . . , xn )
θ∈Θ
= L(θ̂; x1 , . . . , xn )
= sup L(θ; x1 , . . . , xn )
θ: τ (θ)=τ (θ̂)
= M {τ (θ̂); x1 , . . . , xn },
where the first inequality follows from the fact that the range over which the supremum is being
taken has been enlarged, the second equality follows from the definition of the M LE θ̂, the third
equality follows from the fact that the point θ = θ̂ remains in the range over which the supremum
is being taken, and the final equality follows from the definition of the induced likelihood. Thus,
we have demonstrated that M {τ (θ̂); x1 , . . . , xn } ≥ M (τ ; x1 , . . . , xn ) for all values of τ , which
proves that τ (θ̂) is the value which maximises the induced likelihood M (τ ; x1 , . . . , xn ). In other
words, the M LE of τ is τ̂ = τ (θ̂) as was required.
2.1.3. Other Estimation Methods: There are many other estimation procedures which have
been developed, and we will study one of them in more detail in Section 2.5; namely, Bayesian
estimation. However, we here only briefly mention some of the general aspects of a few other
estimation procedures. The most common type of estimation procedure which we have not covered
so far is generally constructed by finding a value for an estimator which minimises some measure of
“distance” between the observed data and the distribution family of the chosen probability model.
Three of the most common choices for measuring this distance are least-squares, minimum chi-
square and minimum Kolmogorov distance. We now briefly describe these methods in the case
where we have observed the realisations x1 , . . . , xn of the random variables X1 , . . . , Xn assumed to
have come from a distribution belonging to a probability model indexed by the parameter θ and
having CDF s FX (x; θ) and pdfs fX (x; θ):
Statistical Inference (STAT3013/8027) Lecture Notes - Page 11
• Least-Squares - Choose θ̂, the estimate of θ, to be the value which minimises the distance
function:
n
d(θ) = {xi − Eθ (Xi )}2 .
i=1
k
{nj − npj (θ)}2
d(θ) = .
j=1
npj (θ)
Again, estimate τ by τ̂ = τ (θ̂). We note that the distance function d(θ) defined here is closely
related to the Kullback-Leibler distance and the entropy measure, which have the general form:
k
nj
e(θ) = nj ln .
j=1
npj (θ)
• Minimum Kolmogorov distance - First, define the empirical distribution function, F̂n (x), by
n
1
F̂n (x) = n I(xi ≤x) .
i=1
Note that F̂n (x) represents the proportion of data points less than or equal to the specified
value x (i.e., it is the CDF of the distribution with probability n−1 on each of the observed
values xi ). Choose θ̂, the estimate of θ, to be the value which minimises the distance function:
In other words, we choose the value of θ which minimises the maximum vertical distance
between the chosen family of CDF s and the observed CDF of the data values. As before,
estimate τ by τ̂ = τ (θ̂).
In closing, we note that the reason that these estimation procedures are not covered in more
detail is that they generally are extremely difficult to implement in practice, and as such are
not commonly employed in real estimation problems. Nonetheless, they do demonstrate a very
intuitively appealing idea in the approach to estimation; namely, the idea of minimising some
measure of distance between the observed data and the theoretical model chosen to describe the
population from which the data arose.
2.2. Properties of Estimators
In the preceding sections we introduced a variety of estimators, generally justified on reasonably
intuitive grounds. We now wish to establish some criteria on which we can base comparisons of our
estimators. In particular, we would like to decide which estimator is “best” for a given problem.
Statistical Inference (STAT3013/8027) Lecture Notes - Page 12
Before we introduce these criteria and discuss the associated properties of the estimators we have
introduced, we need to make a distinction between two general types of comparison criteria. The
two major classes of criteria are distinguished by their relationship to the size of the sample on which
the estimator is based. Specifically, properties based on the estimation procedure as it pertains to
any fixed sample size are referred to as small-sample properties. Alternatively, properties which
pertain to the behaviour of an estimation procedure as the sample size increases without bound
are referred to as large-sample or asymptotic properties.
2.2.1. Bias and Mean Squared Error: The most common measure of how “close” to its target
an estimator tends to be is the mean-squared error or M SE. For any estimator T = t(X1 , . . . , Xn )
of the quantity τ = τ (θ), the M SE is defined as:
where the notation M SEt (θ) is used to indicate the dependence of the mean-squared error on both
the estimator in question and the value of the underlying parameter θ.
The M SE can be partitioned into two important components, based on the relationship:
corrected: should
M SEt (θ) = Eθ {(T − τ )2 }
have been (+) not
+ {Eθ (T ) − τ }]2
= Eθ [{T − Eθ (T )} − (-)
2
= Eθ {T − Eθ (T )} − + 2Eθ {T − Eθ (T )}{Eθ (T ) − τ } + Eθ {Eθ (T ) − τ }2
2
= V arθ (T ) −
+ 2{Eθ (T ) − τ }Eθ {T − Eθ (T )} + {Eθ (T ) − τ }
= V arθ (T ) + {Biasθ (T )}2 ,
where the final equality follows from the fact that Eθ {T − Eθ (T )} = Eθ (T ) − Eθ (T ) = 0 and we
have defined Biasθ (T ) = Eθ (T ) − τ to be the bias of the estimator T (i.e., the difference between
the expectation of the estimator and the quantity which it is being used to estimate). Using the
M SE, we can now compare estimation procedures:
Example 2.1 (cont’d): We have seen that the standard method of moments (and indeed the
M LE, as well) of the parameter σ 2 based on X1 , . . . , Xn , a sample of size n from a normal
n
distribution with mean µ and variance σ 2 is σ̂ 2 = n−1 i=1 (Xi − X)2 . Alternatively, we
know that the standard unbiased estimator of σ 2 is the usual sample variance, s2 = (n −
n
1)−1 i=1 (Xi − X)2 . It is a simple (though tedious) calculation to show that:
2σ 4
V arµ,σ2 (s2 ) = ,
n−1
and the demonstration of this fact is left as an exercise. Since we know that s2 is unbiased, it
is clear that M SEs2 (µ, σ 2 ) = V arµ,σ2 (s2 ). Now, we can write σ̂ 2 = n−1 (n − 1)s2 , so that:
2 (n − 1)s2 n−1 (n − 1)σ 2
E µ,σ 2 (σ̂ ) = E µ,σ 2 = Eµ,σ2 (s2 ) = ,
n n n
and
(n − 1)σ 2 σ2
Biasµ,σ2 (σ̂ 2 ) = Eµ,σ2 (σ̂ 2 ) − σ 2 = − σ2 = −
2
n 2
n
(n − 1)s (n − 1) 2(n − 1)σ 4
V arµ,σ2 (σ̂ 2 ) = V arµ,σ2 = V ar µ,σ
2
2 (s ) = .
n n2 n2
Statistical Inference (STAT3013/8027) Lecture Notes - Page 13
which is clearly positive for any non-negative integer n. In other words, despite the fact that s2
is unbiased, σ̂ 2 has smaller M SE. Moreover, suppose we define another estimator as σ̂c2 = cs2
for some constant c. In this case, we can again easily calculate:
and
Biasµ,σ2 (σ̂c2 ) = Eµ,σ2 (σ̂c2 ) − σ 2 = cσ 2 − σ 2 = (c − 1)σ 2
2c2 σ 4
V arµ,σ2 (σ̂c2 ) = V arµ,σ2 (cs2 ) = c2 V arµ,σ2 (s2 ) = .
n−1
Therefore, the M SE of this new estimator is given by
2c2 σ 4
M SEσ̂c2 (µ, σ 2 ) = V arµ,σ2 (σ̂c2 ) + {Biasµ,σ2 (σ̂c2 )}2 = + (c − 1)2 σ 4 .
n−1
Differentiating this expression with respect to c and equating to zero shows that:
4cσ 4
+ 2(c − 1)σ 4 = 0 =⇒ 4c + 2(c − 1)(n − 1) = 0
n−1
=⇒ {4 + 2(n − 1)}c = 2(n − 1)
n−1
=⇒ c= .
n+1
It is straightforward to verify that this value of c yields a minimum, and thus, among all
1
n
estimators of the form cs2 , the one with the minimum M SE is n−1 2
n+1 s = n+1
2
i=1 (Xi − X) ,
which is neither the M LE, the method of moments estimator nor the usual unbiased estimator.
[NOTE: We have not shown that this new estimator has the smallest possible M SE of any
estimator, only among those having the form cs2 for some constant c.]
Ideally, we would like to find an estimator T = t(X1 , . . . , Xn ) which has minimal M SE, so that
for any other estimator T1 = t1 (X1 , . . . , Xn ) we have have M SEt (θ) ≤ M SEt1 (θ) for all values of
θ ∈ Θ. Unfortunately, it is easy to see that such an estimator cannot exist (except in the most
unusual of circumstances). To demonstrate this, we define the estimator T0 = t0 (X1 , . . . , Xn ) ≡
τ (θ0 ) = τ0 (i.e., T0 is the estimator which always yields an estimate equal to some pre-specified
value τ0 regardless of the observed data values) and note that M SEt0 (θ) = Biast0 (θ) = {τ0 −τ (θ)}2
so that M SEt0 (θ0 ) = 0. Thus, since M SEs are clearly non-negative, no estimator will have smaller
M SE than T0 when θ = θ0 . Of course, for other values of θ, T0 is an extremely silly estimator, but
this example demonstrates the difficulty of finding the “best” estimator uniformly over all possible
values of θ. Indeed, if we imagine T0 -type estimators for each possible parameter value in Θ, then
the following theorem shows that if an estimator T = t(X1 , . . . , Xn ) is to have smaller M SE than
all of these estimators over the entire range of Θ, then it must have M SEt (θ) ≡ 0.
Theorem 2.2: Suppose that X1 , . . . , Xn are an iid sample from a distribution with density
function fX (x; θ) belonging to a family indexed by the parameter θ ∈ Θ. If T = t(X1 , . . . , Xn )
Statistical Inference (STAT3013/8027) Lecture Notes - Page 14
is an estimator of τ = τ (θ) satisfying M SEt (θ) ≤ M SEt (θ) for all θ ∈ Θ and any other
estimator T = t (X1 , . . . , Xn ) [i.e., T has uniformly minimal M SE], then M SEt (θ) = 0 for all
θ ∈ Θ.
Proof: Pick any value θ0 ∈ Θ and define the estimator T0 = t0 (X1 , . . . , Xn ) ≡ τ (θ0 ). Clearly,
M SEt0 (θ0 ) = 0. Therefore, since we have assumed that T has uniformly minimal M SE, we
must have M SEt (θ0 ) ≤ M SEt0 (θ0 ) = 0. Since M SEs are non-negative quantities, it must be
the case that M SEt (θ0 ) = 0. Finally, since the original choice of θ0 was arbitrary, the preceding
argument is valid for any choice of θ0 , meaning that M SEt (θ) = 0 for any value of θ ∈ Θ.
In other words, the only possible estimator with minimal M SE over the full range of the parameter
space is one with an M SE which is uniformly zero, and generally speaking such estimators do not
exist since they must be both unbiased and have no variance (i.e., they must be exactly correct for
any sample values x1 , . . . , xn ).
One reason for being unable to find an estimator with uniformly smallest M SE over all values of
θ ∈ Θ is that there are simply too many possible estimators (as the silly estimators in the preceding
discussion demonstrate). One solution to this problem is to restrict the class of allowable estimators
t(·), for instance by requiring the allowable estimators to be unbiased, so that Biast (θ) = 0 for all
θ ∈ Θ. We will further investigate this possibility in later sections.
2.2.2. Location and Scale Equivariance: At the end of the previous subsection, we noted that
we might restrict attention to unbiased estimators in an effort to reduce the class of allowable
estimators enough so that an “optimal” estimator, in terms of minimal M SE, might be found.
In this section, we investigate alternative “common sense” properties which might be used for the
same purpose in certain settings.
First, suppose that we are estimating a scalar quantity τ = τ (θ) which can be interpreted as
the “centre” or “location” of the underlying distribution family. Such quantities τ are referred to
as location parameters and are formally defined as follows:
Definition 2.1: Let {fX (x; θ), θ ∈ Θ} be a family of distributions with density functions
fX (x; θ). Suppose that there is a function h(·) such that fX (x; θ) = h{x − τ (θ)}. If such a
function exists, then τ = τ (θ) is a location parameter. Equivalently, it is not difficult to show
that the preceding description implies that τ = τ (θ) is a location parameter for the family of
densities if and only if the density function of the new random variable Y = X − τ (θ) does not
depend on θ.
An obvious (and easily demonstrated) property of location parameters is that if X has density
fX (x; θ) = h{x − τ (θ)} then W = X + c has density h{(w − c) − τ (θ)} = h[w − {τ (θ) + c}]. In
other words, if τ = τ (θ) is a location parameter for the distribution family associated with an iid
sample of X’s, then τ + c is a location parameter for the distribution family associated with the
corresponding W ’s. The idea here is that “shifting” all of the observed data by a fixed amount
has the effect of shifting its location by the same amount. As such, it seems reasonable that any
estimator we choose for τ should have the corresponding “shift” property. That is, we would like our
estimation procedure to produce an estimate based on the shifted data which is just the estimate
based on the original data shifted by the appropriate amount. Estimators with this property are
said to be location equivariant. Formally, an estimator T = t(X1 , . . . , Xn ) is location equivariant
if it satisfies:
t(X1 + c, . . . , Xn + c) = t(X1 , . . . , Xn ) + c,
for any constant value c.
We note that most of the usual estimators of location are indeed location equivariant. For
example, clearly median(X1 + c, . . . , Xn + c) = median(X1 , . . . , Xn ) + c, so the median is a location
Statistical Inference (STAT3013/8027) Lecture Notes - Page 15
1 1 1
n n n
t(X1 + c, . . . , Xn + c) = (Xi + c) = Xi + c = t(X1 , . . . , Xn ) + c,
n i=1 n i=1 n i=1
has uniformly minimum M SE and is known as the Pitman estimator of location (estimators which
have uniformly minimal M SE among the class of location equivariant estimators are sometimes
referred to as M RE or minimum risk equivariant estimators). While this estimator seems quite
complicated, it can be shown that it reduces dramatically for many of the common distribution
families. In particular, if fX (x; θ) is the normal density with mean θ and known variance, then
τ = τ (θ) = θ is the location parameter and the Pitman estimator of location reduces to the sample
average (i.e., for a normal population mean, the sample average has uniformly minimal M SE
among all location equivariant estimators).
Alternatively, suppose that we are interested in estimating a scalar quantity τ = τ (θ) which
can be interpreted as the “spread” or “scale” of the underlying distribution family. Such quantities
τ are referred to as scale parameters and are formally defined as follows:
Definition 2.2: Let {fX (x; θ), θ ∈ Θ} be a family of distributions with density functions
fX (x; θ). Suppose that there is a function h(·) such that fX (x; θ) = {τ (θ)}−1 h x{τ (θ)}−1 . If
such a function exists, then τ = τ (θ) is a scale parameter (NB: note that this definition requires
τ (θ) ≥ 0 for all θ ∈ Θ since density functions must be non-negative). Equivalently, it can be
shown that the preceding description implies that τ = τ (θ) is a scale parameter for the family
of densities if and only if the density function of the new random variable Y = X/τ (θ) does not
depend on θ.
An important property of scale parameters is that if X has density fX (x; θ) = {τ (θ)}−1 h x{τ (θ)}−1
then W = cX has density {cτ (θ)}−1 h w{cτ (θ)}−1 when c > 0 and density
{|c|τ (θ)}−1 h − w{|c|τ (θ)}−1 = {|c|τ (θ)}−1 h1 w{|c|τ (θ)}−1
when c < 0 and the function h1 is defined by the relationship h1 (x) = h(−x). In either case, we see
that if τ = τ (θ) is a scale parameter for the distribution family associated with an iid sample of X’s,
then |c|τ is a scale parameter for the distribution family associated with the corresponding W ’s.
The idea here is that “shrinking” or “expanding” all of the observed data by a fixed amount has the
effect of changing its scale by the same amount. As such, it seems reasonable that any estimator
we choose for τ should have the corresponding property. That is, we would like our estimation
procedure to produce an estimate based on the scaled data which is just the estimate based on the
original data multiplied by the appropriate scale factor. Estimators with this property are said to
be scale equivariant. Formally, an estimator T = t(X1 , . . . , Xn ) is scale equivariant if it satisfies:
so that the sample standard deviation is also seen to be a scale equivariant estimator. [NOTE:
The preceding calculation actually uses the fact that the sample mean is also a scale equivariant
estimator (which is easily seen from a quick algebraic calculation), even though it is not normally
thought of as a scale estimator.] Finally, we note that in addition to scale equivariance, another
desirable property of scale estimators is that they do not change if a fixed constant is added to
each of the observed data values (since such a transformation would not change the scale of the
values only their location). Estimators which have such a property are called location invariant.
Formally, an estimator T = t(X1 , . . . , Xn ) is location invariant if it satisfies:
t(X1 + c, . . . , Xn + c) = t(X1 , . . . , Xn ),
for any constant value c. Most of the usual estimators of scale are not only scale equivariant but
location invariant as well (e.g., the IQR and the sample standard deviation are location invariant
as well as scale equivariant).
2.2.3. Consistency and Asymptotic Efficiency: The previous sections have defined properties
of estimators for a fixed sample X1 , . . . , Xn of size n. In other words, these were small sample
properties. We now turn our attention to two new properties of estimators which are defined
asymptotically; that is, as the sample size grows without bound. Recall that such properties are
termed “large sample”. In such situations, we will generally denote the estimator based on a given
sample size n by Tn = tn (X1 , . . . , Xn ) and then examine the limiting properties of the sequence of
estimators {Tn }n=1,2,... as n tends towards infinity.
The first large sample property we will discuss deals with the notion of an estimation proce-
dure eventually yielding an essentially exactly correct result given sufficiently large samples. The
formalisation of this notion is termed consistency and can be defined as follows:
Definition 2.3: Let T1 , T2 , . . . be a sequence of estimators of τ (θ), where Tn = tn (X1 , . . . , Xn ).
The sequence {Tn }n=1,2,... is weakly consistent if for every > 0
In other words, a sequence of estimators is weakly consistent as long as the probability that it is
eventually within any small interval around the true value τ (θ) tends towards one. This idea can be
seen as the formalisation of the notion that, as the amount of information increases, our estimation
procedure should give better and better estimates with larger and larger probability.
We note, however, that just because a sequence of estimators is weakly consistent does not
necessarily imply that it has any nice small sample properties. For instance, it is possible for a
sequence of estimators to be weakly consistent even though each member of the sequence is biased;
that is, Eθ (Tn ) = τ (θ) for any n. Indeed, it need not even be the case that the bias decreases
with n; that is, limn→∞ Eθ (Tn ) = τ (θ). Now, at the least, it seems reasonable to ask that a
sequence of estimators have this last property, generally referred to as the estimator sequence being
asymptotically unbiased. It turns out that we can ensure this behaviour if we define a stronger
kind of consistency:
Definition 2.4: Let T1 , T2 , . . . be a sequence of estimators of τ (θ), where Tn = tn (X1 , . . . , Xn ).
The sequence {Tn }n=1,2,... is mean-square consistent if and only if
It can be shown that if a sequence of estimators is mean-square consistent than it must be asymp-
totically unbiased (a fact which follows directly from the relationship between the M SE and the
variance and bias of the estimator Tn ). Moreover, if an estimator is mean-square consistent it must
also be weakly consistent (of course, as noted earlier, the reverse implication is not true). The
demonstration of this fact relies on the so-called Chebychev inequality, which states that for any
random variable Z and any constants a > 0 and c it must be the case that
E{(Z − c)2 }
P r(|Z − c| ≥ a) ≤ .
a2
To see this, suppose that Z has density function fZ (z), and note that
∞
E{(Z − c)2 } = (z − c)2 fZ (z)dz
−∞
2
= (z − c) fZ (z)dz + (z − c)2 fZ (z)dz
z:|z−c|<a z:|z−c|≥a
≥ (z − c)2 fZ (z)dz
z:|z−c|≥a
≥ a2 fZ (z)dz
z:|z−c|≥a
= a2 fZ (z)dz
z:|z−c|≥a
2
= a P r(|Z − c| ≥ a),
which provides the desired result after some simple algebraic rearrangement. Now, using this result
we note that
P rθ {τ (θ) − < Tn < τ (θ) + } = P rθ {|Tn − τ (θ)| < }
= 1 − P rθ {|Tn − τ (θ)| ≥ }
E[{Tn − τ (θ)}2 ]
≥1−
2
Thus, if the sequence {Tn }n=1,2,... is mean-square consistent, so that limn→∞ E[{Tn − τ (θ)}2 ] = 0,
we see that
lim P rθ {τ (θ) − < Tn < τ (θ) + } ≥ 1.
n→∞
Statistical Inference (STAT3013/8027) Lecture Notes - Page 18
Of course, since probabilities cannot exceed unity, this inequality must be an equality, which is
precisely the defining equation for weak consistency.
We close this section with the second of our large sample properties for estimators. This
property is generally referred to as asymptotic relative efficiency and to define it, we must first
define the notion of asymptotic normality. Of course, all standard introductions to statistical
inference teach the Central Limit Theorem, and thus we are familiar with the concept of a random
variable having a normal distribution “in the limit” as the sample size increases, but this notion
is rarely defined more precisely in introductory units. Here we will start to give a more formal
definition of what it means for something to have a normal distribution “in the limit”:
Definition 2.5: Let Z1 , Z2 , . . . be a sequence of random variables with cumulative distribution
functions F1 (z), F2 (z), . . .. The sequence {Zn }n=1,2,... is said to be asymptotically normal if:
i. limn→∞ E(Zn ) = µ for some value µ;
ii. limn→∞ V ar(Zn ) = σ 2 > 0 for some positive value σ 2 ; and,
iii. limn→∞ Fn (z) = Φ z−µσ for all z ∈ (−∞, ∞), where Φ(·) is the CDF of the standard
normal distribution.
[NOTE: While this definition provides an explanation of what it means for the distribution of a
sequence of random variables to converge to a normal distribution (and, indeed, the above definition
is an example of a more general concept known as “convergence in distribution”), it is rarely very
practical to demonstrate that a sequence of random variables is asymptotically normal by examining
the limit of their CDF s. Generally, it is easier (and turns out to be equivalent) to show that the
associated moment generating functions of the Zn ’s converge to the moment generating function
of a normal distribution with mean µ and variance σ 2 .]
Once we have a formal notion of what it means for a sequence of random variables to be
asymptotically normal, we can then define asymptotic relative efficiency as follows:
Definition 2.6: Let T1 , T2 , . . . and U1 , U2 , . . . be two weakly consistent sequences of estimators
√ √
of τ (θ), and define the new random variables Zn = n{Tn − τ (θ)} and Wn = n{Un − τ (θ)}.
Further, assume that the sequences {Zn }n=1,2,... and {Wn }n=1,2,... are asymptotically normal
2 2 2 2
with mean µZ = µW = 0 and variances σZ = σZ (θ) and σW = σW (θ), where, as the notation
suggests, the limiting variances of the Zn ’s and the Wn ’s depend on the true underlying value
of the parameter θ. The asymptotic relative efficiency of the sequence {Tn }n=1,2,... with respect
2 2
to the sequence {Un }n=1,2,... is defined as eT,U = σW /σZ .
As a simple example of this concept, suppose that X1 , . . . , Xn are a sample from a normal population
n
with mean µ and variance σ 2 . The usual sequence of estimators for µ, X n = n1 i=1 Xi , is well
known to be weakly consistent (indeed, it is mean-square consistent which follows from the Law
√
of Large Numbers) and the sequence of random variables Zn = n(X n − µ) are well known to
be asymptotically normal with mean zero and variance σ 2 (by the Central Limit Theorem). It
can be shown (though it is rather difficult and thus omitted here) that the sequence of estimators
X̃n = median(X1 , . . . , Xn ) is also weakly consistent and the sequence of random variables Wn =
√
n(X̃n − µ) is asymptotically normal with mean zero and variance σ 2 /{2φ(0)}2 , where φ(·) is the
density function of the standard normal distribution. Now, a simple exercise shows that φ(0) =
√
1/ 2π, and thus the asymptotic relative efficiency of the sample average with respect to the sample
median (in the case of normal data) is eX,X̃ = π/2. Since this value is larger than one, we see
that the sample average is more efficient than the sample median when the data are truly from
a normal population. Since asymptotic efficiencies are based on asymptotic variances, and these
variances are used in assessing the accuracy of estimators (which the reader will recall from their
introductory unit in statistics and which we will deal with in more detail in Section 3), one useful
Statistical Inference (STAT3013/8027) Lecture Notes - Page 19
interpretation of the relative efficiency is “the amount of extra data required for one estimation
procedure to be as accurate as another”. For our example of the sample mean and sample median,
then, we can see that in order for the sample median to be as accurate as the sample mean, we
must have a sample which has π/2 ≈ 1.57 times as many observations. [Provided, of course, we
believe the normality assumption, and indeed if the data are not normally distributed than it is
possible for the median to be more efficient than the mean.]
Finally, once we have the notion of relative efficiency, we might ask whether we can find
best asymptotically normal (BAN ) estimator sequences, which are essentially those for which the
relative efficiency with respect to any other sequence is always larger than or equal to one. In other
words, a weakly consistent sequence of estimators {Tn }n=1,2,... is BAN for τ (θ) if:
√
i. the sequence of random variables Zn = n{Tn − τ (θ)} is asymptotically normal with mean
µ = 0 and variance σ 2 = σ 2 (θ); and,
ii. any other weakly consistent sequence of estimators {Tn }n=1,2,... for which the sequence of
√
random variables Zn = n{Tn − τ (θ)} is asymptotically normal with mean µ = 0 and
variance σ2 = σ2 (θ) has σ2 (θ) ≥ σ 2 (θ) for all θ ∈ Θ.
Of course, it is generally very difficult to prove that a sequence is BAN from this definition, since
we must be able to verify the minimality of the asymptotic variance over all other consistent,
asymptotically normal estimator sequences. However, it can be shown that many of the common
estimators are indeed best asymptotically normal. For instance, the sample mean is a BAN esti-
mator for the mean µ of a normal population. Unfortunately, the limiting nature of the definition
of relative efficiency means that BAN estimators are rarely unique. For instance, the sequence of
1
n
estimators Tn = n+1 i=1 Xi is also BAN for µ from a normal population since its asymptotic
variance is clearly the same as that of the usual sample average, the additional one in the divisor
becoming essentially negligible as the sample size increases towards infinity.
2.2.4. Loss Functions and Minimax Estimation: In this section, we examine the notion behind
the M SE and extend its defining concept. If we consider the problem of estimating τ (θ) from the
perspective of making a choice or decision among the possible values of τ (θ), then an estimator
T = t(X1 , . . . , Xn ) is sometimes referred to as a decision function or a decision rule. Obviously,
the random nature of the observations means that the actual estimate t = t(x1 , . . . , xn ) based
on the particular observed values x1 , . . . , xn will inevitably be in error. However, it is generally
the case that some errors are more severe than others, and we can quantify this idea by defining
an appropriate loss function, (t; θ). There are many ways of measuring the loss associated with
estimating τ (θ) to be the value t, and the three most common ones are:
i. Squared-Error: (t; θ) = {t − τ (θ)}2 ;
ii. Absolute-Error: (t; θ) = |t − τ (θ)|; and,
iii. Constant-Error: (t; θ) = AI{|t−τ (θ)|>} .
The first two of these functions measure the loss as an increasing function of the discrepancy between
the true value of τ (θ) and the estimated value t. The third function measures loss as either some
fixed value A if the estimate differs from the true value τ (θ) by more than some pre-specified value
, and otherwise the loss is zero (i.e., as long as the estimate is within of the true value there is no
loss). Of course, there are many other potential measures of loss, and the context of any particular
problem may suggest which loss function is the most sensible in the circumstances (in particular,
the three loss functions discussed here are all symmetric, so that errors below and errors above of
the same size incur equal losses; however, there are situations in which the direction of the error
will effect the loss and in such situations asymmetric loss functions are necessary).
Suppose, however, that we have been able to determine the most sensible loss function for a
Statistical Inference (STAT3013/8027) Lecture Notes - Page 20
given problem (which is a quite large supposition, of course). Obviously, we would like to pick
a decision function (i.e., an estimator) which has a small associated loss. Of course, since the
estimators are based on random observations, we cannot hope to find a decision rule which can
guarantee small loss for every possible outcome of the random observations. As such, we must
lower our sights somewhat, and instead we will try and minimise the average loss over the possible
outcomes of the observations. Doing so leads to the definition of the so-called risk function, Rt (θ) =
Eθ {(T ; θ)}. The risk function allows us to compare competing decision rules. In particular, suppose
that we have two competing decision functions t1 (X1 , . . . , Xn ) and t2 (X1 , . . . , Xn ), then we can say
that t1 is a better estimator than t2 if Rt1 (θ) ≤ Rt2 (θ) for all θ ∈ Θ, and Rt1 (θ) < Rt2 (θ) for at
least one value of θ in the parameter space Θ. As a final piece of nomenclature, we shall say that
an estimator is admissible if there is no better estimator (i.e., if there is no estimator with smaller
or equal risk for all possible parameter values).
Given these ideas, we can then attempt to determine a decision rule (i.e., an estimation pro-
cedure) which has minimal risk among the admissible estimators. However, we quickly see that if
we choose the squared-error loss function, than the risk function simply becomes our now familiar
M SEt (θ), for which we know that no uniformly minimal estimator generally exists. Indeed, for
almost any loss function we choose (and certainly the three common loss functions defined previ-
ously), there will not be a general estimator which has uniformly minimal risk over the entire range
of possible values for the parameter θ. The problem, as we have seen, is that the risk function
depends on θ. Earlier, we suggested reducing the class of estimators to overcome this problem, and
we will investigate the idea further in subsequent sections. However, an alternate approach might
be to find an estimator which has the smallest “overall” risk over all possible values of θ. Of course,
we must more formally specify what we mean by an “overall” risk. This idea will be more fully
discussed in Section 2.5. For now, though, we discuss a simple definition of overall risk; namely,
the maximal risk, supθ∈Θ Rt (θ).
Definition 2.7: Suppose that T = t(X1 , . . . , Xn ) is an estimation procedure (or decision rule)
for the quantity τ (θ). Also, suppose that the chosen loss function for the estimation problem
is given by (t; θ), so that the risk function for T is given by Rt (θ) = Eθ {(T ; θ)}. If, for any
other estimation procedure T = t (X1 , . . . , Xn ) with risk function Rt (θ) = Eθ {(T ; θ)}, the
risk function of T satisfies
sup{Rt (θ)} ≤ sup{Rt (θ)},
θ∈Θ θ∈Θ
2.3. Sufficiency
One of the most important uses of statistical methods is to effect data reduction and summari-
sation. In particular, in our present parametric estimation setting, we would like to distill the
information regarding the parameter θ from our sample of random observations. Clearly, not all
of the information in these observations will be relevant to θ (indeed, some part of the observed
values are simply based on random chance). As such, we will want to reduce or summarise our
observations by ignoring extraneous information. Of course, we will not want to reduce our data
to the extent that we start to lose information which is relevant to the parameter θ. Reduction of
data takes place through the construction of statistics (or estimators), and a statistic which retains
all the information relevant to the parameter θ which was contained in the original data values is
Statistical Inference (STAT3013/8027) Lecture Notes - Page 21
termed sufficient for θ. The general notion here is to replace the actual observations by the value
of a sufficient statistic which removes as much extraneous information (presumably caused by the
underlying randomness in the data) as possible and still maintains all of the relevant information
in the data. As such, decisions made on the basis of sufficient statistics instead of the full set of
observations can be seen to be equally as valid and useful.
More formally, suppose that X1 , . . . , Xn is a random sample from a distribution family having
densities fX (x; θ). Let X represent the sample space of the random vector (X1 , . . . , Xn ), then a
statistic T = t(X1 , . . . , Xn ) can be viewed as a partitioning of X . In other words, if we define T
to be the sample space of T and define the sets Xt = {(x1 , . . . , xn ) ∈ X : t(x1 , . . . , xn ) = t} for
each t ∈ T , then the collection {Xt }t∈T forms a partition of X . The usefulness of a statistic in
terms of its data reduction properties can then be judged by how effective this partitioning is in
both reducing the number of “possible” values to be considered as well as the degree to which all
relevant information regarding the parameter θ is retained. With regard to the partitioning induced
by a statistic, we can see that if decisions are based on the value of a statistic instead of the actual
observed data, then clearly the decision will be the same for any dataset within the same partition
of the sample space, Xt . As such, in order for a statistic to be sufficient (i.e., retain all relevant
information regarding the parameter θ) the information which distinguishes the individual elements
of each Xt should have no bearing on the value of θ (i.e., if the observed sample is known to be in a
given Xt , the probability of the sample taking any of the values within this member of the sample
space partition should not depend on the value of θ). We shall give a formal characterisation of
when we can expect this to happen, but first we examine a simple example which illustrates the
ideas behind sufficiency:
Example 2.5: Let X1 , X2 , X3 be a sample of size n = 3 from a Bernoulli distribution with
parameter p [i.e., P rp (Xi = 1) = p and P rp (Xi = 0) = 1 − p]. In this case, the sample space
for (X1 , X2 , X3 ) consists of the 8 values:
X = {(0, 0, 0), (0, 0, 1), (0, 1, 0), (1, 0, 0), (0, 1, 1), (1, 0, 1), (1, 1, 0), (1, 1, 1)}.
X0,1 = {(0, 0, 0), (0, 1, 0), (1, 0, 0)}, X1,1 = {(0, 0, 1), (0, 1, 1), (1, 0, 1), (1, 1, 0)}, X2,1 = {(1, 1, 1)},
X1,2 = {(0, 0, 1), (0, 1, 0), (1, 0, 0)}, X2,2 = {(0, 1, 1), (1, 0, 1), (1, 1, 0)}, X3,2 = {(1, 1, 1)},
respectively. We now examine the distribution of the sample space values within each element of
these two partitions. First, suppose that we are told that T1 = 0, so that the possible values for
our original sample are the set X0,1 = {(0, 0, 0), (0, 1, 0), (1, 0, 0)}. We can then easily calculate
Statistical Inference (STAT3013/8027) Lecture Notes - Page 22
the chance that the actual dataset was all zeroes as:
P rp (X1 = 0, X2 = 0, X3 = 0|T1 = 0)
P rp (X1 = 0, X2 = 0, X3 = 0, T1 = 0)
=
P rp (T1 = 0)
P rp (X1 = 0, X2 = 0, X3 = 0)
=
P rp (X1 = 0, X2 = 0, X3 = 0 or X1 = 0, X2 = 1, X3 = 0 or X1 = 1, X2 = 0, X3 = 0)
(1 − p)3
=
(1 − p)3 + 2p(1 − p)2
1−p
= .
1+p
From this calculation, we can see that the statistic T1 is not sufficient, since it does not induce
an appropriate partition. In particular, if we were to base any decision or estimate on the value
of T1 = 0, it would have to be the same regardless of whether the actual sample had been
the vector (0, 0, 0) or the vector (0, 1, 0). However, these two samples clearly contain different
information about the parameter p. By contrast, suppose that we are told that T2 = 1, so that
the possible values for our original sample are the set X1,2 = {(0, 0, 1), (0, 1, 0), (1, 0, 0)}. We
can then easily calculate the chance that the actual dataset was (0, 1, 0) as:
P rp (X1 = 0, X2 = 1, X3 = 0|T2 = 1)
P rp (X1 = 0, X2 = 1, X3 = 0, T2 = 1)
=
P rp (T2 = 1)
P rp (X1 = 0, X2 = 1, X3 = 0)
=
P rp (X1 = 0, X2 = 0, X3 = 1 or X1 = 0, X2 = 1, X3 = 0 or X1 = 1, X2 = 0, X3 = 0)
p(1 − p)2
=
3p(1 − p)2
1
= .
3
Indeed, similar calculations show that for any value T2 = t, the chance that the actual dataset
was one of the possible elements of Xt,2 does not depend on p. Thus, T2 is indeed a sufficient
statistic, since basing estimates on its value retains all of the relevant information in the sample
(X1 , X2 , X3 ) regarding the parameter p, the remaining distinctions being determined entirely
by underlying random chance.
Based on this example, we can now formally define a sufficient statistic:
Definition 2.8: Let X1 , . . . , Xn be a random sample from a distribution family with density
function fX (x; θ), where θ is a parameter (vector). A (vector-valued) statistic S = s(X1 , . . . , Xn )
is sufficient for θ if and only if the conditional distribution of X1 , . . . , Xn given S does not depend
on θ. If S is vector valued so that S = (S1 , . . . , Sk ) we generally refer to the individual scalar
components S1 , . . . , Sk as jointly sufficient statistics.
From this definition, we can easily see that the sample itself X = (X1 , . . . , Xn ) is a sufficient
statistic, as is the collection of order statistics Y = (Y1 , . . . , Yn ) = sort(X1 , . . . , Xn ) [i.e., Y1 is the
smallest of the Xi ’s, Y2 the second smallest and so on up to Yn , the largest of the Xi ’s] since the
conditional distribution of X given Y is simply the one which puts equal probability on each of the
n! permutations of the elements of Y . Moreover, if we recall that the central notion of a statistic
is that it sets up a partition of the sample space X , then it is clear that if S = s(X1 , . . . , Xn ) is a
sufficient statistic and h(·) is an invertible function then h(S) is also a sufficient statistic, since h(S)
Statistical Inference (STAT3013/8027) Lecture Notes - Page 23
will create the same sample space partition (due to the one-to-one nature of invertible functions)
as S [i.e., for any value s, we have
Xh(s) = (x1 , . . . , xn ) ∈ X : h{s(x1 , . . . , xn )} = h(s)
= (x1 , . . . , xn ) ∈ X : h−1 [h{s(x1 , . . . , xn )}] = h−1 {h(s)}
= (x1 , . . . , xn ) ∈ X : s(x1 , . . . , xn )} = s
= Xs ,
since we have assumed that h(·) is invertible]. However, neither this last result nor the definition
itself is very useful for directly determining whether a statistic is sufficient (since finding the con-
ditional distribution of X given S is usually extremely difficult). Fortunately, there is an easier
method of finding sufficient statistics which we introduce in the next section.
2.3.1. Factorisation Criterion: We now present an extremely important theorem which can be
used to determine whether or not a statistic is sufficient:
Theorem 2.3: Let X1 , . . . , Xn be a random sample from a distribution family having density
function fX (x; θ) for some parameter vector θ. A statistic S = s(X1 , . . . , Xn ) is sufficient if and
only if the joint density function of the Xi ’s factors as:
n
fX1 ,...,Xn (x1 , . . . , xn ; θ) = fX (xi ; θ) = h1 {s(x1 , . . . , xn ); θ}h2 (x1 , . . . , xn ),
i=1
for some non-negative function h1 (·; θ) which depends on the xi ’s only through the value
s(x1 , . . . , xn ) and some non-negative function h2 (·) which does not depend on θ.
Proof: The proof is tedious and not very enlightening and is thus omitted from these notes.
We note that Theorem 2.3 provides a way to determine whether a certain statistic is sufficient,
however, just because we are unable to find an appropriate factorisation for some statistic does not
necessarily imply that no such factorisation exists. Thus, the theorem is rarely useful in determining
whether a statistic is not sufficient. Of course, to determine that a statistic T is not sufficient we
merely need to show that the distribution of the observations X1 , . . . , Xn given T = t depends on θ
for some value of t. In fact, the main usefulness of Theorem 2.3 is in discovering sufficient statistics,
as the following examples demonstrate:
Example 2.6: Let X1 , . . . , Xn be a random sample from the uniform distribution on the interval
[θ1 , θ2 ], so that the density function is given by fX (x; θ1 , θ2 ) = (θ2 − θ1 )−1 Iθ1 ≤x≤θ2 for θ1 < θ2 .
The joint density of the Xi ’s can then be written as:
n
fX1 ,...,Xn (x1 , . . . , xn ; θ1 , θ2 ) = (θ2 − θ1 )−1 I(θ1 ≤xi ≤θ2 )
i=1
n
−n
= (θ2 − θ1 ) I(θ1 ≤xi ≤θ2 )
i=1
= (θ2 − θ1 )−n I{(θ1 ≤x1 ≤θ2 )∩···∩(θ1 ≤xn ≤θ2 )}
= (θ2 − θ1 )−n I[{θ1 ≤min(x1 ,...,xn )}∩{max(x1 ,...,xn )≤θ2 }]
= (θ2 − θ1 )−n I{θ1 ≤min(x1 ,...,xn )} I{max(x1 ,...,xn )≤θ2 } .
Thus, if we set h1 (y1 , yn ; θ1 , θ2 ) = (θ2 −θ1 )−n I(θ1 ≤y1 ) I(yn ≤θ2 ) and h2 (x1 , . . . , xn ) = 1, we see that
Y1 = min(X1 , . . . , Xn ) and Yn = max(X1 , . . . , Xn ) are jointly sufficient statistics. Alternatively,
if we assume that we know θ1 = 0, then the joint density of the sample can be written as:
and we can then define h1 (yn ; θ2 ) = θ2−n I(yn ≤θ2 ) and h2 (x1 , . . . , xn ) = I{0≤min(x1 ,...,xn )} to see
that Yn = max(X1 , . . . , Xn ) is now a sufficient statistic.
Example 2.7: Let X1 , . . . , Xn be a random sample from a normal distribution family with
density function
1 1 2
φµ,σ2 (x) = √ exp − 2 (x − µ) ,
2πσ 2 2σ
for parameters µ and σ 2 > 0. The joint density of the Xi ’s can then be written as:
n
fX1 ,...,Xn (x1 , . . . , xn ; θ1 , θ2 ) = φµ,σ2 (xi )
i=1
1
n
1 2
= exp − (xi − µ)
(2πσ 2 )n/2 2σ 2 i=1
n n
1 1 2 2
= exp − 2 xi − 2µ xi + nµ .
(2πσ 2 )n/2 2σ i=1 i=1
Thus, we see that the joint density itself can be written as a function of the two quantities
n n
S1 = i=1 Xi and S2 = i=1 Xi2 , which means that we can define h1 (s1 , s2 ; µ, σ 2 ) to be the
joint density itself and h2 (x1 , . . . , xn ) = 1 and thus S1 and S2 are jointly sufficient. Moreover,
it is relatively easy to see that the vector-valued function h(S1 , S2 ) = {n−1 S1 , (n − 1)−1 (S2 −
n−1 S12 )} = (X, s2 ) is invertible (since it is one-to-one), and therefore the average, X, and the
usual sample variance, s2 , are also jointly sufficient.
The result of Theorem 2.3 is intuitively evident when we consider that if the joint density factors
as indicated then the log-likelihood function is essentially equal to ln{h1 (s1 , . . . , sk ; θ)} [where we
have written s = s(x1 , . . . , xn ) = (s1 , . . . , sk ) when s(·, . . . , ·) is a vector-valued function with k
components and we have used the standard reduction of eliminating additive terms from the log-
likelihood which do not depend on the parameter θ]. In other words, all the information about
θ contained in the likelihood is contained in the vector-valued statistic S, which is precisely the
notion behind sufficiency. Indeed, this argument forms the basis of the following important result:
Theorem 2.4: Let X1 , . . . , Xn be a random sample from a distribution family with density
function fX (x; θ). Also, let S = s(X1 , . . . , Xn ) be a sufficient statistic for θ. Then, the M LE of
θ depends on the sample observations only through the sufficient statistic. In other words, the
M LE is a function of the sufficient statistic S.
Proof: Since S is sufficient, we know that the likelihood function (which is the same as the
joint density function) can be written in the form:
n
L(θ; x1 , . . . , xn ) = fX (xi ; θ) = h1 {s(x1 , . . . , xn ); θ}h2 (x1 , . . . , xn ).
i=1
Clearly, L(θ; x1 , . . . , xn ) is maximised in θ at the same place that h1 {s(x1 , . . . , xn ); θ} is, since
the factor h2 (x1 , . . . , xn ) does not depend on θ. Moreover, the value of θ which maximises
h1 {s(x1 , . . . , xn ); θ} = h1 (s; θ) can clearly only depend on s. Formally, we have θ̂M LE =
argmaxθ∈Θ {h1 (s; θ)}, and thus θ̂M LE must be a function of s only.
As an example of Theorem 2.4, we note that the M LEs of µ and σ 2 for the normal family are
n n n n 2
µ̂ = X = n−1 i=1 Xi and σ̂ 2 = n−1 i=1 (Xi − X)2 = n−1 i=1 Xi2 − n−1 i=1 Xi which
n
are clearly functions of the sufficient statistics found in Example 2.7; namely, S1 = i=1 Xi
n
and S2 = i=1 Xi2 . We note, however, that it is possible for method of moments or method of
percentiles estimators not to be functions of sufficient statistics.
Statistical Inference (STAT3013/8027) Lecture Notes - Page 25
Example 2.6 (cont’d): If X1 , . . . , Xn are uniformly distributed on the interval [0, θ], then we
saw that Yn = max(X1 , . . . , Xn ) was a sufficient statistic. Moreover, we can write the log-
likelihood for θ based on the sample as:
where the term ln[I{0≤min(x1 ,...,xn )} ] has been left out since it does not depend on θ. Now,
−n ln(θ) is a decreasing function of θ, so to maximise the log-likelihood we must choose θ as
small as possible; however, since ln(0) = −∞ the only possible range for θ on which the log-
likelihood is not negatively infinite is θ ≥ max(x1 , . . . , xn ). These two facts together show that
the M LE of θ is given by Yn = max(X1 , . . . , Xn ) which is clearly a function of a sufficient
statistic. On the other hand, the expected value of any Xi is θ/2. Therefore, the method of
moments estimator of θ is easily calculated as θ̂M OM = 2X. The method of moments estimator
is clearly not a function of Yn , and indeed it can be shown that it is not a function of any
sufficient statistic (though the demonstration is somewhat technical and so we will omit it).
We close this section by discussing our original objective in introducing sufficient statistics, which
was data reduction. Recall that the idea behind sufficient statistics is that they contained all the
relevant information regarding the parameter θ and removed (some) extraneous information. In
particular, if we have a sample of size n, X1 , . . . , Xn from a distribution family with densities
fX (x; θ) and a sufficient statistic S = (S1 , . . . , Sk ), then we can effectively reduce the number of
relevant pieces of information regarding θ from n down to k. Recall, also, that we could conceive
of this reduction in terms of a partitioning of the sample space X into the subsets Xs for each
s in the range of S. Effectively, then, we have reduced the number of possible outcomes which
need to be considered from the size of X (the individual elements of which can be considered as a
partition induced by the sample itself X1 , . . . , Xn ) down to the number of elements in the range of
S. However, we have seen that there is not simply a unique sufficient statistic, and the question
then arises as to whether a particular sufficient statistic has effected the greatest possible reduction
in the data. If a particular sufficient statistic does indeed effect the maximal reduction, we shall
refer to it as a minimal sufficient statistic (the adjective “minimal” here referring to the fact that
such statistics will have the smallest number of components, k, possible). Equivalently, we can
view minimal sufficient statistics as those for which the induced partition of the sample space has
the fewest members (i.e., subsets Xs ). Generically, then, a sufficient statistic is termed minimal if
no other sufficient statistic condenses the data to a greater extent. Formally, we have the following
definition:
Definition 2.9: A sufficient statistic S is termed minimal sufficient if and only if for any other
sufficient statistic S there exists a function h(·) such that S = h(S ).
Unfortunately, this definition is rarely useful in identifying minimal sufficient statistics. Indeed, in
general it is quite difficult to determine minimal sufficient statistics. There is, however, a particular
class of distribution families for which minimal sufficient statistics can be determined, and we focus
on these families in the next section.
2.3.2. Exponential Families: We now introduce a class of distribution families which have very
convenient mathematical properties and which include most of the standard probability models
which are commonly dealt with in statistical applications. The class of distributions are known as
exponential families and are defined as follows:
Definition 2.10: A distribution family which has density functions of the form:
k
fX (x; θ) = exp ci (θ)di (x) − b(θ) − a(x) ,
i=1
Statistical Inference (STAT3013/8027) Lecture Notes - Page 26
for a k-dimensional parameter θ = (θ1 , . . . , θk ) and suitable choices of the functions a(·), b(·)
ci (·) and di (·) (for i = 1, . . . , k) is termed a k-parameter exponential family.
Note that it is important that the number of ci (·) and di (·) functions is the same as the dimension
of the parameter vector. We recall, also, that in the case of discrete distribution families we should
interpret the density function fX (x; θ) as a probability mass function (pmf). Before presenting a few
examples of exponential families, we note that if we define the reparameterisation η = (η1 , . . . , ηk ) =
c(θ) = {c1 (θ), . . . , ck (θ)} then η is referred to as the canonical parameter for the exponential family
and the density function can be written in the form:
k
fX (x; η) = exp ηi di (x) − B(η) − a(x) .
i=1
Moreover, in this parameterisation we have B(η) = b{c−1 (η)} [where c−1 (η) is the inverse function
of the reparameterisation function η = c(θ), which must exist for the reparameterisation to be valid
and which can be guaranteed to exist in the case of exponential families], and based on this function
we can define KD (t) = B(η + t) − B(η), which is the so-called joint cumulant generating function
of the random variable D = {d1 (X), . . . , dk (X)}, so-called because its derivatives evaluated at
t = 0 yield the cumulants of D, the first cumulant being the mean, the second cumulant being the
variance and the third cumulant being the skewness. In other words, some simple vector calculus
shows:
∂ ∂
E(D) = KD (t) =⇒ E(Di ) = B(η)
∂t t=0 ∂ηi
∂ ∂
V ar(D) = T
KD (t) =⇒ Cov(Di , Dj ) = B(η)
∂t∂t t=0 ∂ηi ∂ηj
∂ ∂
Skew(Di ) = 3 KD (t) =⇒ Skew(Di ) = B(η).
∂ti t=0 ∂ηi3
Finally, we note that it is reasonably straightforward to show that KD (t) = ln{mD (t)} where mD (t)
is the joint moment generating function of the random vector D.
Example 2.8: If X has a Poisson distribution with rate parameter λ, then we can see that the
pmf can be written as:
λx e−λ
fX (x; λ) = = exp{x ln(λ) − λ − ln(x!)}, x = 0, 1, 2, . . . .
x!
Thus, the Poisson family is a one-dimensional exponential family with functions a(x) = ln(x!),
b(λ) = λ, c1 (λ) = ln(λ) and d1 (x) = x. Moreover, we see that the canonical parameter is
η = ln(λ), leading to the inverse relationship λ = eη and
which yields the form of the mgf for a Poisson random variable with which we are familiar, since
D = X in this case.
Example 2.9: If X has a Normal distribution with mean µ and variance σ 2 , then we can see
that the pdf can be written as:
1 1 2
µ 1 2 µ2 1 2 1
φµ,σ2 (x) = √ exp − 2σ 2 (x − µ) = exp x − 2 x − 2 − ln(σ ) − ln(2π) .
2πσ 2 σ2 2σ 2σ 2 2
Statistical Inference (STAT3013/8027) Lecture Notes - Page 27
1
Thus, the Normal family is a two-dimensional exponential family with functions a(x) = 2 ln(2π),
2
1
2
b(µ, σ ) = µ
+
2σ 2 2
2
ln(σ ), c1 (µ, θ) = c2 (µ, θ) = − 2σ1 2 , d1 (x) = x and
µ
σ2 ,
2
d2 (x) = x . Moreover,
we see that the canonical parameters are η1 = σµ2 and η2 = − 2σ1 2 , leading to the inverse
relationship µ = −η1 (2η2 )−1 , σ 2 = −(2η2 )−1 and
η1 1 η2 1
B(η) = b − ,− = − 1 − ln(−2η2 )
2η2 2η2 4η2 2
∂ η1
E(D1 ) = B(η) = − = µ = E(X);
∂η1 2η2
∂ η2 1
E(D2 ) = B(η) = 12 − = µ2 + σ 2 = E(X 2 );
∂η2 4η2 2η2
∂2 1
V ar(D1 ) = 2 B(η) = − = σ 2 = V ar(X);
∂η1 2η2
∂3
Skew(D1 ) = B(η) = 0 = Skew(X).
∂η13
Alternatively, if we assume that σ 2 is a known constant rather than a parameter, the density
then has the form of a one-parameter exponential family with functions a(x) = 12 ln(2π) +
2 2
1
2ln(σ 2 ) + 2σ
x µ µ
2 , b(µ) = 2σ 2 , c1 (µ) = σ 2 and d1 (x) = x. Therefore, we see that the canonical
η2 σ2 (η + t)2 σ 2 η2 σ2 σ2 1
B(η) = b(ησ 2 ) = =⇒ KD (t) = − = (2ηt + t2 ) = µt + σ 2 t2
2 2 2 2 2
1 2 2
=⇒ mD (t) = exp µt + σ t ,
2
which is the familiar moment generating function for the normal distribution since D = X in
this case.
The main reason that we focus on exponential families is that the form of the densities makes
application of Theorem 2.3 straightforward. In particular, for an iid sample X1 , . . . , Xn from an
n n n
exponential family it is straightforward to see that i=1 Di = i=1 d1 (Xi ), . . . , i=1 dk (Xi ) is
a sufficient statistic. Moreover, it can be shown (though we will not provide a proof since it is rather
n
technical) that this is a minimal sufficient statistic. In fact, it turns out that i=1 Di is not only
minimal sufficient, but is also complete, a concept which we will discuss briefly in the next section.
Finally, before proceeding to the next section, we note that while most of the common distributions
which arise in statistical applications are of exponential class, not all are. In particular, one simple
example of a family which is not of exponential form is the family of uniform distributions on the
interval [θ1 , θ2 ].
Definition 2.11: If X1 , . . . , Xn are a random sample from a distribution having density function
fX (x; θ) for some parameter value θ ∈ Θ and T = t(X1 , . . . , Xn ) is an unbiased estimator of
τ (θ), so that Eθ (T ) = τ (θ), then T is called a uniformly minimum-variance unbiased (UMVU)
estimator if and only if V arθ (T ) ≤ V arθ (T ) for all values of θ ∈ Θ and any other unbiased
estimator T = t (X1 , . . . , Xn ) [i.e., for any other estimator satisfying Eθ (T ) = τ (θ)].
In the following sections, we will investigate when U M V U estimators exist, what there variance is
and how to find them.
2.4.1. Variance Bound for Unbiased Estimators: Before finding U M V U estimators, it is helpful
to investigate the general properties of the variance of unbiased estimators. In particular, we will
be able to determine a lower bound below which the variance of an unbiased estimator cannot fall.
Thus, if we find an estimator which achieves this bound uniformly for all values of the parameter
θ, we can conclude that we have a U M V U estimator. Before we state and prove the lower bound,
we need to make some assumptions (generally referred to as regularity conditions) to ensure that
we exclude strange cases for which the lower bound does not hold (rest assured, however, that
the following assumptions are true for almost all distributions and situations of practical interest).
Let X1 , . . . , Xn be a random sample from a distribution having density function fX (x; θ) with θ
assumed to be scalar, let T = t(X1 , . . . , Xn ) be an unbiased estimator of τ (θ) and assume:
∂
i. ∂θ ln{fX (x; θ)} exists for all x and θ;
ii. interchange of integration and differentiation is permissible insofar as
∞ ∞ n ∞ ∞ n
∂ ∂
··· fX (xi ; θ)dx1 · · · dxn = ··· fX (xi ; θ)dx1 · · · dxn
∂θ −∞ −∞ i=1 −∞ −∞ ∂θ i=1
and
∞ ∞ n
∂
··· t(x1 , . . . , xn ) fX (xi ; θ)dx1 · · · dxn
∂θ −∞ −∞ i=1
∞ ∞ n
∂
= ··· t(x1 , . . . , xn ) fX (xi ; θ)dx1 · · · dxn
−∞ −∞ ∂θ i=1
∂ 2
iii. The expectation i(θ) = Eθ ∂θ ln{fX (X; θ)} , where X is a generic random variable
having distribution with density fX (x; θ), is finite for all θ ∈ Θ.
Under these assumptions, we can formally state the Information Inequality which is also known as
the Cramér-Rao Inequality:
Theorem 2.5: Let X1 , . . . , Xn be a random sample from a distribution family with density
function fX (x; θ) where θ is a scalar parameter. Also, let T = t(X1 , . . . , Xn ) be an unbiased
estimator for τ (θ). Then, assuming conditions (i), (ii) and (iii) above hold,
{τ (θ)}2
V arθ (T ) ≥ ,
ni(θ)
where τ (θ) = dθ
d
τ (θ). Further, equality occurs if and only if there exists a function K(θ, n),
not depending on the xi ’s, such that:
n
∂
ln{fX (x; θ)} = K(θ, n){t(x1 , . . . , xn ) − τ (θ)}
i=1
∂θ
Proof: The proof relies on the Cauchy-Schwartz Inequality which, in one of its simpler forms,
states that:
{E(XY )}2 ≤ E(X 2 )E(Y 2 ),
Statistical Inference (STAT3013/8027) Lecture Notes - Page 29
with equality only if X = cY for some constant c (i.e., a quantity not involving X or Y ). A
demonstration of the Cauchy-Schwartz inequality is left as an exercise, while a fully rigorous
proof of the current inequality is omitted since it is not overly enlightening. However, a basic
argument demonstrating the validity of the result proceeds as follows. Clearly, the assumption
that T is unbiased for τ (θ) implies that
n
0 = Eθ {T − τ (θ)} = · · · {t(x1 , . . . , xn ) − τ (θ)} f (xi ; θ)dx1 . . . dxn
i=1
Thus, the expected Fisher information is I(θ) = ni(θ) = nθ−2 . Alternatively, if we had
used the characterisation −Eθ {l (θ)} for the expected Fisher information, we see that l(θ) =
n n
−n ln(θ) − θ−1 i=1 Xi , so that l (θ) = nθ−2 − 2θ−3 i=1 Xi which leads to the same result for
I(θ). [NOTE: Calculating the expected Fisher information from the characterisation E[{l (θ)}2 ]
would have been made somewhat complicated due to the squaring operation performed on the
summation of the Xi ’s.] Thus, the lower bound for the variance of any unbiased estimator
T = t(X1 , . . . , Xn ) of θ is given by:
1 θ2
V arθ (T ) ≥
= .
nθ−2 n
n
Finally, we note that the sample average, X = n−1 i=1 Xi is clearly unbiased and
V arθ (X) θ2
V arθ (X) = = .
n n
Thus, since the variance of X achieves the Cramér-Rao lower bound, X must be a U M V U
estimator. Indeed, we can see that
n
∂ n
1 n
ln{fX (x; θ)} = 2
(x − θ) = 2 (x − θ),
i=1
∂θ i=1
θ θ
and thus, setting K(θ, n) = nθ−2 , we see that the sample average satisfies the conditions for
equality in Theorem 2.5.
We note that Theorem 2.5 is also true for discrete distributions, as long as the conditions required
for the density function in the continuous case are satisfied by the pmf in the discrete case (with
integrals replaced by summations, of course).
Example 2.8 (cont’d): If X has a Poisson distribution with rate parameter θ, so that pX (x; θ) =
θ x e−θ
x! for x = 0, 1, 2, . . ., and τ (θ) = θ, then we have τ (θ) = 1 and dθ
d d
ln{pX (x; θ)} = dθ {x ln(θ)−
−1
θ − ln(x!)} = xθ − 1, so that
2
d 1 1 1
i(θ) = Eθ ln{fX (x; θ)} = 2 E{(X − θ)2 } = 2 V arθ (X) = .
dθ θ θ θ
Thus, the expected Fisher information is I(θ) = ni(θ) = nθ−1 and the lower bound for the
variance of any unbiased estimator T = t(X1 , . . . , Xn ) of θ is given by:
1 θ
V arθ (T ) ≥ = .
nθ−1 n
As in Example 2.10, the lower bound is achieved by the estimator X, since V arθ (X) =
n−1 V arθ (X) = n−1 θ. Thus, X is a U M V U estimator of θ. Alternatively, suppose that
τ (θ) = e−θ = P rθ (X = 0). In this case, we have τ (θ) = −e−θ , and we see that the lower
bound for the variance of unbiased estimators of τ (θ) is given by n−1 θ(−e−θ )2 = n−1 θe−2θ . It
n
is easy to verity that the estimator T = n−1 i=1 I(Xi =0) is unbiased for e−θ . Moreover, it is
easy to see that nT has a binomial distribution with parameters n and p = e−θ . Therefore,
V arθ (T ) = n−1 p(1 − p) = n−1 e−θ (1 − e−θ ). It is not difficult to show that eθ ≥ 1 + θ for any
θ, and this fact then easily implies that
as should be the case according to Theorem 2.5. In fact, it can be shown that equality only
occurs when θ = 0. Thus the variance of T does not achieve the Cramér-Rao lower bound. Of
course, it could still be a U M V U estimator of e−θ if no estimator achieves the Cramér-Rao lower
bound. However, it turns out that T is not a U M V U and we shall find a U M V U estimator for
this quantity in the next section.
We close this section with several remarks regarding Cramér-Rao bounds. These results are some-
what more advanced and technical, and detailed discussions are beyond the scope of these notes.
i. If θ is a vector parameter of dimension k, then there is an analog to the Cramér-Rao variance
lower bound which states that if T is an unbiased estimator of τ (θ) then
T
where ∇τ = ∂θ∂ 1 τ (θ), . . . , ∂θ∂k τ (θ) is the gradient vector (written as a column) of τ (θ) and
−1
I (θ) is the matrix inverse of the expected Fisher information matrix I(θ) defined to have
(i, j)th component
∂ ∂ ∂2
Iij (θ) = Eθ l(θ) l(θ) = −Eθ l(θ) .
∂θi ∂θj ∂θi ∂θj
In other words, I(θ) is the variance-covariance matrix of the score function (i.e., the gradient
of the log-likelihood).
ii. In general, the Cramér-Rao lower bound is not sharp. In other words, in many cases there is no
estimator with a variance equal to the lower bound value. This does not, however, necessarily
mean that there is no U M V U estimator in such cases. We shall see an example of this in the
next section.
iii. If the M LE of θ, θ̂M LE , is a solution to the score equation, l (θ) = 0, (as opposed to being
a boundary value of the parameter space Θ) and T = t(X1 , . . . , Xn ) is an unbiased estimator
of τ (θ) the variance of which achieves the Cramér-Rao lower bound then it must be the case
that T = τ (θ̂M LE ). In other words, if there is an unbiased estimator of τ (θ) the variance of
which achieves the Cramér-Rao lower bound, it must be the M LE of τ . Again, we note that
there may be U M V U estimators the variances of which do not achieve the Cramér-Rao lower
bound and in these cases, the estimators need not be the M LEs.
iv. Finally, as a follow-up to the previous remark, we note that it can be shown that estima-
tors whose variance achieves the Cramér-Rao lower bound exist only in the case where the
probability model is an exponential family (which adds another piece of evidence as to why
these families are so special and important). In fact, it can be further shown that even within
exponential families, only a very limited collection of functions of the parameters, τ (θ), have
unbiased estimators for which the variance achieves the Cramér-Rao lower bound. At first,
this may seem to indicate that seeking U M V U estimators, even in exponential families, is
essentially fruitless. Recall, however, that U M V U estimators need not have variances which
achieve the Cramér-Rao lower bound [see remark (ii) above]. As such, the remark here merely
indicates that the Cramér-Rao inequality is not the most fruitful method of finding U M V U es-
timators. Indeed, the next section presents an alternative, and more useful, method of finding
U M V U estimators.
2.4.2. The Rao-Blackwell Theorem and Completeness: In the previous section we saw that
unbiased estimators could not have variances which fell below a specific bound. As such, if we
could find an unbiased estimator the variance of which achieved this bound, then clearly such an
estimator would be a uniformly minimum-variance unbiased (UMVU) estimator. Unfortunately,
Statistical Inference (STAT3013/8027) Lecture Notes - Page 32
it is rarely possible to find an unbiased estimator with a variance equal to the Cramér-Rao lower
bound. So, we now present some results which provide an alternative approach to finding UMVU
estimators.
It should seem reasonable that an estimator based on a sufficient statistic would be less variable
than one which is not so based, since the idea of sufficiency was the removal of irrelevant information
(which by its nature would tend to increase variability). Indeed, suppose that T = t(X1 , . . . , Xn ) is
an unbiased estimator of the parameter τ = τ (θ) and suppose that S = s(X1 , . . . , Xn ) is a (possibly
vector-valued) sufficient statistic. The following theorem, known as the Rao-Blackwell Theorem,
shows that we can construct an unbiased estimator from T and S which has smaller variance than
T . Specifically, we have:
Theorem 2.6: Let X1 , . . . , Xn be a random sample from a distribution family with density
function fX (x; θ) for some parameter θ ∈ Θ, and let S = s(X1 , . . . , Xn ) be a sufficient statistic
[NOTE: S may be vector-valued, in which case we write S = (S1 , . . . , Sk )]. Further, let T =
t(X1 , . . . , Xn ) be an unbiased estimator of τ = τ (θ). If we define the new quantity T1 = Eθ (T |S)
then:
i. T1 is a statistic (i.e., it does not depend on θ) and is a function of the sufficient statistic,
T1 = t1 (S) = t1 (S1 , . . . , Sk );
ii. T1 is an unbiased estimator of τ (θ); and,
iii. V arθ (T1 ) ≤ V arθ (T ) for all θ ∈ Θ, and V arθ (T1 ) < V arθ (T ) for some θ ∈ Θ unless T1 = T .
Proof: (i.) Since S is a sufficient statistic, we know that the distribution of (X1 , . . . , Xn ) given
S cannot depend on θ from Definition 2.8. Clearly, then, the distribution of any function of
(X1 , . . . , Xn ) given S cannot depend on θ either. Thus, T1 does not depend on θ; in other words,
T1 is a statistic, since it is a function of only the data. Also, from the definition of conditional
expectations, it is clear that T1 depends on the Xi ’s only through the value of S; in other words,
T1 is a function of S.
(ii.) Using the law of the iterated expectation, we know that E{E(Y |Z)} = E(Y ) for any
random variables Y and Z. In particular, then, we have
Now, since T1 is simply a function of the sufficient statistic S, we can further see that:
Therefore, we see that V arθ (T ) = Eθ {(T − T1 )2 } + V arθ (T1 ) ≥ V arθ (T1 ), and the inequality
is strict unless T = T1 . [NOTE: An alternate derivation of this result is based on the extension
of the law of the iterated expectation to the case of variances:
where we have used the obvious fact that Eθ {V arθ (T |S)} ≥ 0 since it is the expected value of
a conditional variance which clearly cannot be negative.]
Statistical Inference (STAT3013/8027) Lecture Notes - Page 33
So, Theorem 2.6 provides a way of finding an unbiased estimator with “low” variance (i.e., at
least as low as the variance of any other given unbiased estimator). Whether or not the resultant
estimator is a UMVU estimator will be taken up shortly. Before discussing this important issue,
we present an example:
Example 2.8 (cont’d): If X has a Poisson distribution with rate parameter θ, we saw that
n
T = n−1 i=1 I(Xi =0) is an unbiased estimator for e−θ and we determined its variance as
V arθ (T ) = n−1 e−θ (1 − e−θ ). Furthermore, since the Poisson family of distributions was seen
n n
to be an exponential family with D = d1 (X) = X, we know that S = i=1 Di = i=1 Xi is a
sufficient statistic (and indeed, a minimal sufficient statistic). So, according to Theorem 2.6, if
we define T1 = Eθ (T |S) we should get an unbiased estimator for e−θ which has lower variance
that T . First, to determine the explicit form of the estimator, we note that:
n
n
n
n
−1 −1
Eθ (T |S = s) = Eθ n I(Xi =0) Xi = s = n Eθ I(Xi =0) Xi = s
i=1 i=1 i=1 i=1
n
n
= Eθ I(X1 =0) X i = s = P rθ X 1 = 0 Xi = s
i=1 i=1
n n
P rθ X1 = 0, i=1 Xi = s P rθ X1 = 0, i=2 Xi = s
= n = n
P rθ i=1 Xi = s P rθ i=1 Xi = s
n
P rθ (X1 = 0)P rθ i=2 Xi = s e−θ {(n − 1)θ}s e−(n−1)θ /s!
= n =
P rθ i=1 Xi = s
(nθ)s e−nθ /s!
s
n−1
= .
n
S
Thus, T1 = n−1n is the new estimator. To verify directly that T1 is unbiased and has lower
variance than T , we note that S has a Poisson distribution with rate parameter nθ, so that:
∞
s ∞
n − 1 (nθ)s e−nθ −nθ {(n − 1)θ}s
Eθ (T1 ) = =e = e−nθ e(n−1)θ = e−θ ,
s=0
n s! s=0
s!
−1
which yields V arθ (T1 ) = eθ(n −2) −(e−θ )2 = e−2θ (eθ/n −1). To see that this variance is smaller
than V arθ (T ) = n−1 e−θ (1 − e−θ ) = n−1 e−2θ (eθ − 1), we note that:
∞ ∞ ∞ ∞
1 θ 1 θm θm θm (θ/n)m
(e − 1) = = > m (m!)
= = eθ/n − 1.
n n m=1 m! m=1
n(m!) m=1
n m=1
m!
Alternatively, we know that the Cramér-Rao lower bound on the variance of unbiased estimators
in this case is given by n−1 θe−2θ , and since we know that ey − 1 > y for any y = 0, we have:
θ −2θ
e−2θ (eθ/n − 1) > e ,
n
so that the variance of T1 does not achieve the Cramér-Rao lower bound. Nonetheless, we shall
see that T1 turns out to be a U M V U estimator.
Statistical Inference (STAT3013/8027) Lecture Notes - Page 34
Recall that sufficient statistics are not unique; that is, there may be two different (possibly vector-
valued) statistics S1 and S2 both of which are sufficient. In this case, we can define multiple new
estimators from an original unbiased estimator T as
i. T1 = Eθ (T |S1 );
ii. T2 = Eθ (T |S2 );
iii. T3 = Eθ (T1 |S2 ); and
iv. T4 = Eθ (T2 |S1 ).
[NOTE: Since T1 is a function of S1 , we see that Eθ (T1 |S1 ) = T1 , so re-conditioning on the
same sufficient statistic does not aid in arriving at unbiased estimators with reduced variance].
Now, Theorem 2.6 indicates that V arθ (T ) ≥ V arθ (T1 ) ≥ V arθ (T3 ) and V arθ (T ) ≥ V arθ (T2 ) ≥
V arθ (T4 ). However, Theorem 2.6 does not give us any indication as to whether T3 or T4 will have
the smaller variance; indeed, there may be no clear cut winner, as V arθ (T3 ) may be less than
V arθ (T4 ) for some values of θ while the reverse is true for other values of θ. This problem is
generally alleviated by choosing to condition on a minimal sufficient statistic, since if S1 is minimal
sufficient we know that for any other sufficient statistic S2 there exists a function h(·) such that
S1 = h(S2 ), in which case
T3 = Eθ (T1 |S2 ) = Eθ {Eθ (T |S1 )|S2 } = Eθ [Eθ {T |h(S2 )}|S2 }] = Eθ {T |h(S2 )} = Eθ (T |S1 ) = T1 ,
where the fourth equality follows from the fact that Eθ {T |h(S2 )} is, by definition, a function of S2 .
In other words, conditioning on a minimal sufficient statistic implies that any further conditioning
will not result in any further variance reduction (indeed, it will not even result in a new unbiased
estimator).
Moveover, if we have another unbiased estimator T , then Theorem 2.6 indicates that T1 =
Eθ (T |S1 ) has smaller variance than T , but it does not indicate whether T1 or T1 has the lower
variance. So, while Theorem 2.6 gives us a method for deriving estimators with reduced variances,
it does not necessarily gives us a method of deriving UMVU estimators. We shall see, however,
that there are conditions under which the result of Theorem 2.6 does yield a UMVU estimator.
Unfortunately, these conditions are rather technical and we only present a basic introduction.
We start by defining the concept of completeness of a statistic or estimator T . The general
idea is that a statistic is complete if no function of it has expectation zero for all values of θ unless
the function is the zero function, z(x) ≡ 0 for all x. In particular, this means that if g(T ) is an
unbiased estimator for some parameter τ = τ (θ), then there is no other function of T which is also
an unbiased estimator of τ . To see this, note that if h(T ) was another unbiased estimator of τ
then z(T ) = g(T ) − h(T ) would be a non-zero function of T (since the two functions g and h are
assumed to be distinct) for which Eθ {z(T )} = Eθ {g(T )} − Eθ {h(T )} = τ − τ = 0, contradicting
the assumption of completeness for T . Thus, complete statistics have at most one form in which
they can be used to estimate a parameter in an unbiased fashion. Formally, we have the following
definition:
Definition 2.12: If X1 , . . . , Xn are a random sample from a distribution having density function
fX (x; θ) with parameter θ ∈ Θ , then a statistic T = t(X1 , . . . , Xn ) is termed complete if and
only if
Eθ {z(T )} = 0 =⇒ P rθ {z(T ) = 0} = 1,
for all θ ∈ Θ.
but clearly P rθ {z(T1 ) = 0} = P rθ {Yn = nY1 } = 1 for any n > 1 (in fact, it can be shown that
this probability actually equals zero as long as n > 1). Thus, T1 is not a complete statistic
(although it is sufficient in this case, since we saw that Yn on its own is sufficient and thus any
vector-valued statistic which includes Yn as a component must be sufficient as well, though of
course it will not be minimal sufficient in such cases). Alternatively, suppose that z2 (t2 ) is such
that Eθ {z2 (T2 )} = Eθ {z2 (Yn )} = 0 for all θ > 0. This means that
θ
z2 (y)fYn (y; θ)dy = 0
0
for all θ > 0. It is again a simple exercise (left to the reader) to show that the density function
associated with the distribution of Yn is given by fYn (y; θ) = nθ−n y n−1 , so that z2 (Yn ) having
zero expectation implies that:
θ θ
n
z2 (y)y n−1 dy = 0 =⇒ z2 (y)y n−1 dy = 0,
θn 0 0
for all θ > 0. Differentiating the second equation above with respect to θ, shows that z2 (Yn )
having zero expectation implies z2 (θ)θn−1 = 0 for all θ > 0. This equation, in turn, implies that
z2 (θ) = 0 for all θ > 0. In other words, z2 (·) is the zero function, so that P rθ {z2 (T2 ) = 0} =
P rθ (0 = 0) = 1. Thus, T2 = Yn is seen to be a complete (as well as sufficient) statistic.
In general, demonstrating completeness for a given statistic can be quite complicated. Fortunately,
it turns out that completeness can be demonstrated for specific statistics in exponential families. In
n
particular, the (minimal) sufficient statistic i=1 Di where Di = {d1 (Xi ), . . . , dk (Xi )} is complete
(the proof of this fact is rather technical and is omitted). The true importance of complete, sufficient
statistics is demonstrated in the following theorem:
Theorem 2.7: Let X1 , . . . , Xn be a random sample from a distribution with density function
fX (x; θ) for some parameter θ ∈ Θ. If S = s(X1 , . . . , Xn ) is a complete and sufficient statistic,
and T = t(S) is an unbiased estimator of τ = τ (θ), then T is a UMVU estimator.
Proof: Let T = t (S) be any unbiased estimator of τ which is a function of the complete,
sufficient statistic (we have assumed that T is one such estimator, but there may be others).
Then we have Eθ (T − T ) = 0 for all θ ∈ Θ. However, since T and T are functions of S, we
can define T − T = z(S) = t(S) − t (S). Since S is assumed complete, it must be the case that
P rθ {z(S) = 0} = P rθ (T = T ) = 1. In other words, there can be only one unbiased estimator
of τ which is a function of S. Now, let T1 be any unbiased estimator of τ (not necessarily a
function of S). Since Eθ (T1 |S) is unbiased and a function of S (by Theorem 2.6), it must be
the case that Eθ (T1 |S) = T , regardless of the initial unbiased estimator T1 . Now, Theorem 2.6
also states that V arθ {Eθ (T1 |S)} = V arθ (T ) ≤ V ar(T1 ) for all θ ∈ Θ. Since T1 was an arbitrary
unbiased estimator of τ , we see that this final implication means that T has smaller variance
than any other unbiased estimator; in other words, T is a UMVU estimator.
Theorem 2.7 is often referred to as the Lehmann-Scheffé Theorem. The implication of the theorem
is extremely important. If there is a complete, sufficient statistic S (which we know exists in the
Statistical Inference (STAT3013/8027) Lecture Notes - Page 36
case of an exponential family) and there is some unbiased estimator of τ , say T1 then there is
a UMVU estimator of τ which can be arrived at by combining Theorems 2.6 and 2.7; that is,
by taking the conditional expectation of the unbiased estimator given the complete and sufficient
statistic, T = Eθ (T1 |S), since this estimator will be unbiased and will be a function of the complete,
sufficient statistic. Moreover, if we happen to have (or can easily determine) an unbiased estimator
which is a function of a complete, sufficient statistic we know that it must be a UMVU estimator
without any further modification.
Example 2.8 (cont’d): Since the Poisson distributions form an exponential family with d1 (Xi ) =
n
Xi , we know that S = i=1 Xi is a complete and sufficient statistic. Furthermore, we have
seen that the statistic S
n−1
T = ,
n
is an unbiased estimator of τ = τ (θ) = e−θ = P rθ (Xi = 0). Thus, we have an unbiased estimator
which is a function of a complete, sufficient statistic, which implies that T must be a UMVU
estimator (even though, as we saw previously, its variance does not achieve the Cramér-Rao
lower bound).
As a final remark, we note that it is possible in certain situations for some functions of the parameter,
τ = τ (θ), to have no unbiased estimators, though the situations in which this occurs are rare and
usually not of much practical importance. Also, it is possible for unbiased estimators to exist, but
for there to be no UMVU estimator; in other words, there is no unbiased estimator whose variance
is minimal for all values of θ ∈ Θ.
2.5. Bayes Estimation
In the previous sections, our estimators have been functions of the data; in other words, they have
been based solely on the observed information, which certainly seems sensible. However, as we have
noted, the randomness in the observations means error in the estimates is inevitable. In particular,
occasionally there will be observed data which yields an estimated value for the parameter of interest
which may be “unbelievable”. In such situations, we may be tempted to conclude that our chosen
probability model is wrong. To address this concern, we may choose a new probability model, or
use so-called non-parametric methods which are less dependent on the choice of probability models
(and we shall briefly investigate this approach in Section 2.6). Suppose, however, that we believe
our chosen probability model is correct. This creates somewhat of a quandary, since we must
seemingly choose between our belief in the model and our belief that the resultant estimate of the
parameters is highly errant. The resolution to this dilemma comes from asking a simple question:
Why do we feel that the resultant estimate based on the data is so “unbelievable”? Clearly, we
must have some prior knowledge of what a “reasonable” estimate of the parameter is in order to
make such a judgement. If so, we should try to incorporate the information contained in our prior
knowledge of the specific problem under study into the estimation procedure (i.e., we should base
our estimator not only on the observed data, but also on some quantification of our prior ideas
about the likely values of the parameters being estimated).
Formally, suppose that we can model our prior belief about the “likelihood” that the parameter
of interest, θ, takes on any specified value in the parameter space, Θ, with the density function, π(θ),
referred to as the prior distribution of θ. The function π(θ) contains our beliefs about the relative
likelihood that a particular value of θ in Θ is the “true” value of the parameter (i.e., that it is the
actual value of the parameter which indexes the distribution used to characterise the population
that gave rise to the observed data). Since we are still assuming that the chosen probability model
is correct, some value of θ must indeed be the correct one, and thus the integral of π(θ) over the
Statistical Inference (STAT3013/8027) Lecture Notes - Page 37
full range of the parameter space, Θ, must be unity, which is why we choose π(θ) to be a density
function (or a pmf if the parameter space is discrete).
The question now arises as to how to incorporate this prior distribution into the estimation
procedure. To do this, we note that our attachment of a prior distribution to the parameter θ is
equivalent to considering it as a random variable itself. Moreover, with this interpretation of θ, we
see that the density function for the observed random variables, fX (x; θ), can be thought of as the
conditional density of the Xi ’s given θ. To combine the information regarding our prior belief and
our observed data, we focus on the “change” to our prior belief brought about by the data. In other
words, we want to examine the “likelihood” of values for the parameter θ given the new observed
data information. Formally, then, we define the posterior distribution of θ, π(θ|X1 , . . . , Xn ), using
Bayes’ Rule (which is where the name Bayesian estimation derives) as:
L(θ; X1 , . . . , Xn )π(θ)
π(θ|X1 , . . . , Xn ) = .
Θ
L(t; X1 , . . . , Xn )π(t)dt
[NOTE: Recall that the likelihood function of the data, L(θ; X1 , . . . , Xn ), is equivalent to the joint
density of the Xi ’s. In fact, it is the joint conditional density of the Xi ’s given θ in this case, since
θ is now assumed to follow a random distribution. Also, note that the denominator in the above
definition is just the unconditional, or marginal, density function of the Xi ’s. As such, it does
not depend on θ and, from the perspective of the posterior distribution of θ, is therefore just a
normalising constant which ensures that the posterior distribution integrates to unity. Heuristically,
then, we see that the definition of the posterior distribution can be thought of as:
P r(X1 , . . . , Xn |θ)P r(θ)
P r(θ|X1 , . . . , Xn ) = ,
P r(X1 , . . . , Xn )
which is precisely the standard form of Bayes’ Rule.]
The posterior distribution incorporates both forms of information that we have about the
parameter; namely, our prior beliefs and the observed data. Of course, as it is a distribution
function, it does not directly give us a point estimate for the parameter of interest. Using the
posterior distribution to arrive at point estimates is the subject of the rest of this section. Before
proceeding to this discussion, however, we close with an important comment. For the remainder
of this section, we will assume that we have been given (or have made a choice of) an appropriate
prior distribution (i.e., one which accurately reflects our prior knowledge regarding the parameter
θ). Of course, in practice, the proper choice of a prior distribution is extremely difficult, and is
generally quite crucial to the end result of the estimation procedure. Unfortunately, a full discussion
regarding the proper choice of prior distributions is complex and beyond the scope of these notes.
Here, we only note that priors are often chosen for reasons of mathematical simplicity (which is
rarely a strong practical justification for the use of a specific prior).
2.5.1. Posterior Bayes Estimators: We noted previously that the posterior distribution incor-
porates all the available information regarding the parameter in our new Bayesian framework, in
much the same way that the likelihood function itself does for the specified probability model. As
such, we might consider estimating θ by using the value which maximises the posterior distribu-
tion; that is, we might use the posterior mode. Alternatively, since the posterior distribution is
indeed a distribution for θ (recall that the likelihood function is a distribution for the Xi ’s but not
necessarily for θ), we might use its mean or median as an estimator as well. Primarily for reasons
of mathematical simplicity (though we shall see there are other good reasons), we shall focus on
the posterior mean, or posterior Bayes estimator, of any parameter of interest τ = τ (θ):
τ̂π = E{τ (θ)|X1 , . . . , Xn } = τ (θ)π(θ|X1 , . . . , Xn )dθ,
Θ
Statistical Inference (STAT3013/8027) Lecture Notes - Page 38
where we interpret the farthest right-hand expression as a multiple integral if θ is a vector, and
we replace integrals by appropriate sums if θ is discrete. Also, we note that the chosen notation is
designed to indicate the dependence of the estimator on the chosen prior distribution π(θ). Using
the definition of the posterior distribution, and the fact that the likelihood function is just the joint
(conditional) density of the data, we can write
L(θ; X1 , . . . , Xn )π(θ)
τ̂π = E{τ (θ)|X1 , . . . , Xn } = τ (θ)π(θ|X1 , . . . , Xn )dθ = τ (θ) dθ
Θ Θ Θ
L(t; X1 , . . . , Xn )π(t)dt
n n
i=1 fX (xi ; θ) π(θ) Θ
τ (θ) i=1 fX (xi ; θ) π(θ)dθ
= τ (θ) n dθ = n ,
Θ Θ i=1 fX (xi ; t) π(t)dt Θ i=1 fX (xi ; θ) π(θ)dθ
provided the observed Xi ’s are independent and identically distributed [NOTE: in the denominator
of final expression, we have switched the integration variable from t to θ, since once this integral
is factored outside the integral in the numerator, there is no longer any possibility of ambiguity].
Note the similarity between this estimator and the Pitman estimator of location defined in Section
2.2.2.
Example 2.5 (cont’d): Let X1 , . . . , Xn be a sample from the Bernoulli distribution with pa-
rameter θ, so that fX (x; θ) = θx (1 − θ)1−x for x = 0, 1. Suppose that we choose a uniform
distribution over the range Θ = (0, 1) to represent our prior belief regarding θ, so that π(θ) = 1
for 0 ≤ θ ≤ 1 (note that the uniform prior indicates that we believe each value is as likely
as any other, so that this prior may serve to indicate the general notion of “no prior belief”
regarding the value of θ). So, to estimate τ (θ) = θ, the parameter itself, using the posterior
Bayes estimator, we have:
n n
1 n 1−xi
1 xi n− xi
θ θ xi
(1 − θ) π(θ)dθ θθ i=1
(1 − θ) i=1
dθ
θ̂π = 01 n i=1 = 0 n n
1−x
i=1 θ (1 − θ)
x i i dθπ(θ)dθ 1 x i n− x i
0
0
θ i=1 (1 − θ) i=1
dθ
n n
1 1+ i=1 xi n− xi
θ (1 − θ) i=1
dθ
= n
0 n .
1 xi n− xi
0
θ i=1
(1 − θ) i=1
dθ
Now, it is not difficult to show (and is left as an exercise) that the Beta integral can be calculated
as: 1
Γ(a)Γ(b)
θa−1 (1 − θ)b−1 dθ = ,
0 Γ(a + b)
∞
for any positive constants a and b [and, of course, Γ(k) = 0 xk−1 e−x dx is the usual Gamma
function, which satisfies the simple relationship Γ(k + 1) = kΓ(k), a fact which is easily demon-
strated using integration by parts]. Thus, we see that the posterior Bayes estimator for θ is
given by:
n n
Γ 2 + i=1 xi Γ n + 1 − i=1 xi Γ(n + 2)
θ̂π = n n
Γ(n + 3) Γ 1 + i=1 xi Γ n + 1 − i=1 xi
n n
Γ 2 + i=1 xi Γ(n + 2) 1 + i=1 xi
= n = .
Γ 1 + i=1 xi Γ(n + 3) n+2
Alternatively, suppose that we choose a Beta distribution as our prior, so that π(θ) = πa,b (θ) =
Γ(a+b) a−1
Γ(a)Γ(b) θ (1 − θ)b−1 for some chosen positive values of the constants a and b. In this case,
Statistical Inference (STAT3013/8027) Lecture Notes - Page 39
nearly identical calculations to those performed above (and based on the fact that this prior
leads to readily tractable mathematics, which is precisely why it was chosen), we see that:
n
a + i=1 xi
θ̂πa,b = .
n+a+b
[NOTE: The case a = b = 1 reduces to the case of a uniform prior, and yields the appropriate
result.] Finally, we note that the above estimator can be written as
n a+b a
θ̂πa,b = x+ ,
n+a+b n+a+b a+b
n
where x = n−1 i=1 xi is the observed sample average (which in this case is also the observed
proportion of data values which were equal to 1). It is a simple exercise to show that the expec-
tation of a random variable with a distribution having density πa,b (θ) (i.e., a Beta distribution
with parameters a and b) is given by a/(a + b). So, the new form of the estimator shows that in
this case the posterior Bayes estimator can be seen as the weighted average between the maxi-
mum likelihood estimator (i.e., the estimator we would commonly use when we were not trying
to incorporate prior information, but rather basing our estimate solely on the data) and the
“pure prior” estimator (i.e., the mean of the posterior distribution, which is what the posterior
Bayes estimator reduces to if we have no observed data). In closing this example, however, we
note that it is not always possible to write a posterior Bayes estimator in such a form (i.e., as
a weighted average of the “pure prior” estimate and the M LE).
We note that the (conditional) expectation of the estimator in the preceding example is given by:
nθ + a
E(θ̂πa,b |θ) = = θ,
n+a+b
unless a = b = 0, which is not allowed (as the parameters a and b must be positive). As such, the
posterior Bayes estimator in this instance is not (conditionally) unbiased. Indeed, this turns out to
be a general phenomenon, as the following theorem shows:
Theorem 2.8: Let τ̂π be the posterior Bayes estimator of τ = τ (θ) with respect to the prior
distribution π(θ). If both τ̂π and τ (θ) have finite variances, then either P r{τ̂π = τ (θ)|θ} = 1
or else E(τ̂π |θ) = τ (θ). In other words, the only way for a posterior Bayes estimator to be
(conditionally) unbiased is if it always yields exactly the correct value of τ (θ).
Proof: We start by supposing that τ̂π is (conditionally) unbiased, so that E(τ̂π |θ) = τ (θ).
Then, we have:
V ar(τ̂π ) = E{V ar(τ̂π |θ)} + V ar{E(τ̂π |θ)} = E{V ar(τ̂π |θ)} + V ar{τ (θ)}.
and since both of the quantities on the left-hand side of this equality are non-negative (since
they are expectations of conditional variances, which cannot be negative), both of the quantities
Statistical Inference (STAT3013/8027) Lecture Notes - Page 40
must be zero. In particular, we see that E{V ar(τ̂π |θ)} = 0, which implies V ar(τ̂π |θ) = 0, since
again V ar(τ̂π |θ) cannot be negative, and therefore the only way it can have zero expectation
is for it to always be zero. Finally, we note that the only way a random variable can have
(conditional) variance of zero is if it is always equal to its (conditional) expectation, and thus
we see that if τ̂π is assumed unbiased, we must have P r{τ̂π = τ (θ)|θ} = 1. Thus, we have shown
that there are only two possibilities, either τ̂π is not unbiased, or else it is always equal to τ (θ),
as was required.
Finally, we note that the uniform prior chosen in Example 2.5 was seen to represent the notion
of “no prior information” regarding the parameter θ, since it gave equal likelihood to all possible
values. Such a prior distribution is often termed non-informative. It is sometimes argued that such
priors are the most sensible ones to choose in most situations. A full discussion of such ideas is
again beyond the scope of these notes; however, we note that it is not always possible to define
such non-informative priors. Moreover, even if we can define a non-informative prior distribution
for a particular parameter θ, if we reparameterise our probability model using the new parameter
η = η(θ), it is rarely the case that the non-informative prior for θ will transform into a corresponding
non-informative prior for η. In other words, we know that if θ has a distribution with density π(θ),
then any one-to-one function (which a reparameterisation must be) η = g(θ) has density function:
dg −1 (η)
πη (η) = π{g −1 (η)} ,
dη
where g −1 (η) is the inverse function of g(θ) (which again must exist since a reparameterisation is
a one-to-one function). Clearly, then, if π(θ) is the density of a uniform distribution, then it will
rarely be the case that πη (η) will also be a uniform distribution. Thus, assuming no information
on a particular parameter scale, generally means that we are assuming we do have information
on some other parameter scale. This lack of invariance for the property of non-informativeness in
prior distributions makes their use somewhat suspect. At the very least, we must be reasonably
sure about the appropriate scale on which to choose to represent our “lack of prior knowledge”
about the problem at hand. This is, of course, just another piece of evidence demonstrating
the difficulties involved in choosing an appropriate prior distribution. [NOTE: For those who are
interested, another popular choice of prior distribution, designed to represent the notion of a lack of
any prior information, is the so-called vague or Jefferys prior, which is based on the square-root of
the expected Fisher information and does have the above noted invariance property. Alternatively,
the method of empirical Bayes estimation attempts to use the data itself to choose, at least in part,
the appropriate prior distribution.]
2.5.2. Bayes Risk and Minimax Estimators: In Section 2.2.4, we introduced the concept of
loss functions, to measure the relative cost of making various errors in our estimation process. In
this section, we discuss how the use of a prior distribution can be combined with a selected loss
function to arrive at optimal estimators. Recall, however, that we have the same issues regarding
appropriate choice of a loss function that we do for prior distributions, and we will again simply
assume that an appropriate choice of prior and loss function have been made without delving into
the complex (and sometimes non-statistical) issues involved in this selection.
Formally, let X1 , . . . , Xn be a random sample from a distribution with density function fX (x; θ)
for some parameter θ ∈ Θ. We will assume that θ is a random variable with some (known) prior
distribution π(θ). Using this prior information as well as the sample observations, we wish to
estimate the parameter τ = τ (θ). In addition, we assume that the loss function (t; θ) has been
specified and determines the relative cost of estimating τ as t when θ is the true value of the
Statistical Inference (STAT3013/8027) Lecture Notes - Page 41
parameter (i.e., the particular outcome from the chosen prior distribution). For any estimator,
T = t(X1 , . . . , Xn ) (which may depend on the prior distribution as well), we defined the risk function
as Rt (θ) = Eθ {(T ; θ)}, which we now will write as Rt (θ) = E{(T ; θ)|θ} since θ is considered as
a random variable in our present context. Our original goal was to choose an estimator T which
had uniformly minimal risk over the entire range of θ values. Of course, in general, we saw that
no such estimator existed, the difficulty arising from the fact that the risk function depends on
θ, and for any pair of estimators one will generally be better for some possible values of θ and
worse for others. In the present situation, we have assumed that θ is a random variable; in other
words, we have an idea of which values of θ are the most likely. As such, we might try and choose
an estimator which minimises the risk appropriately averaged over the possible θ values; that is,
choose an estimator which does “best” for the most “likely” values of θ. Formally, we define the
Bayes risk of an estimator as follows:
Definition 2.13: Let X1 , . . . , Xn be a random sample from a distribution having density
function fX (x; θ) for some parameter θ ∈ Θ, θ being a random variable with prior distribution
π(θ). For estimating τ = τ (θ) using the loss function (t; θ) and an estimator T = t(X1 , . . . , Xn ),
the risk function was defined as Rt (θ) = E{(T ; θ)|θ}. The Bayes risk of the estimator T with
respect to the chosen loss function and prior distribution is then defined as:
r(t) = r ,π (t) = Rt (θ)π(θ)dθ = Eπ {Rt (θ)},
Θ
where the notation Eπ indicates expectation taken with respect to the prior distribution.
Note that the Bayes risk of an estimator is a weighted average of its risk function, Rt (θ), where
the weights represent the likelihood that the risk at any given value of θ is the pertinent one; that
is, the weights represent the likelihood of any θ value based on our prior information. Since the
Bayes risk is now a single number, rather than a function of θ as the risk function itself was, we
can easily define the “best” estimator in this context as the one which minimises the Bayes risk:
Definition 2.14: Under the structure determined in Definition 2.13, the Bayes estimator of
τ (θ) with respect to a chosen loss function and prior distribution is that estimator T = T ,π =
t ,π (X1 , . . . , Xn ) with the smallest Bayes risk. In other words, T ,π is a Bayes estimator if
r ,π (t ,π ) ≤r ,π (t)
So, we now see that the posterior Bayes estimator introduced in Section 2.5.1 is indeed a Bayes
estimator with respect to squared-error loss. Furthermore, nearly identical calculations, combined
with the fact that the function h(a) = E(|Z − a|) for any random variable Z is minimised at a =
median(Z), show that the Bayes estimator of a scalar parameter θ under absolute-error loss is given
by the median of the posterior distribution, π(θ|X1 = x1 , . . . , Xn = xn ). [NOTE: Similarly, the
Bayes estimator under absolute-error loss of τ (θ) is given by the median of the posterior distribution
of τ (θ). Of course, to find the posterior distribution of τ (θ) we must use the change-of-variable
formula on the posterior distribution of the parameter itself, π(θ|X1 = x1 , . . . , Xn = xn ).] Finally,
we note that choosing the constant-error loss function with window-width , (t; θ) = AI{|t−τ (θ)|>} ,
deriving the associated Bayes estimator and then letting tend to zero, yields the mode of the
posterior distribution of τ (θ) (again, requiring the use of the change of variable formula to arrive
at the appropriate posterior distribution for the parameter τ ). In other words, while the posterior
mode is not (necessarily) directly a Bayes estimator, it is the limit of a sequence of Bayes estimators
(of course, in some circumstances the posterior mode may be the Bayes estimator for some other
choice of loss function). The demonstration of this fact follows along the lines of the demonstration
for the posterior mean and posterior median Bayes estimators, however, it is rather technical and
unenlightening, and is thus omitted from these notes.
Example 2.11: Suppose that X1 , . . . , Xn are independent random variables each having a
normal distribution with zero mean and variance (2θ)−1 . The joint conditional distribution of
the Xi ’s given θ (which is also the joint conditional likelihood function) is then:
n
−θ x2i
−n/2 n/2 i=1
L(θ; x1 , . . . , xn ) = π θ e .
Further, suppose that we select a Gamma prior distribution for θ with shape parameter α and
scale parameter 1/β, so that
β α α−1 −βθ
π(θ) = θ e .
Γ(α)
Thus, the posterior distribution for θ is:
clearly has the form of a Gamma density with shape parameter n/2 + α and scale parameter
(β + y)−1 , we can conclude that
n n/2+α
(β + y)n/2+α β + i=1 x2i
C(x1 , . . . , xn ) = = .
Γ(n/2 + α) Γ(n/2 + α)
If we select squared-error loss, then we know that the Bayes estimator for θ is given by
E(θ|x1 , . . . , xn ), the mean of the posterior distribution. In this case, the posterior distribu-
tion is a Gamma distribution which has mean (n/2 + α)/(β + y). Also, note that the vari-
ance of the Xi ’s is σ 2 = (2θ)−1 , which means that the Bayes estimate (under squared-error
loss) is E{(2θ)−1 |x1 , . . . , xn }. Now, it is a simple exercise (left to the reader) to show that
if Z has a Gamma distribution with shape parameter a > 1 and scale parameter b, then
E(1/Z) = {b(a − 1)}−1 . Therefore, the Bayes estimator of σ 2 is given by:
β+y β+y
E{(2θ)−1 |x1 , . . . , xn } = = .
2(n/2 + α − 1) n + 2α − 2
(the demonstration of this fact derives from a straightforward implementation of the change-of-
variable formula for probability densities and is left as an exercise). So, if we use absolute-error
loss, the Bayes estimator is the median of this posterior distribution (which has the form of an
inverse Gamma distribution and thus, unfortunately, does not admit a closed form expression for
the median). Finally, if we take the limit of the Bayes estimators associated with the constant-
error loss function with window-width , we arrive at the mode of the posterior distribution for
σ 2 = (2θ)−1 as our estimator, which is easily calculated as:
n
2 β + i=1 x2i β+y
mode{π1 (σ |x1 , . . . , xn )} = = .
n + 2α + 2 n + 2α + 2
n
The Bayes estimators derived in the preceding example are seen to be functions of Y = i=1 Xi2 ,
which we have seen is the minimal sufficient statistic in this case. In fact, it can be shown quite
generally that Bayes estimators will be functions of the minimal sufficient statistics as well as BAN
for any choice of prior. So, even if we are unsure about our particular choice of prior distribution,
we can at least be sure that our Bayes estimator has some desirable properties regardless of our
choice of prior. In this vein, we close with a theorem which relates Bayes estimators to the minimax
estimators defined in Section 2.2.4. Recall that T = t(X1 , . . . , Xn ) is a minimax estimator of τ (θ)
for the specified loss function (t; θ) if the maximum value of its risk function, Rt (θ) = Eθ {(T ; θ)}
over the parameter space, Θ, is smaller than the maximum value of the risk function for any other
estimator; in other words, T is a minimax if
for any other estimator T = t (X1 , . . . , Xn ) (see Definition 2.7). The idea behind minimax esti-
mators is a desire to be “conservative” or “risk averse”, as minimax estimators seek to minimise
the impact of the worst possible estimation outcome. Unfortunately, as we noted in Section 2.2.4,
finding minimax estimators is generally quite difficult. However, as the next theorem shows, we
can sometimes arrive at minimax estimators through a Bayesian estimation procedure:
Statistical Inference (STAT3013/8027) Lecture Notes - Page 44
Theorem 2.9: If T = t(X1 , . . . , Xn ) is the Bayes estimator for the parameter τ = τ (θ) under
the loss function (t; θ) and the prior distribution π(θ), and the risk function for T is constant
[i.e., Rt (θ) ≡ c for some value c which does not depend on θ], then T is a minimax estimator.
Proof: Since T is the Bayes estimator under the given loss function and prior distribution, we
know that it has smaller Bayes risk than any other estimator T = t (X1 , . . . , Xn ). In other
words, we know that
r ,π (t) = Rt (θ)π(θ)dθ ≤ Rt (θ)π(θ)dθ = r ,π (t ),
Θ Θ
where Rt (θ) is the risk function for the arbitrary new estimator T . Therefore, since we have
assumed Rt (θ) ≡ c, we have:
sup{Rt (θ)} = c = cπ(θ)dθ = Rt (θ)π(θ)dθ ≤ Rt (θ)π(θ)dθ ≤ sup{Rt (θ)},
θ∈Θ Θ Θ Θ θ∈Θ
for any estimator T [NOTE: the final inequality follows from the fact that Θ Rt (θ)π(θ)dθ =
Eπ {Rt (θ)}, and the expectation of a random variable clearly cannot be larger than the maxi-
mum value of the random variable over its sample space]. Thus, T must be a minimax estimator.
Example 2.5 (cont’d): We saw that for the parameter in a Bernoulli distribution, θ, the Bayes
estimator using squared-error loss and a Beta distribution prior with parameters a and b was
given by n
a + i=1 Xi
θ̂πa,b = .
n+a+b
Now, the risk function for θ̂πa,b is given by:
n 2 n 2
a + i=1 Xi
Rθ̂π (θ) = Eθ −θ = Eθ A Xi + aA − θ
a,b n+a+b i=1
n 2 n
2
= A Eθ Xi + 2A(aA − θ)Eθ Xi + (aA − θ)2
i=1 i=1
= A {nθ(1 − θ) + n θ } + 2nAθ(aA − θ) + (aA − θ)2
2 2 2
1
n
nx
F̂ (x) = I(xi ≤x) = ,
n i=1 n
where nx is defined as the number of observed data values which are less than or equal to the
value x. Essentially, the empirical distribution function F̂ is the CDF of a new discrete random
variable, say X , defined to take a value chosen at random from the collection of observed data
values X = {x1 , . . . , xn }. In this way, the relationship between F̂ and X mimics the relationship
between F and the original random variables representing the data values, X1 , . . . , Xn (of course,
X is by its nature discrete whereas the Xi ’s may be either discrete or continuous). We shall take
advantage of this relationship in more detail later, but for now it suffices to note that the obvious
analogy between the pairs (F, X) and (F̂ , X ) means that it is reasonable to assume that studying
(F̂ , X ) will likely yield information about (F, X). In particular, we note that, for any given value
x, F̂ (x) is an unbiased estimate of F (x), since
1 1 1 1
n n n n
EF {F̂ (x)} = EF I(Xi ≤x) = EF {I(Xi ≤x) } = P r(Xi ≤ x) = F (x) = F (x),
n i=1 n i=1 n i=1 n i=1
where the notation EF is used to indicate expectation under the true distribution determined by
the CDF F (in just the same way that the previous notation Eθ indicated expectation under the
distribution indexed by the parameter value θ). Of course, this result also follows directly upon the
recognition that the random variable nx (the number of observed data values less than or equal to x)
is clearly binomially distributed with n trials and a “success” probability of p = P r(Xi ≤ x) = F (x).
Thus, we can see that EF (nx /n) = EF (nx )/n = nF (x)/n = F (x). This characterisation shows
further that:
nx 1 1 1
V arF {F̂ (x)} = V arF = 2 V arF (nx ) = 2 {np(1 − p)} = F (x){1 − F (x)}.
n n n n
As noted earlier, there are other methods of estimating F , but none are quite as simple and intuitive
as the empirical distribution function F̂ (indeed, in some sense, F̂ can be viewed as a MLE of F ).
Of course, as we noted in the introduction to this section, we are usually not interested in
estimating F directly, but rather some functional of it, θ(F ). The obvious estimator of this quantity
then becomes θ̂ = θ(F̂ ). Indeed, such an approach will lead us directly to our “common-sense”
estimators for many of the commonly used functionals of interest. In particular, suppose that θ(F )
represents the expectation of a random variable, X, having distribution F , so that θ(F ) = EF (X).
In this case, the estimator we arrive at for the expected value of F is given by
1
n
θ̂ = θ(F̂ ) = EF̂ (X) = xpF̂ (x) = xi = x,
n i=1
x∈X
since the (discrete) random variable having a distribution with CDF F̂ was defined to have sample
space X = {x1 , . . . , xn } and pmf pF̂ (x) = n−1 for all x ∈ X . In this case, we can further see
that θ(F̂ ) is an unbiased estimator of θ(F ) (since the sample average is always unbiased for the
population expectation, regardless of the population distribution). Unfortunately, it will not always
be the case that θ(F̂ ) will be unbiased for θ(F ) when the functional θ(·) is a more complicated one,
despite the fact that we have seen that F̂ itself is always unbiased for F .
In the following sections, we investigate ways of assessing and correcting the bias of θ̂ = θ(F̂ ),
as well as estimating its variance, V arF {θ(F̂ )}. Before proceeding, however, we note that there are
Statistical Inference (STAT3013/8027) Lecture Notes - Page 47
alternative “non-parametric” estimation procedures, the most common ones based on the ranked
data. We shall discuss such procedures a little later, but for now we simply note that some of
the most elementary estimators such as the median and the inter-quartile range are “rank-based”
estimators, since their construction is based on examination of the sorted data values. Of course,
the median can also be viewed as an estimator based on F̂ , since defining θ(F ) to be the median
of the distribution characterised by the CDF F clearly implies that θ(F̂ ) is equal to the median
of the observed data (the distinction between this approach and that of “rank-based” methods is
that in the latter case we may wish to use the median as an estimator for the population mean as
opposed to the population median).
2.6.2. The Jackknife, Bias Correction and Variance Estimation: We now turn our attention to
assessing the properties of the estimator θ(F̂ ). In particular, we will be interested in investigating its
bias and variance. Moreover, our investigation of bias will generally have as its aim the subsequent
modification of our estimator so as to reduce the bias. In other words, we will want to construct a
new estimator of the form θ̃ = θ(F̂ ) − B̂ = θ̂ − B̂, where B̂ is an estimate of
θ̂i = θ(F̂i ),
where F̂i is the empirical distribution function based on the observations x1 , . . . , xi−1 , xi+1 , . . . , xn ;
that is, F̂i is the empirical distribution function based on the observed data after the ith value has
been deleted. The idea behind this approach is that these θ̂i values can be seen as estimates of
n
θ̂, and the degree to which their average θ̂• = n1 i=1 θ̂i differs from θ̂ (i.e., the degree to which
the θ̂i ’s are biased as estimators of θ̂) is a reasonable reflection of the level of bias in θ̂ itself as an
estimator of θ(F ). Specifically, we will define
and then define the Jackknife bias-corrected estimator of θ(F ) to be θ̃J = θ̂ − B̂J .
The justification of this procedure is somewhat technical, but we can give a reasonable heuristic
explanation. Suppose that the bias of θ(F̂ ) decreases as the sample size increases in such a way
that
a(F )
EF {θ(F̂ )} = E(θ̂) ≈ θ(F ) + ,
n
for some (often unknown) constant a(F ) depending of F . It turns out that this is quite generally
true for most of the commonly used functionals θ(·) of interest. As such, we see that
a(F )
EF {θ(F̂i )} = E(θ̂i ) ≈ θ(F ) + ,
n−1
since F̂i is just an empirical distribution function based on n − 1 observations rather than n.
Statistical Inference (STAT3013/8027) Lecture Notes - Page 48
In other words, θ̃J is approximately unbiased [and indeed, is exactly unbiased if the expected value
of θ(F̂ ) is exactly equal to θ(F ) + (a/n), as the following example shows].
Example 2.12: Suppose that θ(F ) is the variance functional; that is θ(F ) = σF2 = EF [{X −
EF (X)}2 ]. In this case,
1
n
θ̂ = θ(F̂ ) = EF̂ [{X − EF̂ (X)}2 ] = (xi − x)2 .
n i=1
[NOTE: The devisor here is n rather than n − 1, since under F̂ , the mean of the random variable
X is “known” to be x. In other words, we are calculating the “population” variance for random
variable with CDF F̂ .] Clearly, this estimate is biased, and indeed it is easy to show that:
n−1 σ2
EF (θ̂) = θ(F ) = σF2 − F .
n n
As such, we have seen that the Jackknife bias-corrected estimator will be exactly unbiased in
this case. Indeed, we see that in this case
1
θ̂i = EF̂i [{X − EF̂i (X)}2 ] = (xj − xi )2 ,
n−1
j=i
where EF̂i (X) = (n − 1)−1 j=i xj = xi , since F̂i is the CDF of the discrete random variable
with sample space Xi = {x1 , . . . , xi−1 , xi+1 , . . . , xn } and pmf pF̂i (x) = (n − 1)−1 for x ∈ Xi .
Some further straightforward (though rather tedious) algebraic manipulation (left as an exercise
for the reader) then shows that:
n−2
n
θ̂• = (xi − x)2 .
(n − 1)2 i=1
Statistical Inference (STAT3013/8027) Lecture Notes - Page 49
1
n−1 x − n−1 xi .]
n
[NOTE: This calculation is made simpler upon noting that xi = Therefore, we
can calculate the Jackknife bias estimate as:
1 1
n n n
1
θ̃J = (xi − x)2 + (xi − x)2 = (xi − x)2 ,
n i=1 n(n − 1) i=1 n − 1 i=1
which are generally referred to as the pseudo-values, then the Jackknife bias-corrected estimate
n
of θ(F ) is given by θ̃J = n−1 i=1 θ̃i . It is reasonably straightforward to extend these ideas to
develop an estimate of variance as well:
1 n
1
V
arJ (θ̂) = (θ̃i − θ̃J )2 = s̃2 ,
n(n − 1) i=1 n
where s̃2 is just the sample variance of the θ̃i ’s. This estimator has obvious intuitive appeal, with
the pseudo-values, θ̃i , used to find an unbiased estimate of θ(F ) or the variance of the estimator
θ(F̂ ) in direct analogy to how the observed data values themselves are used to find an unbiased
estimator of the population mean (i.e., the sample average) or the variance of the mean (i.e., the
usual sample variance divided by the sample size). Indeed, it can easily be shown that when
θ(F ) = EF (X), we have
n
θ̃i = θ(F̂ ) + (n − 1){θ(F̂ ) − θ(F̂i )} = x + (n − 1)(x − xi ) = nx − (n − 1)xi = xj − xj = xi ,
j=1 j=i
that is, the pseudo-values are just the observed data values themselves. In this case, the Jackknife
bias estimate is clearly seen to be zero (as it should be, since θ̂ = x is unbiased in this case) and
n
the Jackknife estimate of variance is just s2 /n, where s2 = (n − 1)−1 i=1 (xi − x)2 is the usual
sample variance. These values are precisely the usual estimates of mean and its standard error.
Example 2.12 (cont’d): When θ(F ) is the variance functional, so that θ(F ) = σF2 = EF [{X −
EF (X)}2 ], we can see that the pseudo-values are:
n
θ̃i = θ(F̂ ) + (n − 1){θ(F̂ ) − θ(F̂i )} = nθ(F̂ ) − (n − 1)θ(F̂i ) = (xi − x)2 − (xj − xi )2 .
i=1 j=i
Statistical Inference (STAT3013/8027) Lecture Notes - Page 50
1 1 n 1 2
n n n
θ̃i = yi2 = y ,
n i=1 n i=1 n − 1 n − 1 i=1 i
and therefore:
n 2
1 2
n
1 2 1
V arJ (θ̂) = s̃ = θ̃i − y
n n(n − 1) i=1 n − 1 j=1 j
2
1 2
n n
1 n2 4
= y −n y
n(n − 1) i=1 (n − 1)2 i n − 1 i=1 i
n n 2
n2 1 4 n2 1 2
= y − y
(n − 1)3 n i=1 i (n − 1)3 n i=1 i
n n 2
n2 1 4 1 2
= (x i − x) − (x i − x) .
(n − 1)3 n i=1 n i=1
n
By way of comparison, we note that the exact variance of the estimator n−1 i=1 (xi − x)2 is
given by:
(n − 1)2 (n − 1)(n − 3) 2 2 (n − 1)2 n−3 2 2
V arF {θ(F̂ )} = µ4,F − (σF ) = µ4,F − (σ ) ,
n3 n3 n3 n−1 F
where µ4,F = EF [{X − EF (X)}4 ] is the fourth central moment of the distribution with CDF
F . Note that for sufficiently large values of n, we have:
n2 1 (n − 1)2 n−3
3
≈ ≈ ; ≈ 1,
(n − 1) n n3 n−1
[NOTE: This calculation follows along identical lines to those used in calculating the Jackknife
n
variance of θ(F̂ ) = n−1 i=1 (xi − x)2 above, and is left as an exercise for the reader.]
Statistical Inference (STAT3013/8027) Lecture Notes - Page 51
Unfortunately, there are drawbacks to the Jackknife variance estimate. It turns out that
the Jackknife estimate of variance is not always an accurate (or even consistent) estimate of the
true variance V arF {θ(F̂ )}. In particular, if θ(F ) is defined to be the median of the distribution
with CDF F , the Jackknife estimate of variance is not a valid estimate of the true variance of
the sample median θ(F̂ ). The reasons for this breakdown in the Jackknife variance estimator are
rather technical, and we will not discuss them here. However, the ideas behind the Jackknife lead us
directly to the methods of the next section, wherein we will arrive at a more generally accurate and
valid non-parametric estimate of variance for θ(F̂ ) [as well as for other estimators δ(X1 , . . . , Xn )].
Before we proceed to this new approach, though, we give a brief development of a method
of variance estimation which has close ties to the ideas in the Jackknife (but which actually pre-
dates the Jackknife) called the δ-method. The general idea is based upon simple first-order Taylor
expansion. In particular, suppose that Y = (Y1 , . . . , Yn ) is a random vector with known mean
vector µ = (µ1 , . . . , µn ) and known variance-covariance matrix
σ2 σ12 · · · σ1n
1
σ21 σ22 · · · σ2n
Σ =
.. .. .. ..
,
. . . .
σn1 ··· · · · σn2
and let Z = g(Y ) for some differentiable function g(·). In order to estimate the variance of Z in
terms of the mean and variance of Y , we first note that first-order Taylor expansion of g(Y ) about
the point Y = µ yields:
n
Z = g(Y ) ≈ g(µ) + gi (µ)(Yi − µi ) = g(µ) + ∇g(µ)T (Y − µ),
i=1
where gi (Y ) = ∂Y∂
i
g(Y ) and ∇g(µ) = {g1 (µ), . . . , gn (µ)}T . From this approximation, we can
directly estimate the variance of Z as
V ar(Z) = V ar{g(µ) + ∇g(µ)T (Y − µ)} = V ar{∇g(µ)T (Y − µ)} = ∇g(µ)T Σ ∇g(µ),
[Recall that for any constant vector a and any random vector W , we have V ar(aT W ) = aT V ar(W )a].
This approximation is generally known as the δ-method estimate of variance. [NOTE: The approx-
imation is based on “linearising” the function g(·), and thus the accuracy of the estimate is closely
tied to how good this linear approximation to g(·) is, particularly near the mean vector µ.] Now,
if we assume that the Xi ’s are an iid sample from some distribution with known mean µX and
2
variance σX , then µ = (µX , . . . , µX ) and Σ is an n × n diagonal matrix with each diagonal ele-
2
ment equal to σX . Thus, we can calculate the δ-method estimate of the variance of an estimator
θ̂t = t(X1 , . . . , Xn ) as:
2
(t1,µ · · · tn,µ ) σX 0 ··· 0 t1,µ
..
0 σX2
··· . n
V ar(θ̂) ≈ ∇t(µ) Σ ∇t(µ) =
T ... = σX
2
t2i,µ ,
.. .. ..
. . . 0 i=1
2
0 ··· 0 σX tn,µ
∂
where ti,µ = ti (µ) and ti (X) = ∂X i
t(X1 , . . . , Xn ). Finally, we note that if we do not know
2
the true mean vector µ and the true variance σX , then we can just substitute any convenient
estimates for them, typically the most sensible choice would be use the data vector X itself to
Statistical Inference (STAT3013/8027) Lecture Notes - Page 52
n
n n
2
2 2 2 2 2 2 2 2 2 4 4
V ar(s ) ≈ σX (si,µ ) ≈ s (si,X ) = s (Xi − X) = s .
i=1 i=1 i=1
n−1 n−1
where µ3,X = E(X 3 ) and µ4,X = E(X 4 ). Furthermore, we can see that ∇t(µ) = n−1 n
(−2µX , 1)T .
Therefore, the δ-method variance estimate for the sample variance in this form is readily calculated
(using matrix multiplication and some straightforward algebra) to be:
n n
V ar(s2 ) ≈ ∇t(µ)T Σ ∇t(µ) = (µ4,X −4µX µ3,X +6µ2X σX
2
+3µ4X −σX
4
)= (µ −σ 4 ),
(n − 1)2 (n − 1)2 X,4 X
2.6.3. The Bootstrap Method: The notion behind the Jackknife pseudo-values, θ̃i , is a reason-
able one. We can “mimic” the behaviour of the random variables Xi , and therefore of the estimator
θ(F̂ ), under the true distribution F by using the θ̃i values, which are essentially estimates of θ(F̂ )
based on “re-samples” (drawn under the distribution F̂ ) of specified sub-collections of n − 1 of
the original data points. The behaviour of the θ̃i ’s is then mapped back to the estimate the true
behaviour of θ̂ = θ(F̂ ). However, if we are truly to create a proper analogy for the behaviour of θ̂
under F , it makes more sense to examine the behaviour of the quantity θ̂ = θ(F̂ ), where the dis-
tribution F̂ is the empirical distribution associated with the random variables X1 , . . . , Xn having
distribution F̂ . In other words, to examine the behaviour of the quantity θ̂ under the population
distribution F , we simply imagine that our observed data forms its own “population” from which
we randomly sample according to the “true” distribution F̂ and construct an estimate of the “true”
population parameter θ(F̂ ) using the “re-sampled” data, arriving at θ(F̂ ) as our estimator of θ(F̂ ).
The advantage of this approach is that we “know the truth” regarding θ(F̂ ), since we know the
“true” distribution F̂ . Thus, we can determine exactly (assuming we are willing to conduct the
appropriate algebraic calculations) the bias and variance of θ(F̂ ). If the analogy holds, and it will
in most cases, we can then use the bias and variance of θ(F̂ ) under F̂ as estimators of the bias
and variance of θ(F̂ ) under F . This approach is generally referred to as the bootstrap, since we
are using the data itself to estimate its behaviour under F , effectively “pulling ourselves up by our
own bootstraps”.
Formally, then, we will define the bootstrap estimators of bias and variance as:
where F̂ is the CDF of a random variable X , which takes values drawn at random from the
collection X = {X1 , . . . , Xn }. We note that these formulae are seen to be directly derived by
writing the expressions for the bias and variance of θ(F̂ ):
BiasF {θ(F̂ )} = EF {θ(F̂ )} − θ(F̂ ); V arF {θ(F̂ )} = EF {θ(F̂ )2 } − [EF {θ(F̂ )}]2 ,
and replacing each instance of F by F̂ , and each instance of F̂ by F̂ . Of course, we are now
in the position of having to calculate expectations and variances of θ(F̂ ). These calculations are
occasionally possible exactly in the case of simple functionals, θ(·), but generally the necessary
quantities will need to be estimated. Fortunately, and this is the real strength of the bootstrap
method, this can be accomplished in a very computationally straightforward way. First, we note
that since we have the observed values x1 , . . . , xn in our possession, we can easily create realisations
of the random sample X1 , . . . , Xn by simply randomly drawing n values from the collection X =
{x1 , . . . , xn } with replacement. Suppose that we repeat this re-sampling exercise a large number
of times, say B, leading to the re-sampled datasets:
{X1,1
, . . . , Xn,1 }, . . . , {X1,b
, . . . , Xn,b }, . . . , {X1,B
, . . . , Xn,B }.
In turn, these B “bootstrap” datasets can be used to construct the estimates θ̂b = θ(F̂b ), where F̂b
is the empirical distribution function derived from the re-sampled dataset {X1,b
, . . . , Xn,b }. Using
these θ̂b values we can approximate the bootstrap bias and variance as:
1
B
B̂B = EF̂ {θ(F̂ )} − θ(F̂ ) ≈θ̂b − θ̂
B
b=1
B 2
1 1
B
V arB {θ(F̂ )} = V arF̂ {θ(F̂ )} ≈
θ̂b − θ̂ .
B−1 B c=1 c
b=1
Statistical Inference (STAT3013/8027) Lecture Notes - Page 54
Note that we have simply estimated the expected value and the variance of θ(F̂ ) by the sample
average and sample variance of the θ̂b ’s, respectively. As such, as long as B is large enough, we
can be certain that these estimates are reasonably accurate (in fact, it can be shown that the error
in these estimates decreases linearly in B, and are thus approximately of the size B −1 ).
We further note that, just as for the Jackknife, the notion of the bootstrap can be extended to
estimators other than θ(F̂ ). In particular, if θ̂δ = δ(X1 , . . . , Xn ) is any estimator of θ(F ), we can
use the bootstrap to estimate the bias and variance of this estimator as:
1
B
B̂B = EF̂ {δ(X1 , . . . , Xn )}
− θ(F̂ ) ≈
δ(X1,b
, . . . , Xn,b ) − θ(F̂ )
B
b=1
B 2
1 1
B
V arB (θ̂δ ) = V arF̂ {δ(X1 , . . . , Xn )} ≈
δ(X1,b , . . . , Xn,b ) −
δ(X1,c , . . . , Xn,c ) .
B−1 B c=1
b=1
Note that the bootstrap notion of replacing F by F̂ and F̂ by F̂ has simply been augmented to
include the replacement of Xi by Xi .
Example 2.13: Suppose that we have observed the following data pairs, which represent the
average LSAT (Legal Scholastic Aptitude Test, a common entrance exam for prospective law
school students in the United States) and GPA (grade point average) scores for the 1973 entering
class at a random sample of 15 U.S. Law Schools (the data are also plotted below):
LSAT GPA ρ̂i − ρ̂ LSAT GPA ρ̂i − ρ̂ LSAT GPA ρ̂i − ρ̂
576 3.39 0.1166 635 3.30 −0.0127 558 2.81 −0.0214
578 3.03 −0.0003 666 3.44 −0.0451 580 3.07 0.0036
555 3.00 0.0082 661 3.43 −0.0402 651 3.36 −0.0246
605 3.13 −0.0003 653 3.12 0.0417 575 2.74 0.0093
545 2.76 −0.0360 572 2.88 −0.0093 594 2.96 0.0035
Suppose that we are interested in estimating the correlation between LSAT scores (Yi ’s) and
GPAs (Zi ’s), so that our functional of interest is
CovF (Y, Z)
θ(F ) = ρF = % ,
V arF (Y )V arF (Z)
where F represents the joint distribution of the pairs Xi = (Yi , Zi ). Further, suppose that we
use the usual correlation estimator
n
(yi − y)(zi − z)
ρ̂ = %n i=1 n .
2 2
i=1 (yi − y) i=1 (zi − z)
The sample correlation coefficient for these 15 pairs is easily calculated as ρ̂ = 0.7764. Moreover,
the table provides values for ρ̂i − ρ̂ (where ρ̂i is the sample correlation calculated without the ith
data value), which can then be used to create Jackknife pseudo-values, ρ̃i = ρ̂ + (n − 1)(ρ̂ − ρ̂i ).
These pseudo-values can then be used to estimate the bias and variance of ρ̂ as:
n 2
1 1
n n
1
B̂J = ρ̂ − ρ̃i = −0.007; V arJ (ρ̂) = ρ̃i − ρ̃i = 0.0203.
n i=1 n(n − 1) i=1 n i=1
Alternatively, we can select B re-samples from the 15 observed pairs and create bootstrap
replicates of the correlation estimate, ρ̂b (b = 1, . . . , B). For example, one re-sample might be:
X1 = X7 = (555, 3.00), X2 = X15 = (594, 2.96), X3 = X14 = (572, 2.88),
X4 = X3 = (558, 2.81), X5 = X7 = (555, 3.00), X6 = X14 = (572, 2.88),
X7 = X7 = (555, 3.00), X8 = X7 = (555, 3.00), X9 = X12 = (575, 2.74),
X10 = X3 = (558, 2.81), X11 = X6 = (580, 3.07), X12 = X6 = (580, 3.07),
X13 = X1 = (576, 3.39), X14 = X10 = (605, 3.13), X15 = X12 = (575, 2.74).
Statistical Inference (STAT3013/8027) Lecture Notes - Page 55
For this particular re-sample, we can see that ρ̂ = 0.2585 (note how different this value is
from ρ̂ = 0.7764, indicating that the correlation estimator in this case may be quite variable).
Table 2.2 shows the bootstrap bias and variance estimates based on various values of B (each
replicated three times):
Table 2.2: Bootstrap Bias and Variance Estimates for the Correlation Coefficient
Note that for B = 10, the bootstrap estimates of bias and variance are quite variable (which
was foreshadowed by the fact that the single re-sample we examined earlier yielded a value of
ρ̂ quite different from ρ̂), but by the time B = 10, 000 there is essentially no variability in the
estimates. As such, we must be careful when implementing the bootstrap to ensure that we
have chosen a large enough value of B (of course, we do not want to choose an overly large value
as this will incur excessive computational costs and thus make our estimation procedure overly
time consuming). It is generally accepted that bootstrap bias and standard deviation estimates
typically require a few thousand re-samples to ensure that the variability due to the random
selection of re-samples (generally referred to as simulation error) is sufficiently small.
Finally, we present a plot of the data and a “bootstrap histogram” on the top of the following
page (i.e., a histogram of 10,000 values of ρ̂ calculated on randomly re-sampled datasets). The
plot of the data indicates a reasonably linear relationship (the least-squares linear regression
line is superimposed on the plot), which confirms that the use of a correlation coefficient as a
measure of relationship is reasonable. Moreover, the plot uncovers a potential outlier (which
happens to be the first data point corresponding to an LSAT value of 576 and a GPA value
of 3.39). The presence of this outlier has an adverse effect on the variability of the bootstrap
estimators, which is why we required B = 10, 000 re-samples before the bootstrap estimators
stopped varying noticeably from trial to trial. Of course, what should be done regarding this
outlier is an important subject, but is beyond the scope of these notes. The histogram of the
10,000 bootstrap values also indicates the inherent variability in the ρ̂ values. Moreover, the
Average LSAT and GPA scores for Histogram of 10,000 Bootstrap Replicates
Entering Classes at 15 U.S. Law Schools of the Correlation Coefficient
1000
•
•
3.4
•
•
800
600
•
•
Count
GPA
•
400
3.0
•
200
•
2.8
•
•
0
560 580 600 620 640 660 0.0 0.2 0.4 0.6 0.8 1.0
histogram provides another interesting piece of information; namely, the distribution of the ρ̂
values is quite skewed. Indeed, the bootstrap histogram yields information regarding the actual
distribution of ρ̂ under F̂ , and this information may be used to infer the behaviour (following
the standard bootstrap paradigm) of ρ̂ under F . For comparison purposes, the theoretical
distribution of ρ̂ under the assumption of that the Xi ’s follow a bivariate normal distribution
with true correlation of ρ = 0.7764 is superimposed on the histogram. We will further investigate
the use of this distributional information (in the pursuit of confidence intervals) in subsequent
sections.
The idea behind the bootstrap is powerful and extremely intuitively appealing. Moreover, the
implementation is reasonably easy (though computationally intensive). Why, then, has the boot-
strap not replaced parametric approaches? One drawback is that, as implemented, the bootstrap
method yields a different answer every time (of course, the differences will be very small if B is
large). Another drawback is that if θ(·) is complicated to calculate (perhaps because it is implicitly
defined as the solution to an equation, just as the MLE was) then computing its value for each of
B re-sampled datasets is computationally quite expensive and time consuming. Moreover, as we
have discussed in the previous sections, if we truly believe the parametric structure we have set up,
then the parametric estimators have nice optimal properties. Still, the bootstrap is a very flexible
and widely applicable approach which deserves more attention than it currently gets among statis-
tical practitioners (particularly given the speed with which modern computers can implement its
requirements). Indeed, the bootstrap can even be extended to circumstances beyond the iid setting
on which we have focussed here. Finally, however, a word of warning. We must be somewhat careful
since we cannot always guarantee that replacing the bootstrap paradigm (i.e., estimating bias and
variance using quantities derived by replacing F by F̂ and F̂ by F̂ in the defining expressions for
the true bias and variance) will yield valid estimates in more complicated settings (particularly, if
the observed data points are not independent of one another).