0% found this document useful (0 votes)

26 views56 pages

Lecture Notes - 1

The document provides lecture notes on statistical inference, focusing on estimating and testing characteristics of a population based on random observations. It covers foundational concepts such as probability models, parametric point estimation, and methods of estimation, including the method of moments. The notes emphasize the importance of a mathematical foundation while exploring both parametric and non-parametric estimation techniques.

Uploaded by

qq1812016515

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views56 pages

Lecture Notes - 1

Uploaded by

qq1812016515

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 56

STATISTICAL INFERENCE - STAT3013/8027

LECTURE NOTES

1. INTRODUCTION

The primary subject of statistical inference is drawing conclusions about some aspect of a population
of persons or objects based on a set of quantitative observations randomly gathered from that
population, or equivalently, drawing conclusions about the generating process of certain quantities
based on a set of randomly generated observed outcomes from that process. More specifically,
we will be interested in estimating or testing some numerical characteristic(s) of a population or
generating process based on a set of random observations from that population or process and then
assigning some level of confidence to our estimates or conclusions. These notions will no doubt be
familiar concepts from any introductory unit in statistics. Our focus here will be on more fully
developing the underlying theory and philosophy upon which the techniques learned in earlier units
are based and then using these principles to extend our understanding of statistical concepts to a
wider range of situations. As such, a basic knowledge of introductory mathematics and statistics
will be assumed throughout. In particular, we shall assume that the reader is familiar with the
following concepts and areas:
• Single and Multi-variable Differentiation and Integration;
• Maximisation and Minimisation of Functions;
• Taylor-Series Expansions;
• Basic Probability and Random Variables;
• Joint and Marginal Distributions and Independence;
• Moments of Random Variables and Moment Generating Functions;
• The Change of Variable Formula for Probability Densities; and,
• Basic Conditional Distributions and Conditional Expectations.
We note that the reader is only assumed to be familiar with the above (and other related) topics
and not necessarily expert. Indeed, it is not the intention of these notes to provide a rigorous
mathematical development of the theory of statistical inference. Nonetheless, any reasonable un-
derstanding of the development and properties of statistical inference and estimation procedures
must be based to some degree on a firm mathematical foundation. We shall strive, therefore, to use
mathematics as a tool rather than as an end in itself and thus, while completely rigorous proofs will
rarely be provided, basic mathematical explanations and justifications will certainly be presented.
In order to more formally define our task, we shall focus on examining the properties of so-
called probability models. Loosely speaking, a probability model is simply a collection, or family,
of related probability distributions, one of which is believed to fully characterise the population
or process from which a set of observed data values arose. Typically, these models will be termed
parametric when each member of the family of distributions in question is uniquely associated
with (or indexed by) a vector of numerical values, called parameters. To give a specific example,
we might assume that the values of a numerical characteristic of interest among the elements in
a particular population are well described by a normal (or bell-shaped or Gaussian) distribution
with some unspecified expectation (or mean or centre), generally designated by µ, and variance (or
spread), generally designated by σ 2 . In this case, the probability model (i.e., the family of normal
distributions) is indexed (i.e., each member is uniquely identified) by the two values µ and σ 2 . Our
task is thus reduced to estimating or testing hypotheses regarding the true (but unknown) values
of these parameters.
Statistical Inference (STAT3013/8027) Lecture Notes - Page 2

In Section 2 of these notes, we shall start by examining the relatively simple task of estimating
the value of a parameter from a chosen probability model or family. In particular, we shall develop
and discuss theory regarding the construction of estimates and the determination and comparison
of the properties of these estimation procedures. Of course, since estimates by their nature must
be based on random information, they will inevitably contain error (i.e., the observed value of
an estimator will not exactly equal the value of the parameter it is intended to estimate except
in the most special of circumstances). Thus, in addition to providing estimates, we should also
endeavour to provide some measure of how strongly we believe (or how confident we are) in the
precision of our estimated value. This attachment of confidence to an estimator is the subject of
interval estimation which we discuss in Section 3 of these notes. As an alternative to estimating
the values of parameters, we may have specific hypotheses about their true values, the plausibility
of which we can test using the observed data. Such hypothesis testing will be the subject of Section
4, the last section of these notes. Before proceeding on to the details of parametric estimation
and testing, we note that the vast majority of our results will be based on the assumption that the
chosen probability model is indeed correct (i.e., that the population or process under study is indeed
characterised by one member of the family of distributions which comprise the model). Often times
this assumption is either not overly critical or is demonstrably true. Other times, however, there
is some non-negligible doubt associated with the choice of probability model, and methods which
are less constrained to specific parametric families are desirable. Throughout these notes, then, we
shall begin to explore the area of so-called non-parametric procedures, which are a first attempt at
widening the class of probability models available to include those which are not easily indexed by
a finite collection of parameters (e.g., we might wish to use the family of all symmetric distributions
instead of the family of normal distributions, and this new family is not possible to index by just
its expectation and variance, nor indeed by any finite collection of numerical parameters).
Statistical Inference (STAT3013/8027) Lecture Notes - Page 3

2. PARAMETRIC POINT ESTIMATION

The preceding introduction made some general comments regarding statistical inference. There it
was noted that, given a sample of observations arising from a specified parametric probability model,
estimation of the values of the parameters (or some function of these values) was an important
problem of study. The simplest sort of estimation is the use of observed information to derive a
single numerical value as a “best guess” for the true value of a parameter or some function of the
parameter. Such an endeavour is generally referred to as point estimation and is the subject of this
section of these notes. Generally speaking, the problem of point estimation can be mathematically
described as follows:
i. Assume that the values of some numerical characteristic of the elements of a population (or the
outcomes of a random process) can be represented as the sample space of a random variable
X with a density function, fX (x; θ), whose form is known up to the (unknown) value of a
(vector of) parameter(s), θ. As a specific example, we might believe that the values in the
population are normally distributed with unknown mean, µ, and variance, σ 2 . In this case, the
parameter vector would be θ = (µ, σ 2 ) and the density function fX (x; θ) = fX (x; µ, σ 2 ) would
be the density function associated with the normal distribution having mean µ and variance
σ 2 , which we know has the form:
1 (x−µ)2
1 −
2σ 2
fX (x; µ, σ 2 ) = √ e .
2πσ 2
ii. Assume that observed values x1 , . . . , xn have been obtained from the n random variables
X1 , . . . , Xn , each independently drawn or generated from the population or process of interest;
that is, each having a distribution with the appropriate density function fX (x; θ).
iii. We wish to estimate some (possibly vector-valued) function of θ, say τ (θ), based on the available
observations x1 , . . . , xn . Generally, we will estimate the desired function of θ using some
estimator or function of the observed values, say t(x1 , . . . , xn ). Finally, we note that the
function τ (θ) could be the identity function, in which case we are simply estimating θ itself.
The theory of point estimation begins by determining a collection of appropriate estimators (i.e.,
functions of the observations) for estimating τ (θ) and then selecting between these estimates based
on appropriate criteria regarding the properties of these estimators. More specifically, we shall
develop several methods of estimation, which are procedures for determining the value of an ap-
propriate estimator. These methods of estimation are described in the next sub-section. Following
that, we introduce and examine properties of these estimators and determine criteria for assessing
the quality of these estimators. In particular, as estimators are simply functions of the outcomes
of random variables, we shall examine their distributions and the properties of these distributions.
Finally, we note that the problem of point estimation as described here is heavily dependent on the
belief that the form of the density function fX (x; θ) is known (i.e., the chosen probability model
or distributional family is correct). In the final sub-section here we make brief mention of some
non-parametric estimation procedures which are less heavily dependent on assumptions regarding
the form of the probability model.
2.1. Estimation Methods
For certain common problems, such as estimating the mean of a normal population, we can construct
estimators using simple common sense. In the specific case of a normal mean, we know that the
sample average is quite a reasonable estimate. We can formally put this into our estimation
framework by letting θ = (µ, σ 2 ) be the parameter vector for the normal family of distributions,
setting τ (θ) = τ (µ, σ 2 ) = µ, and letting x1 , . . . , xn denote the observed values of our sample and
Statistical Inference (STAT3013/8027) Lecture Notes - Page 4
n
defining the function t(x1 , . . . , xn ) = n1 i=1 xi . The function t(·) is our estimator and we will
denote the value produced by this function for our specific set of observations as t. Additionally,
recall that the observed data values are considered as outcomes of the random variables X1 , . . . , Xn ,
and thus T = t(X1 , . . . , Xn ) is also a random variable, the outcome of which is the value of
the estimator calculated on the set of observed data (i.e., t). As such, we will be interested in
determining distributional properties of T . Of course, we could also have started this process using
the median instead of the average, in which case our definition of the function t(·) would change
(to be the sample median function), but all other notational aspects would remain the same.
We now turn our attention to some general methods of determining estimators for functions
of parameters τ (θ), since we do not want to have to rely on determining specific new procedures
for every new estimation problem we encounter. We will, however, maintain our basic notation,
so that t = t(x1 , . . . , xn ) is the value of an estimator of τ (θ) defined by the function t(·) and this
value can be interpreted as the outcome of the random variable T = t(X1 , . . . , Xn ). As an aside,
we note that the definition of the function t(·) which determines an estimator need not be explicit
or have a “closed form” expression (as it does in the case of the sample average), and indeed we
shall see that many of the most important estimators are based on implicitly determined functions
whereby the value of t(·) is determined through the solution or maximisation of a specific equation
or objective function.
2.1.1. Method of Moments: A common method of describing a distribution is through the
calculation of its raw moments, which are defined as the expectations µr = Eθ (X r ) for r = 1, 2, . . ..
The notation Eθ is employed to indicate that the value of the rth moment depends on the parameter
θ, and thus we will write µr = µr (θ) or µr = µr (θ1 , . . . , θk ) if the parameter is a vector of k
components, θ = (θ1 , . . . θk ).
Given the functional relationship between moments and parameters, it should seem reasonable,
then, that one useful estimate of τ (θ) would be the value t = τ (θ̂), where θ̂ = θ̂(x1 , . . . , xn ) is
the value of the parameter vector which makes the sample moments, mr = mr (x1 , . . . , xn ) =
1
n r
n i=1 xi , equal to the raw moments of the probability model distributions [and we note that the
functional notation, θ̂(x1 , . . . , xn ) is used to remind us that this value is dependent on the values
of the observed data]. Of course, we will not be able to make all the sample moments equal to
their corresponding raw moment, and so we will settle for equating as many as possible, which
will be determined by the dimension of the parameter vector θ. In other words, the estimator
t = t(x1 , . . . , xn ) = τ {θ̂(x1 , . . . , xn )} is defined implicitly by defining θ̂(x1 , . . . , xn ) as the value of
θ = (θ1 , . . . , θk ) which solves the system of k equations:
µ1 (θ1 , . . . , θk ) = m1 (x1 , . . . , xn )
..
.
µk (θ1 , . . . , θk ) = mk (x1 , . . . , xn ).
Note that the number of equations in the defining system is equal to the number of parameters,
which we now see is necessary to ensure that the system has a unique solution. Also, note that if
τ (·) is the identity function, than θ̂(·) is equivalent to t(·), and in this case t(·) is usually referred
to as the standard method of moments estimator of θ.
Example 2.1: Suppose that X1 , . . . , Xn represent a random sample from a normal distribution
with mean µ and variance σ 2 , so that we can set θ = (µ, σ 2 ). The standard method of moments
estimates of µ and σ 2 are then the values which solve the system of equations:
µ1 (µ, σ 2 ) = µ = m1 (x1 , . . . , xn )
µ2 (µ, σ 2 ) = σ 2 + µ2 = m2 (x1 , . . . , xn ).
Statistical Inference (STAT3013/8027) Lecture Notes - Page 5

Some simple algebra shows that the solution to this system, θ̂ = (µ̂, σ̂ 2 ), is given by

n
1
µ̂ = m1 (x1 , . . . , xn ) = n xi = x
i=1
σ̂ 2 = m2 (x1 , . . . , xn ) − {m1 (x1 , . . . , xn )}2
n n 2
1 2 1
=n xi − n xi
i=1 i=1
n
= 1
n (xi − x)2 .
i=1

Note that the method of moments estimator of σ 2 is not the usual unbiased estimate, s2 =
1
n 2 2
i=1 (xi − x) = n−1 σ̂ .
n
n−1
Now, suppose that we wanted to estimate σ, instead of σ 2 . One convenient method is to
√
define the function τ (θ) = τ (µ, σ 2 ) = σ 2 , so that σ = τ (µ, σ 2 ). Thus, a method of moments
√ n
estimate for σ is given by τ (µ̂, σ̂ 2 ) = σ̂ 2 = n1 i=1 (xi − x)2 . Alternatively, we might choose
to reparmeterise our probability model (i.e., index the family by a different set of parameters),
and set θ = (µ, σ) and then solve the method of moments equations for µ̂ and σ̂ directly.
Generally, either approach will provide the same result (though there are some special, and
generally unimportant, cases in which these two approaches lead to different answers).
Before moving on to the next estimation procedure, we note that the use of raw moments in the
standard method of moments procedure is by no means required. Indeed, generalisations of the
method of moments procedures which employ matching various other corresponding population and
sample quantities are possible. For instance, we might employ a pre-specified collection of sample
and population percentiles rather than moments, which yields the so-called method of percentiles
estimators (e.g., for the normal distribution we might solve a system of equations based on equating
the theoretical quartiles to the observed sample quartiles). The most common generalisation,
however, is based on replacing the raw and sample moments with the so-called central moments
1
n
µr = Eθ {X − Eθ (X)}r and mr = n−1 i=1 (xi − x) and derives an estimate by solving the
r

system of k equations µ1 = m1 = x and µr = mr for r = 2, . . . , k (note that the first equation
does not involve central moments, since the both µ1 and m1 are always equal to zero). Another
common generalisation of the method of moments is to employ any k (central) moments for the
k defining equations rather than simply the first k (central) moments. Finally, another common
generalisation, often referred to as the generalised method of moments is to use the first moment
of k functions gi (·), i = 1, . . . , k in the defining equations. In other words, the generalised method
of moments estimator of θ is the solution to the k equations:

1
n
Eθ {g1 (X)} = g1 (xi )
n i=1
..
.
1
n
Eθ {gk (X)} = gk (xi ).
n i=1

If the gi (·)’s are set to gi (x) = xi , then we recover the standard method of moments equations.
2.1.2. Maximum Likelihood: Perhaps the most ﬂexible, important and intuitively appealing of
all estimation procedures is that of maximum likelihood. Before formally describing this estimation
method, we start with a simple example to demonstrate the concept behind maximum likelihood.
Statistical Inference (STAT3013/8027) Lecture Notes - Page 6

Example 2.2: Suppose that a particular population contains individuals of two types, A and
B. Moreover, suppose that we are told that there are three times more of one type of individual
than the other, but we do not know which of the two types of individuals is the more prevalent.
We would like to know whether it is the type A or type B individuals who are predominant,
and to answer this question we plan to randomly sample 3 individuals. Letting X denote the
number of type A individuals in the sample, it should be clear that X has a binomial distribution
with a number of trials equal to three and a success probability p which is either 0.25 if type B
individuals are the most prevalent or 0.75 if type A individuals are the most prevalent. Based
on this fact, we can determine the probability of X taking on any of its four possible values
(0, 1, 2, or 3) under each of the two possible success probability options using the binomial
probability mass function:

3!
P rp (X = x) = px (1 − p)3−x ,
x!(3 − x)!

which yields the following results:

Table 2.1: Probability of various outcomes of X
under the two possible probability models
Outcome of X
Probability Model 0 1 2 3
3 1 9 27 27
p= 4 64 64 64 64
1 27 27 9 1
p= 4 64 64 64 64

Based on this table of probabilities, we can now devise a reasonable estimator for the true
population value of p, based on the notion of the “preponderance of evidence” or the likelihood.
The idea is to select the value of our estimator for p as either 0.25 or 0.75, whichever gives
a larger probability to the event which we actually observed, X = x. In other words, if we
observe zero type A individuals in our sample, we would estimate p as 0.25 since the probability
of observing this sample result when p = 0.25 is much larger than the probability of the observed
sample result under the other alternative, p = 0.75. Formally, we deﬁne our estimator as:
1
 4 if x = 0, 1
p̂ = p̂(x) = = argmax{P rp (X = x)},
3 p∈{ 14 , 34 }
4 if x = 2, 3

In this way, we see that our estimator is that value in the possible parameter set for p which
maximises the probability mass function for the random variable X. The the probability mass
function P rp (X = x), when treated as a function of the parameter p for a fixed value of x (instead
of the more usual interpretation which treats it as a function of x with a fixed parameter value
p) will be referred to as the likelihood function, and we will write L(p) = P rp (X = x), the
notation highlighting the fact that it is a function of p and not x. In this way, we can redefine
our estimator p̂(x) as the value of p within the range of its allowable values (generally referred
to as the parameter space) which maximises L(p). Note that an alternative common sense
estimator might be defined as x/3, but this is clearly less desirable in the present problem since
it will never give the correct answer, its only possible values being 0, 1/3, 2/3 and 1. Of course,
this is due to the description of our problem, which required that p be one of two specific values.
In the previous example, the choice between two specific values of p made the problem rather
special. However, the notion of maximising a likelihood is easily extended to more general cases. In
Statistical Inference (STAT3013/8027) Lecture Notes - Page 7

particular, if we are not told that p, the proportion of type A individuals in the population, must
be either 0.25 or 0.75, then we can define a general maximum likelihood estimator for p by simply
maximising the likelihood function L(p) over the full range of possibilities; namely the interval
[0, 1]. Of course, doing so now requires simple calculus techniques as opposed to the examination
of a table. Specifically, setting the derivative of the likelihood function equal to zero, yields the
defining equation of the maximum likelihood estimator as:

d 3!
L(p) = L (p̂) = {xp̂x−1 (1 − p̂)3−x − (3 − x)p̂x (1 − p̂)2−x } = 0,
dp p=p̂ x!(3 − x)!

which is equivalent to

xp̂x−1 (1 − p̂)3−x − (3 − x)p̂x (1 − p̂)2−x = 0

=⇒ xp̂x−1 (1 − p̂)3−x = (3 − x)p̂x (1 − p̂)2−x
=⇒ x(1 − p̂) = (3 − x)p̂
=⇒ x − xp̂ = 3p̂ − xp̂
=⇒ x = 3p̂
x
=⇒ p̂ =
3

which now does equal the “common sense” estimator of the population proportion of type A
individuals.
In general, then, we can deﬁne the maximum likelihood estimator of a (vector) parameter θ indexing
a parametric model family having densities fX (x; θ) as follows:
i. The likelihood function for a parameter θ based on a sample of n random variables X1 , . . . , Xn
is deﬁned to be the joint probability density function of the n random variables considered as
a function of the parameter θ:

L(θ) = L(θ; x1 , . . . , xn ) = fX1 ,...,Xn (x1 , . . . , xn ; θ).

(Throughout these notes, we will interpret the word “density” to mean a probability mass
function if the random variables in question are discrete). Note that if the Xi ’s are indepen-
dent and identically distributed with probability density function fX (x; θ), then the likelihood
function can be written as
n
L(θ) = fX (xi ; θ).
i=1

ii. The maximum likelihood estimator (M LE) of a parameter θ is defined to be the value, θ̂ =
θ̂(x1 , . . . , xn ), which maximises the likelihood function L(θ; x1 , . . . , xn ) over the chosen set
of allowable parameter values or parameter space, usually denoted Θ [NOTE: the notation
θ̂(x1 , . . . , xn ) is used to remind us that the M LE, like any other estimator, is a function of
the observed data values]. Typically, the M LE will be the solution to the system of equations
determined by setting the (partial) derivative(s) of the likelihood function equal to zero. In
∂
other words, θ̂ is the solution (in θ) to the (vector) equation ∂θ L(θ) = 0. Of course, in the
event that the solution to these equations does not lie in the specified parameter space Θ, we
must then choose some other method of finding the appropriate restricted maximum.
iii. The form of most common probability densities usually means that the likelihood function
itself can be quite complicated to maximise directly. However, since the natural logarithm is
Statistical Inference (STAT3013/8027) Lecture Notes - Page 8

a monotonically increasing function, it is clear that the value of θ which maximises L(θ) is the
same as the value which maximises the log-likelihood function l(θ) = ln{L(θ)}. Typically, the
log-likelihood function will be much easier to deal with, and indeed, in the case of independent
and identically distributed observations the log-likelihood transforms the product structure of
the likelihood into the much more tractable summation structure

n
l(θ) = ln{fX (xi ; θ)}.
i=1

Using the log-likelihood, we can then deﬁne the M LE as the solution to the score equations:
∂ ∂
l(θ) = 0, . . . , l(θ) = 0,
∂θ1 ∂θk
provided the solution exists and is an element of Θ (NOTE: if the solution is not in Θ, then we
must ﬁnd the M LE by examining the boundary of the set Θ to determine which parameter
value within the parameter space makes the log-likelihood the largest).
We now present some examples of the implementation of the maximum likelihood estimation pro-
cedure:
Example 2.3: Suppose that X1 , . . . , Xn are independent random variables each having a nor-
mal distribution with zero mean and variance σ 2 . In this case, the appropriate density function
is:
1 x2
fX (x; σ 2 ) = √ e− 2σ2 ,
2πσ 2
which leads to a log-likelihood function of:

1 2
n x2
n
2 1 − i2 n 2
l(σ ) = ln √ e 2σ = − ln(2πσ ) − 2 x .
i=1 2πσ 2 2 2σ i=1 i

Diﬀerentiating this function with respect to σ 2 and setting equal to zero yields the M LE of σ 2
as:
1 2
n
d 2 n
l(σ ) = − 2 + x
dσ 2 2σ 2(σ 2 )2 i=1 i
1 2
n
n
=⇒ − 2+ x =0
2σ̂ 2(σ̂ 2 )2 i=1 i
1 2
n
=⇒ σ̂ 2 = x .
n i=1 i

Example 2.4: Suppose that we observe n random vectors X1 = (X11 , X12 ), . . . , Xn = (X21 , X22 )
each having a bivariate normal distribution with zero mean and variance-covariance matrix

τ1 + τ2 τ2 − τ 1
V = ,
τ2 − τ 1 τ1 + τ2

with 0 < τ1 ≤ τ2 . [NOTE: This example may seem somewhat contrived, but in fact, with some
minor algebraic modiﬁcations, it forms the basis for an extremely important class of statistical
techniques known as mixed linear models or random eﬀects ANOVA models. However, a full
discussion of these models is beyond the scope of these notes.] In this case, the appropriate
density function for the random vectors Xi is:

1 1 2 2
fXi (xi1 , xi2 ; τ1 , τ2 ) = exp − {(τ1 + τ2 )(xi1 + xi2 ) + 2(τ1 − τ2 )xi1 xi2 } ,
8πτ1 τ2 8τ1 τ2
Statistical Inference (STAT3013/8027) Lecture Notes - Page 9

which leads to a log-likelihood function of:

n
l(τ1 , τ2 ) = ln{fXi (xi1 , xi2 ; τ1 , τ2 )}
i=1

(τ1 + τ2 ) 2 (τ1 − τ2 )
n n
= −n ln(τ1 τ2 ) − (xi1 + x2i2 ) − xi1 xi2 .
8τ1 τ2 i=1 4τ1 τ2 i=1

[NOTE: Technically, there should be an additional term in the log-likelihood of the form
−n ln(8π), but it is common practice to omit any additive term in the log-likelihood which
is completely unrelated to the parameters. The reason for this is that such terms are irrelevant
for the purposes of determining the M LE, as can be seen from the fact that these terms will
disappear upon diﬀerentiation with respect to the parameter values performed in deriving the
score equation.] Diﬀerentiating this function with respect to τ1 and τ2 yields:

1 2 1
n n
∂ n 2
l(τ1 , τ2 ) = − + 2 (x + xi2 ) − 2 xi1 xi2
∂τ1 τ1 8τ1 i=1 i1 4τ1 i=1
1
n
n
=− + 2 (xi1 − xi2 )2
τ1 8τ1 i=1
1 2 1
n n
∂ n 2
l(τ1 , τ2 ) = − + 2 (x + xi2 ) + 2 xi1 xi2
∂τ2 τ2 8τ2 i=1 i1 4τ2 i=1
1
n
n
=− + 2 (xi1 + xi2 )2 .
τ2 8τ2 i=1

1
n 2
Setting these derivatives equal to zero and solving yields the M LEs as τ̂1 = 8n
i=1 (xi1 − xi2 )
1 n 2
and τ̂2 = 8n i=1 (xi1 + xi2 ) , provided that τ̂1 ≤ τ̂2 . If τ̂1 > τ̂2 , then the solutions to the score
equations are not in the allowable parameter space, and we must ﬁnd the M LEs by examining
the boundary of the parameter space. In this case, that means that we must maximise the
likelihood subject to the boundary condition τ1 = τ2 . Making this substitution into the log-
likelihood function we have

1 2
n
l(τ1 , τ1 ) = −2n ln(τ1 ) − (x + x2i2 ).
4τ1 i=1 i1

1
n 2
Diﬀerentiating this function and setting equal to zero yields the solution τ̂1 = τ̂2 = 8n i=1 (xi1 +
x2i2 ). Therefore, the M LEs for this problem are
1
n n n
8n i=1 (xi1 − xi2 )2 if i=1 (xi1 − xi2 )2 ≤ i=1 (xi1 + xi2 )2
τ̂1 =
1
2
n n ;
+ x2i2 ) − xi2 )2 > + xi2 )2
n
8n i=1 (xi1 if i=1 (xi1 i=1 (xi1

1
n n n
8n i=1 (xi1 + xi2 )2 if i=1 (xi1 − xi2 )2 ≤ i=1 (xi1 + xi2 )2
τ̂2 =
1
2
n n .
+ x2i2 ) − xi2 )2 > + xi2 )2
n
8n i=1 (xi1 if i=1 (xi1 i=1 (xi1

Before we move on to a brief discussion of some other general estimation methods, we note that our
discussion of maximum likelihood estimation so far has only enabled us to estimate θ, the parameter
(vector) itself. Recall, however, that we are more generally interested in estimation of functions of
our parameters, τ = τ (θ). If τ (·) is a one-to-one vector function of θ, then we can “reparameterise”
our family of distributions, using the new parameter τ = τ (θ) and then implement our maximum
Statistical Inference (STAT3013/8027) Lecture Notes - Page 10

likelihood procedure on the newly indexed family. Essentially, this amounts to “renaming” each
member of the family, which in turn reduces to employing the chain rule for diﬀerentiation on the
score equations to arrive at new objective functions for deriving the M LE τ̂ . Fortunately, none
of this is explicitly necessary, since some simple calculus and algebraic computations demonstrate
that for any function τ = τ (θ), the M LE of τ is given by τ̂ = τ (θ̂). This property is known as
functional equivariance of the M LE, and is formally stated and proved in the following theorem:
Theorem 2.1: Let x1 , . . . , xn be an iid sample from a distribution having likelihood function
L(θ; x1 , . . . , xn ). Also, let θ̂ = θ̂(x1 , . . . , xn ) be the M LE of θ based on this likelihood function.
For any function τ = τ (θ), we can deﬁne the likelihood function induced by τ (·) as

M (τ ; x1 , . . . , xn ) = sup L(θ; x1 , . . . , xn )
θ: τ (θ)=τ

and τ̂ , the M LE of τ , is then deﬁned as the value which maximises this induced likelihood. In
such circumstances, τ̂ = τ (θ̂).
Proof: To show that τ̂ = τ (θ̂), we need to demonstrate that τ (θ̂) maximises the induced
likelihood M (τ ; x1 , . . . , xn ). In other words, we need to show that

M {τ (θ̂); x1 , . . . , xn } ≥ M (τ ; x1 , . . . , xn ),

for all values of τ . To do this, we note that:

M (τ ; x1 , . . . , xn ) = sup L(θ; x1 , . . . , xn )
θ: τ (θ)=τ

≤ sup L(θ; x1 , . . . , xn )
θ∈Θ

= L(θ̂; x1 , . . . , xn )
= sup L(θ; x1 , . . . , xn )
θ: τ (θ)=τ (θ̂)

= M {τ (θ̂); x1 , . . . , xn },

where the first inequality follows from the fact that the range over which the supremum is being
taken has been enlarged, the second equality follows from the definition of the M LE θ̂, the third
equality follows from the fact that the point θ = θ̂ remains in the range over which the supremum
is being taken, and the final equality follows from the definition of the induced likelihood. Thus,
we have demonstrated that M {τ (θ̂); x1 , . . . , xn } ≥ M (τ ; x1 , . . . , xn ) for all values of τ , which
proves that τ (θ̂) is the value which maximises the induced likelihood M (τ ; x1 , . . . , xn ). In other
words, the M LE of τ is τ̂ = τ (θ̂) as was required.
2.1.3. Other Estimation Methods: There are many other estimation procedures which have
been developed, and we will study one of them in more detail in Section 2.5; namely, Bayesian
estimation. However, we here only briefly mention some of the general aspects of a few other
estimation procedures. The most common type of estimation procedure which we have not covered
so far is generally constructed by finding a value for an estimator which minimises some measure of
“distance” between the observed data and the distribution family of the chosen probability model.
Three of the most common choices for measuring this distance are least-squares, minimum chi-
square and minimum Kolmogorov distance. We now briefly describe these methods in the case
where we have observed the realisations x1 , . . . , xn of the random variables X1 , . . . , Xn assumed to
have come from a distribution belonging to a probability model indexed by the parameter θ and
having CDF s FX (x; θ) and pdfs fX (x; θ):
Statistical Inference (STAT3013/8027) Lecture Notes - Page 11

• Least-Squares - Choose θ̂, the estimate of θ, to be the value which minimises the distance
function:
n
d(θ) = {xi − Eθ (Xi )}2 .
i=1

Then estimate τ by τ̂ = τ (θ̂).

• Minimum Chi-Squared - First, partition the sample space, S, of the random variables Xi into
k
k distinct subsets, S1 , . . . , Sk , such that Sj1 ∩ Sj2 = φ for j1 = j2 and j=1 Sj = S. Next,
k
deﬁne pj (θ) = Prθ (Xi ∈ Sj ) = Sj fX (x; θ)dx. Note that j=1 pj (θ) = 1 by the deﬁnition of
the sets S1 , . . . , Sk . Finally, let nj be the number of the values x1 , . . . , xn which fall into Sj ;
n
so that nj = i=1 I(xi ∈Sj ) , where I(·) is the usual indicator function which yields a value one
if its argument is true and a value zero otherwise. Choose θ̂, the estimate of θ, to be the value
which minimises the distance function:

k
{nj − npj (θ)}2
d(θ) = .
j=1
npj (θ)

Again, estimate τ by τ̂ = τ (θ̂). We note that the distance function d(θ) deﬁned here is closely
related to the Kullback-Leibler distance and the entropy measure, which have the general form:

k
nj
e(θ) = nj ln .
j=1
npj (θ)

• Minimum Kolmogorov distance - First, deﬁne the empirical distribution function, F̂n (x), by

n
1
F̂n (x) = n I(xi ≤x) .
i=1

Note that F̂n (x) represents the proportion of data points less than or equal to the speciﬁed
value x (i.e., it is the CDF of the distribution with probability n−1 on each of the observed
values xi ). Choose θ̂, the estimate of θ, to be the value which minimises the distance function:

d(θ) = sup |F (x; θ) − F̂n (x)|.

In other words, we choose the value of θ which minimises the maximum vertical distance
between the chosen family of CDF s and the observed CDF of the data values. As before,
estimate τ by τ̂ = τ (θ̂).
In closing, we note that the reason that these estimation procedures are not covered in more
detail is that they generally are extremely diﬃcult to implement in practice, and as such are
not commonly employed in real estimation problems. Nonetheless, they do demonstrate a very
intuitively appealing idea in the approach to estimation; namely, the idea of minimising some
measure of distance between the observed data and the theoretical model chosen to describe the
population from which the data arose.
2.2. Properties of Estimators
In the preceding sections we introduced a variety of estimators, generally justiﬁed on reasonably
intuitive grounds. We now wish to establish some criteria on which we can base comparisons of our
estimators. In particular, we would like to decide which estimator is “best” for a given problem.
Statistical Inference (STAT3013/8027) Lecture Notes - Page 12

Before we introduce these criteria and discuss the associated properties of the estimators we have
introduced, we need to make a distinction between two general types of comparison criteria. The
two major classes of criteria are distinguished by their relationship to the size of the sample on which
the estimator is based. Specifically, properties based on the estimation procedure as it pertains to
any fixed sample size are referred to as small-sample properties. Alternatively, properties which
pertain to the behaviour of an estimation procedure as the sample size increases without bound
are referred to as large-sample or asymptotic properties.
2.2.1. Bias and Mean Squared Error: The most common measure of how “close” to its target
an estimator tends to be is the mean-squared error or M SE. For any estimator T = t(X1 , . . . , Xn )
of the quantity τ = τ (θ), the M SE is defined as:

M SEt (θ) = Eθ {(T − τ )2 },

where the notation M SEt (θ) is used to indicate the dependence of the mean-squared error on both
the estimator in question and the value of the underlying parameter θ.
The M SE can be partitioned into two important components, based on the relationship:
corrected: should
M SEt (θ) = Eθ {(T − τ )2 }
have been (+) not
+ {Eθ (T ) − τ }]2
= Eθ [{T − Eθ (T )} − (-)
2

= Eθ {T − Eθ (T )} − + 2Eθ {T − Eθ (T )}{Eθ (T ) − τ } + Eθ {Eθ (T ) − τ }2
2
= V arθ (T ) −
+ 2{Eθ (T ) − τ }Eθ {T − Eθ (T )} + {Eθ (T ) − τ }
= V arθ (T ) + {Biasθ (T )}2 ,

where the final equality follows from the fact that Eθ {T − Eθ (T )} = Eθ (T ) − Eθ (T ) = 0 and we
have defined Biasθ (T ) = Eθ (T ) − τ to be the bias of the estimator T (i.e., the difference between
the expectation of the estimator and the quantity which it is being used to estimate). Using the
M SE, we can now compare estimation procedures:
Example 2.1 (cont’d): We have seen that the standard method of moments (and indeed the
M LE, as well) of the parameter σ 2 based on X1 , . . . , Xn , a sample of size n from a normal
n
distribution with mean µ and variance σ 2 is σ̂ 2 = n−1 i=1 (Xi − X)2 . Alternatively, we
know that the standard unbiased estimator of σ 2 is the usual sample variance, s2 = (n −
n
1)−1 i=1 (Xi − X)2 . It is a simple (though tedious) calculation to show that:

2σ 4
V arµ,σ2 (s2 ) = ,
n−1

and the demonstration of this fact is left as an exercise. Since we know that s2 is unbiased, it
is clear that M SEs2 (µ, σ 2 ) = V arµ,σ2 (s2 ). Now, we can write σ̂ 2 = n−1 (n − 1)s2 , so that:

2 (n − 1)s2 n−1 (n − 1)σ 2
E µ,σ 2 (σ̂ ) = E µ,σ 2 = Eµ,σ2 (s2 ) = ,
n n n

and

(n − 1)σ 2 σ2
Biasµ,σ2 (σ̂ 2 ) = Eµ,σ2 (σ̂ 2 ) − σ 2 = − σ2 = −
2
n 2
n
(n − 1)s (n − 1) 2(n − 1)σ 4
V arµ,σ2 (σ̂ 2 ) = V arµ,σ2 = V ar µ,σ
2
2 (s ) = .
n n2 n2
Statistical Inference (STAT3013/8027) Lecture Notes - Page 13

Therefore, we see that

2(n − 1)σ 4 σ4 (2n − 1)σ 4

M SEσ̂2 (µ, σ 2 ) = V arµ,σ2 (σ̂ 2 ) + {Biasµ,σ2 (σ̂ 2 )}2 = + = .
n2 n2 n2
Now, a quick algebraic calculation shows that

2σ 4 (2n − 1)σ 4 2n2 σ 4 − (n − 1)(2n − 1)σ 4 (3n − 1)σ 4

− = = ,
n−1 n2 n2 (n − 1) n2 (n − 1)

which is clearly positive for any non-negative integer n. In other words, despite the fact that s2
is unbiased, σ̂ 2 has smaller M SE. Moreover, suppose we deﬁne another estimator as σ̂c2 = cs2
for some constant c. In this case, we can again easily calculate:

Eµ,σ2 (σ̂c2 ) = Eµ,σ2 (cs2 ) = cEµ,σ2 (s2 ) = cσ 2 ,

and
Biasµ,σ2 (σ̂c2 ) = Eµ,σ2 (σ̂c2 ) − σ 2 = cσ 2 − σ 2 = (c − 1)σ 2
2c2 σ 4
V arµ,σ2 (σ̂c2 ) = V arµ,σ2 (cs2 ) = c2 V arµ,σ2 (s2 ) = .
n−1
Therefore, the M SE of this new estimator is given by

2c2 σ 4
M SEσ̂c2 (µ, σ 2 ) = V arµ,σ2 (σ̂c2 ) + {Biasµ,σ2 (σ̂c2 )}2 = + (c − 1)2 σ 4 .
n−1
Diﬀerentiating this expression with respect to c and equating to zero shows that:

4cσ 4
+ 2(c − 1)σ 4 = 0 =⇒ 4c + 2(c − 1)(n − 1) = 0
n−1
=⇒ {4 + 2(n − 1)}c = 2(n − 1)
n−1
=⇒ c= .
n+1
It is straightforward to verify that this value of c yields a minimum, and thus, among all
1
n
estimators of the form cs2 , the one with the minimum M SE is n−1 2
n+1 s = n+1
2
i=1 (Xi − X) ,
which is neither the M LE, the method of moments estimator nor the usual unbiased estimator.
[NOTE: We have not shown that this new estimator has the smallest possible M SE of any
estimator, only among those having the form cs2 for some constant c.]
Ideally, we would like to find an estimator T = t(X1 , . . . , Xn ) which has minimal M SE, so that
for any other estimator T1 = t1 (X1 , . . . , Xn ) we have have M SEt (θ) ≤ M SEt1 (θ) for all values of
θ ∈ Θ. Unfortunately, it is easy to see that such an estimator cannot exist (except in the most
unusual of circumstances). To demonstrate this, we define the estimator T0 = t0 (X1 , . . . , Xn ) ≡
τ (θ0 ) = τ0 (i.e., T0 is the estimator which always yields an estimate equal to some pre-specified
value τ0 regardless of the observed data values) and note that M SEt0 (θ) = Biast0 (θ) = {τ0 −τ (θ)}2
so that M SEt0 (θ0 ) = 0. Thus, since M SEs are clearly non-negative, no estimator will have smaller
M SE than T0 when θ = θ0 . Of course, for other values of θ, T0 is an extremely silly estimator, but
this example demonstrates the difficulty of finding the “best” estimator uniformly over all possible
values of θ. Indeed, if we imagine T0 -type estimators for each possible parameter value in Θ, then
the following theorem shows that if an estimator T = t(X1 , . . . , Xn ) is to have smaller M SE than
all of these estimators over the entire range of Θ, then it must have M SEt (θ) ≡ 0.
Theorem 2.2: Suppose that X1 , . . . , Xn are an iid sample from a distribution with density
function fX (x; θ) belonging to a family indexed by the parameter θ ∈ Θ. If T = t(X1 , . . . , Xn )
Statistical Inference (STAT3013/8027) Lecture Notes - Page 14

is an estimator of τ = τ (θ) satisfying M SEt (θ) ≤ M SEt (θ) for all θ ∈ Θ and any other
estimator T = t (X1 , . . . , Xn ) [i.e., T has uniformly minimal M SE], then M SEt (θ) = 0 for all
θ ∈ Θ.
Proof: Pick any value θ0 ∈ Θ and define the estimator T0 = t0 (X1 , . . . , Xn ) ≡ τ (θ0 ). Clearly,
M SEt0 (θ0 ) = 0. Therefore, since we have assumed that T has uniformly minimal M SE, we
must have M SEt (θ0 ) ≤ M SEt0 (θ0 ) = 0. Since M SEs are non-negative quantities, it must be
the case that M SEt (θ0 ) = 0. Finally, since the original choice of θ0 was arbitrary, the preceding
argument is valid for any choice of θ0 , meaning that M SEt (θ) = 0 for any value of θ ∈ Θ.
In other words, the only possible estimator with minimal M SE over the full range of the parameter
space is one with an M SE which is uniformly zero, and generally speaking such estimators do not
exist since they must be both unbiased and have no variance (i.e., they must be exactly correct for
any sample values x1 , . . . , xn ).
One reason for being unable to find an estimator with uniformly smallest M SE over all values of
θ ∈ Θ is that there are simply too many possible estimators (as the silly estimators in the preceding
discussion demonstrate). One solution to this problem is to restrict the class of allowable estimators
t(·), for instance by requiring the allowable estimators to be unbiased, so that Biast (θ) = 0 for all
θ ∈ Θ. We will further investigate this possibility in later sections.
2.2.2. Location and Scale Equivariance: At the end of the previous subsection, we noted that
we might restrict attention to unbiased estimators in an effort to reduce the class of allowable
estimators enough so that an “optimal” estimator, in terms of minimal M SE, might be found.
In this section, we investigate alternative “common sense” properties which might be used for the
same purpose in certain settings.
First, suppose that we are estimating a scalar quantity τ = τ (θ) which can be interpreted as
the “centre” or “location” of the underlying distribution family. Such quantities τ are referred to
as location parameters and are formally defined as follows:
Definition 2.1: Let {fX (x; θ), θ ∈ Θ} be a family of distributions with density functions
fX (x; θ). Suppose that there is a function h(·) such that fX (x; θ) = h{x − τ (θ)}. If such a
function exists, then τ = τ (θ) is a location parameter. Equivalently, it is not difficult to show
that the preceding description implies that τ = τ (θ) is a location parameter for the family of
densities if and only if the density function of the new random variable Y = X − τ (θ) does not
depend on θ.
An obvious (and easily demonstrated) property of location parameters is that if X has density
fX (x; θ) = h{x − τ (θ)} then W = X + c has density h{(w − c) − τ (θ)} = h[w − {τ (θ) + c}]. In
other words, if τ = τ (θ) is a location parameter for the distribution family associated with an iid
sample of X’s, then τ + c is a location parameter for the distribution family associated with the
corresponding W ’s. The idea here is that “shifting” all of the observed data by a fixed amount
has the effect of shifting its location by the same amount. As such, it seems reasonable that any
estimator we choose for τ should have the corresponding “shift” property. That is, we would like our
estimation procedure to produce an estimate based on the shifted data which is just the estimate
based on the original data shifted by the appropriate amount. Estimators with this property are
said to be location equivariant. Formally, an estimator T = t(X1 , . . . , Xn ) is location equivariant
if it satisfies:
t(X1 + c, . . . , Xn + c) = t(X1 , . . . , Xn ) + c,
for any constant value c.
We note that most of the usual estimators of location are indeed location equivariant. For
example, clearly median(X1 + c, . . . , Xn + c) = median(X1 , . . . , Xn ) + c, so the median is a location
Statistical Inference (STAT3013/8027) Lecture Notes - Page 15

equivariant estimator. Similarly, if t(X1 , . . . , Xn ) is the sample average, then

1 1 1
n n n
t(X1 + c, . . . , Xn + c) = (Xi + c) = Xi + c = t(X1 , . . . , Xn ) + c,
n i=1 n i=1 n i=1

so that the sample average is also seen to be a location equivariant estimator.

Recall that one of the reasons we introduced the notion of location equivariance was to see if
restricting our class of estimators might lead to an estimator with uniformly minimal M SE within
this restricted class. [NOTE: Clearly, the estimators T0 ≡ τ (θ0 ) discussed in the previous section
are not location equivariant.] It turns out (though we will not prove this fact) that within the class
of location equivariant estimators for a location parameter τ = τ (θ) from the family of distributions
with density functions fX (x; θ) indexed by a scalar parameter θ, the estimator
n
Θ
τ (θ) dτdθ(θ) i=1 fX (Xi ; θ)dθ
T = t(X1 , . . . , Xn ) = dτ (θ) n
Θ dθ i=1 fX (Xi ; θ)dθ

has uniformly minimum M SE and is known as the Pitman estimator of location (estimators which
have uniformly minimal M SE among the class of location equivariant estimators are sometimes
referred to as M RE or minimum risk equivariant estimators). While this estimator seems quite
complicated, it can be shown that it reduces dramatically for many of the common distribution
families. In particular, if fX (x; θ) is the normal density with mean θ and known variance, then
τ = τ (θ) = θ is the location parameter and the Pitman estimator of location reduces to the sample
average (i.e., for a normal population mean, the sample average has uniformly minimal M SE
among all location equivariant estimators).
Alternatively, suppose that we are interested in estimating a scalar quantity τ = τ (θ) which
can be interpreted as the “spread” or “scale” of the underlying distribution family. Such quantities
τ are referred to as scale parameters and are formally defined as follows:
Definition 2.2: Let {fX (x; θ), θ ∈ Θ} be a family of distributions with density functions

fX (x; θ). Suppose that there is a function h(·) such that fX (x; θ) = {τ (θ)}−1 h x{τ (θ)}−1 . If
such a function exists, then τ = τ (θ) is a scale parameter (NB: note that this definition requires
τ (θ) ≥ 0 for all θ ∈ Θ since density functions must be non-negative). Equivalently, it can be
shown that the preceding description implies that τ = τ (θ) is a scale parameter for the family
of densities if and only if the density function of the new random variable Y = X/τ (θ) does not
depend on θ.

An important property of scale parameters is that if X has density fX (x; θ) = {τ (θ)}−1 h x{τ (θ)}−1

then W = cX has density {cτ (θ)}−1 h w{cτ (θ)}−1 when c > 0 and density

{|c|τ (θ)}−1 h − w{|c|τ (θ)}−1 = {|c|τ (θ)}−1 h1 w{|c|τ (θ)}−1

when c < 0 and the function h1 is defined by the relationship h1 (x) = h(−x). In either case, we see
that if τ = τ (θ) is a scale parameter for the distribution family associated with an iid sample of X’s,
then |c|τ is a scale parameter for the distribution family associated with the corresponding W ’s.
The idea here is that “shrinking” or “expanding” all of the observed data by a fixed amount has the
effect of changing its scale by the same amount. As such, it seems reasonable that any estimator
we choose for τ should have the corresponding property. That is, we would like our estimation
procedure to produce an estimate based on the scaled data which is just the estimate based on the
original data multiplied by the appropriate scale factor. Estimators with this property are said to
be scale equivariant. Formally, an estimator T = t(X1 , . . . , Xn ) is scale equivariant if it satisfies:

t(cX1 , . . . , cXn ) = |c|t(X1 , . . . , Xn ),

Statistical Inference (STAT3013/8027) Lecture Notes - Page 16

for any constant value c.

We note that most of the usual estimators of scale are indeed scale equivariant. For example,
if c ≥ 0, clearly the quartiles, Q̂1,w and Q̂3,w , of the values Wi = cXi (i = 1, . . . , n) satisfy Q̂1,w =
cQ̂1,x and Q̂3,w = cQ̂3,x where Q̂1,x and Q̂3,x are the corresponding lower and upper quartiles of
the corresponding Xi . Alternatively, if c < 0, it is not diﬃcult to see that (provided we deﬁne
the quartiles using appropriate linear interpolation between observed data values) Q̂1,w = cQ̂3,x
and Q̂3,w = cQ̂1,x . In either case, if t(X1 , . . . , Xn ) is the interquartile range (IQR) we have
IQR(cX1 , . . . , cXn ) = |c|IQR(X1 , . . . , Xn ). To see this, note that if c ≥ 0 then

IQR(cX1 , . . . , cXn ) = Q̂3,w − Q̂1,w = cQ̂3,x − cQ̂1,x = c(Q̂3,x − Q̂1,x ) = cIQR(X1 , . . . , Xn ),

and if c < 0 then

IQR(cX1 , . . . , cXn ) = Q̂3,w − Q̂1,w = cQ̂1,x − cQ̂3,x = c(Q̂1,x − Q̂3,x ) = −cIQR(X1 , . . . , Xn ).

Similarly, if t(X1 , . . . , Xn ) is the sample standard deviation, then

1 n
1 n
t(cX1 , . . . , cXn ) = (cXi − cX) = |c|
2 (Xi − X)2 = |c|t(X1 , . . . , Xn ),
n − 1 i=1 n − 1 i=1

so that the sample standard deviation is also seen to be a scale equivariant estimator. [NOTE:
The preceding calculation actually uses the fact that the sample mean is also a scale equivariant
estimator (which is easily seen from a quick algebraic calculation), even though it is not normally
thought of as a scale estimator.] Finally, we note that in addition to scale equivariance, another
desirable property of scale estimators is that they do not change if a ﬁxed constant is added to
each of the observed data values (since such a transformation would not change the scale of the
values only their location). Estimators which have such a property are called location invariant.
Formally, an estimator T = t(X1 , . . . , Xn ) is location invariant if it satisﬁes:

t(X1 + c, . . . , Xn + c) = t(X1 , . . . , Xn ),

for any constant value c. Most of the usual estimators of scale are not only scale equivariant but
location invariant as well (e.g., the IQR and the sample standard deviation are location invariant
as well as scale equivariant).
2.2.3. Consistency and Asymptotic Efficiency: The previous sections have defined properties
of estimators for a fixed sample X1 , . . . , Xn of size n. In other words, these were small sample
properties. We now turn our attention to two new properties of estimators which are defined
asymptotically; that is, as the sample size grows without bound. Recall that such properties are
termed “large sample”. In such situations, we will generally denote the estimator based on a given
sample size n by Tn = tn (X1 , . . . , Xn ) and then examine the limiting properties of the sequence of
estimators {Tn }n=1,2,... as n tends towards infinity.
The first large sample property we will discuss deals with the notion of an estimation proce-
dure eventually yielding an essentially exactly correct result given sufficiently large samples. The
formalisation of this notion is termed consistency and can be defined as follows:
Definition 2.3: Let T1 , T2 , . . . be a sequence of estimators of τ (θ), where Tn = tn (X1 , . . . , Xn ).
The sequence {Tn }n=1,2,... is weakly consistent if for every > 0

lim P rθ {τ (θ) − < Tn < τ (θ) + } = 1, ∀θ ∈ Θ.

n→∞
Statistical Inference (STAT3013/8027) Lecture Notes - Page 17

In other words, a sequence of estimators is weakly consistent as long as the probability that it is
eventually within any small interval around the true value τ (θ) tends towards one. This idea can be
seen as the formalisation of the notion that, as the amount of information increases, our estimation
procedure should give better and better estimates with larger and larger probability.
We note, however, that just because a sequence of estimators is weakly consistent does not
necessarily imply that it has any nice small sample properties. For instance, it is possible for a
sequence of estimators to be weakly consistent even though each member of the sequence is biased;
that is, Eθ (Tn ) = τ (θ) for any n. Indeed, it need not even be the case that the bias decreases
with n; that is, limn→∞ Eθ (Tn ) = τ (θ). Now, at the least, it seems reasonable to ask that a
sequence of estimators have this last property, generally referred to as the estimator sequence being
asymptotically unbiased. It turns out that we can ensure this behaviour if we deﬁne a stronger
kind of consistency:
Deﬁnition 2.4: Let T1 , T2 , . . . be a sequence of estimators of τ (θ), where Tn = tn (X1 , . . . , Xn ).
The sequence {Tn }n=1,2,... is mean-square consistent if and only if

lim M SEtn (θ) = 0, ∀θ ∈ Θ.

n→∞

It can be shown that if a sequence of estimators is mean-square consistent than it must be asymp-
totically unbiased (a fact which follows directly from the relationship between the M SE and the
variance and bias of the estimator Tn ). Moreover, if an estimator is mean-square consistent it must
also be weakly consistent (of course, as noted earlier, the reverse implication is not true). The
demonstration of this fact relies on the so-called Chebychev inequality, which states that for any
random variable Z and any constants a > 0 and c it must be the case that
E{(Z − c)2 }
P r(|Z − c| ≥ a) ≤ .
a2
To see this, suppose that Z has density function fZ (z), and note that
∞
E{(Z − c)2 } = (z − c)2 fZ (z)dz
−∞
2
= (z − c) fZ (z)dz + (z − c)2 fZ (z)dz
z:|z−c|<a z:|z−c|≥a

≥ (z − c)2 fZ (z)dz
z:|z−c|≥a

≥ a2 fZ (z)dz
z:|z−c|≥a

= a2 fZ (z)dz
z:|z−c|≥a
2
= a P r(|Z − c| ≥ a),
which provides the desired result after some simple algebraic rearrangement. Now, using this result
we note that
P rθ {τ (θ) − < Tn < τ (θ) + } = P rθ {|Tn − τ (θ)| < }
= 1 − P rθ {|Tn − τ (θ)| ≥ }
E[{Tn − τ (θ)}2 ]
≥1−
2
Thus, if the sequence {Tn }n=1,2,... is mean-square consistent, so that limn→∞ E[{Tn − τ (θ)}2 ] = 0,
we see that
lim P rθ {τ (θ) − < Tn < τ (θ) + } ≥ 1.
n→∞
Statistical Inference (STAT3013/8027) Lecture Notes - Page 18

Of course, since probabilities cannot exceed unity, this inequality must be an equality, which is
precisely the defining equation for weak consistency.
We close this section with the second of our large sample properties for estimators. This
property is generally referred to as asymptotic relative efficiency and to define it, we must first
define the notion of asymptotic normality. Of course, all standard introductions to statistical
inference teach the Central Limit Theorem, and thus we are familiar with the concept of a random
variable having a normal distribution “in the limit” as the sample size increases, but this notion
is rarely defined more precisely in introductory units. Here we will start to give a more formal
definition of what it means for something to have a normal distribution “in the limit”:
Definition 2.5: Let Z1 , Z2 , . . . be a sequence of random variables with cumulative distribution
functions F1 (z), F2 (z), . . .. The sequence {Zn }n=1,2,... is said to be asymptotically normal if:
i. limn→∞ E(Zn ) = µ for some value µ;
ii. limn→∞ V ar(Zn ) = σ 2 > 0 for some positive value σ 2 ; and,

iii. limn→∞ Fn (z) = Φ z−µσ for all z ∈ (−∞, ∞), where Φ(·) is the CDF of the standard
normal distribution.
[NOTE: While this definition provides an explanation of what it means for the distribution of a
sequence of random variables to converge to a normal distribution (and, indeed, the above definition
is an example of a more general concept known as “convergence in distribution”), it is rarely very
practical to demonstrate that a sequence of random variables is asymptotically normal by examining
the limit of their CDF s. Generally, it is easier (and turns out to be equivalent) to show that the
associated moment generating functions of the Zn ’s converge to the moment generating function
of a normal distribution with mean µ and variance σ 2 .]
Once we have a formal notion of what it means for a sequence of random variables to be
asymptotically normal, we can then define asymptotic relative efficiency as follows:
Definition 2.6: Let T1 , T2 , . . . and U1 , U2 , . . . be two weakly consistent sequences of estimators
√ √
of τ (θ), and define the new random variables Zn = n{Tn − τ (θ)} and Wn = n{Un − τ (θ)}.
Further, assume that the sequences {Zn }n=1,2,... and {Wn }n=1,2,... are asymptotically normal
2 2 2 2
with mean µZ = µW = 0 and variances σZ = σZ (θ) and σW = σW (θ), where, as the notation
suggests, the limiting variances of the Zn ’s and the Wn ’s depend on the true underlying value
of the parameter θ. The asymptotic relative efficiency of the sequence {Tn }n=1,2,... with respect
2 2
to the sequence {Un }n=1,2,... is defined as eT,U = σW /σZ .
As a simple example of this concept, suppose that X1 , . . . , Xn are a sample from a normal population
n
with mean µ and variance σ 2 . The usual sequence of estimators for µ, X n = n1 i=1 Xi , is well
known to be weakly consistent (indeed, it is mean-square consistent which follows from the Law
√
of Large Numbers) and the sequence of random variables Zn = n(X n − µ) are well known to
be asymptotically normal with mean zero and variance σ 2 (by the Central Limit Theorem). It
can be shown (though it is rather difficult and thus omitted here) that the sequence of estimators
X̃n = median(X1 , . . . , Xn ) is also weakly consistent and the sequence of random variables Wn =
√
n(X̃n − µ) is asymptotically normal with mean zero and variance σ 2 /{2φ(0)}2 , where φ(·) is the
density function of the standard normal distribution. Now, a simple exercise shows that φ(0) =
√
1/ 2π, and thus the asymptotic relative efficiency of the sample average with respect to the sample
median (in the case of normal data) is eX,X̃ = π/2. Since this value is larger than one, we see
that the sample average is more efficient than the sample median when the data are truly from
a normal population. Since asymptotic efficiencies are based on asymptotic variances, and these
variances are used in assessing the accuracy of estimators (which the reader will recall from their
introductory unit in statistics and which we will deal with in more detail in Section 3), one useful
Statistical Inference (STAT3013/8027) Lecture Notes - Page 19

interpretation of the relative efficiency is “the amount of extra data required for one estimation
procedure to be as accurate as another”. For our example of the sample mean and sample median,
then, we can see that in order for the sample median to be as accurate as the sample mean, we
must have a sample which has π/2 ≈ 1.57 times as many observations. [Provided, of course, we
believe the normality assumption, and indeed if the data are not normally distributed than it is
possible for the median to be more efficient than the mean.]
Finally, once we have the notion of relative efficiency, we might ask whether we can find
best asymptotically normal (BAN ) estimator sequences, which are essentially those for which the
relative efficiency with respect to any other sequence is always larger than or equal to one. In other
words, a weakly consistent sequence of estimators {Tn }n=1,2,... is BAN for τ (θ) if:
√
i. the sequence of random variables Zn = n{Tn − τ (θ)} is asymptotically normal with mean
µ = 0 and variance σ 2 = σ 2 (θ); and,
ii. any other weakly consistent sequence of estimators {Tn }n=1,2,... for which the sequence of
√
random variables Zn = n{Tn − τ (θ)} is asymptotically normal with mean µ = 0 and
variance σ2 = σ2 (θ) has σ2 (θ) ≥ σ 2 (θ) for all θ ∈ Θ.
Of course, it is generally very difficult to prove that a sequence is BAN from this definition, since
we must be able to verify the minimality of the asymptotic variance over all other consistent,
asymptotically normal estimator sequences. However, it can be shown that many of the common
estimators are indeed best asymptotically normal. For instance, the sample mean is a BAN esti-
mator for the mean µ of a normal population. Unfortunately, the limiting nature of the definition
of relative efficiency means that BAN estimators are rarely unique. For instance, the sequence of
1
n
estimators Tn = n+1 i=1 Xi is also BAN for µ from a normal population since its asymptotic
variance is clearly the same as that of the usual sample average, the additional one in the divisor
becoming essentially negligible as the sample size increases towards infinity.
2.2.4. Loss Functions and Minimax Estimation: In this section, we examine the notion behind
the M SE and extend its defining concept. If we consider the problem of estimating τ (θ) from the
perspective of making a choice or decision among the possible values of τ (θ), then an estimator
T = t(X1 , . . . , Xn ) is sometimes referred to as a decision function or a decision rule. Obviously,
the random nature of the observations means that the actual estimate t = t(x1 , . . . , xn ) based
on the particular observed values x1 , . . . , xn will inevitably be in error. However, it is generally
the case that some errors are more severe than others, and we can quantify this idea by defining
an appropriate loss function, (t; θ). There are many ways of measuring the loss associated with
estimating τ (θ) to be the value t, and the three most common ones are:
i. Squared-Error: (t; θ) = {t − τ (θ)}2 ;
ii. Absolute-Error: (t; θ) = |t − τ (θ)|; and,
iii. Constant-Error: (t; θ) = AI{|t−τ (θ)|>} .
The first two of these functions measure the loss as an increasing function of the discrepancy between
the true value of τ (θ) and the estimated value t. The third function measures loss as either some
fixed value A if the estimate differs from the true value τ (θ) by more than some pre-specified value
, and otherwise the loss is zero (i.e., as long as the estimate is within of the true value there is no
loss). Of course, there are many other potential measures of loss, and the context of any particular
problem may suggest which loss function is the most sensible in the circumstances (in particular,
the three loss functions discussed here are all symmetric, so that errors below and errors above of
the same size incur equal losses; however, there are situations in which the direction of the error
will effect the loss and in such situations asymmetric loss functions are necessary).
Suppose, however, that we have been able to determine the most sensible loss function for a
Statistical Inference (STAT3013/8027) Lecture Notes - Page 20

given problem (which is a quite large supposition, of course). Obviously, we would like to pick
a decision function (i.e., an estimator) which has a small associated loss. Of course, since the
estimators are based on random observations, we cannot hope to find a decision rule which can
guarantee small loss for every possible outcome of the random observations. As such, we must
lower our sights somewhat, and instead we will try and minimise the average loss over the possible
outcomes of the observations. Doing so leads to the definition of the so-called risk function, Rt (θ) =
Eθ {(T ; θ)}. The risk function allows us to compare competing decision rules. In particular, suppose
that we have two competing decision functions t1 (X1 , . . . , Xn ) and t2 (X1 , . . . , Xn ), then we can say
that t1 is a better estimator than t2 if Rt1 (θ) ≤ Rt2 (θ) for all θ ∈ Θ, and Rt1 (θ) < Rt2 (θ) for at
least one value of θ in the parameter space Θ. As a final piece of nomenclature, we shall say that
an estimator is admissible if there is no better estimator (i.e., if there is no estimator with smaller
or equal risk for all possible parameter values).
Given these ideas, we can then attempt to determine a decision rule (i.e., an estimation pro-
cedure) which has minimal risk among the admissible estimators. However, we quickly see that if
we choose the squared-error loss function, than the risk function simply becomes our now familiar
M SEt (θ), for which we know that no uniformly minimal estimator generally exists. Indeed, for
almost any loss function we choose (and certainly the three common loss functions defined previ-
ously), there will not be a general estimator which has uniformly minimal risk over the entire range
of possible values for the parameter θ. The problem, as we have seen, is that the risk function
depends on θ. Earlier, we suggested reducing the class of estimators to overcome this problem, and
we will investigate the idea further in subsequent sections. However, an alternate approach might
be to find an estimator which has the smallest “overall” risk over all possible values of θ. Of course,
we must more formally specify what we mean by an “overall” risk. This idea will be more fully
discussed in Section 2.5. For now, though, we discuss a simple definition of overall risk; namely,
the maximal risk, supθ∈Θ Rt (θ).
Definition 2.7: Suppose that T = t(X1 , . . . , Xn ) is an estimation procedure (or decision rule)
for the quantity τ (θ). Also, suppose that the chosen loss function for the estimation problem
is given by (t; θ), so that the risk function for T is given by Rt (θ) = Eθ {(T ; θ)}. If, for any
other estimation procedure T = t (X1 , . . . , Xn ) with risk function Rt (θ) = Eθ {(T ; θ)}, the
risk function of T satisfies
sup{Rt (θ)} ≤ sup{Rt (θ)},
θ∈Θ θ∈Θ

then T is termed a minimax estimator of τ (θ).

We shall revisit minimax estimators in Section 2.5. However, we can already see that, if they
exist, they have the clearly desirable property of having the “minimum maximal risk” among all
estimation procedures.

2.3. Suﬃciency
One of the most important uses of statistical methods is to eﬀect data reduction and summari-
sation. In particular, in our present parametric estimation setting, we would like to distill the
information regarding the parameter θ from our sample of random observations. Clearly, not all
of the information in these observations will be relevant to θ (indeed, some part of the observed
values are simply based on random chance). As such, we will want to reduce or summarise our
observations by ignoring extraneous information. Of course, we will not want to reduce our data
to the extent that we start to lose information which is relevant to the parameter θ. Reduction of
data takes place through the construction of statistics (or estimators), and a statistic which retains
all the information relevant to the parameter θ which was contained in the original data values is
Statistical Inference (STAT3013/8027) Lecture Notes - Page 21

termed sufficient for θ. The general notion here is to replace the actual observations by the value
of a sufficient statistic which removes as much extraneous information (presumably caused by the
underlying randomness in the data) as possible and still maintains all of the relevant information
in the data. As such, decisions made on the basis of sufficient statistics instead of the full set of
observations can be seen to be equally as valid and useful.
More formally, suppose that X1 , . . . , Xn is a random sample from a distribution family having
densities fX (x; θ). Let X represent the sample space of the random vector (X1 , . . . , Xn ), then a
statistic T = t(X1 , . . . , Xn ) can be viewed as a partitioning of X . In other words, if we define T
to be the sample space of T and define the sets Xt = {(x1 , . . . , xn ) ∈ X : t(x1 , . . . , xn ) = t} for
each t ∈ T , then the collection {Xt }t∈T forms a partition of X . The usefulness of a statistic in
terms of its data reduction properties can then be judged by how effective this partitioning is in
both reducing the number of “possible” values to be considered as well as the degree to which all
relevant information regarding the parameter θ is retained. With regard to the partitioning induced
by a statistic, we can see that if decisions are based on the value of a statistic instead of the actual
observed data, then clearly the decision will be the same for any dataset within the same partition
of the sample space, Xt . As such, in order for a statistic to be sufficient (i.e., retain all relevant
information regarding the parameter θ) the information which distinguishes the individual elements
of each Xt should have no bearing on the value of θ (i.e., if the observed sample is known to be in a
given Xt , the probability of the sample taking any of the values within this member of the sample
space partition should not depend on the value of θ). We shall give a formal characterisation of
when we can expect this to happen, but first we examine a simple example which illustrates the
ideas behind sufficiency:
Example 2.5: Let X1 , X2 , X3 be a sample of size n = 3 from a Bernoulli distribution with
parameter p [i.e., P rp (Xi = 1) = p and P rp (Xi = 0) = 1 − p]. In this case, the sample space
for (X1 , X2 , X3 ) consists of the 8 values:

X = {(0, 0, 0), (0, 0, 1), (0, 1, 0), (1, 0, 0), (0, 1, 1), (1, 0, 1), (1, 1, 0), (1, 1, 1)}.

Now, deﬁne the two statistics T1 = t1 (X1 , X2 , X3 ) = X1 X2 + X3 and T2 = t2 (X1 , X2 , X3 ) =

X1 + X2 + X3 . Clearly, the sample space of T1 is T1 = {0, 1, 2} and the sample space of T2
is T2 = {0, 1, 2, 3} both of which reduce the number of “possible” values which need to be
considered and they induce the sample space partitions:

X0,1 = {(0, 0, 0), (0, 1, 0), (1, 0, 0)}, X1,1 = {(0, 0, 1), (0, 1, 1), (1, 0, 1), (1, 1, 0)}, X2,1 = {(1, 1, 1)},

and X0,2 = {(0, 0, 0)},

X1,2 = {(0, 0, 1), (0, 1, 0), (1, 0, 0)}, X2,2 = {(0, 1, 1), (1, 0, 1), (1, 1, 0)}, X3,2 = {(1, 1, 1)},

respectively. We now examine the distribution of the sample space values within each element of
these two partitions. First, suppose that we are told that T1 = 0, so that the possible values for
our original sample are the set X0,1 = {(0, 0, 0), (0, 1, 0), (1, 0, 0)}. We can then easily calculate
Statistical Inference (STAT3013/8027) Lecture Notes - Page 22

the chance that the actual dataset was all zeroes as:

P rp (X1 = 0, X2 = 0, X3 = 0|T1 = 0)
P rp (X1 = 0, X2 = 0, X3 = 0, T1 = 0)
=
P rp (T1 = 0)
P rp (X1 = 0, X2 = 0, X3 = 0)
=
P rp (X1 = 0, X2 = 0, X3 = 0 or X1 = 0, X2 = 1, X3 = 0 or X1 = 1, X2 = 0, X3 = 0)
(1 − p)3
=
(1 − p)3 + 2p(1 − p)2
1−p
= .
1+p

From this calculation, we can see that the statistic T1 is not suﬃcient, since it does not induce
an appropriate partition. In particular, if we were to base any decision or estimate on the value
of T1 = 0, it would have to be the same regardless of whether the actual sample had been
the vector (0, 0, 0) or the vector (0, 1, 0). However, these two samples clearly contain diﬀerent
information about the parameter p. By contrast, suppose that we are told that T2 = 1, so that
the possible values for our original sample are the set X1,2 = {(0, 0, 1), (0, 1, 0), (1, 0, 0)}. We
can then easily calculate the chance that the actual dataset was (0, 1, 0) as:

P rp (X1 = 0, X2 = 1, X3 = 0|T2 = 1)
P rp (X1 = 0, X2 = 1, X3 = 0, T2 = 1)
=
P rp (T2 = 1)
P rp (X1 = 0, X2 = 1, X3 = 0)
=
P rp (X1 = 0, X2 = 0, X3 = 1 or X1 = 0, X2 = 1, X3 = 0 or X1 = 1, X2 = 0, X3 = 0)
p(1 − p)2
=
3p(1 − p)2
1
= .
3

Indeed, similar calculations show that for any value T2 = t, the chance that the actual dataset
was one of the possible elements of Xt,2 does not depend on p. Thus, T2 is indeed a sufficient
statistic, since basing estimates on its value retains all of the relevant information in the sample
(X1 , X2 , X3 ) regarding the parameter p, the remaining distinctions being determined entirely
by underlying random chance.
Based on this example, we can now formally define a sufficient statistic:
Definition 2.8: Let X1 , . . . , Xn be a random sample from a distribution family with density
function fX (x; θ), where θ is a parameter (vector). A (vector-valued) statistic S = s(X1 , . . . , Xn )
is sufficient for θ if and only if the conditional distribution of X1 , . . . , Xn given S does not depend
on θ. If S is vector valued so that S = (S1 , . . . , Sk ) we generally refer to the individual scalar
components S1 , . . . , Sk as jointly sufficient statistics.
From this definition, we can easily see that the sample itself X = (X1 , . . . , Xn ) is a sufficient
statistic, as is the collection of order statistics Y = (Y1 , . . . , Yn ) = sort(X1 , . . . , Xn ) [i.e., Y1 is the
smallest of the Xi ’s, Y2 the second smallest and so on up to Yn , the largest of the Xi ’s] since the
conditional distribution of X given Y is simply the one which puts equal probability on each of the
n! permutations of the elements of Y . Moreover, if we recall that the central notion of a statistic
is that it sets up a partition of the sample space X , then it is clear that if S = s(X1 , . . . , Xn ) is a
sufficient statistic and h(·) is an invertible function then h(S) is also a sufficient statistic, since h(S)
Statistical Inference (STAT3013/8027) Lecture Notes - Page 23

will create the same sample space partition (due to the one-to-one nature of invertible functions)
as S [i.e., for any value s, we have

Xh(s) = (x1 , . . . , xn ) ∈ X : h{s(x1 , . . . , xn )} = h(s)

= (x1 , . . . , xn ) ∈ X : h−1 [h{s(x1 , . . . , xn )}] = h−1 {h(s)}

= (x1 , . . . , xn ) ∈ X : s(x1 , . . . , xn )} = s
= Xs ,

since we have assumed that h(·) is invertible]. However, neither this last result nor the definition
itself is very useful for directly determining whether a statistic is sufficient (since finding the con-
ditional distribution of X given S is usually extremely difficult). Fortunately, there is an easier
method of finding sufficient statistics which we introduce in the next section.
2.3.1. Factorisation Criterion: We now present an extremely important theorem which can be
used to determine whether or not a statistic is sufficient:
Theorem 2.3: Let X1 , . . . , Xn be a random sample from a distribution family having density
function fX (x; θ) for some parameter vector θ. A statistic S = s(X1 , . . . , Xn ) is sufficient if and
only if the joint density function of the Xi ’s factors as:
n
fX1 ,...,Xn (x1 , . . . , xn ; θ) = fX (xi ; θ) = h1 {s(x1 , . . . , xn ); θ}h2 (x1 , . . . , xn ),
i=1

for some non-negative function h1 (·; θ) which depends on the xi ’s only through the value
s(x1 , . . . , xn ) and some non-negative function h2 (·) which does not depend on θ.
Proof: The proof is tedious and not very enlightening and is thus omitted from these notes.
We note that Theorem 2.3 provides a way to determine whether a certain statistic is sufficient,
however, just because we are unable to find an appropriate factorisation for some statistic does not
necessarily imply that no such factorisation exists. Thus, the theorem is rarely useful in determining
whether a statistic is not sufficient. Of course, to determine that a statistic T is not sufficient we
merely need to show that the distribution of the observations X1 , . . . , Xn given T = t depends on θ
for some value of t. In fact, the main usefulness of Theorem 2.3 is in discovering sufficient statistics,
as the following examples demonstrate:
Example 2.6: Let X1 , . . . , Xn be a random sample from the uniform distribution on the interval
[θ1 , θ2 ], so that the density function is given by fX (x; θ1 , θ2 ) = (θ2 − θ1 )−1 Iθ1 ≤x≤θ2 for θ1 < θ2 .
The joint density of the Xi ’s can then be written as:
n
fX1 ,...,Xn (x1 , . . . , xn ; θ1 , θ2 ) = (θ2 − θ1 )−1 I(θ1 ≤xi ≤θ2 )
i=1
n
−n
= (θ2 − θ1 ) I(θ1 ≤xi ≤θ2 )
i=1
= (θ2 − θ1 )−n I{(θ1 ≤x1 ≤θ2 )∩···∩(θ1 ≤xn ≤θ2 )}
= (θ2 − θ1 )−n I[{θ1 ≤min(x1 ,...,xn )}∩{max(x1 ,...,xn )≤θ2 }]
= (θ2 − θ1 )−n I{θ1 ≤min(x1 ,...,xn )} I{max(x1 ,...,xn )≤θ2 } .

Thus, if we set h1 (y1 , yn ; θ1 , θ2 ) = (θ2 −θ1 )−n I(θ1 ≤y1 ) I(yn ≤θ2 ) and h2 (x1 , . . . , xn ) = 1, we see that
Y1 = min(X1 , . . . , Xn ) and Yn = max(X1 , . . . , Xn ) are jointly suﬃcient statistics. Alternatively,
if we assume that we know θ1 = 0, then the joint density of the sample can be written as:

fX1 ,...,Xn (x1 , . . . , xn ; θ2 ) = θ2−n I{0≤min(x1 ,...,xn )} I{max(x1 ,...,xn )≤θ2 }

Statistical Inference (STAT3013/8027) Lecture Notes - Page 24

and we can then deﬁne h1 (yn ; θ2 ) = θ2−n I(yn ≤θ2 ) and h2 (x1 , . . . , xn ) = I{0≤min(x1 ,...,xn )} to see
that Yn = max(X1 , . . . , Xn ) is now a suﬃcient statistic.
Example 2.7: Let X1 , . . . , Xn be a random sample from a normal distribution family with
density function
1 1 2
φµ,σ2 (x) = √ exp − 2 (x − µ) ,
2πσ 2 2σ
for parameters µ and σ 2 > 0. The joint density of the Xi ’s can then be written as:
n
fX1 ,...,Xn (x1 , . . . , xn ; θ1 , θ2 ) = φµ,σ2 (xi )
i=1

1
n
1 2
= exp − (xi − µ)
(2πσ 2 )n/2 2σ 2 i=1
n n
1 1 2 2
= exp − 2 xi − 2µ xi + nµ .
(2πσ 2 )n/2 2σ i=1 i=1

Thus, we see that the joint density itself can be written as a function of the two quantities
n n
S1 = i=1 Xi and S2 = i=1 Xi2 , which means that we can define h1 (s1 , s2 ; µ, σ 2 ) to be the
joint density itself and h2 (x1 , . . . , xn ) = 1 and thus S1 and S2 are jointly sufficient. Moreover,
it is relatively easy to see that the vector-valued function h(S1 , S2 ) = {n−1 S1 , (n − 1)−1 (S2 −
n−1 S12 )} = (X, s2 ) is invertible (since it is one-to-one), and therefore the average, X, and the
usual sample variance, s2 , are also jointly sufficient.
The result of Theorem 2.3 is intuitively evident when we consider that if the joint density factors
as indicated then the log-likelihood function is essentially equal to ln{h1 (s1 , . . . , sk ; θ)} [where we
have written s = s(x1 , . . . , xn ) = (s1 , . . . , sk ) when s(·, . . . , ·) is a vector-valued function with k
components and we have used the standard reduction of eliminating additive terms from the log-
likelihood which do not depend on the parameter θ]. In other words, all the information about
θ contained in the likelihood is contained in the vector-valued statistic S, which is precisely the
notion behind sufficiency. Indeed, this argument forms the basis of the following important result:
Theorem 2.4: Let X1 , . . . , Xn be a random sample from a distribution family with density
function fX (x; θ). Also, let S = s(X1 , . . . , Xn ) be a sufficient statistic for θ. Then, the M LE of
θ depends on the sample observations only through the sufficient statistic. In other words, the
M LE is a function of the sufficient statistic S.
Proof: Since S is sufficient, we know that the likelihood function (which is the same as the
joint density function) can be written in the form:
n
L(θ; x1 , . . . , xn ) = fX (xi ; θ) = h1 {s(x1 , . . . , xn ); θ}h2 (x1 , . . . , xn ).
i=1

Clearly, L(θ; x1 , . . . , xn ) is maximised in θ at the same place that h1 {s(x1 , . . . , xn ); θ} is, since
the factor h2 (x1 , . . . , xn ) does not depend on θ. Moreover, the value of θ which maximises
h1 {s(x1 , . . . , xn ); θ} = h1 (s; θ) can clearly only depend on s. Formally, we have θ̂M LE =
argmaxθ∈Θ {h1 (s; θ)}, and thus θ̂M LE must be a function of s only.
As an example of Theorem 2.4, we note that the M LEs of µ and σ 2 for the normal family are
n n n n 2
µ̂ = X = n−1 i=1 Xi and σ̂ 2 = n−1 i=1 (Xi − X)2 = n−1 i=1 Xi2 − n−1 i=1 Xi which
n
are clearly functions of the suﬃcient statistics found in Example 2.7; namely, S1 = i=1 Xi
n
and S2 = i=1 Xi2 . We note, however, that it is possible for method of moments or method of
percentiles estimators not to be functions of suﬃcient statistics.
Statistical Inference (STAT3013/8027) Lecture Notes - Page 25

Example 2.6 (cont’d): If X1 , . . . , Xn are uniformly distributed on the interval [0, θ], then we
saw that Yn = max(X1 , . . . , Xn ) was a suﬃcient statistic. Moreover, we can write the log-
likelihood for θ based on the sample as:

l(θ) = −n ln(θ) + ln[I{max(x1 ,...,xn )≤θ} ],

where the term ln[I{0≤min(x1 ,...,xn )} ] has been left out since it does not depend on θ. Now,
−n ln(θ) is a decreasing function of θ, so to maximise the log-likelihood we must choose θ as
small as possible; however, since ln(0) = −∞ the only possible range for θ on which the log-
likelihood is not negatively infinite is θ ≥ max(x1 , . . . , xn ). These two facts together show that
the M LE of θ is given by Yn = max(X1 , . . . , Xn ) which is clearly a function of a sufficient
statistic. On the other hand, the expected value of any Xi is θ/2. Therefore, the method of
moments estimator of θ is easily calculated as θ̂M OM = 2X. The method of moments estimator
is clearly not a function of Yn , and indeed it can be shown that it is not a function of any
sufficient statistic (though the demonstration is somewhat technical and so we will omit it).
We close this section by discussing our original objective in introducing sufficient statistics, which
was data reduction. Recall that the idea behind sufficient statistics is that they contained all the
relevant information regarding the parameter θ and removed (some) extraneous information. In
particular, if we have a sample of size n, X1 , . . . , Xn from a distribution family with densities
fX (x; θ) and a sufficient statistic S = (S1 , . . . , Sk ), then we can effectively reduce the number of
relevant pieces of information regarding θ from n down to k. Recall, also, that we could conceive
of this reduction in terms of a partitioning of the sample space X into the subsets Xs for each
s in the range of S. Effectively, then, we have reduced the number of possible outcomes which
need to be considered from the size of X (the individual elements of which can be considered as a
partition induced by the sample itself X1 , . . . , Xn ) down to the number of elements in the range of
S. However, we have seen that there is not simply a unique sufficient statistic, and the question
then arises as to whether a particular sufficient statistic has effected the greatest possible reduction
in the data. If a particular sufficient statistic does indeed effect the maximal reduction, we shall
refer to it as a minimal sufficient statistic (the adjective “minimal” here referring to the fact that
such statistics will have the smallest number of components, k, possible). Equivalently, we can
view minimal sufficient statistics as those for which the induced partition of the sample space has
the fewest members (i.e., subsets Xs ). Generically, then, a sufficient statistic is termed minimal if
no other sufficient statistic condenses the data to a greater extent. Formally, we have the following
definition:
Definition 2.9: A sufficient statistic S is termed minimal sufficient if and only if for any other
sufficient statistic S there exists a function h(·) such that S = h(S ).
Unfortunately, this definition is rarely useful in identifying minimal sufficient statistics. Indeed, in
general it is quite difficult to determine minimal sufficient statistics. There is, however, a particular
class of distribution families for which minimal sufficient statistics can be determined, and we focus
on these families in the next section.
2.3.2. Exponential Families: We now introduce a class of distribution families which have very
convenient mathematical properties and which include most of the standard probability models
which are commonly dealt with in statistical applications. The class of distributions are known as
exponential families and are defined as follows:
Definition 2.10: A distribution family which has density functions of the form:
k
fX (x; θ) = exp ci (θ)di (x) − b(θ) − a(x) ,
i=1
Statistical Inference (STAT3013/8027) Lecture Notes - Page 26

for a k-dimensional parameter θ = (θ1 , . . . , θk ) and suitable choices of the functions a(·), b(·)
ci (·) and di (·) (for i = 1, . . . , k) is termed a k-parameter exponential family.
Note that it is important that the number of ci (·) and di (·) functions is the same as the dimension
of the parameter vector. We recall, also, that in the case of discrete distribution families we should
interpret the density function fX (x; θ) as a probability mass function (pmf). Before presenting a few
examples of exponential families, we note that if we deﬁne the reparameterisation η = (η1 , . . . , ηk ) =
c(θ) = {c1 (θ), . . . , ck (θ)} then η is referred to as the canonical parameter for the exponential family
and the density function can be written in the form:

k
fX (x; η) = exp ηi di (x) − B(η) − a(x) .
i=1

Moreover, in this parameterisation we have B(η) = b{c−1 (η)} [where c−1 (η) is the inverse function
of the reparameterisation function η = c(θ), which must exist for the reparameterisation to be valid
and which can be guaranteed to exist in the case of exponential families], and based on this function
we can deﬁne KD (t) = B(η + t) − B(η), which is the so-called joint cumulant generating function
of the random variable D = {d1 (X), . . . , dk (X)}, so-called because its derivatives evaluated at
t = 0 yield the cumulants of D, the ﬁrst cumulant being the mean, the second cumulant being the
variance and the third cumulant being the skewness. In other words, some simple vector calculus
shows:
∂ ∂
E(D) = KD (t) =⇒ E(Di ) = B(η)
∂t t=0 ∂ηi
∂ ∂
V ar(D) = T
KD (t) =⇒ Cov(Di , Dj ) = B(η)
∂t∂t t=0 ∂ηi ∂ηj
∂ ∂
Skew(Di ) = 3 KD (t) =⇒ Skew(Di ) = B(η).
∂ti t=0 ∂ηi3
Finally, we note that it is reasonably straightforward to show that KD (t) = ln{mD (t)} where mD (t)
is the joint moment generating function of the random vector D.
Example 2.8: If X has a Poisson distribution with rate parameter λ, then we can see that the
pmf can be written as:

λx e−λ
fX (x; λ) = = exp{x ln(λ) − λ − ln(x!)}, x = 0, 1, 2, . . . .
x!

Thus, the Poisson family is a one-dimensional exponential family with functions a(x) = ln(x!),
b(λ) = λ, c1 (λ) = ln(λ) and d1 (x) = x. Moreover, we see that the canonical parameter is
η = ln(λ), leading to the inverse relationship λ = eη and

B(η) = b(eη ) = eη =⇒ KD (t) = eη+t − eη = eη (et − 1) = λ(et − 1)

=⇒ mD (t) = exp{λ(et − 1)},

which yields the form of the mgf for a Poisson random variable with which we are familiar, since
D = X in this case.
Example 2.9: If X has a Normal distribution with mean µ and variance σ 2 , then we can see
that the pdf can be written as:

1 1 2
µ 1 2 µ2 1 2 1
φµ,σ2 (x) = √ exp − 2σ 2 (x − µ) = exp x − 2 x − 2 − ln(σ ) − ln(2π) .
2πσ 2 σ2 2σ 2σ 2 2
Statistical Inference (STAT3013/8027) Lecture Notes - Page 27

1
Thus, the Normal family is a two-dimensional exponential family with functions a(x) = 2 ln(2π),
2
1
2
b(µ, σ ) = µ
+
2σ 2 2
2
ln(σ ), c1 (µ, θ) = c2 (µ, θ) = − 2σ1 2 , d1 (x) = x and
µ
σ2 ,
2
d2 (x) = x . Moreover,
we see that the canonical parameters are η1 = σµ2 and η2 = − 2σ1 2 , leading to the inverse
relationship µ = −η1 (2η2 )−1 , σ 2 = −(2η2 )−1 and

η1 1 η2 1
B(η) = b − ,− = − 1 − ln(−2η2 )
2η2 2η2 4η2 2

which, through diﬀerentiation with respect to η1 and η2 , shows that:

∂ η1
E(D1 ) = B(η) = − = µ = E(X);
∂η1 2η2
∂ η2 1
E(D2 ) = B(η) = 12 − = µ2 + σ 2 = E(X 2 );
∂η2 4η2 2η2
∂2 1
V ar(D1 ) = 2 B(η) = − = σ 2 = V ar(X);
∂η1 2η2
∂3
Skew(D1 ) = B(η) = 0 = Skew(X).
∂η13

Alternatively, if we assume that σ 2 is a known constant rather than a parameter, the density
then has the form of a one-parameter exponential family with functions a(x) = 12 ln(2π) +
2 2
1
2ln(σ 2 ) + 2σ
x µ µ
2 , b(µ) = 2σ 2 , c1 (µ) = σ 2 and d1 (x) = x. Therefore, we see that the canonical

parameter is η = σµ2 , leading to the inverse relationship µ = ησ 2 and

η2 σ2 (η + t)2 σ 2 η2 σ2 σ2 1
B(η) = b(ησ 2 ) = =⇒ KD (t) = − = (2ηt + t2 ) = µt + σ 2 t2
2 2 2 2 2
1 2 2
=⇒ mD (t) = exp µt + σ t ,
2

which is the familiar moment generating function for the normal distribution since D = X in
this case.
The main reason that we focus on exponential families is that the form of the densities makes
application of Theorem 2.3 straightforward. In particular, for an iid sample X1 , . . . , Xn from an
n n n
exponential family it is straightforward to see that i=1 Di = i=1 d1 (Xi ), . . . , i=1 dk (Xi ) is
a sufficient statistic. Moreover, it can be shown (though we will not provide a proof since it is rather
n
technical) that this is a minimal sufficient statistic. In fact, it turns out that i=1 Di is not only
minimal sufficient, but is also complete, a concept which we will discuss briefly in the next section.
Finally, before proceeding to the next section, we note that while most of the common distributions
which arise in statistical applications are of exponential class, not all are. In particular, one simple
example of a family which is not of exponential form is the family of uniform distributions on the
interval [θ1 , θ2 ].

2.4. Unbiased Estimation

As we noted earlier, estimators with uniformly minimum M SE rarely exist due to the sheer number
of possible estimation procedures. One possible way around this problem is to restrict our attention
to unbiased estimators; that is, to estimators T = t(X1 , . . . , Xn ) of a parameter τ (θ) which satisfy
Eθ (T ) = τ (θ). In such instances, we see that M SEt (θ) = V arθ (T ), since the bias is zero. Therefore,
if we restrict our attention to unbiased estimators, we are now interested in ﬁnding an estimator
with uniformly minimum variance. Formally, we have:
Statistical Inference (STAT3013/8027) Lecture Notes - Page 28

Definition 2.11: If X1 , . . . , Xn are a random sample from a distribution having density function
fX (x; θ) for some parameter value θ ∈ Θ and T = t(X1 , . . . , Xn ) is an unbiased estimator of
τ (θ), so that Eθ (T ) = τ (θ), then T is called a uniformly minimum-variance unbiased (UMVU)
estimator if and only if V arθ (T ) ≤ V arθ (T ) for all values of θ ∈ Θ and any other unbiased
estimator T = t (X1 , . . . , Xn ) [i.e., for any other estimator satisfying Eθ (T ) = τ (θ)].
In the following sections, we will investigate when U M V U estimators exist, what there variance is
and how to find them.
2.4.1. Variance Bound for Unbiased Estimators: Before finding U M V U estimators, it is helpful
to investigate the general properties of the variance of unbiased estimators. In particular, we will
be able to determine a lower bound below which the variance of an unbiased estimator cannot fall.
Thus, if we find an estimator which achieves this bound uniformly for all values of the parameter
θ, we can conclude that we have a U M V U estimator. Before we state and prove the lower bound,
we need to make some assumptions (generally referred to as regularity conditions) to ensure that
we exclude strange cases for which the lower bound does not hold (rest assured, however, that
the following assumptions are true for almost all distributions and situations of practical interest).
Let X1 , . . . , Xn be a random sample from a distribution having density function fX (x; θ) with θ
assumed to be scalar, let T = t(X1 , . . . , Xn ) be an unbiased estimator of τ (θ) and assume:
∂
i. ∂θ ln{fX (x; θ)} exists for all x and θ;
ii. interchange of integration and differentiation is permissible insofar as
∞ ∞ n ∞ ∞ n
∂ ∂
··· fX (xi ; θ)dx1 · · · dxn = ··· fX (xi ; θ)dx1 · · · dxn
∂θ −∞ −∞ i=1 −∞ −∞ ∂θ i=1

and
∞ ∞ n
∂
··· t(x1 , . . . , xn ) fX (xi ; θ)dx1 · · · dxn
∂θ −∞ −∞ i=1
∞ ∞ n
∂
= ··· t(x1 , . . . , xn ) fX (xi ; θ)dx1 · · · dxn
−∞ −∞ ∂θ i=1
∂ 2
iii. The expectation i(θ) = Eθ ∂θ ln{fX (X; θ)} , where X is a generic random variable
having distribution with density fX (x; θ), is ﬁnite for all θ ∈ Θ.
Under these assumptions, we can formally state the Information Inequality which is also known as
the Cramér-Rao Inequality:
Theorem 2.5: Let X1 , . . . , Xn be a random sample from a distribution family with density
function fX (x; θ) where θ is a scalar parameter. Also, let T = t(X1 , . . . , Xn ) be an unbiased
estimator for τ (θ). Then, assuming conditions (i), (ii) and (iii) above hold,
{τ (θ)}2
V arθ (T ) ≥ ,
ni(θ)

where τ (θ) = dθ
d
τ (θ). Further, equality occurs if and only if there exists a function K(θ, n),
not depending on the xi ’s, such that:
n
∂
ln{fX (x; θ)} = K(θ, n){t(x1 , . . . , xn ) − τ (θ)}
i=1
∂θ

Proof: The proof relies on the Cauchy-Schwartz Inequality which, in one of its simpler forms,
states that:
{E(XY )}2 ≤ E(X 2 )E(Y 2 ),
Statistical Inference (STAT3013/8027) Lecture Notes - Page 29

with equality only if X = cY for some constant c (i.e., a quantity not involving X or Y ). A
demonstration of the Cauchy-Schwartz inequality is left as an exercise, while a fully rigorous
proof of the current inequality is omitted since it is not overly enlightening. However, a basic
argument demonstrating the validity of the result proceeds as follows. Clearly, the assumption
that T is unbiased for τ (θ) implies that
n
0 = Eθ {T − τ (θ)} = · · · {t(x1 , . . . , xn ) − τ (θ)} f (xi ; θ)dx1 . . . dxn
i=1

Diﬀerentiating this relationship with respect to θ then shows:

n
∂
0 = · · · {t(x1 , . . . , xn ) − τ (θ)} f (xi ; θ) f (xj ; θ) dx1 . . . dxn
∂θ
i=1 j=i
n
− · · · τ (θ) f (xi ; θ)dx1 . . . dxn
i=1

n ∂ n
∂θ f (xi ; θ)
= · · · {t(x1 , . . . , xn ) − τ (θ)} f (xj ; θ)dx1 . . . dxn
i=1
f (xi ; θ) i=1
n

− τ (θ) ··· f (xi ; θ)dx1 . . . dxn
i=1

n
∂
= E {T − τ (θ)} ln{f (Xi ; θ)} − τ (θ)
i=1
∂θ

So, using the Cauchy-Schwartz Inequality with X = T − τ (θ) and Y = ∂

∂θ ln{f (X; θ)}, we have
n 2
2 ∂
{τ (θ)} = E {T − τ (θ)} ln{f (Xi ; θ)}
i=1
∂θ
2
2
n
∂
≤ E {T − τ (θ)} ]E ln{f (Xi ; θ)}
i=1
∂θ
= V arθ (T ){ni(θ)},

where we have used the fact that T is unbiased to show that V arθ (T ) = E {T − Eθ (T )}2 =
n ∂ 2
E {T − τ (θ)}2 ] and the fact that E ln{f (X ; θ)} = ni(θ), which relies on the fact
i=1 ∂θ ∂ i

that the Xi ’s are assumed to be iid and that E ∂θ ln{f (Xi ; θ)} = 0, is left as an exercise.
Finally, some simple algebraic manipulation produces the desired result.
We note that the quantity I(θ) = ni(θ) is usually referred to as the expected Fisher information,
and can be shown to be equal to Eθ [{l (θ)}2 ], the second moment of the score function (i.e., the
derivative of the log-likelihood). In fact, it is not hard to show that Eθ {l (θ)} = 0, which implies
that I(θ) is actually the variance of the score function. Moreover, assuming further exchanges of
diﬀerentiation and integration are permissible, it can be shown that I(θ) = −Eθ {l (θ)}, which
usually yields a simpler computation than the original deﬁnition. The demonstration of these facts
is also left as an exercise.
Example 2.10: If X has an exponential distribution with mean parameter θ, so that fX (x; θ) =
1 −x/θ
θe for x > 0, and τ (θ) = θ [i.e., τ (·) is the identity function] then we see that τ (θ) = 1 and
−1
dθ ln{fX (x; θ)} = dθ {− ln(θ) − xθ
d d
} = −θ−1 + xθ−2 = θ−2 (x − θ), so that
2
d 1 1 1
i(θ) = Eθ ln{fX (X; θ)} = 4 E{(X − θ)2 } = 4 V arθ (X) = 2 .
dθ θ θ θ
Statistical Inference (STAT3013/8027) Lecture Notes - Page 30

Thus, the expected Fisher information is I(θ) = ni(θ) = nθ−2 . Alternatively, if we had
used the characterisation −Eθ {l (θ)} for the expected Fisher information, we see that l(θ) =
n n
−n ln(θ) − θ−1 i=1 Xi , so that l (θ) = nθ−2 − 2θ−3 i=1 Xi which leads to the same result for
I(θ). [NOTE: Calculating the expected Fisher information from the characterisation E[{l (θ)}2 ]
would have been made somewhat complicated due to the squaring operation performed on the
summation of the Xi ’s.] Thus, the lower bound for the variance of any unbiased estimator
T = t(X1 , . . . , Xn ) of θ is given by:

1 θ2
V arθ (T ) ≥
= .
nθ−2 n
n
Finally, we note that the sample average, X = n−1 i=1 Xi is clearly unbiased and

V arθ (X) θ2
V arθ (X) = = .
n n

Thus, since the variance of X achieves the Cramér-Rao lower bound, X must be a U M V U
estimator. Indeed, we can see that

n
∂ n
1 n
ln{fX (x; θ)} = 2
(x − θ) = 2 (x − θ),
i=1
∂θ i=1
θ θ

and thus, setting K(θ, n) = nθ−2 , we see that the sample average satisﬁes the conditions for
equality in Theorem 2.5.
We note that Theorem 2.5 is also true for discrete distributions, as long as the conditions required
for the density function in the continuous case are satisﬁed by the pmf in the discrete case (with
integrals replaced by summations, of course).
Example 2.8 (cont’d): If X has a Poisson distribution with rate parameter θ, so that pX (x; θ) =
θ x e−θ
x! for x = 0, 1, 2, . . ., and τ (θ) = θ, then we have τ (θ) = 1 and dθ
d d
ln{pX (x; θ)} = dθ {x ln(θ)−
−1
θ − ln(x!)} = xθ − 1, so that
2
d 1 1 1
i(θ) = Eθ ln{fX (x; θ)} = 2 E{(X − θ)2 } = 2 V arθ (X) = .
dθ θ θ θ

Thus, the expected Fisher information is I(θ) = ni(θ) = nθ−1 and the lower bound for the
variance of any unbiased estimator T = t(X1 , . . . , Xn ) of θ is given by:

1 θ
V arθ (T ) ≥ = .
nθ−1 n

As in Example 2.10, the lower bound is achieved by the estimator X, since V arθ (X) =
n−1 V arθ (X) = n−1 θ. Thus, X is a U M V U estimator of θ. Alternatively, suppose that
τ (θ) = e−θ = P rθ (X = 0). In this case, we have τ (θ) = −e−θ , and we see that the lower
bound for the variance of unbiased estimators of τ (θ) is given by n−1 θ(−e−θ )2 = n−1 θe−2θ . It
n
is easy to verity that the estimator T = n−1 i=1 I(Xi =0) is unbiased for e−θ . Moreover, it is
easy to see that nT has a binomial distribution with parameters n and p = e−θ . Therefore,
V arθ (T ) = n−1 p(1 − p) = n−1 e−θ (1 − e−θ ). It is not diﬃcult to show that eθ ≥ 1 + θ for any
θ, and this fact then easily implies that

n−1 e−θ (1 − e−θ ) ≥ n−1 θe−2θ ,

Statistical Inference (STAT3013/8027) Lecture Notes - Page 31

as should be the case according to Theorem 2.5. In fact, it can be shown that equality only
occurs when θ = 0. Thus the variance of T does not achieve the Cramér-Rao lower bound. Of
course, it could still be a U M V U estimator of e−θ if no estimator achieves the Cramér-Rao lower
bound. However, it turns out that T is not a U M V U and we shall ﬁnd a U M V U estimator for
this quantity in the next section.
We close this section with several remarks regarding Cramér-Rao bounds. These results are some-
what more advanced and technical, and detailed discussions are beyond the scope of these notes.
i. If θ is a vector parameter of dimension k, then there is an analog to the Cramér-Rao variance
lower bound which states that if T is an unbiased estimator of τ (θ) then

V arθ (T ) ≥ (∇τ ) I −1 (θ)∇τ,

T
where ∇τ = ∂θ∂ 1 τ (θ), . . . , ∂θ∂k τ (θ) is the gradient vector (written as a column) of τ (θ) and
−1
I (θ) is the matrix inverse of the expected Fisher information matrix I(θ) deﬁned to have
(i, j)th component

∂ ∂ ∂2
Iij (θ) = Eθ l(θ) l(θ) = −Eθ l(θ) .
∂θi ∂θj ∂θi ∂θj

In other words, I(θ) is the variance-covariance matrix of the score function (i.e., the gradient
of the log-likelihood).
ii. In general, the Cramér-Rao lower bound is not sharp. In other words, in many cases there is no
estimator with a variance equal to the lower bound value. This does not, however, necessarily
mean that there is no U M V U estimator in such cases. We shall see an example of this in the
next section.
iii. If the M LE of θ, θ̂M LE , is a solution to the score equation, l (θ) = 0, (as opposed to being
a boundary value of the parameter space Θ) and T = t(X1 , . . . , Xn ) is an unbiased estimator
of τ (θ) the variance of which achieves the Cramér-Rao lower bound then it must be the case
that T = τ (θ̂M LE ). In other words, if there is an unbiased estimator of τ (θ) the variance of
which achieves the Cramér-Rao lower bound, it must be the M LE of τ . Again, we note that
there may be U M V U estimators the variances of which do not achieve the Cramér-Rao lower
bound and in these cases, the estimators need not be the M LEs.
iv. Finally, as a follow-up to the previous remark, we note that it can be shown that estima-
tors whose variance achieves the Cramér-Rao lower bound exist only in the case where the
probability model is an exponential family (which adds another piece of evidence as to why
these families are so special and important). In fact, it can be further shown that even within
exponential families, only a very limited collection of functions of the parameters, τ (θ), have
unbiased estimators for which the variance achieves the Cramér-Rao lower bound. At first,
this may seem to indicate that seeking U M V U estimators, even in exponential families, is
essentially fruitless. Recall, however, that U M V U estimators need not have variances which
achieve the Cramér-Rao lower bound [see remark (ii) above]. As such, the remark here merely
indicates that the Cramér-Rao inequality is not the most fruitful method of finding U M V U es-
timators. Indeed, the next section presents an alternative, and more useful, method of finding
U M V U estimators.
2.4.2. The Rao-Blackwell Theorem and Completeness: In the previous section we saw that
unbiased estimators could not have variances which fell below a specific bound. As such, if we
could find an unbiased estimator the variance of which achieved this bound, then clearly such an
estimator would be a uniformly minimum-variance unbiased (UMVU) estimator. Unfortunately,
Statistical Inference (STAT3013/8027) Lecture Notes - Page 32

it is rarely possible to find an unbiased estimator with a variance equal to the Cramér-Rao lower
bound. So, we now present some results which provide an alternative approach to finding UMVU
estimators.
It should seem reasonable that an estimator based on a sufficient statistic would be less variable
than one which is not so based, since the idea of sufficiency was the removal of irrelevant information
(which by its nature would tend to increase variability). Indeed, suppose that T = t(X1 , . . . , Xn ) is
an unbiased estimator of the parameter τ = τ (θ) and suppose that S = s(X1 , . . . , Xn ) is a (possibly
vector-valued) sufficient statistic. The following theorem, known as the Rao-Blackwell Theorem,
shows that we can construct an unbiased estimator from T and S which has smaller variance than
T . Specifically, we have:
Theorem 2.6: Let X1 , . . . , Xn be a random sample from a distribution family with density
function fX (x; θ) for some parameter θ ∈ Θ, and let S = s(X1 , . . . , Xn ) be a sufficient statistic
[NOTE: S may be vector-valued, in which case we write S = (S1 , . . . , Sk )]. Further, let T =
t(X1 , . . . , Xn ) be an unbiased estimator of τ = τ (θ). If we define the new quantity T1 = Eθ (T |S)
then:
i. T1 is a statistic (i.e., it does not depend on θ) and is a function of the sufficient statistic,
T1 = t1 (S) = t1 (S1 , . . . , Sk );
ii. T1 is an unbiased estimator of τ (θ); and,
iii. V arθ (T1 ) ≤ V arθ (T ) for all θ ∈ Θ, and V arθ (T1 ) < V arθ (T ) for some θ ∈ Θ unless T1 = T .
Proof: (i.) Since S is a sufficient statistic, we know that the distribution of (X1 , . . . , Xn ) given
S cannot depend on θ from Definition 2.8. Clearly, then, the distribution of any function of
(X1 , . . . , Xn ) given S cannot depend on θ either. Thus, T1 does not depend on θ; in other words,
T1 is a statistic, since it is a function of only the data. Also, from the definition of conditional
expectations, it is clear that T1 depends on the Xi ’s only through the value of S; in other words,
T1 is a function of S.
(ii.) Using the law of the iterated expectation, we know that E{E(Y |Z)} = E(Y ) for any
random variables Y and Z. In particular, then, we have

Eθ (T1 ) = Eθ {Eθ (T |S)} = Eθ (T ) = τ (θ),

implying that T1 is an unbiased estimator of τ (θ).

(iii.) We note that:

V arθ (T ) = Eθ {(T − τ )2 } = Eθ {(T − T1 + T1 − τ )2 }

= Eθ {(T − T1 )2 } + 2Eθ {(T − T1 )(T1 − τ )} + V arθ (T1 ).

Now, since T1 is simply a function of the suﬃcient statistic S, we can further see that:

Eθ {(T − T1 )(T1 − τ )} = Eθ [Eθ {(T − T1 )(T1 − τ )|S}] = Eθ {(T1 − τ )Eθ (T − T1 |S)}

= Eθ [(T1 − τ ){Eθ (T |S) − T1 }] = 0.

Therefore, we see that V arθ (T ) = Eθ {(T − T1 )2 } + V arθ (T1 ) ≥ V arθ (T1 ), and the inequality
is strict unless T = T1 . [NOTE: An alternate derivation of this result is based on the extension
of the law of the iterated expectation to the case of variances:

V arθ (T ) = Eθ {V arθ (T |S)} + V arθ {Eθ (T |S)} ≥ V arθ (T1 )

where we have used the obvious fact that Eθ {V arθ (T |S)} ≥ 0 since it is the expected value of
a conditional variance which clearly cannot be negative.]
Statistical Inference (STAT3013/8027) Lecture Notes - Page 33

So, Theorem 2.6 provides a way of finding an unbiased estimator with “low” variance (i.e., at
least as low as the variance of any other given unbiased estimator). Whether or not the resultant
estimator is a UMVU estimator will be taken up shortly. Before discussing this important issue,
we present an example:
Example 2.8 (cont’d): If X has a Poisson distribution with rate parameter θ, we saw that
n
T = n−1 i=1 I(Xi =0) is an unbiased estimator for e−θ and we determined its variance as
V arθ (T ) = n−1 e−θ (1 − e−θ ). Furthermore, since the Poisson family of distributions was seen
n n
to be an exponential family with D = d1 (X) = X, we know that S = i=1 Di = i=1 Xi is a
sufficient statistic (and indeed, a minimal sufficient statistic). So, according to Theorem 2.6, if
we define T1 = Eθ (T |S) we should get an unbiased estimator for e−θ which has lower variance
that T . First, to determine the explicit form of the estimator, we note that:

n
n
n
n
−1 −1
Eθ (T |S = s) = Eθ n I(Xi =0) Xi = s = n Eθ I(Xi =0) Xi = s
i=1 i=1 i=1 i=1

n
n
= Eθ I(X1 =0) X i = s = P rθ X 1 = 0 Xi = s
i=1 i=1
n n
P rθ X1 = 0, i=1 Xi = s P rθ X1 = 0, i=2 Xi = s
= n = n
P rθ i=1 Xi = s P rθ i=1 Xi = s
n
P rθ (X1 = 0)P rθ i=2 Xi = s e−θ {(n − 1)θ}s e−(n−1)θ /s!
= n =
P rθ i=1 Xi = s
(nθ)s e−nθ /s!
s
n−1
= .
n
S
Thus, T1 = n−1n is the new estimator. To verify directly that T1 is unbiased and has lower
variance than T , we note that S has a Poisson distribution with rate parameter nθ, so that:
∞
s ∞

n − 1 (nθ)s e−nθ −nθ {(n − 1)θ}s
Eθ (T1 ) = =e = e−nθ e(n−1)θ = e−θ ,
s=0
n s! s=0
s!

showing that T1 is unbiased, and:

∞
2s ∞

n−1 (nθ)s e−nθ {(n − 1)2 θ}s −1
(n−1)2 θ −1
Eθ (T12 ) = = e−nθ = e−nθ en = eθ(n −2)
,
s=0
n s! s=0
s!ns

−1
which yields V arθ (T1 ) = eθ(n −2) −(e−θ )2 = e−2θ (eθ/n −1). To see that this variance is smaller
than V arθ (T ) = n−1 e−θ (1 − e−θ ) = n−1 e−2θ (eθ − 1), we note that:
∞ ∞ ∞ ∞
1 θ 1 θm θm θm (θ/n)m
(e − 1) = = > m (m!)
= = eθ/n − 1.
n n m=1 m! m=1
n(m!) m=1
n m=1
m!

Alternatively, we know that the Cramér-Rao lower bound on the variance of unbiased estimators
in this case is given by n−1 θe−2θ , and since we know that ey − 1 > y for any y = 0, we have:

θ −2θ
e−2θ (eθ/n − 1) > e ,
n
so that the variance of T1 does not achieve the Cramér-Rao lower bound. Nonetheless, we shall
see that T1 turns out to be a U M V U estimator.
Statistical Inference (STAT3013/8027) Lecture Notes - Page 34

Recall that sufficient statistics are not unique; that is, there may be two different (possibly vector-
valued) statistics S1 and S2 both of which are sufficient. In this case, we can define multiple new
estimators from an original unbiased estimator T as
i. T1 = Eθ (T |S1 );
ii. T2 = Eθ (T |S2 );
iii. T3 = Eθ (T1 |S2 ); and
iv. T4 = Eθ (T2 |S1 ).
[NOTE: Since T1 is a function of S1 , we see that Eθ (T1 |S1 ) = T1 , so re-conditioning on the
same sufficient statistic does not aid in arriving at unbiased estimators with reduced variance].
Now, Theorem 2.6 indicates that V arθ (T ) ≥ V arθ (T1 ) ≥ V arθ (T3 ) and V arθ (T ) ≥ V arθ (T2 ) ≥
V arθ (T4 ). However, Theorem 2.6 does not give us any indication as to whether T3 or T4 will have
the smaller variance; indeed, there may be no clear cut winner, as V arθ (T3 ) may be less than
V arθ (T4 ) for some values of θ while the reverse is true for other values of θ. This problem is
generally alleviated by choosing to condition on a minimal sufficient statistic, since if S1 is minimal
sufficient we know that for any other sufficient statistic S2 there exists a function h(·) such that
S1 = h(S2 ), in which case

T3 = Eθ (T1 |S2 ) = Eθ {Eθ (T |S1 )|S2 } = Eθ [Eθ {T |h(S2 )}|S2 }] = Eθ {T |h(S2 )} = Eθ (T |S1 ) = T1 ,

where the fourth equality follows from the fact that Eθ {T |h(S2 )} is, by definition, a function of S2 .
In other words, conditioning on a minimal sufficient statistic implies that any further conditioning
will not result in any further variance reduction (indeed, it will not even result in a new unbiased
estimator).
Moveover, if we have another unbiased estimator T , then Theorem 2.6 indicates that T1 =
Eθ (T |S1 ) has smaller variance than T , but it does not indicate whether T1 or T1 has the lower
variance. So, while Theorem 2.6 gives us a method for deriving estimators with reduced variances,
it does not necessarily gives us a method of deriving UMVU estimators. We shall see, however,
that there are conditions under which the result of Theorem 2.6 does yield a UMVU estimator.
Unfortunately, these conditions are rather technical and we only present a basic introduction.
We start by defining the concept of completeness of a statistic or estimator T . The general
idea is that a statistic is complete if no function of it has expectation zero for all values of θ unless
the function is the zero function, z(x) ≡ 0 for all x. In particular, this means that if g(T ) is an
unbiased estimator for some parameter τ = τ (θ), then there is no other function of T which is also
an unbiased estimator of τ . To see this, note that if h(T ) was another unbiased estimator of τ
then z(T ) = g(T ) − h(T ) would be a non-zero function of T (since the two functions g and h are
assumed to be distinct) for which Eθ {z(T )} = Eθ {g(T )} − Eθ {h(T )} = τ − τ = 0, contradicting
the assumption of completeness for T . Thus, complete statistics have at most one form in which
they can be used to estimate a parameter in an unbiased fashion. Formally, we have the following
definition:
Definition 2.12: If X1 , . . . , Xn are a random sample from a distribution having density function
fX (x; θ) with parameter θ ∈ Θ , then a statistic T = t(X1 , . . . , Xn ) is termed complete if and
only if
Eθ {z(T )} = 0 =⇒ P rθ {z(T ) = 0} = 1,

for all θ ∈ Θ.

Example 2.6 (cont’d): Let X1 , . . . , Xn be a random sample from a uniform distribution on

the interval [0, θ], for some θ > 0. We deﬁne the quantities Yn = max(X1 , . . . , Xn ) and Y1 =
Statistical Inference (STAT3013/8027) Lecture Notes - Page 35

min(X1 , . . . , Xn ). Further, we deﬁne two statistics T1 = (Y1 , Yn ) and T2 = Yn and we wish to

investigate whether these statistics are complete. First, we note that it is a simple exercise (left
to the reader) to demonstrate that Eθ (Y1 ) = (n + 1)−1 θ and Eθ (Yn ) = n(n + 1)−1 θ. Thus,
deﬁning z1 (t1 ) = z1 (y1 , yn ) = (n + 1)yn − n(n + 1)y1 , we see that

Eθ {z1 (T1 )} = (n + 1)Eθ (Yn ) − n(n + 1)Eθ (Y1 ) = nθ − nθ = 0,

but clearly P rθ {z(T1 ) = 0} = P rθ {Yn = nY1 } = 1 for any n > 1 (in fact, it can be shown that
this probability actually equals zero as long as n > 1). Thus, T1 is not a complete statistic
(although it is sufficient in this case, since we saw that Yn on its own is sufficient and thus any
vector-valued statistic which includes Yn as a component must be sufficient as well, though of
course it will not be minimal sufficient in such cases). Alternatively, suppose that z2 (t2 ) is such
that Eθ {z2 (T2 )} = Eθ {z2 (Yn )} = 0 for all θ > 0. This means that
θ
z2 (y)fYn (y; θ)dy = 0
0

for all θ > 0. It is again a simple exercise (left to the reader) to show that the density function
associated with the distribution of Yn is given by fYn (y; θ) = nθ−n y n−1 , so that z2 (Yn ) having
zero expectation implies that:
θ θ
n
z2 (y)y n−1 dy = 0 =⇒ z2 (y)y n−1 dy = 0,
θn 0 0

for all θ > 0. Differentiating the second equation above with respect to θ, shows that z2 (Yn )
having zero expectation implies z2 (θ)θn−1 = 0 for all θ > 0. This equation, in turn, implies that
z2 (θ) = 0 for all θ > 0. In other words, z2 (·) is the zero function, so that P rθ {z2 (T2 ) = 0} =
P rθ (0 = 0) = 1. Thus, T2 = Yn is seen to be a complete (as well as sufficient) statistic.
In general, demonstrating completeness for a given statistic can be quite complicated. Fortunately,
it turns out that completeness can be demonstrated for specific statistics in exponential families. In
n
particular, the (minimal) sufficient statistic i=1 Di where Di = {d1 (Xi ), . . . , dk (Xi )} is complete
(the proof of this fact is rather technical and is omitted). The true importance of complete, sufficient
statistics is demonstrated in the following theorem:
Theorem 2.7: Let X1 , . . . , Xn be a random sample from a distribution with density function
fX (x; θ) for some parameter θ ∈ Θ. If S = s(X1 , . . . , Xn ) is a complete and sufficient statistic,
and T = t(S) is an unbiased estimator of τ = τ (θ), then T is a UMVU estimator.
Proof: Let T = t (S) be any unbiased estimator of τ which is a function of the complete,
sufficient statistic (we have assumed that T is one such estimator, but there may be others).
Then we have Eθ (T − T ) = 0 for all θ ∈ Θ. However, since T and T are functions of S, we
can define T − T = z(S) = t(S) − t (S). Since S is assumed complete, it must be the case that
P rθ {z(S) = 0} = P rθ (T = T ) = 1. In other words, there can be only one unbiased estimator
of τ which is a function of S. Now, let T1 be any unbiased estimator of τ (not necessarily a
function of S). Since Eθ (T1 |S) is unbiased and a function of S (by Theorem 2.6), it must be
the case that Eθ (T1 |S) = T , regardless of the initial unbiased estimator T1 . Now, Theorem 2.6
also states that V arθ {Eθ (T1 |S)} = V arθ (T ) ≤ V ar(T1 ) for all θ ∈ Θ. Since T1 was an arbitrary
unbiased estimator of τ , we see that this final implication means that T has smaller variance
than any other unbiased estimator; in other words, T is a UMVU estimator.
Theorem 2.7 is often referred to as the Lehmann-Scheffé Theorem. The implication of the theorem
is extremely important. If there is a complete, sufficient statistic S (which we know exists in the
Statistical Inference (STAT3013/8027) Lecture Notes - Page 36

case of an exponential family) and there is some unbiased estimator of τ , say T1 then there is
a UMVU estimator of τ which can be arrived at by combining Theorems 2.6 and 2.7; that is,
by taking the conditional expectation of the unbiased estimator given the complete and sufficient
statistic, T = Eθ (T1 |S), since this estimator will be unbiased and will be a function of the complete,
sufficient statistic. Moreover, if we happen to have (or can easily determine) an unbiased estimator
which is a function of a complete, sufficient statistic we know that it must be a UMVU estimator
without any further modification.
Example 2.8 (cont’d): Since the Poisson distributions form an exponential family with d1 (Xi ) =
n
Xi , we know that S = i=1 Xi is a complete and sufficient statistic. Furthermore, we have
seen that the statistic S
n−1
T = ,
n
is an unbiased estimator of τ = τ (θ) = e−θ = P rθ (Xi = 0). Thus, we have an unbiased estimator
which is a function of a complete, sufficient statistic, which implies that T must be a UMVU
estimator (even though, as we saw previously, its variance does not achieve the Cramér-Rao
lower bound).
As a final remark, we note that it is possible in certain situations for some functions of the parameter,
τ = τ (θ), to have no unbiased estimators, though the situations in which this occurs are rare and
usually not of much practical importance. Also, it is possible for unbiased estimators to exist, but
for there to be no UMVU estimator; in other words, there is no unbiased estimator whose variance
is minimal for all values of θ ∈ Θ.
2.5. Bayes Estimation
In the previous sections, our estimators have been functions of the data; in other words, they have
been based solely on the observed information, which certainly seems sensible. However, as we have
noted, the randomness in the observations means error in the estimates is inevitable. In particular,
occasionally there will be observed data which yields an estimated value for the parameter of interest
which may be “unbelievable”. In such situations, we may be tempted to conclude that our chosen
probability model is wrong. To address this concern, we may choose a new probability model, or
use so-called non-parametric methods which are less dependent on the choice of probability models
(and we shall briefly investigate this approach in Section 2.6). Suppose, however, that we believe
our chosen probability model is correct. This creates somewhat of a quandary, since we must
seemingly choose between our belief in the model and our belief that the resultant estimate of the
parameters is highly errant. The resolution to this dilemma comes from asking a simple question:
Why do we feel that the resultant estimate based on the data is so “unbelievable”? Clearly, we
must have some prior knowledge of what a “reasonable” estimate of the parameter is in order to
make such a judgement. If so, we should try to incorporate the information contained in our prior
knowledge of the specific problem under study into the estimation procedure (i.e., we should base
our estimator not only on the observed data, but also on some quantification of our prior ideas
about the likely values of the parameters being estimated).
Formally, suppose that we can model our prior belief about the “likelihood” that the parameter
of interest, θ, takes on any specified value in the parameter space, Θ, with the density function, π(θ),
referred to as the prior distribution of θ. The function π(θ) contains our beliefs about the relative
likelihood that a particular value of θ in Θ is the “true” value of the parameter (i.e., that it is the
actual value of the parameter which indexes the distribution used to characterise the population
that gave rise to the observed data). Since we are still assuming that the chosen probability model
is correct, some value of θ must indeed be the correct one, and thus the integral of π(θ) over the
Statistical Inference (STAT3013/8027) Lecture Notes - Page 37

full range of the parameter space, Θ, must be unity, which is why we choose π(θ) to be a density
function (or a pmf if the parameter space is discrete).
The question now arises as to how to incorporate this prior distribution into the estimation
procedure. To do this, we note that our attachment of a prior distribution to the parameter θ is
equivalent to considering it as a random variable itself. Moreover, with this interpretation of θ, we
see that the density function for the observed random variables, fX (x; θ), can be thought of as the
conditional density of the Xi ’s given θ. To combine the information regarding our prior belief and
our observed data, we focus on the “change” to our prior belief brought about by the data. In other
words, we want to examine the “likelihood” of values for the parameter θ given the new observed
data information. Formally, then, we define the posterior distribution of θ, π(θ|X1 , . . . , Xn ), using
Bayes’ Rule (which is where the name Bayesian estimation derives) as:
L(θ; X1 , . . . , Xn )π(θ)
π(θ|X1 , . . . , Xn ) = .
Θ
L(t; X1 , . . . , Xn )π(t)dt
[NOTE: Recall that the likelihood function of the data, L(θ; X1 , . . . , Xn ), is equivalent to the joint
density of the Xi ’s. In fact, it is the joint conditional density of the Xi ’s given θ in this case, since
θ is now assumed to follow a random distribution. Also, note that the denominator in the above
definition is just the unconditional, or marginal, density function of the Xi ’s. As such, it does
not depend on θ and, from the perspective of the posterior distribution of θ, is therefore just a
normalising constant which ensures that the posterior distribution integrates to unity. Heuristically,
then, we see that the definition of the posterior distribution can be thought of as:
P r(X1 , . . . , Xn |θ)P r(θ)
P r(θ|X1 , . . . , Xn ) = ,
P r(X1 , . . . , Xn )
which is precisely the standard form of Bayes’ Rule.]
The posterior distribution incorporates both forms of information that we have about the
parameter; namely, our prior beliefs and the observed data. Of course, as it is a distribution
function, it does not directly give us a point estimate for the parameter of interest. Using the
posterior distribution to arrive at point estimates is the subject of the rest of this section. Before
proceeding to this discussion, however, we close with an important comment. For the remainder
of this section, we will assume that we have been given (or have made a choice of) an appropriate
prior distribution (i.e., one which accurately reflects our prior knowledge regarding the parameter
θ). Of course, in practice, the proper choice of a prior distribution is extremely difficult, and is
generally quite crucial to the end result of the estimation procedure. Unfortunately, a full discussion
regarding the proper choice of prior distributions is complex and beyond the scope of these notes.
Here, we only note that priors are often chosen for reasons of mathematical simplicity (which is
rarely a strong practical justification for the use of a specific prior).
2.5.1. Posterior Bayes Estimators: We noted previously that the posterior distribution incor-
porates all the available information regarding the parameter in our new Bayesian framework, in
much the same way that the likelihood function itself does for the specified probability model. As
such, we might consider estimating θ by using the value which maximises the posterior distribu-
tion; that is, we might use the posterior mode. Alternatively, since the posterior distribution is
indeed a distribution for θ (recall that the likelihood function is a distribution for the Xi ’s but not
necessarily for θ), we might use its mean or median as an estimator as well. Primarily for reasons
of mathematical simplicity (though we shall see there are other good reasons), we shall focus on
the posterior mean, or posterior Bayes estimator, of any parameter of interest τ = τ (θ):

τ̂π = E{τ (θ)|X1 , . . . , Xn } = τ (θ)π(θ|X1 , . . . , Xn )dθ,
Θ
Statistical Inference (STAT3013/8027) Lecture Notes - Page 38

where we interpret the farthest right-hand expression as a multiple integral if θ is a vector, and
we replace integrals by appropriate sums if θ is discrete. Also, we note that the chosen notation is
designed to indicate the dependence of the estimator on the chosen prior distribution π(θ). Using
the deﬁnition of the posterior distribution, and the fact that the likelihood function is just the joint
(conditional) density of the data, we can write

L(θ; X1 , . . . , Xn )π(θ)
τ̂π = E{τ (θ)|X1 , . . . , Xn } = τ (θ)π(θ|X1 , . . . , Xn )dθ = τ (θ) dθ
Θ Θ Θ
L(t; X1 , . . . , Xn )π(t)dt
n n
i=1 fX (xi ; θ) π(θ) Θ
τ (θ) i=1 fX (xi ; θ) π(θ)dθ
= τ (θ) n dθ = n ,
Θ Θ i=1 fX (xi ; t) π(t)dt Θ i=1 fX (xi ; θ) π(θ)dθ

provided the observed Xi ’s are independent and identically distributed [NOTE: in the denominator
of ﬁnal expression, we have switched the integration variable from t to θ, since once this integral
is factored outside the integral in the numerator, there is no longer any possibility of ambiguity].
Note the similarity between this estimator and the Pitman estimator of location deﬁned in Section
2.2.2.
Example 2.5 (cont’d): Let X1 , . . . , Xn be a sample from the Bernoulli distribution with pa-
rameter θ, so that fX (x; θ) = θx (1 − θ)1−x for x = 0, 1. Suppose that we choose a uniform
distribution over the range Θ = (0, 1) to represent our prior belief regarding θ, so that π(θ) = 1
for 0 ≤ θ ≤ 1 (note that the uniform prior indicates that we believe each value is as likely
as any other, so that this prior may serve to indicate the general notion of “no prior belief”
regarding the value of θ). So, to estimate τ (θ) = θ, the parameter itself, using the posterior
Bayes estimator, we have:
n n
1 n 1−xi
1 xi n− xi
θ θ xi
(1 − θ) π(θ)dθ θθ i=1
(1 − θ) i=1
dθ
θ̂π = 01 n i=1 = 0 n n
1−x
i=1 θ (1 − θ)
x i i dθπ(θ)dθ 1 x i n− x i
0
0
θ i=1 (1 − θ) i=1
dθ
n n
1 1+ i=1 xi n− xi
θ (1 − θ) i=1
dθ
= n
0 n .
1 xi n− xi
0
θ i=1
(1 − θ) i=1
dθ

Now, it is not diﬃcult to show (and is left as an exercise) that the Beta integral can be calculated
as: 1
Γ(a)Γ(b)
θa−1 (1 − θ)b−1 dθ = ,
0 Γ(a + b)
∞
for any positive constants a and b [and, of course, Γ(k) = 0 xk−1 e−x dx is the usual Gamma
function, which satisﬁes the simple relationship Γ(k + 1) = kΓ(k), a fact which is easily demon-
strated using integration by parts]. Thus, we see that the posterior Bayes estimator for θ is
given by:
n n
Γ 2 + i=1 xi Γ n + 1 − i=1 xi Γ(n + 2)
θ̂π = n n
Γ(n + 3) Γ 1 + i=1 xi Γ n + 1 − i=1 xi
n n
Γ 2 + i=1 xi Γ(n + 2) 1 + i=1 xi
= n = .
Γ 1 + i=1 xi Γ(n + 3) n+2

Alternatively, suppose that we choose a Beta distribution as our prior, so that π(θ) = πa,b (θ) =
Γ(a+b) a−1
Γ(a)Γ(b) θ (1 − θ)b−1 for some chosen positive values of the constants a and b. In this case,
Statistical Inference (STAT3013/8027) Lecture Notes - Page 39

nearly identical calculations to those performed above (and based on the fact that this prior
leads to readily tractable mathematics, which is precisely why it was chosen), we see that:
n
a + i=1 xi
θ̂πa,b = .
n+a+b
[NOTE: The case a = b = 1 reduces to the case of a uniform prior, and yields the appropriate
result.] Finally, we note that the above estimator can be written as

n a+b a
θ̂πa,b = x+ ,
n+a+b n+a+b a+b
n
where x = n−1 i=1 xi is the observed sample average (which in this case is also the observed
proportion of data values which were equal to 1). It is a simple exercise to show that the expec-
tation of a random variable with a distribution having density πa,b (θ) (i.e., a Beta distribution
with parameters a and b) is given by a/(a + b). So, the new form of the estimator shows that in
this case the posterior Bayes estimator can be seen as the weighted average between the maxi-
mum likelihood estimator (i.e., the estimator we would commonly use when we were not trying
to incorporate prior information, but rather basing our estimate solely on the data) and the
“pure prior” estimator (i.e., the mean of the posterior distribution, which is what the posterior
Bayes estimator reduces to if we have no observed data). In closing this example, however, we
note that it is not always possible to write a posterior Bayes estimator in such a form (i.e., as
a weighted average of the “pure prior” estimate and the M LE).
We note that the (conditional) expectation of the estimator in the preceding example is given by:

nθ + a
E(θ̂πa,b |θ) = = θ,
n+a+b
unless a = b = 0, which is not allowed (as the parameters a and b must be positive). As such, the
posterior Bayes estimator in this instance is not (conditionally) unbiased. Indeed, this turns out to
be a general phenomenon, as the following theorem shows:
Theorem 2.8: Let τ̂π be the posterior Bayes estimator of τ = τ (θ) with respect to the prior
distribution π(θ). If both τ̂π and τ (θ) have ﬁnite variances, then either P r{τ̂π = τ (θ)|θ} = 1
or else E(τ̂π |θ) = τ (θ). In other words, the only way for a posterior Bayes estimator to be
(conditionally) unbiased is if it always yields exactly the correct value of τ (θ).
Proof: We start by supposing that τ̂π is (conditionally) unbiased, so that E(τ̂π |θ) = τ (θ).
Then, we have:

V ar(τ̂π ) = E{V ar(τ̂π |θ)} + V ar{E(τ̂π |θ)} = E{V ar(τ̂π |θ)} + V ar{τ (θ)}.

Now, by deﬁnition τ̂π = E{τ (θ)|X1 , . . . , Xn }, so that:

V ar{τ (θ)} = E[V ar{τ (θ)|X1 , . . . , Xn }] + V ar[E{τ (θ)|X1 , . . . , Xn }]

= E[V ar{τ (θ)|X1 , . . . , Xn }] + V ar(τ̂π ).

Combining these two equalities shows that

E{V ar(τ̂π |θ)} + E[V ar{τ (θ)|X1 , . . . , Xn }] = 0,

and since both of the quantities on the left-hand side of this equality are non-negative (since
they are expectations of conditional variances, which cannot be negative), both of the quantities
Statistical Inference (STAT3013/8027) Lecture Notes - Page 40

must be zero. In particular, we see that E{V ar(τ̂π |θ)} = 0, which implies V ar(τ̂π |θ) = 0, since
again V ar(τ̂π |θ) cannot be negative, and therefore the only way it can have zero expectation
is for it to always be zero. Finally, we note that the only way a random variable can have
(conditional) variance of zero is if it is always equal to its (conditional) expectation, and thus
we see that if τ̂π is assumed unbiased, we must have P r{τ̂π = τ (θ)|θ} = 1. Thus, we have shown
that there are only two possibilities, either τ̂π is not unbiased, or else it is always equal to τ (θ),
as was required.
Finally, we note that the uniform prior chosen in Example 2.5 was seen to represent the notion
of “no prior information” regarding the parameter θ, since it gave equal likelihood to all possible
values. Such a prior distribution is often termed non-informative. It is sometimes argued that such
priors are the most sensible ones to choose in most situations. A full discussion of such ideas is
again beyond the scope of these notes; however, we note that it is not always possible to deﬁne
such non-informative priors. Moreover, even if we can deﬁne a non-informative prior distribution
for a particular parameter θ, if we reparameterise our probability model using the new parameter
η = η(θ), it is rarely the case that the non-informative prior for θ will transform into a corresponding
non-informative prior for η. In other words, we know that if θ has a distribution with density π(θ),
then any one-to-one function (which a reparameterisation must be) η = g(θ) has density function:

dg −1 (η)
πη (η) = π{g −1 (η)} ,
dη

where g −1 (η) is the inverse function of g(θ) (which again must exist since a reparameterisation is
a one-to-one function). Clearly, then, if π(θ) is the density of a uniform distribution, then it will
rarely be the case that πη (η) will also be a uniform distribution. Thus, assuming no information
on a particular parameter scale, generally means that we are assuming we do have information
on some other parameter scale. This lack of invariance for the property of non-informativeness in
prior distributions makes their use somewhat suspect. At the very least, we must be reasonably
sure about the appropriate scale on which to choose to represent our “lack of prior knowledge”
about the problem at hand. This is, of course, just another piece of evidence demonstrating
the difficulties involved in choosing an appropriate prior distribution. [NOTE: For those who are
interested, another popular choice of prior distribution, designed to represent the notion of a lack of
any prior information, is the so-called vague or Jefferys prior, which is based on the square-root of
the expected Fisher information and does have the above noted invariance property. Alternatively,
the method of empirical Bayes estimation attempts to use the data itself to choose, at least in part,
the appropriate prior distribution.]
2.5.2. Bayes Risk and Minimax Estimators: In Section 2.2.4, we introduced the concept of
loss functions, to measure the relative cost of making various errors in our estimation process. In
this section, we discuss how the use of a prior distribution can be combined with a selected loss
function to arrive at optimal estimators. Recall, however, that we have the same issues regarding
appropriate choice of a loss function that we do for prior distributions, and we will again simply
assume that an appropriate choice of prior and loss function have been made without delving into
the complex (and sometimes non-statistical) issues involved in this selection.
Formally, let X1 , . . . , Xn be a random sample from a distribution with density function fX (x; θ)
for some parameter θ ∈ Θ. We will assume that θ is a random variable with some (known) prior
distribution π(θ). Using this prior information as well as the sample observations, we wish to
estimate the parameter τ = τ (θ). In addition, we assume that the loss function (t; θ) has been
specified and determines the relative cost of estimating τ as t when θ is the true value of the
Statistical Inference (STAT3013/8027) Lecture Notes - Page 41

parameter (i.e., the particular outcome from the chosen prior distribution). For any estimator,
T = t(X1 , . . . , Xn ) (which may depend on the prior distribution as well), we defined the risk function
as Rt (θ) = Eθ {(T ; θ)}, which we now will write as Rt (θ) = E{(T ; θ)|θ} since θ is considered as
a random variable in our present context. Our original goal was to choose an estimator T which
had uniformly minimal risk over the entire range of θ values. Of course, in general, we saw that
no such estimator existed, the difficulty arising from the fact that the risk function depends on
θ, and for any pair of estimators one will generally be better for some possible values of θ and
worse for others. In the present situation, we have assumed that θ is a random variable; in other
words, we have an idea of which values of θ are the most likely. As such, we might try and choose
an estimator which minimises the risk appropriately averaged over the possible θ values; that is,
choose an estimator which does “best” for the most “likely” values of θ. Formally, we define the
Bayes risk of an estimator as follows:
Definition 2.13: Let X1 , . . . , Xn be a random sample from a distribution having density
function fX (x; θ) for some parameter θ ∈ Θ, θ being a random variable with prior distribution
π(θ). For estimating τ = τ (θ) using the loss function (t; θ) and an estimator T = t(X1 , . . . , Xn ),
the risk function was defined as Rt (θ) = E{(T ; θ)|θ}. The Bayes risk of the estimator T with
respect to the chosen loss function and prior distribution is then defined as:

r(t) = r ,π (t) = Rt (θ)π(θ)dθ = Eπ {Rt (θ)},
Θ
where the notation Eπ indicates expectation taken with respect to the prior distribution.
Note that the Bayes risk of an estimator is a weighted average of its risk function, Rt (θ), where
the weights represent the likelihood that the risk at any given value of θ is the pertinent one; that
is, the weights represent the likelihood of any θ value based on our prior information. Since the
Bayes risk is now a single number, rather than a function of θ as the risk function itself was, we
can easily define the “best” estimator in this context as the one which minimises the Bayes risk:
Definition 2.14: Under the structure determined in Definition 2.13, the Bayes estimator of
τ (θ) with respect to a chosen loss function and prior distribution is that estimator T = T ,π =
t ,π (X1 , . . . , Xn ) with the smallest Bayes risk. In other words, T ,π is a Bayes estimator if
r ,π (t ,π ) ≤r ,π (t)

for any other estimator T = t(X1 , . . . , Xn ).

We note that the posterior Bayes estimator defined in Section 2.5.1 was defined without reference
to a loss function. We shall see, however, that the posterior Bayes estimator does indeed correspond
to a Bayes estimator for a specific loss function. Of course, in order to do this, we must be able
to actually construct Bayes estimators, and Definition 2.14 does not give any direct method of
achieving this task.
However, it turns out that for certain choices of the loss function, it is not too difficult to
directly construct the Bayes estimator. Specifically, suppose that we choose squared-error loss,
(t; θ) = {t − τ (θ)}2 . In this case, the Bayes risk can be written as:

r ,π (t) = E[{t(X1 , . . . , Xn ) − τ (θ)}2 |θ]π(θ)dθ
Θ
∞ ∞ n
2
= ··· {t(x1 , . . . , xn ) − τ (θ)} fX (xi ; θ) dx1 · · · dxn π(θ)dθ
Θ −∞ −∞ i=1
∞ ∞ n
2 fX (xi ; θ) π(θ)
= ··· {τ (θ) − t(x1 , . . . , xn )} i=1
dθ f (x1 , . . . , xn )dx1 · · · dxn
−∞ −∞ Θ f (x1 , . . . , xn )
∞ ∞
2
= ··· {τ (θ) − t(x1 , . . . , xn )} π(θ|x1 , . . . , xn )dθ f (x1 , . . . , xn )dx1 · · · dxn ,
−∞ −∞ Θ
Statistical Inference (STAT3013/8027) Lecture Notes - Page 42

where f (x1 , . . . , xn ) = Θ L(θ; x1 , . . . , xn )π(θ)dθ is the marginal likelihood function of the sample
X1 , . . . , Xn . Now, the integrand in the final expression is clearly non-negative, so we can minimise

the Bayes risk by minimising the quantity Θ {τ (θ)−t(x1 , . . . , xn )}2 π(θ|x1 , . . . , xn )dθ. However, this
value is just the expectation of {τ (θ) − t(x1 , . . . , xn )}2 with respect to the posterior distribution
π(θ|x1 , . . . , xn ). It is a straightforward exercise (left to the reader) to show that the function
h(a) = E{(Z − a)2 } for any random variable Z is minimised at a = E(Z). Applying this result
to the current situation shows that the Bayes risk under squared-error loss is minimised at the
posterior expectation of τ (θ), which is precisely the posterior Bayes estimator

E{τ (θ)|X1 = x1 , . . . , Xn = xn } = τ (θ)π(θ|x1 , . . . , xn )dθ.
Θ

So, we now see that the posterior Bayes estimator introduced in Section 2.5.1 is indeed a Bayes
estimator with respect to squared-error loss. Furthermore, nearly identical calculations, combined
with the fact that the function h(a) = E(|Z − a|) for any random variable Z is minimised at a =
median(Z), show that the Bayes estimator of a scalar parameter θ under absolute-error loss is given
by the median of the posterior distribution, π(θ|X1 = x1 , . . . , Xn = xn ). [NOTE: Similarly, the
Bayes estimator under absolute-error loss of τ (θ) is given by the median of the posterior distribution
of τ (θ). Of course, to ﬁnd the posterior distribution of τ (θ) we must use the change-of-variable
formula on the posterior distribution of the parameter itself, π(θ|X1 = x1 , . . . , Xn = xn ).] Finally,
we note that choosing the constant-error loss function with window-width , (t; θ) = AI{|t−τ (θ)|>} ,
deriving the associated Bayes estimator and then letting tend to zero, yields the mode of the
posterior distribution of τ (θ) (again, requiring the use of the change of variable formula to arrive
at the appropriate posterior distribution for the parameter τ ). In other words, while the posterior
mode is not (necessarily) directly a Bayes estimator, it is the limit of a sequence of Bayes estimators
(of course, in some circumstances the posterior mode may be the Bayes estimator for some other
choice of loss function). The demonstration of this fact follows along the lines of the demonstration
for the posterior mean and posterior median Bayes estimators, however, it is rather technical and
unenlightening, and is thus omitted from these notes.
Example 2.11: Suppose that X1 , . . . , Xn are independent random variables each having a
normal distribution with zero mean and variance (2θ)−1 . The joint conditional distribution of
the Xi ’s given θ (which is also the joint conditional likelihood function) is then:
n
−θ x2i
−n/2 n/2 i=1
L(θ; x1 , . . . , xn ) = π θ e .

Further, suppose that we select a Gamma prior distribution for θ with shape parameter α and
scale parameter 1/β, so that
β α α−1 −βθ
π(θ) = θ e .
Γ(α)
Thus, the posterior distribution for θ is:

π(θ|x1 , . . . , xn ) = C(x1 , . . . , xn )θn/2+α−1 e−θ(β+y) ,

n 2 α
n/2 ∞ −1
where y = i=1 xi and C(x1 , . . . , xn ) = β π Γ(α) 0 L(θ; x1 , . . . , xn )π(θ)dθ . Now,
the quantity C(x1 , . . . , xn ) can be directly calculated through straightforward (though tedious)
integration. Alternatively, we can note that π(θ|x1 , . . . , xn ) must be a density function in θ, and
C(x1 , . . . , xn ) must therefore be the appropriate “normalising” constant. Since π(θ|x1 , . . . , xn )
Statistical Inference (STAT3013/8027) Lecture Notes - Page 43

clearly has the form of a Gamma density with shape parameter n/2 + α and scale parameter
(β + y)−1 , we can conclude that
n n/2+α
(β + y)n/2+α β + i=1 x2i
C(x1 , . . . , xn ) = = .
Γ(n/2 + α) Γ(n/2 + α)

If we select squared-error loss, then we know that the Bayes estimator for θ is given by
E(θ|x1 , . . . , xn ), the mean of the posterior distribution. In this case, the posterior distribu-
tion is a Gamma distribution which has mean (n/2 + α)/(β + y). Also, note that the vari-
ance of the Xi ’s is σ 2 = (2θ)−1 , which means that the Bayes estimate (under squared-error
loss) is E{(2θ)−1 |x1 , . . . , xn }. Now, it is a simple exercise (left to the reader) to show that
if Z has a Gamma distribution with shape parameter a > 1 and scale parameter b, then
E(1/Z) = {b(a − 1)}−1 . Therefore, the Bayes estimator of σ 2 is given by:

β+y β+y
E{(2θ)−1 |x1 , . . . , xn } = = .
2(n/2 + α − 1) n + 2α − 2

Alternatively, we can ﬁnd the posterior distribution of σ 2 , which is given by:

2
π1 (σ 2 |x1 , . . . , xn ) = C(x1 , . . . , xn )2−α−n/2 (σ 2 )−n/2−α−1 e−(β+y)/(2σ )

(the demonstration of this fact derives from a straightforward implementation of the change-of-
variable formula for probability densities and is left as an exercise). So, if we use absolute-error
loss, the Bayes estimator is the median of this posterior distribution (which has the form of an
inverse Gamma distribution and thus, unfortunately, does not admit a closed form expression for
the median). Finally, if we take the limit of the Bayes estimators associated with the constant-
error loss function with window-width , we arrive at the mode of the posterior distribution for
σ 2 = (2θ)−1 as our estimator, which is easily calculated as:
n
2 β + i=1 x2i β+y
mode{π1 (σ |x1 , . . . , xn )} = = .
n + 2α + 2 n + 2α + 2
n
The Bayes estimators derived in the preceding example are seen to be functions of Y = i=1 Xi2 ,
which we have seen is the minimal sufficient statistic in this case. In fact, it can be shown quite
generally that Bayes estimators will be functions of the minimal sufficient statistics as well as BAN
for any choice of prior. So, even if we are unsure about our particular choice of prior distribution,
we can at least be sure that our Bayes estimator has some desirable properties regardless of our
choice of prior. In this vein, we close with a theorem which relates Bayes estimators to the minimax
estimators defined in Section 2.2.4. Recall that T = t(X1 , . . . , Xn ) is a minimax estimator of τ (θ)
for the specified loss function (t; θ) if the maximum value of its risk function, Rt (θ) = Eθ {(T ; θ)}
over the parameter space, Θ, is smaller than the maximum value of the risk function for any other
estimator; in other words, T is a minimax if

sup{Rt (θ)} ≤ sup{Rt (θ)},

θ∈Θ θ∈Θ

for any other estimator T = t (X1 , . . . , Xn ) (see Definition 2.7). The idea behind minimax esti-
mators is a desire to be “conservative” or “risk averse”, as minimax estimators seek to minimise
the impact of the worst possible estimation outcome. Unfortunately, as we noted in Section 2.2.4,
finding minimax estimators is generally quite difficult. However, as the next theorem shows, we
can sometimes arrive at minimax estimators through a Bayesian estimation procedure:
Statistical Inference (STAT3013/8027) Lecture Notes - Page 44

Theorem 2.9: If T = t(X1 , . . . , Xn ) is the Bayes estimator for the parameter τ = τ (θ) under
the loss function (t; θ) and the prior distribution π(θ), and the risk function for T is constant
[i.e., Rt (θ) ≡ c for some value c which does not depend on θ], then T is a minimax estimator.
Proof: Since T is the Bayes estimator under the given loss function and prior distribution, we
know that it has smaller Bayes risk than any other estimator T = t (X1 , . . . , Xn ). In other
words, we know that

r ,π (t) = Rt (θ)π(θ)dθ ≤ Rt (θ)π(θ)dθ = r ,π (t ),
Θ Θ

where Rt (θ) is the risk function for the arbitrary new estimator T . Therefore, since we have
assumed Rt (θ) ≡ c, we have:

sup{Rt (θ)} = c = cπ(θ)dθ = Rt (θ)π(θ)dθ ≤ Rt (θ)π(θ)dθ ≤ sup{Rt (θ)},
θ∈Θ Θ Θ Θ θ∈Θ

for any estimator T [NOTE: the ﬁnal inequality follows from the fact that Θ Rt (θ)π(θ)dθ =
Eπ {Rt (θ)}, and the expectation of a random variable clearly cannot be larger than the maxi-
mum value of the random variable over its sample space]. Thus, T must be a minimax estimator.
Example 2.5 (cont’d): We saw that for the parameter in a Bernoulli distribution, θ, the Bayes
estimator using squared-error loss and a Beta distribution prior with parameters a and b was
given by n
a + i=1 Xi
θ̂πa,b = .
n+a+b
Now, the risk function for θ̂πa,b is given by:
n 2 n 2
a + i=1 Xi
Rθ̂π (θ) = Eθ −θ = Eθ A Xi + aA − θ
a,b n+a+b i=1
n 2 n
2
= A Eθ Xi + 2A(aA − θ)Eθ Xi + (aA − θ)2
i=1 i=1
= A {nθ(1 − θ) + n θ } + 2nAθ(aA − θ) + (aA − θ)2
2 2 2

= θ2 (1 − nA2 + n2 A2 − 2nA) + θ(nA2 + 2naA2 − 2aA) + a2 A2 .

n
where A = (n+a+b)−1 and we have used the fact that S = i=1 Xi has a binomial distribution
with parameters n and θ to derive Eθ (S) = nθ and E(S 2 ) = V ar(S) + {E(S)}2 = nθ(1 − θ) +
n2 θ2 . Now, this risk will be constant (i.e., independent of θ) if 1 − nA2 + n2 A2 − 2nA = 0 and
√
nA2 + 2naA2 − 2aA = 0. The first of these two equations has solutions A = {n ± n}−1 . Using
√
this solution, the second equation has solutions a = {2(nA)−1 − 2}−1 = ± 12 n. Of course, the
√
parameters of our Beta prior distribution cannot be negative, so we must choose a = 12 n (which
√ √
means we must choose A = n + n). Finally, we note that this implies b = A−1 − n − a = 12 n.
√
Therefore, if a = b = 12 n are the chosen parameters for our Beta prior distribution, the Bayes
n √ √
estimator is 2 i=1 Xi + n /(2n + 2 n), and since this estimator has constant risk, it must
also be the minimax estimator of θ.
In closing, we note that Theorem 2.9 shows us that even if we are unsure of our choice of prior, we
can (sometimes) choose a prior distribution which will lead to an estimator with desirable other
properties, thus making our choice of prior less crucial. Of course, the fact that a particular choice
of prior leads to an estimator with good other properties is not really a firm justification for the
choice of that prior in the first place.
Statistical Inference (STAT3013/8027) Lecture Notes - Page 45

2.6. Nonparametric Methods

The definitions and comparisons of estimators and estimation methods in all of the preceding sec-
tions rely heavily on a choice of parametric probability model. In particular, without a parametric
model, there would be no parameters to estimate in the first place. Suppose, however, that we
are unable or unwilling to choose a specific parametric model for an estimation problem. The
first question we might ask is how we can estimate anything, given that there are no parameters.
Recall, though, that parameters are just surrogates for some numerical characteristic of the popu-
lation in which we are interested. As such, we might define the quantity we wish to estimate (i.e.,
the numerical quantity of interest regarding the population) as some “function” of the underlying
population distribution of appropriate numerical characteristics among the individual elements of
the population. We denote this quantity of interest by θ(F ), where F = F (x) is the CDF of the
population distribution of numerical characteristics (i.e., it is the CDF of the random quantities
X1 , . . . , Xn which will represent our iid sample data from the population and on which we will base
our estimation). Note that θ(F ) is actually a function whose argument (i.e., input value) is itself
a function, and such mathematical entities are often referred to as functionals. The most common
example of a functional of interest is the population mean or expectation, which can be generically
represented:
∞
θ(F ) = xdF (x),
−∞
∞
where this integral is interpretted as −∞ xf (x)dx if F represents a continuous distribution with

f (x) = F (x) being its density function, and intrepretted as x∈X xpF (x) if F represents a discrete
distribution with sample space X and probability mass function pF (x). [NOTE: If X is the set of
integers, then pF (x) = F (x) − F (x − 1). Alternatively, we can formally define pF (x) = F (x) −
limy↑x F (y) for general discrete sample spaces.] Of course, there are many other possible functionals
of interest. Also, we note that, unlike the case for parametric models, the value of the functional
θ(·) does not necessarily uniquely determine F within the class of all possible distribution functions
(e.g., there are a vast number of different distributions which all have the same expectation). Of
course, this unique determination is no longer necessary, since the functional of interest is directly
defined by in terms of the “population characteristic” we wish to determine, rather than in terms
of the parameters of the “true” member of the chosen parametric probability model.
Once we have defined a functional of interest, the question then arises as to how to estimate
it without the use of a parametric model. Moreover, we will need to develop “non-parametric”
methods for assessing the quality of these estimates and for comparing various methods. All of
these tasks are briefly discussed in the following sub-sections, where we introduce two of the most
common modern non-parametric methods: the Jackknife and the Bootstrap.
2.6.1. The Empirical Cumulative Distribution Function: For the parametric probability models
discussed in the earlier sections, we could characterise each member of the distribution family by
its parameter value, θ. As such, estimating θ in these instances was really just a way of estimating
the distribution from which the observed data arose. In other words, estimating θ was a way of
reducing the estimation problem for a function (i.e., the distribution function F of the data) to
the simpler problem of estimating a numerical (possibly vector-valued) quantity θ. However, we
are now unwilling (or unable) to characterise our problem using a parametric family, and so we
must now estimate F , the CDF of our data, directly. There are many techniques which have been
developed to do this, but we will focus on the simplest and most versatile of them here; namely, the
so-called empirical distribution function (see the definition in Section 2.1.3 regarding the minimum
Kolmogorov distance estimation procedure). The empirical distribution function based on a set of
Statistical Inference (STAT3013/8027) Lecture Notes - Page 46

observed data values x1 , . . . , xn is deﬁned as:

1
n
nx
F̂ (x) = I(xi ≤x) = ,
n i=1 n

where nx is defined as the number of observed data values which are less than or equal to the
value x. Essentially, the empirical distribution function F̂ is the CDF of a new discrete random
variable, say X , defined to take a value chosen at random from the collection of observed data
values X = {x1 , . . . , xn }. In this way, the relationship between F̂ and X mimics the relationship
between F and the original random variables representing the data values, X1 , . . . , Xn (of course,
X is by its nature discrete whereas the Xi ’s may be either discrete or continuous). We shall take
advantage of this relationship in more detail later, but for now it suffices to note that the obvious
analogy between the pairs (F, X) and (F̂ , X ) means that it is reasonable to assume that studying
(F̂ , X ) will likely yield information about (F, X). In particular, we note that, for any given value
x, F̂ (x) is an unbiased estimate of F (x), since

1 1 1 1
n n n n
EF {F̂ (x)} = EF I(Xi ≤x) = EF {I(Xi ≤x) } = P r(Xi ≤ x) = F (x) = F (x),
n i=1 n i=1 n i=1 n i=1

where the notation EF is used to indicate expectation under the true distribution determined by
the CDF F (in just the same way that the previous notation Eθ indicated expectation under the
distribution indexed by the parameter value θ). Of course, this result also follows directly upon the
recognition that the random variable nx (the number of observed data values less than or equal to x)
is clearly binomially distributed with n trials and a “success” probability of p = P r(Xi ≤ x) = F (x).
Thus, we can see that EF (nx /n) = EF (nx )/n = nF (x)/n = F (x). This characterisation shows
further that:

nx 1 1 1
V arF {F̂ (x)} = V arF = 2 V arF (nx ) = 2 {np(1 − p)} = F (x){1 − F (x)}.
n n n n

As noted earlier, there are other methods of estimating F , but none are quite as simple and intuitive
as the empirical distribution function F̂ (indeed, in some sense, F̂ can be viewed as a MLE of F ).
Of course, as we noted in the introduction to this section, we are usually not interested in
estimating F directly, but rather some functional of it, θ(F ). The obvious estimator of this quantity
then becomes θ̂ = θ(F̂ ). Indeed, such an approach will lead us directly to our “common-sense”
estimators for many of the commonly used functionals of interest. In particular, suppose that θ(F )
represents the expectation of a random variable, X, having distribution F , so that θ(F ) = EF (X).
In this case, the estimator we arrive at for the expected value of F is given by
1
n
θ̂ = θ(F̂ ) = EF̂ (X) = xpF̂ (x) = xi = x,
n i=1
x∈X

since the (discrete) random variable having a distribution with CDF F̂ was deﬁned to have sample
space X = {x1 , . . . , xn } and pmf pF̂ (x) = n−1 for all x ∈ X . In this case, we can further see
that θ(F̂ ) is an unbiased estimator of θ(F ) (since the sample average is always unbiased for the
population expectation, regardless of the population distribution). Unfortunately, it will not always
be the case that θ(F̂ ) will be unbiased for θ(F ) when the functional θ(·) is a more complicated one,
despite the fact that we have seen that F̂ itself is always unbiased for F .
In the following sections, we investigate ways of assessing and correcting the bias of θ̂ = θ(F̂ ),
as well as estimating its variance, V arF {θ(F̂ )}. Before proceeding, however, we note that there are
Statistical Inference (STAT3013/8027) Lecture Notes - Page 47

alternative “non-parametric” estimation procedures, the most common ones based on the ranked
data. We shall discuss such procedures a little later, but for now we simply note that some of
the most elementary estimators such as the median and the inter-quartile range are “rank-based”
estimators, since their construction is based on examination of the sorted data values. Of course,
the median can also be viewed as an estimator based on F̂ , since deﬁning θ(F ) to be the median
of the distribution characterised by the CDF F clearly implies that θ(F̂ ) is equal to the median
of the observed data (the distinction between this approach and that of “rank-based” methods is
that in the latter case we may wish to use the median as an estimator for the population mean as
opposed to the population median).
2.6.2. The Jackknife, Bias Correction and Variance Estimation: We now turn our attention to
assessing the properties of the estimator θ(F̂ ). In particular, we will be interested in investigating its
bias and variance. Moreover, our investigation of bias will generally have as its aim the subsequent
modiﬁcation of our estimator so as to reduce the bias. In other words, we will want to construct a
new estimator of the form θ̃ = θ(F̂ ) − B̂ = θ̂ − B̂, where B̂ is an estimate of

BiasF {θ(F̂ )} = EF {θ(F̂ )} − θ(F ),

the bias of θ(F̂ ).

Without explicitly defining the functional of interest θ(F ), of course, we cannot make specific
statements regarding the bias of θ(F̂ ). However, it turns out that we can come up with a straight-
forward estimate of this bias, for which we will give a justification shortly. The bias estimate we
shall construct is called the Jackknife bias estimate and is based on the quantities

θ̂i = θ(F̂i ),

where F̂i is the empirical distribution function based on the observations x1 , . . . , xi−1 , xi+1 , . . . , xn ;
that is, F̂i is the empirical distribution function based on the observed data after the ith value has
been deleted. The idea behind this approach is that these θ̂i values can be seen as estimates of
n
θ̂, and the degree to which their average θ̂• = n1 i=1 θ̂i differs from θ̂ (i.e., the degree to which
the θ̂i ’s are biased as estimators of θ̂) is a reasonable reflection of the level of bias in θ̂ itself as an
estimator of θ(F ). Specifically, we will define

B̂J = (n − 1)(θ̂• − θ̂),

and then deﬁne the Jackknife bias-corrected estimator of θ(F ) to be θ̃J = θ̂ − B̂J .
The justiﬁcation of this procedure is somewhat technical, but we can give a reasonable heuristic
explanation. Suppose that the bias of θ(F̂ ) decreases as the sample size increases in such a way
that
a(F )
EF {θ(F̂ )} = E(θ̂) ≈ θ(F ) + ,
n
for some (often unknown) constant a(F ) depending of F . It turns out that this is quite generally
true for most of the commonly used functionals θ(·) of interest. As such, we see that

a(F )
EF {θ(F̂i )} = E(θ̂i ) ≈ θ(F ) + ,
n−1

since F̂i is just an empirical distribution function based on n − 1 observations rather than n.
Statistical Inference (STAT3013/8027) Lecture Notes - Page 48

Therefore, we see that

EF (θ̃J ) = EF (θ̂ − B̂J )

= EF (θ̂) − EF (B̂J )
= EF (θ̂) − EF {(n − 1)(θ̂• − θ̂)}
n
1
= EF (θ̂) − (n − 1)EF θ̂i − θ̂
n i=1
n
1
= EF (θ̂) − (n − 1) EF (θ̂i ) − EF (θ̂)
n i=1
n
a(F ) 1 a(F ) a(F )
≈ θ(F ) + − (n − 1) θ(F ) + − θ(F ) −
n n i=1 n−1 n

a(F ) a(F ) a(F )
= θ(F ) + − (n − 1) θ(F ) + − θ(F ) −
n n−1 n

a(F ) na(F ) − (n − 1)a(F )
= θ(F ) + − (n − 1)
n n(n − 1)
a(F ) a(F )
= θ(F ) + −
n n
= θ(F ).

In other words, θ̃J is approximately unbiased [and indeed, is exactly unbiased if the expected value
of θ(F̂ ) is exactly equal to θ(F ) + (a/n), as the following example shows].
Example 2.12: Suppose that θ(F ) is the variance functional; that is θ(F ) = σF2 = EF [{X −
EF (X)}2 ]. In this case,

1
n
θ̂ = θ(F̂ ) = EF̂ [{X − EF̂ (X)}2 ] = (xi − x)2 .
n i=1

[NOTE: The devisor here is n rather than n − 1, since under F̂ , the mean of the random variable
X is “known” to be x. In other words, we are calculating the “population” variance for random
variable with CDF F̂ .] Clearly, this estimate is biased, and indeed it is easy to show that:

n−1 σ2
EF (θ̂) = θ(F ) = σF2 − F .
n n
As such, we have seen that the Jackknife bias-corrected estimator will be exactly unbiased in
this case. Indeed, we see that in this case

1
θ̂i = EF̂i [{X − EF̂i (X)}2 ] = (xj − xi )2 ,
n−1
j=i

where EF̂i (X) = (n − 1)−1 j=i xj = xi , since F̂i is the CDF of the discrete random variable
with sample space Xi = {x1 , . . . , xi−1 , xi+1 , . . . , xn } and pmf pF̂i (x) = (n − 1)−1 for x ∈ Xi .
Some further straightforward (though rather tedious) algebraic manipulation (left as an exercise
for the reader) then shows that:

n−2
n
θ̂• = (xi − x)2 .
(n − 1)2 i=1
Statistical Inference (STAT3013/8027) Lecture Notes - Page 49

1
n−1 x − n−1 xi .]
n
[NOTE: This calculation is made simpler upon noting that xi = Therefore, we
can calculate the Jackknife bias estimate as:

B̂J = (n − 1)(θ̂• − θ̂)

n−2 1
n n
2 2
= (n − 1) (xi − x) − (xi − x)
(n − 1)2 i=1 n i=1
1 n
=− (xi − x)2 .
n(n − 1) i=1

This leads to the Jackknife bias-corrected estimator:

1 1
n n n
1
θ̃J = (xi − x)2 + (xi − x)2 = (xi − x)2 ,
n i=1 n(n − 1) i=1 n − 1 i=1

which is precisely the usual unbiased estimator of variance.

In addition to estimating the bias of θ(F̂ ), we would also like to estimate its variance, V arF {θ(F̂ )}.
Previously, we have defined the values θ̂i and noted that they can be thought of as estimates based
on a collection of subsamples from the original data. This analogy led to their use in the definition
of an estimate of bias. More specifically, we see that if we define the quantities:

θ̃i = θ̂ + (n − 1)(θ̂ − θ̂i ),

which are generally referred to as the pseudo-values, then the Jackknife bias-corrected estimate
n
of θ(F ) is given by θ̃J = n−1 i=1 θ̃i . It is reasonably straightforward to extend these ideas to
develop an estimate of variance as well:

1 n
1
V
arJ (θ̂) = (θ̃i − θ̃J )2 = s̃2 ,
n(n − 1) i=1 n

where s̃2 is just the sample variance of the θ̃i ’s. This estimator has obvious intuitive appeal, with
the pseudo-values, θ̃i , used to ﬁnd an unbiased estimate of θ(F ) or the variance of the estimator
θ(F̂ ) in direct analogy to how the observed data values themselves are used to ﬁnd an unbiased
estimator of the population mean (i.e., the sample average) or the variance of the mean (i.e., the
usual sample variance divided by the sample size). Indeed, it can easily be shown that when
θ(F ) = EF (X), we have

n
θ̃i = θ(F̂ ) + (n − 1){θ(F̂ ) − θ(F̂i )} = x + (n − 1)(x − xi ) = nx − (n − 1)xi = xj − xj = xi ,
j=1 j=i

that is, the pseudo-values are just the observed data values themselves. In this case, the Jackknife
bias estimate is clearly seen to be zero (as it should be, since θ̂ = x is unbiased in this case) and
n
the Jackknife estimate of variance is just s2 /n, where s2 = (n − 1)−1 i=1 (xi − x)2 is the usual
sample variance. These values are precisely the usual estimates of mean and its standard error.
Example 2.12 (cont’d): When θ(F ) is the variance functional, so that θ(F ) = σF2 = EF [{X −
EF (X)}2 ], we can see that the pseudo-values are:

n
θ̃i = θ(F̂ ) + (n − 1){θ(F̂ ) − θ(F̂i )} = nθ(F̂ ) − (n − 1)θ(F̂i ) = (xi − x)2 − (xj − xi )2 .
i=1 j=i
Statistical Inference (STAT3013/8027) Lecture Notes - Page 50

Now, deﬁning yi = xi − x and again using the fact that

n 1 1 1
xi = x− xi = x + (x − xi ) = x − yi ,
n−1 n−1 n−1 n−1
n 2
it can be shown that θ̃i = n−1 yi (a fact which is left as an exercise). Thus, we can see that

1 1 n 1 2
n n n
θ̃i = yi2 = y ,
n i=1 n i=1 n − 1 n − 1 i=1 i

and therefore:
n 2
1 2
n
1 2 1
V arJ (θ̂) = s̃ = θ̃i − y
n n(n − 1) i=1 n − 1 j=1 j
2
1 2
n n
1 n2 4
= y −n y
n(n − 1) i=1 (n − 1)2 i n − 1 i=1 i
n n 2
n2 1 4 n2 1 2
= y − y
(n − 1)3 n i=1 i (n − 1)3 n i=1 i
n n 2
n2 1 4 1 2
= (x i − x) − (x i − x) .
(n − 1)3 n i=1 n i=1
n
By way of comparison, we note that the exact variance of the estimator n−1 i=1 (xi − x)2 is
given by:

(n − 1)2 (n − 1)(n − 3) 2 2 (n − 1)2 n−3 2 2
V arF {θ(F̂ )} = µ4,F − (σF ) = µ4,F − (σ ) ,
n3 n3 n3 n−1 F

where µ4,F = EF [{X − EF (X)}4 ] is the fourth central moment of the distribution with CDF
F . Note that for suﬃciently large values of n, we have:

n2 1 (n − 1)2 n−3
3
≈ ≈ ; ≈ 1,
(n − 1) n n3 n−1

so that the Jackknife estimator is approximately (and asymptotically) correct.

Finally, we note that we have estimated the variance of θ̂ = θ(F̂ ), but we are more likely to
be interested in the variance of θ̃J , the Jackknife bias-corrected estimator, or even some other
estimator. In such cases, a modification to the Jackknife variance estimation procedure is gen-
erally possible; indeed, as long as our estimator of θ(F ), say θ̂t = t(X1 , . . . , Xn ) is well-defined
on the subsamples of data with the ith value removed [i.e., as long as we can calculate θ̂t,i =
t(X1 , . . . , Xi−1 , Xi+1 , . . . , Xn )] then both the Jackknife bias-corrected estimator and the Jackknife
estimate of variance can be defined as before [i.e., as the mean and appropriately re-scaled variance
(i.e., the variance divided by the sample size, n) of the pseudo-values, θ̃i = θ̂ + (n − 1)(θ̂ − θ̂t,i )].
In particular, we can calculate the Jackknife variance estimate of the usual sample variance,
n
s2 = s2 (x1 , . . . , xn ) = (n − 1)−1 i=1 (xi − x)2 , as
2
1 1
n n
n2
V
arJ (s2 ) = (xi − x)4 − (xi − x)2 .
(n − 1)(n − 2)2 n i=1 n i=1

[NOTE: This calculation follows along identical lines to those used in calculating the Jackknife
n
variance of θ(F̂ ) = n−1 i=1 (xi − x)2 above, and is left as an exercise for the reader.]
Statistical Inference (STAT3013/8027) Lecture Notes - Page 51

Unfortunately, there are drawbacks to the Jackknife variance estimate. It turns out that
the Jackknife estimate of variance is not always an accurate (or even consistent) estimate of the
true variance V arF {θ(F̂ )}. In particular, if θ(F ) is deﬁned to be the median of the distribution
with CDF F , the Jackknife estimate of variance is not a valid estimate of the true variance of
the sample median θ(F̂ ). The reasons for this breakdown in the Jackknife variance estimator are
rather technical, and we will not discuss them here. However, the ideas behind the Jackknife lead us
directly to the methods of the next section, wherein we will arrive at a more generally accurate and
valid non-parametric estimate of variance for θ(F̂ ) [as well as for other estimators δ(X1 , . . . , Xn )].
Before we proceed to this new approach, though, we give a brief development of a method
of variance estimation which has close ties to the ideas in the Jackknife (but which actually pre-
dates the Jackknife) called the δ-method. The general idea is based upon simple ﬁrst-order Taylor
expansion. In particular, suppose that Y = (Y1 , . . . , Yn ) is a random vector with known mean
vector µ = (µ1 , . . . , µn ) and known variance-covariance matrix
 σ2 σ12 · · · σ1n 
1
 σ21 σ22 · · · σ2n 
Σ =
 .. .. .. .. 
,
. . . .
σn1 ··· · · · σn2

and let Z = g(Y ) for some differentiable function g(·). In order to estimate the variance of Z in
terms of the mean and variance of Y , we first note that first-order Taylor expansion of g(Y ) about
the point Y = µ yields:

n
Z = g(Y ) ≈ g(µ) + gi (µ)(Yi − µi ) = g(µ) + ∇g(µ)T (Y − µ),
i=1

where gi (Y ) = ∂Y∂
i
g(Y ) and ∇g(µ) = {g1 (µ), . . . , gn (µ)}T . From this approximation, we can
directly estimate the variance of Z as

V ar(Z) = V ar{g(µ) + ∇g(µ)T (Y − µ)} = V ar{∇g(µ)T (Y − µ)} = ∇g(µ)T Σ ∇g(µ),

[Recall that for any constant vector a and any random vector W , we have V ar(aT W ) = aT V ar(W )a].
This approximation is generally known as the δ-method estimate of variance. [NOTE: The approx-
imation is based on “linearising” the function g(·), and thus the accuracy of the estimate is closely
tied to how good this linear approximation to g(·) is, particularly near the mean vector µ.] Now,
if we assume that the Xi ’s are an iid sample from some distribution with known mean µX and
2

variance σX , then µ = (µX , . . . , µX ) and Σ is an n × n diagonal matrix with each diagonal ele-
2
ment equal to σX . Thus, we can calculate the δ-method estimate of the variance of an estimator
θ̂t = t(X1 , . . . , Xn ) as:
 2
 
(t1,µ · · · tn,µ ) σX 0 ··· 0 t1,µ
 ..   
 0 σX2
··· .    n
V ar(θ̂) ≈ ∇t(µ) Σ ∇t(µ) =
T    ...  = σX
2
t2i,µ ,
 .. .. ..  
 . . . 0   i=1
2
0 ··· 0 σX tn,µ

∂
where ti,µ = ti (µ) and ti (X) = ∂X i
t(X1 , . . . , Xn ). Finally, we note that if we do not know
2
the true mean vector µ and the true variance σX , then we can just substitute any convenient
estimates for them, typically the most sensible choice would be use the data vector X itself to
Statistical Inference (STAT3013/8027) Lecture Notes - Page 52

replace µ and sample variance, s2 , to replace σX 2

. [NOTE: The δ-method estimate of variance for
an estimator θ̂t = t(X1 , . . . , Xn ) in the case where the individual Xi ’s are themselves vectors is
2
easily constructed from the preceding discussion by simply interpreting µX and σX as a vector and
a matrix, respectively, and appropriately interpreting the ti,µ ’s as appropriate gradient vectors,
n 2
so that the variance estimate is just i=1 tTi,µ σX ti,µ .] As an example, employing the δ-method to
1
n
estimate the variance of the sample variance estimator itself, θ̂ = s2 (X1 , . . . , Xn ) = n−1 i=1 (Xi −
2
X) , yields a variance estimate of:

n
n n
2
2 2 2 2 2 2 2 2 2 4 4
V ar(s ) ≈ σX (si,µ ) ≈ s (si,X ) = s (Xi − X) = s .
i=1 i=1 i=1
n−1 n−1

since s2i,X = ∂X∂

i
s2 (X1 , . . . , Xn ) can be readily calculated (using some simple calculus and algebra)
2
as n−1 (Xi − X). Like the Jackknife estimate of variance, the δ-method estimate can sometimes be
extremely poor; indeed, the example just presented shows directly that the δ-method can be quite
poor (compare the estimate to the true value given previously which contains µ4,F , the fourth
central moment of the underlying distribution of the Xi ’s). Of course, if we can consider our
estimator in some other functional form which is better approximated by the “linearisation”, then
the δ-method will obviously work better [NB: in particular, the δ-method is obviously going to
have trouble if, in its original derivation, we have chosen a function g(·) for which g (µ) = 0; which
is precisely the case for the sample variance example given above, though the use of X in place
of µ alleviates this problem to some extent]. Now, many common estimators can be written in
the form θ̂ = θ̂(X1 , . . . , Xn ) = t(Q1 , . . . , Qk ) for some known function t(·) and known functions
n
Qi (·) such that Qi = n1 j=1 Qi (Xj ). For example, the sample variance can be written as s2 =
−1 n 2
−1 n 2 2
n
n−1 n i=1 X i − n i=1 X i
n
= n−1 (Q2 − Q1 ), where Q1 (Xi ) = Xi and Q2 (Xi ) = Xi2 .
In this case, we can employ the δ-method approach to approximate the variance of θ̂ as:

V ar(θ̂) ≈ ∇t(µ)T Σ ∇t(µ),

where now µ and Σ are the mean vector and variance-covariance matrix, respectively, of the vector
Q = {Q1 , . . . , Qk }T , and ∇t(µ) is seen to be the vector with components ti (µ), where ti (Q) =
∂
∂Q
t(Q1 , . . . , Qk ). For the case of the sample variance, we can readily calculate µ = E(Q) =
i
2
(X, X 2 )T = (µX , σX + µ2X )T and

V ar(X) Cov(X, X 2 ) 1 2
σX 2
µ3,X − µX σX − µ3X
Σ = = 2 ,
Cov(X, X 2 ) V ar(X 2 ) n µ3,X − µX σX − µ3X µ4,X − σX − 2σX µX − µ4X
4 2 2

where µ3,X = E(X 3 ) and µ4,X = E(X 4 ). Furthermore, we can see that ∇t(µ) = n−1 n
(−2µX , 1)T .
Therefore, the δ-method variance estimate for the sample variance in this form is readily calculated
(using matrix multiplication and some straightforward algebra) to be:
n n
V ar(s2 ) ≈ ∇t(µ)T Σ ∇t(µ) = (µ4,X −4µX µ3,X +6µ2X σX
2
+3µ4X −σX
4
)= (µ −σ 4 ),
(n − 1)2 (n − 1)2 X,4 X

where µ4,X = E{(X1 − µX )4 } = µ4,X − 4µX µ3,X + 6µ2X σX 2

+ 3µ4X is the fourth central moment of
the distribution of the Xi ’s. This result is much more in line with the true variance of s2 presented
previously. Indeed, this application of the δ-method to the form of the statistic written as a function
of the averages of the Qi (Xj )’s, when such a form is available, tends to work much better in practice
then a direct application of the δ-method to the statistic θ̂(X1 , . . . , Xn ) itself.
Statistical Inference (STAT3013/8027) Lecture Notes - Page 53

2.6.3. The Bootstrap Method: The notion behind the Jackknife pseudo-values, θ̃i , is a reason-
able one. We can “mimic” the behaviour of the random variables Xi , and therefore of the estimator
θ(F̂ ), under the true distribution F by using the θ̃i values, which are essentially estimates of θ(F̂ )
based on “re-samples” (drawn under the distribution F̂ ) of specified sub-collections of n − 1 of
the original data points. The behaviour of the θ̃i ’s is then mapped back to the estimate the true
behaviour of θ̂ = θ(F̂ ). However, if we are truly to create a proper analogy for the behaviour of θ̂
under F , it makes more sense to examine the behaviour of the quantity θ̂ = θ(F̂ ), where the dis-
tribution F̂ is the empirical distribution associated with the random variables X1 , . . . , Xn having
distribution F̂ . In other words, to examine the behaviour of the quantity θ̂ under the population
distribution F , we simply imagine that our observed data forms its own “population” from which
we randomly sample according to the “true” distribution F̂ and construct an estimate of the “true”
population parameter θ(F̂ ) using the “re-sampled” data, arriving at θ(F̂ ) as our estimator of θ(F̂ ).
The advantage of this approach is that we “know the truth” regarding θ(F̂ ), since we know the
“true” distribution F̂ . Thus, we can determine exactly (assuming we are willing to conduct the
appropriate algebraic calculations) the bias and variance of θ(F̂ ). If the analogy holds, and it will
in most cases, we can then use the bias and variance of θ(F̂ ) under F̂ as estimators of the bias
and variance of θ(F̂ ) under F . This approach is generally referred to as the bootstrap, since we
are using the data itself to estimate its behaviour under F , effectively “pulling ourselves up by our
own bootstraps”.
Formally, then, we will define the bootstrap estimators of bias and variance as:

B̂B = EF̂ {θ(F̂ )} − θ(F̂ ); V

arB {θ(F̂ )} = V arF̂ {θ(F̂ )} = EF̂ {θ(F̂ )2 } − [EF̂ {θ(F̂ )}]2 ,

where F̂ is the CDF of a random variable X , which takes values drawn at random from the
collection X = {X1 , . . . , Xn }. We note that these formulae are seen to be directly derived by
writing the expressions for the bias and variance of θ(F̂ ):

BiasF {θ(F̂ )} = EF {θ(F̂ )} − θ(F̂ ); V arF {θ(F̂ )} = EF {θ(F̂ )2 } − [EF {θ(F̂ )}]2 ,

and replacing each instance of F by F̂ , and each instance of F̂ by F̂ . Of course, we are now
in the position of having to calculate expectations and variances of θ(F̂ ). These calculations are
occasionally possible exactly in the case of simple functionals, θ(·), but generally the necessary
quantities will need to be estimated. Fortunately, and this is the real strength of the bootstrap
method, this can be accomplished in a very computationally straightforward way. First, we note
that since we have the observed values x1 , . . . , xn in our possession, we can easily create realisations
of the random sample X1 , . . . , Xn by simply randomly drawing n values from the collection X =
{x1 , . . . , xn } with replacement. Suppose that we repeat this re-sampling exercise a large number
of times, say B, leading to the re-sampled datasets:

{X1,1

, . . . , Xn,1 }, . . . , {X1,b

, . . . , Xn,b }, . . . , {X1,B

, . . . , Xn,B }.

In turn, these B “bootstrap” datasets can be used to construct the estimates θ̂b = θ(F̂b ), where F̂b
is the empirical distribution function derived from the re-sampled dataset {X1,b
, . . . , Xn,b }. Using

these θ̂b values we can approximate the bootstrap bias and variance as:

1
B
B̂B = EF̂ {θ(F̂ )} − θ(F̂ ) ≈θ̂b − θ̂
B
b=1
B 2
1 1
B

V arB {θ(F̂ )} = V arF̂ {θ(F̂ )} ≈

θ̂b − θ̂ .
B−1 B c=1 c
b=1
Statistical Inference (STAT3013/8027) Lecture Notes - Page 54

Note that we have simply estimated the expected value and the variance of θ(F̂ ) by the sample
average and sample variance of the θ̂b ’s, respectively. As such, as long as B is large enough, we
can be certain that these estimates are reasonably accurate (in fact, it can be shown that the error
in these estimates decreases linearly in B, and are thus approximately of the size B −1 ).
We further note that, just as for the Jackknife, the notion of the bootstrap can be extended to
estimators other than θ(F̂ ). In particular, if θ̂δ = δ(X1 , . . . , Xn ) is any estimator of θ(F ), we can
use the bootstrap to estimate the bias and variance of this estimator as:
1
B
B̂B = EF̂ {δ(X1 , . . . , Xn )}
− θ(F̂ ) ≈
δ(X1,b
, . . . , Xn,b ) − θ(F̂ )
B
b=1
B 2
1 1
B

V arB (θ̂δ ) = V arF̂ {δ(X1 , . . . , Xn )} ≈

δ(X1,b , . . . , Xn,b ) −

δ(X1,c , . . . , Xn,c ) .
B−1 B c=1
b=1

Note that the bootstrap notion of replacing F by F̂ and F̂ by F̂ has simply been augmented to
include the replacement of Xi by Xi .
Example 2.13: Suppose that we have observed the following data pairs, which represent the
average LSAT (Legal Scholastic Aptitude Test, a common entrance exam for prospective law
school students in the United States) and GPA (grade point average) scores for the 1973 entering
class at a random sample of 15 U.S. Law Schools (the data are also plotted below):
LSAT GPA ρ̂i − ρ̂ LSAT GPA ρ̂i − ρ̂ LSAT GPA ρ̂i − ρ̂
576 3.39 0.1166 635 3.30 −0.0127 558 2.81 −0.0214
578 3.03 −0.0003 666 3.44 −0.0451 580 3.07 0.0036
555 3.00 0.0082 661 3.43 −0.0402 651 3.36 −0.0246
605 3.13 −0.0003 653 3.12 0.0417 575 2.74 0.0093
545 2.76 −0.0360 572 2.88 −0.0093 594 2.96 0.0035
Suppose that we are interested in estimating the correlation between LSAT scores (Yi ’s) and
GPAs (Zi ’s), so that our functional of interest is
CovF (Y, Z)
θ(F ) = ρF = % ,
V arF (Y )V arF (Z)
where F represents the joint distribution of the pairs Xi = (Yi , Zi ). Further, suppose that we
use the usual correlation estimator
n
(yi − y)(zi − z)
ρ̂ = %n i=1 n .
2 2
i=1 (yi − y) i=1 (zi − z)
The sample correlation coeﬃcient for these 15 pairs is easily calculated as ρ̂ = 0.7764. Moreover,
the table provides values for ρ̂i − ρ̂ (where ρ̂i is the sample correlation calculated without the ith
data value), which can then be used to create Jackknife pseudo-values, ρ̃i = ρ̂ + (n − 1)(ρ̂ − ρ̂i ).
These pseudo-values can then be used to estimate the bias and variance of ρ̂ as:
n 2
1 1
n n
1
B̂J = ρ̂ − ρ̃i = −0.007; V arJ (ρ̂) = ρ̃i − ρ̃i = 0.0203.
n i=1 n(n − 1) i=1 n i=1
Alternatively, we can select B re-samples from the 15 observed pairs and create bootstrap
replicates of the correlation estimate, ρ̂b (b = 1, . . . , B). For example, one re-sample might be:
X1 = X7 = (555, 3.00), X2 = X15 = (594, 2.96), X3 = X14 = (572, 2.88),
X4 = X3 = (558, 2.81), X5 = X7 = (555, 3.00), X6 = X14 = (572, 2.88),
X7 = X7 = (555, 3.00), X8 = X7 = (555, 3.00), X9 = X12 = (575, 2.74),

X10 = X3 = (558, 2.81), X11 = X6 = (580, 3.07), X12 = X6 = (580, 3.07),

X13 = X1 = (576, 3.39), X14 = X10 = (605, 3.13), X15 = X12 = (575, 2.74).
Statistical Inference (STAT3013/8027) Lecture Notes - Page 55

For this particular re-sample, we can see that ρ̂ = 0.2585 (note how diﬀerent this value is
from ρ̂ = 0.7764, indicating that the correlation estimator in this case may be quite variable).
Table 2.2 shows the bootstrap bias and variance estimates based on various values of B (each
replicated three times):
Table 2.2: Bootstrap Bias and Variance Estimates for the Correlation Coeﬃcient

Trial 1 Trial 2 Trial 3

B̂B VarB B̂B VarB B̂B V arB
B = 10 −0.009 0.0113 −0.117 0.0324 0.028 0.0064
B = 100 −0.004 0.0256 −0.005 0.0222 −0.011 0.0197
B = 1000 −0.007 0.0173 −0.006 0.0170 −0.005 0.0180
B = 10000 −0.006 0.0180 −0.006 0.0178 −0.006 0.0178

Note that for B = 10, the bootstrap estimates of bias and variance are quite variable (which
was foreshadowed by the fact that the single re-sample we examined earlier yielded a value of
ρ̂ quite different from ρ̂), but by the time B = 10, 000 there is essentially no variability in the
estimates. As such, we must be careful when implementing the bootstrap to ensure that we
have chosen a large enough value of B (of course, we do not want to choose an overly large value
as this will incur excessive computational costs and thus make our estimation procedure overly
time consuming). It is generally accepted that bootstrap bias and standard deviation estimates
typically require a few thousand re-samples to ensure that the variability due to the random
selection of re-samples (generally referred to as simulation error) is sufficiently small.
Finally, we present a plot of the data and a “bootstrap histogram” on the top of the following
page (i.e., a histogram of 10,000 values of ρ̂ calculated on randomly re-sampled datasets). The
plot of the data indicates a reasonably linear relationship (the least-squares linear regression
line is superimposed on the plot), which confirms that the use of a correlation coefficient as a
measure of relationship is reasonable. Moreover, the plot uncovers a potential outlier (which
happens to be the first data point corresponding to an LSAT value of 576 and a GPA value
of 3.39). The presence of this outlier has an adverse effect on the variability of the bootstrap
estimators, which is why we required B = 10, 000 re-samples before the bootstrap estimators
stopped varying noticeably from trial to trial. Of course, what should be done regarding this
outlier is an important subject, but is beyond the scope of these notes. The histogram of the
10,000 bootstrap values also indicates the inherent variability in the ρ̂ values. Moreover, the
Average LSAT and GPA scores for Histogram of 10,000 Bootstrap Replicates
Entering Classes at 15 U.S. Law Schools of the Correlation Coefficient
1000

•
•
3.4

•
•
800

• normal theory density

3.2

600

•
•
Count
GPA

•
400
3.0

•
200

•
2.8

•
•
0

560 580 600 620 640 660 0.0 0.2 0.4 0.6 0.8 1.0

LSAT Bootstrap Replicate Correlation Values

Statistical Inference (STAT3013/8027) Lecture Notes - Page 56

histogram provides another interesting piece of information; namely, the distribution of the ρ̂
values is quite skewed. Indeed, the bootstrap histogram yields information regarding the actual
distribution of ρ̂ under F̂ , and this information may be used to infer the behaviour (following
the standard bootstrap paradigm) of ρ̂ under F . For comparison purposes, the theoretical
distribution of ρ̂ under the assumption of that the Xi ’s follow a bivariate normal distribution
with true correlation of ρ = 0.7764 is superimposed on the histogram. We will further investigate
the use of this distributional information (in the pursuit of confidence intervals) in subsequent
sections.
The idea behind the bootstrap is powerful and extremely intuitively appealing. Moreover, the
implementation is reasonably easy (though computationally intensive). Why, then, has the boot-
strap not replaced parametric approaches? One drawback is that, as implemented, the bootstrap
method yields a different answer every time (of course, the differences will be very small if B is
large). Another drawback is that if θ(·) is complicated to calculate (perhaps because it is implicitly
defined as the solution to an equation, just as the MLE was) then computing its value for each of
B re-sampled datasets is computationally quite expensive and time consuming. Moreover, as we
have discussed in the previous sections, if we truly believe the parametric structure we have set up,
then the parametric estimators have nice optimal properties. Still, the bootstrap is a very flexible
and widely applicable approach which deserves more attention than it currently gets among statis-
tical practitioners (particularly given the speed with which modern computers can implement its
requirements). Indeed, the bootstrap can even be extended to circumstances beyond the iid setting
on which we have focussed here. Finally, however, a word of warning. We must be somewhat careful
since we cannot always guarantee that replacing the bootstrap paradigm (i.e., estimating bias and
variance using quantities derived by replacing F by F̂ and F̂ by F̂ in the defining expressions for
the true bias and variance) will yield valid estimates in more complicated settings (particularly, if
the observed data points are not independent of one another).

Theory of Estimation by P.G.dixit, Nirali Publication
No ratings yet
Theory of Estimation by P.G.dixit, Nirali Publication
186 pages
Sta 341 Class Notes Final
No ratings yet
Sta 341 Class Notes Final
120 pages
CQA Certification Guide and How To Crack Exam On Asq Certified Quality Auditor
100% (1)
CQA Certification Guide and How To Crack Exam On Asq Certified Quality Auditor
15 pages
STATPROB Module 7
No ratings yet
STATPROB Module 7
16 pages
Chapter Two (Estimation and Hypothesis Testing)
No ratings yet
Chapter Two (Estimation and Hypothesis Testing)
20 pages
Unit 2 Statistical Estimation
No ratings yet
Unit 2 Statistical Estimation
15 pages
Block 2
No ratings yet
Block 2
85 pages
Lecture No. Probability & Statistics
No ratings yet
Lecture No. Probability & Statistics
60 pages
Lecture Notes For Mathematical Statistics
No ratings yet
Lecture Notes For Mathematical Statistics
184 pages
DS 630 - Lec 01 - ST
No ratings yet
DS 630 - Lec 01 - ST
59 pages
Estimation Bertinoro09 Cristiano Porciani 1
No ratings yet
Estimation Bertinoro09 Cristiano Porciani 1
42 pages
Module 5
No ratings yet
Module 5
67 pages
STAT 101 Module Handout 4.1
No ratings yet
STAT 101 Module Handout 4.1
12 pages
Statinf Estimation
No ratings yet
Statinf Estimation
110 pages
Statistical Inference Frequentist
No ratings yet
Statistical Inference Frequentist
25 pages
Chapter 6
No ratings yet
Chapter 6
33 pages
SI Chapter-2
No ratings yet
SI Chapter-2
53 pages
Lecture Notes Statistics II PDF
No ratings yet
Lecture Notes Statistics II PDF
139 pages
Lecture Transcript 4 (Estimation of Paramterers)
No ratings yet
Lecture Transcript 4 (Estimation of Paramterers)
12 pages
c09 Estimation PDF
No ratings yet
c09 Estimation PDF
10 pages
Statistics
No ratings yet
Statistics
53 pages
Block 7
No ratings yet
Block 7
34 pages
Statistics 512 Notes I D. Small
No ratings yet
Statistics 512 Notes I D. Small
8 pages
PLU Quantitative Techniques 3
No ratings yet
PLU Quantitative Techniques 3
17 pages
CH-2 Estimation - 071222
No ratings yet
CH-2 Estimation - 071222
16 pages
SSC Gds Notes
No ratings yet
SSC Gds Notes
88 pages
Psp-Unit-6 Estimation Theory PDF
No ratings yet
Psp-Unit-6 Estimation Theory PDF
38 pages
Topic 10 Point Estmation of Parameters
No ratings yet
Topic 10 Point Estmation of Parameters
36 pages
Unit 5
No ratings yet
Unit 5
17 pages
Unit 5 Estimation: Structure
No ratings yet
Unit 5 Estimation: Structure
17 pages
Outile-Course-Of - Inferential-Statistic
No ratings yet
Outile-Course-Of - Inferential-Statistic
16 pages
Unit 18
No ratings yet
Unit 18
12 pages
AllNotes 4
No ratings yet
AllNotes 4
56 pages
STAT2102 Chapter6
No ratings yet
STAT2102 Chapter6
5 pages
UMass Stat 516 Solutions Chapter 8
No ratings yet
UMass Stat 516 Solutions Chapter 8
26 pages
Chapter 2 SM
No ratings yet
Chapter 2 SM
37 pages
Stat-Review Xid-8243919 1
No ratings yet
Stat-Review Xid-8243919 1
24 pages
Chapter 5 Introduction To Statistical Inference
No ratings yet
Chapter 5 Introduction To Statistical Inference
9 pages
Stat2602 Chapter3
No ratings yet
Stat2602 Chapter3
37 pages
NOTES
No ratings yet
NOTES
14 pages
1 Preliminaries: 1.1 Motivation
No ratings yet
1 Preliminaries: 1.1 Motivation
7 pages
1.1 Parametric and Nonparametric Statistical Inference
No ratings yet
1.1 Parametric and Nonparametric Statistical Inference
8 pages
Ch-1.Ppt Business Statx
No ratings yet
Ch-1.Ppt Business Statx
66 pages
Unit-3 (Estimation)
No ratings yet
Unit-3 (Estimation)
16 pages
Stimation: Statistic
No ratings yet
Stimation: Statistic
46 pages
Estimation of Parameter
No ratings yet
Estimation of Parameter
10 pages
Chap - 2point - Estimation
No ratings yet
Chap - 2point - Estimation
11 pages
Introduction
No ratings yet
Introduction
11 pages
Lectura 2 Point Estimator Basics
No ratings yet
Lectura 2 Point Estimator Basics
11 pages
Chapter 2: Statistical Inference, Point Estimation, and Confidence Intervals
No ratings yet
Chapter 2: Statistical Inference, Point Estimation, and Confidence Intervals
16 pages
Statistical+Inference+1 Shaw2007
No ratings yet
Statistical+Inference+1 Shaw2007
66 pages
STA 303 Lec 1
No ratings yet
STA 303 Lec 1
5 pages
Sample Theory With Ques. - Estimation (JAM MS Unit-14)
No ratings yet
Sample Theory With Ques. - Estimation (JAM MS Unit-14)
25 pages
202004160626023624rajiv Saksena Advance Statistical Inference
No ratings yet
202004160626023624rajiv Saksena Advance Statistical Inference
31 pages
Estimation Theory: x, x, x ,…… ……x ,x f x,θ θ θ θ
No ratings yet
Estimation Theory: x, x, x ,…… ……x ,x f x,θ θ θ θ
18 pages
Point Estimation: Statistics (MAST20005) & Elements of Statistics (MAST90058) Semester 2, 2018
No ratings yet
Point Estimation: Statistics (MAST20005) & Elements of Statistics (MAST90058) Semester 2, 2018
12 pages
Estimators: The Basic Statistical Model
No ratings yet
Estimators: The Basic Statistical Model
9 pages
Wickham Stati
No ratings yet
Wickham Stati
12 pages
QRM 01
No ratings yet
QRM 01
45 pages
DLL Week6 SP11 - 4QT
No ratings yet
DLL Week6 SP11 - 4QT
8 pages
Wooldridge 7e Ch06 IM
100% (1)
Wooldridge 7e Ch06 IM
20 pages
Bcom
No ratings yet
Bcom
38 pages
8 Economics Minor
No ratings yet
8 Economics Minor
15 pages
QRM 10
No ratings yet
QRM 10
101 pages
CHAPTER 6 Solution
67% (3)
CHAPTER 6 Solution
64 pages
Business Analyst Master's Program in Collaboration With IBM V11 - New
No ratings yet
Business Analyst Master's Program in Collaboration With IBM V11 - New
28 pages
Stock Watson 3U ExerciseSolutions Chapter8 Instructors
No ratings yet
Stock Watson 3U ExerciseSolutions Chapter8 Instructors
14 pages
HullOFOD11eProblem 31 - 15
No ratings yet
HullOFOD11eProblem 31 - 15
194 pages
Stock Watson 3U ExerciseSolutions Chapter4 Instructors
No ratings yet
Stock Watson 3U ExerciseSolutions Chapter4 Instructors
15 pages
Stock Watson 3U ExerciseSolutions Chapter17 Instructors
No ratings yet
Stock Watson 3U ExerciseSolutions Chapter17 Instructors
26 pages
Stock Watson 3U ExerciseSolutions Chapter5 Instructors
No ratings yet
Stock Watson 3U ExerciseSolutions Chapter5 Instructors
18 pages
Week 3 - Tutorial Solutions
No ratings yet
Week 3 - Tutorial Solutions
8 pages
Analysis of Variance (1 & 2 Way)
No ratings yet
Analysis of Variance (1 & 2 Way)
15 pages
Chi-Square Test For Imprecise Data in Consistency Table
No ratings yet
Chi-Square Test For Imprecise Data in Consistency Table
5 pages
QRM 02
No ratings yet
QRM 02
51 pages
Maximum Likelihood Estimation by K.Kashin
No ratings yet
Maximum Likelihood Estimation by K.Kashin
34 pages
W 15808
No ratings yet
W 15808
62 pages
Financial Analytics of Inverse BTC Options in A Stochastic Volatility World
No ratings yet
Financial Analytics of Inverse BTC Options in A Stochastic Volatility World
39 pages
Untitled
No ratings yet
Untitled
60 pages
Study On The Nexuses Between Supply Chain Information Technology Capabilities and Firm Performance: Exploring The Meditating Role of Innovation and Organizational Learning
No ratings yet
Study On The Nexuses Between Supply Chain Information Technology Capabilities and Firm Performance: Exploring The Meditating Role of Innovation and Organizational Learning
17 pages
HullOFOD11eProblem 21 - 28
No ratings yet
HullOFOD11eProblem 21 - 28
63 pages
Factor in R PDF
No ratings yet
Factor in R PDF
4 pages
Stock Watson 3U ExerciseSolutions Chapter14 Instructors
No ratings yet
Stock Watson 3U ExerciseSolutions Chapter14 Instructors
13 pages
QRM 03
No ratings yet
QRM 03
16 pages
Stock Watson 3U ExerciseSolutions Chapter15 Instructors
No ratings yet
Stock Watson 3U ExerciseSolutions Chapter15 Instructors
12 pages
Stock Watson 3U ExerciseSolutions Chapter3 Instructors
No ratings yet
Stock Watson 3U ExerciseSolutions Chapter3 Instructors
23 pages
Qualitative and Qualitative Research Paradigm
No ratings yet
Qualitative and Qualitative Research Paradigm
20 pages
Carr Wu JFE2016
No ratings yet
Carr Wu JFE2016
20 pages
Sheet 4
No ratings yet
Sheet 4
1 page
Parameter Estimation of The Weibull Distribution Comparison of The Least-Squares Method and The Maximum Likelihood Estimation
No ratings yet
Parameter Estimation of The Weibull Distribution Comparison of The Least-Squares Method and The Maximum Likelihood Estimation
15 pages
Intraday Liquidity Dynamics and News Releases Around Price Jumps: Evidence From The DJIA Stocks
No ratings yet
Intraday Liquidity Dynamics and News Releases Around Price Jumps: Evidence From The DJIA Stocks
39 pages
Project WRK 4
No ratings yet
Project WRK 4
11 pages
DMRT For Table 1 2 and 3 Date 19.10.2024
No ratings yet
DMRT For Table 1 2 and 3 Date 19.10.2024
56 pages
J Jfineco 2010 03 009
No ratings yet
J Jfineco 2010 03 009
25 pages
HullOFOD11eProblem 25 - 30
No ratings yet
HullOFOD11eProblem 25 - 30
19 pages
Chapter 5
No ratings yet
Chapter 5
45 pages
Notice - 13 Nov 2017
No ratings yet
Notice - 13 Nov 2017
28 pages
Research Methodology
No ratings yet
Research Methodology
18 pages
Tut W2 Sol
No ratings yet
Tut W2 Sol
6 pages
Rplots
No ratings yet
Rplots
5 pages
Advanced Mathematics ( )
No ratings yet
Advanced Mathematics ( )
2 pages
HullOFOD11eProblem 01 - 40
No ratings yet
HullOFOD11eProblem 01 - 40
2 pages
HullOFOD11eProblem 01 - 37
No ratings yet
HullOFOD11eProblem 01 - 37
2 pages
MATH142 3 Research 4Q1920 1
No ratings yet
MATH142 3 Research 4Q1920 1
4 pages
Week 2 - Tutorial Solution
No ratings yet
Week 2 - Tutorial Solution
2 pages
Sheet 1
No ratings yet
Sheet 1
1 page
Sheet 2
No ratings yet
Sheet 2
1 page
Notes On
No ratings yet
Notes On
2 pages
Delft: Fqtqfuil-F, Q
No ratings yet
Delft: Fqtqfuil-F, Q
33 pages
Stock Market Prediction
No ratings yet
Stock Market Prediction
16 pages
CRT, .4, Utztt. 196: Symbiosis College of Arts and Commerce: Statistical Methods
No ratings yet
CRT, .4, Utztt. 196: Symbiosis College of Arts and Commerce: Statistical Methods
1 page
Differences in Perception of Online Anesthesiology Between Thai Medical Students and Teachers During The COVID-19 Pandemic
No ratings yet
Differences in Perception of Online Anesthesiology Between Thai Medical Students and Teachers During The COVID-19 Pandemic
9 pages
Biol 180 WIN 2022 Practice Exam 1
No ratings yet
Biol 180 WIN 2022 Practice Exam 1
2 pages
Trending Topic Analysis Using Novel Sub Topic Detection Model
No ratings yet
Trending Topic Analysis Using Novel Sub Topic Detection Model
5 pages
Approximating The Kullback Leibler Divergence Between Gaussian Mixture Models
No ratings yet
Approximating The Kullback Leibler Divergence Between Gaussian Mixture Models
4 pages
PERT Estimation Technique: Optimistic Pessimistic Most Likely
No ratings yet
PERT Estimation Technique: Optimistic Pessimistic Most Likely
2 pages
Mso201/Eso 209: Probability & Statistics Semester 2: 2013-14 Problem Set #1
No ratings yet
Mso201/Eso 209: Probability & Statistics Semester 2: 2013-14 Problem Set #1
2 pages
Stochastic Calculus and Brownian Motion
From Everand
Stochastic Calculus and Brownian Motion
Tejas Thakur
No ratings yet
Understanding Analysis: Foundations and Applications
From Everand
Understanding Analysis: Foundations and Applications
Tanmay Shroff
No ratings yet
Introduction to Statistics
From Everand
Introduction to Statistics
Simone Malacrida
No ratings yet

Lecture Notes - 1

Uploaded by

Lecture Notes - 1

Uploaded by

STATISTICAL INFERENCE - STAT3013/8027

2. PARAMETRIC POINT ESTIMATION

which yields the following results:

xp̂x−1 (1 − p̂)3−x − (3 − x)p̂x (1 − p̂)2−x = 0

L(θ) = L(θ; x1 , . . . , xn ) = fX1 ,...,Xn (x1 , . . . , xn ; θ).

which leads to a log-likelihood function of:

for all values of τ . To do this, we note that:

Then estimate τ by τ̂ = τ (θ̂).

d(θ) = sup |F (x; θ) − F̂n (x)|.

M SEt (θ) = Eθ {(T − τ )2 },

Therefore, we see that

2(n − 1)σ 4 σ4 (2n − 1)σ 4

2σ 4 (2n − 1)σ 4 2n2 σ 4 − (n − 1)(2n − 1)σ 4 (3n − 1)σ 4

Eµ,σ2 (σ̂c2 ) = Eµ,σ2 (cs2 ) = cEµ,σ2 (s2 ) = cσ 2 ,

equivariant estimator. Similarly, if t(X1 , . . . , Xn ) is the sample average, then

so that the sample average is also seen to be a location equivariant estimator.

t(cX1 , . . . , cXn ) = |c|t(X1 , . . . , Xn ),

for any constant value c.

IQR(cX1 , . . . , cXn ) = Q̂3,w − Q̂1,w = cQ̂3,x − cQ̂1,x = c(Q̂3,x − Q̂1,x ) = cIQR(X1 , . . . , Xn ),

and if c < 0 then

IQR(cX1 , . . . , cXn ) = Q̂3,w − Q̂1,w = cQ̂1,x − cQ̂3,x = c(Q̂1,x − Q̂3,x ) = −cIQR(X1 , . . . , Xn ).

Similarly, if t(X1 , . . . , Xn ) is the sample standard deviation, then

lim P rθ {τ (θ) −  < Tn < τ (θ) + } = 1, ∀θ ∈ Θ.

lim M SEtn (θ) = 0, ∀θ ∈ Θ.

then T is termed a minimax estimator of τ (θ).

Now, deﬁne the two statistics T1 = t1 (X1 , X2 , X3 ) = X1 X2 + X3 and T2 = t2 (X1 , X2 , X3 ) =

and X0,2 = {(0, 0, 0)},

fX1 ,...,Xn (x1 , . . . , xn ; θ2 ) = θ2−n I{0≤min(x1 ,...,xn )} I{max(x1 ,...,xn )≤θ2 }

l(θ) = −n ln(θ) + ln[I{max(x1 ,...,xn )≤θ} ],

B(η) = b(eη ) = eη =⇒ KD (t) = eη+t − eη = eη (et − 1) = λ(et − 1)

which, through diﬀerentiation with respect to η1 and η2 , shows that:

parameter is η = σµ2 , leading to the inverse relationship µ = ησ 2 and

2.4. Unbiased Estimation

Diﬀerentiating this relationship with respect to θ then shows:

So, using the Cauchy-Schwartz Inequality with X = T − τ (θ) and Y = ∂

n−1 e−θ (1 − e−θ ) ≥ n−1 θe−2θ ,

V arθ (T ) ≥ (∇τ ) I −1 (θ)∇τ,

Eθ (T1 ) = Eθ {Eθ (T |S)} = Eθ (T ) = τ (θ),

implying that T1 is an unbiased estimator of τ (θ).

V arθ (T ) = Eθ {(T − τ )2 } = Eθ {(T − T1 + T1 − τ )2 }

Eθ {(T − T1 )(T1 − τ )} = Eθ [Eθ {(T − T1 )(T1 − τ )|S}] = Eθ {(T1 − τ )Eθ (T − T1 |S)}

V arθ (T ) = Eθ {V arθ (T |S)} + V arθ {Eθ (T |S)} ≥ V arθ (T1 )

showing that T1 is unbiased, and:

Example 2.6 (cont’d): Let X1 , . . . , Xn be a random sample from a uniform distribution on

min(X1 , . . . , Xn ). Further, we deﬁne two statistics T1 = (Y1 , Yn ) and T2 = Yn and we wish to

Eθ {z1 (T1 )} = (n + 1)Eθ (Yn ) − n(n + 1)Eθ (Y1 ) = nθ − nθ = 0,

Now, by deﬁnition τ̂π = E{τ (θ)|X1 , . . . , Xn }, so that:

V ar{τ (θ)} = E[V ar{τ (θ)|X1 , . . . , Xn }] + V ar[E{τ (θ)|X1 , . . . , Xn }]

Combining these two equalities shows that

E{V ar(τ̂π |θ)} + E[V ar{τ (θ)|X1 , . . . , Xn }] = 0,

for any other estimator T = t(X1 , . . . , Xn ).

π(θ|x1 , . . . , xn ) = C(x1 , . . . , xn )θn/2+α−1 e−θ(β+y) ,

Alternatively, we can ﬁnd the posterior distribution of σ 2 , which is given by:

sup{Rt (θ)} ≤ sup{Rt (θ)},

= θ2 (1 − nA2 + n2 A2 − 2nA) + θ(nA2 + 2naA2 − 2aA) + a2 A2 .

2.6. Nonparametric Methods

observed data values x1 , . . . , xn is deﬁned as:

BiasF {θ(F̂ )} = EF {θ(F̂ )} − θ(F ),

the bias of θ(F̂ ).

B̂J = (n − 1)(θ̂• − θ̂),

Therefore, we see that

EF (θ̃J ) = EF (θ̂ − B̂J )

B̂J = (n − 1)(θ̂• − θ̂)

This leads to the Jackknife bias-corrected estimator:

which is precisely the usual unbiased estimator of variance.

θ̃i = θ̂ + (n − 1)(θ̂ − θ̂i ),

Now, deﬁning yi = xi − x and again using the fact that

so that the Jackknife estimator is approximately (and asymptotically) correct.

replace µ and sample variance, s2 , to replace σX 2

since s2i,X = ∂X∂

where µ4,X = E{(X1 − µX )4 } = µ4,X − 4µX µ3,X + 6µ2X σX 2

B̂B = EF̂ {θ(F̂  )} − θ(F̂ ); V

Trial 1 Trial 2 Trial 3

• normal theory density

LSAT Bootstrap Replicate Correlation Values

You might also like

lim P rθ {τ (θ) − < Tn < τ (θ) + } = 1, ∀θ ∈ Θ.

B̂B = EF̂ {θ(F̂ )} − θ(F̂ ); V