0% found this document useful (0 votes)
23 views25 pages

Statistical Inference Frequentist

This document provides an overview of statistical inference and introduces some key terminology used in estimation and hypothesis testing. It discusses how statistical inference deals with estimating unknown population parameters from a sample of observations. Parameters refer to quantities in the population, while statistics are values computed from the sample data. The sampling distribution considers how a statistic would vary across different possible samples from the population. Estimation involves using statistics to obtain point or interval estimates of unknown population parameters. The document provides examples to illustrate these concepts.

Uploaded by

Ankita Ghosh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views25 pages

Statistical Inference Frequentist

This document provides an overview of statistical inference and introduces some key terminology used in estimation and hypothesis testing. It discusses how statistical inference deals with estimating unknown population parameters from a sample of observations. Parameters refer to quantities in the population, while statistics are values computed from the sample data. The sampling distribution considers how a statistic would vary across different possible samples from the population. Estimation involves using statistics to obtain point or interval estimates of unknown population parameters. The document provides examples to illustrate these concepts.

Uploaded by

Ankita Ghosh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Elements of Statistical Inference

Chiranjit Mukhopadhyay
Indian Institute of Science

Statistical inference mainly deals with estimation and hypothesis testing about unknown
population parameters, given a set of observations on the variable whose population be-
haviour we want to study or model. Using the developed model for forecasting or prediction
purpose also falls under the realm of statistical inference but its general theory will not
be discussed here and will be introduced only when it first arises in the context of regres-
sion. Here we shall only give the flavour of general mathematical statistical treatment of the
problem of estimation and hypothesis testing about unknown population parameters.

1 Some Terminologies

1.1 Parameter

Parameters are quantities defined in the population. To emphasise this fact we shall refer to
it as population parameter in this subsection. Population parameters are quantities derived
from the population probability model like its mean, median, variance, 90-th percentile, or
even the entire c.d.f. itself. In practice we often assume observations arising from a given
parametric family of probability model like Binomial, Geometric, Negative Binomial, Pois-
son, exponential, Gamma, Normal, Weibull etc. from external or empirical considerations.
In such situations population parameters are nothing but those quantities which appear in
the expressions of the p.m.f. or p.d.f. of those probability models, like the p of Binomial,
Geometric or Negative Binomial distribution, or λ of the Poisson or exponential model or
(µ, σ 2 ) of the Normal probability model, or (α, λ) of the Gamma distribution or (λ, β) of the
Weibull model. This is because in these parametric families any other population quantity
of interest can be expressed in terms of theseqbasic model parameters. For instance the
standard deviation of a Bernoulli population is p(1 − p), the mean of a Gamma population
is α/λ, or the 75-th percentile of a Normal population is µ + 0.6745 × σ.

1.2 Statistic

Statistics are quantities which are computed from the data or observations. For example for a
random sample Y1 , Y2 , . . . , Yn from a N (µ, σ 2 ) population, the sample mean Y = n1 ni=1 Yi or
P

the naive sample variance s2n = n1 ni=1 (Yi − Y )2 are statistics, as opposed to their respective
P

population counter-parts µ and σ 2 , which are population parameters. Sample proportion p̂


= (# Successes in the sample)/n is another example of a statistic with a sample of size n
from a Bernoulli 0-1 or Success-Failure population. In general a statistic is a function of
observations Y1 , Y2 , . . . , Yn which is denoted by T (Y1 , Y2 , . . . , Yn ), to emphasise the fact that
it is a formula used to derive quantities based on a sample Y1 , Y2 , . . . , Yn .

1
1.3 Sampling Distribution

In classical statistics the optimality of any method is judged in terms of its repeated use
over different samples from the same population. This philosophy gives rise to the consid-
eration of the possible values a statistic T (Y1 , Y2 , . . . , Yn ) can take and the frequency with
which it takes such values over repeated sampling. Or in other words we consider the prob-
ability distribution of a statistic T (Y1 , Y2 , . . . , Yn ) over repeated sampling. This probability
distribution is called the sampling distribution of T .
For example consider the sample mean Y . Though for a given population its mean µ is
something fixed (but unknown), we cannot expect to get the same value of the sample mean
Y for all different possible samples that we can draw from the population. However we can
consider and theoretically derive how Y would behave over repeated sampling for all possible
samples in terms of its probability distribution. This probability distribution of Y over all
possible samples is called the sampling distribution of the sample mean Y .

Example 1: Consider a population consisting of only four numbers 1, 2, 3 and 4 with 30%
1’s, 40% 2’s, 20% 3’s and the remaining 10% 4’s. That is the probability distribution in the
population may be expressed in terms of the p.m.f. pY (y) of the population random variable
Y as follows:
y 1 2 3 4
pY (y) 0.3 0.4 0.2 0.1

Now consider drawing a random sample of size 2 from this population and the three statistics
Y = (Y1 + Y2 )/2, s22 = 1/2 2i=1 (Yi − Y )2 and s21 = 2i=1 (Yi − Y )2 , where Y1 and Y2 are
P P

the two observations. The sampling distributions of these three statistics can be figured
out by considering all possible samples of size 2 that can be drawn from this population,
the corresponding probabilities of drawing each such sample, and the values of each of the
statistics for every such sample These steps are presented in the following table:

Possible
{1,1} {1,2} {1,3} {1,4} {2,2} {2,3} {2,4} {3,3} {3,4} {4,4}
Samples
Probability 0.09 0.24 0.12 0.06 0.16 0.16 0.08 0.04 0.04 0.01
Y 1 1.5 2 2.5 2 2.5 3 3 3.5 4
2
s2 0 0.25 1 2.25 0 0.25 1 0 0.25 0
2
s1 0 0.5 2 4.5 0 0.5 2 0 0.5 0

Finally consolidating the distinct values that have been assumed by these statistics and
adding the corresponding probabilities of obtaining the samples for the same values of the
statistics, we obtain the sampling distributions of these statistics as follows:

Sampling Distribution of the Sample Mean Y


y 1.0 1.5 2.0 2.5 3.0 3.5 4.0
p (y) 0.09 0.24 0.28 0.22 0.12 0.04 0.01
Y

2
Sampling Distribution of s22 Sampling Distribution of s21
s22 0 0.25 1 2.25 s22 0 0.5 2 4.5
prob. 0.30 0.44 0.20 0.06 prob. 0.30 0.44 0.20 0.06

2 Estimation
Suppose we have a probability model in the population characterised by the p.d.f. f (y|θ)
(or p.m.f. p(y|θ)) and we have n independent and identically distributed (henceforth called
i.i.d.) observations Y1 , Y2 , . . . , Yn on the population random variable Y , which has the p.d.f.
f (y|θ) (or p.m.f. p(y|θ)). The first inference problem at hand is to estimate the unknown
population parameter θ. The problem of estimation has two fangs - point estimation and
interval estimation. In the first case i.e. point estimation, as the name suggests, one deals
with a single valued estimator of θ; while in the later case i.e. interval estimation, one reports
an interval of values which is supposed to contain the true unknown value of the parameter
θ.

2.1 Point Estimation

We start our discussion on point estimation by first considering the nature of a “good”
estimator. That is we shall first try to understand the kind of properties and behaviour a
reasonable estimator should have. Or in other words we first discuss the desirable criteria
for “good” estimators. Then failing to devise any straight-forward method of obtaining such
estimators we shall next resort to some general methods of estimation which are reasonably
well-behaved for large samples for the so called “regular” models.
Let the statistics θ̂ and θ̂0 be two different estimators of the population parameter θ. Say for
example with a sample from a N (µ, 1) population, for the unknown population parameter µ,
let µ̂ be the sample mean and µ̂0 be the sample median, as µ is both the population mean as
well as the population median. Likewise with a sample from a Poisson(λ) population, for the
unknown population parameter λ, let λ̂ denote the sample mean and λ̂0 denote the sample
variance, as λ may be interpreted as both the population mean and population variance.
Now θ̂ would be considered to be better than θ̂0 if

Probθ (θ − a < θ̂ < θ + b) ≥ Probθ (θ − a < θ̂0 < θ + b) ∀a, b > 0 and ∀θ ∈ Θ (1)

where the probabilities above are according to the sampling distributions of θ̂ and θ̂0 (it is
subscripted with a θ to emphasise the fact that these probabilities in general depend on the
unknown θ), and Θ is the parameter space, the set of all possible values the unknown
population parameter θ can take. A slightly weaker version of inequality (1) is

Eθ [(θ̂ − θ)2 ] ≤ Eθ [(θ̂0 − θ)2 ] ∀θ ∈ Θ (2)

where the expectation Eθ [·] is w.r.t.the sampling distributions of the estimators θ̂ and θ̂0
(again the Eθ [·] is subscripted with a θ to emphasise the fact that the expectation in general

3
depends on the unknown θ). Inequality (2) is weaker than inequality (1) in the sense that if
inequality (1) is satisfied by two estimators θ̂ and θ̂0 , then it implies that they must satisfy
inequality (2). Since inequality (2), though weaker, is less clumsy to deal with, as it does
not involve additional arbitrary constants like a and b as in inequality (1), “goodness” of an
estimator is measured in terms of this weaker criterion. This criterion is called the Mean
Squared Error or MSE which is formally defined as follows.
Definition 1: Mean Squared Error or MSE of an estimator θ̂ is given by MSEθ (θ̂ ) =
Eθ [(θ̂ − θ)2 ].
Intuitively, if θ̂ is an estimator of θ, then the error of θ̂ is (θ̂ − θ), the amount by which
it misses the quantity θ it is trying to estimate. But this error could be either positive or
negative and in general we are not interested in its sign but only the magnitude. A smooth
way of getting rid of the sign of a quantity is to square it.1 Thus we consider the squared
error (θ̂ − θ)2 . Now this squared error is a random quantity as its value depends on the
value of the estimator θ̂ , which varies from sample to sample. One way of consolidating the
value of this random criterion viz. squared error would be to look at its mean over repeated
sampling or by how much is the estimator θ̂ missing its target value of θ on an average. This
leads to the criterion mean squared error Eθ [(θ̂ − θ)2 ].
Note that though MSE of an estimator θ̂ does not depend on the value of θ̂ (since we have
averaged it out), in general it still depends on θ. Thus MSE of θ̂ is a function of θ. Since
we do not know the value of θ, which is the crux of the problem, one may now seek for a
super estimator which has the smallest MSE among all the estimators of θ, no matter what
the value of θ may be or for all values of θ ∈ Θ. Unfortunately such super estimators do
not exist, because when θ = θ0 , some fixed value in Θ, no other estimator θ̂ can have a
smaller MSE than the (rather silly) estimator θ̂0 ≡ θ0 . That is if we allow anything and
everything in the universe to be an estimator of θ and then seek for an estimator which
uniformly minimises the MSE, that is simply asking for too much.
The solution becomes apparent once we have realised the problem. If we can somehow keep
trivial and silly estimators out of the consideration and then seek for an MSE minimising
estimator, there might be some hope of obtaining a solution. That is we have to keep
our search for a “good” or MSE minimising estimator confined within a narrower class of
“reasonable” estimators. One such smaller or restricted class of estimators which is usually
considered in practice are called unbiased estimators, defined as follows.
Definition 2: An estimator θ̂ of the unknown population parameter θ is called unbiased,
if Eθ [θ̂ ] = θ ∀θ ∈ Θ.
1
Getting rid of the sign of a quantity is a recurring problem in statistics and we always use squaring for
accomplishing this task. Simply ignoring the sign is not a smooth procedure as the graph of f (x) = |x| for
−∞ < x < ∞ exhibits at x = 0. The closest one can come to this through a smooth procedure is by squaring
it (study the graph of g(x) = x2 for −∞ < x < ∞). Raising the quantity, whose sign we want to get rid
of, by any even power would also be a smooth procedure but the magnitude is distorted the least when this
power is 2 (for instance compare the graphs of f (x) = |x|, g(x) = x2 and h(x) = x4 for −∞ < x < ∞). We
want the process of ridding the sign to be smooth because otherwise it poses mathematical difficulties like
not being able to use standard calculas methods for subsequent theoretical development.

4
Above definition requires the mean of the sampling distribution of an unbiased estimator to
coincide with the unknown population parameter it is trying to estimate, for its all possible
values. Intuitively, this means that if an estimator is unbiased, on an average it will always
hit the target, no matter what or where the target is. This looks like a fairly reasonable
thing to demand from an estimator. Also notice that the unbiasedness requirement also gets
rid of trivial constant estimators, as clearly an estimator like θ̂ ≡ θ0 has an expected value
of θ0 , which does not coincide with the value of the unknown population parameter θ for
any value other than θ0 in Θ.
Example 2: Suppose Y1 , Y2 , . . . , Yn is a random sample from an arbitrary population with
mean µ. Then since E[Yi ] = µ ∀µ and ∀i, Y5 is an unbiased estimator of µ, so is (Y2 + Y3 )/2,
so is 0.2Y8 + 0.5Y11 + 0.3Y13 and so is of course the sample mean Y = n1 ni=1 Yi . 5
P

The above example goes to show that usually the class of unbiased estimators is quite
large. Now the task is to seek for an estimator with minimum MSE among this reasonably
large class of unbiased estimators. Unbiased estimators has a simple neatly interpretable
expression for their MSE as the following corollary shows.
Theorem 1: MSEθ (θ̂ ) = Vθ [θ̂ ] + Bθ (θ̂ )2 , where Vθ [θ̂ ] is the variance of the estimator
θ̂ w.r.t. its sampling distribution and Bθ (θ̂ ) = Eθ [θ̂ − θ] is the bias of the estimator θ̂ .
Proof:

M SEθ (θ̂ )
= Eθ [(θ̂ − θ)2 ]
= Eθ [{(θ̂ − Eθ [θ̂ ]) + (Eθ [θ̂ ] − θ)}2 ]
= Eθ [(θ̂ − Eθ [θ̂ ])2 ] + (Eθ [θ̂ − θ])2 + 2Eθ [(θ̂ − Eθ [θ̂ ])(Eθ [θ̂ ] − θ)]
(because Eθ [θ] = θ and (Eθ [θ̂ − θ])2 is a constant, whose expectation is the constant
itself.)
= Vθ [θ̂ ] + Bθ (θ̂ )2 + 2(Eθ [θ̂ ] − θ)Eθ [θ̂ − Eθ [θ̂ ]]
(by the definition of Vθ [θ̂ ] and Bθ (θ̂ ) and constancy of Eθ [θ̂ ] − θ)
= Vθ [θ̂ ] + Bθ (θ̂ )2 (as Eθ [θ̂ − Eθ [θ̂ ]] = 0) 5

Corollary 1: If θ̂ is unbiased for θ, MSEθ (θ̂ ) = Vθ [θ̂ ].

Proof: If θ̂ is unbiased for θ, its bias Bθ (θ̂ ) = Eθ [θ̂ − θ] = 0, by the definition of unbiased-
ness, and the result follows from Theorem 1. 5
For the above corollary it isqusually customary to report the value of an (unbiased) point
estimator together with its Vθ [θ̂ ], called its standard error. Corollary 1 reduces the
task of seeking an MSE minimising estimator among the class of unbiased estimators to
that of seeking an unbiased estimator with a uniformly minimum variance or standard error.
This leads to the discussion of obtaining in general the “best” point estimators called the
Uniformly Minimum Variance Unbiased Estimators or UMVUE.

5
2.1.1 UMVUE

The foregoing discussion goes to show the desirability of having an UMVUE. But unfor-
tunately there is no direct method of obtaining an UMVUE for any parameter of a given
population probability model. By a direct method we mean a ready made algorithmic tech-
nique which readily yields a (may be a computerised numerical) solution as soon as the
problem of obtaining an UMVUE is stated or formulated. Instead there are a couple of the-
orems which help one show or prove an estimator to be UMVUE. However before discussing
these theorems it is first necessary though to introduce a couple of more concepts and results,
without which the main theorems regarding UMVUE would remain inaccessible. The first
such concept is called sufficiency.
Definition 3: A statistic T (Y1 , Y2 , . . . , Yn ) is said to be sufficient for θ if the conditional
distribution of the original observations Y1 , Y2 , . . . , Yn given T (Y1 , Y2 , . . . , Yn ) = t does not
depend on θ.
Sufficiency plays a central role in mathematical statistics, not just for UMVUE. Intuitively,
sufficient statistics provide a way of reducing the data without loosing any information
about the unknown parameter θ. This is because if one has the value of the sufficient
statistic T but the original data set Y1 , Y2 , . . . , Yn is lost, one can still reconstruct a set of
Y1 , Y2 , . . . , Yn (using for example a random number generator) as it does not require knowl-
edge of the unknown θ (by definition), which is equivalent to the original data set in the sense
that its probability distribution remains the same as the original data set. Thus if sufficient
statistics exist, one need not carry around the entire original raw data set for drawing infer-
ence about the model parameters. Just having the values of the sufficient statistics is good
enough or sufficient, as these statistics carry all the relevant information about θ contained
in the observations Y1 , Y2 , . . . , Yn .
Example 3: Suppose Y1 and Y2 are i.i.d. Poisson(λ) i.e. we have a sample of size 2 from a
Poisson population. Consider the statistic T = Y1 + Y2 .

P (Y1 = y|T = t)
P (Y1 = y, Y2 = t − y)
=
P (T = t)
P (Y1 = y)P (Y2 = t − y)
=
e−2λ (2λ)t /t!
(because Y1 and Y2 are independent and T ∼ Poisson(2λ))
e−2λ λy+t−y /(y!(t − y)!)
=
e−2λ (2λ)t /t!
!  y  t−y
t 1 1
= ,
y 2 2

which does not depend on the unknown population parameter λ. It should now be easy
to see that if we had Y1 , Y2 , . . . , Yn a sample of size n from a Poisson(λ) population and

6
T = Y1 + Y2 + · · · Yn ,
 y1  y2  yn
t! 1 1 1
P (Y1 = y1 , Y2 = y2 , . . . , Yn = yn |T = t) = ··· ,
y1 !y2 ! · · · yn ! n n n
which does not depend on λ. Thus according to the definition T = ni=1 Yi is a suf-
P

ficient statistic for a Poisson sample. This is because if one has the value of T as t,
one can reconstruct a version of the original sample by generating a set of values from a
Multinomial(t; n1 , . . . , n1 ) distribution, without bothering to carry around all the n Y1 , Y2 , . . . , Yn
values. 5
Now trying to intuitively guess, obtain and then show a statistic is sufficient from definition,
as has been done in example 3 above, is an arduous if not an impossible task. Fortunately
there is a theorem, called the Factorisation Theorem, which helps one obtain a sufficient
statistics in a routine manner from the expression of the p.m.f./p.d.f. of a probability model.
At this point it will also be wise to broaden our horizon by considering all the unknown
parameters in the population at once. Thus from now on we shall use a bold-faced θ to
denote the vector (more than one) of unknown parameters, while preserving the notation
θ in case of a single unknown.2 Before presenting the Factorisation Theorem theorem we
need to introduce another extremely important statistical concept called the likelihood
function.
Definition 4: If Y1 , Y2 , . . . , Yn are i.i.d. with p.d.f. f (y|θ) (or p.m.f. p(y|θ)) with re-
alised values Y1 = y1 , Y2 = y2 , . . . , Yn = yn the likelihood function of θ is given by
L(θ|y1 , y2 , . . . , yn ) = ni=1 f (yi |θ) (or ni=1 p(yi |θ) in the discrete case).
Q Q

Very loosely speaking the likelihood function sort of gives the probability of observing the
data at hand given a value of the model parameter θ. But since θ is unknown, we try to
view this quantity in its totality as a function of the unknown θ as it varies over its domain
Θ. It is important to realise that in the expression of the likelihood function, the variable of
interest is θ and not the observed data Y1 = y1 , Y2 = y2 , . . . , Yn = yn . It is something akin to
a probability only when viewed as a function of y1 , y2 , . . . , yn , but since the likelihood must
be viewed as a function of θ it is not a probability.
Example 4: A. If Y1 , Y2 , . . . , Yn are i.i.d. N (µ, σ 2 ), with realised values Y1 = y1 , Y2 =
y2 , . . . , Yn = yn , then the likelihood function of (µ, σ 2 ) is given by
Pn Pn
= (2πσ 2 )−n/2 e− 2σ2 { (yi −y)2 +n(y−µ)2 }
1 1
(yi −µ)2
L(µ, σ 2 |y1 , y2 , . . . , yn ) = (2πσ 2 )−n/2 e− 2σ2 i=1 i=1

(3)
B. If Y1 , Y2 , . . . , Yn are i.i.d. Bernoulli(p), so that each Yi is 0-1 valued with probability of
assuming the value 1 is p and 0 is 1 − p, which may be expressed as the p.m.f. p(y|p) =
py (1 − p)1−y for y = 0, 1, with realised values Y1 = y1 , Y2 = y2 , . . . , Yn = yn , then the
likelihood function of p is given by
Pn Pn
yi
L(p|y1 , y2 , . . . , yn ) = p i=1 (1 − p)n− i=1
yi
(4)
2
This convention of using bold-face for a vector and ordinary font for a scalar - be it the parameters or
statistics - will be used through out these notes.

7
C. If Y1 , Y2 , . . . , Yn are i.i.d. Poisson(λ) with realised values Y1 = y1 , Y2 = y2 , . . . , Yn = yn ,
then the likelihood function of λ is given by
Pn n
L(λ|y1 , y2 , . . . , yn ) = e−nλ λ yi
Y
i=1 / yi ! (5)
i=1

Theorem 2 (Factorisation Theorem): If the population random variable Y has p.d.f.


f (y|θ) (or p.m.f. p(y|θ)) then given the the observed data Y1 = y1 , Y2 = y2 , . . . , Yn = yn ,
statistics T (Y1 , Y2 , . . . , Yn ) is sufficient for θ if and only if the likelihood function can be
factored as (t being the realised value of T )

L(θ|y1 , y2 , . . . , yn ) = g(t(y1 , y2 , . . . , yn ), θ)h(y1 , y2 , . . . , yn ). (6)

That is t(y1 , y2 , . . . , yn ) is sufficient for θ ⇐⇒ the likelihood function can be factored into
two components, where the expression of the first component involves θ and terms involv-
ing y1 , y2 , . . . , yn appearing only through t(y1 , y2 , . . . , yn ), and the expression of the second
component involves only y1 , y2 , . . . , yn without any term involving θ.
Proof: We shall present the proof for discrete Y , which is a little more intuitive and illus-
trative but less technical than the continuous case.
“only if” or =⇒ part: Suppose T (Y1 , Y2 , . . . , Yn ) is sufficient for θ and let t(y1 , y2 , . . . , yn ) =
t denote the observed value of T . Then

L(θ|y1 , y2 , . . . , yn )
= P (Y1 = y1 , Y2 = y2 , . . . , Yn = yn |θ)
(by definition of likelihood function)
= P (Y1 = y1 , Y2 = y2 , . . . , Yn = yn , T = t|θ)
(as the two events are same)
= P (T = t|θ)P (Y1 = y1 , Y2 = y2 , . . . , Yn = yn |T = t, θ)
(by definition of conditional probability.)
= g(t, θ)h(y1 , y2 , . . . , yn )

where g(t, θ) = P (T = t|θ), which involves only θ and t (without directly involving
y1 , y2 , . . . , yn - the only way y1 , y2 , . . . , yn appear in the expression of g(t, θ) is through t);
and h(y1 , y2 , . . . , yn ) = P (Y1 = y1 , Y2 = y2 , . . . , Yn = yn |T = t, θ), which does not involve
θ, by definition of the sufficiency of T .
“if” or ⇐= part: Suppose the probability model of Y1 , Y2 , . . . , Yn is such that it admits the
factorisation (6). Then

P (Y1 = y1 , Y2 = y2 , . . . , Yn = yn |T = t, θ)
P (Y1 = y1 , Y2 = y2 , . . . , Yn = yn , T = t|θ)
=
P (T = t|θ)
g(t, θ)h(y1 , y2 , . . . , yn )
= P
{y1 , y2 , . . . , yn :T (y1 , y2 , . . . , yn )=t}
g(t, θ)h(y1 , y2 , . . . , yn )

8
g(t, θ)h(y1 , y2 , . . . , yn )
= P
g(t, θ) {y1 , y2 , . . . , yn :T (y1 , y2 , . . . , yn )=t}
h(y1 , y2 , . . . , yn )
h(y1 , y2 , . . . , yn )
= P
y1 , y2 , . . . , yn :T (y1 , y2 , . . . , yn )=t h(y1 , y2 , . . . , yn )
which does not depend on θ. 5
Example 5: A. i. Consider the Normal likelihood given in (3). Assume that σ 2 is known
but not µ. Let T (Y1 , Y2 , . . . , Yn ) = Y , so that t = y. Then the likelihood function Pn (3) can
1
−(t−µ)2 /(2σ 2 ) 2 −n/2 − 2σ2 (y −y)2
i=1 i
be factorised into g(t, µ) = e and h(y1 , y2 , . . . , yn ) = (2πσ ) e ,
showing that Y is sufficient for µ in a N (µ, σ 2 ) model for known σ 2 .
A. ii. Again consider the Normal likelihood given in (3). This time assume that µ is known
but not σ 2 . Let T (Y1 , Y2 , . . . , Yn ) = ni=1 (Yi − µ)2 . Note that this T is a statistic because
P
1
µ is known. In this case define g(t, σ 2 ) = (2πσ 2 )−n/2 e− 2σ2 t and h(y1 , y2 , . . . , yn ) = 1 so that
L(σ 2 |y1 , y2 , . . . , yn ) = g(t, σ 2 )h(y1 , y2 , . . . , yn ). Thus in this case ni=1 (Yi − µ)2 is sufficient
P

for σ 2 .
A. iii. Finally consider the Normal likelihood in (3) with both (µ, σ 2 ) unknown. Note that
in this case the unknown parameter is vector valued with θ = (µ, σ 2 ). In this case we should
Pn 2
have a vector valued sufficient statistics T . Thus let T = (Y , i=1 (Yi − Y ) ), g(t, θ) =
1
P n
(2πσ 2 )−n/2 e− 2σ2 ( i=1 (yi −y) +(y−µ) ) and h(y1 , y2 , . . . , yn ) = 1, so that L(θ|y1 , y2 , . . . , yn ) =
2 2

g(t, θ)h(y1 , y2 , . . . , yn ). Thus in this case (Y , ni=1 (Yi − Y )2 ) is sufficient for θ = (µ, σ 2 ).
P

B. For the Bernoulli likelihood in (4) let T = ni=1 Yi , g(t, p) = pt (1−p)n−t and h(y1 , y2 , . . . , yn ) =
P

1. Then L(p|y1 , y2 , . . . , yn ) = g(t, p)h(y1 , y2 , . . . , yn ) and thus ni=1 Yi is sufficient for p.


P

Pn
C. i=1 Yi is sufficient for λ of the Poisson model, because for the Poisson model define T =
Pn −nλ t Qn
Y
i=1 i , g(t, λ) = e λ , and h(y ,
1 2y , . . . , y n ) = 1/ i=1 i ! so that the Poisson likelihood
y
in (5) equals g(t, λ)h(y1 , y2 , . . . , yn ). 5
We are now in a position to state the first theorem pertaining UMVUE, due to Rao and
Blackwell, which is as follows.
Theorem 3 (Rao-Blackwell Theorem): If T is sufficient for θ and U is any unbiased
estimate of θ, then the statistic h(T ) = E[U |T ] is also unbiased for θ and Vθ [h(T )] ≤ Vθ [U ].
Proof: First note that h(T ), the conditional expectation of U given T is a statistic i.e. can
be completely determined from the data, because since T is sufficient, the conditional dis-
tribution of Y1 , Y2 , . . . , Yn |T and hence E[U (Y1 , Y2 , . . . , Yn )|T ] = h(T ) also does not depend
on θ.
Next observe that h(T ) is unbiased for θ. This is because Eθ [h(T )] = Eθ [E[U |T ]] = Eθ [U ] =
θ ∀θ ∈ Θ. The last equality follows because U is unbiased for θ, and the one before that
follows because E[E[X|Y ]] = E[X]. Now,
Vθ [U ]
= Eθ [(U − θ)2 ]
(because U is unbiased for θ)

9
= Eθ [{(U − h(T )) + (h(T ) − θ)}2 ]
= Eθ [(U − h(T ))2 ] + Eθ [(h(T ) − θ)2 ] + 2Eθ [(U − h(T ))(h(T ) − θ)]
= Eθ [(U − h(T ))2 ] + Vθ [h(T )] + 2Eθ [E[{(U − h(T ))(h(T ) − θ)|T ]]
(because h(T )is unbiased for θ and E[E[X|Y ] = E[X])
= Eθ [(U − h(T ))2 ] + Vθ [h(T )] + 2Eθ [(h(T ) − θ)E[(U − h(T ))|T ]]
(because given T, (h(T ) − θ)is a constant w.r.t. T and hence can be taken out of the
expectation)
= Eθ [(U − h(T ))2 ] + Vθ [h(T )]
(because E[(U − h(T ))|T ]] = E[U |T ] − E[h(T )|T ] = h(T ) − h(T ) = 0)
≥ Vθ [h(T )]
(because Eθ [(U − h(T ))2 ] ≥ 0) 5

Rao-Blackwell theorem only goes on to show how an arbitrary unbiased estimator may
be improved w.r.t. reducing its variance by preserving its unbiasedness. It states that the
way to reduce variance of an unbiased estimator after preserving its unbiasedness would be
to consider a new statistic which is the conditional expectation of the unbiased estimator
given a sufficient statistics. Incidentally this process of taking conditional expectation of
a statistic given a sufficient statistics is called Rao-Blackwellisation. As such the theorem
does not directly state anything about an estimator being a UMVUE. Indeed according
to the theorem, if U1 and U2 are two unbiased estimators of θ then they can respectively
be improved upon by considering h1 (T ) = E[U1 |T ] and h2 (T ) = E[U2 |T ], but it does not
say anything comparing the variances of h1 (T ) and h2 (T ). This problem is resolved by
introducing another concept called completeness.
Definition 5: A statistic T is called complete if Eθ [g(T )] = 0 ∀θ ∈ Θ =⇒ g(T ) ≡ 0.3
3
The intuition behind completeness of a statistic can be understood only together with sufficiency. Op-
posite of a sufficient statistic is called an ancillary statistic. A statistic is called ancillary if its distribution
does not depend on the unknown θ. For example if Y1 , Y2 , . . . , Yn are i.i.d. Uniform[θ − 1/2, θ + 1/2] or
Uniform[θ, θ + 1], the smallest and largest observations denoted by (Y(1) , Y(n) ) are sufficient (use factorisa-
tion theorem) but Y(n) −Y(1) is ancillary in either case. A sufficient statistic packs in itself all the information
the data has got about the unknown θ, while an ancillary statistics is useless in drawing any inference about
θ since its distribution does not depend on θ. Now a sufficient statistic is most successful in optimal data
reduction when there is no ancillary statistics. In all the instances of Example 5, the number of parame-
ters and the corresponding number of sufficient statistics agree, because in those cases there do not exist
any ancillary statistic. But in the Uniform[θ − 1/2, θ + 1/2] or Uniform[θ, θ + 1] examples above, the suffi-
cient statistics is two-dimensional for a single parameter θ. This is because of the presence of the ancillary
statistic. Now a weaker notion of ancillarity, called first-order ancillarity, occurs when the expectation of
a non-constant statistic does not depend on θ. Ancillarity requires the entire distribution be free from θ
while first-order ancillarity requires the same only about its mean. Completeness of a sufficient statistic
guarantees that there is no first-order ancillary statistics and thus no ancillary statistics (because ancillarity
implies first-order ancillarity). Because if T is complete and some function of it has a constant expectation
c no matter what the value of θ is, then the function must identically equal c (if Eθ [f (T )] = c ∀θ ∈ Θ,
Eθ [f (T ) − c] = 0 ∀θ ∈ Θ implying f (T ) ≡ c by completeness). Thus if a sufficient statistic is also complete
it does not contain any ancillarity and as a result we can expect good results from functions of this sufficient
statistic such as an UMVUE for example.

10
Theorem 4 (Lehmann-Scheffe Theorem): If T is a complete sufficient statistic then for
any unbiased estimate U of θ, h(T ) = E[U |T ] is the unique UMVUE for θ.
Proof: We have already shown that if T is sufficient, for any arbitrary unbiased estimate
U of θ, Vθ [h(T )] ≤ Vθ [U ], where h(T ) = E[U |T ]. Now if U1 and and U2 are two different
unbiased estimators of θ and h1 (T ) = E[U1 |T ] and h2 (T ) = E[U2 |T ], then since Eθ [h1 (T )] =
Eθ [h2 (T )] = θ ∀θ ∈ Θ, Eθ [h1 (T ) − h2 (T )] = 0 ∀θ ∈ Θ, which implies h1 (T ) ≡ h2 (T ), since
T is complete. Thus no matter whatever unbiased estimator U one may start with, after
its Rao-Blackwellisation w.r.t. a complete sufficient statistic T , one always ends up with
the same or unique Rao-Blackwellised version h(T ) whose variance is no greater than any
arbitrary unbiased estimator of θ. Thus h(T ) is the unique UMVUE for θ. 5
Example 6: A random variable Y is said to belong to the exponential family of distri-
butions if it has the p.d.f. (or p.m.f)
k
X
f (y|η) = exp{ ηi Ti (y) + A(η) + B(y)} (7)
i=1

where η = (η1 , η2 , . . . , ηk ) is the vector of unknown parameters and A(·), B(·), T1 (·), T2 (·),
. . ., Tk (·), are known functions.
P.d.f/p.m.f of many standard probability models can be expressed in the form given in (7).
If Y ∼ B(n, p), the Binomial distribution with p as the parameter of interest and known n,
! ( !)
n 1 n
 
y n−y
p(y|p) = p (1 − p) = exp ηy + n log + log
y 1 + eη y
   
p 1
where η = log 1−p
. This is in the form of (7) with k = 1, T1 (y) = y, A(η) = n log 1+eη
!
n
and B(y) = log .
y
If Y ∼ N (µ, σ 2 ), the Normal distribution with (µ, σ 2 ) as the unknown parameters,
f (y|µ, σ 2 )
1 1
 
2
= √ exp − 2 (y − µ)
2πσ 2 2σ
1 η12
( !)
 
2
= exp η1 y + η2 y + + log(−η2 ) − log π
2 2η2
µ
where η1 = and η2 = − 2σ1 2 . This is in the form of (7) with k = 2, T1 (y) = y, T2 (y) = y 2 ,
 σ22 
1 η1
A(η) = 2 2η2
+ log(−η2 ) − log π and B(y) = 0.
y
If Y ∼ Poisson(λ), p(y|λ) = e−λ λy! = exp {ηy − eη − log y!}, where η = log λ. This is in
the form of (7) with k = 1, T1 (y) = y, A(η) = −eη and B(y) = log y!.
If Y ∼ Gamma(α, λ), with both α and λ as unknown parameters,
λα α−1 −λy
f (y|α, λ) = y e = exp {(η1 log y + η2 y) + (η1 log(−η2 ) − log Γ(η1 )) − log y}
Γ(α)

11
where η1 = α and η2 = −λ. This is in the form of (7) with k = 2, T1 (y) = log y, T2 (y) = y,
A(η) = η1 log(−η2 ) − log Γ(η1 ) and B(y) = − log y.
Similarly, Negative Binomial, Hypergeometric and Beta distributions can also be shown
to belong to the exponential family.4 Note that the parameter η in (7) is not the natural
parameter of interest for the standard models as discussed above. It is called the canonical
parameter. However in all cases it will be the case that there exists a one-to-one relationship
between the natural parameter of a model and the canonical parameter, as illustrated in the
above important special models. We need not worry too much about this issue. As shall be
seen shortly, we shall be able to draw inference about the natural parameter of interest only.
Introduction of the concept of exponential family only facilitates the theoretical development
in the sense that it allows us to prove results like sufficiency and completeness and hence the
UMVUE for many probability models of interest at once in a unified manner.
Sufficiency: If the distribution of Y belongs to the exponential family, and Y1 , Y2 , . . . , Yn is
an i.i.d. sample of Y , then it is immediate that T (Y1 , Y2 , . . . , Yn ) = ( ni=1 T1 (Yi ), . . . , ni=1 Tk (Yi ))
P P

is sufficient for η using the Factorisation Theorem. This is because, in this case the likelihood
of η
L(η|y1 , y2 , . . . , yn ) (8)
 
k n
( n )
X X  X
= exp ηj Tj (yi ) + nA(η) exp B(yi )
 
j=1 i=1 i=1

= g(t, η)h(y1 , y2 , . . . , yn )
nP o
k Pn Pn
where g(t, η) = exp j=1 ηj i=1 Tj (yi ) + nA(η) and h(y1 , y2 , . . . , yn ) = exp { i=1 B(yi )}.
Completeness: We shall prove completeness of the exponential family only for k =
1, which is easily generalisable to higher dimension. Suppose the distribution of Y be-
longs to the exponential family, and Eη [g(Y )] = 0 ∀η. Then from (7) it follows that
{g(y) exp(B(y))} exp(ηy)dy 5 = 0 ∀η. But the l.h.s. is nothing but the Laplace transforma-
R

tion of g(y) exp(B(y)). Since Laplace transformation of a function is unique, g(y) exp(B(y)) ≡
0, implying g(y) ≡ 0.6
Thus for exponential family by the Rao-Blackwell and Lehmann-Scheffe theorems any func-
tion of T is the UMVUE of its expectation. This is because the conditional expectation of
any function of T given T is that function itself, and any statistic is an unbiased estimate
of its own expectation. Now since T is also complete, it is the UMVUE.
4
However not all important probability models, useful in application belongs to the exponential family.
β
For example, the Weibull
 distribution with p.d.f. f (y|λ, β) = λβy β−1 e−λy can at best be simplified to
exp β log y − λeβ log y + (log λ + log β) − log y , which is not in the form given in (7).
5
In case Y is discrete this integral is to be replaced by summation and the rest of the argument carries
through.
6
Completeness of the exponential family should be understood in terms the distribution of the corre-
sponding sufficient statistics T . Say for example if Y ∼ N (0, σ 2 ) then it belongs to the exponential family.
But Y is not complete. This is because for example Eσ2 [Y α ] = 0 ∀σ 2 > 0 ∀α = 1, 3, 5, . . . odd powers. Or
for that matter expectation of any function of Y , which is symmetric about 0 is 0. But those functions are
not 0, showing that Y is not complete. But in this case the sufficient statistics is Y 2 , which of course is
complete.

12
Thus if Y ∼ B(n, p), since Ep [Y /n] = p and Y is complete and sufficient, the sample
proportion p̂ = Y /n is the UMVUE of the population proportion p.
If Y1 , Y2 , . . . , Yn are i.i.d. N (µ, σ 2 ), ( ni=1 Yi , ni=1 Yi2 ) is complete and sufficient for (µ/σ 2 , −1/(2σ 2 )).
P P

But since ( ni=1 Yi , ni=1 Yi2 ) ↔ (Y = n1 ni=1 Yi , s2n−1 = n−1 1 Pn 2 2 2


i=1 (Yi −Y ) ) and (µ/σ , −1/(2σ )) ↔
P P P

(µ, σ 2 ) is one-to-one the sample mean and variance (Y , s2n−1 ) is complete and sufficient
for (µ, σ 2 ). Now since Eµ,σ2 [Y ] = n1 ni=1 Eh µ,σ2 [Yi ] = n1 ini=1 µ = µ, Y h is the UMVUEi of
P P

1 Pn 1 Pn 2
µ. Similarly since Eµ,σ2 [s2n−1 ] = n−1 Eµ,σ2 i=1 (Yi − Y )
2
= n−1 Eµ,σ2 2
i=1 Yi − nY =
h n oi   2 
Vµ,σ2 [Yi ] + (Eµ,σ2 [Yi ])2
1 Pn n
n−1 i=1 − n−1 Vµ,σ2 [Y ] + Eµ,σ2 [Y ] (because for any random
h i  
σ2
variable X, E[X 2 ] = V [X] + (E[X])2 ) = 1 Pn 2 2 n
n−1 i=1 (σ + µ ) − n−1 n
+ µ2 (because
n 2
1
V 2 [Yi ] = n12 ni=1 σ 2 = σn , and we have just shown that Eµ,σ2 [Y ] = µ.)
P P
Vµ,σ

2 [Y ] =
n2  i=1 µ,σ
n 1
= n−1 − n−1 σ 2 = σ 2 , s2n−1 is the UMVUE of σ 2 .
If Y1 , Y2 , . . . , Yn are i.i.d. Poisson(λ) then ni=1 Yi is complete and sufficient.
P
Since
Eλ [ ni=1 Yi ] = nλ Y = n1 ni=1 Yi is the UMVUE of λ. 5
P P

We move on to the other method of obtaining UMVUE after presenting an interesting


example of obtaining UMVUE through Rao-Blackwellisation.
Example 7: If T denotes the life of a system, its reliability at time t is defined as P [T > t].
Exponential distribution is a very popular model for life of a system. So let T have an
exponential distribution with mean θ, denoted by T ∼ exp(θ) and suppose we are interested
in estimating the reliability of the system at a give mission time t. For this purpose suppose
we observe time to failure or life of n independent systems, T1 = t1 , . . . , Tn = tn . Now if
T ∼ exp(θ), P [T > t] = e−t/θ , and T = n1 ni=1 Ti is of θ, because ni=1 Ti is a
P P
the UMVUE
h P i
complete and sufficient statistic for exp(θ) and Eθ n1 ni=1 Ti = θ. But this does not imply
that e−t/T is the UMVUE for e−t/θ . In fact e−t/T is not even an unbiased estimate of e−t/θ
and this leads to the search for an UMVUE of e−t/θ .
(
1 if T1 > t
With the first observation T1 , the estimator It (T1 ) = , the indicator func-
0 otherwise
tion of the event T1 > t, is an unbiased estimate of P [T > t] for all life distribution T (not
just exponential), because Eθ [It (T1 )] = P [T > t]. Now for exponential distribution since
Pn
i=1 Ti is a complete
Pn
and sufficient statistic for θ, by Rao-Blackwell Lehmann-Scheffe theo-
rems E [It (T1 )| i=1 Ti ] would be the UMVUE for P [T > t]. The derivation of this quantity
only requires some probability calculation which is as follows.
Let S = ni=1 Ti and R = ni=2 Ti . Since It (T1 ) is a function of T1 , its conditional expec-
P P

tation is found by deriving the conditional density fT1 |S (t1 |s) of T1 |S, which requires the
joint density fT1 ,S (t1 , s) of (T1 , S). This is accomplished by considering the joint density of
(T1 , R) (which is easy because of the independence of T1 and R) and then considering the
one-to-one onto transformation (T1 , R) ↔ (T1 , S) defined as T1 = T1 and S = R + T1 . Note
that S ∼ Gamma(n, 1/θ) and likewise R ∼ Gamma(n − 1, 1/θ). Thus the density of R is
1
fR (r) = rn−2 e−r/θ
θn−1 (n − 2)!

13
and since T1 ∼ exp(θ) the joint density of (T1 , R) is given by
1
fT1 ,r (t1 , r) = rn−2 e−(r+t1 )/θ .
θn (n− 2)!
Now applying the one-to-one onto transformation (T1 , R) ↔ (T1 , S) we get the joint density
of (T1 , S) as
1
fT1 ,S (t1 , s) = n (s − t1 )n−2 e−s/θ .
θ (n − 2)!
The conditional density fT1 |S (t1 |s) of T1 |S is found by dividing fT1 ,S (t1 , s) by the marginal
density fS (s) of S, which since is a Gamma(n, 1/θ) density,
n−2
n−1 t1

fT1 |S (t1 |s) = 1− 0 < t1 < s.
s s
Now since It (T1 ) is the indicator function of the event T1 > t, E [It (T1 )|S] is found by
integrating fT1 |S (t1 |s) from t to t1 ’s upper-bound s. This integral is found as follows.
E [It (T1 )|S]
n−1Z s t1 n−2
 
= 1− dt1
s t s
Z 1−t/s
= (n − 1) un−2 du
0
(with the substitution u = 1 − t1 /s we have du = −dt1 /s and the range of the integral
from 1 − t/s to 0, which after adjusting for the negative sign becomes from 0 to 1 − t/s.)
t n−1
 
= 1−
s
Thus the UMVUE of P [T > t] = e−t/θ , the reliability at mission time t of a system hav-
ing an exponential life distribution with mean θ, with observations T1 , . . . , Tn is given by
  n−1
1 − Pnt Ti
. 5
i=1

Now we shall present the second and final method of determining an UMVUE. Though so
far while discussing the theory, we have pretended as if our interest lies only in the natural
population parameter θ, as Example 7 reveals, many times we might be interested in some
function φ(θ) of the original population parameter θ. Note that this does not require redoing
the previous Rao-Blackwell Lehmann-Scheffe theory of determining the UMVUE. In this case
one confines oneself with unbiased estimators of φ(θ) and restricts one’s search of UMVUE
only among functions of complete sufficient statistics of θ.
So now suppose we are interested in estimating some parametric function φ(θ) of θ. In this
discussion we shall only deal with scalar valued θ and φ(θ).
Theorem 5 (Cramer-Rao Theorem): If the population p.d.f. f (y|θ) is “regular”, Y1 , Y2 , . . . , Yn are
0 (θ)}2
i.i.d. f (y|θ) and T (Y1 , Y2 , . . . , Yn ) is an unbiased estimate of φ(θ), then Vθ [T ] ≥ − n h{φ
∂ 2 log f (Y |θ)
io .7
nEθ
∂θ 2

7
In the discrete case, replace the p.d.f. f (y|θ) by p.m.f. p(y|θ) in the statement of the theorem and
replace the integrals by summations in its proof, and the same result follows through.

14
Proof: Let Y = (Y1 , Y2 , . . . , Yn ) and y = (y1 , y2 , . . . , yn ). Since the joint density
R
of Y1 , Y2 , . . . , Yn is
same as the likelihood function of θ, as defined in Definition 4, Eθ [g(Y )] = <n g(y)L(θ|y) dy
for any function of Y and Z
L(θ|y) dy = 1.
<n

Partially differentiating the above w.r.t. θ within the integral sign (which is allowed since
the model is assumed to be “regular”) we get
Z
∂L(θ|y)
dy = 0
<n ∂θ
which can be rewritten as
" # ( )
∂ log L(θ|Y ) Z
1 ∂L(θ|y)
Eθ = L(θ|y) dy = 0 (9)
∂θ <n L(θ|y) ∂θ

If we again differentiate (9) w.r.t. θ under the integral sign (which is allowed under a second
“regularity” condition) we get
( ! !)
Z
1 ∂L(θ|y) ∂L(θ|y) ∂ 1 ∂L(θ|y)
+ L(θ|y) dy
<n L(θ|y) ∂θ ∂θ ∂θ L(θ|y) ∂θ
Z ( ! ! !)
∂ log L(θ|y) 1 ∂L(θ|y) ∂ ∂ log L(θ|y)
= + L(θ|y) dy
<n ∂θ L(θ|y) ∂θ ∂θ ∂θ
 !2 
Z ∂ log L(θ|y) ∂ 2 log L(θ|y) 
= + L(θ|y) dy = 0
<n  ∂θ ∂θ2 

The last equality implies that


 !2 
∂ 2 log L(θ|Y )
" #
∂ log L(θ|Y )
Eθ   = −Eθ . (10)
∂θ ∂θ2

Now if T (Y ) is an unbiased estimator of φ(θ)


Z
T (y)L(θ|y) dy = φ(θ). (11)
<n

Differentiating (11) under the integral sign (which is allowed under a third “regularity”
condition) we obtain
Z
∂ log L(θ|y)
T (y) L(θ|y) dy = φ0 (θ)
<n ∂θ
and by (9)
Z
∂ log L(θ|y)
{T (y) − φ(θ)} L(θ|y) dy = φ0 (θ). (12)
<n ∂θ
Hence by (9), (11) and (12)
" #
∂ log L(θ|Y )
Covθ T (Y ), = φ0 (θ). (13)
∂θ

15
Now by (9) and (10)
" #
∂ log L(θ|Y )

∂θ
( n
∂ 2 log L(θ|Y ) ∂2 X
" # " )#
= −Eθ = −Eθ log(f (Yi |θ))
∂θ2 ∂θ2 i=1
∂ 2 log(f (Y |θ))
" #
= −nEθ . (14)
∂θ2
∂ log L(θ|Y)
Let ρT (Y), ∂ log L(θ|Y) denote the correlation coefficient between T (Y ) and ∂θ
. Then by
∂θ
(13) and (14)
n h io2
Covθ T (Y ), ∂ log ∂θ
L(θ|Y)
{φ0 (θ)}2
ρ2T (Y), ∂ log L(θ|Y) =  h
∂ log L(θ|Y)
i =  h
∂ 2 log(f (Y |θ)
i ≤1
∂θ (Vθ [T (Y )]) Vθ ∂θ
(Vθ [T (Y )]) −nEθ ∂θ2
0 2
implying Vθ [T ] ≥ − n h{φ2 (θ)} io . 5
∂ log f (Y |θ)
nEθ
∂θ 2

0 2
The quantity − n h{φ2 (θ)} io is called the Cramer-Rao lower bound (CRLB). The
∂ log f (Y |θ)
nEθ
∂θ 2
 
∂ 2 log f (Y |θ) ∂ log f (Y |θ) 2
h i 
quantity −Eθ ∂θ2
which by virtue of (10) (for n = 1) equals Eθ ∂θ
is
called Fisher Information and is denoted by I(θ). Fisher Information for the entire
sample, denoted by In (θ), which in general is given in (10) reduces to nI(θ) in the i.i.d.
case. The intuitive reason behind calling this quantity “information” will be clear in the
next sub-subsection when we take up the Method of Maximum Likelihood or Maximum
Likelihood Estimation (MLE). In this connection note that the CRLB will be attained by
an unbiased estimator T (Y ) of φ(θ) if and only if its correlation with ∂ log L(θ| Y ) is ±1
∂θ
which happens if and only if T (Y ) − φ(θ) is a constant multiple of ∂ log L(θ|Y ) , since by (9)
  ∂θ
∂ log L(θ|Y )
Eθ ∂θ
= 0. But in general since this constant multiple may depend on θ it may be
said that CRLB is attained by an unbiased estimator T (Y ) of φ(θ) if and only if
∂ log L(θ|Y )
= A(θ) {T (Y ) − φ(θ)} (15)
∂θ

We shall close this sub-subsection by illustrating the utility of CRLB in determination of


UMVUE. Theorem 5 gives a lower-bound for the variance of an unbiased estimator T (Y ) of
a parametric function φ(θ). Thus if we can somehow catch hold of an unbiased estimator φ̂
of φ(θ), whose variance coincides with this lower-bound, then we know that the variance of
any other unbiased estimator is more than that of φ̂ and thus φ̂ is a UMVUE. That is if the
CRLB is attained by the variance of any unbiased estimator of φ(θ), then it is a UMVUE.
Example 8: A. If Y1 , Y2 , . . . , Yn are i.i.d. Bernoulli(p)
" # " #
∂ Y 1
I(p) = Vp {Y log(p) + (1 − Y ) log(1 − p)} = Vp =
∂p p(1 − p) p(1 − p)

16
h i
and Vp p̂ = n1 ni=1 Yi = p(1−p) = In1(p) , the CRLB, showing the sample proportion to be the
P
n
UMVUE for the population proportion.
B. Let Y1 , Y2 , . . . , Yn be i.i.d. N (µ, 1). Then
n n
∂2
" ( )!# " #
1 1 2 ∂ X
√ e− 2 (Yi −µ)
Y
In (µ) = −Eµ 2
log = −Eµ (Yi − µ) = n
∂µ i=1 2π ∂µ i=1
and Vµ [Y ] = 1/n = 1/In (µ). Since Y is unbiased for µ it is an UMVUE, as its variance is
attaining the CRLB.
C. Let Y1 , Y2 , . . . , Yn be i.i.d. exponential with mean θ. Then f (y|θ) = (1/θ)e−y/θ . Thus
∂2 ∂
∂θ2
log f (y|θ) = ∂θ {−(1/θ)+(y/θ2 )} = (1/θ2 )−2(y/θ3 ), so that I(θ) = −Eθ [(1/θ2 ) − 2(Y /θ3 )] =
(1/θ2 ) and In (θ) = n/θ. Now if we are interested in estimating θ since Eθ [Y ] = θ and
Vθ [Y ] = θ2 /n = 1/In (θ) = the CRLB, it is the UMVUE for θ. 5

2.1.2 MLE
So far we have discussed some criteria under which an estimator may be called “good”
and have only provided at best some indirect methods of obtaining such estimators. In
general it is desirable to have some automatic methods which yield estimators which are
at least approximately “good”. One such method is the method of Maximum Likelihood
(ML) yielding an MLE. If a UMVUE does not exist or cannot be obtained using the results
discussed in §2.1.1, then by far the most popular method of estimation employed in practice is
the method of Maximum Likelihood. This is partly because instead of having an assortment
of theorems guiding one in the search of a “good” estimator as in the case of UMVUE,
it is an automatic method producing at least numerical estimates; and partly (actually
philosophically the only reason) because the method yields estimators which are “good” in
the asymptotic sense and thus work very well for large samples.
Let the parameter of interest θ be vector valued, and let L(θ|y) denote the likelihood of
θ given the observations y, where the likelihood function L(θ|y) is as has been defined in
Definition 4.
Definition 6: θ̂ is called the Maximum Likelihood Estimator (MLE) of θ if L(θ̂|y) ≥
L(θ|y) ∀θ ∈ Θ .
Intuitively, granting the informal interpretation of the likelihood function discussed in the
paragraph following Definition 4, MLE θ̂ is that value of θ for which one is most likely to
observe a data set such as the one at hand namely Y = y, among all possible values of
θ ∈ Θ . Other than this intuitive appeal, in the so-called “regular”8 cases, computation of
MLE is conceptually straight forward. All one has to do is find the maxima of L(θ|y), which
is typically accomplished by solving the system of equations ∂L(∂θθ|y ) = 0 or equivalently
θ =θ̂
by solving
∂`(θ|y)
= 0 where `(θ|y) = log(L(θ|y)) (16)
∂θ θ =θ̂
8
A model f (y|θ) or p(y|θ) is regular if it satisfies the conditions required for interchanging the differen-
tiation and integration signs in Theorem 5 yielding CRLB.

17
((16) is usually easier to deal with because the likelihood function is a product which after
taking log becomes a sum), and then showing the resulting solution θ̂ to be the global
maxima9 .
Example 9: A. If Y1 , Y2 , . . . , Yn are i.i.d. Bernoulli(p)
n
!
Pn
yi
Pn p
(1 − p)n− yi
X
L(p|y) = p i=1 i=1 ⇒ `(p|y) = log yi + n log(1 − p)
1 − p i=1

and n
∂`(p|y) 1X
= 0 ⇒ p̂ = yi
∂p p=p̂
n i=1
 Pn Pn 
∂ 2 `(p|y ) yi n− i=1 yi
Since ∂p2
= − i=1
p2
+ (1−p)2
< 0 ∀0 < p < 1, `(p|y) is concave and thus p̂ =
1 Pn
n i=1 yi , the sample proportion is the MLE of the unknown population proportion p.

B. Let Y1 , Y2 , . . . , Yn be i.i.d. N (µ, σ 2 ). Then


n n
( )
1 X n n 1 X
L(µ, σ 2 ) = (2πσ 2 )−n/2 exp − 2 (yi − µ)2 ⇒ `(µ, σ 2 ) = − log(2π)− log(σ 2 )− 2 (yi −µ)2
2σ i=1 2 2 2σ i=1

Now
∂`(µ,σ 2 |y )
   Pn 
i=1 (yi − µ̂)
! ! ! !
∂µ (µ=µ̂,σ 2 =σb2 ) 0 0 µ̂ y
= ⇒  Pn (yi −µ̂)2 = ⇒ = .
 
∂`(µ,σ 2 |y ) n 1 Pn

2
0 σ2 i=1 i − y)
(y
i=1
  c
∂(σ 2 ) 2(σb2 )2 2σb2 n
(µ=µ̂,σ 2 =σb2 )

The second derivative matrix of `(µ, σ 2 |y)


∂ 2 `(µ,σ 2 |y ) ∂ 2 `(µ,σ 2 |y )
   Pn 
n 1
∂µ2 ∂µ∂(σ 2 ) σ2 (σ 2 )2 i=1 (yi − µ)

∂ 2 `(µ,σ 2 |y ) ∂ 2 `(µ,σ 2 |y )
 = − 1 Pn 1
n
1 Pn
o 
∂(σ 2 )∂µ ∂(σ 2 )2 (σ 2 )2 i=1 (yi − µ) (σ 2 )2
n/2 − σ2 i=1 (yi − µ)2

is negative-definite at (µ̂, σc2 ) and is asymptotically so as n → ∞ in probability10 . Because


by the law of large numbers
∂ 2 `(µ,σ 2 |y ) ∂ 2 `(µ,σ 2 |y )
  " 1 #
1 ∂µ2 ∂µ∂(σ 2 ) P
 −→ σ2
0
∂ 2 `(µ,σ 2 |y ) ∂ 2 `(µ,σ 2 |y ) − .
n 0 − (σ12 )2
∂(σ 2 )∂µ ∂(σ 2 )2

Pn
Hence (µ̂, σc2 ) = (y, n1 i=1 (yi − y)2 ) is the MLE of (µ, σ 2 ). 5
Having understood the method i.e. how to implement it, let us now turn our attention
to the more important question of the justification of this method. That is why in a given
9
Showing something to be a global maxima in general is a daunting task, but in most MLE computations
this feat is typically accomplished by showing the log-likelihood is concave.
10 ∞
A sequence of random variables {Xn }n=1 is said to converge to a random variable X in probability,
P
notationally represented as Xn → X, if ∀ > 0 the sequence of real numbers P [|Xn − X| < ] → 1 as n → ∞.

18
situation MLE should be considered and what optimality property does it possess? We shall
present the proofs only in the single parameter case, which are easily extendable to multi
or vector-valued parameter case. Also we shall only be concerned with regular cases. For
non-regular cases ML may or may not be the best approach for estimation. But before
discussing optimality of MLE, we have to first introduce another criterion for an estimator
being “good” called consistency.
Definition 7: An estimator θ̂n , based on a sample of size n is said to be consistent for θ
P
if θn → θ ∀θ ∈ Θ.
That is if an estimator is consistent, it has the desirable property of converging to the true
unknown value as the sample size increases.
Example 10 (Weak Law of Large Numbers): Let Y n = n1 ni=1 Yi denote the sample
P

mean based on n i.i.d. observations Y1 , Y2 , . . . , Yn from an arbitrary population having mean


µ and variance σ 2 . Since E[Y n ] = µ (Example 2) and V [Y n ] = σ 2 /n (Example 6, page 13),
by Chebyshev’s inequality
√  σ2
" !#
h i σ
P |Y n − µ| <  = P |Y n − µ| < n √ ≥ 1 − 2 → 1 as n → ∞
σ n n
Thus the sample mean is always a consistent estimate of population mean, provided the
population mean and variance exists. 5
Theorem 6: Assume the probability model is “regular”. Then as n → ∞, (16) always
has a solution with probability tending to 1. Furthermore such a solution is unique with
probability tending to 1, which is also consistent for θ.
Proof: Above theorem essentially has three claims namely existence of a solution, uniqueness
of the solution and finally consistency of the solution. We shall prove existence, consistency
and uniqueness, in that order, after some preliminaries.
Let θ 0 denote the true unknown value of the parameter of interest θ. Consider the
L(θ |Y )
random variable L( θ 0 |Y ) , where L(θ|Y ) is the Likelihood function of the data set Y =
(Y1 , Y2 , . . . , Yn ). Since arithmetic mean is always greater than geometric mean, so are their
logarithms, which means logarithm of an expectation or arithmetic mean is always greater
than the expectation of the logarithms (because that’s what the logarithm of a geometric
mean reduces to). Thus,
" #! " !#
L(θ|Y ) L(θ|Y )
log Eθ 0 > Eθ 0 log
L(θ 0 |Y ) L(θ 0 |Y )
where Eθ 0 [g(Y )] denotes the expectation of the random variable g(Y ) when θ equals its
true unknown value θ 0 . Now ∀θ ∈ Θ ,
n n
" #
L(θ|Y ) Z
L(θ|y) Y Z
L(θ|y) Z Y
Eθ 0 = f (yi |θ 0 )dy = L(θ 0 |y)dy = f (yi |θ)dy = 1
L(θ 0 |Y ) <n L(θ 0 |y) i=1 <n L(θ 0 |y) <n i=1

Thus ∀θ ∈ Θ ,
Eθ 0 [log (L(θ 0 |Y ))] ≥ Eθ 0 [log (L(θ|Y ))]

19
But n1 log (L(θ|Y )) = n1 ni=1 log (f (Yi |θ)) ∀θ ∈ Θ , and thus by the law of large num-
P

bers (see Example 10 above) with probability tending to 1, n1 log (L(θ|Y )) would equal
Eθ 0 [log (L(θ|Y ))]. Hence ∀θ ∈ Θ ,

log (L(θ 0 |Y )) ≥ log (L(θ|Y )) with probability tending to 1. (17)

Existence: All the statements below are true with probability tending to 1. According to
(17), log (L(θ|Y )) or `(θ|Y ) will have a maxima at θ 0 . Hence its derivative must vanish
at θ 0 . Therefore equation (16) must have a solution (at θ 0 in particular). This shows that
there exists a solution for equation (16).
 
Consistency: Substituting θ = θ̂, the MLE of θ, in (17) we get log (L(θ 0 |Y )) ≥ log L(θ̂|Y )
with probability
 tending to 1. But on the other hand, by the definition of MLE (vide. Def-
inition 6), log L(θ̂|Y ) ≥ log (L(θ|Y )) ∀θ ∈ Θ , and thus in particular when θ = θ 0 ,
 
log L(θ̂|Y ) ≥ log (L(θ 0 |Y )). Therefore the MLE θ̂ have to coincide with θ 0 , the true un-
known value of the parameter of interest, with probability tending to 1, as n → ∞, showing
that MLE is consistent.
Uniqueness: We shall outline the proof for the single parameter case. The proof for
the general multi-parameter case is analogous, but since it requires a little more additional
notations is not included here. MLE is obtained by solving equation (16) and then checking
whether it is at least a local maxima or not (because equation (16) would be satisfied by
both local minima as well as saddle points of the log likelihood function `(θ|Y )) by looking
at the sign of ∂ `(θ|Y )
2
, which needs to be negative for θ̂ to be a maxima. Let θ̂ be any
∂θ2 c
θ=θ̂
consistent estimator of θ. Then by the law of large numbers
n
1 ∂ 2 `(θ|Y ) ∂ 2 log(f (Yi |θ)) ∂ 2 log(f (Y |θ))
" #
1 X P
= −→ Eθ0
n ∂θ2 θ=θ̂c
n i=1 ∂θ2 θ=θ̂c
∂θ2 θ=θ0

Now by (10) (for n = 1) ∀θ ∈ Θ,


 !2 
∂ 2 log(f (Y |θ))
" #
∂ log(f (Y |θ))
Eθ 2
= −Eθ   <0
∂θ ∂θ
 
∂ 2 log(f (Y |θ))
and hence Eθ0 ∂θ2
< 0, which means that if θ̂c is a consistent estimator then
θ=θ0
∂ 2 `(θ|Y )
by the law of large numbers, with probability tending to 1, ∂θ2
< 0. Now let θ̂1 < θ̂2
θ=θ̂c
be two distinct MLE’s. Then they also must be consistent. Furthermore since the regularity
conditions guarantee that likelihood function is smooth there must exist a minima θ̂3 such
that θ̂ < θ̂ < θ̂ . Since θ̂ is a minima ∂ `(θ|Y )
2
1 3 2 3 > 0. But since θ̂ and θ̂ are consistent,
∂θ2 1 2
θ=θ̂3
∂ 2 `(θ|Y )
and θ̂1 < θ̂3 < θ̂2 so will be θ̂3 , in which case ∂θ2
< 0, which is a contradiction, and
θ=θ̂3
thus one cannot have more than one MLE. 5

20
∂ 2 `(θ|Y )
In the foregoing proof of uniqueness, we saw that ∂θ2
played a crucial role in
θ=θ̂
establishingan estimate  to be an MLE. It was also seen that this quantity approximately
∂ 2 `(θ|Y )
equals −E θ ∂θ2
= I (θ) with θ = θ . The quantity I(θ) = ∞ I (θ) also arose in §2.2.1
n 0 \ \

in the context of CRLB, where I(θ) was called Fisher Information. It is called “information”
because of the following reasons stated below (without proof):

(i) By (10)  !2 
∂ 2 `(θ|Y )
" #
∂`(θ|Y )
I(θ) = −Eθ 2
= Eθ   > 0 ∀θ ∈ Θ
∂θ ∂θ

(ii) I(θ) is additive in the sense that if I (1) (θ) and I (2) (θ) respectively denote informations
for two samples Y1 and Y2 of size 1 each, then the information I2 (θ) for the com-
bined sample {Y1 , Y2 } of size 2 equals I (1) (θ) + I (2) (θ), provided of course they are
independent.

(iii) If I (T ) (θ) denotes information based on a statistic T (Y ), then I(θ) ≥ I (T ) (θ), with
equality iff T is sufficient. This shows that you always loose some information by
summarising raw data, unless it is summarised using sufficient statistics, in which case
you do not loose any information as from the original raw data.

(iv) A variety of distance measures between two distributions for two different values of θ
has an approximate increasing relationship with I(θ). What this means is I(θ) can be
viewed as a measure of sensitivity of changing the population distribution by changing
θ.

Apart from the above mathematical facts, the so called likelihood principle says that
whatever a data set Y has to say about the unknown parameter θ, it is packed in the
likelihood or log-likelihood function `(θ|Y ). Now for the same parameter for two different
data sets the one with sharper (more peaked) `(θ|Y ) is more informative about θ. That
is flatter the likelihood lesser is the information about θ. In (θ)(= nI(θ)), negative of the
expected value of the second derivative of the log-likelihood simply measures how peaked
one can expect this log-likelihood function to be. The negative sign simply ensures that the
quantity is positive. Larger this value the likelihood is expected to be more peaked and thus
more is the information about θ and hence the name “information”.
Fisher Information being an expectation is a population quantity, and thus depicts the
kind of informative behaviour to expect from a likelihood, in general for an arbitrary sample
from this population. But for a given set of data the peakedness of the likelihood can be
exactly measured without resorting to its expected value. This consideration gives rise to
the sample version of Fisher information called the Observed Information which is given
2
by In (θ) = − ∂ `(θ| Y ) . Note that by law of large numbers ∀θ ∈ Θ, I(θ) = 1 I (θ) −→
P
I(θ),
∂θ2 n n
which allows us to consistently estimate the Fisher Information by Observed Information.
In the multi-parameter or vector valued θ, instead of a scalar Fisher/ observed informa-
tion we have a Fisher Information Matrix I(θ) and its sample analogue Observed

21
Information Matrix I(θ) which are as follows. Let θ = (θ1 , . . . , θk ) have k components.
Then        
∂ 2 `(θ |Y ) ∂ 2 `(θ |Y ) ∂ 2 `(θ |Y )
 Eθ ∂θ2
Eθ ∂θ1 ∂θ2 · · · Eθ ∂θ1 ∂θk 
  1      
∂ 2 `(θ |Y ) ∂ 2 `(θ |Y ) ∂ 2 `(θ |Y )
···
 
 Eθ ∂θ2 ∂θ1
Eθ ∂θ22
Eθ ∂θ2 ∂θk

I(θ) = − 
.. .. .. ..
 
.
 
 .  .  .
 
    
 ∂ 2 `(θ |Y ) ∂ 2 `(θ |Y ) ∂ 2 `(θ |Y ) 
Eθ ∂θk ∂θ1 Eθ ∂θk ∂θ2 · · · Eθ ∂θk2

and likewise
∂ 2 `(θ |Yi ) ∂ 2 `(θ |Yi ) ∂ 2 `(θ |Yi )
 Pn Pn Pn 
1 1 1
n i=1∂θ12 n i=1 ∂θ1 ∂θ2
··· n i=1∂θ1 ∂θk
 
 1 n ∂ `(θ |Yi )
2
1 Pn ∂ 2 `(θ |Yi ) 1 Pn ∂ 2 `(θ |Yi )
···
P 
 
n i=1 ∂θ2 ∂θ1 n i=1 ∂θ22 n i=1 ∂θ2 ∂θk
I(θ) = −
 .. .. ... .. .

. . .
 
 
∂ 2 `(θ |Yi ) ∂ 2 `(θ |Yi ) ∂ 2 `(θ |Yi )
 Pn Pn Pn 
1 1 1
n i=1 ∂θk ∂θ1 n i=1 ∂θk ∂θ2
··· n i=1 ∂θk2

As far as the optimality of MLE is concerned, so far we have only shown that it is consistent.
Though often MLE is the starting point of the search of an UMVUE its relationship with an
UMVUE still remains to be examined. Actually the connection between the two in the single
parameter case is best studied through Theorem 5, establishing the CRLB. Thus suppose
there exists an unbiased estimator T (Y ) for a parametric function φ(θ), which attains the
CRLB. Then by (15) and (16) θ̂, the MLE of θ must satisfy T (Y ) = φ(θ̂) which goes on
to show that the UMVUE of φ(θ) is a function of the MLE. In case φ(θ) ≡ θ and T (Y ) is
unbiased for θ attaining CRLB, the above argument also shows that then T (Y ) must equal
the MLE θ̂.
The connection between UMVUE and MLE in the multi-parameter case is easiest to ap-
preciate for the exponential family of distributions, covering most of the probability models
applied in practice, introduced in Example 6. Again as usual we shall show the mathematics
for the continuous case, which carries through exactly the same way in the discrete case by
replacing the p.d.f. by p.m.f. and the integrals by summations. By (7) since f (y|η) is a
p.d.f. ( k )
Z
η T (y) + B(y) dy = e−A(η ) .
X
exp l l
< l=1
Differentiating the above w.r.t. ηj , interchanging the differentiation and integral sign, de-
noting ∂A( η ) by A (η), and then multiplying both sides by e−A(η ) , we get∀j = 1, 2, . . . , k,
∂ηj j

Z ( k )
X
Tj (y) exp ηl Tl (y) + B(y) + A(η) dy = −Aj (η)
< l=1

or < Tj (y)f (y|η) dy = Eη [Tj (Y )] = −Aj (η). Thus T (Y ) is an unbiased estimator of


R

−Aj (η), and as shown in Example 6, based on a sample of size n, n1 ni=1 Tj (Yi ) is the
P

UMVUE for −Aj (η). But by (9) the likelihood equation (16) in this case reduces to
n
1X
Tj (Yi ) = −Aj (η̂) ∀j = 1, 2, . . . , k,
n i=1

22
where η̂ is the MLE of η showing that functions of T being UMVUE estimates of their
expectations (vide. Example 6) are necessarily functions of the MLE η̂.
We shall close our discussion about MLE after proving an important theorem. This theo-
rem, among other things, like enabling one to approximate the sampling distribution of the
MLE in large samples, also helps one to see the connection between MLE and CRLB.
Theorem 7: For “regular” models, for large n, the sampling distribution of MLE θ̂ may
be approximated by a k-variate Normal distribution with mean θ 0 and variance-covariance
matrix I −1
n (θ 0 ), where θ 0 is the true unknown value of θ.

Proof: We shall give the outline of the proof only for k = 1. For general k, though the
proof is similar, is a little more technical in nature and hence omitted. By Taylor’s theorem,
expanding the l.h.s. of the likelihood equation (16) about θ0 , the true unknown value of θ,
we get,
∂`(θ|Y ) ∂`(θ|Y ) ∂ 2 `(θ|Y )
= + (θ̂ − θ0 )
∂θ θ=θ̂
∂θ θ=θ0
∂θ2 θ=θ∗

where θ∗ lies in between θ̂ and θ0 . But by (16) since the l.h.s. of the above equation is 0,

∂ 2 `(θ|Y )
,( )
∂`(θ|Y )
(θ̂ − θ0 ) = − .
∂θ θ=θ0
∂θ2 θ=θ∗
q
Multiplying both sides of above by nI(θ0 ), and multiplying the numerator and denominator
of the r.h.s. by n1 this equation may be rewritten as
( ),
1 ∂`(θ|Y )
q
q n ∂θ
I(θ0 )/n
θ=θ0
(θ̂ − θ0 ) nI(θ0 ) =   . (18)
1 ∂ 2 `(θ|Y )
n ∂θ2
{−I(θ0 )}
θ=θ∗

Now by law of large numbers, definition of I(θ0 ), and because θ∗ is sandwitched between θ0
P
and θ̂, with θ̂ −→ θ0 ,
n
1 ∂ 2 `(θ|Y ) 1X ∂ 2 log f (Yi |θ) P
= −→ −I(θ0 ).
n ∂θ2 θ=θ∗
n i=1 ∂θ2 θ=θ∗

Thus the denominator of (18) converges in probability to 1. For handling the numerator
note that
n
1 ∂`(θ|Y ) 1X
= Ui
n ∂θ θ=θ0
n i=1
∂ log f (Yi |θ)
where Ui = ∂θ
. By (9), Eθ0 [Ui ] = 0 and by (9) and (10), Vθ0 [Ui ] = I(θ0 ) = σ 2
θ=θ0
Ui by U , the numerator of (18) may be written as √U −0
Pn
(say). Thus by denoting n1 i=1 2 σ /n
whose distribution would approach that of a standard Normalqdistribution as n → ∞ by
the Central Limit Theorem. Thus the distribution of (θ̂ − θ0 ) nI(θ0 ), the l.h.s. of (18),

23
must also approach that of a standard Normal distribution as n → ∞, implying that θ̂ is
asymptotically Normal with mean θ0 and variance 1/ {nI(θ0 )}. 5
Above theorem shows that the asymptotic variance of MLE is same as the CRLB. In this
sense MLE may be called an “efficient” estimator. However a general discussion of efficiency
is beyond the scope of this notes, and thus we shall wrap up our discussion about MLE by
making a a couple of final remarks.
Note that optimality properties of MLE like consistency and asymptotic Normality with
CRLB as the asymptotic variance are all large sample properties. That is they are true only
if the sample size n → ∞. Thus it must be borne in mind that one can expect MLE to
behave reasonably well only when the sample size is large, unless other results show them to
be optimal for small samples.
The second point is regarding numerical computation. Since more often than not the
likelihood equation (16) does not admit closed form analytical solution, one has to employ
numerical methods. In such situations one typically employs standard numerical method
like Newton-Raphson to directly solve (16), or utilise method of steepest ascent for direct
maximisation of `(θ|Y ). Newton-Raphson requires the second derivative matrix of `(θ|Y ).
When one uses the expression of the expected values of these second derivatives in the the
Newton-Raphson algorithm, the method is called Rao’s method of scoring, which typically
converges faster than the usual Newton-Raphson.
q
1/ nI(θ0 ) is called the asymptotic standard error of MLE. Since one does not know θ0
q of θ one may replace θ0 by MLE θ̂ in the expression for I(θ) or
the true unknown value
alternatively report 1/ In (θ̂) as estimated asymptotic standard error of the MLE, which
also is a consistent estimator. As shall be shortly seen in §2.2, the section on interval esti-
mation, standard error is typically best interpreted in its intuitive sense when the sampling
distribution of the estimator is Normal. In case of MLE, by virtue of Theorem 7 since it is
so, this asymptotic standard error typically is a fairly interpretable quantity whose use and
interpretation will be further clarified in §2.2.

2.1.3 Other Methods


For a given estimation problem though we first strive to procure a UMVUE with the search
possibly guided by first obtaining an anlytical expression for MLE, because of the computa-
tional difficulties sometimes one resorts to some other alternative methods. In this subsection
we shall discuss a few such other methods.

Method of Moments (MM)


In this method given an i.i.d. sample Y1 , Y2 , . . . , Yn from the population one equates the
sample moment(s) with the theoretical population moment(s) which are typically some func-
tions of the original population parameters of interest θ. If θ has k components one typically
considers the first k raw or central moments, whichever is convenient. Let µr denote the
r-th raw moment in the population and Y r denote the corresponding r-th raw moment in

24
the sample. Then
n
Z
1X
µr = y r f (y|θ) dy = gr (θ) (say) and Y r = Yir
< n i=1

Now MM requires one to form the system of k equations

gr (θ) = Y r for r = 1, 2, . . . , k

which are then solved for θ to obtain MM estimator θ̂ M of θ.


α
λ
Example 11: Let Y1 , Y2 , . . . , Yn be i.i.d. Gamma(α, λ) with population p.d.f. Γ(α) y α−1 e−λy .
Now as can be immediately seen, method of ML would require messing around with the
di-gamma function (derivative of log Γ() functions), which is not exactly a routine numer-
ical task. But by appealing to the method of moments we immediately observe that the
population mean or the first raw moment equals α/λ and the population variance or the
second central moment equals α/λ2 . Equating these two to their respective (unbiased) sam-
ple counter-parts Y = n1 ni=1 Yi and s2n−1 = n−1 1 Pn 2
i=1 (Yi − Y ) and solving for (α, λ), we
P
2
immediately obtain the MM estimates of (α, λ) as α̂M = Y /s2n−1 and λ̂M = Y /s2n−1 . 5
P
Now let us address the issue of optimality of MM. By law of large numbers Y r −→ µr .
Now if the gr (·) functions are such that (g1 (θ), . . . , gk (θ)) ←→ (θ1 , . . . , θk ) is one-to-one with
continuous inverse (as in the gamma example above), then θ̂ M is a consistent estimator of
θ. Furthermore Y r ’s are unbiased estimators of µr ’s. (But this does not imply that θ̂ M is
unbiased for θ, unless of course if the gr (·)’s are linear, which is extremely rare.) Other than
these there is no other compleing reason to use MM. In fact MM estimators in general are
less efficient than MLE, where the notion of efficiency was very briefly touched upon in
§2.1.2.

25

You might also like