Estimation EMV
Estimation EMV
Point Estimation
Recall that a probability space (sometimes called a probability model ) is a triple (Ω, B, P ) , where Ω is
a sample space, B is a σ-algebra of events (subsets of Ω) and P is a probability function (deÞned on B).
In statistics we need to be able to study several probability functions simultaneously.
DeÞnition. A statistical experiment (sometimes called a statistical model ) is a triple (Ω, B, P) , where
Ω is a sample space, B is a σ-algebra of events and P is a collection of probability functions deÞned on
B; that is, P is a collection of probability functions such that (Ω, B, P ) is a probability space for each P ∈ P.
© ª
DeÞnition. A statistical model (Ω, B, P) is parametric if P is of the form Pθ : θ ∈ Θ ⊆ Rk , where
θ is a parameter, which takes on values in the parameter space Θ, a subset of Rk .
For concreteness, we will only consider parametric statistical models. Moreover, we will assume that the
elementary outcomes are vectors of real numbers and that these outcomes are realizations of a collection
of i.i.d. random variables. That is, each outcome is of the form
x1
x = ... ,
xn
Example. Suppose Xi ∼ i.i.d. Ber (p) , where p ∈ [0, 1] is an unknown parameter. In this case, θ = p,
Θ = [0, 1] ⊆ R and
0 for x < 0
F (x|p) = 1−p for 0 ≤ x < 1
1 for x ≥ 1.
Example. Suppose Xi ∼ i.i.d. U [0, θ] , where θ > 0 is an unknown parameter. In this case, Θ = R++ ⊆ R
and
0 for x < 0
F (x|θ) = x/θ for 0 ≤ x < θ
1 for x ≥ θ.
1
Economics 240A Fall 2003
¡ ¢
Example. Suppose Xi ∼ i.i.d. N µ, σ 2 , where µ ∈ R and σ 2 > 0 are unknown parameters. In
¡ ¢0
this case, θ = µ, σ 2 , Θ = R × R++ ⊆ R2 and
Z x
¡ ¢ ¡ ¢
F x|µ, σ 2 = φ t|µ, σ2 dt, x ∈ R,
−∞
where
µ ¶
¡ 2
¢ 1 1 2
φ t|µ, σ = √ exp − 2 (t − µ) , t ∈ R.
2πσ 2 2σ
DeÞnition. Let X1 , . . . , Xn be a random sample from a distribution with cdf F (·|θ) , where θ is an
unknown parameter. A point estimator is any statistic W (X1 , . . . , Xn ) .
At this level of generality, a point estimator is just a random variable. A realized value of a (point)
estimator is called a (point) estimate. The quantity we are trying to estimate (typically θ) is called the
estimand. It is not required that the range of the estimator coincides with the range of the estimand;
that is, W (X1 , . . . , Xn ) ∈ Θ is not required. On the other hand, an estimator W (X1 , . . . , Xn ) of θ is
a good estimator (only) if it is “close” to θ in some probabilistic sense and this will typically require
W (X1 , . . . , Xn ) ∈ Θ.
Casella and Berger (Sections 7.2.1-7.2.3) discuss three methods that can be used to generate estimators
under quite general circumstances. We will cover two of these methods, the method of moments and
the maximum likelihood procedure. Method of moments estimators are obtained by solving a system of
equations, while maximum likelihood estimators are constructed by solving a maximization problem.
Suppose X1 , . . . , Xn is a random sample from a distribution with cdf F (·|θ) , where θ ∈ Θ is an unknown
scalar parameter. Let µ : Θ → R be the function deÞned by
Z ∞
µ (θ) = xdF (x|θ) , θ ∈ Θ.
−∞
As deÞned, µ (θ) is the expected value of a random variable with cdf F (·|θ) . The true parameter value θ
solves the equation
E (X) = µ (θ) ,
1X
n ³ ´
X̄ = Xi = µ θ̂ .
n
i=1
To the extent that X̄ is a good estimator of E (X) (it turns out that it often is), one would expect θ̂ to be
a good estimator of θ.
Example. Suppose Xi ∼ i.i.d. Ber (p) , where p ∈ [0, 1] is an unknown parameter. In this case, θ = p,
Θ = [0, 1] and
µ (p) = p, 0 ≤ p ≤ 1.
2
Economics 240A Fall 2003
p̂ = X̄.
θ
µ (θ) = , θ > 0.
2
³ ´
The method of moments estimator θ̂ is found by solving the equation µ θ̂ = X̄ :
³ ´ θ̂
µ θ̂ = = X̄ ⇔ θ̂ = 2X̄.
2
As it turns out, this method of moments estimator is not a terribly good estimator. Notice that even
though θ is unknown, we do know that Xi > θ is impossible when Xi ∼ U [0, θ] . It is possible to have
Xi > θ̂ for some i (e.g. if n = 3 and X1 = X2 = 1 and X3 = 7), so it seems plausible that a better
estimator can be constructed.
In the case of a scalar parameter θ, the method of moments estimator is constructed by solving one (mo-
ment) equation in one unknown parameter. When θ is a k-dimensional parameter vector, θ = (θ1 , . . . , θk )0 ,
the method of moments estimator of θ is constructed by solving k equations in the k unknown parameters
θ1 , . . . , θk .
DeÞnition. Let X1 , . . . , Xn be a random sample from a distribution with cdf F (·|θ) , where θ ∈ Θ ⊆ Rk
is a vector of unknown parameters. For any j = 1, . . . , k, let µj : Θ → R be deÞned by
Z ∞
µj (θ) = xj dF (x|θ) , θ ∈ Θ.
−∞
1X j
n ³ ´
Xi = µj θ̂ , j = 1, . . . , k.
n
i=1
¡ ¢
Example. Suppose Xi ∼ i.i.d. N µ, σ 2 , where µ ∈ R and σ 2 > 0 are unknown parameters. In
¡ ¢0
this case, θ = µ, σ 2 , Θ = R × R++ ⊆ R2 and the functions µ1 (·) and µ2 (·) are given by
¡ ¢
µ1 µ, σ2 = µ,
¡ ¢
µ2 µ, σ 2 = σ 2 + µ2 .
¡ ¢0
Any solution µ̂, σ̂ 2 to the equation
n
1X ¡ ¢
Xi = µ1 µ̂, σ̂ 2 = µ̂
n
i=1
3
Economics 240A Fall 2003
satisÞes
n
1X
µ̂ = Xi = X̄.
n
i=1
In all of the examples considered so far, there is a unique solution θ̂ ∈ Θ to the system
n Z ³ ´
1X j ∞
Xi = xj dF x|θ̂ , j = 1, . . . , k,
n −∞
i=1
of estimating equations. It is not difficult to construct examples where the method of moments breaks
down. In such cases, some variant of the method of moments may work.
¡ ¢
Example. Suppose Xi ∼ i.i.d. N 0, σ2 , where σ 2 > 0 is an unknown parameter. In this case, θ = σ 2 ,
Θ = R++ ⊆ R and the function µ01 (·) is given by
¡ ¢
µ1 σ 2 = 0.
The equation
¡ ¢
X̄ = µ1 σ̂ 2 = 0
Generalizing this example, let X1 , . . . , Xn be a random sample from a distribution with cdf F (·|θ) ,
where θ ∈ Θ ⊆ Rk is a vector of unknown parameters. Even if we cannot Þnd a unique solution θ̂ to the
estimating equations
4
Economics 240A Fall 2003
n Z ³ ´
1X j ∞
Xi = xj dF x|θ̂ , j = 1, . . . , k,
n −∞
i=1
has a unique solution θ̂ ∈ Θ. Estimators θ̂ constructed in this way are also called method of moments
estimators.
DeÞnition. Let X = (X1 , . . . , Xn )0 be a discrete (continuous) n-dimensional random vector with joint
pmf (pdf) fX (·|θ) : Rn → R+ , where θ ∈ Θ is an unknown parameter vector. For any x = (x1 , . . . , xn )0 ,
the likelihood function given x is the function L (·|x) : Θ → R+ given by
The log likelihood function given x is the function l (·|x) : Θ → [−∞, ∞) given by
When X1 , . . . , Xn is a random sample from a discrete (continuous) distribution with pmf (pdf) f (·|θ) ,
the likelihood function given x = (x1 , . . . , xn )0 is
n
Y
L (θ|x) = f (xi |θ) , θ ∈ Θ,
i=1
DeÞnition. Let X1 , . . . , Xn be a random sample from a discrete (continuous) distribution with pmf
(pdf) f (·|θ) , where θ ∈ Θ is an unknown parameter vector. When X = (X1 , . . . , Xn )0 = x, a maximum
likelihood estimate θ̂ (x) of θ satisÞes
³ ´
L θ̂ (x) |x = maxθ∈Θ L (θ|x) ,
where L (·|x) is the likelihood function given x. The estimator θ̂ (X) is a maximum likelihood estimator
(MLE) of θ.
Maximum likelihood estimators often enjoy favorable large sample properties. That result is related to
the following fact, which in itself can be used to motivate the maximum likelihood estimator.
5
Economics 240A Fall 2003
Theorem (Information Inequality; Ruud, Lemma D.2). Let X be a discrete (continuous) ran-
dom variable with pmf (pdf ) f0 and let f1 be any other pmf (pdf). Then
E (log (Y )) ≤ log (E (Y )) .
Now,
X f1 (x) X X
E (Y ) = · f0 (x) = f1 (x) ≤ f1 (x) = 1
f0 (x)
x∈X x∈X x∈R
if X is discrete, while
Z Z Z ∞
f1 (x)
E (Y ) = · f0 (x) dx = f1 (x) dx ≤ f1 (x) dx = 1
X f0 (x) X −∞
as was to be shown. ¥
Let X1 , . . . , Xn be a random sample from a discrete (continuous) distribution with pmf (pdf) f (·|θ) ,
where θ ∈ Θ is unknown. It follows from the information inequality that
for any θ∗ ∈ Θ, where Eθ (·) denotes the expected value computed using the true (unknown) cdf F (·|θ) of
the random variable X. As a consequence, the true parameter value θ solves the problem of maximizing
Eθ (log f (X|θ∗ ))
6
Economics 240A Fall 2003
The sample analogue of this problem is that of maximizing the average log likelihood
n
1X
log f (Xi |θ∗ )
n
i=1
with respect θ∗ ∈ Θ. The average log likelihood is a strictly increasing function of L (θ∗ |X1 , . . . , Xn ) .
SpeciÞcally,
n µ ¶
1X 1
log f (Xi |θ∗ ) = log ∗
L (θ |X1 , . . . , Xn ) .
n n
i=1
Therefore, a maximum likelihood estimator θ̂ (X1 , . . . , Xn ) maximizes the average log likelihood with re-
spect to θ∗ :
à n !
1X
n ³ ´ 1X
log f Xi |θ̂ (X1 , . . . , Xn ) = maxθ∗ ∈Θ log f (Xi |θ∗ ) .
n n
i=1 i=1
Example. Suppose Xi ∼ i.i.d. Ber (p) , where p ∈ [0, 1] is an unknown parameter. Each Xi is discrete
with pmf
1−p for x = 0
f (x|p) = p for x = 1
0 otherwise
½ x
p (1 − p)1−x for x ∈ {0, 1}
=
0 otherwise,
where 00 = 1 and 1 (·) is the indicator function. It suffices to consider the case where xi ∈ {0, 1} for
i = 1, . . . , n, as the likelihood is zero for all other values of x = (x1 , . . . , xn )0 .
The likelihood given x is
n
Y n
Y
L (p|x) = f (xi |p) = pxi (1 − p)1−xi
i=1 i=1
Pn Pn
= p i=1 xi
(1 − p)n− i=1 xi
, p ∈ [0, 1] ,
7
Economics 240A Fall 2003
n
X n
X
l (p|x) = log f (xi |p) = (xi log p + (1 − xi ) log (1 − p))
i=1 i=1
à n ! à n
!
X X
= xi log p + n− xi log (1 − p) , p ∈ [0, 1] ,
i=1 i=1
whereP
0 · log 0 = 0.
If ni=1 xi = 0, then
L (p|x) = (1 − p)n
is a decreasing
P function of p and p = 0 maximizes L (p|x) with respect to p ∈ [0, 1] .
If ni=1 xi = n, then
L (p|x) = pn
¯ Ã n ! Ã n
!
d ¯ X 1 X 1
l (p|x)¯¯ = xi − n− xi =0
dp p=p̂ p̂ 1 − p̂
i=1 i=1
m
Pn
i=1 xi
p̂ = .
n
is the maximum likelihood estimator of p. In this case, the maximum likelihood estimator coincides with
the method of moments estimator.
8
Economics 240A Fall 2003
= p · log p∗ + (1 − p) · log (1 − p∗ )
When a unique maximum likelihood estimator θ̂ of θ = (θ1 , . . . , θk )0 exists, it can usually be constructed
by solving the likelihood equations
¯
∂ ¯
l (θ|X1 , . . . , Xn )¯¯ = 0, j = 1, . . . , k,
∂θj θ=θ̂
and verifying that a second-order condition holds. For instance, if θ is a scalar parameter a unique solution
θ̂ to
¯
d ¯
l (θ|x1 , . . . , xn )¯¯ =0
dθ θ=θ̂
¡ ¢
Example. Suppose X¡i ∼ i.i.d. ¢ N µ, σ 2 , where µ ∈ R and σ 2 > 0 are unknown parameters. The
marginal pdf of Xi is f ·|µ, σ2 , where
µ ¶
¡ ¢ 1 1 2
f x|µ, σ2 = √ exp − 2 (x − µ)
2πσ 2 2σ
µ ¶
¡ ¢
2 −1/2 1 2
= 2πσ exp − 2 (x − µ) .
2σ
Yn n µ ¶
¡ ¢ ¡ ¢ Y ¡ ¢−1/2 1
L µ, σ 2 |x = f xi |µ, σ2 = 2πσ2 exp − 2 (xi − µ)2
2σ
i=1 i=1
à n
!
¡ ¢ −n/2 1 X 2
= 2πσ2 exp − 2 (xi − µ) ,
2σ
i=1
9
Economics 240A Fall 2003
n
X
¡ ¢ ¡ ¢
l µ, σ 2 |x = log f xi |µ, σ 2
i=1
n µ
X ¶
1 1 ¡ 2¢ 1 2
= − log (2π) − log σ − 2 (xi − µ)
2 2 2σ
i=1
n
n n 1 X
2
= − log 2π − log σ − 2 (xi − µ)2 .
2 2 2σ
i=1
and
¯ n
∂ ¡ ¢¯
¯ n 1 X
l µ, σ 2
|X1 , . . . , Xn ¯ = − + (Xi − µ̂)2 ,
∂ σ̂ 2 θ=θ̂ 2σ̂ 2
2σ̂ 4
i=1
¡ ¢0 ¡ ¢0
where θ = µ, σ 2 and θ̂ = µ̂, σ̂2 . The unique solution to these equations is
n
1X
µ̂ = Xi = X̄,
n
i=1
n n
1X 1 X¡ ¢2
σ̂ 2
= (Xi − µ̂)2 = Xi − X̄ .
n n
i=1 i=1
The matrix
¯ à ¡ ¢ ¡ ¢ !¯¯ µ ¶
∂2 2 ∂2
∂2 ¡ ¢ ¯ ∂µ∂µ l ¡µ, σ |x ¢ l µ, σ2 |x ¯ − σ̂n2 0
l µ, σ 2 |x ¯¯ = ∂µ∂σ2 ¡ ¢ ¯ =
− 2σ̂n4
2 2
∂θ∂θ0 θ=θ̂
∂
∂σ 2 ∂µ
l µ, σ2 |x ∂
∂σ2 ∂σ2
l µ, σ2 |x ¯ 0
θ=θ̂
¡ ¢0 ¡ ¢
is negative deÞnite, so µ̂, σ̂ 2 is a local maximizer of l µ, σ2 |x . In fact,
¡ ¢
lim|µ|→∞ L µ, σ2 |x = 0
¡ ¢0 ¡ ¢0
for any µ ∈ R, so µ̂, σ̂ 2 is the maximum likelihood estimator of µ, σ 2 . Once again, the maximum
likelihood estimator coincides with the method of moments estimator.
In the present case, the second-order condition can also be veriÞed using univariate calculus (Casella
and Berger, Example 7.2.11).
10
Economics 240A Fall 2003
¡ ¢ ¡ ¢
Exercise. Verify that µ∗ , σ∗2 = µ, σ 2 maximizes
¡ ¡ ¢¢
E(µ,σ2 ) log f Xi |µ∗ , σ∗2
µ ¶
1 1 ¡ ∗2 ¢ 1 ∗ 2
= E(µ,σ2 ) − log (2π) − log σ − ∗2 (Xi − µ )
2 2 2σ
n n 1 ³ ´
= − log 2π − log σ ∗2 − ∗2 σ 2 + (µ − µ∗ )2
2 2 2σ
¡ ¢
with respect to µ∗ , σ ∗2 ∈ R × R++ .
One case where the maximum likelihood estimator cannot be constructed by solving the likelihood
equations is the following.
Example. Suppose Xi ∼ i.i.d. U [0, θ] , where θ > 0 is an unknown parameter. Each Xi is continu-
ous with pdf
½
1/θ for 0 ≤ x ≤ θ
f (x|θ) =
0 otherwise
1
= 1 (0 ≤ x ≤ θ) .
θ
It suffices to consider the case where xi ≥ 0 for i = 1, . . . , n, as the likelihood is zero for all other values of
x = (x1 , . . . , xn )0 .
The likelihood given x is
n
Y n µ
Y ¶
1
L (θ|x) = f (xi |θ) = 1 (0 ≤ xi ≤ θ)
θ
i=1 i=1
n
1 Y
= 1 (0 ≤ xi ≤ θ)
θn
i=1
1
= 1 (max1≤i≤n xi ≤ θ) , θ > 0,
θn
θ̂ = max1≤i≤n Xi .
In this case, the maximum likelihood estimator is different from the method moments estimator (the latter
is 2 · X̄). Unlike the method of moments estimator, the maximum likelihood estimator has the property
11
Economics 240A Fall 2003
that θ̂ ≥ Xi for every i. On the³ other´hand, since max1≤i≤n Xi is a lower bound on the true θ, θ̂ will tend
to underestimate θ. Indeed, P θ̂ ≤ θ = 1.
In the sense of the following deÞnition, the maximum likelihood estimator max1≤i≤n Xi in the preceding
example is the nth order statistic and is often denoted by X(n) .
DeÞnition. Let X1 , . . . , Xn be a random sample. The order statistics are the sample values placed
in ascending order. They are denoted by X(1) , . . . , X(n) .
Example. For any random sample X1 , . . . , Xn , X(1) = min1≤i≤n Xi and X(n) = max1≤i≤n Xn .
Remark. The pmf (pdf) of order statistics obtained from a discrete (continuous) distribution can be
characterized using combinatorial arguments (Casella and Berger, Section 5.4).
Remark. Maximum likelihood ³ ´ estimators are equivariant in the sense that if θ̂ is a maximum likeli-
hood estimator of θ, then τ θ̂ is a maximum likelihood estimator of τ (θ) for any function τ (·) deÞned
on Θ (Casella and Berger, Theorem 7.2.10).
1X ³ ´
n
ψ j Xi , θ̂ = 0, j = 1, . . . , k,
n
i=1
where each ψ j : R × Θ → R is a function. Z-estimators are usually motivated by showing that θ is the
only value of θ∗ ∈ Θ for which
Eθ ψ j (X, θ∗ ) = 0 ∀j ∈ {1, . . . , k} .
The leading special case is the method of moments estimator, which is a Z-estimator with
Z ∞
∗ j
ψ j (Xi , θ ) = Xi − xj dF (x|θ∗ ) , j = 1, . . . , k,
−∞
12
Economics 240A Fall 2003
1X ³ ´
n n
1X
m Xi , θ̂ = maxθ∗ ∈Θ m (Xi , θ∗ ) ,
n n
i=1 i=1
where m : R × Θ → R is a function. M -estimators are usually motivated by showing that θ is the unique
maximizer (with respect to θ∗ ∈ Θ) of
Eθ m (X, θ∗ ) .
The leading special case is the maximum likelihood estimator, which is an M -estimator with
In particular, many maximum likelihood estimators can be interpreted as method of moments estimators
with
¯
∂ ¯
∗
gj (Xi , θ ) = log f (Xi , θ)¯¯ , j = 1, . . . , k.
∂θj θ=θ∗
© ª
It will almost always be possible to Þnd a set of functions ψ j : j = 1, . . . , k such that θ is the only
value of θ∗ ∈ Θ for which
Eθ ψ j (X, θ∗ ) = 0 ∀j ∈ {1, . . . , k}
Eθ m (X, θ∗ ) .
DeÞnition. Let (Ω, B, {Pθ : θ ∈ Θ}) be a parametric statistical model. A parameter value θ1 ∈ Θ
is identiÞed if there does not exist another parameter value θ2 ∈ Θ such that Pθ1 = Pθ2 . The model
(Ω, B, {Pθ : θ ∈ Θ}) is identiÞed if every parameter value θ ∈ Θ is identiÞed.
In other words, a model is identiÞed if knowledge of the true marginal cdf F (·|θ) implies knowledge
of the parameter θ. This is a very modest and reasonable requirement. IdentiÞcation is a property of
13
Economics 240A Fall 2003
where µ ∈ R and σ 2 > 0 are unknown parameters. In this case, Xi ∼ i.i.d. Ber (Φ (−µ/σ)) , where Φ (·) is
the cdf of the standard normal distribution:
Z x µ ¶
1 1 2
Φ (x) = √ exp − t dt.
−∞ 2π 2
¡ ¢0
As a consequence, the marginal distribution of each Xi depends on µ, σ2 only through µ/σ and any
¡ ¢0
parameter value µ1 , σ 21 is unidentiÞed. We can achieve identiÞcation by imposing an identifying assump-
tion on the parameters. In this case, a natural identifying assumption is σ = 1.
Remark. If θ1 ∈ Θ is an unidentiÞed parameter value, there is another parameter value θ2 ∈ Θ such that
F (·|θ1 ) = F (·|θ2 ) , implying
Z ∞ Z ∞
Eθ1 g (X) = g (x) dF (x|θ1 ) = g (x) dF (x|θ2 ) = Eθ2 g (X)
−∞ −∞
for any function g. In particular, any solution θ∗ to a system of equations of the form
Eθ1 m (X, θ∗ )
Eθ2 m (X, θ∗ ) .
If one attempts to estimate the parameters of an unidentiÞed model, unique method of moments (maximum
likelihood) estimators typically cannot be found.
14
Economics 240A Fall 2003
DeÞnition. The mean squared error (MSE) matrix of an estimator θ̂ of θ is the function (of θ) given by
³ ´ ·³ ´³ ´0 ¸
M SEθ θ̂ = Eθ θ̂ − θ θ̂ − θ , θ ∈ Θ.
An estimator θ̂ of θ is unbiased if
³ ´
Eθ θ̂ = θ ∀θ ∈ Θ.
Many results derived using MSE generalize to other measures of closeness. It is convenient to use MSE
because it is analytically tractable and has a straightforward interpretation in terms of the variance and
bias of the estimator θ̂. SpeciÞcally,
³ ´ ·³ ´³ ´0 ¸
M SEθ θ̂ = Eθ θ̂ − θ θ̂ − θ
·³ ³ ´ ³ ´ ´³ ³ ´ ³ ´ ´0 ¸
= Eθ θ̂ − Eθ θ̂ + Eθ θ̂ − θ θ̂ − Eθ θ̂ + Eθ θ̂ − θ
·³ ³ ´´ ³ ³ ´´0 ¸
= Eθ θ̂ − Eθ θ̂ θ̂ − Eθ θ̂
³ ³ ´ ´³ ³ ´ ´0
+ Eθ θ̂ − θ Eθ θ̂ − θ
h ³ ´i ³ ³ ´ ´0
+Eθ θ̂ − Eθ θ̂ Eθ θ̂ − θ
³ ³ ´ ´ h ³ ´i0
+ Eθ θ̂ − θ Eθ θ̂ − Eθ θ̂
³ ´ ³ ´ ³ ´0
= V arθ θ̂ + Biasθ θ̂ · Biasθ θ̂
h ³ ´i ³ ´ ³ ´
because Eθ θ̂ − Eθ θ̂ = 0 and Eθ θ̂ − θ = Biasθ θ̂ is non-random. In particular,
³ ´ ³ ´ ³ ´2
M SEθ θ̂ = V arθ θ̂ + Biasθ θ̂
15
Economics 240A Fall 2003
n
1 X¡ ¢2
σ̂ 2 = Xi − X̄
n
i=1
implying
à n
!
¡ ¢ σ2 1 X¡ ¢2 σ2
E S2 = ·E Xi − X̄ = · (n − 1) = σ 2
n−1 σ2 n−1
i=1
and
µ ¶2 Ã n
! µ ¶2
¡ 2
¢ σ2 1 X¡ ¢2 σ2 2
V ar S = · V ar 2
Xi − X̄ = · 2 (n − 1) = σ4 .
n−1 σ n−1 n−1
i=1
In particular,
¡ ¢ 2
M SE(µ,σ2 ) S 2 = σ4 .
n−1
Similarly,
à n
!
¡ 2
¢ σ2 1 X¡ ¢2 σ2 n−1 2
E σ̂ = ·E 2
Xi − X̄ = · (n − 1) = σ
n σ n n
i=1
and
µ ¶2 Ã n
! µ ¶2
¡ ¢ σ2 1 X¡ ¢2 σ2 2 (n − 1) 4
V ar σ̂ 2 = · V ar 2
Xi − X̄ = · 2 (n − 1) = σ ,
n σ n n2
i=1
so
µ ¶2
¡ 2
¢ 2 (n − 1) 4 1 2 2n − 1 4
M SE(µ,σ2 ) σ̂ = σ + σ = σ
n2 n n2
2 − 1/n 4 2 ¡ ¢
= σ < σ 4 = M SE(µ,σ2 ) S 2 .
n n−1
Unlike S 2 , σ̂ 2 is biased. Nonetheless, its variance is so much smaller than that of S 2 that its MSE is
smaller for all values of µ and σ 2 .
16
Economics 240A Fall 2003
In this example, the MSE ranking does not depend on the true value of the parameter(s). In spite of
this we do not know whether an even better estimator exists. To answer that question, it might appear
natural to look for an estimator that minimizes MSE uniformly in µ and σ2 . Unfortunately, such an esti-
mator does not exist.
The point is that in order to Þnd a uniformly (in the value of the parameters) best estimator, we need
to impose certain restrictions on the class of estimators under consideration.
for every W ∈ W.
³ ´
Remark. When θ is a vector, the notation “M SEθ θ̂ ≤ M SEθ (W )” is shorthand for “the matrix
³ ´
M SEθ (W ) − M SEθ θ̂ is positive semi-deÞnite”.
¡ ¢
Example. Suppose Xi ∼ i.i.d. N µ, σ 2 , where 2
ª R and σ > 0 are unknown parameters. The
© 2 µ 2∈
2 2
estimator (of σ ) σ̂ is efficient relative to W = σ̂ , S .
¡ ¢
Example. Suppose Xi ∼ i.i.d. N µ, σ2 , where µ ∈ R and σ 2 > 0 are unknown parameters. Con-
sider the following class of estimators (of σ 2 ):
( n
)
2 1 X¡ ¢2
W = σ̃ c = Xi − X̄ : c > 0 .
c
i=1
The estimators
n
1 X¡ ¢2
σ̂ 2 = Xi − X̄ = σ̃ 2n
n
i=1
and
n
1 X¡ ¢2
S2 = Xi − X̄ = σ̃ 2n−1
n−1
i=1
are both members of W. It is not hard to show that σ̃ 2n+1 is efficient relative to W.
of unbiased estimators of θ with Þnite variance. For unbiased estimators, the MSE is simply the variance.
17
Economics 240A Fall 2003
It turns out that UMVU estimators often exist. The Rao-Blackwell Theorem facilitates the search for
UMVU estimators by showing that UMVU estimators can always be based on statistics that are sufficient
in the sense of the following deÞnition.
DeÞnition. Let X1 , . . . , Xn be a random sample from a distribution with cdf F (·|θ) , where θ ∈ Θ is
unknown. A statistic T = T (X1 , . . . , Xn ) is a sufficient statistic for θ if the conditional distribution of
(X1 , . . . , Xn )0 given T does not depend on θ.
Theorem (Rao-Blackwell Theorem; Casella and Berger, Theorem 7.3.17). Let θ̂ ∈ Wu (θ)
and let T be any sufficient statistic for θ. Then
³ ´
θ̃ = EX|T θ̂|T ∈ Wu (θ)
and
³ ´ ³ ´
V arθ θ̃ ≤ V arθ θ̂ ∀θ ∈ Θ.
³ ´ ³ ´ ³ ´
Remark. The inequality V arθ θ̃ ≤ V arθ θ̂ is strict unless Pθ θ̃ = θ̂ = 1.
Proof. The distribution of the estimator θ̂ = θ̂ (X1 , . . . , Xn ) conditional on T does not depend on θ
when T is sufficient. Therefore,
³ ´
θ̃ = EX|T θ̂|T
is a function of T = T (X1 , . . . , Xn ) that does not depend on the true (unknown) value of θ. In particular,
θ̃ is an estimator.
The estimator θ̃ is unbiased because
³ ´ ³ ³ ´´ ³ ´
Eθ θ̃ = Eθ EX|T θ̂|T = Eθ θ̂ = θ,
where the second equality uses the law of iterated expectations and the last equality uses the fact that θ̂
is unbiased.
Applying the conditional variance identity, we have:
³ ´ ³ ³ ´´ ³ ³ ´´
V arθ θ̂ = V arθ EX|T θ̂|T + Eθ V arX|T θ̂|T
³ ³ ´´ ³ ´
≥ V arθ EX|T θ̂|T = V arθ θ̃ . ¥
18
Economics 240A Fall 2003
³ ´ ³ ´
Remark. With a little more effort, a proof of the relation V arθ θ̂ ≥ V arθ θ̃ can be based on the
conditional version of Jensen’s inequality:
³ ´ µ³ ´2 ¶ µ µ³ ´2 ¶¶
V arθ θ̂ = Eθ θ̂ − θ = Eθ EX|T θ̂ − θ |T
µ³ ³ ´ ´2 ¶ µ³ ´2 ¶ ³ ´
≥ Eθ EX|T θ̂|T − θ = Eθ θ̃ − θ = V arθ θ̃ ,
where the second equality uses the law of iterated expectations and the inequality uses the conditional
version of Jensen’s
³ ³ inequality.
´´ This
³ method
´ of proof is applicable whenever the measure of closeness is of
the form Eθ L θ̂, θ , where L θ̂, θ is a convex function of θ̂ :
³ ³ ´´ ³ ³ ³ ´ ´´
Eθ L θ̂, θ = Eθ EX|T L θ̂, θ |T
³ ³ ³ ´ ´´ ³ ³ ´´
≥ Eθ L EX|T θ̂|T , θ = Eθ L θ̃, θ .
¯ ¯
¯ ¯
For instance, ¯θ̂ − θ¯ is a convex function of θ̂ and therefore
³¯ ¯´ ³ ³¯ ¯ ´´
¯ ¯ ¯ ¯
Eθ ¯θ̂ − θ¯ = Eθ EX|T ¯θ̂ − θ¯ |T
³³¯ ³ ´ ¯´´ ³¯ ¯´
¯ ¯ ¯ ¯
≥ Eθ ¯EX|T θ̂|T − θ¯ = Eθ ¯θ̃ − θ¯ .
It follows from the Rao-Blackwell Theorem that when looking for UMVU estimators, there is no need
to consider estimators that cannot be written as functions of a sufficient statistic. Indeed, it suffices to
look at estimators that are necessary statistics in the sense that they can be written as functions of every
sufficient statistic. Here, the word “every” is crucial because any estimator is a function of (X1 , . . . , Xn )0
and (X1 , . . . , Xn )0 is always a sufficient statistic.
Of course, the usefulness of the Rao-Blackwell Theorem depends on the extent to which sufficient sta-
tistics of low dimension are available and easy to Þnd. As it turns out, sufficient statistics of the same
dimension as θ are available in many cases. In cases where determination of sufficient statistics by means
of the deÞnition is tedious, the following characterization of sufficiency may be useful.
19
Economics 240A Fall 2003
Example. For any random sample ¡ X1 , . . . , X¢n from a discrete (continuous) distribution, two sufficient
0 0
statistics are (X1 , . . . , Xn ) and X(1) , . . . , X(n) .
Example. Suppose Xi ∼ i.i.d. Ber (p) , where p ∈ [0, 1] is an unknown parameter. The joint pmf of
(X1 , . . . , Xn )0 is
n
Y
fX (x1 , . . . , xn |p) = pxi (1 − p)1−xi · 1 (xi ∈ {0, 1})
i=1
Ãn !
Pn P
n− n
Y
= p i=1 xi (1 − p) i=1 xi · 1 (xi ∈ {0, 1})
i=1
³Xn ´
= g xi |p · h (x1 , . . . , xn ) ,
i=1
where
g (t|p) = pt (1 − p)n−t
and
n
Y
h (x1 , . . . , xn ) = 1 (xi ∈ {0, 1}) .
i=1
P
Therefore, ni=1 Xi is a sufficient statistic for p.
The maximum likelihood (and method of moments) estimator
Pn
Xi
p̂ = i=1
n
Pn
is a function of i=1 Xi .
Example. Suppose Xi ∼ i.i.d. U [0, θ] , where θ > 0 is an unknown parameter. The joint pdf of
(X1 , . . . , Xn )0 is
n µ
Y ¶
1
fX (x1 , . . . , xn |θ) = 1 (0 ≤ xi ≤ θ)
θ
i=1
n
1 Y
= 1 (0 ≤ xi ≤ θ)
θn
i=1
1 ¡ ¢ ¡ ¢
= n 1 x(n) ≤ θ · 1 x(1) ≥ 0
θ
¡ ¢
= g x(n) |θ · h (x1 , . . . , xn ) ,
where
20
Economics 240A Fall 2003
1
g (t|θ) = 1 (t ≤ θ)
θn
and
¡ ¢
h (x1 , . . . , xn ) = 1 x(1) ≥ 0 .
Yn µ ¶
¡ 2
¢ ¡ ¢
2 −1/2 1 2
fX x1 , . . . , xn |µ, σ = 2πσ exp − 2 (xi − µ)
2σ
i=1
à n
!
n/2 ¡ 2 ¢−n/2 1 X¡ 2 2
¢
= (2π) σ exp − 2 xi + µ − 2µxi
2σ
i=1
µ ¶ Ã n n
!
n/2 ¡ 2 ¢−n/2 nµ2 µ X 1 X 2
= (2π) σ exp − 2 exp xi − 2 xi
2σ σ2 2σ
i=1 i=1
³Xn Xn ´
= g xi , x2i |µ, σ 2 · h (x1 , . . . , xn ) ,
i=1 i=1
where
µ ¶ µ ¶
¡ 2
¢ n/2 ¡ 2 ¢−n/2 nµ2 µ 1
g t1 , t2 |µ, σ = (2π) σ exp − 2 exp t1 − 2 t2
2σ σ2 2σ
and
h (x1 , . . . , xn ) = 1.
¡Pn Pn ¢ ¡ ¢
2 0 is a sufficient statistic for µ, σ 2 0 .
Therefore, i=1 Xi , i=1 Xi
The maximum likelihood (and method of moments) estimator of µ,
Pn
Xi
µ̂ = X̄ = i=1 ,
n
¡Pn Pn ¢
2 0,
is a function of i=1 Xi , i=1 Xi as is the maximum likelihood (and method of moments) estimator
of σ 2 ,
21
Economics 240A Fall 2003
n
à n
!2
1X 2 1X
σ̂ 2 = Xi − Xi .
n n
i=1 i=1
A sufficient statistic T is particularly useful if unbiased estimators based on T are essentially unique in
the sense that any two unbiased estimators based on T are equal with probability one. Indeed, if we can
somehow Þnd a sufficient statistic T such that unbiased estimators based on T are essentially unique, then
any θ̂ ∈ Wu (θ) based on T is UMVU. The Lehmann-Scheffé Theorem establishes essential uniqueness of
unbiased estimators based on a complete sufficient statistic T.
Eθ (g (T )) = 0 ∀θ ∈ Θ
implies
Pθ (g (T ) = 0) = 1 ∀θ ∈ Θ.
Theorem (Lehmann-Scheffé Theorem; Casella and Berger, Theorem 7.5.1). Unbiased esti-
mators based on complete sufficient statistics are essentially unique.
22
Economics 240A Fall 2003
Complete sufficient statistics can often be found if the family {f (·|θ) : θ ∈ Θ} of pmfs/pdfs is an expo-
nential family.
Example. Suppose Xi ∼ i.i.d. Ber (p) , where p ∈ (0, 1) . The marginal pmf satisÞes
where
c (p) = 1 − p,
µ ¶
p
η (p) = log ,
1−p
t (x) = x.
¡ ¢
Example. Suppose Xi ∼ i.i.d. N µ, σ 2 , where µ ∈ R and σ 2 > 0 are unknown parameters. In
this case, the marginal pdf satisÞes
µ ¶
¡ 2
¢ ¡ ¢
2 −1/2 1 2
f x|µ, σ = 2πσ exp − 2 (x − µ)
2σ
µ ¶ µ ¶
1/2 ¡ 2 ¢−1/2 µ2 µ 1 2
= (2π) σ exp − 2 exp x − 2x
2σ σ2 2σ
à 2 !
¡ ¢ X ¡ ¢
2 2
= h (x) c µ, σ exp ηi µ, σ ti (x) ,
i=1
23
Economics 240A Fall 2003
where
h (x) = 1,
µ ¶
¡ ¢ ¡ ¢−1/2 µ2
c µ, σ 2 = (2π)1/2 σ 2 exp − 2 ,
2σ
¡ ¢ µ
η1 µ, σ 2 = ,
σ2
t1 (x) = x,
¡ ¢ 1
η2 µ, σ 2 = − 2 ,
2σ
t2 (x) = x2 .
Theorem (Casella and Berger, Theorem 6.2.25). Let X1 , . . . , Xn be a random sample from a discrete
(continuous) exponential family with pmf (pdf)
à d !
X
f (x|θ) = h (x) c (θ) exp η i (θ) ti (x) , x ∈ R, θ ∈ Θ.
i=1
Remark. A set A ⊆ Rd contains an open set if and only if we can Þnd constants aL U L U
1 < a1 , . . . , ad < ad
such that
£ L U¤ £ ¤
a1 , a1 × . . . × aL U
d , ad ⊆ A;
that is, a set A ⊆ Rd contains an open set if and only if it contains a d-dimensional rectangle.
Example. Suppose Xi ∼ i.i.d. Ber (p) , where p ∈ (0, 1) . The marginal pdf satisÞes
where
24
Economics 240A Fall 2003
c (p) = 1 − p,
µ ¶
p
η (p) = log ,
1−p
t (x) = x.
The set
½ µ ¶ ¾
p
{η (p) : p ∈ (0, 1)} = log : p ∈ (0, 1) = R
1−p
is open, so
n
X n
X
t (Xi ) = Xi
i=1 i=1
is unbiased,
Ep (p̂) = Ep (Xi ) = p,
P
and is based on ni=1 Xi . Therefore, p̂ is a UMVU estimator of p.
The conclusion is not affected if Θ = [0, 1] is considered, as V arp (p̂) = 0 when p ∈ {0, 1} .
¡ ¢
Example. Suppose Xi ∼ i.i.d. N µ, σ 2 , where µ ∈ R and σ 2 > 0 are unknown parameters. In
this case, the marginal pdf satisÞes
à 2 !
¡ ¢ ¡ ¢ X ¡ ¢
2 2 2
f x|µ, σ = h (x) c µ, σ exp η i µ, σ ti (x) ,
i=1
where
25
Economics 240A Fall 2003
h (x) = 1,
µ ¶
¡ 2
¢ 1/2 ¡ 2 ¢−1/2 µ2
c µ, σ = (2π) σ exp − 2 ,
2σ
¡ ¢ µ
η1 µ, σ 2 = ,
σ2
t1 (x) = x,
¡ ¢ 1
η2 µ, σ 2 = − 2 ,
2σ
t2 (x) = x2 .
The set
n¡ ¡ ¢ ¡ ¢¢ o ½µ µ 1 0
¶ ¾
2 2 0 2 2
η1 µ, σ , η 2 µ, σ : µ ∈ R, σ > 0 = ,− : µ ∈ R, σ > 0 = R× (−∞, 0)
σ 2 2σ 2
is open, so
à n n
!0 Ã n n
!0
X X X X
t1 (Xi ) , t2 (Xi ) = Xi , Xi2
i=1 i=1 i=1 i=1
n n
à n !2
2 1 X¡ ¢2 1 X 2 1 X
S = Xi − X̄ = Xi − Xi
n−1 n−1 (n − 1) n
i=1 i=1 i=1
¡Pn Pn ¢
2 0
is unbiased and is based on i=1 Xi , i=1 Xi and S 2 is therefore a UMVU estimator of σ 2 .
26
Economics 240A Fall 2003
Outside the exponential family of distributions, we typically have to Þnd complete sufficient statistics
(if they exist) by applying the deÞnition of completeness.
Example. Suppose Xi ∼ i.i.d. U [0, θ] , where θ > 0 is an unknown parameter. The sufficient statis-
tic T = T (X1 , . . . , Xn ) = X(n) is continuous with pdf f (·|θ) given by (Casella and Berger, Example
6.2.23)
The statistic T is complete if the set {x ∈ R+ : g (x) 6= 0} has (Lebesgue) measure zero whenever g : R → R
is a function satisfying
Z θ
Eθ (g (T )) = g (t) ntn−1 θ−n dt = 0 ∀θ > 0.
0
Given any such function g, let g + and g − be the negative and positive parts of g, respectively; that is, let
g + (t) = max (0, g (t)) and g − (t) = max (0, −g (t)) . By assumption,
Z θ Z θ
g + (t) tn−1 dt = g − (t) tn−1 dt ∀θ > 0.
0 0
It can be shown (using the Radon-Nikodym theorem) that this implies that the set {x ∈ R+ : g + (x) 6= g − (x)}
has measure zero. Therefore, the set {x ∈ R+ : g (x) = g + (x) − g − (x) 6= 0} has measure zero. In particu-
lar, X(n) is a complete sufficient statistic.
The maximum likelihood estimator θ̂ML = X(n) is based on X(n) but is biased because
³ ´ Z θ Z θ
Eθ θ̂ML = tf (t|θ) dt = ntn θ−n dt
0 0
¯
n n+1 −n ¯¯θ
= t θ ¯
n+1 t=0
n
= θ.
n+1
is unbiased and is based on the complete sufficient statistic X(n) . As a consequence, θ̂ is a UMVU estimator
of θ.
In this example, we constructed a UMVU estimator of θ by Þnding “the” function θ̂ (·) such that
³ ´
Eθ θ̂ (T ) = θ ∀θ ∈ Θ.
27
Economics 240A Fall 2003
In cases where an unbiased estimator θ̂ (not based on a complete sufficient statistic) has already been
found, a UMVU estimator of θ can be found by “Rao-Blackwellization”; that is,
³ ´
θ̃ = E θ̂|T
Example. Suppose Xi ∼ i.i.d. U [0, θ] , where θ > 0 is an unknown parameter. Since Eθ (Xi ) = θ/2,
an unbiased estimator of θ is
θ̂ = 2X1 .
is a UMVU estimator of θ.
If a UMVU estimator does not exist or is hard to Þnd, it is nice to have a benchmark against which
all estimators can be compared. That is, it is nice to have a lower bound on the variance of any unbiased
estimator.
Suppose an LMVU estimator exists at every θ0 ∈ Θ and let VL (θ0 ) denote the variance of the LMVU
estimator at θ0 . The function VL (·) : Θ → R+ provides us with a lower bound on the variance of any
estimator θ̂ ∈ Wu (θ) . The bound is sharp in the sense that it can be attained for any θ0 ∈ Θ. If the LMVU
estimator (or its variance) is hard to Þnd, we may nonetheless be able to Þnd a nontrivial lower bound on
the variance of any unbiased estimator of θ. A very useful bound, the Cramér-Rao bound, can be obtained
by applying the covariance inequality, a corollary of the Cauchy-Schwarz inequality.
Corollary (Covariance Inequality; Casella and Berger, Example 4.7.4). If (X, Y ) is a bivariate
random vector, then
28
Economics 240A Fall 2003
Let X1 , . . . , Xn be a random sample from a discrete (continuous) distribution with pmf (pdf) f (·|θ) ,
where θ ∈ Θ is an unknown parameter vector. Moreover, let θ̂ = θ̂ (X1 , . . . , Xn ) be any estimator of θ
and let ψ = ψ (X1 , . . . , Xn ; θ) be any vector-valued function of (X1 , . . . , Xn )0 and θ. It follows from the
(multivariate version of the) covariance inequality that
³ ´ ³ ´ ³ ´0
V arθ θ̂ ≥ Covθ θ̂, ψ V arθ (ψ)−1 Covθ θ̂, ψ .
In general, the lower bound on the right hand side depends on θ̂ and therefore the inequality may not seem
to be very helpful.
³ ´ ³ ´
Remark. It can be shown that the function (of θ and θ̂) Covθ θ̂, ψ depends on θ̂ only through Eθ θ̂
if and only if
Covθ (ψ, U ) = 0 ∀θ ∈ Θ
Covθ (ψ, U ) = 0 ∀θ ∈ Θ
DeÞnition. Let X1 , . . . , Xn be a random sample from a discrete (continuous) distribution with pmf
(pdf) f (·|θ) , where θ ∈ Θ ⊆ Rk is an unknown parameter vector. The score function is the (random)
function S (·|X1 , . . . , Xn ) : Θ → Rk given by
n
∂ X
S (θ|X1 , . . . , Xn ) = log f (Xi |θ) , θ ∈ Θ.
∂θ
i=1
Eθ (S (θ|X1 , . . . , Xn )) = 0
and
29
Economics 240A Fall 2003
³ ´
Eθ θ̂ · S (θ|X1 , . . . , Xn )0 = I,
then
³ ´
V arθ θ̂ ≥ V arθ (S (θ|X1 , . . . , Xn ))−1
³ ´ ³ ´ ³ ´
Covθ θ̂, S (θ) = Eθ θ̂ · (S (θ) − Eθ (S (θ)))0 = Eθ θ̂ · S (θ)0
= I,
³ ´ ³ ´ ³ ´0
V arθ θ̂ ≥ Covθ θ̂, S (θ) V arθ (S (θ))−1 Covθ θ̂, S (θ)
= V arθ (S (θ))−1 . ¥
Eθ (S (θ|X1 , . . . , Xn )) = 0
and
³ ´
Eθ θ̂ · S (θ|X1 , . . . , Xn )0 = I
both have natural interpretations. Suppose X = (X1 , . . . , Xn )0 is continuous with joint pdf fX (·|θ) and let
T (X1 . . . , Xn ) be any statistic with Eθ |T (X1 . . . , Xn )| < ∞. If we can interchange the order of differenti-
ation and differentiation, then
Z
∂ ∂
Eθ (T (X)) = T (x) fX (x|θ) dx
∂θ0 ∂θ0 Rn
Z
∂
= T (x) fX (x|θ) dx
Rn ∂θ0
Z µ ¶
∂
= T (x) log fX (x|θ) fX (x|θ) dx
Rn ∂θ0
¡ ¢
= Eθ T (X) S (θ|X)0
30
Economics 240A Fall 2003
n
∂ X
S (θ|X1 , . . . , Xn ) = log f (Xi |θ)
∂θ
i=1
Ãn !
∂ Y
= log f (Xi |θ)
∂θ
i=1
∂
= log fX (X1 , . . . , Xn |θ)
∂θ
when X1 , . . . , Xn is a random sample from a distribution with pdf f (·|θ) . Setting h (X) = 1 and h (X) = θ̂,
we obtain the relations
Eθ (S (θ|X1 , . . . , Xn )) = 0
and
³ ´
Eθ θ̂ · S (θ|X1 , . . . , Xn )0 = I,
respectively.
Conditions under which we can interchange of the order of integration and differentiation are available
(Casella and Berger, Section 2.4).
Lemma. Let X1 , . . . , Xn be a random sample from a discrete (continuous) distribution with pmf (pdf )
f (·|θ) , where θ ∈ Θ. Suppose
(i) Θ is open.
(ii) The set {x ∈ R : f (x|θ) > 0} does not depend on θ.
(iii) For every θ ∈ Θ there is a function bθ : R → R+ and a constant ∆θ > 0 such that
and
¯ ¯
¯ f (x|θ + δ) − f (x|θ) ¯
¯ ¯ < bθ (x) ∀x ∈ R
¯ δ ¯
∂ ¡ 0¢
0 Eθ (T (X)) = Eθ T (X) S (θ|X)
∂θ
31
Economics 240A Fall 2003
Remark. Conditions (ii) and (iii) of the lemma hold whenever f (x|θ) is of the form
à d !
X
f (x|θ) = h (x) c (θ) exp η i (θ) ti (x) , x ∈ R, θ ∈ Θ,
i=1
The quantity
¡ ¢
I (θ) = Eθ S (θ|X1 , . . . , Xn ) S (θ|X1 , . . . , Xn )0
is called the information matrix, or the Fisher information. Under the assumptions of Cramér-Rao In-
equality, the Fisher information is
and I (θ)−1 provides a lower bound on the variance of any estimator θ̂ ∈ Wu (θ) .
The Fisher information is easy to compute when X1 , . . . , Xn is a random sample (Casella and Berger,
Corollary 7.3.10):
ÃÃ n
!Ã n
!0 !
∂ X ∂ X
I (θ) = Eθ log f (Xi |θ) log f (Xi |θ)
∂θ ∂θ
i=1 i=1
µµ ¶µ ¶0 ¶
∂ ∂
= n · Eθ log f (X|θ) log f (X|θ) ,
∂θ ∂θ
µµ ¶µ ¶0 ¶
∂ ∂
I (θ) = n · Eθ log f (X|θ) log f (X|θ)
∂θ ∂θ
µ 2 ¶
∂
= −n · Eθ log f (X|θ) .
∂θ∂θ0
If the conditions of the Cramér-Rao inequality are satisÞed and it just so happens that an unbiased
estimator attains the bound, then the estimator is UMVU. The Cramér-Rao inequality can therefore be
used to establish optimality in some cases.
Example. Suppose Xi ∼ i.i.d. Ber (p) , where p ∈ (0, 1) is unknown. When x ∈ {0, 1} , we have:
32
Economics 240A Fall 2003
³ ´
log f (x|p) = log px (1 − p)1−x = x · log p + (1 − x) log (1 − p) ,
and
∂ x (1 − x) (1 − p) x p (x − 1) x 1
log f (x|p) = − = + = − .
∂p p 1−p p (1 − p) p (1 − p) p (1 − p) 1 − p
The conditions of the Cramér-Rao inequality are satisÞed, so the Fisher information is
µ ¶
Xi 1
I (p) = n · V arp −
p (1 − p) 1 − p
µ ¶2
1
= n· V arp (Xi )
p (1 − p)
µ ¶2
1
= n· p (1 − p)
p (1 − p)
n
= .
p (1 − p)
p (1 − p)
I (p)−1 = .
n
of p satisÞes
¡ ¢ V arp (Xi ) p (1 − p)
V arp (p̂) = V arp X̄ = = .
n n
The maximum likelihood estimator attains the lower bound I (p)−1 and is therefore UMVU.
There are cases where the Cramér-Rao bound does not apply or fails to be sharp.
Example. Suppose Xi ∼ i.i.d. U [0, θ] , where θ > 0 is an unknown parameter. When 0 < x < θ,
we have:
1
log f (x|θ) = log = − log θ
θ
and
33
Economics 240A Fall 2003
∂ 1
log f (x|θ) = − .
∂θ θ
delivers a lower bound on the variance of unbiased estimators. In fact, the variance of UMVU estimator
n+1
θ̂ = X(n)
n
¡ ¢
Example. Suppose Xi ∼ i.i.d. N µ, σ2 , where µ ∈ R and σ 2 > 0 are unknown parameters. We
have:
µ µ ¶¶
¡ ¢ ¡ ¢−1/2 1
log f x|µ, σ 2 = log 2πσ 2 exp − 2 (x − µ)2
2σ
1 1 ¡ ¢ 1
= − log (2π) − log σ 2 − 2 (x − µ)2 ,
2 2 2σ
∂ ¡ ¢ 1
log f x|µ, σ2 = 2 (x − µ) ,
∂µ σ
and
∂ ¡ ¢ 1 1
2
log f x|µ, σ 2 = − 2 + 4 (x − µ)2 .
∂σ 2σ 2σ
The conditions of the Cramér-Rao inequality are satisÞed, so the Fisher information is
µ 1 ¶
¡ ¢ 0
I µ, σ 2 = n · σ2 .
0 2σ1 4
because
µ ¶
∂ ¡ 2
¢ 1 1
V ar(µ,σ2 ) log f Xi |µ, σ = 4 V ar(µ,σ2 ) (Xi − µ) = 2 ,
∂µ σ σ
34
Economics 240A Fall 2003
µ ¶ ³ ´
∂ ¡ 2
¢ 1 2 1
V ar(µ,σ2 ) log f Xi |µ, σ = 2 V ar 2
(µ,σ ) (Xi − µ) = 4,
∂σ2 4
(2σ ) 2σ
and
µ ¶ ³ ´
∂ ¡ ¢ ∂ ¡ ¢ 1 1 2
Cov(µ,σ2 ) log f Xi |µ, σ2 , 2 log f Xi |µ, σ2 = Cov 2
(µ,σ ) (Xi − µ) , (Xi − µ)
∂µ ∂σ σ 2 2σ 4
= 0.
¡ ¢0
As a consequence, a lower bound on the covariance matrix of an unbiased estimator of µ, σ 2 is
à 2 !
¡ ¢
2 −1
σ
0
I µ, σ = n .
4
0 2σn
In particular, no unbiased estimator of µ can have variance smaller than σ 2 /n. This lower bound is attained
by the maximum likelihood estimator µ̂ = X̄. The Cramér-Rao lower bound on the variance of unbiased
estimators of σ2 is 2σ 4 /n. The variance of the UMVU estimator S 2 is 2σ 4 / (n − 1) and the Cramér-Rao
bound therefore cannot be attained.
³ ´ ³ ³ ´´
V arθ θ̂a = V arθ θ̂0 + a θ̂1 − θ̂0
³ ´ ³ ´ ³ ´
= V arθ θ̂0 + a2 V arθ θ̂1 − θ̂0 + 2a · Covθ θ̂0 , θ̂1 − θ̂0 .
³ ´ ³ ´
Therefore, if V arθ θ̂a ≥ V arθ θ̂0 for any θ ∈ Θ and any a ∈ R, then
d ³ ´¯¯ ³ ´
V arθ θ̂a ¯¯ = 2 · Covθ θ̂0 , θ̂1 − θ̂0 = 0 ∀θ ∈ Θ. ¥
da a=0
35
Economics 240A Fall 2003
³ ´
Covθ θ̂, θ̃ − θ̂ = 0 ∀θ ∈ Θ
³ ´ ³ ´
V arθ θ̃ = V arθ θ̂ + θ̃ − θ̂
³ ´ ³ ´ ³ ´
= V arθ θ̂ + V arθ θ̃ − θ̂ + 2Covθ θ̂, θ̃ − θ̂
³ ´ ³ ´
= V arθ θ̂ + V arθ θ̃ − θ̂
³ ´
≥ V arθ θ̂
³ ´
whenever Covθ θ̂, θ̃ − θ̂ = 0.
and
³ ´ ³ ´
Covθ θ̂, θ̃ = V arθ θ̂
for every θ ∈ Θ.
Proof. Now,
³ ´ ³ ´
Covθ θ̂, θ̃ = Covθ θ̂, θ̃ − θ̂ + θ̂
³ ´ ³ ´
= Covθ θ̂, θ̃ − θ̂ + V arθ θ̂
³ ´
= V arθ θ̂ ,
36
Economics 240A Fall 2003
³ ´
where the last holds because Covθ θ̂, θ̃ − θ̂ = 0 in view of the theorem. Using this relation,
³ ´ ³ ´ ³ ´ ³ ´
V arθ θ̃ − θ̂ = V arθ θ̃ + V arθ θ̂ − 2Covθ θ̂, θ̃
³ ´ ³ ´
= V arθ θ̃ − V arθ θ̂ . ¥
Corollary (Casella and Berger, Theorem 7.3.19). UMVU estimators are essentially unique in the
sense that if θ̂ is a UMVU estimator of θ and θ̃ ∈ Wu (θ) , then
³ ´ ³ ´
V arθ θ̃ > V arθ θ̂
³ ´
unless Pθ θ̃ = θ̂ = 1.
Proof.
³ If
´ a random
³ ´ variable
³ ´ X has V ar (X) = 0, then P (X = E (X)) = 1. ³ Unbiasedness
´ ³ ´implies
Eθ θ̃ − θ̂ = Eθ θ̃ − Eθ θ̂ = 0 and it therefore suffices to show that V arθ θ̃ > V arθ θ̂ unless
³ ´
V arθ θ̃ − θ̂ = 0. The stated result now follows from the previous corollary. ¥
37