0% found this document useful (0 votes)
77 views37 pages

Estimation EMV

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
77 views37 pages

Estimation EMV

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Economics 240A Fall 2003

Department of Economics M. Jansson


University of California at Berkeley

Point Estimation

Recall that a probability space (sometimes called a probability model ) is a triple (Ω, B, P ) , where Ω is
a sample space, B is a σ-algebra of events (subsets of Ω) and P is a probability function (deÞned on B).
In statistics we need to be able to study several probability functions simultaneously.

DeÞnition. A statistical experiment (sometimes called a statistical model ) is a triple (Ω, B, P) , where
Ω is a sample space, B is a σ-algebra of events and P is a collection of probability functions deÞned on
B; that is, P is a collection of probability functions such that (Ω, B, P ) is a probability space for each P ∈ P.
© ª
DeÞnition. A statistical model (Ω, B, P) is parametric if P is of the form Pθ : θ ∈ Θ ⊆ Rk , where
θ is a parameter, which takes on values in the parameter space Θ, a subset of Rk .

For concreteness, we will only consider parametric statistical models. Moreover, we will assume that the
elementary outcomes are vectors of real numbers and that these outcomes are realizations of a collection
of i.i.d. random variables. That is, each outcome is of the form
 
x1
 
x =  ...  ,
xn

where each xi (i = 1, . . . , n) is a realization of a random variable Xi and the random variables X1 , . . . , Xn


are a random sample from some distribution with cdf F (·|θ) , where θ ∈ Θ is unknown. Under these
assumptions, each of the probability functions Pθ appearing in the deÞnition of a parametric statistical
model is uniquely determined by the corresponding cdf F (·|θ) . In other words, there is a one-to-one cor-
respondence between the collection P = {Pθ : θ ∈ Θ} and the associated family F = {F (·|θ) : θ ∈ Θ} of
marginal cdfs, so we can (and typically will) specify a statistical model in terms of the latter.

Example. Suppose Xi ∼ i.i.d. Ber (p) , where p ∈ [0, 1] is an unknown parameter. In this case, θ = p,
Θ = [0, 1] ⊆ R and

 0 for x < 0
F (x|p) = 1−p for 0 ≤ x < 1

1 for x ≥ 1.

Example. Suppose Xi ∼ i.i.d. U [0, θ] , where θ > 0 is an unknown parameter. In this case, Θ = R++ ⊆ R
and

 0 for x < 0
F (x|θ) = x/θ for 0 ≤ x < θ

1 for x ≥ θ.

1
Economics 240A Fall 2003

¡ ¢
Example. Suppose Xi ∼ i.i.d. N µ, σ 2 , where µ ∈ R and σ 2 > 0 are unknown parameters. In
¡ ¢0
this case, θ = µ, σ 2 , Θ = R × R++ ⊆ R2 and
Z x
¡ ¢ ¡ ¢
F x|µ, σ 2 = φ t|µ, σ2 dt, x ∈ R,
−∞

where
µ ¶
¡ 2
¢ 1 1 2
φ t|µ, σ = √ exp − 2 (t − µ) , t ∈ R.
2πσ 2 2σ

DeÞnition. Let X1 , . . . , Xn be a random sample from a distribution with cdf F (·|θ) , where θ is an
unknown parameter. A point estimator is any statistic W (X1 , . . . , Xn ) .

At this level of generality, a point estimator is just a random variable. A realized value of a (point)
estimator is called a (point) estimate. The quantity we are trying to estimate (typically θ) is called the
estimand. It is not required that the range of the estimator coincides with the range of the estimand;
that is, W (X1 , . . . , Xn ) ∈ Θ is not required. On the other hand, an estimator W (X1 , . . . , Xn ) of θ is
a good estimator (only) if it is “close” to θ in some probabilistic sense and this will typically require
W (X1 , . . . , Xn ) ∈ Θ.
Casella and Berger (Sections 7.2.1-7.2.3) discuss three methods that can be used to generate estimators
under quite general circumstances. We will cover two of these methods, the method of moments and
the maximum likelihood procedure. Method of moments estimators are obtained by solving a system of
equations, while maximum likelihood estimators are constructed by solving a maximization problem.
Suppose X1 , . . . , Xn is a random sample from a distribution with cdf F (·|θ) , where θ ∈ Θ is an unknown
scalar parameter. Let µ : Θ → R be the function deÞned by
Z ∞
µ (θ) = xdF (x|θ) , θ ∈ Θ.
−∞

As deÞned, µ (θ) is the expected value of a random variable with cdf F (·|θ) . The true parameter value θ
solves the equation

E (X) = µ (θ) ,

where X is a random variable with the same (marginal) distribution as Xi (i = 1, . . . , n) . A method of


moments estimator θ̂ solves the sample analogue of this equation, viz.

1X
n ³ ´
X̄ = Xi = µ θ̂ .
n
i=1

To the extent that X̄ is a good estimator of E (X) (it turns out that it often is), one would expect θ̂ to be
a good estimator of θ.

Example. Suppose Xi ∼ i.i.d. Ber (p) , where p ∈ [0, 1] is an unknown parameter. In this case, θ = p,
Θ = [0, 1] and

µ (p) = p, 0 ≤ p ≤ 1.

2
Economics 240A Fall 2003

Therefore, the method of moments estimator of p is

p̂ = X̄.

Example. Suppose Xi ∼ i.i.d. U [0, θ] , where θ > 0 is an unknown parameter. We have:

θ
µ (θ) = , θ > 0.
2
³ ´
The method of moments estimator θ̂ is found by solving the equation µ θ̂ = X̄ :
³ ´ θ̂
µ θ̂ = = X̄ ⇔ θ̂ = 2X̄.
2
As it turns out, this method of moments estimator is not a terribly good estimator. Notice that even
though θ is unknown, we do know that Xi > θ is impossible when Xi ∼ U [0, θ] . It is possible to have
Xi > θ̂ for some i (e.g. if n = 3 and X1 = X2 = 1 and X3 = 7), so it seems plausible that a better
estimator can be constructed.

In the case of a scalar parameter θ, the method of moments estimator is constructed by solving one (mo-
ment) equation in one unknown parameter. When θ is a k-dimensional parameter vector, θ = (θ1 , . . . , θk )0 ,
the method of moments estimator of θ is constructed by solving k equations in the k unknown parameters
θ1 , . . . , θk .

DeÞnition. Let X1 , . . . , Xn be a random sample from a distribution with cdf F (·|θ) , where θ ∈ Θ ⊆ Rk
is a vector of unknown parameters. For any j = 1, . . . , k, let µj : Θ → R be deÞned by
Z ∞
µj (θ) = xj dF (x|θ) , θ ∈ Θ.
−∞

A method of moments estimator θ̂ of θ solves the estimating equations

1X j
n ³ ´
Xi = µj θ̂ , j = 1, . . . , k.
n
i=1

¡ ¢
Example. Suppose Xi ∼ i.i.d. N µ, σ 2 , where µ ∈ R and σ 2 > 0 are unknown parameters. In
¡ ¢0
this case, θ = µ, σ 2 , Θ = R × R++ ⊆ R2 and the functions µ1 (·) and µ2 (·) are given by
¡ ¢
µ1 µ, σ2 = µ,
¡ ¢
µ2 µ, σ 2 = σ 2 + µ2 .

¡ ¢0
Any solution µ̂, σ̂ 2 to the equation
n
1X ¡ ¢
Xi = µ1 µ̂, σ̂ 2 = µ̂
n
i=1

3
Economics 240A Fall 2003

satisÞes
n
1X
µ̂ = Xi = X̄.
n
i=1

Using this relation, the equation


n
1X 2 ¡ ¢
Xi = µ2 µ̂, σ̂ 2 = σ̂ 2 + µ̂2
n
i=1

can solved for σ̂ 2 to yield


n n n
1X 2 1X 2 1 X¡ ¢2
σ̂ 2 = Xi − µ̂2 = Xi − X̄ 2 = Xi − X̄ .
n n n
i=1 i=1 i=1

In all of the examples considered so far, there is a unique solution θ̂ ∈ Θ to the system
n Z ³ ´
1X j ∞
Xi = xj dF x|θ̂ , j = 1, . . . , k,
n −∞
i=1

of estimating equations. It is not difficult to construct examples where the method of moments breaks
down. In such cases, some variant of the method of moments may work.
¡ ¢
Example. Suppose Xi ∼ i.i.d. N 0, σ2 , where σ 2 > 0 is an unknown parameter. In this case, θ = σ 2 ,
Θ = R++ ⊆ R and the function µ01 (·) is given by
¡ ¢
µ1 σ 2 = 0.

The equation
¡ ¢
X̄ = µ1 σ̂ 2 = 0

has inÞnitely many solutions when X̄ = 0 and no solutions when X̄ 6= 0.


In contrast, the sample counterpart of the equation
¡ ¢ ¡ ¢
E X 2 = µ2 σ 2 = σ 2

has a unique solution:


n
2 1X 2
σ̂ = Xi .
n
i=1

Generalizing this example, let X1 , . . . , Xn be a random sample from a distribution with cdf F (·|θ) ,
where θ ∈ Θ ⊆ Rk is a vector of unknown parameters. Even if we cannot Þnd a unique solution θ̂ to the
estimating equations

4
Economics 240A Fall 2003

n Z ³ ´
1X j ∞
Xi = xj dF x|θ̂ , j = 1, . . . , k,
n −∞
i=1

we may be able to Þnd functions gj : R → R (j = 1, . . . , k) such that the system of equations


n Z ³ ´
1X ∞
gj (Xi ) = gj (x) dF x|θ̂ , j = 1, . . . , k,
n −∞
i=1

has a unique solution θ̂ ∈ Θ. Estimators θ̂ constructed in this way are also called method of moments
estimators.

DeÞnition. Let X = (X1 , . . . , Xn )0 be a discrete (continuous) n-dimensional random vector with joint
pmf (pdf) fX (·|θ) : Rn → R+ , where θ ∈ Θ is an unknown parameter vector. For any x = (x1 , . . . , xn )0 ,
the likelihood function given x is the function L (·|x) : Θ → R+ given by

L (θ|x) = L (θ|x1 , . . . , xn ) = fX (x|θ) , θ ∈ Θ.

The log likelihood function given x is the function l (·|x) : Θ → [−∞, ∞) given by

l (θ|x) = l (θ|x1 , . . . , xn ) = log L (θ|x) , θ ∈ Θ.

When X1 , . . . , Xn is a random sample from a discrete (continuous) distribution with pmf (pdf) f (·|θ) ,
the likelihood function given x = (x1 , . . . , xn )0 is
n
Y
L (θ|x) = f (xi |θ) , θ ∈ Θ,
i=1

while the log likelihood function given x is


n
X
l (θ|x) = log f (xi |θ) , θ ∈ Θ.
i=1

DeÞnition. Let X1 , . . . , Xn be a random sample from a discrete (continuous) distribution with pmf
(pdf) f (·|θ) , where θ ∈ Θ is an unknown parameter vector. When X = (X1 , . . . , Xn )0 = x, a maximum
likelihood estimate θ̂ (x) of θ satisÞes
³ ´
L θ̂ (x) |x = maxθ∈Θ L (θ|x) ,

where L (·|x) is the likelihood function given x. The estimator θ̂ (X) is a maximum likelihood estimator
(MLE) of θ.

Maximum likelihood estimators often enjoy favorable large sample properties. That result is related to
the following fact, which in itself can be used to motivate the maximum likelihood estimator.

5
Economics 240A Fall 2003

Theorem (Information Inequality; Ruud, Lemma D.2). Let X be a discrete (continuous) ran-
dom variable with pmf (pdf ) f0 and let f1 be any other pmf (pdf). Then

E (log f0 (X)) ≥ E (log (f1 (X))) .

Remark. The information inequality is strict unless P (f0 (X) = f1 (X)) = 1.

Proof. The claim is that E (log (Y )) ≤ 0, where


½
f1 (X) /f0 (X) for X ∈ X
Y = ,
0 for X ∈
/ X.

where X = {x : f0 (x) > 0} is the support of X.


Recall the following implication of Jensen’s inequality: If Y is a random variable with P (Y ≥ 0) = 1,
then

E (log (Y )) ≤ log (E (Y )) .

Now,
X f1 (x) X X
E (Y ) = · f0 (x) = f1 (x) ≤ f1 (x) = 1
f0 (x)
x∈X x∈X x∈R

if X is discrete, while
Z Z Z ∞
f1 (x)
E (Y ) = · f0 (x) dx = f1 (x) dx ≤ f1 (x) dx = 1
X f0 (x) X −∞

if X is continuous. In both cases, E (Y ) ≤ 1 and it follows from Jensen’s inequality that

E (log (Y )) ≤ log (E (Y )) ≤ log (1) = 0,

as was to be shown. ¥

Let X1 , . . . , Xn be a random sample from a discrete (continuous) distribution with pmf (pdf) f (·|θ) ,
where θ ∈ Θ is unknown. It follows from the information inequality that

Eθ (log f (X|θ)) ≥ Eθ (log f (X|θ∗ ))

for any θ∗ ∈ Θ, where Eθ (·) denotes the expected value computed using the true (unknown) cdf F (·|θ) of
the random variable X. As a consequence, the true parameter value θ solves the problem of maximizing

Eθ (log f (X|θ∗ ))

with respect to θ∗ ∈ Θ; that is,

6
Economics 240A Fall 2003

Eθ (log f (X|θ)) = maxθ∗ ∈Θ Eθ (log f (X|θ∗ )) .

The sample analogue of this problem is that of maximizing the average log likelihood
n
1X
log f (Xi |θ∗ )
n
i=1

with respect θ∗ ∈ Θ. The average log likelihood is a strictly increasing function of L (θ∗ |X1 , . . . , Xn ) .
SpeciÞcally,
n µ ¶
1X 1
log f (Xi |θ∗ ) = log ∗
L (θ |X1 , . . . , Xn ) .
n n
i=1

Therefore, a maximum likelihood estimator θ̂ (X1 , . . . , Xn ) maximizes the average log likelihood with re-
spect to θ∗ :
à n !
1X
n ³ ´ 1X
log f Xi |θ̂ (X1 , . . . , Xn ) = maxθ∗ ∈Θ log f (Xi |θ∗ ) .
n n
i=1 i=1

Example. Suppose Xi ∼ i.i.d. Ber (p) , where p ∈ [0, 1] is an unknown parameter. Each Xi is discrete
with pmf


 1−p for x = 0
f (x|p) = p for x = 1

0 otherwise
½ x
p (1 − p)1−x for x ∈ {0, 1}
=
0 otherwise,

= px (1 − p)1−x · 1 (x ∈ {0, 1}) ,

where 00 = 1 and 1 (·) is the indicator function. It suffices to consider the case where xi ∈ {0, 1} for
i = 1, . . . , n, as the likelihood is zero for all other values of x = (x1 , . . . , xn )0 .
The likelihood given x is

n
Y n
Y
L (p|x) = f (xi |p) = pxi (1 − p)1−xi
i=1 i=1
Pn Pn
= p i=1 xi
(1 − p)n− i=1 xi
, p ∈ [0, 1] ,

while the log likelihood given x is

7
Economics 240A Fall 2003

n
X n
X
l (p|x) = log f (xi |p) = (xi log p + (1 − xi ) log (1 − p))
i=1 i=1
à n ! à n
!
X X
= xi log p + n− xi log (1 − p) , p ∈ [0, 1] ,
i=1 i=1

whereP
0 · log 0 = 0.
If ni=1 xi = 0, then

L (p|x) = (1 − p)n

is a decreasing
P function of p and p = 0 maximizes L (p|x) with respect to p ∈ [0, 1] .
If ni=1 xi = n, then

L (p|x) = pn

and p = 1 maximizes L (p|x) with respect


P to p ∈ [0, 1] .
In intermediate cases where 0 < ni=1 xi < n, the maximum likelihood estimate can be found by solving
the Þrst-order condition for an interior maximum:

¯ Ã n ! Ã n
!
d ¯ X 1 X 1
l (p|x)¯¯ = xi − n− xi =0
dp p=p̂ p̂ 1 − p̂
i=1 i=1

m
Pn
i=1 xi
p̂ = .
n

This unique solution to the Þrst-order condition is a maximizer because


¯ Ã n ! Ã n
!
d2 ¯ X 1 X 1
l (p|x)¯¯ =− xi − n− xi < 0.
dp 2
p=p̂ i=1
p̂2
(1 − p̂)2
i=1

Combining the results, we see that


Pn
i=1 Xi
p̂ = X̄ =
n

is the maximum likelihood estimator of p. In this case, the maximum likelihood estimator coincides with
the method of moments estimator.

Exercise. Verify that p∗ = p maximizes

8
Economics 240A Fall 2003

Ep (log f (Xi |p∗ )) = Ep (Xi log p∗ + (1 − Xi ) log (1 − p∗ ))

= p · log p∗ + (1 − p) · log (1 − p∗ )

with respect to p∗ ∈ [0, 1] .

When a unique maximum likelihood estimator θ̂ of θ = (θ1 , . . . , θk )0 exists, it can usually be constructed
by solving the likelihood equations
¯
∂ ¯
l (θ|X1 , . . . , Xn )¯¯ = 0, j = 1, . . . , k,
∂θj θ=θ̂

and verifying that a second-order condition holds. For instance, if θ is a scalar parameter a unique solution
θ̂ to
¯
d ¯
l (θ|x1 , . . . , xn )¯¯ =0
dθ θ=θ̂

is a maximum likelihood estimate if l (·|x1 , . . . , xn ) is twice differentiable, Θ is an interval (possibly un-


bounded) and
¯
d2 ¯
¯
2 l (θ|x1 , . . . , xn )¯ < 0.
dθ θ=θ̂

¡ ¢
Example. Suppose X¡i ∼ i.i.d. ¢ N µ, σ 2 , where µ ∈ R and σ 2 > 0 are unknown parameters. The
marginal pdf of Xi is f ·|µ, σ2 , where

µ ¶
¡ ¢ 1 1 2
f x|µ, σ2 = √ exp − 2 (x − µ)
2πσ 2 2σ
µ ¶
¡ ¢
2 −1/2 1 2
= 2πσ exp − 2 (x − µ) .

The likelihood given x = (x1 , . . . , xn )0 is

Yn n µ ¶
¡ ¢ ¡ ¢ Y ¡ ¢−1/2 1
L µ, σ 2 |x = f xi |µ, σ2 = 2πσ2 exp − 2 (xi − µ)2

i=1 i=1
à n
!
¡ ¢ −n/2 1 X 2
= 2πσ2 exp − 2 (xi − µ) ,

i=1

while the log likelihood given x is

9
Economics 240A Fall 2003

n
X
¡ ¢ ¡ ¢
l µ, σ 2 |x = log f xi |µ, σ 2
i=1
n µ
X ¶
1 1 ¡ 2¢ 1 2
= − log (2π) − log σ − 2 (xi − µ)
2 2 2σ
i=1
n
n n 1 X
2
= − log 2π − log σ − 2 (xi − µ)2 .
2 2 2σ
i=1

The likelihood equations are:


¯ n
∂ ¡ ¢¯ 1 X
l µ, σ2 |X1 , . . . , Xn ¯¯ = 2 (Xi − µ̂) = 0
∂µ θ=θ̂ σ̂ i=1

and
¯ n
∂ ¡ ¢¯
¯ n 1 X
l µ, σ 2
|X1 , . . . , Xn ¯ = − + (Xi − µ̂)2 ,
∂ σ̂ 2 θ=θ̂ 2σ̂ 2
2σ̂ 4
i=1
¡ ¢0 ¡ ¢0
where θ = µ, σ 2 and θ̂ = µ̂, σ̂2 . The unique solution to these equations is

n
1X
µ̂ = Xi = X̄,
n
i=1
n n
1X 1 X¡ ¢2
σ̂ 2
= (Xi − µ̂)2 = Xi − X̄ .
n n
i=1 i=1

The matrix
¯ à ¡ ¢ ¡ ¢ !¯¯ µ ¶
∂2 2 ∂2
∂2 ¡ ¢ ¯ ∂µ∂µ l ¡µ, σ |x ¢ l µ, σ2 |x ¯ − σ̂n2 0
l µ, σ 2 |x ¯¯ = ∂µ∂σ2 ¡ ¢ ¯ =
− 2σ̂n4
2 2
∂θ∂θ0 θ=θ̂

∂σ 2 ∂µ
l µ, σ2 |x ∂
∂σ2 ∂σ2
l µ, σ2 |x ¯ 0
θ=θ̂
¡ ¢0 ¡ ¢
is negative deÞnite, so µ̂, σ̂ 2 is a local maximizer of l µ, σ2 |x . In fact,
¡ ¢
lim|µ|→∞ L µ, σ2 |x = 0

for any σ 2 > 0 and


¡ ¢ ¡ ¢
limσ2 →0 L µ, σ2 |x = limσ2 →∞ L µ, σ2 |x = 0

¡ ¢0 ¡ ¢0
for any µ ∈ R, so µ̂, σ̂ 2 is the maximum likelihood estimator of µ, σ 2 . Once again, the maximum
likelihood estimator coincides with the method of moments estimator.
In the present case, the second-order condition can also be veriÞed using univariate calculus (Casella
and Berger, Example 7.2.11).

10
Economics 240A Fall 2003

¡ ¢ ¡ ¢
Exercise. Verify that µ∗ , σ∗2 = µ, σ 2 maximizes

¡ ¡ ¢¢
E(µ,σ2 ) log f Xi |µ∗ , σ∗2
µ ¶
1 1 ¡ ∗2 ¢ 1 ∗ 2
= E(µ,σ2 ) − log (2π) − log σ − ∗2 (Xi − µ )
2 2 2σ
n n 1 ³ ´
= − log 2π − log σ ∗2 − ∗2 σ 2 + (µ − µ∗ )2
2 2 2σ
¡ ¢
with respect to µ∗ , σ ∗2 ∈ R × R++ .

One case where the maximum likelihood estimator cannot be constructed by solving the likelihood
equations is the following.

Example. Suppose Xi ∼ i.i.d. U [0, θ] , where θ > 0 is an unknown parameter. Each Xi is continu-
ous with pdf

½
1/θ for 0 ≤ x ≤ θ
f (x|θ) =
0 otherwise
1
= 1 (0 ≤ x ≤ θ) .
θ

It suffices to consider the case where xi ≥ 0 for i = 1, . . . , n, as the likelihood is zero for all other values of
x = (x1 , . . . , xn )0 .
The likelihood given x is

n
Y n µ
Y ¶
1
L (θ|x) = f (xi |θ) = 1 (0 ≤ xi ≤ θ)
θ
i=1 i=1
n
1 Y
= 1 (0 ≤ xi ≤ θ)
θn
i=1
1
= 1 (max1≤i≤n xi ≤ θ) , θ > 0,
θn

where the third equality uses that fact that xi ≥ 0 for i = 1, . . . , n.


The likelihood given x is zero for θ < max1≤i≤n xi and is a decreasing function of θ for θ ≥ max1≤i≤n xi .
As a consequence, the maximum likelihood estimator of θ is

θ̂ = max1≤i≤n Xi .

In this case, the maximum likelihood estimator is different from the method moments estimator (the latter
is 2 · X̄). Unlike the method of moments estimator, the maximum likelihood estimator has the property

11
Economics 240A Fall 2003

that θ̂ ≥ Xi for every i. On the³ other´hand, since max1≤i≤n Xi is a lower bound on the true θ, θ̂ will tend
to underestimate θ. Indeed, P θ̂ ≤ θ = 1.

Exercise. Verify that θ∗ = θ maximizes

Eθ (log f (Xi |θ∗ )) = − log θ∗ − ∞ · Pθ (Xi > θ∗ )

with respect to θ∗ > 0, where ∞ · 0 = 0.

In the sense of the following deÞnition, the maximum likelihood estimator max1≤i≤n Xi in the preceding
example is the nth order statistic and is often denoted by X(n) .

DeÞnition. Let X1 , . . . , Xn be a random sample. The order statistics are the sample values placed
in ascending order. They are denoted by X(1) , . . . , X(n) .

Example. For any random sample X1 , . . . , Xn , X(1) = min1≤i≤n Xi and X(n) = max1≤i≤n Xn .

Remark. The pmf (pdf) of order statistics obtained from a discrete (continuous) distribution can be
characterized using combinatorial arguments (Casella and Berger, Section 5.4).

Remark. Maximum likelihood ³ ´ estimators are equivariant in the sense that if θ̂ is a maximum likeli-
hood estimator of θ, then τ θ̂ is a maximum likelihood estimator of τ (θ) for any function τ (·) deÞned
on Θ (Casella and Berger, Theorem 7.2.10).

Let X1 , . . . , Xn be a random sample from a distribution with cdf F (·|θ) , where θ ∈ Θ ⊆ Rk is an


unknown parameter vector. An estimator θ̂ of θ is called a Z-estimator if it is (implicitly) deÞned as a
solution of a system of equations of the form

1X ³ ´
n
ψ j Xi , θ̂ = 0, j = 1, . . . , k,
n
i=1

where each ψ j : R × Θ → R is a function. Z-estimators are usually motivated by showing that θ is the
only value of θ∗ ∈ Θ for which

Eθ ψ j (X, θ∗ ) = 0 ∀j ∈ {1, . . . , k} .

The leading special case is the method of moments estimator, which is a Z-estimator with
Z ∞
∗ j
ψ j (Xi , θ ) = Xi − xj dF (x|θ∗ ) , j = 1, . . . , k,
−∞

or, more generally,


Z ∞

ψ j (Xi , θ ) = gj (Xi ) − gj (x) dF (x|θ∗ ) , j = 1, . . . , k.
−∞

An estimator θ̂ of θ is called an M -estimator if it is (implicitly) deÞned as

12
Economics 240A Fall 2003

1X ³ ´
n n
1X
m Xi , θ̂ = maxθ∗ ∈Θ m (Xi , θ∗ ) ,
n n
i=1 i=1

where m : R × Θ → R is a function. M -estimators are usually motivated by showing that θ is the unique
maximizer (with respect to θ∗ ∈ Θ) of

Eθ m (X, θ∗ ) .

The leading special case is the maximum likelihood estimator, which is an M -estimator with

m (Xi , θ∗ ) = log f (Xi |θ∗ ) ,

where f (·|θ∗ ) is the pmf/pdf of the cdf F (·|θ∗ ) .


Many M -estimators satisfy Þrst-order conditions of the form
n µ ¯ ¶
1X ∂ ¯
m (Xi , θ)¯¯ = 0, j = 1, . . . , k.
n ∂θj θ=θ̂
i=1

and can therefore be interpreted as Z-estimators with


¯
∂ ¯
ψj (Xi , θ∗ ) = m (Xi , θ)¯¯ , j = 1, . . . , k.
∂θj θ=θ∗

In particular, many maximum likelihood estimators can be interpreted as method of moments estimators
with
¯
∂ ¯

gj (Xi , θ ) = log f (Xi , θ)¯¯ , j = 1, . . . , k.
∂θj θ=θ∗
© ª
It will almost always be possible to Þnd a set of functions ψ j : j = 1, . . . , k such that θ is the only
value of θ∗ ∈ Θ for which

Eθ ψ j (X, θ∗ ) = 0 ∀j ∈ {1, . . . , k}

or a function m such that θ is the unique maximizer (with respect to θ∗ ∈ Θ) of

Eθ m (X, θ∗ ) .

An important exception occurs when the model is not identiÞed.

DeÞnition. Let (Ω, B, {Pθ : θ ∈ Θ}) be a parametric statistical model. A parameter value θ1 ∈ Θ
is identiÞed if there does not exist another parameter value θ2 ∈ Θ such that Pθ1 = Pθ2 . The model
(Ω, B, {Pθ : θ ∈ Θ}) is identiÞed if every parameter value θ ∈ Θ is identiÞed.

In other words, a model is identiÞed if knowledge of the true marginal cdf F (·|θ) implies knowledge
of the parameter θ. This is a very modest and reasonable requirement. IdentiÞcation is a property of

13
Economics 240A Fall 2003

the parameterization/speciÞcation of a statistical model. When the cdfs {F (·|θ) : θ ∈ Θ} characterizing a


statistical model are speciÞed directly, identiÞcation usually holds. On the other hand, problems may arise
when the observed sample is assumed to be generated by a transformation model.

Example. Suppose X1 , . . . , Xn is a random sample generated by the model


½
0 for Xi∗ ≤ 0 ¡ ¢
Xi = 1 (Xi∗ > 0) = ∗ , Xi∗ ∼ i.i.d.N µ, σ 2 ,
1 for Xi > 0

where µ ∈ R and σ 2 > 0 are unknown parameters. In this case, Xi ∼ i.i.d. Ber (Φ (−µ/σ)) , where Φ (·) is
the cdf of the standard normal distribution:
Z x µ ¶
1 1 2
Φ (x) = √ exp − t dt.
−∞ 2π 2
¡ ¢0
As a consequence, the marginal distribution of each Xi depends on µ, σ2 only through µ/σ and any
¡ ¢0
parameter value µ1 , σ 21 is unidentiÞed. We can achieve identiÞcation by imposing an identifying assump-
tion on the parameters. In this case, a natural identifying assumption is σ = 1.

Remark. If θ1 ∈ Θ is an unidentiÞed parameter value, there is another parameter value θ2 ∈ Θ such that
F (·|θ1 ) = F (·|θ2 ) , implying
Z ∞ Z ∞
Eθ1 g (X) = g (x) dF (x|θ1 ) = g (x) dF (x|θ2 ) = Eθ2 g (X)
−∞ −∞

for any function g. In particular, any solution θ∗ to a system of equations of the form

Eθ1 ψ j (X, θ∗ ) = 0 ∀j ∈ {1, . . . , k}

will also be a solution to the following system of equations:

Eθ2 ψ j (X, θ∗ ) = 0 ∀j ∈ {1, . . . , k} .

Similarly, any maximizer (with respect to θ∗ ∈ Θ) of

Eθ1 m (X, θ∗ )

will also be a maximizer of

Eθ2 m (X, θ∗ ) .

If one attempts to estimate the parameters of an unidentiÞed model, unique method of moments (maximum
likelihood) estimators typically cannot be found.

An estimator W (X1 , . . . , Xn ) of θ is a good estimator (only) if it is “close” to θ in some probabilistic


sense. We will use mean squared error as our measure of closeness.

14
Economics 240A Fall 2003

DeÞnition. The mean squared error (MSE) matrix of an estimator θ̂ of θ is the function (of θ) given by
³ ´ ·³ ´³ ´0 ¸
M SEθ θ̂ = Eθ θ̂ − θ θ̂ − θ , θ ∈ Θ.

DeÞnition. The bias of an estimator θ̂ of θ is


³ ´ ³ ´
Biasθ θ̂ = Eθ θ̂ − θ, θ ∈ Θ.

An estimator θ̂ of θ is unbiased if
³ ´
Eθ θ̂ = θ ∀θ ∈ Θ.

Many results derived using MSE generalize to other measures of closeness. It is convenient to use MSE
because it is analytically tractable and has a straightforward interpretation in terms of the variance and
bias of the estimator θ̂. SpeciÞcally,

³ ´ ·³ ´³ ´0 ¸
M SEθ θ̂ = Eθ θ̂ − θ θ̂ − θ
·³ ³ ´ ³ ´ ´³ ³ ´ ³ ´ ´0 ¸
= Eθ θ̂ − Eθ θ̂ + Eθ θ̂ − θ θ̂ − Eθ θ̂ + Eθ θ̂ − θ
·³ ³ ´´ ³ ³ ´´0 ¸
= Eθ θ̂ − Eθ θ̂ θ̂ − Eθ θ̂
³ ³ ´ ´³ ³ ´ ´0
+ Eθ θ̂ − θ Eθ θ̂ − θ
h ³ ´i ³ ³ ´ ´0
+Eθ θ̂ − Eθ θ̂ Eθ θ̂ − θ
³ ³ ´ ´ h ³ ´i0
+ Eθ θ̂ − θ Eθ θ̂ − Eθ θ̂

³ ´ ³ ´ ³ ´0
= V arθ θ̂ + Biasθ θ̂ · Biasθ θ̂

h ³ ´i ³ ´ ³ ´
because Eθ θ̂ − Eθ θ̂ = 0 and Eθ θ̂ − θ = Biasθ θ̂ is non-random. In particular,
³ ´ ³ ´ ³ ´2
M SEθ θ̂ = V arθ θ̂ + Biasθ θ̂

when θ is a scalar parameter.


¡ ¢
Example. Suppose Xi ∼ i.i.d. N µ, σ2 , where µ ∈ R and σ 2 > 0 are unknown parameters. Two
estimators of σ 2 are the maximum likelihood estimator

15
Economics 240A Fall 2003

n
1 X¡ ¢2
σ̂ 2 = Xi − X̄
n
i=1

and the sample variance


n
1 X¡ ¢2
S2 = Xi − X̄ .
n−1
i=1

The sample variance satisÞes


n
(n − 1) S 2 1 X¡ ¢2
= Xi − X̄ ∼ χ2 (n − 1) ,
σ2 σ2
i=1

implying
à n
!
¡ ¢ σ2 1 X¡ ¢2 σ2
E S2 = ·E Xi − X̄ = · (n − 1) = σ 2
n−1 σ2 n−1
i=1

and
µ ¶2 Ã n
! µ ¶2
¡ 2
¢ σ2 1 X¡ ¢2 σ2 2
V ar S = · V ar 2
Xi − X̄ = · 2 (n − 1) = σ4 .
n−1 σ n−1 n−1
i=1

In particular,
¡ ¢ 2
M SE(µ,σ2 ) S 2 = σ4 .
n−1
Similarly,
à n
!
¡ 2
¢ σ2 1 X¡ ¢2 σ2 n−1 2
E σ̂ = ·E 2
Xi − X̄ = · (n − 1) = σ
n σ n n
i=1

and
µ ¶2 Ã n
! µ ¶2
¡ ¢ σ2 1 X¡ ¢2 σ2 2 (n − 1) 4
V ar σ̂ 2 = · V ar 2
Xi − X̄ = · 2 (n − 1) = σ ,
n σ n n2
i=1

so

µ ¶2
¡ 2
¢ 2 (n − 1) 4 1 2 2n − 1 4
M SE(µ,σ2 ) σ̂ = σ + σ = σ
n2 n n2
2 − 1/n 4 2 ¡ ¢
= σ < σ 4 = M SE(µ,σ2 ) S 2 .
n n−1

Unlike S 2 , σ̂ 2 is biased. Nonetheless, its variance is so much smaller than that of S 2 that its MSE is
smaller for all values of µ and σ 2 .

16
Economics 240A Fall 2003

In this example, the MSE ranking does not depend on the true value of the parameter(s). In spite of
this we do not know whether an even better estimator exists. To answer that question, it might appear
natural to look for an estimator that minimizes MSE uniformly in µ and σ2 . Unfortunately, such an esti-
mator does not exist.

Example. As a competitor to σ̂ 2 , consider the estimator σ̃ 2 = 1. Evidently, σ̃ 2 is a perfect estimator


if σ 2 happens to equal unity, but is an inferior estimator for most other values of σ 2 .

The point is that in order to Þnd a uniformly (in the value of the parameters) best estimator, we need
to impose certain restrictions on the class of estimators under consideration.

DeÞnition. Let W be a class of estimators. An estimator θ̂ of θ is efficient relative to W if


³ ´
M SEθ θ̂ ≤ M SEθ (W ) ∀θ ∈ Θ

for every W ∈ W.
³ ´
Remark. When θ is a vector, the notation “M SEθ θ̂ ≤ M SEθ (W )” is shorthand for “the matrix
³ ´
M SEθ (W ) − M SEθ θ̂ is positive semi-deÞnite”.
¡ ¢
Example. Suppose Xi ∼ i.i.d. N µ, σ 2 , where 2
ª R and σ > 0 are unknown parameters. The
© 2 µ 2∈
2 2
estimator (of σ ) σ̂ is efficient relative to W = σ̂ , S .
¡ ¢
Example. Suppose Xi ∼ i.i.d. N µ, σ2 , where µ ∈ R and σ 2 > 0 are unknown parameters. Con-
sider the following class of estimators (of σ 2 ):
( n
)
2 1 X¡ ¢2
W = σ̃ c = Xi − X̄ : c > 0 .
c
i=1

The estimators
n
1 X¡ ¢2
σ̂ 2 = Xi − X̄ = σ̃ 2n
n
i=1

and
n
1 X¡ ¢2
S2 = Xi − X̄ = σ̃ 2n−1
n−1
i=1

are both members of W. It is not hard to show that σ̃ 2n+1 is efficient relative to W.

A “natural” class of estimators with fairly general applicability is the class

Wu (θ) = {W : Eθ (W ) = θ and V arθ (W ) < ∞ for every θ ∈ Θ}

of unbiased estimators of θ with Þnite variance. For unbiased estimators, the MSE is simply the variance.

17
Economics 240A Fall 2003

DeÞnition. An estimator θ̂ ∈ Wu (θ) of θ is a uniform minimum variance unbiased (UMVU) estima-


tor of θ if θ̂ is efficient relative to Wu (θ) .

It turns out that UMVU estimators often exist. The Rao-Blackwell Theorem facilitates the search for
UMVU estimators by showing that UMVU estimators can always be based on statistics that are sufficient
in the sense of the following deÞnition.

DeÞnition. Let X1 , . . . , Xn be a random sample from a distribution with cdf F (·|θ) , where θ ∈ Θ is
unknown. A statistic T = T (X1 , . . . , Xn ) is a sufficient statistic for θ if the conditional distribution of
(X1 , . . . , Xn )0 given T does not depend on θ.

Theorem (Rao-Blackwell Theorem; Casella and Berger, Theorem 7.3.17). Let θ̂ ∈ Wu (θ)
and let T be any sufficient statistic for θ. Then
³ ´
θ̃ = EX|T θ̂|T ∈ Wu (θ)

and
³ ´ ³ ´
V arθ θ̃ ≤ V arθ θ̂ ∀θ ∈ Θ.

³ ´ ³ ´ ³ ´
Remark. The inequality V arθ θ̃ ≤ V arθ θ̂ is strict unless Pθ θ̃ = θ̂ = 1.

Proof. The distribution of the estimator θ̂ = θ̂ (X1 , . . . , Xn ) conditional on T does not depend on θ
when T is sufficient. Therefore,
³ ´
θ̃ = EX|T θ̂|T

is a function of T = T (X1 , . . . , Xn ) that does not depend on the true (unknown) value of θ. In particular,
θ̃ is an estimator.
The estimator θ̃ is unbiased because
³ ´ ³ ³ ´´ ³ ´
Eθ θ̃ = Eθ EX|T θ̂|T = Eθ θ̂ = θ,

where the second equality uses the law of iterated expectations and the last equality uses the fact that θ̂
is unbiased.
Applying the conditional variance identity, we have:

³ ´ ³ ³ ´´ ³ ³ ´´
V arθ θ̂ = V arθ EX|T θ̂|T + Eθ V arX|T θ̂|T
³ ³ ´´ ³ ´
≥ V arθ EX|T θ̂|T = V arθ θ̃ . ¥

18
Economics 240A Fall 2003

³ ´ ³ ´
Remark. With a little more effort, a proof of the relation V arθ θ̂ ≥ V arθ θ̃ can be based on the
conditional version of Jensen’s inequality:

³ ´ µ³ ´2 ¶ µ µ³ ´2 ¶¶
V arθ θ̂ = Eθ θ̂ − θ = Eθ EX|T θ̂ − θ |T
µ³ ³ ´ ´2 ¶ µ³ ´2 ¶ ³ ´
≥ Eθ EX|T θ̂|T − θ = Eθ θ̃ − θ = V arθ θ̃ ,

where the second equality uses the law of iterated expectations and the inequality uses the conditional
version of Jensen’s
³ ³ inequality.
´´ This
³ method
´ of proof is applicable whenever the measure of closeness is of
the form Eθ L θ̂, θ , where L θ̂, θ is a convex function of θ̂ :

³ ³ ´´ ³ ³ ³ ´ ´´
Eθ L θ̂, θ = Eθ EX|T L θ̂, θ |T
³ ³ ³ ´ ´´ ³ ³ ´´
≥ Eθ L EX|T θ̂|T , θ = Eθ L θ̃, θ .

¯ ¯
¯ ¯
For instance, ¯θ̂ − θ¯ is a convex function of θ̂ and therefore

³¯ ¯´ ³ ³¯ ¯ ´´
¯ ¯ ¯ ¯
Eθ ¯θ̂ − θ¯ = Eθ EX|T ¯θ̂ − θ¯ |T
³³¯ ³ ´ ¯´´ ³¯ ¯´
¯ ¯ ¯ ¯
≥ Eθ ¯EX|T θ̂|T − θ¯ = Eθ ¯θ̃ − θ¯ .

It follows from the Rao-Blackwell Theorem that when looking for UMVU estimators, there is no need
to consider estimators that cannot be written as functions of a sufficient statistic. Indeed, it suffices to
look at estimators that are necessary statistics in the sense that they can be written as functions of every
sufficient statistic. Here, the word “every” is crucial because any estimator is a function of (X1 , . . . , Xn )0
and (X1 , . . . , Xn )0 is always a sufficient statistic.
Of course, the usefulness of the Rao-Blackwell Theorem depends on the extent to which sufficient sta-
tistics of low dimension are available and easy to Þnd. As it turns out, sufficient statistics of the same
dimension as θ are available in many cases. In cases where determination of sufficient statistics by means
of the deÞnition is tedious, the following characterization of sufficiency may be useful.

Theorem (Factorization Criterion; Casella and Berger, Theorem 6.2.6). Let X1 , . . . , Xn be


a random sample from a discrete (continuous) distribution with cdf F (·|θ) , where θ ∈ Θ is unknown. A
statistic T = T (X1 , . . . , Xn ) is a sufficient statistic for θ if and only if there exist functions g (·|·) and
h (·) such that fX (·|θ) is a pmf (pdf ) of (X1 , . . . , Xn )0 , where

fX (x1 , . . . , xn |θ) = g (T (x1 , . . . , xn ) |θ) h (x1 , . . . , xn )

for every (x1 , . . . , xn ) ∈ Rn and every θ ∈ Θ.

19
Economics 240A Fall 2003

Example. For any random sample ¡ X1 , . . . , X¢n from a discrete (continuous) distribution, two sufficient
0 0
statistics are (X1 , . . . , Xn ) and X(1) , . . . , X(n) .

Example. Suppose Xi ∼ i.i.d. Ber (p) , where p ∈ [0, 1] is an unknown parameter. The joint pmf of
(X1 , . . . , Xn )0 is

n
Y
fX (x1 , . . . , xn |p) = pxi (1 − p)1−xi · 1 (xi ∈ {0, 1})
i=1
Ãn !
Pn P
n− n
Y
= p i=1 xi (1 − p) i=1 xi · 1 (xi ∈ {0, 1})
i=1
³Xn ´
= g xi |p · h (x1 , . . . , xn ) ,
i=1

where

g (t|p) = pt (1 − p)n−t

and
n
Y
h (x1 , . . . , xn ) = 1 (xi ∈ {0, 1}) .
i=1
P
Therefore, ni=1 Xi is a sufficient statistic for p.
The maximum likelihood (and method of moments) estimator
Pn
Xi
p̂ = i=1
n
Pn
is a function of i=1 Xi .

Example. Suppose Xi ∼ i.i.d. U [0, θ] , where θ > 0 is an unknown parameter. The joint pdf of
(X1 , . . . , Xn )0 is

n µ
Y ¶
1
fX (x1 , . . . , xn |θ) = 1 (0 ≤ xi ≤ θ)
θ
i=1
n
1 Y
= 1 (0 ≤ xi ≤ θ)
θn
i=1
1 ¡ ¢ ¡ ¢
= n 1 x(n) ≤ θ · 1 x(1) ≥ 0
θ
¡ ¢
= g x(n) |θ · h (x1 , . . . , xn ) ,

where

20
Economics 240A Fall 2003

1
g (t|θ) = 1 (t ≤ θ)
θn

and
¡ ¢
h (x1 , . . . , xn ) = 1 x(1) ≥ 0 .

Therefore, X(n) = max1≤i≤n Xi is a sufficient statistic for θ.


The method of moments estimator θ̂MM = 2X̄ is unbiased but is not a function of X(n) (unless n = 1)
and therefore cannot be UMVU. On the other hand, the maximum likelihood estimator θ̂ML = X(n) is
based on X(n) .
¡ ¢
Example. Suppose Xi ∼ i.i.d. N µ, σ 2 , where µ ∈ R and σ 2 > 0 are unknown parameters. In
this case, the joint pdf of (X1 , . . . , Xn )0 is

Yn µ ¶
¡ 2
¢ ¡ ¢
2 −1/2 1 2
fX x1 , . . . , xn |µ, σ = 2πσ exp − 2 (xi − µ)

i=1
à n
!
n/2 ¡ 2 ¢−n/2 1 X¡ 2 2
¢
= (2π) σ exp − 2 xi + µ − 2µxi

i=1
µ ¶ Ã n n
!
n/2 ¡ 2 ¢−n/2 nµ2 µ X 1 X 2
= (2π) σ exp − 2 exp xi − 2 xi
2σ σ2 2σ
i=1 i=1
³Xn Xn ´
= g xi , x2i |µ, σ 2 · h (x1 , . . . , xn ) ,
i=1 i=1

where
µ ¶ µ ¶
¡ 2
¢ n/2 ¡ 2 ¢−n/2 nµ2 µ 1
g t1 , t2 |µ, σ = (2π) σ exp − 2 exp t1 − 2 t2
2σ σ2 2σ

and

h (x1 , . . . , xn ) = 1.

¡Pn Pn ¢ ¡ ¢
2 0 is a sufficient statistic for µ, σ 2 0 .
Therefore, i=1 Xi , i=1 Xi
The maximum likelihood (and method of moments) estimator of µ,
Pn
Xi
µ̂ = X̄ = i=1 ,
n
¡Pn Pn ¢
2 0,
is a function of i=1 Xi , i=1 Xi as is the maximum likelihood (and method of moments) estimator
of σ 2 ,

21
Economics 240A Fall 2003

n
à n
!2
1X 2 1X
σ̂ 2 = Xi − Xi .
n n
i=1 i=1

A sufficient statistic T is particularly useful if unbiased estimators based on T are essentially unique in
the sense that any two unbiased estimators based on T are equal with probability one. Indeed, if we can
somehow Þnd a sufficient statistic T such that unbiased estimators based on T are essentially unique, then
any θ̂ ∈ Wu (θ) based on T is UMVU. The Lehmann-Scheffé Theorem establishes essential uniqueness of
unbiased estimators based on a complete sufficient statistic T.

DeÞnition. A sufficient statistic T for θ is complete if

Eθ (g (T )) = 0 ∀θ ∈ Θ

implies

Pθ (g (T ) = 0) = 1 ∀θ ∈ Θ.

Theorem (Lehmann-Scheffé Theorem; Casella and Berger, Theorem 7.5.1). Unbiased esti-
mators based on complete sufficient statistics are essentially unique.

Proof. Suppose θ̂ = θ̂ (T ) and θ̃ = θ̃ (T ) are unbiased estimators θ based on a sufficient statistic T.


Then
³ ´ ³ ´ ³ ´
Eθ θ̂ (T ) − θ̃ (T ) = Eθ θ̂ (T ) − Eθ θ̃ (T ) = θ − θ = 0 ∀θ ∈ Θ

because θ̂ and θ̃ are unbiased. If T is complete, then


³ ´ ³ ´
Pθ θ̂ (T ) − θ̃ (T ) = 0 = Pθ θ̂ (T ) = θ̃ (T ) = 1 ∀θ ∈ Θ. ¥

Corollary. If T is a complete sufficient statistic and θ̂ ∈ Wu (θ) is based on T, then θ̂ is a UMVU


estimator of θ.
³ ´ ³ ´
Proof. Suppose there exists an estimator θ̃ ∈ Wu (θ) such that V arθ∗ θ̃ < V arθ∗ θ̂ for some θ∗ ∈ Θ.
³ ´
By the Rao-Blackwell theorem, E θ̃|T ∈ Wu (θ) is based on T and satisÞes
³ ³ ´´ ³ ´ ³ ´
V arθ∗ E θ̃|T ≤ V arθ∗ θ̃ < V arθ∗ θ̂ .
³ ³ ´´ ³ ´
In view of the Lehmann-Scheffé theorem, this is impossible because V arθ∗ E θ̃|T < V arθ∗ θ̂ implies
that θ̂ is not an essentially unique unbiased estimator of θ based on T. ¥

22
Economics 240A Fall 2003

Complete sufficient statistics can often be found if the family {f (·|θ) : θ ∈ Θ} of pmfs/pdfs is an expo-
nential family.

DeÞnition. A family {f (·|θ) : θ ∈ Θ} of pmfs/pdfs is called a d-dimensional exponential family if there


exist functions h : R → R+ , c : Θ → R+ , ηi : Θ → R (i = 1, . . . , d) and ti : R → R (i = 1, . . . , d) such that
à d !
X
f (x|θ) = h (x) c (θ) exp η i (θ) ti (x) , ∀x ∈ R, θ ∈ Θ.
i=1

Example. Suppose Xi ∼ i.i.d. Ber (p) , where p ∈ (0, 1) . The marginal pmf satisÞes

f (x|p) = px (1 − p)1−x · 1 (x ∈ {0, 1})


µ ¶x
p
= (1 − p) · 1 (x ∈ {0, 1})
1−p
µ µ ¶ ¶
p
= exp log · x (1 − p) · 1 (x ∈ {0, 1})
1−p

= h (x) c (p) exp (η (p) t (x)) ,

where

h (x) = 1 (x ∈ {0, 1}) ,

c (p) = 1 − p,
µ ¶
p
η (p) = log ,
1−p

t (x) = x.

¡ ¢
Example. Suppose Xi ∼ i.i.d. N µ, σ 2 , where µ ∈ R and σ 2 > 0 are unknown parameters. In
this case, the marginal pdf satisÞes

µ ¶
¡ 2
¢ ¡ ¢
2 −1/2 1 2
f x|µ, σ = 2πσ exp − 2 (x − µ)

µ ¶ µ ¶
1/2 ¡ 2 ¢−1/2 µ2 µ 1 2
= (2π) σ exp − 2 exp x − 2x
2σ σ2 2σ
à 2 !
¡ ¢ X ¡ ¢
2 2
= h (x) c µ, σ exp ηi µ, σ ti (x) ,
i=1

23
Economics 240A Fall 2003

where

h (x) = 1,
µ ¶
¡ ¢ ¡ ¢−1/2 µ2
c µ, σ 2 = (2π)1/2 σ 2 exp − 2 ,

¡ ¢ µ
η1 µ, σ 2 = ,
σ2

t1 (x) = x,

¡ ¢ 1
η2 µ, σ 2 = − 2 ,

t2 (x) = x2 .

Theorem (Casella and Berger, Theorem 6.2.25). Let X1 , . . . , Xn be a random sample from a discrete
(continuous) exponential family with pmf (pdf)
à d !
X
f (x|θ) = h (x) c (θ) exp η i (θ) ti (x) , x ∈ R, θ ∈ Θ.
i=1

The sufficient (for θ) statistic


à n n
!0
X X
T (X1 , . . . , Xn ) = t1 (Xi ) , . . . , td (Xi )
i=1 i=1

is complete if the set {(η1 (θ) , . . . , η d (θ)) : θ ∈ Θ} contains an open set.

Remark. A set A ⊆ Rd contains an open set if and only if we can Þnd constants aL U L U
1 < a1 , . . . , ad < ad
such that
£ L U¤ £ ¤
a1 , a1 × . . . × aL U
d , ad ⊆ A;

that is, a set A ⊆ Rd contains an open set if and only if it contains a d-dimensional rectangle.

Example. Suppose Xi ∼ i.i.d. Ber (p) , where p ∈ (0, 1) . The marginal pdf satisÞes

f (x|p) = h (x) c (p) exp (η (p) t (x)) ,

where

24
Economics 240A Fall 2003

h (x) = 1 (x ∈ {0, 1}) ,

c (p) = 1 − p,
µ ¶
p
η (p) = log ,
1−p

t (x) = x.

The set
½ µ ¶ ¾
p
{η (p) : p ∈ (0, 1)} = log : p ∈ (0, 1) = R
1−p

is open, so
n
X n
X
t (Xi ) = Xi
i=1 i=1

is a complete sufficient statistic.


The maximum likelihood estimator
Pn
i=1 Xi
p̂ =
n

is unbiased,

Ep (p̂) = Ep (Xi ) = p,

P
and is based on ni=1 Xi . Therefore, p̂ is a UMVU estimator of p.
The conclusion is not affected if Θ = [0, 1] is considered, as V arp (p̂) = 0 when p ∈ {0, 1} .
¡ ¢
Example. Suppose Xi ∼ i.i.d. N µ, σ 2 , where µ ∈ R and σ 2 > 0 are unknown parameters. In
this case, the marginal pdf satisÞes
à 2 !
¡ ¢ ¡ ¢ X ¡ ¢
2 2 2
f x|µ, σ = h (x) c µ, σ exp η i µ, σ ti (x) ,
i=1

where

25
Economics 240A Fall 2003

h (x) = 1,
µ ¶
¡ 2
¢ 1/2 ¡ 2 ¢−1/2 µ2
c µ, σ = (2π) σ exp − 2 ,

¡ ¢ µ
η1 µ, σ 2 = ,
σ2

t1 (x) = x,

¡ ¢ 1
η2 µ, σ 2 = − 2 ,

t2 (x) = x2 .

The set

n¡ ¡ ¢ ¡ ¢¢ o ½µ µ 1 0
¶ ¾
2 2 0 2 2
η1 µ, σ , η 2 µ, σ : µ ∈ R, σ > 0 = ,− : µ ∈ R, σ > 0 = R× (−∞, 0)
σ 2 2σ 2

is open, so
à n n
!0 Ã n n
!0
X X X X
t1 (Xi ) , t2 (Xi ) = Xi , Xi2
i=1 i=1 i=1 i=1

is a complete sufficient statistic.


The maximum likelihood estimator
Pn
i=1 Xi
µ̂ = X̄ =
n
¡Pn Pn ¢
2 0 and X̄ is therefore a UMVU estimator of µ.
of µ is unbiased and is based on i=1 Xi , i=1 Xi
An alternative UMVU estimator of µ is
½
X̄ for X̄ 6= 0
µ̃ =
1 for X̄ = 0.

Therefore, X̄ is only an essentially unique UMVU estimator of µ.


The sample variance

n n
à n !2
2 1 X¡ ¢2 1 X 2 1 X
S = Xi − X̄ = Xi − Xi
n−1 n−1 (n − 1) n
i=1 i=1 i=1
¡Pn Pn ¢
2 0
is unbiased and is based on i=1 Xi , i=1 Xi and S 2 is therefore a UMVU estimator of σ 2 .

26
Economics 240A Fall 2003

Outside the exponential family of distributions, we typically have to Þnd complete sufficient statistics
(if they exist) by applying the deÞnition of completeness.

Example. Suppose Xi ∼ i.i.d. U [0, θ] , where θ > 0 is an unknown parameter. The sufficient statis-
tic T = T (X1 , . . . , Xn ) = X(n) is continuous with pdf f (·|θ) given by (Casella and Berger, Example
6.2.23)

f (t|θ) = ntn−1 θ−n 1 (0 ≤ t ≤ θ) , θ > 0.

The statistic T is complete if the set {x ∈ R+ : g (x) 6= 0} has (Lebesgue) measure zero whenever g : R → R
is a function satisfying
Z θ
Eθ (g (T )) = g (t) ntn−1 θ−n dt = 0 ∀θ > 0.
0

Given any such function g, let g + and g − be the negative and positive parts of g, respectively; that is, let
g + (t) = max (0, g (t)) and g − (t) = max (0, −g (t)) . By assumption,
Z θ Z θ
g + (t) tn−1 dt = g − (t) tn−1 dt ∀θ > 0.
0 0

It can be shown (using the Radon-Nikodym theorem) that this implies that the set {x ∈ R+ : g + (x) 6= g − (x)}
has measure zero. Therefore, the set {x ∈ R+ : g (x) = g + (x) − g − (x) 6= 0} has measure zero. In particu-
lar, X(n) is a complete sufficient statistic.
The maximum likelihood estimator θ̂ML = X(n) is based on X(n) but is biased because

³ ´ Z θ Z θ
Eθ θ̂ML = tf (t|θ) dt = ntn θ−n dt
0 0
¯
n n+1 −n ¯¯θ
= t θ ¯
n+1 t=0
n
= θ.
n+1

On the other hand, the estimator


n+1 n+1
θ̂ = X(n) = θ̂ML
n n

is unbiased and is based on the complete sufficient statistic X(n) . As a consequence, θ̂ is a UMVU estimator
of θ.

In this example, we constructed a UMVU estimator of θ by Þnding “the” function θ̂ (·) such that
³ ´
Eθ θ̂ (T ) = θ ∀θ ∈ Θ.

27
Economics 240A Fall 2003

In cases where an unbiased estimator θ̂ (not based on a complete sufficient statistic) has already been
found, a UMVU estimator of θ can be found by “Rao-Blackwellization”; that is,
³ ´
θ̃ = E θ̂|T

is UMVU if θ̂ is unbiased and T is a complete sufficient statistic.

Example. Suppose Xi ∼ i.i.d. U [0, θ] , where θ > 0 is an unknown parameter. Since Eθ (Xi ) = θ/2,
an unbiased estimator of θ is

θ̂ = 2X1 .

The conditional distribution of X1 given X(n) = x(n)£ is a mixture


¤ distribution. SpeciÞcally, X1 equals x(n)
with probability 1/n and is uniformly distributed on 0, x(n) with probability (n − 1) /n. As a consequence,
³ ´ µ ¶
¡ ¢ 1 n − 1 X(n) n+1
E θ̂|X(n) = 2E X1 |X(n) = 2 X + · = X(n)
n (n) n 2 n

is a UMVU estimator of θ.

If a UMVU estimator does not exist or is hard to Þnd, it is nice to have a benchmark against which
all estimators can be compared. That is, it is nice to have a lower bound on the variance of any unbiased
estimator.

DeÞnition. An estimator θ̂ ∈ Wu (θ) of θ is locally minimum variance unbiased (LMVU) at θ0 if


³ ´
V arθ0 θ̂ ≤ V arθ0 (W )

for any W ∈ Wu (θ) .

Suppose an LMVU estimator exists at every θ0 ∈ Θ and let VL (θ0 ) denote the variance of the LMVU
estimator at θ0 . The function VL (·) : Θ → R+ provides us with a lower bound on the variance of any
estimator θ̂ ∈ Wu (θ) . The bound is sharp in the sense that it can be attained for any θ0 ∈ Θ. If the LMVU
estimator (or its variance) is hard to Þnd, we may nonetheless be able to Þnd a nontrivial lower bound on
the variance of any unbiased estimator of θ. A very useful bound, the Cramér-Rao bound, can be obtained
by applying the covariance inequality, a corollary of the Cauchy-Schwarz inequality.

Theorem (Cauchy-Schwarz Inequality; Casella and Berger, Theorem 4.7.3). If (X, Y ) is a


bivariate random vector, then
¡ ¡ ¢¢1/2 ¡ ¡ 2 ¢¢1/2
|E (XY )| ≤ E |XY | ≤ E X 2 E Y .

Corollary (Covariance Inequality; Casella and Berger, Example 4.7.4). If (X, Y ) is a bivariate
random vector, then

28
Economics 240A Fall 2003

Cov (X, Y )2 ≤ V ar (X) V ar (Y ) .

Let X1 , . . . , Xn be a random sample from a discrete (continuous) distribution with pmf (pdf) f (·|θ) ,
where θ ∈ Θ is an unknown parameter vector. Moreover, let θ̂ = θ̂ (X1 , . . . , Xn ) be any estimator of θ
and let ψ = ψ (X1 , . . . , Xn ; θ) be any vector-valued function of (X1 , . . . , Xn )0 and θ. It follows from the
(multivariate version of the) covariance inequality that
³ ´ ³ ´ ³ ´0
V arθ θ̂ ≥ Covθ θ̂, ψ V arθ (ψ)−1 Covθ θ̂, ψ .

In general, the lower bound on the right hand side depends on θ̂ and therefore the inequality may not seem
to be very helpful.
³ ´ ³ ´
Remark. It can be shown that the function (of θ and θ̂) Covθ θ̂, ψ depends on θ̂ only through Eθ θ̂
if and only if

Covθ (ψ, U ) = 0 ∀θ ∈ Θ

whenever U is a random variable with Eθ (U ) = ³ 0 and´ V arθ (U ) < ∞ for every θ ∈ Θ.


In particular, the function (of θ and θ̂) Covθ θ̂, ψ depends on θ̂ only through θ if and only if

Covθ (ψ, U ) = 0 ∀θ ∈ Θ

whenever U is a random variable with Eθ (U ) = 0 and V arθ (U ) < ∞ for every θ ∈ Θ.


³ ´
It turns out that under certain conditions (on f (·|θ)), Covθ θ̂, ψ is independent of θ̂ when θ̂ ∈ Wu (θ)
and ψ is the score function evaluated at θ.

DeÞnition. Let X1 , . . . , Xn be a random sample from a discrete (continuous) distribution with pmf
(pdf) f (·|θ) , where θ ∈ Θ ⊆ Rk is an unknown parameter vector. The score function is the (random)
function S (·|X1 , . . . , Xn ) : Θ → Rk given by
n
∂ X
S (θ|X1 , . . . , Xn ) = log f (Xi |θ) , θ ∈ Θ.
∂θ
i=1

Theorem (Cramér-Rao Inequality; Casella and Berger, Theorem 7.3.9). Let X1 , . . . , Xn be


a random sample from a discrete (continuous) distribution with pmf (pdf ) f (·|θ) , where θ ∈ Θ is an
unknown parameter vector. Moreover, let θ̂ ∈ Wu (θ) . If

Eθ (S (θ|X1 , . . . , Xn )) = 0

and

29
Economics 240A Fall 2003

³ ´
Eθ θ̂ · S (θ|X1 , . . . , Xn )0 = I,

then
³ ´
V arθ θ̂ ≥ V arθ (S (θ|X1 , . . . , Xn ))−1

whenever V arθ (S (θ|X1 , . . . , Xn )) exists and is positive deÞnite.

Proof. Let S (θ) = S (θ|X1 , . . . , Xn ) . Under the stated assumptions,

³ ´ ³ ´ ³ ´
Covθ θ̂, S (θ) = Eθ θ̂ · (S (θ) − Eθ (S (θ)))0 = Eθ θ̂ · S (θ)0

= I,

and it follows from the covariance inequality that

³ ´ ³ ´ ³ ´0
V arθ θ̂ ≥ Covθ θ̂, S (θ) V arθ (S (θ))−1 Covθ θ̂, S (θ)

= V arθ (S (θ))−1 . ¥

The high-level assumptions

Eθ (S (θ|X1 , . . . , Xn )) = 0

and
³ ´
Eθ θ̂ · S (θ|X1 , . . . , Xn )0 = I

both have natural interpretations. Suppose X = (X1 , . . . , Xn )0 is continuous with joint pdf fX (·|θ) and let
T (X1 . . . , Xn ) be any statistic with Eθ |T (X1 . . . , Xn )| < ∞. If we can interchange the order of differenti-
ation and differentiation, then

Z
∂ ∂
Eθ (T (X)) = T (x) fX (x|θ) dx
∂θ0 ∂θ0 Rn
Z

= T (x) fX (x|θ) dx
Rn ∂θ0
Z µ ¶

= T (x) log fX (x|θ) fX (x|θ) dx
Rn ∂θ0
¡ ¢
= Eθ T (X) S (θ|X)0

30
Economics 240A Fall 2003

where the last equality holds because

n
∂ X
S (θ|X1 , . . . , Xn ) = log f (Xi |θ)
∂θ
i=1
Ãn !
∂ Y
= log f (Xi |θ)
∂θ
i=1


= log fX (X1 , . . . , Xn |θ)
∂θ

when X1 , . . . , Xn is a random sample from a distribution with pdf f (·|θ) . Setting h (X) = 1 and h (X) = θ̂,
we obtain the relations

Eθ (S (θ|X1 , . . . , Xn )) = 0

and
³ ´
Eθ θ̂ · S (θ|X1 , . . . , Xn )0 = I,

respectively.
Conditions under which we can interchange of the order of integration and differentiation are available
(Casella and Berger, Section 2.4).

Lemma. Let X1 , . . . , Xn be a random sample from a discrete (continuous) distribution with pmf (pdf )
f (·|θ) , where θ ∈ Θ. Suppose

(i) Θ is open.
(ii) The set {x ∈ R : f (x|θ) > 0} does not depend on θ.
(iii) For every θ ∈ Θ there is a function bθ : R → R+ and a constant ∆θ > 0 such that

V arθ (bθ (X)) < ∞

and
¯ ¯
¯ f (x|θ + δ) − f (x|θ) ¯
¯ ¯ < bθ (x) ∀x ∈ R
¯ δ ¯

whenever |δ| < ∆θ .


Then

∂ ¡ 0¢
0 Eθ (T (X)) = Eθ T (X) S (θ|X)
∂θ

for any statistic T (X1 . . . , Xn ) and any θ ∈ Θ.

31
Economics 240A Fall 2003

Remark. Conditions (ii) and (iii) of the lemma hold whenever f (x|θ) is of the form
à d !
X
f (x|θ) = h (x) c (θ) exp η i (θ) ti (x) , x ∈ R, θ ∈ Θ,
i=1

where each ηi (·) is differentiable.

The quantity
¡ ¢
I (θ) = Eθ S (θ|X1 , . . . , Xn ) S (θ|X1 , . . . , Xn )0

is called the information matrix, or the Fisher information. Under the assumptions of Cramér-Rao In-
equality, the Fisher information is

I (θ) = V arθ (S (θ|X1 , . . . , Xn ))

and I (θ)−1 provides a lower bound on the variance of any estimator θ̂ ∈ Wu (θ) .
The Fisher information is easy to compute when X1 , . . . , Xn is a random sample (Casella and Berger,
Corollary 7.3.10):

ÃÃ n
!Ã n
!0 !
∂ X ∂ X
I (θ) = Eθ log f (Xi |θ) log f (Xi |θ)
∂θ ∂θ
i=1 i=1
µµ ¶µ ¶0 ¶
∂ ∂
= n · Eθ log f (X|θ) log f (X|θ) ,
∂θ ∂θ

where X is a random variable with pmf/pdf f (·|θ) . Moreover, if


Z ∞ Z ∞
∂2 ∂2
0 f (x|θ) dx = f (x|θ) dx = 0,
−∞ ∂θ∂θ ∂θ∂θ0 −∞

then (Casella and Berger, Lemma 7.3.11)

µµ ¶µ ¶0 ¶
∂ ∂
I (θ) = n · Eθ log f (X|θ) log f (X|θ)
∂θ ∂θ
µ 2 ¶

= −n · Eθ log f (X|θ) .
∂θ∂θ0

If the conditions of the Cramér-Rao inequality are satisÞed and it just so happens that an unbiased
estimator attains the bound, then the estimator is UMVU. The Cramér-Rao inequality can therefore be
used to establish optimality in some cases.

Example. Suppose Xi ∼ i.i.d. Ber (p) , where p ∈ (0, 1) is unknown. When x ∈ {0, 1} , we have:

32
Economics 240A Fall 2003

³ ´
log f (x|p) = log px (1 − p)1−x = x · log p + (1 − x) log (1 − p) ,

and

∂ x (1 − x) (1 − p) x p (x − 1) x 1
log f (x|p) = − = + = − .
∂p p 1−p p (1 − p) p (1 − p) p (1 − p) 1 − p

The conditions of the Cramér-Rao inequality are satisÞed, so the Fisher information is

µ ¶
Xi 1
I (p) = n · V arp −
p (1 − p) 1 − p
µ ¶2
1
= n· V arp (Xi )
p (1 − p)
µ ¶2
1
= n· p (1 − p)
p (1 − p)
n
= .
p (1 − p)

The variance of any unbiased estimator of p is bounded from below by

p (1 − p)
I (p)−1 = .
n

Now, the maximum likelihood estimator


Pn
i=1 Xi
p̂ = = X̄
n

of p satisÞes
¡ ¢ V arp (Xi ) p (1 − p)
V arp (p̂) = V arp X̄ = = .
n n

The maximum likelihood estimator attains the lower bound I (p)−1 and is therefore UMVU.

There are cases where the Cramér-Rao bound does not apply or fails to be sharp.

Example. Suppose Xi ∼ i.i.d. U [0, θ] , where θ > 0 is an unknown parameter. When 0 < x < θ,
we have:
1
log f (x|θ) = log = − log θ
θ

and

33
Economics 240A Fall 2003

∂ 1
log f (x|θ) = − .
∂θ θ

The conditions of the Cramér-Rao inequality are violated because


µ ¶
∂ n
Eθ (S (θ|X1 , . . . , Xn )) = n · Eθ log f (Xi |θ) = − 6= 0.
∂θ θ

We therefore cannot be sure that the Fisher information


õ ¶ !
1 2 n
I (θ) = n · Eθ − = 2
θ θ

delivers a lower bound on the variance of unbiased estimators. In fact, the variance of UMVU estimator
n+1
θ̂ = X(n)
n

of θ is (Casella and Berger, Example 7.3.13)


³ ´ 1 1
V arθ θ̂ = θ2 < θ2 = I (θ)−1 .
n (n + 2) n

¡ ¢
Example. Suppose Xi ∼ i.i.d. N µ, σ2 , where µ ∈ R and σ 2 > 0 are unknown parameters. We
have:

µ µ ¶¶
¡ ¢ ¡ ¢−1/2 1
log f x|µ, σ 2 = log 2πσ 2 exp − 2 (x − µ)2

1 1 ¡ ¢ 1
= − log (2π) − log σ 2 − 2 (x − µ)2 ,
2 2 2σ
∂ ¡ ¢ 1
log f x|µ, σ2 = 2 (x − µ) ,
∂µ σ

and

∂ ¡ ¢ 1 1
2
log f x|µ, σ 2 = − 2 + 4 (x − µ)2 .
∂σ 2σ 2σ

The conditions of the Cramér-Rao inequality are satisÞed, so the Fisher information is
µ 1 ¶
¡ ¢ 0
I µ, σ 2 = n · σ2 .
0 2σ1 4

because
µ ¶
∂ ¡ 2
¢ 1 1
V ar(µ,σ2 ) log f Xi |µ, σ = 4 V ar(µ,σ2 ) (Xi − µ) = 2 ,
∂µ σ σ

34
Economics 240A Fall 2003

µ ¶ ³ ´
∂ ¡ 2
¢ 1 2 1
V ar(µ,σ2 ) log f Xi |µ, σ = 2 V ar 2
(µ,σ ) (Xi − µ) = 4,
∂σ2 4
(2σ ) 2σ

and

µ ¶ ³ ´
∂ ¡ ¢ ∂ ¡ ¢ 1 1 2
Cov(µ,σ2 ) log f Xi |µ, σ2 , 2 log f Xi |µ, σ2 = Cov 2
(µ,σ ) (Xi − µ) , (Xi − µ)
∂µ ∂σ σ 2 2σ 4

= 0.

¡ ¢0
As a consequence, a lower bound on the covariance matrix of an unbiased estimator of µ, σ 2 is
à 2 !
¡ ¢
2 −1
σ
0
I µ, σ = n .
4
0 2σn

In particular, no unbiased estimator of µ can have variance smaller than σ 2 /n. This lower bound is attained
by the maximum likelihood estimator µ̂ = X̄. The Cramér-Rao lower bound on the variance of unbiased
estimators of σ2 is 2σ 4 /n. The variance of the UMVU estimator S 2 is 2σ 4 / (n − 1) and the Cramér-Rao
bound therefore cannot be attained.

The following theorem delivers a couple of useful implications of optimality.

Theorem. If θ̂0 is a UMVU estimator of θ and θ̂1 ∈ Wu (θ) , then


³ ´
Covθ θ̂0 , θ̂1 − θ̂0 = 0 ∀θ ∈ Θ.

Proof. For any a ∈ R, let θ̂a be the unbiased estimator of θ given by


³ ´
θ̂a = (1 − a) θ̂0 + aθ̂1 = θ̂0 + a θ̂1 − θ̂0 .
³ ´ ³ ´ ³ ´
If θ̂0 is UMVU, then V arθ θ̂a ≥ V arθ θ̂0 for any θ ∈ Θ and any a ∈ R. For any θ ∈ Θ, V arθ θ̂a is a
differentiable (indeed, a quadratic) function of a :

³ ´ ³ ³ ´´
V arθ θ̂a = V arθ θ̂0 + a θ̂1 − θ̂0
³ ´ ³ ´ ³ ´
= V arθ θ̂0 + a2 V arθ θ̂1 − θ̂0 + 2a · Covθ θ̂0 , θ̂1 − θ̂0 .

³ ´ ³ ´
Therefore, if V arθ θ̂a ≥ V arθ θ̂0 for any θ ∈ Θ and any a ∈ R, then

d ³ ´¯¯ ³ ´
V arθ θ̂a ¯¯ = 2 · Covθ θ̂0 , θ̂1 − θ̂0 = 0 ∀θ ∈ Θ. ¥
da a=0

Remark. The theorem also has a converse. Indeed, if θ̂ ∈ Wu (θ) and

35
Economics 240A Fall 2003

³ ´
Covθ θ̂, θ̃ − θ̂ = 0 ∀θ ∈ Θ

for every θ̃ ∈ Wu (θ) , then θ̂ is a UMVU estimator of θ because

³ ´ ³ ´
V arθ θ̃ = V arθ θ̂ + θ̃ − θ̂
³ ´ ³ ´ ³ ´
= V arθ θ̂ + V arθ θ̃ − θ̂ + 2Covθ θ̂, θ̃ − θ̂
³ ´ ³ ´
= V arθ θ̂ + V arθ θ̃ − θ̂
³ ´
≥ V arθ θ̂

³ ´
whenever Covθ θ̂, θ̃ − θ̂ = 0.

Corollary (Casella and Berger, Theorem 7.3.20). If θ̂ is a UMVU estimator of θ and U is a


random variable with Eθ (U ) = 0 and V arθ (U ) < ∞ for every θ ∈ Θ, then
³ ´
Covθ θ̂, U = 0 ∀θ ∈ Θ.

Proof. Apply the theorem to θ̂0 = θ̂ and θ̂1 = θ̂ + U. ¥

Corollary. If θ̂ is a UMVU estimator of θ and θ̃ ∈ Wu (θ) , then


³ ´ ³ ´ ³ ´
V arθ θ̃ − θ̂ = V arθ θ̃ − V arθ θ̂

and
³ ´ ³ ´
Covθ θ̂, θ̃ = V arθ θ̂

for every θ ∈ Θ.

Proof. Now,

³ ´ ³ ´
Covθ θ̂, θ̃ = Covθ θ̂, θ̃ − θ̂ + θ̂
³ ´ ³ ´
= Covθ θ̂, θ̃ − θ̂ + V arθ θ̂
³ ´
= V arθ θ̂ ,

36
Economics 240A Fall 2003

³ ´
where the last holds because Covθ θ̂, θ̃ − θ̂ = 0 in view of the theorem. Using this relation,

³ ´ ³ ´ ³ ´ ³ ´
V arθ θ̃ − θ̂ = V arθ θ̃ + V arθ θ̂ − 2Covθ θ̂, θ̃
³ ´ ³ ´
= V arθ θ̃ − V arθ θ̂ . ¥

Corollary (Casella and Berger, Theorem 7.3.19). UMVU estimators are essentially unique in the
sense that if θ̂ is a UMVU estimator of θ and θ̃ ∈ Wu (θ) , then
³ ´ ³ ´
V arθ θ̃ > V arθ θ̂
³ ´
unless Pθ θ̃ = θ̂ = 1.

Proof.
³ If
´ a random
³ ´ variable
³ ´ X has V ar (X) = 0, then P (X = E (X)) = 1. ³ Unbiasedness
´ ³ ´implies
Eθ θ̃ − θ̂ = Eθ θ̃ − Eθ θ̂ = 0 and it therefore suffices to show that V arθ θ̃ > V arθ θ̂ unless
³ ´
V arθ θ̃ − θ̂ = 0. The stated result now follows from the previous corollary. ¥

37

You might also like