0% found this document useful (0 votes)
104 views6 pages

Notes On The Cram Er-Rao Inequality: Kimball Martin February 8, 2012

The document discusses the Cramér-Rao inequality, which provides a theoretical lower bound on the variance of estimators. It introduces concepts like the score, Fisher information, and proves lemmas about them. The Cramér-Rao inequality states that for any unbiased estimator θˆ of a parameter θ, the variance of θˆ must be greater than or equal to the inverse of n times the Fisher information. The document lays the groundwork for the proof of this inequality.

Uploaded by

Nida Razzaq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
104 views6 pages

Notes On The Cram Er-Rao Inequality: Kimball Martin February 8, 2012

The document discusses the Cramér-Rao inequality, which provides a theoretical lower bound on the variance of estimators. It introduces concepts like the score, Fisher information, and proves lemmas about them. The Cramér-Rao inequality states that for any unbiased estimator θˆ of a parameter θ, the variance of θˆ must be greater than or equal to the inverse of n times the Fisher information. The document lays the groundwork for the proof of this inequality.

Uploaded by

Nida Razzaq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Notes on the Cramér-Rao Inequality

Kimball Martin
February 8, 2012

Suppose X is a random variable with pdf fX (x; θ), θ being an unknown parameter. Let X1 , . . . , Xn be
a random sample and θ̂ = θ̂(X1 , . . . , Xn ). We’ve seen that E(θ̂), or rather E(θ̂) − θ, is a measure of how
biased θ̂ is. We’ve also seen that V ar(θ̂) provides a measure of efficiency, i.e., the smaller the variance of θ̂,
the more likely E(θ̂) will provide an accurate estimate of θ.
Given a specific unbiased estimator θ̂, how do we know if it is the best (most efficient, i.e., smallest
variance) one, or if there is a better one? A key tool in understanding this question is a theoretical lower
bound on how small V ar(θ̂) can be. This is the Cramér-Rao Inequality.
From now on, we assume X is continuous and θ is a single real parameter (i.e., there is only one
unknown). We will also assume the range of X does not depend on θ. To be more precise, we will assume
there exist a, b ∈ R ∪ {±∞} independent of θ such that
(
fX (x; θ) > 0 if a < x < b
fX (x; θ) = 0 if x < a or x > b.

For instance if (a, b) = (−∞, ∞), then we are assuming fX (x; θ) > 0 for all θ and all real x. Things like the
normal distribution on R and the exponential distribution on [0, ∞) satisfy these conditions. An example
which does not satisfy this regularity condition is X uniform on [0, θ], because then we would need to take
a = 0 and b = b(θ) = θ, which is dependent upon θ.
To discuss the Cramér-Rao Inequality, it will be helpful to introduce a bit more notation and terminology.
As a bit of motivation, we’ve already seen in the maximum likelihood method, it isQsometimes useful to
work withP the function ln fX (x; θ) (the natural log of the likelihood function L(θ) = fX (Xi ; θ) becomes
ln L(θ) = ln fX (Xi , θ)).

Definition 1. We call V (θ) = VX (θ) = ln fX (X; θ) the score with respect to θ. The number

I(θ) = E(V 0 (θ)2 )

is called Fisher’s information number.

What does the above mean?


Example 1. Suppose X is exponential with pdf fX (x; θ) = θe−θx for x ≥ 0. As usual, θ is an unknown
parameter. Then
V (θ) = ln θe−θX = ln θ + ln e−θX = ln θ − θX.
For a given value of θ, the score gives a random variable. It makes sense to talk about its expectation
value. Similarly, the derivative with respect to θ also gives a random variable for each θ. The square of the
expectation value of the derivative is Fisher’s information number, which is a function of θ. Namely, in our
example,
d 1
V 0 (θ) = V (θ) = − X
dθ θ

1
so  
0 1 2 2 1 2X 1 2
I(θ) = E(V (θ) ) = E ( − X) = E( 2 − + X 2 ) = 2 − E(X) + E(X 2 ).
θ θ θ θ θ
1 1 1 2
From probability we know E(X) = θ and E(X 2 ) = V ar(X) + E(X)2 = θ2 + θ2 = θ2 . Thus
1 2 1 2 1
I(θ) = − · + 2 = 2.
θ2 θ θ θ θ

Remark. Since the logarithm only makes sense for a positive argument, the score only makes sense when
fX (X; θ) 6= 0. We also require fX (X; θ) to be differentiable with respect to θ, which we assume. (This
means we need fX (x; θ) > 0 for all x, θ.)
Lemma 1. If V 0 (θ) is continuous (except possibly at finitely many points), then E(V 0 (θ)) = 0.
Proof. Note

∂θ fX (x; θ)
V 0 (θ) = . (1)
fX (x; θ)
Thus, by definition
Z b Z b Z b
∂ ∂ ∂
E(V 0 (θ)) = Vx0 (θ)fX (x; θ)dx = fX (x; θ)dx = fX (x; θ)dx = 1 = 0.
a a ∂θ ∂θ a ∂θ
Here we used Leibnitz’s rule, which says one can interchange order of differentiation and integration with
respect to independent variables assuming continuous partial derivatives.
Corollary 1. I(θ) = V ar(V 0 (θ)).
Proof.
I(θ) = E(V 0 (θ)2 ) = V ar(V 0 (θ)) + E(V 0 (θ))2 = V ar(V 0 (θ)).

Thus Fisher’s information number is the variance of the derivative of the score, i.e., the variance of the
logarithmic derivative of the pdf fX (X; θ) (cf. (1)). The logarithmic derivative is often a useful quantity to
work with mathematically. For us, the point is that I(θ) appears in the Cramér-Rao bound. I’m sure you’re
anxious to get to this bound, now that I’ve hyped it up so much, but permit me one more
Lemma 2. Assume V (θ) has continuous first and second derivatives. Then I(θ) = −E(V 00 (θ)).
Exercise 1. (*for grad students) Prove Lemma 2.
The point is this often gives a simpler way to compute I(θ).
Example 2. Returning to our example above of the exponential distribution,
1
V 00 (θ) = − .
θ2
Since there is no dependence on X, we could more quickly compute the Fisher information as
1
I(θ) = −E(V 00 (θ)) = −V 00 (θ) = .
θ2
Theorem 1. (Cramér-Rao Inequality.) Assume V (θ) has continuous first derivative (except possibly at
finitely many points). Then for any unbiased estimator θ̂,
1
V ar(θ̂) ≥ .
nI(θ)

2
This is the desired theoretical bound on how efficient an estimator can be. The theorem is in fact valid
under weaker assumptions (see the text), i.e., V (θ) does not need to be differentiable everywhere, but we
assume this for simplicity.
To prove this result, first we need a little material from Section 11.4 of the text.
Definition 2. If X and Y are random variables, their covariance is

Cov(X, Y ) = E(XY ) − E(X)E(Y ).

Note Cov(X, X) = var(X). Also note if X and Y are independent, then Cov(X, Y ) = E(X)E(Y ) −
E(X)E(Y ) = 0 so covariance measures how dependent X and Y are.
p
Lemma 3. |Cov(X, Y )| ≤ V ar(X)V ar(Y ).
Proof. Compute

V ar(X ± Y ) = E((X ± Y )2 ) − E(X ± Y )2 = V ar(X) ± 2Cov(X, Y ) + V ar(Y ).

Since this is always ≥ 0, we have


V ar(X) + V ar(Y )
Cov(X, Y ) ≤ .
2
X−µX Y −µY
Applying this inequality to the normalized random variables X 0 = σX and Y 0 = σY gives

V ar(X 0 ) + V ar(Y 0 )
Cov(X 0 , Y 0 ) ≤ .
2
Note E(X 0 ) = E(Y 0 ) = 0 and V ar(X 0 ) = V ar(Y 0 ) = 1, so we have

Cov(X 0 , Y 0 ) = E(X 0 Y 0 ) ≤ 1.

Since
E(XY − µX Y − µY X + µX µY ) E(XY ) − µX E(Y ) − µY E(X) + µX µY
E(X 0 Y 0 ) = =
σX σY σX σY
E(XY ) − E(X)E(Y ) − E(X)E(Y ) + E(X)E(Y )
= p
V ar(X)V ar(Y )
Cov(X, Y )
=p ,
V ar(X)V ar(Y )

we are done.
Now we have all we need to piece together a proof of the theorem, at least when n = 1.
Proof. (of Theorem when n = 1) Here θ̂ = θ̂(X1 ) is just a function of X1 , so we may think of it as a function
of X. Observe
Cov(V 0 (θ), θ̂) = E(V 0 (θ) · θ̂) − E(V 0 (θ))E(θ̂) = E(V 0 (θ) · θ̂)
by Lemma 1. By Lemma 3, we have
q
|E(V 0 (θ) · θ̂)| = |Cov(V 0 (θ), θ̂)| ≤ V ar(V 0 (θ))V ar(θ̂).

Hence
|E(V 0 (θ) · θ̂)|2 |E(V 0 (θ) · θ̂)|2
V ar(θ̂) ≥ = ,
V ar(V 0 (θ)) I(θ)

3
where we used Corollary 1 for the equality on the right. Thus to prove the theorem, it suffices to show

|E(V 0 (θ) · θ̂)|2 = 1.

Note by (1) and the definition of expected value,


Z b Z b
∂ ∂ ∂ ∂
E(V 0 (θ) · θ̂) = fX (x; θ) · θ̂(x)dx = fX (x; θ) · θ̂(x)dx = E(θ̂) = θ = 1.
a ∂θ ∂θ a ∂θ ∂θ

Here we used Leibnitz’s rule again in the middle, and in the next to the last step we used fact that θ̂ is
unbiased.
The proof for n > 1 is similar, but enters into a little unfamiliar territory that we have carefully
sidestepped until now. Namely, we have only considered random variables defined on a 1-dimensional sample
space. For example, let us suppose for concreteness X is defined on the sample space (0, ∞). In other words,
X is a function from (0, ∞) to R. We also looked at new random variables defined as functions of X, e.g.,
X 2 or 3X or X 2 + X + 1. These are still functions from (0, ∞) to R, and one can determine their pdfs from
that of X.
Then we took a random sample X1 , . . . Xn of X and considered a statistic (e.g., estimator) θ̂ = θ̂(X1 , . . . , Xn ).
For example θ̂ = n1 Xi . Since this is a sum of random variables (like 3X = X + X + X or X 2 + X which
P
we’ve looked at before), we also said this is a random variable. Then we computed its expected value, say,
as
1X 1X
E(θ̂) = E(Xi ) = E(X) = E(X).
n n
Well, the jig is up. While θ̂ is a sum of random variables, there’s is a qualitative difference between something
like X1 + X2 and X + X. Even though X1 and X2 have the same pdf as X, they are distinct measurement
(they are independent!). Since X1 and X2 are independent functions from (0, ∞) to R, their sum

X1 + X2 : (0, ∞) × (0, ∞) → R

needs to be viewed as a function of (0, ∞) × (0, ∞), not (0, ∞). If I think of X1 + X2 as a function of just
one parameter in (0, ∞) then I won’t be able to have X1 = 1 and X2 = 3, since X1 will always equal X2 .
What is means for X1 and X2 to have identical pdfs, is that they really are the same function from
(0, ∞) → R. What distinguishes them as independent events is that I think of them as living on two
different sample spaces S1 and S2 , which just both happen to be represented by (0, ∞). Hence X1 + X2 is
not a random variable in the sense we defined earlier (namely, a 1-dimensional sample space). However, we
can (and at least I will) think of X1 + X2 as being a single random variable, only defined on a 2-dimensional
sample space (0, ∞) × (0, ∞).
Since X1 and X2 are independent, we could still compute expectation values and variance for any specific
simple example of an estimator θ̂(X1 , X2 ) without resorting to thinking of θ̂(X1 , X2 ) as a random variable
on a 2-dimensional sample space. For instance if θ̂(X1 , X2 ) = 41 (X1 + X2 )2 , then

1 1 1 1 1 1 1
E(θ̂) = E((X1 +X2 )2 ) = E(X12 )+ E(X1 )E(X2 )+ E(X2 )2 = E(X 2 )+ E(X)2 = V ar(X)+E(X)2 ,
4 4 2 4 2 2 2

and thus one can reduce E(θ̂) to computing things like E(X k ).
However, if one wants to think about things in any generality, then a more sophisticated point of view
needs to be considered. Recall the result that, given a nice function h : R → R we can compute the expected
value of the continuous random variable h(X) as
Z ∞
E(h(X)) = h(X)fX (x)dx.
−∞

4
For instance if h(X) = X 2 + 3X + 1, then
Z ∞
E(h(X)) = (X 2 + 3X + 1)fX (x)dx.
−∞

Now what happens if we try the same thing something like our previous example θ̂(X1 , X2 ) = 41 (X1 +X2 )2 .
It doesn’t make sense to write Z ∞
E(θ̂) = θ̂fX (x; θ)dx
−∞

because θ̂ is now a function of two independent variables X1 and X2 , so a single integral won’t cut it. Instead,
we need to look at a double integral. Here, what should be true is
Z ∞Z ∞ Z ∞Z ∞
E(θ̂) = θ̂(x1 , x2 )fX1 (x1 ; θ)fX2 (x2 ; θ)dx1 dx2 = (x1 + x2 )2 fX (x1 ; θ)fX (x2 ; θ)dx1 dx2 .
−∞ −∞ −∞ −∞

This is in fact correct—we will skip the details, but they can be found in more advanced probability and
statistics texts, along with much more general statements—and is essentially Theorem 3.9.1 in our text.
Precisely, we state what we need as the following
Proposition 1. Let X1 , . . . , Xn be a random sample for a continuous random variable X, and consider a
function h : Rn → R which gives a continuous random variable h(X1 , . . . , Xn ) on a n-dimensional sample
space. Then
Z ∞ Z ∞
E(h(X1 , . . . , Xn )) = ··· h(x1 , . . . , xn )fX (x1 ; θ) · · · fX (xn ; θ)dx1 · · · dxn .
−∞ −∞

Exercise 2. Suppose X is uniform given with pdf fX (x; θ) = θ1 for 0 < x < θ, where θ > 0. Consider the
estimator θ̂(X1 , X2 ) = 41 (X1 + X2 )2 for a random sample of size 2.
(i) Compute E(θ̂) by reducing to calculations of terms of E(X k ) as discussed above.
(ii) Compute E(θ̂) using the proposition above (i.e., Theorem 3.9.1 in the text), and check it agrees with
your answer for (i).

In fact, there are fairly simple estimators, where the above proposition is the easiest way to calculate
expected values. See, for example, Example 5.4.5 in the text. Here is another example where you should use
the above proposition.
Exercise
√ 3. Let X and X1 , X2 be as in the previous exercise, and consider the random variable h(X1 , X2 ) =
X1 + X2 . Compute E(h(X1 , X2 )). Is this a reasonable estimator?

Using the above proposition, we can now give a proof of the Cramér-Rao inequality for an arbitrary
sample size n.
Proof. (of Theorem for general n) Let
n n ∂
∂θ fX (xi ; θ)
X X
V0 (θ) = VX0 i (θ) =
i=1 i=1
fX (xi ; θ)

One easily sees from the product rule



∂θ [fX (x1 ; θ) · · · fX (xn ; θ)]
V0 (θ) = . (2)
fX (x1 ; θ) · · · fX (xn ; θ)

For any θ, this gives a random variable with an n-dimensional sample space.

5
By Lemma 1 and independence we still have
n
X
E(V0 (θ)) = E(VX0 i (θ)) = nE(VX0 (θ)) = 0.
i=1

Thus, as in the n = 1 case,

Cov(V0 (θ), θ̂) = E(V0 (θ) · θ̂) − E(V0 (θ))E(θ̂) = E(V0 (θ) · θ̂).

By Lemma 3, we have
q
0 0
|E(V (θ) · θ̂)| = |Cov(V (θ), θ̂)| ≤ V ar(V0 (θ))V ar(θ̂).

By Corollary 1, i.e., I(θ) = V ar(VX0 (θ)), we see ote


n
X
V ar(V0 (θ)) = V ar(VX0 i (θ)) = nI(θ).
i=1

Hence
|E(V0 (θ) · θ̂)|2 |E(V0 (θ) · θ̂)|2
V ar(θ̂) ≥ = ,
V ar(V0 (θ)) nI(θ)
where we used Corollary 1 for the equality on the right. Thus to prove the theorem, it suffices to show

|E(V0 (θ) · θ̂)|2 = 1.

Note Proposition 1 and (2) imply,


!
b b ∂
∂θ fX (xi ; θ)
Z Z X
0
E(V (θ) · θ̂) = ··· fX (x1 ; θ) · · · fX (xn ; θ)θ̂(x1 , . . . , xn )dx1 · · · dxn
a a fX (xi ; θ)
Z b Z b

= ···[fX (x1 ; θ) · · · fX (xn ; θ)]θ̂(x1 , . . . , xn )dx1 · · · dxn
a a ∂θ
Z b Z b

= ··· fX (x1 ; θ) · · · fX (xn ; θ)θ̂(x1 , . . . , xn )dx1 · · · dxn
∂θ a a
∂ ∂
= E(θ̂) = θ = 1.
∂θ ∂θ

You might also like