Chapter 7. Statistical Estimation 7.7: Properties of Estimators II
Chapter 7. Statistical Estimation 7.7: Properties of Estimators II
Statistical Estimation
7.7: Properties of Estimators II
(From “Probability & Statistics with Applications to Computing” by Alex Tsun)
We’ll discuss even more desirable properties of estimators. Last time we talked about bias, variance, and
MSE. Bias measured whether or not, in expectation, our estimator was equal to the true value of θ. MSE
measured the expected squared difference between our estimator and the true value of θ. If our estimator
was unbiased, then the MSE of our estimator was precisely the variance.
7.7.1 Consistency
Example(s)
Recall that, if x1 , ..., xn are iid realizations from (continuous) Unif(0, θ), then
n
1X
θ̂n = θ̂n,M oM = 2 · xi
n i=1
Solution
Since θ̂n is unbiased, we have that
h i
P |θ̂n − θ| > ε = P |θ̂n − E θ̂n | > ε
because we can replace θ with the expected value of the estimator. Now, we can apply Chebyshev’s inequality
(6.1) to see that
h i Var θ̂n
P |θ̂n − E θ̂n | > ε ≤
ε2
Now, we can take out the 22 from the estimator’s expression and are left only with the variance of the sample
1
2 Probability & Statistics with Applications to Computing 7.7
σ2 Var(xi )
mean, which is always just n = n .
h i Var θ̂n 22 Var 1
Pn
xi
4 · Var (xi ) /n
n i=1
P |θ̂n − E θ̂n |>ε ≤ = =
ε2 ε2 ε2
Example(s)
Recall that, if x1 , ..., xn are iid realizations from (continuous) Unif(0, θ), then
Solution
In this case, we cannot use Chebyshev’s inequality unfortunately, because the maximum likelihood estimator
is not unbiased. The CDF for θ̂n is
Fθ̂n (t) = P θ̂n ≤ t
which is the probability that each individual sample is less than t because only in that case will the max be
less than t, and we have independence so we can say
P θ̂n ≤ t = P (X1 ≤ t) P (X2 ≤ t) ...P (Xn ≤ t)
t
This is just the CDF of Xi to the n-th power, where the CDF of Unif(0, θ) is just θ (see the distribution
sheet):
0,
t<0
n
Fθ̂n (t) = FX (t) = ( θt )n , 0 ≤ t ≤ θ
1, t>0
There are two ways we can have the absolute value from before be greater than epsilon
P |θ̂n − θ| > ε = P θ̂n > θ + ε + P θ̂n < θ − ε
The first term is 0, because there’s no way our estimator is greater than θ + ε, as it’s never going to be
greater than θ by definition (the samples are between 0 and θ so there’s no way the max of the samples is
greater than θ). So, now we can just use the CDF on the right term, and just plug in for t:
(
( θ−ε n
θ ) , ε<θ
P θ̂n > θ + ε + P θ̂n < θ − ε = P θ̂n < θ − ε =
0, ε≥θ
7.7 Probability & Statistics with Applications to Computing 3
We can assume that ε is less than θ because we really only care when ε is very very small, so we have that
θ − ε n
P |θ̂n − θ| > ε =
θ
Thus, when we take the limit as n approaches infinity, we see that in the parenthesis, we have a number less
than 1, and we raise it to the n-th power, so it goes to 0
lim P |θ̂n − θ| > ε = 0
n→∞
Now we’ve seen that, even though the MLE and MoM estimators of θ given iid samples from Unif(0, θ) are
different, they are both consistent! That means, as n → ∞, they will both converge to the true parameter
θ. This is clearly a good property of an estimator.
You may be wondering, what’s the difference between consistency and unbiasedness? I, for one, was very
confused about the difference for a while as well. There is, in fact, a subtle difference, which we’ll see by
comparing estimators for θ in the continuous Unif(0, θ) distribution.
1. For instance, an unbiased and consistent estimator was the MoM for the uniform distribution: θ̂n,M oM =
2x̄. We proved it was unbiased in 7.6, meaning it is correct in expectation. It converges to the true
parameter (consistent) since the variance goes to 0.
2. However, if you ignore all the samples and just take the first one and multiply it by 2, θ̂ = 2X1 , it is
unbiased (as it is 2 · θ2 ), but it’s not consistent; our estimator doesn’t get better and better with more n
because we’re not using all n samples. Consistency requires that as we get more samples, we approach
the true parameter.
3. Biased but consistent, on the other hand, was the MLE estimator. iWe showed its expectation was
n h n
θ, which is actually “asymptotically unbiased” since E θ̂n,M LE = θ → θ as n → ∞. It
n+1 n+1
does get better and better as n → ∞.
1
4. Neither unbiased nor consistent would just be some random expression, such as θ̂ = X12
.
4 Probability & Statistics with Applications to Computing 7.7
7.7.3 Efficiency
To take about our last topic, efficiency, we first have to define Fisher Information. Efficiency says that our
estimator has as low variance as possible. This property combined with consistency and unbiasedness mean
that our estimator is on target (unbiased), converges to the true parameter (consistent), and does so as fast
as possible (efficient).
Let x = (x1 , ..., xn ) be iid realizations from probability mass function pX (t | θ) (if X is discrete), or
from density function fX (t | θ) (if X is continuous), where θ is a parameter (or vector of parameters).
The Fisher Information of the parameter θ is defined to be:
" 2 # 2
∂ ln L(x | θ) ∂ ln L(x | θ)
I(θ) = n · E = −E
∂θ ∂θ2
where L(x | θ) denotes the likelihood of the data given parameter θ (defined in 7.1). From Wikipedia,
it “is a way of measuring the amount of information that an observable random variable X carries
about an unknown parameter θ upon which the probability X depends”.
That written definition is definitely a mouthful, but if you stop and parse it, you’ll see it’s not too bad
to compute. We always take the second derivative of the log-likelihood to confirm that our MLE was a
maximizer; now all you have to do is take the expectation to get the Fisher Information. There’s no way
though that I can interpret the negative expected value of the second derivative of the log-likelihood, it’s
just too gross and messy.
Why did we define that nasty Fisher information? (Actually, it’s much worse when θ is a vector instead of a
single number, as the second derivative becomes a matrix of second partial derivatives). It would be great if
the mean squared error of an estimator θ̂ was a low as possible. The Cramer-Rao Lower Bound actually gives
a lower bound on the variance on any unbiased estimator θ̂ for θ. That is, if θ̂ is any unbiased estimator for
θ, there is a minimum possible variance (variance = MSE for unbiased estimators). And if your estimator
achieves this lowest possible variance, it is said to be efficient. This is also a highly desirable property of
estimators. The bound is called the Cramer-Rao Lower Bound.
Definition 7.7.3: Cramer-Rao Lower Bound (CRLB)
Let x = (x1 , ..., xn ) be iid realizations from probability mass function pX (t | θ) (if X is discrete), or
from density function fX (t | θ) (if X is continuous), where θ is a parameter (or vector of parameters).
If θ̂ is an unbiased estimator for θ, then
1
MSE(θ̂, θ) = Var θ̂ ≥
I(θ)
7.7 Probability & Statistics with Applications to Computing 5
where I(θ) is the Fisher information defined earlier. What this is saying is, for any unbiased estimator
1
θ̂ for θ, the variance (=MSE) is at least I(θ) . If we achieve this lower bound, meaning our variance
1
is exactly equal to I(θ) , then we have the best variance possible for our estimate. That is, we have
the minimum variance unbiased estimator (MVUE) for θ.
Since we want to find the lowest variance possible, we can look at this through the frame of finding the
estimator’s efficiency.
I(θ)−1
e(θ̂, θ) = ≤1
Var θ̂
This will always be between 0 and 1 because if your variance is equal to the CRLB, then it equals 1,
and anything greater will result in a smaller value. A larger variance will result in a smaller efficiency,
and we want our efficiency to be as high as possible (1).
An unbiased estimator is said to be efficient if it achieves the CRLB - meaning e(θ̂, θ) = 1. That is,
it could not possibly have a lower variance. Again, the CRLB is not guaranteed for biased estimators.
That was super complicated - let’s see how to verify the MLE of Poi(θ) is efficient. It looks scary - but it’s
just messy algebra!
Example(s)
Recall that, if x1 , ..., xn are iid realizations from X ∼ Poi(θ) (recall E [X] = Var (X) = θ), then
n
1X
θ̂ = θ̂MLE = θ̂MoM = xi
n i=1
Is θ̂ efficient?
Solution
First, you have to check that it’s unbiased, as the CRLB only holds for unbiased estimators...
" n #
h i 1X
E θ̂ = E xi = E [xi ] = θ
n i=1
...which it is! Otherwise, we wouldn’t be able to use this bound. We also need to compute the variance. The
2
variance of the sample mean (the estimator) is just σn , and the variance of a Poisson is just θ.
n
!
1X Var (xi ) θ
Var θ̂ = Var xi = =
n i=1 n n
Then, we’re going to compute that weird Fisher Information, which gives us the CRLB, and see if our
variance matches. Remember, we take the second derivative of the log-likelihood, which we did earlier in 7.2
6 Probability & Statistics with Applications to Computing 7.7
Then, we need to take the expected value of this. It turns out, with some algebra, you get − nθ .
2 " n # n
∂ ln L(x | θ) X xi 1 X 1 n
E 2
= E − 2
= − 2
E [xi ] = − 2 nθ = −
∂θ i=1
θ θ i=1 θ θ
Our Fisher Information was the negative expected value of the second derivative of the log-likelihood, so
we just flip the sign to get nθ .
2
∂ ln L(x | θ) n
I(θ) = −E =
∂θ2 θ
.
Finally, our efficiency is the inverse of the Fisher Information over the variance:
I(θ)−1 ( n )−1
e(θ̂, θ) = = θθ =1
Var θ̂ n
Thus, we’ve shown that, since our efficiency is 1, our estimator is efficient. That is, it has the best pos-
sible variance among all unbiased estimators of θ. This, again, is a really good property that we want to have.
To reiterate, this means we cannot possibly do better in terms of mean squared error. Our bias is 0, and our
variance is as low as it can possibly go. The sample mean is the unequivocally best estimator for a Poisson
distribution, in terms of efficiency, in terms of bias, and MSE (it also happens to be consistent, so there are
a lot of good things).
As you can see, showing efficiency is just a bunch of tedious calculations!