Chapter 7. Statistical Estimation: 7.6: Properties of Estimators I
Chapter 7. Statistical Estimation: 7.6: Properties of Estimators I
Statistical Estimation
7.6: Properties of Estimators I
(From “Probability & Statistics with Applications to Computing” by Alex Tsun)
Now that we have all these techniques to compute estimators, you might be wondering which one is the
“best”. Actually, a better question would be: how can we determine which estimator is “better” (rather
than the technique)? There are even more different ways to estimate besides MLE/MoM/MAP, and in
different scenarios, different techniques may work better. In these notes, we will consider some properties of
estimators that allow us to compare their “goodness”.
7.6.1 Bias
The first estimator property we’ll cover is Bias. The bias of an estimator measures whether or not in
expectation, the estimator will be equal to the true parameter.
If
h i
• Bias(θ̂, θ) = 0, or equivalently E θ̂ = θ, then we say θ̂ is an unbiased estimator of θ̂.
Example(s)
First, recall that, if x1 , ..., xn are iid realizations from Poi(θ), then the MLE and MoM were both the
sample mean.
n
1X
θ̂ = θ̂M LE = θ̂M oM = xi
n i=1
1
2 Probability & Statistics with Applications to Computing 7.6
Solution
" n #
h i 1X
E θ̂ = E xi
n i=1
n
1X
= E [xi ] [LoE]
n i=1
n
1X
= θ [E [Poi(θ)] = θ]
n i=1
1
= nθ
n
=θ
This makes sense: the average of your samples should be “on-target” for the true average!
Example(s)
First, recall that, if x1 , ..., xn are iid realizations from (continuous) Unif(0, θ), then
n
1X
θ̂M LE = xmax θ̂M oM = 2 · xi
n i=1
Sure, θ̂M LE maximizes the likelihood, so in a way θ̂M LE is better than θ̂M OM . But, what are the biases
of these estimators? Before doing any computation: do you think θ̂M LE and θ̂M oM are overestimates,
underestimates, or unbiased?
Solution I actually think θ̂M oM is spot-on since the average of the samples should be close to θ/2, and mul-
tiplying by 2 would seem to give the true θ. On the other hand, θ̂M LE might be a bit of an underestimate,
since we probably wouldn’t have θ be exactly the largest (maybe a little larger).
This makes sense because if I had 3 samples from Unif(0, 1) for example, I would expect them at
n
1/4, 2/4, 3/4, and so it would be as my expected max. Similarly, if I had 4 samples, then I would
n+1
n
expect them at 1/5, 2/5, 3/5, 4/5, and so it would again be as my expected max.
n+1
7.6 Probability & Statistics with Applications to Computing 3
Finally,
h i n 1
Bias(θ̂M LE , θ) = E θ̂M LE − θ = θ−θ =− θ
n+1 n+1
• Analysis of Results
This means that θ̂M LE typically underestimates θ and θ̂M OM is an unbiased estimator of θ. But some-
thing isn’t quite right...
2
θ̂M LE = max{1, 9, 2} = 9 θ̂M OM = (1 + 9 + 2) = 8
3
However, based on our sample, the MoM estimator is impossible. If the actual parameter were 8, then
that means that the distribution we pulled the sample from is Unif(0, 8), in which case the likelihood
that we get a 9 is 0. But we did see a 9 in our sample. So, even though θ̂M OM is unbiased, it still
yields an impossible estimate. This just goes to show that finding the right estimator is actually quite
tricky.
A good solution would be to “de-bias” the MLE by scaling it appropriately. If you decided to have a
new estimator based on the MLE:
n+1
θ̂ = θ̂M LE
n
you would now get an unbiased estimator that can’t be wrong! But now it does not maximize the
likelihood anymore...
Actually, the MLE is what we say to be “asymptotically unbiased”, meaning unbiased in the limit.
This is because
1
Bias(θ̂M LE , θ) = − θ→0
n+1
as n → ∞. So usually we might just leave it because we can’t seem to win...
Example(s)
Recall that if x1 , . . . , xn ∼ Exp(θ) are iid, our MLE and MoM estimates were both the inverse sample
mean:
1 n
θ̂ = θ̂M LE = θ̂M oM = = Pn
x̄ i=1 xi
What can you say about the bias of this estimator?
Solution
4 Probability & Statistics with Applications to Computing 7.6
h i n
E θ̂ = E Pn
i=1 xi
n
≥ Pn [Jensen’s inequality]
i=1 E [xi ]
n 1
= Pn 1 E [Exp(θ)] =
i=1 θ θ
n
= 1
nθ
=θ
1
The inequality comes from Jensen’s (section 6.3): since g(x1 , . . . , xn ) = Pn is convex (at least in the
xi
i=1
positive octant when all xi ≥ 0), we have that E [g(x1 , . . . , xn )] ≥ g(E [x1 ] , E [x2 ] , . . . , E [xn ]). It is convex
1 h i
for a reason similar to that is a convex function. So E θ̂ ≥ θ systematically, and we typically have an
x
overestimate.
We are often also interested in how much a estimator varies (we would like it to be unbiased and have small
variance to that it is more accurate). One metric that captures this property of estimators is an estimators
variance.
The variance of an estimator θ̂ is h h i i
Var θ̂ = E (θ̂ − E θ̂ )2
This is just the definition of variance applied to the random variable θ̂ and isn’t actually a new definition.
But maybe instead of just computing the variance, we want a slightly different metric which instead measures
the squared difference from the actual estimator and not just its expectation:
h i
E (θ̂ − θ)2
We call this property the mean squared error (MSE), h i and it is related to both bias and variance! Look
closely at the difference: if θ̂ is unbiased, then E θ̂ = θ and the MSE and variance are actually equal!
This leads to what is known as the “Bias-Variance Tradeoff” in machine learning and statistics. Usually, we
want to minimize MSE, and these two quantities are often inversely related. That is, decreasing one leads
to an increase in the other, and finding the balance will minimize the MSE. It’s hard to see why that might
be the case since we aren’t working with as complex of estimators (we’re just learning the basics!).
Proof of Alternate MSE Formula. We will prove that MSE(θ̂, θ) = Var θ̂ + Bias(θ̂, θ)2 .
h i
MSE(θ̂, θ) = E (θ̂ − θ)2 [def of MSE])
h i h i 2 h i
= E θ̂ − E θ̂ + E θ̂ − θ [add and subtract E θ̂ ]
h i2 h h i h i i h i 2
= E θ̂ − E θ̂ + 2E θ̂ − E θ̂ E θ̂ − θ + E E θ̂ − θ [(a + b)2 = a2 + 2ab + b2 ]
h i h h ii
= Var θ̂ + 0 + E Bias(θ̂, θ)2 [def of var, bias, E θ̂ − E θ̂ = 0]
= Var θ̂ + Bias(θ̂, θ)2
Example(s)
First, recall that, if x1 , ..., xn are iid realizations from Poi(θ), then the MLE and MoM were both the
sample mean.
n
1X
θ̂ = θ̂M LE = θ̂M oM = xi
n i=1
Solution To compute the MSE, let’s compute the bias and variance separately. Earlier, we showed that
h i
Bias(θ̂, θ) = E θ̂ − θ = θ − θ = 0