Lecture15 Fisherinfo
Lecture15 Fisherinfo
Γ0 (α)
Introducing the digamma function ψ(α) = Γ(α)
, the MLE α̂ is obtained by (numerically)
solving
n
X
0
0 = l (α) = −nψ(α) + log Xi .
i=1
What is the sampling distribution of α̂? We compute
∂2
2
log f (x|α) = −ψ 0 (α).
∂α
As this does not depend on x, the Fisher information is I(α) = −Eα [−ψ 0 (α)] = ψ 0 (α). Then
for large n, α̂ is distributed approximately as N (α, nψ10 (α) ).
15-1
Asymptotic normality of the MLE extends naturally to the setting of multiple parameters:
Theorem 15.2. Let {f (x|θ) : θ ∈ Ω} be a parametric model, where θ ∈ Rk has k parameters.
IID
Let X1 , . . . , Xn ∼ f (x|θ) for θ ∈ Ω, and let θ̂n be the MLE based on X1 , . . . , Xn . Define the
Fisher information matrix I(θ) ∈ Rk×k as the matrix whose (i, j) entry is given by the
equivalent expressions
∂2
∂ ∂
I(θ)ij = Covθ log f (X|θ), log f (X|θ) = −Eθ log f (X|θ) . (15.1)
∂θi ∂θj ∂θi ∂θj
Then under the same conditions as Theorem 14.1,
√
n(θ̂n − θ) → N (0, I(θ)−1 ),
where I(θ)−1 is the k × k matrix inverse of I(θ) (and the distribution on the right is the
multivariate normal distribution having this covariance).
(For k = 1, this definition of I(θ) is exactly the same as our previous definition, and I(θ)−1
1
is just I(θ) . The proof of the above result is analogous to the k = 1 case from last lecture,
employing a multivariate Taylor expansion of the equation 0 = ∇l(θ̂) around θ̂ = θ0 .)
IID
Example 15.3. Consider now the full Gamma model, X1 , . . . , Xn ∼ Gamma(α, β). Nu-
merical computation of the MLEs α̂ and β̂ in this model was discussed in Lecture 13. To
approximate their sampling distributions, note
β α α−1 −βx
log f (x|α, β) = log x e = α log β − log Γ(α) + (α − 1) log x − βx,
Γ(α)
so
∂2 ∂2 1 ∂2 α
2
log f (x|α, β) = −ψ 0 (α), log f (x|α, β) = , 2
log f (x|α, β) = − 2 .
∂α ∂α∂β β ∂β β
These partial derivatives again do not depend on x, so the Fisher information matrix is
0
ψ (α) − β1
I(α, β) = ,
− β1 α
β2
15-2
More generally, for any 2 × 2 Fisher information matrix
a b
I= ,
b c
the first definition of equation (15.1) implies that a, c ≥ 0. The upper-left element of I −1
is a−b12 /c , which is always at least a1 . This implies, for any model with a single parameter
θ1 that is contained inside a larger model with parameters (θ1 , θ2 ), that the variability of
the MLE for θ1 in the larger model is always at least that of the MLE for θ1 in the smaller
model; they are equal when the off-diagonal entry b is equal to 0. The same observation is
true for any number of parameters k ≥ 2 in the larger model.
This is a simple example of a trade-off between model complexity and accuracy of esti-
mation, which is fundamental to many areas of statistics and machine learning: a complex
model with more parameters might better capture the true distribution of data, but these
parameters will also be more difficult to estimate than those in a simpler model.
I(θ0 ) measures the expected curvature of the log-likelihood function l(θ) around the true
parameter θ = θ0 . If l(θ) is sharply curved around θ0 —in other words, I(θ0 ) is large—then a
small change in θ can lead to a large decrease in the log-likelihood l(θ), and hence the data
provides a lot of “information” that the true value of θ is close to θ0 . Conversely, if I(θ0 )
is small, then a small change in θ does not affect l(θ) by much, and the data provides less
information about θ. In this (heuristic) sense, I(θ0 ) quantifies the amount of information
that each observation Xi contains about the unknown parameter.
The Fisher information I(θ) is an intrinsic property of the model {f (x|θ) : θ ∈ Ω}, not
of any specific estimator. (We’ve shown that it is related to the variance of the MLE, but
its definition does not involve the MLE.) There are various information-theoretic results
stating that I(θ) describes a fundamental limit to how accurate any estimator of θ based on
X1 , . . . , Xn can be. We’ll prove one such result, called the Cramer-Rao lower bound:
15-3
Proof. Recall the score function
∂
∂ f (x|θ)
z(x, θ) = log f (x|θ) = ∂θ ,
∂θ f (x|θ)
and let Z := Z(X1 , . . . , Xn , θ) = ni=1 z(Xi , θ). By the definition of correlation and the fact
P
that the correlation of two random variables is always between -1 and 1,
The random variables z(X1 , θ), . . . , z(Xn , θ) are IID, and by Lemma 14.1, they have mean 0
and variance I(θ). Then
Since T is unbiased,
Z
θ = Eθ [T ] = T (x1 , . . . , xn )f (x1 |θ) × . . . × f (xn |θ)dx1 . . . dxn .
Rn
Differentiating both sides with respect to θ and applying the product rule of differentiation,
Z
∂
1= T (x1 , . . . , xn ) f (x1 |θ) × f (x2 |θ) × . . . × f (xn |θ)
Rn ∂θ
∂
+ f (x1 |θ) × f (x2 |θ) × . . . × f (xn |θ) + . . .
∂θ
∂
+ f (x1 |θ) × f (x2 |θ) × . . . × f (xn |θ) dx1 . . . dxn
∂θ
Z
= T (x1 , . . . , xn )Z(x1 , . . . , xn , θ)f (x1 |θ) × . . . × f (xn |θ)dx1 . . . dxn
Rn
= Eθ [T Z].
1
Since Eθ [Z] = 0, this implies Covθ [T, Z] = Eθ [T Z] = 1, so Varθ [T ] ≥ nI(θ)
as desired.
For two unbiased estimators of θ, the ratio of their variances is called their relative effi-
1
ciency. An unbiased estimator is efficient if its variance equals the lower bound nI(θ) . Since
the MLE achieves this lower bound asymptotically, we say it is asymptotically efficient.
The Cramer-Rao bound ensures that no unbiased estimator can achieve asymptotically
lower variance than the MLE. Stronger results, which we will not prove in this class, in fact
show that no estimator, biased or unbiased, can asymptotically achieve lower mean-squared-
1
error than nI(θ) , except possibly on a small set of special values θ ∈ Ω.1 In particular,
when the method-of-moments estimator differs from the MLE, we expect it to have higher
mean-squared-error than the MLE for large n, which explains why the MLE is usually the
preferred estimator in simple parametric models.
1
For example, the constant estimator θ̂ = c for fixed c ∈ Ω achieves 0 mean-squared-error if the true
parameter happened to be the special value c, but at all other parameter values is worse than the MLE for
sufficiently large n.
15-4