Maximum Likelihood Estimation.: N N I N I 1 N I I 1
Maximum Likelihood Estimation.: N N I N I 1 N I I 1
Maximum Likelihood Estimation.: N N I N I 1 N I I 1
1 MLE
Let f (·|θ) with θ∈Θ be a parametric family. Let X = (X1 , ..., Xn ) be a random sample from distribution
∏n
f1 (·|θ0 ) θ0 ∈ Θ. Then the joint pdf is
with f (x|θ) = i=1 f1 (xi |θ) where x = (x1 , ..., xn ). The log-likelihood
∑n
is ℓ(θ|x) = i=1 log f1 (xi |θ). The maximum likelihood estimator is, by de˝nition,
The FOC is
1 ∑ ∂ℓ1 (θ̂M L |xi )
n
= 0.
n i=1 ∂θ
Note that the ˝rst information equality is E[∂ℓ1 (θ0 |Xi )] = 0. Thus MLE is the method of moments estimator
corresponding to the ˝rst information equality. So we can expect that the MLE is consistent. Indeed, the
Theorem 1 (MLE consistency). In the setting above, assume that (1) θ0 is identi˝able, i.e. for any θ =
̸ θ0 ,
̸ f (x|θ0 ), (2) the support of f (·|θ) does not depend on θ, and (3) θ0 is an
there exists x such that f (x|θ) =
interior point of parameter space Θ. Then θ̂M L →p θ0 .
The proof of MLE consistency will be given in 14.382 and 14.385. What the proof does, it shows that
function g(θ) = Eθ0 ℓ1 (θ|Xi ) (here Xi ∼ f1 (xi |θ0 )) is maximized at θ = θ0 and random process
1
n ℓ(θ|X)
converges to the function g(θ) in a uniform manner in probability. Then it argues that the maximizer of the
Theorem 2 (MLE asymptotic normality). In the setting above, assume that conditions (1)-(3) in the
MLE consistency theorem hold. In addition, assume that (4) f1 (xi |θ) is thrice di˙erentiable with respect
to θ and we can interchange integration with respect to x and di˙erentiation with respect to θ, and (5)
|∂ 3 log f1 (xi |θ)/∂θ3 | ≤ M (x) and E[M (Xi )] < ∞. Then
√
n(θ̂M L − θ0 ) ⇒ N (0, I1−1 (θ0 ))
1
Proof. This is a sketch of the proof as it misses and important step. By de˝nition,
∂ℓ(θ̂M L ,x)
∂θ = 0. By the
Taylor theorem with a remainder, there is some random variable θ̃ with value between θ0 and θ̂M L such that
So,
Since θ̂M L →p θ0 and θ̃ is between θ0 and θ̂M L , θ̃ →p θ0 as well. From θ̃ →p θ0 , one can prove that
We will not discuss this result here since it requires knowledge of the concept of asymptotic equicontinuity
which we do not cover in this class. You will learn it in 14.385. Note, however, that this result does not
follow from the Continuous mapping theorem since we have a sequence of random functions ℓ(θ|X) instead
of just one non-random function. Suppose we believe in this result. Then, by the Law of large numbers,
[ 2 ]
1 ∑ ∂ 2 log f1 (Xi |θ0 )
n
1 ∂ 2 ℓ(θ0 |X) ∂ log f1 (Xi |θ0 )
= →p E = −I1 (θ0 ).
n ∂θ2 n i=1 ∂θ2 ∂θ2
[ ] [ ]
∂ log f1 (Xi |θ0 ) ∂ log f1 (Xi |θ0 )
Next, by the ˝rst information equality, E ∂θ =0 while V ar ∂θ = I1 (θ0 ). Thus,
One interpretation of MLE asymptotics is that the MLE is asymptotically e°cient (hit Rao-Cramer
Example Let X1 , ..., Xn be a random sample from a distribution with pdf f (x|λ) = λ exp(−λx). This
distribution is called exponential. Its log-likelihood for one draw is ℓ1 (λ|xi ) = log λ − λxi . So ∂ℓ1 ∂λ
(λ|xi )
=
2
1/λ − xi and ∂ ℓ∂λ
1 (λ|xi )
2 = −1/λ 2
. So Fisher information is I1 (λ) = 1/λ 2
. Let us ˝nd the MLE for λ. The
∑n ∑n
log-likelihood for the whole is ℓ(θ|x) = n log λ − λ i=1 xi . The FOC is
n
− i=1 xi = 0. So λ̂M L = 1
.
√ λ̂M L Xn
Its asymptotic distribution is given by n(λ̂M L − λ) ⇒ N (0, λ2 ).
2
2 Inference using MLE
−1
We will have a longer discussion about how to estimate the asymptotic variance of the MLE (I1 (θ0 )) later
when we will discuss asymptotic tests. Right now I want to mention several suggestions.
First of all, if I1 (θ) is a continuous function in θ (which is needed for asymptotic results), then given that
( )−1 ( −1 )−1
θ̂M L is consistent for θ0 , the quantity I1 (θ̂M L ) is consistent for I1 (θ0 ) .
Second, by de˝nition of Fisher information, it equals to the expectation of either negative second deriva-
tive of the likelihood or of the squared score. Instead of taking expectation one may approximate it by taking
The third idea to be used in this context is parametric bootstrap. Assume θ̂M L is the MLE we obtained
Example A word of caution. For asymptotic normality of MLE, we should have common support. Let
us see what might happen otherwise. Let X1 , ..., Xn be a random sample from U [0, θ]. Then θ̂M L = X(n) .
√
So n(θ̂M L − θ) is always nonpositive. So it does not converge to mean zero normal distribution. In fact,
E[X(n) ] = (n/(n + 1))θ and V (X(n) ) = θ2 n/((n + 1)2 (n + 2)) ≈ θ2 /n2 . On the other hand, if the theorem
worked, we would have V (X(n) ) ≈ 1/(nI(θ)). The MLE happens to be super-consistent here, means it
√
converges to the true value at a faster speed than the regular parametric speed of 1/ n.
Example Now, let us consider what might happen if the true parameter value θ0 were on the boundary
of Θ. Let X1 , ..., Xn be a random sample from distribution N (µ, 1) with µ ≥ 0. As an exercise, check that
√
µ̂M L = X n if Xn ≥ 0 and 0 otherwise. Suppose that µ0 = 0 . Then n(µ̂M L − µ0 ) is always nonnegative.
Example Finally, note that it is implicitly assumed both in the consistency and asymptotic normality
theorems that parameter space Θ is ˝xed, i.e. independent of n. In particular, the number of parameters
( ) (( ) ( ))
X1i µi σ2 0
Xi = ∼N ,
X2i µi 0 σ2
3
for i = 1, ..., n, and X1 , ..., Xn be mutually independent. One can show that if the sample size (n) increases
2
to in˝nity, the MLE for σ is inconsistent in this case, though a consistent estimator for σ2 exists.
What is interesting, though we won't show it here is that a bootstrap does not help in this cases, that
is, the bootstrap approximation to the distribution of θ̂M L is not close to the true ˝nite-sample distribution
of θ̂M L .
4 Pseudo-MLE
Let us have a sample X = (X1 , ..., Xn ) i.i.d from some distribution. We do not know what distribution it
is, let's assume it has pdf g(xi ). But we wrongly assumed a speci˝c parametric family, that is, we assumed
Xi ∼ f1 (xi |θ). What would happen if we do MLE. Apparently, MLE will be estimating a pseudo-true
parameter value θ0 with minimizes in some sense the distance between g(·) and family f (·|θ). In particular:
∫
θ0 = arg max log[f1 (xi |θ)]g(xi )dxi = arg max E log f1 (Xi |θ).
θ θ
Parameter θ0 may be of interest or may be not. Under some regularity condition θ̂M L →p θ0 , and in most
parts the logic of the proof of theorem about normality will hold. However, the the information equality
then in general Σ1 ̸= Σ2 . But using the logic of the proof, we can prove that
√
n(θ̂M L − θ0 ) ⇒ N (0, Σ−1 −1
2 Σ1 Σ2 )
standard errors.
4
MIT OpenCourseWare
https://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms