Maximum Likelihood Estimation.: N N I N I 1 N I I 1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Lecture 7

Maximum Likelihood Estimation.

1 MLE

Let f (·|θ) with θ∈Θ be a parametric family. Let X = (X1 , ..., Xn ) be a random sample from distribution
∏n
f1 (·|θ0 ) θ0 ∈ Θ. Then the joint pdf is
with f (x|θ) = i=1 f1 (xi |θ) where x = (x1 , ..., xn ). The log-likelihood
∑n
is ℓ(θ|x) = i=1 log f1 (xi |θ). The maximum likelihood estimator is, by de˝nition,

θ̂M L = arg max ℓ(θ|x).


θ∈Θ

The FOC is
1 ∑ ∂ℓ1 (θ̂M L |xi )
n
= 0.
n i=1 ∂θ

Note that the ˝rst information equality is E[∂ℓ1 (θ0 |Xi )] = 0. Thus MLE is the method of moments estimator

corresponding to the ˝rst information equality. So we can expect that the MLE is consistent. Indeed, the

theorem below gives the consistency result for MLE:

Theorem 1 (MLE consistency). In the setting above, assume that (1) θ0 is identi˝able, i.e. for any θ =
̸ θ0 ,
̸ f (x|θ0 ), (2) the support of f (·|θ) does not depend on θ, and (3) θ0 is an
there exists x such that f (x|θ) =
interior point of parameter space Θ. Then θ̂M L →p θ0 .
The proof of MLE consistency will be given in 14.382 and 14.385. What the proof does, it shows that

function g(θ) = Eθ0 ℓ1 (θ|Xi ) (here Xi ∼ f1 (xi |θ0 )) is maximized at θ = θ0 and random process
1
n ℓ(θ|X)
converges to the function g(θ) in a uniform manner in probability. Then it argues that the maximizer of the

process θˆM L will converge to θ0 .


Once we know that the estimator is consistent, we can think about the asymptotic distribution of the

estimator. The next theorem gives the asymptotic distribution of MLE:

Theorem 2 (MLE asymptotic normality). In the setting above, assume that conditions (1)-(3) in the
MLE consistency theorem hold. In addition, assume that (4) f1 (xi |θ) is thrice di˙erentiable with respect
to θ and we can interchange integration with respect to x and di˙erentiation with respect to θ, and (5)
|∂ 3 log f1 (xi |θ)/∂θ3 | ≤ M (x) and E[M (Xi )] < ∞. Then


n(θ̂M L − θ0 ) ⇒ N (0, I1−1 (θ0 ))

1
Proof. This is a sketch of the proof as it misses and important step. By de˝nition,
∂ℓ(θ̂M L ,x)
∂θ = 0. By the

Taylor theorem with a remainder, there is some random variable θ̃ with value between θ0 and θ̂M L such that

∂ℓ(θ̂M L )|X ∂ℓ(θ0 |X) ∂ 2 ℓ(θ̃|X)


= + (θ̂M L − θ0 ).
∂θ ∂θ ∂θ2

So,

√ − √1n ∂ℓ(θ∂θ0 |X)


n(θ̂M L − θ0 ) = .
1 ∂ 2 ℓ(θ̃|X)
n ∂θ 2

Since θ̂M L →p θ0 and θ̃ is between θ0 and θ̂M L , θ̃ →p θ0 as well. From θ̃ →p θ0 , one can prove that

1 ∂ 2 ℓ(θ̃|X) 1 ∂ 2 ℓ(θ0 |X)


2
− = op (1).
n ∂θ n ∂θ2

We will not discuss this result here since it requires knowledge of the concept of asymptotic equicontinuity

which we do not cover in this class. You will learn it in 14.385. Note, however, that this result does not

follow from the Continuous mapping theorem since we have a sequence of random functions ℓ(θ|X) instead

of just one non-random function. Suppose we believe in this result. Then, by the Law of large numbers,

[ 2 ]
1 ∑ ∂ 2 log f1 (Xi |θ0 )
n
1 ∂ 2 ℓ(θ0 |X) ∂ log f1 (Xi |θ0 )
= →p E = −I1 (θ0 ).
n ∂θ2 n i=1 ∂θ2 ∂θ2

[ ] [ ]
∂ log f1 (Xi |θ0 ) ∂ log f1 (Xi |θ0 )
Next, by the ˝rst information equality, E ∂θ =0 while V ar ∂θ = I1 (θ0 ). Thus,

by the Central limit theorem,

1 ∑ ∂ log f1 (Xi |θ0 )


n
1 ∂ℓ(θ0 |X)
√ =√ ⇒ N (0, I(θ0 )).
n ∂θ n i=1 ∂θ

Finally, by the Slutsky theorem,



n(θ̂M L − θ0 ) ⇒ N (0, I −1 (θ0 )).

One interpretation of MLE asymptotics is that the MLE is asymptotically e°cient (hit Rao-Cramer

bound in very large samples).

Example Let X1 , ..., Xn be a random sample from a distribution with pdf f (x|λ) = λ exp(−λx). This

distribution is called exponential. Its log-likelihood for one draw is ℓ1 (λ|xi ) = log λ − λxi . So ∂ℓ1 ∂λ
(λ|xi )
=
2
1/λ − xi and ∂ ℓ∂λ
1 (λ|xi )
2 = −1/λ 2
. So Fisher information is I1 (λ) = 1/λ 2
. Let us ˝nd the MLE for λ. The
∑n ∑n
log-likelihood for the whole is ℓ(θ|x) = n log λ − λ i=1 xi . The FOC is
n
− i=1 xi = 0. So λ̂M L = 1
.
√ λ̂M L Xn
Its asymptotic distribution is given by n(λ̂M L − λ) ⇒ N (0, λ2 ).

2
2 Inference using MLE
−1
We will have a longer discussion about how to estimate the asymptotic variance of the MLE (I1 (θ0 )) later

when we will discuss asymptotic tests. Right now I want to mention several suggestions.

First of all, if I1 (θ) is a continuous function in θ (which is needed for asymptotic results), then given that
( )−1 ( −1 )−1
θ̂M L is consistent for θ0 , the quantity I1 (θ̂M L ) is consistent for I1 (θ0 ) .

Second, by de˝nition of Fisher information, it equals to the expectation of either negative second deriva-

tive of the likelihood or of the squared score. Instead of taking expectation one may approximate it by taking

averages. For example


1 ∑
n
∂ 2 ℓ1 (θ̂|Xi )
Iˆ = −
n i=1
∂θ2

will be a consistent estimator of the Fisher information.

The third idea to be used in this context is parametric bootstrap. Assume θ̂M L is the MLE we obtained

from our sample of size n. For b = 1, ..., B do the following:

• Simulate sample Xb∗ = (X1b


∗ ∗
, ..., Xnb ) as i.i.d. draws from f1 (xi |θ̂M L ) (that is, assuming that θˆM L is

the true parameter value).

• Find MLE using sample Xb∗ , denote it θb∗ .


Calculate the sample variance of

(θ1∗ , ..., θB ), it gives the bootstrap approximation to (nI1 (θ0 ))−1 . You may

also do bootstrap- bias correction using similar procedure.

3 When MLE asymptotic theory fails us...

Example A word of caution. For asymptotic normality of MLE, we should have common support. Let

us see what might happen otherwise. Let X1 , ..., Xn be a random sample from U [0, θ]. Then θ̂M L = X(n) .

So n(θ̂M L − θ) is always nonpositive. So it does not converge to mean zero normal distribution. In fact,

E[X(n) ] = (n/(n + 1))θ and V (X(n) ) = θ2 n/((n + 1)2 (n + 2)) ≈ θ2 /n2 . On the other hand, if the theorem

worked, we would have V (X(n) ) ≈ 1/(nI(θ)). The MLE happens to be super-consistent here, means it

converges to the true value at a faster speed than the regular parametric speed of 1/ n.

Example Now, let us consider what might happen if the true parameter value θ0 were on the boundary

of Θ. Let X1 , ..., Xn be a random sample from distribution N (µ, 1) with µ ≥ 0. As an exercise, check that

µ̂M L = X n if Xn ≥ 0 and 0 otherwise. Suppose that µ0 = 0 . Then n(µ̂M L − µ0 ) is always nonnegative.

So it does not converge to mean zero normal distribution.

Example Finally, note that it is implicitly assumed both in the consistency and asymptotic normality

theorems that parameter space Θ is ˝xed, i.e. independent of n. In particular, the number of parameters

should not depend on n. Indeed, let

( ) (( ) ( ))
X1i µi σ2 0
Xi = ∼N ,
X2i µi 0 σ2

3
for i = 1, ..., n, and X1 , ..., Xn be mutually independent. One can show that if the sample size (n) increases
2
to in˝nity, the MLE for σ is inconsistent in this case, though a consistent estimator for σ2 exists.

What is interesting, though we won't show it here is that a bootstrap does not help in this cases, that

is, the bootstrap approximation to the distribution of θ̂M L is not close to the true ˝nite-sample distribution

of θ̂M L .

4 Pseudo-MLE

Let us have a sample X = (X1 , ..., Xn ) i.i.d from some distribution. We do not know what distribution it

is, let's assume it has pdf g(xi ). But we wrongly assumed a speci˝c parametric family, that is, we assumed

Xi ∼ f1 (xi |θ). What would happen if we do MLE. Apparently, MLE will be estimating a pseudo-true

parameter value θ0 with minimizes in some sense the distance between g(·) and family f (·|θ). In particular:


θ0 = arg max log[f1 (xi |θ)]g(xi )dxi = arg max E log f1 (Xi |θ).
θ θ

Parameter θ0 may be of interest or may be not. Under some regularity condition θ̂M L →p θ0 , and in most

parts the logic of the proof of theorem about normality will hold. However, the the information equality

would fail. De˝ne [( )2 ]


∂ log f1 (Xi |θ0 )
Σ1 = E ,
∂θ0
[ ]
∂ 2 log f1 (Xi |θ0 )
Σ2 = −E ,
∂θ02
where expectations in both cases are taken assuming that Xi ∼ g(·). If g is not in the parametric family,

then in general Σ1 ̸= Σ2 . But using the logic of the proof, we can prove that


n(θ̂M L − θ0 ) ⇒ N (0, Σ−1 −1
2 Σ1 Σ2 )

This asymptotic variance Σ−1 −1


2 Σ1 Σ2 is often called White's due to White's (1980) paper and thus White's

standard errors.

4
MIT OpenCourseWare
https://ocw.mit.edu

14.381 Statistical Method in Economics


Fall 2018

For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms

You might also like