Mlelectures PDF
Mlelectures PDF
Eric Zivot
The likelihood function is defined as the joint density treated as a functions of the
parameters θ :
Y
n
L(θ|x1 , . . . , xn ) = f (x1 , . . . , xn ; θ) = f (xi ; θ).
i=1
Notice that the likelihood function is a k dimensional function of θ given the data
x1 , . . . , xn . It is important to keep in mind that the likelihood function, being a
function of θ and not the data, is not a proper pdf. It is always positive but
Z Z
· · · L(θ|x1 , . . . , xn )dθ1 · · · dθk 6= 1.
1
If X1 , . . . , Xn are discrete random variables, then f (x1 , . . . , xn ; θ) = Pr(X1 = x1 , . . . , Xn = xn )
for a fixed value of θ.
1
To simplify notation, let the vector x = (x1 , . . . , xn ) denote the observed sample.
Then the joint pdf and likelihood function may be expressed as f (x; θ) and L(θ|x).
i=1
For a given value of θ and observed sample x, f (x; θ) gives the probability of observing
the sample. For example, suppose n = 5 and x = (0, . . . , 0). Now some values of θ
are more likely to have generated this sample than others. In particular, it is more
likely that θ is close to zero than one. To see this, note that the likelihood function
for this sample is
L(θ|(0, . . . , 0)) = (1 − θ)5
This function is illustrated in figure xxx. The likelihood function has a clear maximum
at θ = 0. That is, θ = 0 is the value of θ that makes the observed sample x = (0, . . . , 0)
most likely (highest probability)
Similarly, suppose x = (1, . . . , 1). Then the likelihood function is
L(θ|(1, . . . , 1)) = θ5
which is illustrated in figure xxx. Now the likelihood function has a maximum at
θ = 1.
2
Figure xxx illustrates the normal likelihood for a representative sample of size n = 25.
Notice that the likelihood has the same bell-shape of a bivariate normal density
Suppose σ 2 = 1. Then
à !
1 Xn
L(θ|x) = L(μ|x) = (2π)−n/2 exp − (xi − μ)2
2 i=1
Now
Xn X
n X
n
£ ¤
(xi − μ)2 = (xi − x̄ + x̄ − μ)2 = (xi − x̄)2 + 2(xi − x̄)(x − μ̄) + (x̄ − μ)2
i=1 i=1 i=1
X
n
= (xi − x̄)2 + n(x̄ − μ)2
i=1
so that " n à #!
1 X
L(μ|x) = (2π)−n/2 exp − (xi − x̄)2 + n(x̄ − μ)2
2 i=1
Since both (xi − x̄)2 and (x̄ − μ)2 are positive it is clear that L(μ|x) is maximized at
μ = x̄. This is illustrated in figure xxx.
Example 3 Linear Regression Model with Normal Errors
Consider the linear regression
yi = x0i β + εi , i = 1, . . . , n
(1×k)(k×1)
3
The log-likelihood function is then
n n 1
ln L(θ|y, X) = − ln(2π) − ln(σ 2 ) − 2 (y − Xβ)0 (y − Xβ)
2 2 2σ
To be completed
max L(θ|x)
θ
max ln L(θ|x)
θ
With random sampling, the log-likelihood has the particularly simple form
à n !
Y X
n
ln L(θ|x) = ln f (xi ; θ) = ln f (xi ; θ)
i=1 i=1
Since the MLE is defined as a maximization problem, we would like know the
conditions under which we may determine the MLE using the techniques of calculus.
A regular pdf f (x; θ) provides a sufficient set of such conditions. We say the f (x; θ)
is regular if
1. The support of the random variables X, SX = {x : f (x; θ) > 0}, does not
depend on θ
4
If f (x; θ) is regular then we may find the MLE by differentiating ln L(θ|x) and
solving the first order conditions
∂ ln L(θ̂mle |x)
=0
∂θ
Since θ is (k × 1) the first order conditions define k, potentially nonlinear, equations
in k unknown values:
⎛ ⎞
∂ ln L(θ̂mle |x)
∂θ1
∂ ln L(θ̂mle |x) ⎜ .. ⎟
=⎜
⎝ . ⎟
⎠
∂θ
∂ ln L(θ̂mle |x)
∂θk
The vector of derivatives of the log-likelihood function is called the score vector
and is denoted
∂ ln L(θ|x)
S(θ|x) =
∂θ
By definition, the MLE satisfies
S(θ̂mle |x) = 0
Under random sampling the score for the sample becomes the sum of the scores for
each observation xi :
Xn
∂ ln f (xi ; θ) X
n
S(θ|x) = = S(θ|xi )
i=1
∂θ i=1
∂ ln f (xi ;θ)
where S(θ|xi ) = ∂θ
is the score associated with xi .
Example 5 Bernoulli example continued
The log-likelihood function is
³ Sn Sn ´
ln L(θ|X) = ln θ i=1 xi (1 − θ)n− i=1 xi
à !
Xn Xn
= xi ln(θ) + n − xi ln(1 − θ)
i=1 i=1
The MLE satisfies S(θ̂mle |x) = 0, which after a little algebra, produces the MLE
1X
n
θ̂mle = xi .
n i=1
Hence, the sample average is the MLE for θ in the Bernoulli model.
5
Example 6 Normal example continued
Since the normal pdf is regular, we may determine the MLE for θ = (μ, σ 2 ) by
maximizing the log-likelihood
1 X
n
n n 2
ln L(θ|x) = − ln(2π) − ln(σ ) − 2 (xi − μ)2 .
2 2 2σ i=1
where
1 X
n
∂ ln L(θ|x)
= (xi − μ)
∂μ σ 2 i=1
n 2 −1 1 2 −2 X
n
∂ ln L(θ|x)
= − (σ ) + (σ ) (xi − μ)2
∂σ 2 2 2 i=1
1 X
n
∂ ln L(θ̂mle |x)
= (xi − μ̂mle ) = 0
∂μ σ̂ 2mle i=1
n 2 −1 1 2 −2 X
n
∂ ln L(θ̂mle |x)
= − (σ̂ ) + (σ̂ mle ) (xi − μ̂mle )2 = 0
∂σ 2 2 mle 2 i=1
1X
n
μ̂mle = xi = x̄
n i=1
Hence, the sample average is the MLE for μ. Using μ̂mle = x̄ and solving the second
equation for σ̂ 2mle gives
1X
n
2
σ̂ mle = (xi − x̄)2 .
n i=1
Notice that σ̂ 2mle is not equal to the sample variance.
6
Example 7 Linear regression example continued
The log-likelihood is
n n
ln L(θ|y, X) = − ln(2π) − ln(σ 2 )
2 2
1
− 2 (y − Xβ)0 (y − Xβ)
2σ
∂
The MLE of θ satisfies S(θ̂mle |y, X) = 0 where S(θ|y, X) = ∂θ
ln L(θ|y, X) is the
score vector. Now
∂ ln L(θ|y, X) 1 ∂ 0
= [y y − 2y 0 Xβ + β 0 X 0 Xβ]
∂β 2σ 2 ∂β
= −(σ 2 )−1 [−X 0 y + X 0 Xβ]
∂ ln L(θ|y, X) n 1
2
= − (σ 2 )−1 + (σ 2 )−2 (y − Xβ)0 (y − Xβ)
∂σ 2 2
∂ ln L(θ|y,X)
Solving ∂β
= 0 for β gives
I(θ|x) = −E[H(θ|x)]
7
and
X
n
I(θ|x) = − E[H(θ|xi )] = −nE[H(θ|xi )] = nI(θ|xi )
i=1
The last result says that the sample information matrix is equal to n times the
information matrix for an observation.
The following proposition relates some properties of the score function to the
information matrix.
Proposition 8 Let f (xi ; θ) be a regular pdf. Then
R
1. E[S(θ|xi )] = S(θ|xi )f (xi ; θ)dxi = 0
2. If θ is a scalar then
Z
2
var(S(θ|xi ) = E[S(θ|xi ) ] = S(θ|xi )2 f (xi ; θ)dxi = I(θ|xi )
If θ is a vector then
Z
0
var(S(θ|xi ) = E[S(θ|xi )S(θ|xi ) ] = S(θ|xi )S(θ|xi )0 f (xi ; θ)dxi = I(θ|xi )
8
Next, recall that I(θ|xi ) = −E[H(θ|xi )] and
Z 2
∂ ln f (xi ; θ)
−E[H(θ|xi )] = − f (xi ; θ)dxi
∂θ2
Now, by the chain rule
µ ¶
∂2 ∂ 1 ∂
2 ln f (xi ; θ) = f (xi ; θ)
∂θ ∂θ f (xi ; θ) ∂θ
µ ¶2
−2 ∂ ∂2
= −f (xi ; θ) f (xi ; θ) + f (xi ; θ)−1 2 f (xi ; θ)
∂θ ∂θ
Then
Z " µ ¶2 2
#
∂ ∂
−E[H(θ|xi )] = − −f (xi ; θ)−2 f (xi ; θ) + f (xi ; θ)−1 2 f (xi ; θ) f (xi ; θ)dxi
∂θ ∂θ
Z µ ¶2 Z
∂ ∂2
= f (xi ; θ)−1 f (xi ; θ) dxi − f (xi ; θ)dxi
∂θ ∂θ2
Z
2 ∂2
= E[S(θ|xi ) ] − 2 f (xi ; θ)dxi
∂θ
2
= E[S(θ|xi ) ].
n 2 −1 1 2 −2 X
n
∂ ln L(θ|x)
= − (σ ) + (σ ) (xi − μ)2 = 0
∂σ 2 2 2 i=1
1X
n
σ 2 (μ) = (xi − μ)2 .
n i=1
9
Notice that any value of σ 2 (μ) defined this way satisfies the first order condition
∂ ln L(θ|x)
∂σ 2
= 0. If we substitute σ 2 (μ) for σ 2 in the log-likelihood function for θ we get
the following concentrated log-likelihood function for μ :
1 X
n
c n n 2
ln L (μ|x) = − ln(2π) − ln(σ (μ)) − 2 (xi − μ)2
2 2 2σ (μ) i=1
à n !
n n 1 X
= − ln(2π) − ln (xi − μ)2
2 2 n i=1
à n !−1 n
1 1X X
− (xi − μ)2 (xi − μ)2
2 n i=1
Ãi=1 n !
n n 1X
= − (ln(2π) + 1) − ln (xi − μ)2
2 2 n i=1
Now we may determine the MLE for μ by maximizing the concentrated log-
likelihood function ln L2 (μ|x). The first order conditions are
Pn
∂ ln Lc (μ̂mle |x) (xi − μ̂mle )
= 1 Pi=1
n 2
=0
∂μ n i=1 (xi − μ̂mle )
which is satisfied by μ̂mle = x̄ provided not all of the xi values are identical.
For some models it may not be possible to analytically concentrate the log-
likelihood with respect to a subset of parameters. Nonetheless, it is still possible
in principle to numerically concentrate the log-likelihood.
10
value of θ produces the same value of the likelihood function. When this happens we
say that θ is not identified. Formally, θ is identified if for all θ1 6= θ2 there exists a
sample x for which L(θ1 |x) 6= L(θ2 |x).
The curvature of the log-likelihood is measured by its second derivative (Hessian)
2
H(θ|x) = ∂ ln L(θ|x)
∂θ∂θ0
. Since the Hessian is negative semi-definite, the information in
the sample about θ may be measured by −H(θ|x). If θ is a scalar then −H(θ|x) is
a positive number. The expected amount of information in the sample about the
parameter θ is the information matrix I(θ|x) = −E[H(θ|x)]. As we shall see, the
information matrix is directly related to the precision of the MLE.
Let X1 , . . . , Xn be an iid sample with pdf f (x; θ). Let θ̂ be an unbiased estimator
of θ; i.e., E[θ̂] = θ. If f (x; θ) is regular then
var(θ̂) ≥ I(θ|x)−1
where I(θ|x) = −E[H(θ|x)] denotes the sample information matrix. Hence, the
Cramer-Rao Lower Bound (CRLB) is the inverse of the information matrix. If θ is a
vector then var(θ) ≥ I(θ|x)−1 means that var(θ̂) − I(θ|x) is positive semi-definite.
To determine the CRLB the information matrix must be evaluated. The infor-
mation matrix may be computed as
I(θ|x) = −E[H(θ|x)]
or
I(θ|x) = var(S(θ|x))
11
Further, due to random sampling I(θ|x) = n · I(θ|xi ) = n · var(S(θ|xi )). Now, using
the chain rule it can be shown that
µ ¶
d d xi − θ
H(θ|xi ) = S(θ|xi ) =
dθ dθ θ(1 − θ)
µ ¶
1 + S(θ|xi ) − 2θS(θ|xi )
= −
θ(1 − θ)
E[θ̂mle ] = E[x̄] = θ
θ(1 − θ)
var(θ̂mle ) = var(x̄) =
n
Notice that the MLE is unbiased and its variance is equal to the CRLB. Therefore,
θ̂mle is efficient.
Remarks
12
• If θ = 0 or θ = 1 then I(θ|x) = ∞ and var(θ̂mle ) = 0 (why?)
• I(θ|x) is smallest when θ = 12 .
Now
∂ 2 ln f (xi ; θ)
= −(σ 2 )−1
∂μ2
2
∂ ln f (xi ; θ)
= −(σ 2 )(xi − μ)
∂μ∂σ 2
∂ 2 ln f (xi ; θ)
= −(σ 2 )(xi − μ)
∂σ 2 ∂μ
∂ 2 ln f (xi ; θ) 1 2 −2
= (σ ) − (σ 2 )−3 (xi − μ)2
∂(σ 2 )2 2
so that
I(θ|xi ) = −E[H(θ|xi )]
µ ¶
(σ 2 )−1 E[(xi − μ)](σ2 )−2
= 1
E[(xi − μ)](σ 2 )−2 2
(σ 2 )−2 − (σ 2 )−3 E[(xi − μ)2 ]
Using the results2
E[(xi − μ)] = 0
∙ ¸
(xi − μ)2
E = 1
σ2
we then have µ ¶
(σ 2 )−1 0
I(θ|xi ) = 1 2 −2
0 2
(σ )
The information matrix for the sample is then
µ ¶
n(σ 2 )−1 0
I(θ|x) = n · I(θ|xi ) = n 2 −2
0 2
(σ )
2
(xi − μ)2 /σ 2 is a chi-square random variable with one degree of freedom. The expected value
of a chi-square random variable is equal to its degrees of freedom.
13
and the CRLB is µ ¶
σ2
−1 n
0
CRLB = I(θ|x) = 2σ4
0 n
Notice that the information matrix and the CRLB are diagonal matrices. The CRLB
2
for an unbiased estimator of μ is σn and the CRLB for an unbiased estimator of σ 2
4
is 2σn .
The MLEs for μ and σ2 are
μ̂mle = x̄
1X
n
σ̂ 2mle = (xi − μmle )2
n i=1
Now
E[μ̂mle ] = μ
n−1 2
E[σ̂ 2mle ] = σ
n
so that μ̂mle is unbiased whereas σ̂ 2mle is biased. This illustrates the fact that mles
are not necessarily unbiased. Furthermore,
σ2
var(μ̂mle ) = = CRLB
n
and so μ̂mle is efficient.
The MLE for σ 2 is biased and so the CRLB result does not apply. Consider the
unbiased estimator of σ 2
1 X
n
2
s = (xi − x̄)2
n − 1 i=1
Is the variance of s2 equal to the CRLB? No. To see this, recall that
(n − 1)s2
∼ χ2 (n − 1)
σ2
Further, if X ∼ χ2 (n − 1) then E[X] = n − 1 and var(X) = 2(n − 1). Therefore,
σ2
s2 = X
(n − 1)
σ4 σ4
⇒ var(s2 ) = var(X) =
(n − 1)2 (n − 1)
σ4 σ4
Hence, var(s2 ) = (n−1)
> CRLB = n
.
Remarks
14
• The diagonal elements of I(θ|x) → ∞ as n → ∞
where ε = y − Xβ. Now E[ε] = 0 and E[ε0 ε] = nσ 2 (since ε0 ε/σ 2 ∼ χ2 (n)) so that
µ ¶ µ ¶
−(σ 2 )−1 (−X 0 E[ε]) 0
E[S(θ|y, X)] = n 2 −1 1 2 −2 0 =
− 2 (σ ) + 2 (σ ) E[ε ε] 0
To determine the Hessian and information matrix we need the second derivatives of
ln L(θ|y, X) :
∂ 2 ln L(θ|y, X) ∂ ¡ 2 −1 0 0
¢
= −(σ ) [−X y + X Xβ]
∂β∂β 0 ∂β 0
= −(σ 2 )−1 X 0 X
∂ 2 ln L(θ|y, X) ∂ ¡ ¢
2
= 2
−(σ 2 )−1 [−X 0 y + X 0 Xβ]
∂β∂σ ∂σ
= −(σ 2 )−2 X 0 ε
∂ 2 ln L(θ|y, X)
2 0 = −(σ 2 )−2 ε0 X
∂σ ∂β
µ ¶
∂ 2 ln L(θ|y, X) ∂ n 2 −1 1 2 −2 0
= − (σ ) + (σ ) ε ε
∂ (σ 2 )2 ∂σ 2 2 2
n 2 −2
= (σ ) − (σ 2 )−3 ε0 ε
2
Therefore, µ ¶
−(σ 2 )−1 X 0 X −(σ 2 )−2 X 0 ε
H(θ|y, X) = n
−(σ 2 )−2 ε0 X 2
(σ ) − (σ 2 )−3 ε0 ε
2 −2
and
15
Notice that the information matrix is block diagonal in β and σ 2 . The CRLB for
unbiased estimators of θ is then
µ 2 0 −1 ¶
−1 σ (X X) 0
I(θ|y, X) = 2 4
0 n
σ
Do the MLEs of β and σ 2 achieve the CRLB? First, β̂ mle is unbiased and var(β mle |X) =
σ 2 (X 0 X)−1 = CRLB for an unbiased estimator for β. Hence, β̂ mle is the most efficient
unbiased estimator (BUE). This is an improvement over the Gauss-Markov theorem
which says that β̂ mle = β̂ OLS is the most efficient linear and unbiased estimator
(BLUE). Next, note that σ̂ 2mle is not unbiased (why) so the CRLB result does not
apply. What about the unbiased estimator s2 = (n − k)−1 (y − X β̂ OLS )0 (y − X β̂ OLS )?
2σ4
It can be shown that var(s2 |X) = n−k > n2 σ 4 = CRLB for an unbiased estimator of
σ 2 . Hence s2 is not the most efficient unbiased estimator of σ 2 .
μ̂mle = x̄
1X
n
σ̂ 2mle = (xi − μmle )2
n i=1
Suppose we are interested in the MLE for σ = h(σ 2 ) = (σ 2 )1/2 , which is a one-to-one
function for σ 2 > 0. The invariance property says that
à n !1/2
1 X
σ̂ mle = (σ̂ 2mle )1/2 = (xi − μ̂mle )2
n i=1
16
p
1. θ̂mle → θ
√ d
2. n(θ̂mle − θ) → N(0, I(θ|xi )−1 ), where
∙ ¸
∂ ln f (θ|xi )
I(θ|xi ) = −E [H(θ|xi )] = −E
∂θ∂θ0
That is, √
avar( n(θ̂mle − θ)) = I(θ|xi )−1
Alternatively,
µ ¶
1 −1
θ̂mle ∼ N θ, I(θ|xi ) = N(θ, I(θ|x)−1 )
n
Remarks:
2. Asymptotic normality of θmle follows from an exact first order Taylor’s series
expansion of the first order conditions for a maximum of the log-likelihood about
θ0 :
17
Therefore
√ d
n(θ̂mle − θ0 ) → I(θ0 |xi )−1 N(0, I(θ0 |xi ))
= N(0, I(θ0 |xi )−1 )
√
3. Since I(θ|xi ) = −E[H(θ|xi )] = var(S(θ|xi )) is generally not known, avar( n(θ̂mle −
θ)) must be estimated. The most common estimates for I(θ|xi ) are
X
n
ˆ θ̂mle |xi ) = − 1
I( H(θ̂mle |xi )
n i=1
X
n
ˆ θ̂mle |xi ) = 1
I( S(θ̂mle |xi )(θ̂mle |xi )0
n i=1
To prove consistency of the MLE, one must show that Q0 (θ) = E[ln f (xi |θ)] is
uniquely maximized at θ = θ0 . To do this, let f (x, θ0 ) denote the true density and
let f (x, θ1 ) denote the density evaluated at any θ1 6= θ0 . Define the Kullback-Leibler
Information Criteria (KLIC) as
∙ ¸ Z
f (x, θ0 ) f (x, θ0 )
K(f (x, θ0 ), f (x, θ1 )) = Eθ0 ln = ln f (x, θ0 )dx
f (x, θ1 ) f (x, θ1 )
where
f (x, θ0 )
ln = ∞ if f (x, θ1 ) = 0 and f (x, θ0 ) > 0
f (x, θ1 )
K(f (x, θ0 ), f (x, θ1 )) = 0 if f (x, θ0 ) = 0
The KLIC is a measure of the ability of the likelihood ratio to distinguish between
f (x, θ0 ) and f (x, θ1 ) when f (x, θ0 ) is true. The Shannon-Komogorov Information
Inequality gives the following result:
18
Let X1 , . . . , Xn be an iid sample with X ∼Bernoulli(θ). Recall,
1X
n
θ̂mle = X̄ = Xi
n i=1
1
I(θ|xi ) =
θ(1 − θ)
The asymptotic properties of the MLE tell us that
p
θ̂mle → θ
√ d
n(θ̂mle − θ) → N (0, θ(1 − θ))
Alternatively, µ ¶
A θ(1 − θ)
θ̂mle ∼ N θ,
n
An estimate of the asymptotic variance of θ̂mle is
yi = x0i β + εi , i = 1, . . . , n
εi |xi ˜ iid N(0, σ 2 )
Further, the block diagonality of the information matrix implies that β̂ mle is asymp-
totically independent of σ̂ 2mle .
19
1.8 Relationship Between ML and GMM
Let X1 , . . . , Xn be an iid sample from some underlying economic model. To do ML
estimation, you need to know the pdf, f (xi |θ), of an observation in order to form the
log-likelihood function
X
n
ln L(θ|x) = ln f (xi |θ)
i=1
p
where θ ∈ R . The MLE satisfies the first order condtions
∂ ln L(θ̂mle |x)
= S(θ̂mle |x) = 0
∂θ
For general models, the first order condtions are p nonlinear equations in p unknowns.
Under regularity conditions, the MLE is consistent, asymptotically normally distrib-
uted, and efficient in the class of asymptotically normal estimators:
µ ¶
1 −1
θ̂mle ∼ N θ, I(θ|xi )
n
where I(θ|xi ) = −E[H(θ|xi )] = E[S(θ|xi )S(θ|xi )0 ].
To do GMM estimation, you need to know k ≥ p population moment condtions
E[g(xi , θ)] = 0
The GMM estimator matches sample moments with the population moments. The
sample moments are
1X
n
gn (θ) = g(xi , θ)
n i=1
If k > p, the efficient GMM estimator minimizes the objective function
∂J(θ̂gmm , S −1 )
= G0n (θ̂gmm )Ŝ −1 gn (θ̂gmm ) = 0
∂θ
Under regularity conditions, the efficient GMM estimator is consistent, asymptoti-
cally normally distributed, and efficient in the class of asymptotically normal GMM
estimators for a given set of moment conditions:
µ ¶
1 0 −1 −1
θ̂gmm ∼ N θ, (G S G)
n
h i
where G = E ∂g∂θ n (θ)
0 .
20
The asymptotic efficiency of the MLE in the class of consistent and asymptotically
normal estimators implies that
avar(θ̂mle ) − avar(θ̂gmm ) ≤ 0
That is, the efficient GMM estimator is generally less efficient than the ML estimator.
The GMM estimator will be equivalent to the ML estimator if the moment condi-
tions happen to correspond with the score associated with the pdf of an observation.
That is, if
g(xi , θ) = S(θ|xi )
In this case, there are p moment conditions and the model is just identified. The
GMM estimator then satisfies the sample moment equations
which is the ratio of the likelihood evaluated under the null to the likelihood evaluated
at the MLE. By construction 0 < λ ≤ 1. If H0 : θ = θ0 is true, then we should see
21
λ ≈ 1; if H0 : θ = θ0 is not true then we should see λ < 1. The likelihood ratio
(LR) statistic is a simple transformation of λ such that the value of LR is large if
H0 : θ = θ0 is true, and the value of LR is small when H0 : θ = θ0 is not true.
Formally, the LR statistic is
L(θ0 |x)
LR = −2 ln λ = −2 ln
L(θ̂mle |x)
= −2[ln L(θ0 |x) − ln L(θ̂mle |x)]
From Figure xxx, notice that the distance between ln L(θ̂mle |x) and ln L(θ0 |x)
depends on the curvature of ln L(θ|x) near θ = θ̂mle . If the curvature is sharpe (i.e.,
information is high) then LR will be large for θ0 values away from θ̂mle . If, however,
the curvature of ln L(θ|x) is flat (i.e., information is low) the LR will be small for θ0
values away from θ̂mle .
Under general regularity conditions, if H0 : θ = θ0 is true then
d
LR → χ2 (1)
where I(ˆ θ̂mle |x) is a consistent estimate of the sample information matrix. An im-
plication of the asymptotic normality result is that the usualy t-ratio for testing
H0 : θ = θ0
θ̂mle − θ0 θ̂mle − θ0
t = =q
c θ̂mle )
SE( ˆ θ̂mle |x)−1
I(
³ ´ q
= θ̂mle − θ0 ˆ θ̂mle |x)
I(
22
statistic is defined to be simply the square of this t-ratio
³ ´2
θ̂mle − θ0
W ald =
ˆ θ̂mle |x)−1
I(
³ ´2
= θ̂mle − θ0 I( ˆ θ̂mle |x)
The intuition behind the Wald statistic is illustrated in Figure xxx. If the curva-
ture of ln L(θ|x) near θ = θ̂mle is big (high information) then the squared distance
³ ´2
θ̂mle − θ0 gets blown up when constructing the Wald statistic. If the curvature
ˆ θ̂mle |x) is small and the squared distance
of ln L(θ|x) near θ = θ̂mle is low, then I(
³ ´2
θ̂mle − θ0 gets attenuated when constucting the Wald statistic.
d ln L(θ̂mle |x)
0= = S(θ̂mle |x)
dθ
If H0 : θ = θ0 is true, then we should expect that
d ln L(θ0 |x)
0≈ = S(θ0 |x)
dθ
If H0 : θ = θ0 is not true, then we should expect that
d ln L(θ0 |x)
0 6= = S(θ0 |x)
dθ
The Lagrange multiplier (score) statistic is based on how far S(θ0 |x) is from zero.
Recall the following properties of the score S(θ|xi ). If H0 : θ = θ0 is true then
E[S(θ0 |xi )] = 0
var(S(θ0 |xi )) = I(θ0 |xi )
1 X
n
√ d
nS(θ0 |x) = √ S(θ0 |xi ) → N(0, I(θ0 |xi ))
n i=1
23
so that
S(θ0 |x) ∼ N(0, I(θ0 |x))
This result motivates the statistic
S(θ0 |x)2
LM = = S(θ0 |x)2 I(θ0 |x)−1
I(θ0 |x)
24