0% found this document useful (0 votes)
9 views

lecture1_ml_MLE

The lecture focuses on the Maximum Likelihood Estimator (MLE), covering its definition, properties, and applications in statistical estimation and testing. Key topics include the derivation of asymptotic properties, the likelihood function, and the importance of identification in ensuring the parameters are uniquely determined by the data. The lecture also emphasizes the logic behind MLE and its role in fitting distributions to observed data.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

lecture1_ml_MLE

The lecture focuses on the Maximum Likelihood Estimator (MLE), covering its definition, properties, and applications in statistical estimation and testing. Key topics include the derivation of asymptotic properties, the likelihood function, and the importance of identification in ensuring the parameters are uniquely determined by the data. The lecture also emphasizes the logic behind MLE and its role in fitting distributions to observed data.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 103

Lecture 1: Maximum Likelihood Estimator

Professor: Mauricio Sarrias

Universidad de Talca

2020
1 Introduction
Motivation
Maximum Likelihood Estimator
Identification
The Score Function
The Information Matrix

2 Asymptotic Properties
Consistency
Asymptotic Normality

3 Estimation of Variance

4 Testing
Intuition
The Trinity
Proof, Proof and More Proof
Reading

Reading (Mandatory):
(Ruud)- Chapters 14 and 15.
Buse, A. (1982). The likelihood ratio, Wald, and Lagrange multiplier
tests: An expository note. The American Statistician, 36(3a), 153-157.
Suggested:
(Winkelmann & Boes)- Chapters 2 and 3
Goals

Understand the logic behind the Maximum Likelihood Estimator.


Derive the asymptotic properties of the MLE.
Understand and derive the basic test for the MLE.
1 Introduction
Motivation
Maximum Likelihood Estimator
Identification
The Score Function
The Information Matrix

2 Asymptotic Properties
Consistency
Asymptotic Normality

3 Estimation of Variance

4 Testing
Intuition
The Trinity
Proof, Proof and More Proof
Motivating Example I
Let’s assume we weighed 1000 people from Talca

200
150
Frequency

100
50
0

0 20 40 60 80 100 120 140

Weight in kilogram
Motivating Example I

The goal of maximum likelihood is to find the optimal way to fit a distribution
to the data.

Remark I
Generally, we can writhe the probability or density function of yi = 1, ..., n as
f (yi ; θ), we yi is the ith draw from the population and θ is the parameter of
the distribution.

Remark II
We usually assume independent sampling, i.e., the ith draw from the
population is independent from all other draws i0 6= i

So the question is:


Which distribution does fit the previous weight data?
Motivating Example I
Normal Distribution Chi−squared Distribution

0.020

0.030
0.020
0.010

0.010
0.000
0.000

0 20 40 60 80 100 0 20 40 60 80 100

Weight in kg Weight in kg

Exponential Distribution Gamma Distribution

0.06
0.04

0.04
0.02

0.02
0.00
0.00

0 20 40 60 80 100 0 20 40 60 80 100

Weight in kg Weight in kg
Motivating Example I

Normal Distribution
It seems that the normal distribution

0.020
is the best option.
We expect most of the weights

0.015
to be close to the mean.
We expect the weights to be
realitvely symmetrical around

0.010
the mean.
Ok ..., but not every normal fits

0.005
the our data.
What mean, µ, and variance, σ 2 ,
are the best “estimates”? 0.000

0 20 40 60 80 100 120

Weight in kg
Motivating Example I

Maximum Likelihood Principle

0.020
1 We observe some data.
2 We pick the distribution we

0.015
think generated the data.
3 We find the estimator(s) of the

Density

0.010
distribution, θ,
b that makes more
likely the sample we are
observing.

0.005
IOW, the problem consists on
estimating an unkown parameter of a
population when the population 0.000

distribution is known (up to the −50 0 50 100 150 200

Weight in kilogram
unknown parameter)
Motivating Example II

Example
A random sample of 100 trials was performed and 10 resulted in success.
What can be inferred about the unknown probability of success p0 ?

Note that we are observing the sample; somehow we know the distribution;
and we are asking what is pb that makes more likely the sample we are
observing.
Motivating Example II

For any potential value of p for the probability of success, the probability of y
successes from n trail is given by:
n
 
f (y; n, p) = Pr(Y = y) = y py (1 − p)n−y

where

n
  n!
y = y!(n − y)!
With y = 10 successes from n = 100 trials,

L(p) = Pr(Y = 10)


100! 10
= p (1 − p)90
90!10!
= 1.731 × 1013 × p10 (1 − p)90
Motivating Example II

0.12
0.10
0.08
Likelihood

0.06
0.04
0.02
0.00

0.05 0.10 0.15 0.20 0.25

p
Likelihood Function

The likelihood function denoted by capital L is:


n
Y n
Y
L(θ, y|X) = L(θ; yi |xi ) = f (yi |xi ; θ)
i=1 i=1

where y = (y1 , ..., yn ).


L(θ; yi |xi ) is the likelihood contribution of the i-th observation,
L(θ, y|X) is the likelihood function of the whole sample.

The likelihood function says that, for any given sample y|X, the likelihood
of having obtained that particular sample depends on the parameter θ.
Whenever we can write down the joint probability function of the sample
we can in principle use ML estimation.
Log Likelihood Function

The log-likelihood function is:


n
! n
Y X
ln L(θ, y|X) = ln L(θ) = ln f (yi |xi ; θ) = ln f (yi |xi ; θ)
i=1 i=1
| {z }
f (y|X;θ)

The log-likelihood function is a monotonically increasing function of


L(θ, y|X):
I Any maximizing value θb of ln L(θ, y|X) must also maximize L(θ, y|X).
Taking logarithms which converts products into sums.
I It allows some simplification in the numerical determination of the MLE.
I Likelihood values are often extremely small (but can also be extremely
large). Numerical optimization of the likelihood highly problematic.
I Simplification of the study of the properties of the estimator.
Example

Example (Binomial Example)


Let {Yn } be a random sample of a binomial r.v with parameters (n, p), where
n is assumed to be known and p unknown. The likelihood function for
individual i is given by:
n
 
Li (p; yi ) = f (yi ; p) = y pyi (1 − p)n−yi
i

Since the sample is iid, the likelihood function:


n n  
Y Y n yi n−yi
L(p; y) = f (y1 , y2 , ..., yn ; p) = f (yi ; p) = yi p (1 − p)
i=1 i=1
Example

Example (Binomial Example)


Taking the log we get the log-likelihood function:
n
!
Y
ln L(p; y) = ln f (yi ; p)
i=1
n 
!
n yi
Y 
(n−yi )
= ln yi p (1 − p)
i=1
" n  
#
Pn Pn
yi (n2 − yi )
Y n
= ln p i=1 (1 − p) n=1
y i
i=1
n   n
! n
!
Y n X
2
X
= ln yi + yi ln p + n − yi ln(1 − p)
i=1 i=1 n=1
Example: Linear Regression

Example (Linear Regression)


Consider that {yi , xi } is i.i.d, and yi = xi| β0 + i , where i |xi ∼ N(0, σ02 ). So,
with θ = (β | , σ 2 )| and wi = (yi , xi| )| , the conditional pdf is

(yi − xi| β0 )2
 
1
f (yi |xi ; θ0 ) = p exp −
2πσ02 2σ02
= φ(yi − xi| β0 , σ02 )
The joint pdf of the sample is:
n
(y − Xβ0 )| (y − Xβ0 )
 
Y n/2
f (yi |xi ; θ0 ) = 2πσ02

exp −
i=1
2σ02
= φ(y − Xβ0 , σ02 · In )
The parameter space is Θ is RK × R++ , where K is the dimension of β and
R++ is the set of positive real numbers reflecting the a priori restriction that
σ02 > 0
1 Introduction
Motivation
Maximum Likelihood Estimator
Identification
The Score Function
The Information Matrix

2 Asymptotic Properties
Consistency
Asymptotic Normality

3 Estimation of Variance

4 Testing
Intuition
The Trinity
Proof, Proof and More Proof
Maximum Likelihood Estimator

Definition (ML Estimator)


The MLE is a value of the parameter vector that maximizes the sample
average log-likelihood function:
n
1X
θbn ≡ arg max ln f (yi |xi ; θ)
θ∈Θ n i=1

where Θ denotes the parameter space in which the parameter vector θ lies.
Usually Θ = RK .
Maximum Likelihood Estimator: Maximization
ln L(θ)

max ln L(θ, y)

θb = arg max ln L(θ, y)


θ
Maximum Likelihood Estimator

Remark:
By the nature of the objective function, the MLE is the estimator which
makes the observed data most likely to occur. In other words, the MLE
is the best “rationalization” of what we observed.

Population analogous
Z
E [ln L(θ; y|X)] ≡ ln L(θ; y|X)dF (y|X; θ0 )

where F (y|X; θ0 ) is the joint CDF of (y, X)


Maximum Likelihood Estimator

Assumption I: Distribution
The sample {yi , xi } is i.i.d with true conditional density f (yi |xi ; θ0 ).

Since the sample is i.i.d, we can write:

f (y|X; θ) = f (y1 |x1 ; θ) × f (y2 |x2 ; θ) × ... × f (yn |xn ; θ)


Expected Log-Likelihood Inequality

Is E [ln L(θ; y|X)] maximized at θ0 ?

Assumption II: Dominance I


E [supθ∈Θ |ln L(θ; y|X)|] exists.

Lemma (Expected Log-likelihood Inequality)


If Dominance I assumption holds, then

E [ln f (y|x; θ)] ≤ E [ln f (y|x; θ0 )]


Example
The conditional log-likelihood function of yi |xi ∼ N(xi| β0 , σ02 ) is

(yi − xi| β)2


log f (yi |xi ; θ) = −0.5 log(2πσ 2 ) − (1)
2σ 2
The conditional expectation is:

E (yi − xi| β)2


 
E [ log f (yi |xi ; θ)| xi ] = −0.5 log(2πσ 2 ) −
2σ 2
Note that:

E (yi − xi| β)2 = E (xi| β0 + i − xi| β)2 ∵ yi = xi| β0 + i


   

= E (i + xi| (β0 − β))2


 

= E 2i + 2i xi| (β0 − β) + (β0 − β)| xi xi| (β0 − β)


 

= E(2i ) + (β0 − β)| E(xi xi| )(β0 − β)


= σ02 + (β0 − β)| E(xi xi| )(β0 − β)
When is this expectation finite?
Example

The last term is finite if E(xi xi| ) is. This implies that X is full-column rank.

σ02 + (β0 − β)| E(xi xi| )(β0 − β)


E [ log f (yi |xi ; θ)| xi ] = −0.5 log(2πσ 2 ) −
2σ 2
Now, E [ log f (yi |xi ; θ)| xi ] is uniquely maximized at xi| β = xi| β0 and σ 2 = σ02
1 Introduction
Motivation
Maximum Likelihood Estimator
Identification
The Score Function
The Information Matrix

2 Asymptotic Properties
Consistency
Asymptotic Normality

3 Estimation of Variance

4 Testing
Intuition
The Trinity
Proof, Proof and More Proof
Identification

Before employing MLE, it is necessary to check whether the


data-generating process is sufficiently informative about the parameters of
the model.
Recall OLS: βb to be unique X must be full-column rank. Otherwise, ...
The question is: is the population E [ln f (yi |xi ; θ)] uniquely maximized at
θ0 ?
I If there exists another θ 6= θ0 that maximized E [ln f (yi |xi ; θ)], then MLE
is not identified.
This is satisfied if (conditional density identification):

f (yi |xi ; θ) 6= f (yi |xi ; θ0 ) ∀θ 6= θ0


Identification

Definition (Global Identification)


The parameter vector θ0 is globally identified in Θ if, for every θ1 ∈ Θ,
θ 6= θ1 implies that:

Pr [f (yi |xi ; θ0 ) 6= f (yi |xi ; θ1 )] > 0

Assumption III: Global Identification


Every parameter vector θ0 ∈ Θ is globally identified.
Identification

Lemma (Strict Expected Log-Likelihood Inequality)


Under the Assumptions of Distribution, Dominance I and Global
Identification, then

θ 6= θ0 =⇒ E [ln f (y|x; θ)] < E [ln f (y|x; θ0 )]

In words, the expected value of the log-likelihood is maximized at the true


value of the parameters.
Proof.
Let w = (y, x0 )0 and define

a(w) ≡ f (y|x; θ)/f (y|x; θ0 )

First, WTS that a(w) 6= 1 with positive probability, so that a(w) is


nonconstant random variable (so, we can apply Jensen’s Inequality).

a(w) 6= 1 ⇐⇒ f (y|x; θ) 6= f (y|x; θ0 )


Pr [a(w) 6= 1] ⇐⇒ Pr [f (y|x; θ) 6= f (y|x; θ0 )]
But, by Global Identification:

Pr [f (y|x; θ) 6= f (y|x; θ0 )] > 0 =⇒ Pr [a(w) 6= 1] > 0

Now, WTS E [log a(w)] < log {E [a(w)]}. We use the strict version of
Jensen’s inequality which states that if c(x) is a strictly concave function
and x is nonconstant random variable, then E [c(x)] < c [E(x)]
Proof.
Set c(x) = log(x), since log(x) is strictly concave and a(w) is non-constant.
Therefore

E [log a(w)] < log {E [a(w)]}


Now, WTS that E(a(w)) = 1. Note that the conditional mean of a(w) equals
1 because:
Z
E [ a(w)| x] = a(w)f (y|x; θ0 )dy
Z
f (y|x; θ)
= f (y|x; θ0 )dy
f (y|x; θ0 )
Z
= f (y|x; θ)dy

=1
By the Law of Total Expectations E [a(w)] = 1. Combining the results:

E [log(a(w))] < log(1) = 0


But log(a(w)) = log f (y|x; θ) − log f (y|x; θ0 ).
1 Introduction
Motivation
Maximum Likelihood Estimator
Identification
The Score Function
The Information Matrix

2 Asymptotic Properties
Consistency
Asymptotic Normality

3 Estimation of Variance

4 Testing
Intuition
The Trinity
Proof, Proof and More Proof
Differentiability
Assumption IV: Integrability
The pdf f (yi |xi ; θ) is twice continuously differentiable in θ for all θ ∈ Θ.
Furthermore, the support S(θ) of f (yi |xi ; θ) does not depend on θ, and
differentiation and integration are interchangeable in the sense that

Z Z
∂ ∂
dF (yi |xi ; θ) = dF (yi |xi ; θ)
∂θ ∂θ
ZS ZS
∂2 ∂
dF (yi |xi ; θ) = dF (yi |xi ; θ)
∂θ∂θ 0 S S
∂θ∂θ 0

and

 
∂E [ ln f (yi |xi ; θ)| xi = xi ] ∂ ln f (yi |xi ; θ)
= E xi = xi
∂θ ∂θ
 
∂ 2 E [ ln f (yi |xi ; θ)| xi = xi ] ∂ 2 ln f (yi |xi ; θ)
= E xi = xi
∂θ∂θ 0 ∂θ∂θ 0

where all terms exists. In this case, we denote the support of F (y) simply by S.
The Score Function
Definition (Score Function)
The score function is defined as the vector of first partial derivatives of the
log-likelihood function with respect to the parameter vector θ:
 ∂ ln f (y|X;θ) 
1 ∂θ
 ∂ ln f (y|X;θ) 
∂ ln f (y|X; θ)  ∂θ 2

s(w, θ) = = .

∂θ 
 .. 

∂ ln f (y|X;θ)
∂θK

The score vector for observation i is:

∂ ln f (yi |xi ; θ)
s(wi ; θ) =
∂θ
Because of the additivity of terms in the log-likelihood function, we can write:
n
X
s(w, θ) = s(wi ; θ)
i=1
Score Identity

Lemma (Score Identity)


Under Integrability and Distribution Assumption:

E [s(w; θ)] = 0

We have to be clear whether we are speaking about the score of a single


observation s(wi ; θ) or the score of the sample s(w; θ).
Pn
Since under random sampling, s(w, θ) = i=1 s(wi ; θ), it is sufficient to
establish that E [s(wi ; θ)] = 0
Proof.
First, we derive an integral property of pdf. Because we are assuming
F (y|x; θ) is a proper cdf.,
Z Z
dF (yi |xi ; θ) = f (yi |xi ; θ)dyi = 1 (2)
S S
∀θ ∈ Θ. Given differentiability, we can differentiate both sides of this
equality with respect to θ:
Z

0= f (yi |xi ; θ)dyi (3)
S ∂θ
This equation states how changes in f (yi |xi ; θ) resulting from changes in θ are
restricted by (2). We can rewrite (3) as

f (yi |xi ; θ) ∂
Z
0 = f (yi |xi ; θ)dyi
f (yi |xi ; θ) ∂θ
ZS
1 ∂f (yi |xi ; θ)
0 = dF (yi |xi ; θ) (4)
S f (y |x
i i ; θ) ∂θ | {z }
f (yi |xi ;θ)dyi
Proof.
Now we interpret this integral equation as an expectation. Consider:
∂ 1 ∂
ln f (yi |xi ; θ) ≡ f (yi |xi ; θ)
∂θ f (yi |xi ; θ) ∂θ
1 ∂
s(wi ; θ) ≡ f (yi |xi ; θ) (5)
f (yi |xi ; θ) ∂θ

s(wi ; θ)f (yi |xi ; θ) ≡ f (yi |xi ; θ)
∂θ
Then, substituting into (4)
Z
s(wi ; θ)dF (yi |xi ; θ) = 0
S
This hold for any θ ∈ Θ, in particular, for θ = θ0 . Setting θ = θ0 , we obtain:
Z
s(wi ; θ)dF (yi |xi ; θ) = 0
S
Z
s(wi ; θ0 )dF (yi |xi ; θ0 ) = E [ s(wi ; θ0 )| x] = 0
S
Then, by Law of Total Expectations, we obtain the desired result.
What if the support depend on θ?
In this case the support is S(θ) = A(θ) ≤ y ≤ B(θ). By definition:
Z B(θ)
f (y|x; θ)dy = 1
A(θ)
Now, using the Leibnitz’s theorem gives:

R B(θ)
∂ f (y|x; θ)dy
A(θ)
= 0
∂θ
Z B(θ)
∂f (y|x; θ) ∂B(θ) ∂A(θ)
dy + f (B(θ)|θ) − f (A(θ)|θ) = 0
A(θ)
∂θ ∂θ ∂θ

To interchange the operations of differentiation and integration we need the second and
third terms go to zero. The necessary condition is that

lim f (y|x; θ) = 0
y→A(θ)

lim f (y|x; θ) = 0
y→B(θ)

Sufficient conditions are that the support does not depend on the parameter, which
means that ∂A(θ)/∂θ = ∂B(θ)/∂θ = 0 or that the density is zero at the terminal points.
1 Introduction
Motivation
Maximum Likelihood Estimator
Identification
The Score Function
The Information Matrix

2 Asymptotic Properties
Consistency
Asymptotic Normality

3 Estimation of Variance

4 Testing
Intuition
The Trinity
Proof, Proof and More Proof
Hessian

Since we are doing an optimization analysis, we need the Hessian Matrix.

 ∂ 2 ln f (y|X;θ) ∂ 2 ln f (y|X;θ) ∂ 2 ln f (y|X;θ)



2 ∂θ1 ∂θ2 ... ∂θ1 ∂θK
 ∂ 2 ln f∂θ(y|X;θ)
1
∂ 2 ln f (y|X;θ) ∂ 2 ln f (y|X;θ) 

∂ 2 ln f (y|X; θ) 
 ∂θ2 ∂θ1 ∂θ22
... ∂θ2 ∂θK
H(w; θ) = =

∂θ∂θ | .. .. .. .. 

 . . . .


∂ 2 ln f (y|X;θ) ∂ 2 ln f (y|X;θ) ∂ 2 ln f (y|X;θ)
∂θK ∂θ1 ∂θ2 ∂θK ... ∂θK2

If the log-likelihood function is concave in θ, H(w; θ) is said to be negative


definite. In the scalar case, for K = 1, this simply means that the second
derivative of the log-likelihood function is negative.
Hessian

Because of the additivity of terms in the log-likelihood function:


n
X ∂ 2 ln f (yi |xi ; θ)
H(wi ; θ) = H(wi ; θ) where H(wi ; θ) =
i=1
∂θ∂θ |

Remark
It is important to keep in mind that both the score and Hessian depend on the
sample and are therefore random variables (they differ in repeated samples).
Information Identity

To analyze the variance and the limiting distribution of the ML


estimator, we require some results on the Fisher information matrix.
It is very related to the Hessian matrix.
The information matrix of a sample is simply defined as the negative
expectation of the Hessian Matrix:

I(θ) = −E [H(w, θ)]

Why is it useful?
I It can be used to assess whether the likelihood function is “well behaved”
(Identification)
I Important result: the information matrix is the inverse of the variance of
the maximum likelihood estimator.
I Cramér Rao lower bound.
Information matrix equality

Information matrix equality


The information matrix can be derived in two ways, either as minus the
expected Hessian, or alternative as the variance of the score function, both
evaluated at the true parameter θ0
Information Identity

Assumption V: Finite Information



 
Var ∂θ ln f (y|X; θ) ≡ Var [s(w; θ)] exists.

Lemma (Information Identity)


Under Distribution, Differentiability and Finite Information
Assumption:
∂2
 
E ln f (y|X; θ) = − Var [s(w; θ)]
∂θ∂θ |

Proof: (Homework)
Information Identity

Note the following:


 
|
|
Var [s(wi ; θ0 )] = E s(wi ; θ0 ) s(wi ; θ0 ) + E [s(wi ; θ0 )] E [s(wi ; θ0 )]
 
| {z } | {z } | {z }
(K×1) (1×K) =0
|
= E [s(wi ; θ0 )s(wi ; θ0 ) ]

Therefore we can write:

−I(θ0 ) = E [H(wi ; θ0 )] = − Var [s(wi ; θ0 )] = −E [s(wi ; θ0 )s(wi ; θ0 )| ]


Example

Recall that:

(yi − xi| β)2


log f (yi |xi ; θ) = −0.5 log(2πσ 2 ) −
2σ 2
We have:

1
 
s(wi ; θ) = σ 2 xi · b
i
− 2σ2 + 2σ1 4 b
1
2i
− σ2 xi xi|
 1
− σ14 xi · b

i
H(wi ; θ) =
− 14 xi| · b
i 2σ1 4 − σ16 b 2i
 σ 1 | 2
− 2σ1 4 xi · b
i + 2σ1 6 xi · b

| i
σ 4 xi xi b 3i
s(wi ; θ)s(wi ; θ) = |
1
− 2σ4 xi · bi + 2σ1 6 xi| · b3i 1 1 2
4σ 4 − 2σ 6 b
1 4
i + 4σ8 b i

where wi = (yi , xi| )| , θ = (θ | , σ 2 )| and b


i ≡ yi − xi| β
Example

So for θ = θ0 the b i in these expressions can be replaced by i . In the linear


regression model, E(i |xi ) = 0. Also, since i ∼ N(0, σ02 ), we have E(3i ) = 0
and E(4i ) = 3σ04 . Using these relations, we have:
|
1
!
2σ 2 E(xi xi ) 0
−E [H(wi ; θ0 )] = E [s(wi ; θ)s(wi ; θ)| ] = 0
1
0| 2σ 4
0

If E(xi xi| ) is nonsingular, then E [H(wi ; θ0 )] is nonsingular.


1 Introduction
Motivation
Maximum Likelihood Estimator
Identification
The Score Function
The Information Matrix

2 Asymptotic Properties
Consistency
Asymptotic Normality

3 Estimation of Variance

4 Testing
Intuition
The Trinity
Proof, Proof and More Proof
Some Ideas

For OLS estimator consistency can be shown by finding the sampling


error function and applying LLN.
This cannot be done for nonlinear estimator such as MLE since closed
form solution for finite sample estimators do not exists.
That is, the MLE is an implicit function of the random sample.

Question
How can we proceed?
Some Ideas
Using some LLN we know that:
n
1X p
log f (yi |xi ; θ) −→ E [log f (yi |xi ; θ)] (6)
n i=1
for any fixed parameter value θ. That is, the sample average log-likelihood
function converges to the expected log-likelihood for any value of θ. Recall
that:
n
1X
θbn ≡ arg max log f (yi |xi ; θ)
θ∈Θ n i=1
θ0 ≡ arg max E [log f (yi |xi ; θ)]
θ∈Θ

We would like to say that, given that


1
Pn p p
n i=1 log f (yi |xi ; θ) −→ E [log f (yi |xi ; θ)], then θbn −→ θ0
Some Ideas

We might be able to do this using the continuous mapping theorem.


Pn
Let Xn = n1 i=1 log f (yi |xi ; θ),
and g(·) = arg max(·)
θ∈Θ
p p
Then we would like to say that if Xn −→ X then g(Xn ) −→ g0 (X).

In words:
If the sample average of the log likelihood function is close to the true expected
value of the log likelihood function, then we would expect that θbn will be close
to the maximum of the expected likelihood (as n increases without bound)

However, we cannot do that!


What is the problem?

The problem is that the argument of the arg max(·) is a function of θ, not
θ∈Θ
a real vector:
I The concept of convergence in probability was defined for sequence of
random variables
Therefore, we need to define what we mean by the probability limit of
sequence of random functions, as opposed to a sequence of random
variables:

Convergence for sequence of random variables =⇒ Xn = Xn (ω), ω ∈ Ω


Convergence for sequence of random function =⇒ Qn = Qn (ω, θ), ω ∈ Ω
Example

Example
In ML estimation, the log-likelihood is a function of the sample data (a
random vector that depends on ω) and of a parameter θ. By increasing the
sample size, we obtain a sequence of log-likelihoods that depend on ω and θ.
Consistency

How is the distance between two functions over a set containing an infinite
number of possible comparisons at different values of θ measured?

IOW, since we are dealing with convergence on a function space we


need to define when two functions are close to one another.
To reduce the infinite dimensional character of a function to a
one-dimensional concept of convergence, we take the supremum of the
absolute difference of the function values over all θ in Θ
Uniform Convergence in Probability

Definition (Uniform Convergence in Probability)


The sequence of real-valued functions {Qn (θ)} converges uniformly in
p
probability to the limit function Q0 (θ) if supθ∈Θ |Qn (θ) − Q0 (θ)| −→ 0. We
p
will say that Qn (θ) −→ Q0 (θ) uniformly.
Another way to express uniform convergence in probability is:

sup |Qn (θ) − Q0 (θ)| = op (1)


θ∈Θ

IOW, instead of requiring that the distance |Qn (θ) − Q0 (θ)| converge in
probability to 0 for each θ, we require convergence of supθ∈Θ |Qn (θ) − Q0 (θ)|,
which is the maximum distance that can be found by ranging over the space
parameters.
Uniform Convergence in Probability
QN (θ)

Q0 (θ0 )

Q0 (θ) + 

Q0 (θ)

Q0 (θ) − 

θ0
θ
Uniform Convergence in Probability

Extending the concept to random vectors is straightforward. Now suppose


that {Qn (θ)} is a sequence of K × 1 random vectors that depend both on the
data and on the parameter θ ∈ Θ. This sequence of random vectors is
uniformly convergent in probability to Q0 (θ) if and only if

sup kQn (θ) − Q0 (θ)k = op (1)


θ∈Θ

where kQn (θ) − Q0 (θ)k denotes the Euclidean norm of the vector
Qn (θ) − Q0 (θ). By taking the supremum over θ we obtain another random
quantity that does not depend on θ.
Pointwise Convergence in probability

Definition (Pointwise Convergence in probability)


The sequence of real-valued functions {Qn (θ)} converges pointwise in
p
probability if and only if |Qn (θ) − Q0 (θ)| −→ 0 for each θ ∈ Θ

Uniform convergence is stronger than pointwise convergence.


Uniform LLN

Now we present the uniform LLN to study sequences of random functions


which is analogous to the Chebychev’s LLN for averages of random variables.

Theorem (Uniform LLN)


Suppose that Q(θ, U ) is continuous function over θ ∈ Θ, a closed and bounded
subset of Rp , and that {Un } is a sequence of i.i.d. random variables with cdf
FU (u). If E [supθ∈Θ kQ(θ; U )k] exits, then
1 E [Q(θ; U )] is continuous over θ ∈ Θ and,
1
Pn p
i=1 Q(θ; ui ) −→ E [Q(θ; U )] uniformly.
2
n
Or as Newey and McFadden (1994) (Lemma 2.4 state):
n
1X p
sup Q(θ; ui ) − E [Q(θ; U )] −→ 0
θ∈Θ n i=1
Uniform LLN

The following Theorem


Pn makes the connection between the uniform
convergence of n1 i=1 Q(θ; ui ) to E [Q(θ; U )] and the convergence of θbn to θ0
using the assumption of compact parameter space.
Consistency

Theorem (Consistency of Maxima with Compact Parameter Space)


Suppose that:
1 (compact parameter space) Θ ⊂ Rp is a closed and bounded parameter
space,
2 (uniform convergence) Qn (θ) is a sequence of function that convergence
in probability uniformly to a function Q0 (θ) on Θ,
3 (continuity) Qn (θ) is continuous in θ for any data (w1 , ..., wn ),
4 (identification) Q0 (θ) is uniquely maximized at θ0 ∈ Θ
then θbn ≡ arg max Qn (θ) converges in probability to θ0 .
θ∈Θ
Consistency

Intuition:
If Qn (θ) converges uniformly to Q0 (θ), then the characteristics of Qn (θ) will
be close to the characteristics of Q0 (θ) as n → ∞. One particular
characteristic is the point θ0 where Q0 (θ) is uniquely maximized. Then, it is
expected that the maximizer of Qn (θ), θ, b will be close to the maximizer of
Q0 (θ).
Consistency

Theorem (Consistency of Maxima without Compactness)


Suppose that:
1 (interior) θ0 is an element of the interior of a convex parameter space Θ,
2 (pointwise convergence) Qn (θ) converges in probability to Q0 (θ) for all
θ ∈ Θ,
3 (concavity) Qn (θ) is concave over the parameter space for any data
(w1 , ..., wn ),
4 (identification) Q0 (θ) is uniquely maximized at θ0 ∈ Θ
p
then, as n → ∞, θbn exists with probability approaching 1 and θbn −→ θ0 .
Consistency
Theorem (Consistency of conditional ML with compact parameter)
Let {yi , xi } be i.i.d with conditional density f (yi |xi ; θ0 ) and let θb be the
conditional ML estimator, which maximizes the average log conditional
likelihood:
n
1X
θn = arg max
b log f (yi |xi ; θ)
θ∈Θ n i=1
Suppose the model is correctly specified so that θ0 is in Θ. Suppose that
1 (Compactness) the parameter space Θ is compact subset of RK ,
2 f (yi |xi ; θ) is continuous in θ for all (yi , xi ) (Here, note the Weierstrass’s
theorem),
3 f (yi |xi ; θ) is measurable in (yi , xi ) for all θ ∈ Θ (so θb is well-defined
random variable),
4 (identification) Pr [f (yi |xi ; θ) 6= f (yi |xi ; θ0 )] > 0 for all θ 6= θ0 in Θ,
5 (dominance) E [supθ∈Θ |log f (yi |xi ; θ)|] < ∞ (note: the expectation is
over yi and xi )
p
Then θb −→ θ0
Sketch of Proof.
We would like to apply Consistency of Maxima with Compact Parameter Space
Theorem. In this case, let Q(θ; U ) = log f (y|x; θ). Now we verify that the condition of the
theorem are met:
f (yi |xi ; θ) is continuous,
Compactness states that Θ is a closed and bounded subset of E K ,
(yi , xi ) are i.i.d with conditional density f (yi |xi ; θ0 ),
 
Dominance I states that E supθ∈Θ |log f (yi |xi ; θ)| exists.
Therefore, E [log f (yi |xi ; θ)] is continuous and
n
1 X p
log f (yi |xi ; θ) −→ E [log f (yi |xi ; θ)] (7)
n
i=1
1
P n
uniformly. Let Qn (θ) = n log f (yi |xi ; θ) and Q0 (θ) = E [log f (yi |xi ; θ)]. Under the
i=1
additional assumption of Likelihood Identification, we can invoke the strict expected
log-likelihood inequality: θ 6= θ0 that E [log f (y|x; θ)] < E [log f (y|x; θ0 )]. This implies that
Q0 (θ) is uniquely maximized at θ0 . Therefore
n
1 X p
θ
bn = arg max log f (yi |xi ; θ) −→ θ0
θ∈Θ n
i=1
Consistency
Theorem (Consistency of conditional ML without Compactness)
Let {yi , xi } be i.i.d with conditional density f (yi |xi ; θ0 ) and let θb be the
conditional ML estimator, which maximizes the average log conditional
likelihood:
n
1X
θbn = arg max log f (yi |xi ; θ)
θ∈Θ n i=1
Suppose the model is correctly specified so that θ0 is in Θ. Suppose that
1 the true parameter vector θ0 is an element of the interior of convex
parameter space Θ ⊂ RK ,
2 log f (yi |xi ; θ) is concave in θ for all (yi , xi ) ,
3 log f (yi |xi ; θ) is measurable in (yi , xi ),
4 (identification) Pr [f (yi |xi ; θ) 6= f (yi |xi ; θ0 )] > 0 for all θ 6= θ0 in Θ,
5 E [|log f (yi |xi ; θ)|] < ∞ (i.e., E [log f (yi |xi ; θ)] exists and is finite) for all
θ ∈ Θ (note: the expectation is over yi and xi )
p
Then, n → ∞, θb exists with probability approaching 1 and θbn −→ θ0
1 Introduction
Motivation
Maximum Likelihood Estimator
Identification
The Score Function
The Information Matrix

2 Asymptotic Properties
Consistency
Asymptotic Normality

3 Estimation of Variance

4 Testing
Intuition
The Trinity
Proof, Proof and More Proof
Asymptotic
Theorem (Asymptotic Normality of Conditional ML)
|
Let w ≡ (yi , xi| ) be iid. Suppose the conditions of either Theorem 18 or 19
p
are satisfied, so that θbn −→ θ0 . Suppose, in addition, that:
1 θ0 is in interior of Θ,
2 f (yi |xi ; θ0 ) is twice continuously differentiable in θ for all (yi , xi ),
3 E [s(wi ; θ0 )] = 0 and −E [H(wi ; θ0 )] = E [s(wi ; θ0 )s(wi ; θ0 )| ],
4 (local dominance condition on the Hessian) for some neighborhood N of
θ0 ,  
E sup kH(wi ; θ)k < ∞
θ∈N

1
Pn p
so that for any consistent estimator θ,
e e −→
H(wi ; θ) E [H(wi ; θ0 )]
n i=1
5 E [H(wi ; θ0 )] is nonsingular.
Then:

√  
d −1 −1
n θb − θ0 −→ N(0, V), V = − {E [H(wi ; θ0 )]} = {E [s(wi ; θ0 )s(wi ; θ0 )| ]}
Asymptotic

The intuition is the following:


Since usually we don’t have an explicit solution for the estimator, we need
to focus on the asymptotic behaviour of the score function.
p
Assuming that θbn −→ θ, the behaviour of the score function matters only
within an arbitrary small neighbourhood of θ0 .
... after all, θbn will fall withing such neighborhooods with arbitrary high
probability for a large enough sample size...
... and within such neighborhood the score function is essentially linear..
(Taylor series expansion)
Mean Value Theorem

Theorem (Mean Value Theorem)


Let s : RK → R be defined on an open convex set Θ ⊂ RK such that s is
continuously differentiable on Θ with gradient ∇s. Then for any points θ and
θ0 such that s(θ) = s(θ0 ) + ∇s(θ̄)(θ − θ0 )
Asymptotic
Proof.
The objective function is:
n
1 X
Qn (θ) = ln f (yi |xi ; θ)
n
i=1
Given that f (yi |xi ; θ0 ) is twice continuously differentiable in θ, and given that θ0 is in
interior of Θ, then the maximum likelihood estimator satisfies

∂ log L(θ
b)
= s(w; θ
b) = 0
∂θ
We need to know about the behavior of the gradient around the true parameter. Expand
this set of equations in a Taylor series around the true parameters θ0 . We will use the mean
value theorem to truncate the Taylor series at the second term,

∂ log L(θ
b) ∂ log L(θ0 ) ∂ log L(θ) 
= + b − θ0
θ
∂θ ∂θ ∂θ∂θ | | {z }
| {z } | {z } | {z }
(K×1) (K×1) (K×K) (K×1)

n
" n
#
1 X 1 X 
= s(wi ; θ0 ) + H(wi ; θ) b − θ0
θ
n n
i=1 i=1

b + (1 − α)θ0 for some α ∈ (0, 1)


where θ = αθ
Proof.
So,
" n
#−1 n
!
√  1 X √ 1 X
b − θ0 =
n θ − H(wi ; θ) n s(wi ; θ0 )
n n
i=1 i=1
We know that
p p
b −→ θ0 =⇒ θ −→ θ0
θ
By uniform LLN, we know that
n
1 X p
H(wi ; θ) −→ E [H(wi ; θ)] uniformly in θ ∈ Θ
n
i=1
Then, applying our Lemma:
n
1 X p
H(wi ; θ n ) −→ E [H(wi ; θ0 )]
n
i=1
since E [H(wi ; θ0 )] exists. Finally, using probability limit continuity and nonsingular
information, then:
" n
#−1
1 X p
H(wi ; θ n ) −→ E [H(wi ; θ0 )]−1
n
i=1
Proof.
 Pn 
Since (yi , xi ) are i.i.d. √1 s(wi ; θ0 ) is the sum of variables s(wi ; θ0 ). The score
n i=1
identity lemma implies that E(s(wi ; θ0 )) = 0, and the Information Identity implies that

Var [s(wi ; θ0 )] = E [s(wi ; θ0 )s(wi ; θ0 )| ] = −E [H(wi ; θ0 )]


The Lindberg-Levy CLT therefore implies:
n
!
√ 1 X d
n s(wi ; θ0 ) −→ N (0, −E [H(wi ; θ0 )])
n
i=1
Proof.
Then:
" n
#−1 n
!
√  1 X √ 1 X
b − θ0 =
n θ − H(wi ; θ) n s(wi ; θ0 )
n n
i=1 i=1
d
−→ −E [H(wi ; θ0 )]−1 N (0, −E [H(wi ; θ0 )])
= N 0, −E [H(wi ; θ0 )]−1 E [H(wi ; θ0 )] E [H(wi ; θ0 )]−1
 

= N 0, −E [H(wi ; θ0 )]−1
 

= N 0, [I(wi ; θ0 )]−1
 
Variance Estimation
For large but finite samples, we can therefore write the approximate
distribution of θbn as
h i
a −1
θb ∼ N θ0 , n · [I(θ0 )]
we have three potential estimators of I(θ0 ):
The empirical mean of minus the Hessian,
n
!−1
b1 = 1X
V −H(wi , θ)
b
n i=1

The empirical variance of the score:


n
!−1
b2 = 1X b|
V s(wi , θ)s(w
b i ; θ)
n i=1

Minus the expected Hessian evaluated at θ:


b

i −1
n
!
b3 1X h
V = −E H(wi , θ)
b
n i=1
Proof of Consistency

Evaluated at a θ ∈ Θ, each estimator converges in probability uniformly


to its expectation.
p
Because θbn −→ θ0 , evaluated at θbn each estimator converges in
probability to I(θ0 ).
Because matrix inversion is a continuous transformation, the inverse of
each matrix is also a consistent
√ estimator for the variance matrix of the
asymptotic distribution of n(θbn − θ0 )
1 Introduction
Motivation
Maximum Likelihood Estimator
Identification
The Score Function
The Information Matrix

2 Asymptotic Properties
Consistency
Asymptotic Normality

3 Estimation of Variance

4 Testing
Intuition
The Trinity
Proof, Proof and More Proof
Hypothesis Testing

ML estimator are distributed asymptotically normally:


I As the sample size increases, the sampling distribution of an ML estimator
becomes approximately normal
 
a
b ∼ N β, V
β βb
where, for three coefficients:
   σ2 σβb ,βb σβb ,βb

βb0 β 0 1 0 2
 0
b
Vβb = Var βb1  = σβb1 ,βb0 σ2 σβb ,βb 
β1
b 1 2

βb2 σβb ,βb σβb ,βb σ2
2 0 2 1 β2
b
Hypothesis Testing

Consider the simple hypothesis

H0 : β k = β ∗

where β ∗ is the hypothesized value, often equal to 0. Since σ 2 is unknown, it


βbk
must be estimated, which results in the test:

βbk − β ∗
z=
b2
σ
β
bk
Under the assumptions justifying ML, if H0 is true, then z is distributed
approximately normally with mean of 0 and variance of 1 for large samples.
Testing

f (z)

reject H0 reject H0

z
−1.96 0 1.96
1 Introduction
Motivation
Maximum Likelihood Estimator
Identification
The Score Function
The Information Matrix

2 Asymptotic Properties
Consistency
Asymptotic Normality

3 Estimation of Variance

4 Testing
Intuition
The Trinity
Proof, Proof and More Proof
The Trinity

For more complex hypotheses we can use the Wald, likelihood ratio (LR), or
Lagrange multiplier (LM) test. These test can be though of as a comparison
between the estimates obtained after the constrains implied by the hypothesis
have been imposed to the estimates obtained without the constraints.
The Trinity: LR Test

Log-likelihood function is
the red solid line; log L(β)
b
βbU : unconstrained
estimator.
log L(βbU )
The H0 : β = β ∗ imposes
the constraint β = β ∗ , so
that the constrained
estimate is βbC = β ∗ . log L(βbC )
Unless βbU = β ∗ ,
ln L(βbC ) ≤ ln L(βbU ).
If the constraint βb
significantly reduces the 0 βbC = β ∗ βbU
likelihood, then the null
hypothesis is rejected
The Trinity: Wald Test
The Wald test estimate the model
without constraints, and assesses
the constraint by considering 2
things:
1. It measures the distance
bU − βbC = βbU − β ∗ .
β log L(β
b)
The larger the distance, the less
likely it is that the constraint is
true. .
2. The distance is weighted by the
curvature of the log likelihood
function ∂ 2 log L(β)/∂β 2
The larger the second derivative, ∂ 2 log L(β)
the faster the curve is changing. ∂β 2
The LL function (dashed line) in
nearly flat, so the second derivative
evaluated at βbU is relatively small.
When the second derivative is
small, the distance β
bU and βbC is β
b
0 bC = β ∗
β β
minor relative to the sampling bU
variation.
The second LL function has a
larger second derivative.
With a larger second derivative, the
same distance between βbU and βbC
might be significant.
The Trinity: LM (Score) Test

It only estimate the


constrained model. log L(β)
b

It assesses the slope of the


log likelihood function at
the constraint. log L(βbU )
If H0 is true, the slope
(score) at the constraint
should be close to 0. log L(βbC )
As with the Wald test, the
curvature of the log ∂ log L(β)
∂β
likelihood function at the βb
constraint is used to assess 0 βbC = β ∗ βbU
the significance of a nonzero
slope.
The Trinity

When H0 is true, the Wald, LR, and LM tests are asymptotically


equivalent.
d
As n → ∞, the sampling distributions of the three test −→ χ2r , where r is
the number of constraints being tested.

They are similar only when n → ∞. In small samples this is not necessary
true.
Test Statistics

Assume that we have r (r < K) nonlinear restrictions (which includes linear


restriction as special case)

0
r(θ0 ) = |{z}
| {z }
r×1 r×1

The hypotheses are:


H0 : r(θ0 ) = 0
H1 : r(θ0 ) 6= 0

Also, denote θb as the unconstrained estimator and θe as constrained estimator.


So:
( m )
1X
θM L = arg max
e log f (yi |xi ; θ) s.t r(θ0 ) = 0
θ∈Θ n i=1
Test Statistics

Then, Wald, LM and LR statistics are defined as:


 −1
 
b 1 ·V
 
b |
W = r(θ) R(θ) b R(θ)b | r(θ)
b
| {z } | {z } n

| {z } |{z}
1×r r×K | {z } K×r r×1
K×K
 
e| 1 e
LM = s(θ) · V s(θ)
e
| {z } n |{z}
1×K | {z } K×1
K×K
 

1 n n
 X 
1X 
LR = 2 · n 
n ln f (y |x
i i ; θ)
b − ln f (y |x
i i ; θ)
e 
 i=1 n i=1 

| {z } | {z }
1×1 1×1
Test Statistics

where:

∂r(θ0 )
R(θ0 ) = Jacobian of r(θ0 )
| {z } ∂θ |
r×K

−1 1X ∂
V = I = − ln f (yi |xi ; θ)
b
|{z} n ∂θθ |
i=1
K×K
1 Introduction
Motivation
Maximum Likelihood Estimator
Identification
The Score Function
The Information Matrix

2 Asymptotic Properties
Consistency
Asymptotic Normality

3 Estimation of Variance

4 Testing
Intuition
The Trinity
Proof, Proof and More Proof
Proof: Wald Statistic.
Write W as:

W = c|n Z−1
n cn , cn ≡ nr(θ),
b Zn ≡ R(θ) b|
b θ)
b VR( (8)
d
First, we will show that cn −→ N(0, Vf ). Applying the MVT (21) (to
truncate the Taylor series) to r(θ)
b around θ0 :

r(θ)
b = r(θ0 ) +R(θ̄)(θb − θ0 )
| {z }
=0 under H0
√ √ √
nr(θ) = R(θ̄) n(θb − θ0 ) multiplying by n
b
| {z } | {z } | {z }
r×1 r×K K×1
√ √ √
= R(θ̄) n(θb − θ0 ) + R(θ0 ) n(θb − θ0 ) − R(θ0 ) n(θb − θ0 )
√ √
= R(θ̄) − R(θ0 ) n(θb − θ0 ) + R(θ0 ) n(θb − θ0 )

where θ̄ = αθb + (1 − α)θ0


Proof: Wald Statistic.
Note that
p p
∵θb −→ θ0 =⇒ θ̄ −→ θ0
p
∵F (·) is continuous =⇒ F (θ̄) −→ F (θ0 )
√ d
Note that n(θb − θ0 ) −→ some random variable, then we can write:
√ √ √
b = R(θ̄) − R(θ0 ) n(θb − θ0 ) +R(θ0 ) n(θb − θ0 )
nr(θ)
| {z }
p
−→0
√ √
b = R(θ0 ) n(θb − θ0 ) + op (1)
nr(θ)
Furthermore, we know that:
n
√ −1 1 X
n(θb − θ0 ) = E [H(wi , θ0 )] √ s(wi , θ0 ) + op (1) (9)
n i=1
Proof: Wald Statistic.
Then:
√ √
b = R(θ0 ) n(θb − θ0 ) + op (1)
nr(θ)
" n
#
−1 1
X
= R(θ0 ) E [H(wi , θ0 )] √ s(wi , θ0 ) + op (1) + op (1)
n i=1
n
−1 1 X
= R(θ0 )E [H(wi , θ0 )] √ s(wi , θ0 ) + op (1)
n i=1 | {z }
| {z } op (1)+op (1)
d
−→N(0,−E[H(wi ;θ0 )])

Then:

√ d

−1 −1

b −→
nr(θ) N 0, −R(θ0 )E [H(wi , θ0 )] E [H(wi ; θ0 )] E [H(wi , θ0 )] R(θ0 )|
 
h i
d −1
R(θ0 )0 
 
−→ N 
0, R(θ0 ) −E [H(wi , θ0 )] 
| {z }
V
Proof: Wald Statistic.
It follows that:
√ −1 √ d
W = c|n Z−1
n cn =
b [R(θ0 )V0 R(θ0 )| ]
nr(θ) b −→ χ2 (#r)
nr(θ)
If we have consistent estimators of R(θ0 ) and V, then limit results for
continuous functions imply that:
√ h i−1 √
d
nr(θ)
b R(θb 0 )V b 0 )|
b R(θ b −→
nr(θ) χ2 (#r)
Preliminaries to the next two statistics

The Lagrangian for the constraint is


n
1X
L= λ| r(θ)
log f (yi |xi , θ) − |{z}
n i=1 |{z}
1×r r×1

Then FOC:
√   √
e |λ = 0
ns θe + nR(θ)
√ (10)
nr(θ)
e =0

Following a similar reasoning as in previous proof:


√ √
e = R(θ0 ) n(θe − θ0 ) + op (1) = Op (1)
nr(θ)
Preliminaries to the next two statistics
A Taylor expansion of s(θ)
e yields:

e) ∂ log L(θ0 ) ∂ 2 log L(θ̄)


∂ log L(θ 
= + e − θ0
θ
∂θ ∂θ ∂θ∂θ |
∂ log L(θ0 ) ∂ 2 log L(θ̄) 2 2
e − θ0 + ∂ log L(θ 0)
e − θ0 − ∂ log L(θ 0)
  
= + θ θ e − θ0
θ
∂θ ∂θ∂θ 0 ∂θ∂θ | ∂θ∂θ 0
∂ 2 log L(θ0 )   2 2 
∂ log L(θ0 )
e − θ0 + ∂ log L(|θ̄) − ∂ log L(θ 0)

= + |
θ |
e − θ0
θ
∂θ ∂θ∂θ ∂θ∂θ ∂θ∂θ
e) √ ∂ log L(θ0 ) ∂ 2 log L(θ0 ) √
√ ∂ log L(θ 
n = n + |
e − θ0 +
n θ
∂θ
 2 ∂θ ∂θ∂θ
2 √
∂ log L(θ̄) −1 ∂ log L(θ0 )

+ −1 |
− n |
e − θ0
n θ
n ∂θ∂θ ∂θ∂θ | {z }
| {z }
Op (1)
op (1)

2
√ ∂ log L(θ0 ) ∂ log L(θ0 ) √ 
= n + e − θ0 + op (1)
n θ
∂θ ∂θ∂θ |
| {z } | {z }
d E[H(wi ;θ0 )]
−→N(0,−E[H(wi ;θ0 )])

= Op (1) + Op (1)Op (1) + op (1)


= Op (1) + Op (1) + op (1) = Op (1)
Preliminaries to the next two statistics

Then
√ √   √
e 0 nλn = Op (1)
e | nλn = − ns θe =⇒ R(θ)
R(θ)
| {z }
Op (1)
p
e −→ R0 , we obtain:
Since R(θ)
√ √  0 √ √
e nλn = R| nλn + R(θ)
R(θ) e − R0 nλn = R0| nλn + op (1)
0

Substituting these three equations into the FOCs, we obtain:

√ 
 √ ∂ log L(θ0 ) 

E [H(wi ; θ0 )] R0| n θe − θ0
 
− n ∂θ
 (K×K) K×r  
 
(K×1)
√  =  (K×1)  + op (1)
R0 0 
nλn
 0
(r×K) (r×r) (r×1)
(r×1)
Preliminaries to the next two statistics

This can be solved using the partitioned inverse formula:

A11 A12
−1 A−1 + A−1 A12 (A22 − A21 A−1 A12 )−1 A21 A−1 −A
−1
A12 (A22 − A21 A
−1
A12 )

= 11 11 11 11 11 11
A21 A22 −1 −1 −1
−(A22 − A21 A A12 )−1 A21 A (A22 − A21 A A12 )
11 11 11

Then:
√    −1  −1 |
 −1 |
 −
n θ − θ0 −E H(wi ; θ0 ) +E H(wi ; θ0 ) R (R0 E H(wi ; θ0 ) R )−1 R0 E H(wi ; θ0 )
0 0
e
 −1 |
  −1
(K×1) R )−1 R0 E
 √
= −(R0 E H(wi ; θ0 ) H(wi ; θ0 )
nλn 0
(r×1)
Proof LR.
By second order Taylor series:

0 2
b + ∂ log L(θ) θe − θb + 1 θe − θb ∂ log L(θ̄) θe − θb
b     
log L(θ)
e = log L(θ)
∂θ 2 ∂θ∂θ 0
∂ log L(θ
b)
where θ̄ = αθe + (1 − α)θb for some α ∈ α [0, 1]. Recall that ∂θ = 0 and
2 p
n−1 ∂ ∂θ∂θ
log L(θ̄)
0 −→ E [H(wi , θ0 )]. It follows that:

0 2
b = 1 θe − θb ∂ log L(θ̄) θe − θb
  
e − log L(θ)
log L(θ)
2 ∂θ∂θ 0
   | ∂ 2 log L(θ̄)  
2 · log L(θ)
e − log L(θ)
b = θe − θb
|
θe − θb
∂θ∂θ
√ √   √  0 2 √ 
b = n θe − θb n−1 ∂ log L(θ̄) n θe − θb

2 n n · log L(θ) e − log L(θ)
∂θ∂θ |
 √  | 2 √ 
b = n θe − θb n−1 ∂ log L(θ̄) n θe − θb
 
2 · n · log L(θ)
e − log L(θ)
∂θ∂θ 0
Proof LR.
√ | ∂ 2 log L(θ0 ) √

Adding and subtracting e − θb
n θ n−1 ∂θ∂θ |
n e − θb :
θ

2
√ 0) √
e − θb | n−1 ∂ log L(θ
  
2 · n · log L(θ
e) − log L(θb) = n θ 0
e − θb +
n θ
∂θ∂θ
 h 2 2 i
√ 0) √
e − θb | n−1 ∂ log L(0 θ̄) − n−1 ∂ log L(θ

+ n θ 0
e − θb
n θ
∂θ∂θ ∂θ∂θ
| {z }
op (1)=Op (1)op (1)Op (1)

√ ∂ 2 log L(θ0 ) √
| −1

= e − θb
n θ n e − θb + op (1)
n θ
∂θ∂θ |

e − θb 0 E [H(wi , θ0 )] √n θe − θb + op (1)
 
= n θ

b) − log L(θe) = −√n θe − θb 0 E [H(wi , θ0 )] √n θe − θb + op (1)


  
2 · n · log L(θ × −1
Proof LR.
We know that


 √  √ 
n θ−b
e θ = n eθ − θ0 − n bθ − θ0
  −1  −1  −1  −1 
| | −1
= − E H(wi ; θ0 ) −E H(wi ; θ0 ) R (R0 E H(wi ; θ0 ) R ) R0 E H(wi ; θ0 ) ×
0 0
  −1 √ 
√ ∂ log L(θ0 ) ∂ log L(θ0 )
× n − −E H(wi ; θ0 ) n + op (1)
∂θ ∂θ
 −1 |
 −1 0 −1
 −1 √ ∂ log L(θ0 )
= E H(wi ; θ0 ) R (R0 E H(wi ; θ0 ) R0 ) R0 E H(wi ; θ0 ) n + op (1)
0 ∂θ

Then:

 √
0  √ 
2·n· log L(θ ) − log L(θ )
b e = − n θ−b
e θ E H(wi , θ0 ) θ−b
n e θ + op (1)

 0  −1 |   −1 −1


√ ∂ log L(θ0 ) |
= n E H(wi ; θ0 ) R R0 E H(wi ; θ0 ) R ×
∂θ 0 0

−1  
 √ ∂ log L(θ0 )
= ×R0 E H(wi ; θ0 ) n
∂θ
| {z }
d
 
−→N 0,−E H wi ;θ0
Proof LR.
Then :

 −1 √ ∂ log L(θ0 )



d
  −1    −1 |

R0 E H(wi ; θ0 ) n −→ N 0, R0 E H(wi ; θ0 ) −E H(wi ; θ0 ) E H wi ; θ0 R
∂θ 0
 −1 
d
 0
−→ N 0, −R0 E H(wi ; θ0 ) R0

This asymptotic variance cancels against the central term of the quadratic
form , and hence we are looking at the norm of a #r-dimensional standard
normal vector:
 
d
LR ≡ 2 · n · log L(θ)
b − log L(θ)
e −→ χ2 (#r)

You might also like