lecture1_ml_MLE
lecture1_ml_MLE
Universidad de Talca
2020
1 Introduction
Motivation
Maximum Likelihood Estimator
Identification
The Score Function
The Information Matrix
2 Asymptotic Properties
Consistency
Asymptotic Normality
3 Estimation of Variance
4 Testing
Intuition
The Trinity
Proof, Proof and More Proof
Reading
Reading (Mandatory):
(Ruud)- Chapters 14 and 15.
Buse, A. (1982). The likelihood ratio, Wald, and Lagrange multiplier
tests: An expository note. The American Statistician, 36(3a), 153-157.
Suggested:
(Winkelmann & Boes)- Chapters 2 and 3
Goals
2 Asymptotic Properties
Consistency
Asymptotic Normality
3 Estimation of Variance
4 Testing
Intuition
The Trinity
Proof, Proof and More Proof
Motivating Example I
Let’s assume we weighed 1000 people from Talca
200
150
Frequency
100
50
0
Weight in kilogram
Motivating Example I
The goal of maximum likelihood is to find the optimal way to fit a distribution
to the data.
Remark I
Generally, we can writhe the probability or density function of yi = 1, ..., n as
f (yi ; θ), we yi is the ith draw from the population and θ is the parameter of
the distribution.
Remark II
We usually assume independent sampling, i.e., the ith draw from the
population is independent from all other draws i0 6= i
0.020
0.030
0.020
0.010
0.010
0.000
0.000
0 20 40 60 80 100 0 20 40 60 80 100
Weight in kg Weight in kg
0.06
0.04
0.04
0.02
0.02
0.00
0.00
0 20 40 60 80 100 0 20 40 60 80 100
Weight in kg Weight in kg
Motivating Example I
Normal Distribution
It seems that the normal distribution
0.020
is the best option.
We expect most of the weights
0.015
to be close to the mean.
We expect the weights to be
realitvely symmetrical around
0.010
the mean.
Ok ..., but not every normal fits
0.005
the our data.
What mean, µ, and variance, σ 2 ,
are the best “estimates”? 0.000
0 20 40 60 80 100 120
Weight in kg
Motivating Example I
0.020
1 We observe some data.
2 We pick the distribution we
0.015
think generated the data.
3 We find the estimator(s) of the
Density
0.010
distribution, θ,
b that makes more
likely the sample we are
observing.
0.005
IOW, the problem consists on
estimating an unkown parameter of a
population when the population 0.000
Weight in kilogram
unknown parameter)
Motivating Example II
Example
A random sample of 100 trials was performed and 10 resulted in success.
What can be inferred about the unknown probability of success p0 ?
Note that we are observing the sample; somehow we know the distribution;
and we are asking what is pb that makes more likely the sample we are
observing.
Motivating Example II
For any potential value of p for the probability of success, the probability of y
successes from n trail is given by:
n
f (y; n, p) = Pr(Y = y) = y py (1 − p)n−y
where
n
n!
y = y!(n − y)!
With y = 10 successes from n = 100 trials,
0.12
0.10
0.08
Likelihood
0.06
0.04
0.02
0.00
p
Likelihood Function
The likelihood function says that, for any given sample y|X, the likelihood
of having obtained that particular sample depends on the parameter θ.
Whenever we can write down the joint probability function of the sample
we can in principle use ML estimation.
Log Likelihood Function
(yi − xi| β0 )2
1
f (yi |xi ; θ0 ) = p exp −
2πσ02 2σ02
= φ(yi − xi| β0 , σ02 )
The joint pdf of the sample is:
n
(y − Xβ0 )| (y − Xβ0 )
Y n/2
f (yi |xi ; θ0 ) = 2πσ02
exp −
i=1
2σ02
= φ(y − Xβ0 , σ02 · In )
The parameter space is Θ is RK × R++ , where K is the dimension of β and
R++ is the set of positive real numbers reflecting the a priori restriction that
σ02 > 0
1 Introduction
Motivation
Maximum Likelihood Estimator
Identification
The Score Function
The Information Matrix
2 Asymptotic Properties
Consistency
Asymptotic Normality
3 Estimation of Variance
4 Testing
Intuition
The Trinity
Proof, Proof and More Proof
Maximum Likelihood Estimator
where Θ denotes the parameter space in which the parameter vector θ lies.
Usually Θ = RK .
Maximum Likelihood Estimator: Maximization
ln L(θ)
max ln L(θ, y)
Remark:
By the nature of the objective function, the MLE is the estimator which
makes the observed data most likely to occur. In other words, the MLE
is the best “rationalization” of what we observed.
Population analogous
Z
E [ln L(θ; y|X)] ≡ ln L(θ; y|X)dF (y|X; θ0 )
Assumption I: Distribution
The sample {yi , xi } is i.i.d with true conditional density f (yi |xi ; θ0 ).
The last term is finite if E(xi xi| ) is. This implies that X is full-column rank.
2 Asymptotic Properties
Consistency
Asymptotic Normality
3 Estimation of Variance
4 Testing
Intuition
The Trinity
Proof, Proof and More Proof
Identification
Now, WTS E [log a(w)] < log {E [a(w)]}. We use the strict version of
Jensen’s inequality which states that if c(x) is a strictly concave function
and x is nonconstant random variable, then E [c(x)] < c [E(x)]
Proof.
Set c(x) = log(x), since log(x) is strictly concave and a(w) is non-constant.
Therefore
=1
By the Law of Total Expectations E [a(w)] = 1. Combining the results:
2 Asymptotic Properties
Consistency
Asymptotic Normality
3 Estimation of Variance
4 Testing
Intuition
The Trinity
Proof, Proof and More Proof
Differentiability
Assumption IV: Integrability
The pdf f (yi |xi ; θ) is twice continuously differentiable in θ for all θ ∈ Θ.
Furthermore, the support S(θ) of f (yi |xi ; θ) does not depend on θ, and
differentiation and integration are interchangeable in the sense that
Z Z
∂ ∂
dF (yi |xi ; θ) = dF (yi |xi ; θ)
∂θ ∂θ
ZS ZS
∂2 ∂
dF (yi |xi ; θ) = dF (yi |xi ; θ)
∂θ∂θ 0 S S
∂θ∂θ 0
and
∂E [ ln f (yi |xi ; θ)| xi = xi ] ∂ ln f (yi |xi ; θ)
= E xi = xi
∂θ ∂θ
∂ 2 E [ ln f (yi |xi ; θ)| xi = xi ] ∂ 2 ln f (yi |xi ; θ)
= E xi = xi
∂θ∂θ 0 ∂θ∂θ 0
where all terms exists. In this case, we denote the support of F (y) simply by S.
The Score Function
Definition (Score Function)
The score function is defined as the vector of first partial derivatives of the
log-likelihood function with respect to the parameter vector θ:
∂ ln f (y|X;θ)
1 ∂θ
∂ ln f (y|X;θ)
∂ ln f (y|X; θ) ∂θ 2
s(w, θ) = = .
∂θ
..
∂ ln f (y|X;θ)
∂θK
∂ ln f (yi |xi ; θ)
s(wi ; θ) =
∂θ
Because of the additivity of terms in the log-likelihood function, we can write:
n
X
s(w, θ) = s(wi ; θ)
i=1
Score Identity
E [s(w; θ)] = 0
f (yi |xi ; θ) ∂
Z
0 = f (yi |xi ; θ)dyi
f (yi |xi ; θ) ∂θ
ZS
1 ∂f (yi |xi ; θ)
0 = dF (yi |xi ; θ) (4)
S f (y |x
i i ; θ) ∂θ | {z }
f (yi |xi ;θ)dyi
Proof.
Now we interpret this integral equation as an expectation. Consider:
∂ 1 ∂
ln f (yi |xi ; θ) ≡ f (yi |xi ; θ)
∂θ f (yi |xi ; θ) ∂θ
1 ∂
s(wi ; θ) ≡ f (yi |xi ; θ) (5)
f (yi |xi ; θ) ∂θ
∂
s(wi ; θ)f (yi |xi ; θ) ≡ f (yi |xi ; θ)
∂θ
Then, substituting into (4)
Z
s(wi ; θ)dF (yi |xi ; θ) = 0
S
This hold for any θ ∈ Θ, in particular, for θ = θ0 . Setting θ = θ0 , we obtain:
Z
s(wi ; θ)dF (yi |xi ; θ) = 0
S
Z
s(wi ; θ0 )dF (yi |xi ; θ0 ) = E [ s(wi ; θ0 )| x] = 0
S
Then, by Law of Total Expectations, we obtain the desired result.
What if the support depend on θ?
In this case the support is S(θ) = A(θ) ≤ y ≤ B(θ). By definition:
Z B(θ)
f (y|x; θ)dy = 1
A(θ)
Now, using the Leibnitz’s theorem gives:
R B(θ)
∂ f (y|x; θ)dy
A(θ)
= 0
∂θ
Z B(θ)
∂f (y|x; θ) ∂B(θ) ∂A(θ)
dy + f (B(θ)|θ) − f (A(θ)|θ) = 0
A(θ)
∂θ ∂θ ∂θ
To interchange the operations of differentiation and integration we need the second and
third terms go to zero. The necessary condition is that
lim f (y|x; θ) = 0
y→A(θ)
lim f (y|x; θ) = 0
y→B(θ)
Sufficient conditions are that the support does not depend on the parameter, which
means that ∂A(θ)/∂θ = ∂B(θ)/∂θ = 0 or that the density is zero at the terminal points.
1 Introduction
Motivation
Maximum Likelihood Estimator
Identification
The Score Function
The Information Matrix
2 Asymptotic Properties
Consistency
Asymptotic Normality
3 Estimation of Variance
4 Testing
Intuition
The Trinity
Proof, Proof and More Proof
Hessian
Remark
It is important to keep in mind that both the score and Hessian depend on the
sample and are therefore random variables (they differ in repeated samples).
Information Identity
Why is it useful?
I It can be used to assess whether the likelihood function is “well behaved”
(Identification)
I Important result: the information matrix is the inverse of the variance of
the maximum likelihood estimator.
I Cramér Rao lower bound.
Information matrix equality
Proof: (Homework)
Information Identity
Recall that:
1
s(wi ; θ) = σ 2 xi · b
i
− 2σ2 + 2σ1 4 b
1
2i
− σ2 xi xi|
1
− σ14 xi · b
i
H(wi ; θ) =
− 14 xi| · b
i 2σ1 4 − σ16 b 2i
σ 1 | 2
− 2σ1 4 xi · b
i + 2σ1 6 xi · b
| i
σ 4 xi xi b 3i
s(wi ; θ)s(wi ; θ) = |
1
− 2σ4 xi · bi + 2σ1 6 xi| · b3i 1 1 2
4σ 4 − 2σ 6 b
1 4
i + 4σ8 b i
2 Asymptotic Properties
Consistency
Asymptotic Normality
3 Estimation of Variance
4 Testing
Intuition
The Trinity
Proof, Proof and More Proof
Some Ideas
Question
How can we proceed?
Some Ideas
Using some LLN we know that:
n
1X p
log f (yi |xi ; θ) −→ E [log f (yi |xi ; θ)] (6)
n i=1
for any fixed parameter value θ. That is, the sample average log-likelihood
function converges to the expected log-likelihood for any value of θ. Recall
that:
n
1X
θbn ≡ arg max log f (yi |xi ; θ)
θ∈Θ n i=1
θ0 ≡ arg max E [log f (yi |xi ; θ)]
θ∈Θ
In words:
If the sample average of the log likelihood function is close to the true expected
value of the log likelihood function, then we would expect that θbn will be close
to the maximum of the expected likelihood (as n increases without bound)
The problem is that the argument of the arg max(·) is a function of θ, not
θ∈Θ
a real vector:
I The concept of convergence in probability was defined for sequence of
random variables
Therefore, we need to define what we mean by the probability limit of
sequence of random functions, as opposed to a sequence of random
variables:
Example
In ML estimation, the log-likelihood is a function of the sample data (a
random vector that depends on ω) and of a parameter θ. By increasing the
sample size, we obtain a sequence of log-likelihoods that depend on ω and θ.
Consistency
How is the distance between two functions over a set containing an infinite
number of possible comparisons at different values of θ measured?
IOW, instead of requiring that the distance |Qn (θ) − Q0 (θ)| converge in
probability to 0 for each θ, we require convergence of supθ∈Θ |Qn (θ) − Q0 (θ)|,
which is the maximum distance that can be found by ranging over the space
parameters.
Uniform Convergence in Probability
QN (θ)
Q0 (θ0 )
Q0 (θ) +
Q0 (θ)
Q0 (θ) −
θ0
θ
Uniform Convergence in Probability
where kQn (θ) − Q0 (θ)k denotes the Euclidean norm of the vector
Qn (θ) − Q0 (θ). By taking the supremum over θ we obtain another random
quantity that does not depend on θ.
Pointwise Convergence in probability
Intuition:
If Qn (θ) converges uniformly to Q0 (θ), then the characteristics of Qn (θ) will
be close to the characteristics of Q0 (θ) as n → ∞. One particular
characteristic is the point θ0 where Q0 (θ) is uniquely maximized. Then, it is
expected that the maximizer of Qn (θ), θ, b will be close to the maximizer of
Q0 (θ).
Consistency
2 Asymptotic Properties
Consistency
Asymptotic Normality
3 Estimation of Variance
4 Testing
Intuition
The Trinity
Proof, Proof and More Proof
Asymptotic
Theorem (Asymptotic Normality of Conditional ML)
|
Let w ≡ (yi , xi| ) be iid. Suppose the conditions of either Theorem 18 or 19
p
are satisfied, so that θbn −→ θ0 . Suppose, in addition, that:
1 θ0 is in interior of Θ,
2 f (yi |xi ; θ0 ) is twice continuously differentiable in θ for all (yi , xi ),
3 E [s(wi ; θ0 )] = 0 and −E [H(wi ; θ0 )] = E [s(wi ; θ0 )s(wi ; θ0 )| ],
4 (local dominance condition on the Hessian) for some neighborhood N of
θ0 ,
E sup kH(wi ; θ)k < ∞
θ∈N
1
Pn p
so that for any consistent estimator θ,
e e −→
H(wi ; θ) E [H(wi ; θ0 )]
n i=1
5 E [H(wi ; θ0 )] is nonsingular.
Then:
√
d −1 −1
n θb − θ0 −→ N(0, V), V = − {E [H(wi ; θ0 )]} = {E [s(wi ; θ0 )s(wi ; θ0 )| ]}
Asymptotic
∂ log L(θ
b)
= s(w; θ
b) = 0
∂θ
We need to know about the behavior of the gradient around the true parameter. Expand
this set of equations in a Taylor series around the true parameters θ0 . We will use the mean
value theorem to truncate the Taylor series at the second term,
∂ log L(θ
b) ∂ log L(θ0 ) ∂ log L(θ)
= + b − θ0
θ
∂θ ∂θ ∂θ∂θ | | {z }
| {z } | {z } | {z }
(K×1) (K×1) (K×K) (K×1)
n
" n
#
1 X 1 X
= s(wi ; θ0 ) + H(wi ; θ) b − θ0
θ
n n
i=1 i=1
= N 0, −E [H(wi ; θ0 )]−1
= N 0, [I(wi ; θ0 )]−1
Variance Estimation
For large but finite samples, we can therefore write the approximate
distribution of θbn as
h i
a −1
θb ∼ N θ0 , n · [I(θ0 )]
we have three potential estimators of I(θ0 ):
The empirical mean of minus the Hessian,
n
!−1
b1 = 1X
V −H(wi , θ)
b
n i=1
i −1
n
!
b3 1X h
V = −E H(wi , θ)
b
n i=1
Proof of Consistency
2 Asymptotic Properties
Consistency
Asymptotic Normality
3 Estimation of Variance
4 Testing
Intuition
The Trinity
Proof, Proof and More Proof
Hypothesis Testing
H0 : β k = β ∗
βbk − β ∗
z=
b2
σ
β
bk
Under the assumptions justifying ML, if H0 is true, then z is distributed
approximately normally with mean of 0 and variance of 1 for large samples.
Testing
f (z)
reject H0 reject H0
z
−1.96 0 1.96
1 Introduction
Motivation
Maximum Likelihood Estimator
Identification
The Score Function
The Information Matrix
2 Asymptotic Properties
Consistency
Asymptotic Normality
3 Estimation of Variance
4 Testing
Intuition
The Trinity
Proof, Proof and More Proof
The Trinity
For more complex hypotheses we can use the Wald, likelihood ratio (LR), or
Lagrange multiplier (LM) test. These test can be though of as a comparison
between the estimates obtained after the constrains implied by the hypothesis
have been imposed to the estimates obtained without the constraints.
The Trinity: LR Test
Log-likelihood function is
the red solid line; log L(β)
b
βbU : unconstrained
estimator.
log L(βbU )
The H0 : β = β ∗ imposes
the constraint β = β ∗ , so
that the constrained
estimate is βbC = β ∗ . log L(βbC )
Unless βbU = β ∗ ,
ln L(βbC ) ≤ ln L(βbU ).
If the constraint βb
significantly reduces the 0 βbC = β ∗ βbU
likelihood, then the null
hypothesis is rejected
The Trinity: Wald Test
The Wald test estimate the model
without constraints, and assesses
the constraint by considering 2
things:
1. It measures the distance
bU − βbC = βbU − β ∗ .
β log L(β
b)
The larger the distance, the less
likely it is that the constraint is
true. .
2. The distance is weighted by the
curvature of the log likelihood
function ∂ 2 log L(β)/∂β 2
The larger the second derivative, ∂ 2 log L(β)
the faster the curve is changing. ∂β 2
The LL function (dashed line) in
nearly flat, so the second derivative
evaluated at βbU is relatively small.
When the second derivative is
small, the distance β
bU and βbC is β
b
0 bC = β ∗
β β
minor relative to the sampling bU
variation.
The second LL function has a
larger second derivative.
With a larger second derivative, the
same distance between βbU and βbC
might be significant.
The Trinity: LM (Score) Test
They are similar only when n → ∞. In small samples this is not necessary
true.
Test Statistics
0
r(θ0 ) = |{z}
| {z }
r×1 r×1
1 n n
X
1X
LR = 2 · n
n ln f (y |x
i i ; θ)
b − ln f (y |x
i i ; θ)
e
i=1 n i=1
| {z } | {z }
1×1 1×1
Test Statistics
where:
∂r(θ0 )
R(θ0 ) = Jacobian of r(θ0 )
| {z } ∂θ |
r×K
−1 1X ∂
V = I = − ln f (yi |xi ; θ)
b
|{z} n ∂θθ |
i=1
K×K
1 Introduction
Motivation
Maximum Likelihood Estimator
Identification
The Score Function
The Information Matrix
2 Asymptotic Properties
Consistency
Asymptotic Normality
3 Estimation of Variance
4 Testing
Intuition
The Trinity
Proof, Proof and More Proof
Proof: Wald Statistic.
Write W as:
√
W = c|n Z−1
n cn , cn ≡ nr(θ),
b Zn ≡ R(θ) b|
b θ)
b VR( (8)
d
First, we will show that cn −→ N(0, Vf ). Applying the MVT (21) (to
truncate the Taylor series) to r(θ)
b around θ0 :
r(θ)
b = r(θ0 ) +R(θ̄)(θb − θ0 )
| {z }
=0 under H0
√ √ √
nr(θ) = R(θ̄) n(θb − θ0 ) multiplying by n
b
| {z } | {z } | {z }
r×1 r×K K×1
√ √ √
= R(θ̄) n(θb − θ0 ) + R(θ0 ) n(θb − θ0 ) − R(θ0 ) n(θb − θ0 )
√ √
= R(θ̄) − R(θ0 ) n(θb − θ0 ) + R(θ0 ) n(θb − θ0 )
Then:
√ d
−1 −1
b −→
nr(θ) N 0, −R(θ0 )E [H(wi , θ0 )] E [H(wi ; θ0 )] E [H(wi , θ0 )] R(θ0 )|
h i
d −1
R(θ0 )0
−→ N
0, R(θ0 ) −E [H(wi , θ0 )]
| {z }
V
Proof: Wald Statistic.
It follows that:
√ −1 √ d
W = c|n Z−1
n cn =
b [R(θ0 )V0 R(θ0 )| ]
nr(θ) b −→ χ2 (#r)
nr(θ)
If we have consistent estimators of R(θ0 ) and V, then limit results for
continuous functions imply that:
√ h i−1 √
d
nr(θ)
b R(θb 0 )V b 0 )|
b R(θ b −→
nr(θ) χ2 (#r)
Preliminaries to the next two statistics
Then FOC:
√ √
e |λ = 0
ns θe + nR(θ)
√ (10)
nr(θ)
e =0
2
√ ∂ log L(θ0 ) ∂ log L(θ0 ) √
= n + e − θ0 + op (1)
n θ
∂θ ∂θ∂θ |
| {z } | {z }
d E[H(wi ;θ0 )]
−→N(0,−E[H(wi ;θ0 )])
Then
√ √ √
e 0 nλn = Op (1)
e | nλn = − ns θe =⇒ R(θ)
R(θ)
| {z }
Op (1)
p
e −→ R0 , we obtain:
Since R(θ)
√ √ 0 √ √
e nλn = R| nλn + R(θ)
R(θ) e − R0 nλn = R0| nλn + op (1)
0
√
√ ∂ log L(θ0 )
E [H(wi ; θ0 )] R0| n θe − θ0
− n ∂θ
(K×K) K×r
(K×1)
√ = (K×1) + op (1)
R0 0
nλn
0
(r×K) (r×r) (r×1)
(r×1)
Preliminaries to the next two statistics
A11 A12
−1 A−1 + A−1 A12 (A22 − A21 A−1 A12 )−1 A21 A−1 −A
−1
A12 (A22 − A21 A
−1
A12 )
= 11 11 11 11 11 11
A21 A22 −1 −1 −1
−(A22 − A21 A A12 )−1 A21 A (A22 − A21 A A12 )
11 11 11
Then:
√ −1 −1 |
−1 |
−
n θ − θ0 −E H(wi ; θ0 ) +E H(wi ; θ0 ) R (R0 E H(wi ; θ0 ) R )−1 R0 E H(wi ; θ0 )
0 0
e
−1 |
−1
(K×1) R )−1 R0 E
√
= −(R0 E H(wi ; θ0 ) H(wi ; θ0 )
nλn 0
(r×1)
Proof LR.
By second order Taylor series:
0 2
b + ∂ log L(θ) θe − θb + 1 θe − θb ∂ log L(θ̄) θe − θb
b
log L(θ)
e = log L(θ)
∂θ 2 ∂θ∂θ 0
∂ log L(θ
b)
where θ̄ = αθe + (1 − α)θb for some α ∈ α [0, 1]. Recall that ∂θ = 0 and
2 p
n−1 ∂ ∂θ∂θ
log L(θ̄)
0 −→ E [H(wi , θ0 )]. It follows that:
0 2
b = 1 θe − θb ∂ log L(θ̄) θe − θb
e − log L(θ)
log L(θ)
2 ∂θ∂θ 0
| ∂ 2 log L(θ̄)
2 · log L(θ)
e − log L(θ)
b = θe − θb
|
θe − θb
∂θ∂θ
√ √ √ 0 2 √
b = n θe − θb n−1 ∂ log L(θ̄) n θe − θb
2 n n · log L(θ) e − log L(θ)
∂θ∂θ |
√ | 2 √
b = n θe − θb n−1 ∂ log L(θ̄) n θe − θb
2 · n · log L(θ)
e − log L(θ)
∂θ∂θ 0
Proof LR.
√ | ∂ 2 log L(θ0 ) √
Adding and subtracting e − θb
n θ n−1 ∂θ∂θ |
n e − θb :
θ
2
√ 0) √
e − θb | n−1 ∂ log L(θ
2 · n · log L(θ
e) − log L(θb) = n θ 0
e − θb +
n θ
∂θ∂θ
h 2 2 i
√ 0) √
e − θb | n−1 ∂ log L(0 θ̄) − n−1 ∂ log L(θ
+ n θ 0
e − θb
n θ
∂θ∂θ ∂θ∂θ
| {z }
op (1)=Op (1)op (1)Op (1)
√ ∂ 2 log L(θ0 ) √
| −1
= e − θb
n θ n e − θb + op (1)
n θ
∂θ∂θ |
√
e − θb 0 E [H(wi , θ0 )] √n θe − θb + op (1)
= n θ
√
√ √
n θ−b
e θ = n eθ − θ0 − n bθ − θ0
−1 −1 −1 −1
| | −1
= − E H(wi ; θ0 ) −E H(wi ; θ0 ) R (R0 E H(wi ; θ0 ) R ) R0 E H(wi ; θ0 ) ×
0 0
−1 √
√ ∂ log L(θ0 ) ∂ log L(θ0 )
× n − −E H(wi ; θ0 ) n + op (1)
∂θ ∂θ
−1 |
−1 0 −1
−1 √ ∂ log L(θ0 )
= E H(wi ; θ0 ) R (R0 E H(wi ; θ0 ) R0 ) R0 E H(wi ; θ0 ) n + op (1)
0 ∂θ
Then:
√
0 √
2·n· log L(θ ) − log L(θ )
b e = − n θ−b
e θ E H(wi , θ0 ) θ−b
n e θ + op (1)
−1
√ ∂ log L(θ0 )
= ×R0 E H(wi ; θ0 ) n
∂θ
| {z }
d
−→N 0,−E H wi ;θ0
Proof LR.
Then :
This asymptotic variance cancels against the central term of the quadratic
form , and hence we are looking at the norm of a #r-dimensional standard
normal vector:
d
LR ≡ 2 · n · log L(θ)
b − log L(θ)
e −→ χ2 (#r)