0% found this document useful (0 votes)

16 views103 pages

Lecture1 ML MLE

The lecture focuses on the Maximum Likelihood Estimator (MLE), covering its definition, properties, and applications in statistical estimation and testing. Key topics include the derivation of asymptotic properties, the likelihood function, and the importance of identification in ensuring the parameters are uniquely determined by the data. The lecture also emphasizes the logic behind MLE and its role in fitting distributions to observed data.

Uploaded by

truongthinhan197511

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views103 pages

Lecture1 ML MLE

Uploaded by

truongthinhan197511

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 103

Lecture 1: Maximum Likelihood Estimator

Professor: Mauricio Sarrias

Universidad de Talca

2020
1 Introduction
Motivation
Maximum Likelihood Estimator
Identification
The Score Function
The Information Matrix

2 Asymptotic Properties
Consistency
Asymptotic Normality

3 Estimation of Variance

4 Testing
Intuition
The Trinity
Proof, Proof and More Proof
Reading

Reading (Mandatory):
(Ruud)- Chapters 14 and 15.
Buse, A. (1982). The likelihood ratio, Wald, and Lagrange multiplier
tests: An expository note. The American Statistician, 36(3a), 153-157.
Suggested:
(Winkelmann & Boes)- Chapters 2 and 3
Goals

Understand the logic behind the Maximum Likelihood Estimator.

Derive the asymptotic properties of the MLE.
Understand and derive the basic test for the MLE.
1 Introduction
Motivation
Maximum Likelihood Estimator
Identification
The Score Function
The Information Matrix

2 Asymptotic Properties
Consistency
Asymptotic Normality

3 Estimation of Variance

4 Testing
Intuition
The Trinity
Proof, Proof and More Proof
Motivating Example I
Let’s assume we weighed 1000 people from Talca

200
150
Frequency

100
50
0

0 20 40 60 80 100 120 140

Weight in kilogram
Motivating Example I

The goal of maximum likelihood is to find the optimal way to fit a distribution
to the data.

Remark I
Generally, we can writhe the probability or density function of yi = 1, ..., n as
f (yi ; θ), we yi is the ith draw from the population and θ is the parameter of
the distribution.

Remark II
We usually assume independent sampling, i.e., the ith draw from the
population is independent from all other draws i0 6= i

So the question is:

Which distribution does fit the previous weight data?
Motivating Example I
Normal Distribution Chi−squared Distribution

0.020

0.030
0.020
0.010

0.010
0.000
0.000

0 20 40 60 80 100 0 20 40 60 80 100

Weight in kg Weight in kg

Exponential Distribution Gamma Distribution

0.06
0.04

0.04
0.02

0.02
0.00
0.00

0 20 40 60 80 100 0 20 40 60 80 100

Weight in kg Weight in kg
Motivating Example I

Normal Distribution
It seems that the normal distribution

0.020
is the best option.
We expect most of the weights

0.015
to be close to the mean.
We expect the weights to be
realitvely symmetrical around

0.010
the mean.
Ok ..., but not every normal fits

0.005
the our data.
What mean, µ, and variance, σ 2 ,
are the best “estimates”? 0.000

0 20 40 60 80 100 120

Weight in kg
Motivating Example I

Maximum Likelihood Principle

0.020
1 We observe some data.
2 We pick the distribution we

0.015
think generated the data.
3 We find the estimator(s) of the

Density

0.010
distribution, θ,
b that makes more
likely the sample we are
observing.

0.005
IOW, the problem consists on
estimating an unkown parameter of a
population when the population 0.000

distribution is known (up to the −50 0 50 100 150 200

Weight in kilogram
unknown parameter)
Motivating Example II

Example
A random sample of 100 trials was performed and 10 resulted in success.
What can be inferred about the unknown probability of success p0 ?

Note that we are observing the sample; somehow we know the distribution;
and we are asking what is pb that makes more likely the sample we are
observing.
Motivating Example II

For any potential value of p for the probability of success, the probability of y
successes from n trail is given by:
n

f (y; n, p) = Pr(Y = y) = y py (1 − p)n−y

where

n
n!
y = y!(n − y)!
With y = 10 successes from n = 100 trials,

L(p) = Pr(Y = 10)

100! 10
= p (1 − p)90
90!10!
= 1.731 × 1013 × p10 (1 − p)90
Motivating Example II

0.12
0.10
0.08
Likelihood

0.06
0.04
0.02
0.00

0.05 0.10 0.15 0.20 0.25

p
Likelihood Function

The likelihood function denoted by capital L is:

n
Y n
Y
L(θ, y|X) = L(θ; yi |xi ) = f (yi |xi ; θ)
i=1 i=1

where y = (y1 , ..., yn ).

L(θ; yi |xi ) is the likelihood contribution of the i-th observation,
L(θ, y|X) is the likelihood function of the whole sample.

The likelihood function says that, for any given sample y|X, the likelihood
of having obtained that particular sample depends on the parameter θ.
Whenever we can write down the joint probability function of the sample
we can in principle use ML estimation.
Log Likelihood Function

The log-likelihood function is:

n
! n
Y X
ln L(θ, y|X) = ln L(θ) = ln f (yi |xi ; θ) = ln f (yi |xi ; θ)
i=1 i=1
| {z }
f (y|X;θ)

The log-likelihood function is a monotonically increasing function of

L(θ, y|X):
I Any maximizing value θb of ln L(θ, y|X) must also maximize L(θ, y|X).
Taking logarithms which converts products into sums.
I It allows some simplification in the numerical determination of the MLE.
I Likelihood values are often extremely small (but can also be extremely
large). Numerical optimization of the likelihood highly problematic.
I Simplification of the study of the properties of the estimator.
Example

Example (Binomial Example)

Let {Yn } be a random sample of a binomial r.v with parameters (n, p), where
n is assumed to be known and p unknown. The likelihood function for
individual i is given by:
n

Li (p; yi ) = f (yi ; p) = y pyi (1 − p)n−yi
i

Since the sample is iid, the likelihood function:

n n
Y Y n yi n−yi
L(p; y) = f (y1 , y2 , ..., yn ; p) = f (yi ; p) = yi p (1 − p)
i=1 i=1
Example

Example (Binomial Example)

Taking the log we get the log-likelihood function:
n
!
Y
ln L(p; y) = ln f (yi ; p)
i=1
n
!
n yi
Y
(n−yi )
= ln yi p (1 − p)
i=1
" n
#
Pn Pn
yi (n2 − yi )
Y n
= ln p i=1 (1 − p) n=1
y i
i=1
n n
! n
!
Y n X
2
X
= ln yi + yi ln p + n − yi ln(1 − p)
i=1 i=1 n=1
Example: Linear Regression

Example (Linear Regression)

Consider that {yi , xi } is i.i.d, and yi = xi| β0 + i , where i |xi ∼ N(0, σ02 ). So,
with θ = (β | , σ 2 )| and wi = (yi , xi| )| , the conditional pdf is

(yi − xi| β0 )2

1
f (yi |xi ; θ0 ) = p exp −
2πσ02 2σ02
= φ(yi − xi| β0 , σ02 )
The joint pdf of the sample is:
n
(y − Xβ0 )| (y − Xβ0 )

Y n/2
f (yi |xi ; θ0 ) = 2πσ02

exp −
i=1
2σ02
= φ(y − Xβ0 , σ02 · In )
The parameter space is Θ is RK × R++ , where K is the dimension of β and
R++ is the set of positive real numbers reflecting the a priori restriction that
σ02 > 0
1 Introduction
Motivation
Maximum Likelihood Estimator
Identification
The Score Function
The Information Matrix

2 Asymptotic Properties
Consistency
Asymptotic Normality

3 Estimation of Variance

4 Testing
Intuition
The Trinity
Proof, Proof and More Proof
Maximum Likelihood Estimator

Definition (ML Estimator)

The MLE is a value of the parameter vector that maximizes the sample
average log-likelihood function:
n
1X
θbn ≡ arg max ln f (yi |xi ; θ)
θ∈Θ n i=1

where Θ denotes the parameter space in which the parameter vector θ lies.
Usually Θ = RK .
Maximum Likelihood Estimator: Maximization
ln L(θ)

max ln L(θ, y)

θb = arg max ln L(θ, y)

θ
Maximum Likelihood Estimator

Remark:
By the nature of the objective function, the MLE is the estimator which
makes the observed data most likely to occur. In other words, the MLE
is the best “rationalization” of what we observed.

Population analogous
Z
E [ln L(θ; y|X)] ≡ ln L(θ; y|X)dF (y|X; θ0 )

where F (y|X; θ0 ) is the joint CDF of (y, X)

Maximum Likelihood Estimator

Assumption I: Distribution
The sample {yi , xi } is i.i.d with true conditional density f (yi |xi ; θ0 ).

Since the sample is i.i.d, we can write:

f (y|X; θ) = f (y1 |x1 ; θ) × f (y2 |x2 ; θ) × ... × f (yn |xn ; θ)

Expected Log-Likelihood Inequality

Is E [ln L(θ; y|X)] maximized at θ0 ?

Assumption II: Dominance I

E [supθ∈Θ |ln L(θ; y|X)|] exists.

Lemma (Expected Log-likelihood Inequality)

If Dominance I assumption holds, then

E [ln f (y|x; θ)] ≤ E [ln f (y|x; θ0 )]

Example
The conditional log-likelihood function of yi |xi ∼ N(xi| β0 , σ02 ) is

(yi − xi| β)2

log f (yi |xi ; θ) = −0.5 log(2πσ 2 ) − (1)
2σ 2
The conditional expectation is:

E (yi − xi| β)2

E [ log f (yi |xi ; θ)| xi ] = −0.5 log(2πσ 2 ) −
2σ 2
Note that:

E (yi − xi| β)2 = E (xi| β0 + i − xi| β)2 ∵ yi = xi| β0 + i

= E (i + xi| (β0 − β))2

= E 2i + 2i xi| (β0 − β) + (β0 − β)| xi xi| (β0 − β)

= E(2i ) + (β0 − β)| E(xi xi| )(β0 − β)

= σ02 + (β0 − β)| E(xi xi| )(β0 − β)
When is this expectation finite?
Example

The last term is finite if E(xi xi| ) is. This implies that X is full-column rank.

σ02 + (β0 − β)| E(xi xi| )(β0 − β)

E [ log f (yi |xi ; θ)| xi ] = −0.5 log(2πσ 2 ) −
2σ 2
Now, E [ log f (yi |xi ; θ)| xi ] is uniquely maximized at xi| β = xi| β0 and σ 2 = σ02
1 Introduction
Motivation
Maximum Likelihood Estimator
Identification
The Score Function
The Information Matrix

2 Asymptotic Properties
Consistency
Asymptotic Normality

3 Estimation of Variance

4 Testing
Intuition
The Trinity
Proof, Proof and More Proof
Identification

Before employing MLE, it is necessary to check whether the

data-generating process is sufficiently informative about the parameters of
the model.
Recall OLS: βb to be unique X must be full-column rank. Otherwise, ...
The question is: is the population E [ln f (yi |xi ; θ)] uniquely maximized at
θ0 ?
I If there exists another θ 6= θ0 that maximized E [ln f (yi |xi ; θ)], then MLE
is not identified.
This is satisfied if (conditional density identification):

f (yi |xi ; θ) 6= f (yi |xi ; θ0 ) ∀θ 6= θ0

Identification

Definition (Global Identification)

The parameter vector θ0 is globally identified in Θ if, for every θ1 ∈ Θ,
θ 6= θ1 implies that:

Pr [f (yi |xi ; θ0 ) 6= f (yi |xi ; θ1 )] > 0

Assumption III: Global Identification

Every parameter vector θ0 ∈ Θ is globally identified.
Identification

Lemma (Strict Expected Log-Likelihood Inequality)

Under the Assumptions of Distribution, Dominance I and Global
Identification, then

θ 6= θ0 =⇒ E [ln f (y|x; θ)] < E [ln f (y|x; θ0 )]

In words, the expected value of the log-likelihood is maximized at the true

value of the parameters.
Proof.
Let w = (y, x0 )0 and define

a(w) ≡ f (y|x; θ)/f (y|x; θ0 )

First, WTS that a(w) 6= 1 with positive probability, so that a(w) is

nonconstant random variable (so, we can apply Jensen’s Inequality).

a(w) 6= 1 ⇐⇒ f (y|x; θ) 6= f (y|x; θ0 )

Pr [a(w) 6= 1] ⇐⇒ Pr [f (y|x; θ) 6= f (y|x; θ0 )]
But, by Global Identification:

Pr [f (y|x; θ) 6= f (y|x; θ0 )] > 0 =⇒ Pr [a(w) 6= 1] > 0

Now, WTS E [log a(w)] < log {E [a(w)]}. We use the strict version of
Jensen’s inequality which states that if c(x) is a strictly concave function
and x is nonconstant random variable, then E [c(x)] < c [E(x)]
Proof.
Set c(x) = log(x), since log(x) is strictly concave and a(w) is non-constant.
Therefore

E [log a(w)] < log {E [a(w)]}

=1
By the Law of Total Expectations E [a(w)] = 1. Combining the results:

E [log(a(w))] < log(1) = 0

But log(a(w)) = log f (y|x; θ) − log f (y|x; θ0 ).
1 Introduction
Motivation
Maximum Likelihood Estimator
Identification
The Score Function
The Information Matrix

2 Asymptotic Properties
Consistency
Asymptotic Normality

3 Estimation of Variance

4 Testing
Intuition
The Trinity
Proof, Proof and More Proof
Differentiability
Assumption IV: Integrability
The pdf f (yi |xi ; θ) is twice continuously differentiable in θ for all θ ∈ Θ.
Furthermore, the support S(θ) of f (yi |xi ; θ) does not depend on θ, and
differentiation and integration are interchangeable in the sense that

Z Z
∂ ∂
dF (yi |xi ; θ) = dF (yi |xi ; θ)
∂θ ∂θ
ZS ZS
∂2 ∂
dF (yi |xi ; θ) = dF (yi |xi ; θ)
∂θ∂θ 0 S S
∂θ∂θ 0

and

where all terms exists. In this case, we denote the support of F (y) simply by S.
The Score Function
Definition (Score Function)
The score function is defined as the vector of first partial derivatives of the
log-likelihood function with respect to the parameter vector θ:
 ∂ ln f (y|X;θ) 
1 ∂θ
 ∂ ln f (y|X;θ) 
∂ ln f (y|X; θ)  ∂θ 2

s(w, θ) = = .

∂θ 
 .. 

∂ ln f (y|X;θ)
∂θK

The score vector for observation i is:

∂ ln f (yi |xi ; θ)
s(wi ; θ) =
∂θ
Because of the additivity of terms in the log-likelihood function, we can write:
n
X
s(w, θ) = s(wi ; θ)
i=1
Score Identity

Lemma (Score Identity)

Under Integrability and Distribution Assumption:

E [s(w; θ)] = 0

We have to be clear whether we are speaking about the score of a single

observation s(wi ; θ) or the score of the sample s(w; θ).
Pn
Since under random sampling, s(w, θ) = i=1 s(wi ; θ), it is sufficient to
establish that E [s(wi ; θ)] = 0
Proof.
First, we derive an integral property of pdf. Because we are assuming
F (y|x; θ) is a proper cdf.,
Z Z
dF (yi |xi ; θ) = f (yi |xi ; θ)dyi = 1 (2)
S S
∀θ ∈ Θ. Given differentiability, we can differentiate both sides of this
equality with respect to θ:
Z
∂
0= f (yi |xi ; θ)dyi (3)
S ∂θ
This equation states how changes in f (yi |xi ; θ) resulting from changes in θ are
restricted by (2). We can rewrite (3) as

f (yi |xi ; θ) ∂
Z
0 = f (yi |xi ; θ)dyi
f (yi |xi ; θ) ∂θ
ZS
1 ∂f (yi |xi ; θ)
0 = dF (yi |xi ; θ) (4)
S f (y |x
i i ; θ) ∂θ | {z }
f (yi |xi ;θ)dyi
Proof.
Now we interpret this integral equation as an expectation. Consider:
∂ 1 ∂
ln f (yi |xi ; θ) ≡ f (yi |xi ; θ)
∂θ f (yi |xi ; θ) ∂θ
1 ∂
s(wi ; θ) ≡ f (yi |xi ; θ) (5)
f (yi |xi ; θ) ∂θ
∂
s(wi ; θ)f (yi |xi ; θ) ≡ f (yi |xi ; θ)
∂θ
Then, substituting into (4)
Z
s(wi ; θ)dF (yi |xi ; θ) = 0
S
This hold for any θ ∈ Θ, in particular, for θ = θ0 . Setting θ = θ0 , we obtain:
Z
s(wi ; θ)dF (yi |xi ; θ) = 0
S
Z
s(wi ; θ0 )dF (yi |xi ; θ0 ) = E [ s(wi ; θ0 )| x] = 0
S
Then, by Law of Total Expectations, we obtain the desired result.
What if the support depend on θ?
In this case the support is S(θ) = A(θ) ≤ y ≤ B(θ). By definition:
Z B(θ)
f (y|x; θ)dy = 1
A(θ)
Now, using the Leibnitz’s theorem gives:

R B(θ)
∂ f (y|x; θ)dy
A(θ)
= 0
∂θ
Z B(θ)
∂f (y|x; θ) ∂B(θ) ∂A(θ)
dy + f (B(θ)|θ) − f (A(θ)|θ) = 0
A(θ)
∂θ ∂θ ∂θ

To interchange the operations of differentiation and integration we need the second and
third terms go to zero. The necessary condition is that

lim f (y|x; θ) = 0
y→A(θ)

lim f (y|x; θ) = 0
y→B(θ)

Sufficient conditions are that the support does not depend on the parameter, which
means that ∂A(θ)/∂θ = ∂B(θ)/∂θ = 0 or that the density is zero at the terminal points.
1 Introduction
Motivation
Maximum Likelihood Estimator
Identification
The Score Function
The Information Matrix

2 Asymptotic Properties
Consistency
Asymptotic Normality

3 Estimation of Variance

4 Testing
Intuition
The Trinity
Proof, Proof and More Proof
Hessian

Since we are doing an optimization analysis, we need the Hessian Matrix.

 ∂ 2 ln f (y|X;θ) ∂ 2 ln f (y|X;θ) ∂ 2 ln f (y|X;θ)

If the log-likelihood function is concave in θ, H(w; θ) is said to be negative

definite. In the scalar case, for K = 1, this simply means that the second
derivative of the log-likelihood function is negative.
Hessian

Because of the additivity of terms in the log-likelihood function:

n
X ∂ 2 ln f (yi |xi ; θ)
H(wi ; θ) = H(wi ; θ) where H(wi ; θ) =
i=1
∂θ∂θ |

Remark
It is important to keep in mind that both the score and Hessian depend on the
sample and are therefore random variables (they differ in repeated samples).
Information Identity

To analyze the variance and the limiting distribution of the ML

estimator, we require some results on the Fisher information matrix.
It is very related to the Hessian matrix.
The information matrix of a sample is simply defined as the negative
expectation of the Hessian Matrix:

I(θ) = −E [H(w, θ)]

Why is it useful?
I It can be used to assess whether the likelihood function is “well behaved”
(Identification)
I Important result: the information matrix is the inverse of the variance of
the maximum likelihood estimator.
I Cramér Rao lower bound.
Information matrix equality

Information matrix equality

The information matrix can be derived in two ways, either as minus the
expected Hessian, or alternative as the variance of the score function, both
evaluated at the true parameter θ0
Information Identity

Assumption V: Finite Information

∂

Var ∂θ ln f (y|X; θ) ≡ Var [s(w; θ)] exists.

Lemma (Information Identity)

Under Distribution, Differentiability and Finite Information
Assumption:
∂2

E ln f (y|X; θ) = − Var [s(w; θ)]
∂θ∂θ |

Proof: (Homework)
Information Identity

Note the following:

 
|
|
Var [s(wi ; θ0 )] = E s(wi ; θ0 ) s(wi ; θ0 ) + E [s(wi ; θ0 )] E [s(wi ; θ0 )]
 
| {z } | {z } | {z }
(K×1) (1×K) =0
|
= E [s(wi ; θ0 )s(wi ; θ0 ) ]

Therefore we can write:

−I(θ0 ) = E [H(wi ; θ0 )] = − Var [s(wi ; θ0 )] = −E [s(wi ; θ0 )s(wi ; θ0 )| ]

Example

Recall that:

(yi − xi| β)2

log f (yi |xi ; θ) = −0.5 log(2πσ 2 ) −
2σ 2
We have:

where wi = (yi , xi| )| , θ = (θ | , σ 2 )| and b

i ≡ yi − xi| β
Example

So for θ = θ0 the b i in these expressions can be replaced by i . In the linear

regression model, E(i |xi ) = 0. Also, since i ∼ N(0, σ02 ), we have E(3i ) = 0
and E(4i ) = 3σ04 . Using these relations, we have:
|
1
!
2σ 2 E(xi xi ) 0
−E [H(wi ; θ0 )] = E [s(wi ; θ)s(wi ; θ)| ] = 0
1
0| 2σ 4
0

If E(xi xi| ) is nonsingular, then E [H(wi ; θ0 )] is nonsingular.

1 Introduction
Motivation
Maximum Likelihood Estimator
Identification
The Score Function
The Information Matrix

2 Asymptotic Properties
Consistency
Asymptotic Normality

3 Estimation of Variance

4 Testing
Intuition
The Trinity
Proof, Proof and More Proof
Some Ideas

For OLS estimator consistency can be shown by finding the sampling

error function and applying LLN.
This cannot be done for nonlinear estimator such as MLE since closed
form solution for finite sample estimators do not exists.
That is, the MLE is an implicit function of the random sample.

Question
How can we proceed?
Some Ideas
Using some LLN we know that:
n
1X p
log f (yi |xi ; θ) −→ E [log f (yi |xi ; θ)] (6)
n i=1
for any fixed parameter value θ. That is, the sample average log-likelihood
function converges to the expected log-likelihood for any value of θ. Recall
that:
n
1X
θbn ≡ arg max log f (yi |xi ; θ)
θ∈Θ n i=1
θ0 ≡ arg max E [log f (yi |xi ; θ)]
θ∈Θ

We would like to say that, given that

1
Pn p p
n i=1 log f (yi |xi ; θ) −→ E [log f (yi |xi ; θ)], then θbn −→ θ0
Some Ideas

We might be able to do this using the continuous mapping theorem.

Pn
Let Xn = n1 i=1 log f (yi |xi ; θ),
and g(·) = arg max(·)
θ∈Θ
p p
Then we would like to say that if Xn −→ X then g(Xn ) −→ g0 (X).

In words:
If the sample average of the log likelihood function is close to the true expected
value of the log likelihood function, then we would expect that θbn will be close
to the maximum of the expected likelihood (as n increases without bound)

However, we cannot do that!

What is the problem?

The problem is that the argument of the arg max(·) is a function of θ, not
θ∈Θ
a real vector:
I The concept of convergence in probability was defined for sequence of
random variables
Therefore, we need to define what we mean by the probability limit of
sequence of random functions, as opposed to a sequence of random
variables:

Convergence for sequence of random variables =⇒ Xn = Xn (ω), ω ∈ Ω

Convergence for sequence of random function =⇒ Qn = Qn (ω, θ), ω ∈ Ω
Example

Example
In ML estimation, the log-likelihood is a function of the sample data (a
random vector that depends on ω) and of a parameter θ. By increasing the
sample size, we obtain a sequence of log-likelihoods that depend on ω and θ.
Consistency

How is the distance between two functions over a set containing an infinite
number of possible comparisons at different values of θ measured?

IOW, since we are dealing with convergence on a function space we

need to define when two functions are close to one another.
To reduce the infinite dimensional character of a function to a
one-dimensional concept of convergence, we take the supremum of the
absolute difference of the function values over all θ in Θ
Uniform Convergence in Probability

Definition (Uniform Convergence in Probability)

The sequence of real-valued functions {Qn (θ)} converges uniformly in
p
probability to the limit function Q0 (θ) if supθ∈Θ |Qn (θ) − Q0 (θ)| −→ 0. We
p
will say that Qn (θ) −→ Q0 (θ) uniformly.
Another way to express uniform convergence in probability is:

sup |Qn (θ) − Q0 (θ)| = op (1)

θ∈Θ

IOW, instead of requiring that the distance |Qn (θ) − Q0 (θ)| converge in
probability to 0 for each θ, we require convergence of supθ∈Θ |Qn (θ) − Q0 (θ)|,
which is the maximum distance that can be found by ranging over the space
parameters.
Uniform Convergence in Probability
QN (θ)

Q0 (θ0 )

Q0 (θ) +

Q0 (θ)

Q0 (θ) −

θ0
θ
Uniform Convergence in Probability

Extending the concept to random vectors is straightforward. Now suppose

that {Qn (θ)} is a sequence of K × 1 random vectors that depend both on the
data and on the parameter θ ∈ Θ. This sequence of random vectors is
uniformly convergent in probability to Q0 (θ) if and only if

sup kQn (θ) − Q0 (θ)k = op (1)

θ∈Θ

where kQn (θ) − Q0 (θ)k denotes the Euclidean norm of the vector
Qn (θ) − Q0 (θ). By taking the supremum over θ we obtain another random
quantity that does not depend on θ.
Pointwise Convergence in probability

Definition (Pointwise Convergence in probability)

The sequence of real-valued functions {Qn (θ)} converges pointwise in
p
probability if and only if |Qn (θ) − Q0 (θ)| −→ 0 for each θ ∈ Θ

Uniform convergence is stronger than pointwise convergence.

Uniform LLN

Now we present the uniform LLN to study sequences of random functions

which is analogous to the Chebychev’s LLN for averages of random variables.

Theorem (Uniform LLN)

Suppose that Q(θ, U ) is continuous function over θ ∈ Θ, a closed and bounded
subset of Rp , and that {Un } is a sequence of i.i.d. random variables with cdf
FU (u). If E [supθ∈Θ kQ(θ; U )k] exits, then
1 E [Q(θ; U )] is continuous over θ ∈ Θ and,
1
Pn p
i=1 Q(θ; ui ) −→ E [Q(θ; U )] uniformly.
2
n
Or as Newey and McFadden (1994) (Lemma 2.4 state):
n
1X p
sup Q(θ; ui ) − E [Q(θ; U )] −→ 0
θ∈Θ n i=1
Uniform LLN

The following Theorem

Pn makes the connection between the uniform
convergence of n1 i=1 Q(θ; ui ) to E [Q(θ; U )] and the convergence of θbn to θ0
using the assumption of compact parameter space.
Consistency

Theorem (Consistency of Maxima with Compact Parameter Space)

Suppose that:
1 (compact parameter space) Θ ⊂ Rp is a closed and bounded parameter
space,
2 (uniform convergence) Qn (θ) is a sequence of function that convergence
in probability uniformly to a function Q0 (θ) on Θ,
3 (continuity) Qn (θ) is continuous in θ for any data (w1 , ..., wn ),
4 (identification) Q0 (θ) is uniquely maximized at θ0 ∈ Θ
then θbn ≡ arg max Qn (θ) converges in probability to θ0 .
θ∈Θ
Consistency

Intuition:
If Qn (θ) converges uniformly to Q0 (θ), then the characteristics of Qn (θ) will
be close to the characteristics of Q0 (θ) as n → ∞. One particular
characteristic is the point θ0 where Q0 (θ) is uniquely maximized. Then, it is
expected that the maximizer of Qn (θ), θ, b will be close to the maximizer of
Q0 (θ).
Consistency

Theorem (Consistency of Maxima without Compactness)

Suppose that:
1 (interior) θ0 is an element of the interior of a convex parameter space Θ,
2 (pointwise convergence) Qn (θ) converges in probability to Q0 (θ) for all
θ ∈ Θ,
3 (concavity) Qn (θ) is concave over the parameter space for any data
(w1 , ..., wn ),
4 (identification) Q0 (θ) is uniquely maximized at θ0 ∈ Θ
p
then, as n → ∞, θbn exists with probability approaching 1 and θbn −→ θ0 .
Consistency
Theorem (Consistency of conditional ML with compact parameter)
Let {yi , xi } be i.i.d with conditional density f (yi |xi ; θ0 ) and let θb be the
conditional ML estimator, which maximizes the average log conditional
likelihood:
n
1X
θn = arg max
b log f (yi |xi ; θ)
θ∈Θ n i=1
Suppose the model is correctly specified so that θ0 is in Θ. Suppose that
1 (Compactness) the parameter space Θ is compact subset of RK ,
2 f (yi |xi ; θ) is continuous in θ for all (yi , xi ) (Here, note the Weierstrass’s
theorem),
3 f (yi |xi ; θ) is measurable in (yi , xi ) for all θ ∈ Θ (so θb is well-defined
random variable),
4 (identification) Pr [f (yi |xi ; θ) 6= f (yi |xi ; θ0 )] > 0 for all θ 6= θ0 in Θ,
5 (dominance) E [supθ∈Θ |log f (yi |xi ; θ)|] < ∞ (note: the expectation is
over yi and xi )
p
Then θb −→ θ0
Sketch of Proof.
We would like to apply Consistency of Maxima with Compact Parameter Space
Theorem. In this case, let Q(θ; U ) = log f (y|x; θ). Now we verify that the condition of the
theorem are met:
f (yi |xi ; θ) is continuous,
Compactness states that Θ is a closed and bounded subset of E K ,
(yi , xi ) are i.i.d with conditional density f (yi |xi ; θ0 ),

Dominance I states that E supθ∈Θ |log f (yi |xi ; θ)| exists.
Therefore, E [log f (yi |xi ; θ)] is continuous and
n
1 X p
log f (yi |xi ; θ) −→ E [log f (yi |xi ; θ)] (7)
n
i=1
1
P n
uniformly. Let Qn (θ) = n log f (yi |xi ; θ) and Q0 (θ) = E [log f (yi |xi ; θ)]. Under the
i=1
additional assumption of Likelihood Identification, we can invoke the strict expected
log-likelihood inequality: θ 6= θ0 that E [log f (y|x; θ)] < E [log f (y|x; θ0 )]. This implies that
Q0 (θ) is uniquely maximized at θ0 . Therefore
n
1 X p
θ
bn = arg max log f (yi |xi ; θ) −→ θ0
θ∈Θ n
i=1
Consistency
Theorem (Consistency of conditional ML without Compactness)
Let {yi , xi } be i.i.d with conditional density f (yi |xi ; θ0 ) and let θb be the
conditional ML estimator, which maximizes the average log conditional
likelihood:
n
1X
θbn = arg max log f (yi |xi ; θ)
θ∈Θ n i=1
Suppose the model is correctly specified so that θ0 is in Θ. Suppose that
1 the true parameter vector θ0 is an element of the interior of convex
parameter space Θ ⊂ RK ,
2 log f (yi |xi ; θ) is concave in θ for all (yi , xi ) ,
3 log f (yi |xi ; θ) is measurable in (yi , xi ),
4 (identification) Pr [f (yi |xi ; θ) 6= f (yi |xi ; θ0 )] > 0 for all θ 6= θ0 in Θ,
5 E [|log f (yi |xi ; θ)|] < ∞ (i.e., E [log f (yi |xi ; θ)] exists and is finite) for all
θ ∈ Θ (note: the expectation is over yi and xi )
p
Then, n → ∞, θb exists with probability approaching 1 and θbn −→ θ0
1 Introduction
Motivation
Maximum Likelihood Estimator
Identification
The Score Function
The Information Matrix

2 Asymptotic Properties
Consistency
Asymptotic Normality

3 Estimation of Variance

4 Testing
Intuition
The Trinity
Proof, Proof and More Proof
Asymptotic
Theorem (Asymptotic Normality of Conditional ML)
|
Let w ≡ (yi , xi| ) be iid. Suppose the conditions of either Theorem 18 or 19
p
are satisfied, so that θbn −→ θ0 . Suppose, in addition, that:
1 θ0 is in interior of Θ,
2 f (yi |xi ; θ0 ) is twice continuously differentiable in θ for all (yi , xi ),
3 E [s(wi ; θ0 )] = 0 and −E [H(wi ; θ0 )] = E [s(wi ; θ0 )s(wi ; θ0 )| ],
4 (local dominance condition on the Hessian) for some neighborhood N of
θ0 ,
E sup kH(wi ; θ)k < ∞
θ∈N

1
Pn p
so that for any consistent estimator θ,
e e −→
H(wi ; θ) E [H(wi ; θ0 )]
n i=1
5 E [H(wi ; θ0 )] is nonsingular.
Then:

√
d −1 −1
n θb − θ0 −→ N(0, V), V = − {E [H(wi ; θ0 )]} = {E [s(wi ; θ0 )s(wi ; θ0 )| ]}
Asymptotic

The intuition is the following:

Since usually we don’t have an explicit solution for the estimator, we need
to focus on the asymptotic behaviour of the score function.
p
Assuming that θbn −→ θ, the behaviour of the score function matters only
within an arbitrary small neighbourhood of θ0 .
... after all, θbn will fall withing such neighborhooods with arbitrary high
probability for a large enough sample size...
... and within such neighborhood the score function is essentially linear..
(Taylor series expansion)
Mean Value Theorem

Theorem (Mean Value Theorem)

Let s : RK → R be defined on an open convex set Θ ⊂ RK such that s is
continuously differentiable on Θ with gradient ∇s. Then for any points θ and
θ0 such that s(θ) = s(θ0 ) + ∇s(θ̄)(θ − θ0 )
Asymptotic
Proof.
The objective function is:
n
1 X
Qn (θ) = ln f (yi |xi ; θ)
n
i=1
Given that f (yi |xi ; θ0 ) is twice continuously differentiable in θ, and given that θ0 is in
interior of Θ, then the maximum likelihood estimator satisfies

∂ log L(θ
b)
= s(w; θ
b) = 0
∂θ
We need to know about the behavior of the gradient around the true parameter. Expand
this set of equations in a Taylor series around the true parameters θ0 . We will use the mean
value theorem to truncate the Taylor series at the second term,

∂ log L(θ
b) ∂ log L(θ0 ) ∂ log L(θ)
= + b − θ0
θ
∂θ ∂θ ∂θ∂θ | | {z }
| {z } | {z } | {z }
(K×1) (K×1) (K×K) (K×1)

n
" n
#
1 X 1 X
= s(wi ; θ0 ) + H(wi ; θ) b − θ0
θ
n n
i=1 i=1

b + (1 − α)θ0 for some α ∈ (0, 1)

where θ = αθ
Proof.
So,
" n
#−1 n
!
√ 1 X √ 1 X
b − θ0 =
n θ − H(wi ; θ) n s(wi ; θ0 )
n n
i=1 i=1
We know that
p p
b −→ θ0 =⇒ θ −→ θ0
θ
By uniform LLN, we know that
n
1 X p
H(wi ; θ) −→ E [H(wi ; θ)] uniformly in θ ∈ Θ
n
i=1
Then, applying our Lemma:
n
1 X p
H(wi ; θ n ) −→ E [H(wi ; θ0 )]
n
i=1
since E [H(wi ; θ0 )] exists. Finally, using probability limit continuity and nonsingular
information, then:
" n
#−1
1 X p
H(wi ; θ n ) −→ E [H(wi ; θ0 )]−1
n
i=1
Proof.
Pn
Since (yi , xi ) are i.i.d. √1 s(wi ; θ0 ) is the sum of variables s(wi ; θ0 ). The score
n i=1
identity lemma implies that E(s(wi ; θ0 )) = 0, and the Information Identity implies that

Var [s(wi ; θ0 )] = E [s(wi ; θ0 )s(wi ; θ0 )| ] = −E [H(wi ; θ0 )]

The Lindberg-Levy CLT therefore implies:
n
!
√ 1 X d
n s(wi ; θ0 ) −→ N (0, −E [H(wi ; θ0 )])
n
i=1
Proof.
Then:
" n
#−1 n
!
√ 1 X √ 1 X
b − θ0 =
n θ − H(wi ; θ) n s(wi ; θ0 )
n n
i=1 i=1
d
−→ −E [H(wi ; θ0 )]−1 N (0, −E [H(wi ; θ0 )])
= N 0, −E [H(wi ; θ0 )]−1 E [H(wi ; θ0 )] E [H(wi ; θ0 )]−1

= N 0, −E [H(wi ; θ0 )]−1

= N 0, [I(wi ; θ0 )]−1

Variance Estimation
For large but finite samples, we can therefore write the approximate
distribution of θbn as
h i
a −1
θb ∼ N θ0 , n · [I(θ0 )]
we have three potential estimators of I(θ0 ):
The empirical mean of minus the Hessian,
n
!−1
b1 = 1X
V −H(wi , θ)
b
n i=1

The empirical variance of the score:

n
!−1
b2 = 1X b|
V s(wi , θ)s(w
b i ; θ)
n i=1

Minus the expected Hessian evaluated at θ:

i −1
n
!
b3 1X h
V = −E H(wi , θ)
b
n i=1
Proof of Consistency

Evaluated at a θ ∈ Θ, each estimator converges in probability uniformly

to its expectation.
p
Because θbn −→ θ0 , evaluated at θbn each estimator converges in
probability to I(θ0 ).
Because matrix inversion is a continuous transformation, the inverse of
each matrix is also a consistent
√ estimator for the variance matrix of the
asymptotic distribution of n(θbn − θ0 )
1 Introduction
Motivation
Maximum Likelihood Estimator
Identification
The Score Function
The Information Matrix

2 Asymptotic Properties
Consistency
Asymptotic Normality

3 Estimation of Variance

4 Testing
Intuition
The Trinity
Proof, Proof and More Proof
Hypothesis Testing

ML estimator are distributed asymptotically normally:

I As the sample size increases, the sampling distribution of an ML estimator
becomes approximately normal

a
b ∼ N β, V
β βb
where, for three coefficients:
   σ2 σβb ,βb σβb ,βb

βb0 β 0 1 0 2
 0
b
Vβb = Var βb1  = σβb1 ,βb0 σ2 σβb ,βb 
β1
b 1 2

βb2 σβb ,βb σβb ,βb σ2
2 0 2 1 β2
b
Hypothesis Testing

Consider the simple hypothesis

H0 : β k = β ∗

where β ∗ is the hypothesized value, often equal to 0. Since σ 2 is unknown, it

βbk
must be estimated, which results in the test:

βbk − β ∗
z=
b2
σ
β
bk
Under the assumptions justifying ML, if H0 is true, then z is distributed
approximately normally with mean of 0 and variance of 1 for large samples.
Testing

f (z)

reject H0 reject H0

z
−1.96 0 1.96
1 Introduction
Motivation
Maximum Likelihood Estimator
Identification
The Score Function
The Information Matrix

2 Asymptotic Properties
Consistency
Asymptotic Normality

3 Estimation of Variance

4 Testing
Intuition
The Trinity
Proof, Proof and More Proof
The Trinity

For more complex hypotheses we can use the Wald, likelihood ratio (LR), or
Lagrange multiplier (LM) test. These test can be though of as a comparison
between the estimates obtained after the constrains implied by the hypothesis
have been imposed to the estimates obtained without the constraints.
The Trinity: LR Test

Log-likelihood function is
the red solid line; log L(β)
b
βbU : unconstrained
estimator.
log L(βbU )
The H0 : β = β ∗ imposes
the constraint β = β ∗ , so
that the constrained
estimate is βbC = β ∗ . log L(βbC )
Unless βbU = β ∗ ,
ln L(βbC ) ≤ ln L(βbU ).
If the constraint βb
significantly reduces the 0 βbC = β ∗ βbU
likelihood, then the null
hypothesis is rejected
The Trinity: Wald Test
The Wald test estimate the model
without constraints, and assesses
the constraint by considering 2
things:
1. It measures the distance
bU − βbC = βbU − β ∗ .
β log L(β
b)
The larger the distance, the less
likely it is that the constraint is
true. .
2. The distance is weighted by the
curvature of the log likelihood
function ∂ 2 log L(β)/∂β 2
The larger the second derivative, ∂ 2 log L(β)
the faster the curve is changing. ∂β 2
The LL function (dashed line) in
nearly flat, so the second derivative
evaluated at βbU is relatively small.
When the second derivative is
small, the distance β
bU and βbC is β
b
0 bC = β ∗
β β
minor relative to the sampling bU
variation.
The second LL function has a
larger second derivative.
With a larger second derivative, the
same distance between βbU and βbC
might be significant.
The Trinity: LM (Score) Test

It only estimate the

constrained model. log L(β)
b

It assesses the slope of the

log likelihood function at
the constraint. log L(βbU )
If H0 is true, the slope
(score) at the constraint
should be close to 0. log L(βbC )
As with the Wald test, the
curvature of the log ∂ log L(β)
∂β
likelihood function at the βb
constraint is used to assess 0 βbC = β ∗ βbU
the significance of a nonzero
slope.
The Trinity

When H0 is true, the Wald, LR, and LM tests are asymptotically

equivalent.
d
As n → ∞, the sampling distributions of the three test −→ χ2r , where r is
the number of constraints being tested.

They are similar only when n → ∞. In small samples this is not necessary
true.
Test Statistics

Assume that we have r (r < K) nonlinear restrictions (which includes linear

restriction as special case)

0
r(θ0 ) = |{z}
| {z }
r×1 r×1

The hypotheses are:

H0 : r(θ0 ) = 0
H1 : r(θ0 ) 6= 0

Also, denote θb as the unconstrained estimator and θe as constrained estimator.

So:
( m )
1X
θM L = arg max
e log f (yi |xi ; θ) s.t r(θ0 ) = 0
θ∈Θ n i=1
Test Statistics

Then, Wald, LM and LR statistics are defined as:

 −1

b 1 ·V
 
b |
W = r(θ) R(θ) b R(θ)b | r(θ)
b
| {z } | {z } n

| {z } |{z}
1×r r×K | {z } K×r r×1
K×K

e| 1 e
LM = s(θ) · V s(θ)
e
| {z } n |{z}
1×K | {z } K×1
K×K
 

1 n n
 X 
1X 
LR = 2 · n 
n ln f (y |x
i i ; θ)
b − ln f (y |x
i i ; θ)
e 
 i=1 n i=1 

| {z } | {z }
1×1 1×1
Test Statistics

where:

∂r(θ0 )
R(θ0 ) = Jacobian of r(θ0 )
| {z } ∂θ |
r×K

−1 1X ∂
V = I = − ln f (yi |xi ; θ)
b
|{z} n ∂θθ |
i=1
K×K
1 Introduction
Motivation
Maximum Likelihood Estimator
Identification
The Score Function
The Information Matrix

2 Asymptotic Properties
Consistency
Asymptotic Normality

3 Estimation of Variance

4 Testing
Intuition
The Trinity
Proof, Proof and More Proof
Proof: Wald Statistic.
Write W as:
√
W = c|n Z−1
n cn , cn ≡ nr(θ),
b Zn ≡ R(θ) b|
b θ)
b VR( (8)
d
First, we will show that cn −→ N(0, Vf ). Applying the MVT (21) (to
truncate the Taylor series) to r(θ)
b around θ0 :

r(θ)
b = r(θ0 ) +R(θ̄)(θb − θ0 )
| {z }
=0 under H0
√ √ √
nr(θ) = R(θ̄) n(θb − θ0 ) multiplying by n
b
| {z } | {z } | {z }
r×1 r×K K×1
√ √ √
= R(θ̄) n(θb − θ0 ) + R(θ0 ) n(θb − θ0 ) − R(θ0 ) n(θb − θ0 )
√ √
= R(θ̄) − R(θ0 ) n(θb − θ0 ) + R(θ0 ) n(θb − θ0 )

where θ̄ = αθb + (1 − α)θ0

Proof: Wald Statistic.
Note that
p p
∵θb −→ θ0 =⇒ θ̄ −→ θ0
p
∵F (·) is continuous =⇒ F (θ̄) −→ F (θ0 )
√ d
Note that n(θb − θ0 ) −→ some random variable, then we can write:
√ √ √
b = R(θ̄) − R(θ0 ) n(θb − θ0 ) +R(θ0 ) n(θb − θ0 )
nr(θ)
| {z }
p
−→0
√ √
b = R(θ0 ) n(θb − θ0 ) + op (1)
nr(θ)
Furthermore, we know that:
n
√ −1 1 X
n(θb − θ0 ) = E [H(wi , θ0 )] √ s(wi , θ0 ) + op (1) (9)
n i=1
Proof: Wald Statistic.
Then:
√ √
b = R(θ0 ) n(θb − θ0 ) + op (1)
nr(θ)
" n
#
−1 1
X
= R(θ0 ) E [H(wi , θ0 )] √ s(wi , θ0 ) + op (1) + op (1)
n i=1
n
−1 1 X
= R(θ0 )E [H(wi , θ0 )] √ s(wi , θ0 ) + op (1)
n i=1 | {z }
| {z } op (1)+op (1)
d
−→N(0,−E[H(wi ;θ0 )])

Then:

√ d

−1 −1

b −→
nr(θ) N 0, −R(θ0 )E [H(wi , θ0 )] E [H(wi ; θ0 )] E [H(wi , θ0 )] R(θ0 )|
 
h i
d −1
R(θ0 )0 
 
−→ N 
0, R(θ0 ) −E [H(wi , θ0 )] 
| {z }
V
Proof: Wald Statistic.
It follows that:
√ −1 √ d
W = c|n Z−1
n cn =
b [R(θ0 )V0 R(θ0 )| ]
nr(θ) b −→ χ2 (#r)
nr(θ)
If we have consistent estimators of R(θ0 ) and V, then limit results for
continuous functions imply that:
√ h i−1 √
d
nr(θ)
b R(θb 0 )V b 0 )|
b R(θ b −→
nr(θ) χ2 (#r)
Preliminaries to the next two statistics

The Lagrangian for the constraint is

n
1X
L= λ| r(θ)
log f (yi |xi , θ) − |{z}
n i=1 |{z}
1×r r×1

Then FOC:
√ √
e |λ = 0
ns θe + nR(θ)
√ (10)
nr(θ)
e =0

Following a similar reasoning as in previous proof:

√ √
e = R(θ0 ) n(θe − θ0 ) + op (1) = Op (1)
nr(θ)
Preliminaries to the next two statistics
A Taylor expansion of s(θ)
e yields:

e) ∂ log L(θ0 ) ∂ 2 log L(θ̄)

∂ log L(θ
= + e − θ0
θ
∂θ ∂θ ∂θ∂θ |
∂ log L(θ0 ) ∂ 2 log L(θ̄) 2 2
e − θ0 + ∂ log L(θ 0)
e − θ0 − ∂ log L(θ 0)

= + θ θ e − θ0
θ
∂θ ∂θ∂θ 0 ∂θ∂θ | ∂θ∂θ 0
∂ 2 log L(θ0 ) 2 2
∂ log L(θ0 )
e − θ0 + ∂ log L(|θ̄) − ∂ log L(θ 0)

= + |
θ |
e − θ0
θ
∂θ ∂θ∂θ ∂θ∂θ ∂θ∂θ
e) √ ∂ log L(θ0 ) ∂ 2 log L(θ0 ) √
√ ∂ log L(θ
n = n + |
e − θ0 +
n θ
∂θ
2 ∂θ ∂θ∂θ
2 √
∂ log L(θ̄) −1 ∂ log L(θ0 )

+ −1 |
− n |
e − θ0
n θ
n ∂θ∂θ ∂θ∂θ | {z }
| {z }
Op (1)
op (1)

2
√ ∂ log L(θ0 ) ∂ log L(θ0 ) √
= n + e − θ0 + op (1)
n θ
∂θ ∂θ∂θ |
| {z } | {z }
d E[H(wi ;θ0 )]
−→N(0,−E[H(wi ;θ0 )])

= Op (1) + Op (1)Op (1) + op (1)

= Op (1) + Op (1) + op (1) = Op (1)
Preliminaries to the next two statistics

Then
√ √ √
e 0 nλn = Op (1)
e | nλn = − ns θe =⇒ R(θ)
R(θ)
| {z }
Op (1)
p
e −→ R0 , we obtain:
Since R(θ)
√ √ 0 √ √
e nλn = R| nλn + R(θ)
R(θ) e − R0 nλn = R0| nλn + op (1)
0

Substituting these three equations into the FOCs, we obtain:

√
 √ ∂ log L(θ0 ) 

E [H(wi ; θ0 )] R0| n θe − θ0
 
− n ∂θ
 (K×K) K×r  
 
(K×1)
√  =  (K×1)  + op (1)
R0 0 
nλn
 0
(r×K) (r×r) (r×1)
(r×1)
Preliminaries to the next two statistics

This can be solved using the partitioned inverse formula:

A11 A12
−1 A−1 + A−1 A12 (A22 − A21 A−1 A12 )−1 A21 A−1 −A
−1
A12 (A22 − A21 A
−1
A12 )

= 11 11 11 11 11 11
A21 A22 −1 −1 −1
−(A22 − A21 A A12 )−1 A21 A (A22 − A21 A A12 )
11 11 11

Then:
√   −1 −1 |
−1 |
−
n θ − θ0 −E H(wi ; θ0 ) +E H(wi ; θ0 ) R (R0 E H(wi ; θ0 ) R )−1 R0 E H(wi ; θ0 )
0 0
e
−1 |
−1
(K×1) R )−1 R0 E
 √
= −(R0 E H(wi ; θ0 ) H(wi ; θ0 )
nλn 0
(r×1)
Proof LR.
By second order Taylor series:

0 2
b + ∂ log L(θ) θe − θb + 1 θe − θb ∂ log L(θ̄) θe − θb
b
log L(θ)
e = log L(θ)
∂θ 2 ∂θ∂θ 0
∂ log L(θ
b)
where θ̄ = αθe + (1 − α)θb for some α ∈ α [0, 1]. Recall that ∂θ = 0 and
2 p
n−1 ∂ ∂θ∂θ
log L(θ̄)
0 −→ E [H(wi , θ0 )]. It follows that:

0 2
b = 1 θe − θb ∂ log L(θ̄) θe − θb

e − log L(θ)
log L(θ)
2 ∂θ∂θ 0
| ∂ 2 log L(θ̄)
2 · log L(θ)
e − log L(θ)
b = θe − θb
|
θe − θb
∂θ∂θ
√ √ √ 0 2 √
b = n θe − θb n−1 ∂ log L(θ̄) n θe − θb

2 n n · log L(θ) e − log L(θ)
∂θ∂θ |
√ | 2 √
b = n θe − θb n−1 ∂ log L(θ̄) n θe − θb

2 · n · log L(θ)
e − log L(θ)
∂θ∂θ 0
Proof LR.
√ | ∂ 2 log L(θ0 ) √

Adding and subtracting e − θb
n θ n−1 ∂θ∂θ |
n e − θb :
θ

2
√ 0) √
e − θb | n−1 ∂ log L(θ

2 · n · log L(θ
e) − log L(θb) = n θ 0
e − θb +
n θ
∂θ∂θ
h 2 2 i
√ 0) √
e − θb | n−1 ∂ log L(0 θ̄) − n−1 ∂ log L(θ

+ n θ 0
e − θb
n θ
∂θ∂θ ∂θ∂θ
| {z }
op (1)=Op (1)op (1)Op (1)

√ ∂ 2 log L(θ0 ) √
| −1

= e − θb
n θ n e − θb + op (1)
n θ
∂θ∂θ |
√
e − θb 0 E [H(wi , θ0 )] √n θe − θb + op (1)

= n θ

b) − log L(θe) = −√n θe − θb 0 E [H(wi , θ0 )] √n θe − θb + op (1)

2 · n · log L(θ × −1
Proof LR.
We know that

√
√ √
n θ−b
e θ = n eθ − θ0 − n bθ − θ0
−1 −1 −1 −1
| | −1
= − E H(wi ; θ0 ) −E H(wi ; θ0 ) R (R0 E H(wi ; θ0 ) R ) R0 E H(wi ; θ0 ) ×
0 0
−1 √
√ ∂ log L(θ0 ) ∂ log L(θ0 )
× n − −E H(wi ; θ0 ) n + op (1)
∂θ ∂θ
−1 |
−1 0 −1
−1 √ ∂ log L(θ0 )
= E H(wi ; θ0 ) R (R0 E H(wi ; θ0 ) R0 ) R0 E H(wi ; θ0 ) n + op (1)
0 ∂θ

Then:

√
0 √
2·n· log L(θ ) − log L(θ )
b e = − n θ−b
e θ E H(wi , θ0 ) θ−b
n e θ + op (1)

0 −1 | −1 −1

√ ∂ log L(θ0 ) |
= n E H(wi ; θ0 ) R R0 E H(wi ; θ0 ) R ×
∂θ 0 0

−1
√ ∂ log L(θ0 )
= ×R0 E H(wi ; θ0 ) n
∂θ
| {z }
d

−→N 0,−E H wi ;θ0
Proof LR.
Then :

−1 √ ∂ log L(θ0 )

d
−1 −1 |

R0 E H(wi ; θ0 ) n −→ N 0, R0 E H(wi ; θ0 ) −E H(wi ; θ0 ) E H wi ; θ0 R
∂θ 0
−1
d
0
−→ N 0, −R0 E H(wi ; θ0 ) R0

This asymptotic variance cancels against the central term of the quadratic
form , and hence we are looking at the norm of a #r-dimensional standard
normal vector:

d
LR ≡ 2 · n · log L(θ)
b − log L(θ)
e −→ χ2 (#r)

Maximum Likelihood
No ratings yet
Maximum Likelihood
11 pages
ML Notes
No ratings yet
ML Notes
4 pages
A Guide To Modern Econometrics by Verbeek 181 190
No ratings yet
A Guide To Modern Econometrics by Verbeek 181 190
10 pages
Lecture 03 Maximum Likelihood Estimation
No ratings yet
Lecture 03 Maximum Likelihood Estimation
22 pages
Maximum Likelihood
No ratings yet
Maximum Likelihood
16 pages
Mlelectures PDF
No ratings yet
Mlelectures PDF
24 pages
Mlelectures PDF
No ratings yet
Mlelectures PDF
24 pages
Topic 14: Maximum Likelihood Estimation: 1 Examples
No ratings yet
Topic 14: Maximum Likelihood Estimation: 1 Examples
6 pages
Inf 2
No ratings yet
Inf 2
37 pages
MLE Assingnment
No ratings yet
MLE Assingnment
7 pages
MLE Lecture Note For Econometrician
No ratings yet
MLE Lecture Note For Econometrician
13 pages
10.0 Lesson Plan: Answer Questions Robust Estimators Maximum Likelihood Estimators
No ratings yet
10.0 Lesson Plan: Answer Questions Robust Estimators Maximum Likelihood Estimators
15 pages
10.0 Lesson Plan: Answer Questions Robust Estimators Maximum Likelihood Estimators
No ratings yet
10.0 Lesson Plan: Answer Questions Robust Estimators Maximum Likelihood Estimators
15 pages
10.0 Lesson Plan: Answer Questions Robust Estimators Maximum Likelihood Estimators
No ratings yet
10.0 Lesson Plan: Answer Questions Robust Estimators Maximum Likelihood Estimators
15 pages
Maximum Likelihood
No ratings yet
Maximum Likelihood
7 pages
Experiment 1
No ratings yet
Experiment 1
5 pages
11 Parameter Estimation
No ratings yet
11 Parameter Estimation
6 pages
Maximum
No ratings yet
Maximum
3 pages
NOTES
No ratings yet
NOTES
14 pages
Statistical Inference: Classical and Bayesian Methods
No ratings yet
Statistical Inference: Classical and Bayesian Methods
22 pages
Chap - 2point - Estimation
No ratings yet
Chap - 2point - Estimation
11 pages
Maximum Likelihood
No ratings yet
Maximum Likelihood
7 pages
Module 4
No ratings yet
Module 4
3 pages
Econometrics For Policy: ECONG107 Practical 2: Maximum Likelihood, Bayesian Inference
No ratings yet
Econometrics For Policy: ECONG107 Practical 2: Maximum Likelihood, Bayesian Inference
32 pages
Chapter 2: Statistical Inference, Point Estimation, and Confidence Intervals
No ratings yet
Chapter 2: Statistical Inference, Point Estimation, and Confidence Intervals
16 pages
7 Mle
No ratings yet
7 Mle
31 pages
Learning Models From Data: 1 Parametric Estimation
No ratings yet
Learning Models From Data: 1 Parametric Estimation
14 pages
L08 MaximumLikelihoodEstimation
No ratings yet
L08 MaximumLikelihoodEstimation
5 pages
Econometrics - Exercise Set 2 (Solution)
No ratings yet
Econometrics - Exercise Set 2 (Solution)
12 pages
Maximum Likelihood Estimation
No ratings yet
Maximum Likelihood Estimation
7 pages
Maximum Likelihood Estimation
No ratings yet
Maximum Likelihood Estimation
22 pages
Understanding Maximum Likelihood
No ratings yet
Understanding Maximum Likelihood
5 pages
Week+3 418
No ratings yet
Week+3 418
9 pages
Chapter 2: Maximum Likelihood Estimation: Advanced Econometrics - HEC Lausanne
No ratings yet
Chapter 2: Maximum Likelihood Estimation: Advanced Econometrics - HEC Lausanne
207 pages
12 MLEFilled
No ratings yet
12 MLEFilled
8 pages
Maximum Likelihood Estimation: Guy Lebanon February 19, 2011
No ratings yet
Maximum Likelihood Estimation: Guy Lebanon February 19, 2011
6 pages
Chapte 2 - Maximum Likelihood - HEC - Lausanne
No ratings yet
Chapte 2 - Maximum Likelihood - HEC - Lausanne
276 pages
Etf3600 Lecture3 Mle LPM 2013
No ratings yet
Etf3600 Lecture3 Mle LPM 2013
36 pages
3.exponential Family & Point Estimation - 552
0% (1)
3.exponential Family & Point Estimation - 552
33 pages
Slides Week4
No ratings yet
Slides Week4
37 pages
Sta255 Week 11-1 Pre
No ratings yet
Sta255 Week 11-1 Pre
37 pages
Chapter 2 - Maximum Likelihood - HEC - Lausanne
No ratings yet
Chapter 2 - Maximum Likelihood - HEC - Lausanne
277 pages
Stat100b Maximum Likelihood
No ratings yet
Stat100b Maximum Likelihood
9 pages
Mathematical Statistics (MA212M) : Lecture Slides
No ratings yet
Mathematical Statistics (MA212M) : Lecture Slides
14 pages
CQF ML Lab Estimating Default Probability With Logistic Regression
No ratings yet
CQF ML Lab Estimating Default Probability With Logistic Regression
7 pages
Maximum Likelihood Estimation
No ratings yet
Maximum Likelihood Estimation
8 pages
RegEstimationLS ML StatColumbia
No ratings yet
RegEstimationLS ML StatColumbia
44 pages
MLE Dan Bayesian Estimation From Walpole Book
No ratings yet
MLE Dan Bayesian Estimation From Walpole Book
13 pages
Maximum Likelihood Estimation.: N N I N I 1 N I I 1
No ratings yet
Maximum Likelihood Estimation.: N N I N I 1 N I I 1
5 pages
Asymptotic Theory and Parametric Inference
No ratings yet
Asymptotic Theory and Parametric Inference
32 pages
Likelihood, Bayesian, and Decision Theory
No ratings yet
Likelihood, Bayesian, and Decision Theory
50 pages
Lec2 Maximumlikelihood
No ratings yet
Lec2 Maximumlikelihood
19 pages
Maximum Likelihood Estimation by K.Kashin
No ratings yet
Maximum Likelihood Estimation by K.Kashin
34 pages
Design and Analysis of Computer Experiments: Theory: 1 Density Estimation
No ratings yet
Design and Analysis of Computer Experiments: Theory: 1 Density Estimation
9 pages
Maximum Likelihood Estimation
No ratings yet
Maximum Likelihood Estimation
7 pages
Maximum-Likelihood & Bayesian Parameter Estimation: Srihari: CSE 555
No ratings yet
Maximum-Likelihood & Bayesian Parameter Estimation: Srihari: CSE 555
9 pages
Maximum Likelihood Estimation
No ratings yet
Maximum Likelihood Estimation
46 pages
How To Code For Quantum Computers
From Everand
How To Code For Quantum Computers
Nivio Dos Santos
No ratings yet
Digital Signal Processing (DSP) with Python Programming
From Everand
Digital Signal Processing (DSP) with Python Programming
Maurice Charbit
No ratings yet
Top Numerical Methods With Matlab For Beginners!
From Everand
Top Numerical Methods With Matlab For Beginners!
Andrei Besedin
No ratings yet
DSP Chapter 1 Introduction
No ratings yet
DSP Chapter 1 Introduction
12 pages
23-SIMPLEC Algorithm For Colocated Meshes
No ratings yet
23-SIMPLEC Algorithm For Colocated Meshes
31 pages
Introduction To Functional Programming in Racket: CS 270 Math Foundations of CS Jeremy Johnson
100% (1)
Introduction To Functional Programming in Racket: CS 270 Math Foundations of CS Jeremy Johnson
26 pages
Machine Learning Assignment
No ratings yet
Machine Learning Assignment
13 pages
Activity 3
No ratings yet
Activity 3
3 pages
Speech Recognition: BY Charu Joshi
100% (2)
Speech Recognition: BY Charu Joshi
26 pages
Chap 4 - Linear Inequation Paper
No ratings yet
Chap 4 - Linear Inequation Paper
1 page
Fuchs Notes
No ratings yet
Fuchs Notes
2 pages
DSL Lab Manual
No ratings yet
DSL Lab Manual
41 pages
Year 1 Mathematics 2018 Autumn White Rose Reasoning Problem Solving Paper 2 Marking Scheme
No ratings yet
Year 1 Mathematics 2018 Autumn White Rose Reasoning Problem Solving Paper 2 Marking Scheme
2 pages
Btech Ee 6 Sem Control System Bteec601 Aug 2022
No ratings yet
Btech Ee 6 Sem Control System Bteec601 Aug 2022
2 pages
Efficient Number Theoretic Transform Architecture For CRYSTALS-Kyber
No ratings yet
Efficient Number Theoretic Transform Architecture For CRYSTALS-Kyber
5 pages
6 390 Lecture Notes Spring24
No ratings yet
6 390 Lecture Notes Spring24
144 pages
Dynamics Solved Problems
0% (1)
Dynamics Solved Problems
7 pages
Factor Model Notes:: R α β f β f …+β f ϵ R α β ' f ϵ
No ratings yet
Factor Model Notes:: R α β f β f …+β f ϵ R α β ' f ϵ
2 pages
Math3302 Fa2021 Syllabus
No ratings yet
Math3302 Fa2021 Syllabus
4 pages
Machine Learning
No ratings yet
Machine Learning
2 pages
On Methods For Low Velocity Friction Compensation: Theory and Experimental Study John Adams Shahram Payandeh
No ratings yet
On Methods For Low Velocity Friction Compensation: Theory and Experimental Study John Adams Shahram Payandeh
33 pages
Customer Churn Prediction
100% (1)
Customer Churn Prediction
32 pages
Chapter 3: Interpolation
No ratings yet
Chapter 3: Interpolation
4 pages
National Cipher Challenge: A Beginner's Guide To Codes and Ciphers Part 2, Frequency Analysis
No ratings yet
National Cipher Challenge: A Beginner's Guide To Codes and Ciphers Part 2, Frequency Analysis
14 pages
Dokumen - Pub - Numerical Approximation of Partial Differential Equations 1st Ed 3319323539 978 3 319 32353 4 978 3 319 32354 1 3319323547
100% (2)
Dokumen - Pub - Numerical Approximation of Partial Differential Equations 1st Ed 3319323539 978 3 319 32353 4 978 3 319 32354 1 3319323547
541 pages
Minimax
No ratings yet
Minimax
6 pages
IT2070 - Data Structures and Algorithms
No ratings yet
IT2070 - Data Structures and Algorithms
5 pages
Rapid Miner Cheat Doc 1
No ratings yet
Rapid Miner Cheat Doc 1
14 pages
Large Data Global Existence For Coupled Massive-Massless Wave-Type Systems
No ratings yet
Large Data Global Existence For Coupled Massive-Massless Wave-Type Systems
58 pages
B17 Discrete Report
No ratings yet
B17 Discrete Report
16 pages
IRS Most Important Topic
No ratings yet
IRS Most Important Topic
4 pages
Three Moment Equation
No ratings yet
Three Moment Equation
2 pages
Li 2015
No ratings yet
Li 2015
15 pages

Lecture1 ML MLE

Uploaded by

Lecture1 ML MLE

Uploaded by

Lecture 1: Maximum Likelihood Estimator

Professor: Mauricio Sarrias

Understand the logic behind the Maximum Likelihood Estimator.

0 20 40 60 80 100 120 140

So the question is:

Exponential Distribution Gamma Distribution

Maximum Likelihood Principle

distribution is known (up to the −50 0 50 100 150 200

L(p) = Pr(Y = 10)

0.05 0.10 0.15 0.20 0.25

The likelihood function denoted by capital L is:

where y = (y1 , ..., yn ).

The log-likelihood function is:

The log-likelihood function is a monotonically increasing function of

Example (Binomial Example)

Since the sample is iid, the likelihood function:

Example (Binomial Example)

Example (Linear Regression)

Definition (ML Estimator)

θb = arg max ln L(θ, y)

where F (y|X; θ0 ) is the joint CDF of (y, X)

Since the sample is i.i.d, we can write:

f (y|X; θ) = f (y1 |x1 ; θ) × f (y2 |x2 ; θ) × ... × f (yn |xn ; θ)

Is E [ln L(θ; y|X)] maximized at θ0 ?

Assumption II: Dominance I

Lemma (Expected Log-likelihood Inequality)

E [ln f (y|x; θ)] ≤ E [ln f (y|x; θ0 )]

(yi − xi| β)2

E (yi − xi| β)2

E (yi − xi| β)2 = E (xi| β0 + i − xi| β)2 ∵ yi = xi| β0 + i

= E (i + xi| (β0 − β))2

= E 2i + 2i xi| (β0 − β) + (β0 − β)| xi xi| (β0 − β)

= E(2i ) + (β0 − β)| E(xi xi| )(β0 − β)

σ02 + (β0 − β)| E(xi xi| )(β0 − β)

Before employing MLE, it is necessary to check whether the

f (yi |xi ; θ) 6= f (yi |xi ; θ0 ) ∀θ 6= θ0

Definition (Global Identification)

Pr [f (yi |xi ; θ0 ) 6= f (yi |xi ; θ1 )] > 0

Assumption III: Global Identification

Lemma (Strict Expected Log-Likelihood Inequality)

θ 6= θ0 =⇒ E [ln f (y|x; θ)] < E [ln f (y|x; θ0 )]

In words, the expected value of the log-likelihood is maximized at the true

a(w) ≡ f (y|x; θ)/f (y|x; θ0 )

First, WTS that a(w) 6= 1 with positive probability, so that a(w) is

a(w) 6= 1 ⇐⇒ f (y|x; θ) 6= f (y|x; θ0 )

Pr [f (y|x; θ) 6= f (y|x; θ0 )] > 0 =⇒ Pr [a(w) 6= 1] > 0

E [log a(w)] < log {E [a(w)]}

E [log(a(w))] < log(1) = 0

The score vector for observation i is:

Lemma (Score Identity)

We have to be clear whether we are speaking about the score of a single

Since we are doing an optimization analysis, we need the Hessian Matrix.

 ∂ 2 ln f (y|X;θ) ∂ 2 ln f (y|X;θ) ∂ 2 ln f (y|X;θ)

If the log-likelihood function is concave in θ, H(w; θ) is said to be negative

Because of the additivity of terms in the log-likelihood function:

To analyze the variance and the limiting distribution of the ML

I(θ) = −E [H(w, θ)]

Information matrix equality

Assumption V: Finite Information

Lemma (Information Identity)

Note the following:

Therefore we can write:

−I(θ0 ) = E [H(wi ; θ0 )] = − Var [s(wi ; θ0 )] = −E [s(wi ; θ0 )s(wi ; θ0 )| ]

(yi − xi| β)2

where wi = (yi , xi| )| , θ = (θ | , σ 2 )| and b

So for θ = θ0 the b i in these expressions can be replaced by i . In the linear

If E(xi xi| ) is nonsingular, then E [H(wi ; θ0 )] is nonsingular.

For OLS estimator consistency can be shown by finding the sampling

We would like to say that, given that

We might be able to do this using the continuous mapping theorem.

However, we cannot do that!

Convergence for sequence of random variables =⇒ Xn = Xn (ω), ω ∈ Ω

IOW, since we are dealing with convergence on a function space we

Definition (Uniform Convergence in Probability)

sup |Qn (θ) − Q0 (θ)| = op (1)

Extending the concept to random vectors is straightforward. Now suppose

E (yi − xi| β)2 = E (xi| β0 + i − xi| β)2 ∵ yi = xi| β0 + i

= E (i + xi| (β0 − β))2

= E 2i + 2i xi| (β0 − β) + (β0 − β)| xi xi| (β0 − β)

= E(2i ) + (β0 − β)| E(xi xi| )(β0 − β)

So for θ = θ0 the b i in these expressions can be replaced by i . In the linear

0 −1 | −1 −1

−1 √ ∂ log L(θ0 )