0% found this document useful (0 votes)
16 views22 pages

6 Mle Asy A

Uploaded by

Dx clino
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views22 pages

6 Mle Asy A

Uploaded by

Dx clino
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

ST2132

Topic 06: Large Sample Theory for MLE


part a: Asymptotic Normality, Confidence Intervals

Semester 1 20/21

1 / 22
Introduction

I The fact that many MLEs are consistent and asymptotically normal is
of great importance. In particular, large-sample confidence intervals
are feasible.

I This can be viewed as the parametric version of the fact in survey


sampling that X̄ is asymptotically normal. There, the population is
non-parametric, i.e., not described by a simple density.

I Not surprisingly, the common underlying tool is the Central Limit


Theorem. We will explore that in the heuristic proof in part b.

2 / 22
Theorem: Asymptotic normality of MLE

I Let X1 , . . . , Xn be IID with density f (·|θ), where θ is an unknown


constant in the parameter space Θ ⊂ R. Let θ̂ be the MLE of θ. As
n → ∞,
p
nI(θ)(θ̂ − θ) → N(0, 1) in distribution

I Consequently, for large n, approximately

I(θ)−1
 
θ̂ ∼ N θ,
n

In particular, as n → ∞, θ̂ → θ: the MLE is consistent.

3 / 22
Asymptotic normality of MLE: the vector version

I Let X1 , . . . , Xn be IID with density f (·|θ), where θ is an unknown


vector in the parameter space Θ ⊂ Rp . Let θ̂ be the MLE of θ. As
n → ∞,
p
nI(θ)(θ̂ − θ) → N(0, Ip ) in distribution

I For large n, approximately

I(θ)−1
 
θ̂ ∼ N θ,
n

4 / 22
Interpretation

I Recall that nI(θ) is the amount of information in n IID samples with


density f (·|θ).

I The asymptotic variance of the MLE is inversely proportional to the


sample size n. The notation I(θ)−1 emphasises this point: it is
similar to σ 2 in sample survey (slide 16, 3 sur sam a.pdf).

5 / 22
The Poisson

I X1 , . . . , Xn IID Poisson(λ). θ = λ. θ̂ = X̄ . I(θ) = 1/λ. According to


the theorem, if n is large, approximately
 
λ
X̄ ∼ N λ,
n
I The theorem confirms what we already know (slides 11 and 13,
4 par est a.pdf).

6 / 22
The normal case (a)

I X1 , . . . , Xn IID N(µ, σ 2 ). θ = (µ, σ). θ̂ = (X̄ , σ̂).

1/σ 2
 
0
I(θ) =
0 2/σ 2

I The vector-version of the theorem implies that if n is large,


approximately
     2 
X̄ µ σ /n 0
∼N ,
σ̂ σ 0 σ 2 /(2n)

This is a bivariate normal distribution.

I We already know X̄ ∼ N(µ, σ 2 /n) exactly and X̄ ⊥ σ̂ . The


approximate normality of σ̂ is new.
7 / 22
The normal case (b)

I X1 , . . . , Xn IID N(µ, ν = σ 2 ). θ = (µ, ν). θ̂ = (X̄ , σ̂ 2 ).

1/σ 2
 
0
I(θ) =
0 1/(2σ 4 )

I If n is large,
     2 
X̄ µ σ /n 0
∼N ,
σ̂ 2 σ2 0 2σ 4 /n

I Again, the approximate normality of σ̂ 2 is new.

8 / 22
The HWE (Rice page 283)
I Let W1 , . . . , Wn be IID Multinomial(1,p), where
p = ((1 − θ)2 , 2θ(1 − θ), θ2 ). Wi takes values (1,0,0), (0,1,0) and
(0,0,1) with these probabilities. W1 + · · · + Wn = X
∼ Multinomial(n, p).

I The random loglikelihood is


n
X
L(θ) = (Wi,1 log p1 + Wi,2 log p2 + Wi,3 log p3 )
i=1
= (2X1 + X2 ) log(1 − θ) + (X2 + 2X3 ) log θ + X2 log 2

The MLE based on the W’s is the same as that based on X:


X2 + 2X3
θ̂ =
2n
9 / 22
The HWE (continued)
I To avoid confusion with the Fisher information based on X, let the
Fisher information based on W be
2
I ∗ (θ) =
θ(1 − θ)
I We apply the theorem to the n IID W’s. For large n, approximately
 
θ(1 − θ)
θ̂ ∼ N θ,
2n
I Let I(θ) = nI ∗ (θ) be the Fisher information based on X. Then

θ̂ ∼ N(θ, I(θ)−1 )

It is hard to apply the theorem directly on X, since the sample size is


1.
10 / 22
The general trinomial distribution
I Let W1 , . . . , Wn be IID Multinomial(1,p), where θ = (p1 , p2 ). As in
HWE, the MLE based on the W’s are the same as that based on
X = W1 + · · · + Wn : θ̂ = (X1 /n, X2 /n).

I The Fisher information based on W is


 
∗ 1/p1 + 1/p3 1/p3
I (θ) =
1/p3 1/p2 + 1/p3
I Applying the vector version of the theorem on the W’s, for large n,
approximately
   
p1 p1 (1 − p1 )/n −p1 p2 /n
θ̂ ∼ N ,
p2 −p1 p2 /n p2 (1 − p2 )/n
We already know the expectation and variance are exact, but
approximate normality is new.
11 / 22
The SE

I Recall that the SE of an estimate of θ is defined as the SD of the


corresponding estimator θ̂. For a maximum likelihood estimate, the
theorem implies that
r
I(θ)−1
SE = SD(θ̂) ≈
n
I Since θ is unknown, we use the bootstrap: calculate the Fisher
information at the estimate instead. If we switch notation, denoting
the estimate as θ̂, then
s
I(θ̂)−1
SE ≈
n

12 / 22
Sufficient conditions for theorem

Suppose there is δ > 0 such that for each x ∈ R,


I
∂k
log f (x|θ), k = 1, 2, 3
∂θk
exist on (θ − δ, θ + δ) and are continuous.

I
∂3
log f (x|θ) < M(x)
∂θ3
on (θ − δ, θ + δ), with Eθ (M) < K , a constant.

13 / 22
Sufficient conditions for theorem (continued)

I These conditions are satisfied in all our examples, and in practically


all applications.

I The first condition allows interchanging of differentiation and


integration.

I For the vector version, similar conditions are required, and I(θ) is
assumed to be invertible.

14 / 22
Random interval

This prepares the construction of large-sample CI for θ. Let θ̂ be the ML


estimator of θ. For large n,
!
θ̂ − θ
1 − α ≈ Pr −zα/2 ≤ p ≤ zα/2
I(θ)−1 /n
so r r !
I(θ)−1 I(θ)−1
1 − α ≈ Pr θ̂ − zα/2 ≤ θ ≤ θ̂ + zα/2
n n

Unlike slide 3 of 3 sur sam b.pdf, in general SD(θ̂) is not exactly


r
I(θ)−1
n

15 / 22
Confidence interval

I Let θ̂ be the ML estimate of θ. For large n, an approximate


(1 − α)-CI for θ is
 s s 
−1 −1
θ̂ − zα/2 I(θ̂) , θ̂ + zα/2 I(θ̂) 
n n

p q
I The approximate SE I(θ)−1 /n is estimated by I(θ̂)−1 /n (the
bootstrap).

I The CI is a realisation of a random interval, so is fixed. θ̂ is a


realisation of the ML estimator. θ is either in the confidence interval
or not, and we will not know which is the case.

16 / 22
CI for Poisson rate λ

I θ = λ, estimated by θ̂ = x̄. I(θ)−1 = λ, estimated by I(θ̂)−1 = x̄.

I For large n, an approximate (1 − α)-CI for λ is


r r !
x̄ x̄
x̄ − zα/2 , x̄ + zα/2
n n

This is the same as slide 13 of 4 par est a.pdf.

17 / 22
CI for µ and σ from N(µ, σ 2 )

I θ = (µ, σ), estimated by θ̂ = (x̄, σ̂). I(θ)−1 is estimated by


 2 
−1 σ̂ 0
I(θ̂) =
0 σ̂ 2 /2

I For large n, an approximate (1 − α)-CI for µ is


 
σ̂ σ̂
x̄ − zα/2 √ , x̄ + zα/2 √
n n

an approximate (1 − α)-CI for σ is


 
σ̂ σ̂
σ̂ − zα/2 √ , σ̂ + zα/2 √
2n 2n

18 / 22
CI for µ and σ 2 from N(µ, σ 2 )

I θ = (µ, σ 2 ), estimated by θ̂ = (x̄, σ̂ 2 ). I(θ)−1 is estimated by


 2 
−1 σ̂ 0
I(θ̂) =
0 2σ̂ 4

I For large n, an approximate (1 − α)-CI for σ 2 is


r r !
2 2 2
σ̂ 2 − zα/2 σ̂ 2 , σ̂ + zα/2 σ̂ 2
n n

For µ, the CI is the same as in the previous slide.

19 / 22
The bivariate normal distribution
I The density on Rice page 81 can be written as
 
1 1 0 −1
f (x) = exp − (x − µ) Σ (x − µ)
2π|Σ|1/2 2

σ12
     
x1 µ1 ρσ1 σ2
x= ,µ = ,Σ =
x2 µ2 ρσ1 σ2 σ22
We write X ∼ N(µ, Σ).

I It can be shown that any bivariate normal X can be written as


X = AZ + b ∼ N(b, AA0 ), for some 2 × 1 Z IID N(0,1) components,
and 2 × 2 A, 2 × 1 b constants.

I The multivariate normal density (x is p × 1) looks the same, except


that the power of 2π in the denominator is p/2.
20 / 22
Examples

I Let Y1 , . . . , Yn be IID N(µ, σ 2 ). What is the distribution of


(Y1 , . . . , Yn )?

I Let Y1 , . . . , Yn be independent, with Yi ∼ N(µi , σ 2 ). What is the


distribution of (Y1 , . . . , Yn )?

21 / 22
Linear regression

I Let Y1 , . . . , Yn be random variables with

Yi = β1 xi1 + · · · + βp xip + i

where
X is a fixed known n × p matrix.
p × 1 β is fixed unknown.
n × 1  ∼ N(0, σ 2 In ), with σ 2 fixed unknown.
What is the joint distribution of the n × 1 Y? More compactly, we
can write
Y = Xβ + 
I Given realisation y of Y, how can we get ML estimates of β and σ 2 ?

22 / 22

You might also like