0% found this document useful (0 votes)
19 views31 pages

7 Mle

Uploaded by

Kapzy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views31 pages

7 Mle

Uploaded by

Kapzy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Maximum Likelihood Methods

Instructor: Xinyue Li

Department of Data Science, City University of Hong Kong

Lecture 7

1 / 25
Motivating example

Motivating example: Given an unfair coin, or p-coin, such that


(
1 head with probability p
X =
0 tail with probability 1 − p

how would you determine the value p ?


You need to try the coin several times, say, three times. What you obtain
is ”HHT”.
Draw a conclusion from the experiment you just made.

2 / 25
Solutions
Rationale: The choice of the parameter p should be the value that
maximizes the probability of the sample.
P (X1 = 1, X2 = 1, X3 = 0) = P (X1 = 1) P (X2 = 1) P (X3 = 0)
= p 2 (1 − p)

3 / 25
Solutions
Rationale: The choice of the parameter p should be the value that
maximizes the probability of the sample.
P (X1 = 1, X2 = 1, X3 = 0) = P (X1 = 1) P (X2 = 1) P (X3 = 0)
= p 2 (1 − p)

Maximize f (p) = p 2 (1 − p) . . .
3 / 25
Bernoulli (p)

X1 , · · · , Xn are i.i.d. random variables, each following Bernoulli (p)


Suppose the outcomes of the random sample are: X1 = k1 , · · · , Xn = kn .
What is your choice of p based on the above random sample?

4 / 25
Statistic

X1 , · · · , Xn are i.i.d. random variables, each following the same given pdf.
A statistic or an estimator is a function of the random sample.
Statistic/Estimator is a random variable!
e.g.,
n
1X
p̂ = Xi
n i=1

The outcome of a statistic/estimator is called an estimate. e.g.,


n
1X
pe = ki
n i=1

5 / 25
Estimating parameters

MLE (Maximum Likelihood Methods)


MME (Methods of Moments estimators)

6 / 25
Maximum Likelihood Estimation

Definition 1
For a random sample of size n from the discrete (resp. continuous)
population/pdf px (k; θ) (resp. fY (y ; θ)), the likelihood function, L(θ), is the
product of the pdf evaluated at Xi = ki (resp. Yi = yi ), i.e.,
n n
!
Y Y
L(θ) = pX (ki ; θ) resp. L(θ) = fY (yi ; θ)
i=1 i=1

Definition 2
Let L(θ) be as defined above. If θe is a value of the parameter such that
L (θe ) ≥ L(θ) for all possible values of θ, then we call θe the maximum
likelihood estimate for θ.

How to obtain MLE?

7 / 25
Maximum Likelihood Estimation

Definition 1
For a random sample of size n from the discrete (resp. continuous)
population/pdf px (k; θ) (resp. fY (y ; θ)), the likelihood function, L(θ), is the
product of the pdf evaluated at Xi = ki (resp. Yi = yi ), i.e.,
n n
!
Y Y
L(θ) = pX (ki ; θ) resp. L(θ) = fY (yi ; θ)
i=1 i=1

Definition 2
Let L(θ) be as defined above. If θe is a value of the parameter such that
L (θe ) ≥ L(θ) for all possible values of θ, then we call θe the maximum
likelihood estimate for θ.

How to obtain MLE? Often but not always MLE can be obtained by setting
the first derivative equal to zero:

7 / 25
Poisson distribution:
k
Poisson distribution: pX (k) = e −λ λk! , k = 0, 1, · · · .

n n
!−1
ki
−λ λ
Y Pk Y
−nλ i=1 ki
L(λ) = e =e λ ki !
i=1
ki ! i=1
n
! n
!
X Y
ln L(λ) = −nλ + ki ln λ − ln ki !
i=1 i=1
n
d 1X
ln L(λ) = −n + ki .
dλ λ i=1
n
d 1X
ln L(λ) = 0 =⇒ λe = ki =: k̄
dλ n i=1

The critical point is indeed global maximum because


n
d2 1 X
ln L(λ) = − ki < 0
dλ2 λ2 i=1

8 / 25
Exponential distribution:

Exponential distribution: fY (y ) = λe −λy for y ≥ 0.

n n
!
Y −λyi n
X
L(λ) = λe = λ exp −λ yi
i=1 i=1
n
X
ln L(λ) = n ln λ − λ yi
i=1
n
d n X
ln L(λ) = − yi
dλ λ i=1
d n 1
ln L(λ) = 0 =⇒ λe = Pn =:
dλ i=1 yi ȳ
Recall that exponential distribution models the waiting time...

9 / 25
Gamma distribution:
λ′
Gamma distribution: fY (y ; λ) = Γ(r )
y r −1 e −λy for y ≥ 0 with r > 1 known.

n n
! n
!
Y λr r −1 −λyi Y X
L(λ) = yi e = λrn Γ(r )−n yir −1 exp −λ yi
i=1
Γ(r ) i=1 i=1
n
! n
Y X
ln L(λ) = rn ln λ − n ln Γ(r ) + ln yir −1 −λ yi
i=1 i=1
n
d rn X
ln L(λ) = − yi
dλ λ i=1
d rn r
ln L(λ) = 0 =⇒ λe = Pn = .
dλ i=1 yi ȳ

When r = 1, this reduces to the exponential distribution case.


If r is also unknown, what is the MLE?

10 / 25
Gamma distribution:
λ′
Gamma distribution: fY (y ; λ) = Γ(r )
y r −1 e −λy for y ≥ 0 with r > 1 known.

n n
! n
!
Y λr r −1 −λyi Y X
L(λ) = yi e = λrn Γ(r )−n yir −1 exp −λ yi
i=1
Γ(r ) i=1 i=1
n
! n
Y X
ln L(λ) = rn ln λ − n ln Γ(r ) + ln yir −1 −λ yi
i=1 i=1
n
d rn X
ln L(λ) = − yi
dλ λ i=1
d rn r
ln L(λ) = 0 =⇒ λe = Pn = .
dλ i=1 yi ȳ

When r = 1, this reduces to the exponential distribution case.


If r is also unknown, what is the MLE?
It will be much more complicated. No closed-form solution but only
numerical solutions.

10 / 25
Normal distribution:

Normal distribution has more than one parameters


(y −µ)2

The PDF: fY y ; µ, σ 2 = √2πσ
1

e 2σ2 , y ∈ R.

n n
!
  Y 1 (y −µ)2  −n/2 1 X
− i 2
L µ, σ 2 = √ e 2σ = 2πσ 2 exp − 2 (yi − µ)2
i=1
2πσ 2σ i=1
n
  n   1 X
ln L µ, σ 2 = − ln 2πσ 2 − 2 (yi − µ)2
2 2σ i=1


µ, σ 2 = σ12 ni=1 (yi − µ)
  P
∂µ
ln L

µ, σ 2 = − 2σn 2 + 2σ1 4 ni=1 (yi − µ)2
 P
∂σ 2 ln L

ln L(µ, σ 2 ) = 0
 ∂ 
∂µ µe = ȳ

∂ 2
σe2 = n1 ni=1 (yi − ȳ )2
P
∂σ 2 ln L(µ, σ ) = 0

11 / 25
Uniform Distribution

Example 1
Uniform distribution on [a, b] with a < b :
(
1
if y ∈ [a, b]
fY (y ; a, b) = b−a
0 otherwise.

What is the MLE for a and b?

12 / 25
Properties of Estimators

Question: Estimators are not in general unique (MLE or MME ...). How
to select one estimator?
Recall: For a random sample of size n from the population with given pdf,
we have X1 , · · · , Xn , which are i.i.d. r.v.s. The estimator θ̂ is a function of
Xi′ s :

θ̂ = θ̂ (X1 , · · · , Xn )

13 / 25
Properties of Estimators

Question: Estimators are not in general unique (MLE or MME ...). How
to select one estimator?
Recall: For a random sample of size n from the population with given pdf,
we have X1 , · · · , Xn , which are i.i.d. r.v.s. The estimator θ̂ is a function of
Xi′ s :

θ̂ = θ̂ (X1 , · · · , Xn )
Criterions:
Unbiased. (Mean)
Efficiency, the minimum-variance estimator. (Variance)
Sufficency.
Consistency. (Asymptotic behavior)

13 / 25
Unbiasedness

Definition 3
Given a random sample of size n whose population distribution dependes on an
unknown parameter θ, let θ̂ be an estimator of θ. Then θ̂ is called unbiased if
E(θ̂) = θ; and θ̂ is called asymptotically unbiased if limn→∞ E(θ̂) = θ.

14 / 25
Unbiasedness

Example 2
Let X1 , · · · , Xn be a random sample of size n with the unknown parameter
σ 2 = Var(X ). Consider two statistics
Pn 2
b2 =
σ 1
n i=1 Xi − X̄ Unbiased?

15 / 25
Unbiasedness

Example 3
Let X1 , . . . , XN be random samples from exponential distribution with
parameter λ. The MLE is λ b = 1/Ȳ . Is it unbiased?

16 / 25
Efficiency

Definition 4
Let 
θb1 and
 θ2 be two
b
 unbiased estimators for a parameter θ. If
Var θ2 < Var θ1 , then we say that θb2 is more efficient than θb1 . The
b b
   
relative efficiency of θb2 w.r.t. θb1 is the ratio Var θb2 / Var θb1 .

17 / 25
Efficiency

Example 4
Let X1 , · · · , Xn be a random sample of size n with the unknown parameter
θ = E(X ) (suppose σ 2 = Var(X ) < ∞ .
(1) Show that as long as ni=1 ai = 1, θ̂ = ni=1 ai Xi is unbiased estimator.
P P

(2) Among all possible unbiased estimators θ̂ = ni=1 ai Xi with ni=1 ai = 1.


P P
Find the one with the smallest variance.

18 / 25
Minimum Variance Estimator (MVE)

Question: Can one identify the unbiased estimator having the smallest
variance?
Short answer: In many cases, yes!
We are going to develop the theory to answer this question in details!

19 / 25
Fisher information

Regular estimation Condition: The set of y (resp. k) values, where


fY (y ; θ) ̸= 0 (resp. px (k; θ) ̸= 0), does not depend on θ.
In other words, the domain of the pdf does not depend on the parameter
(so that one can differentiate under integration).

Definition 5
Definition. The Fisher Information of a continous (resp. discrete) random
variable Y ( resp. X ) with pdf fY (y ; θ) (resp. pX (k; θ)) is defined as
" 2 # " 2 #!
∂ ln fY (Y ; θ) ∂ ln pX (X ; θ)
I (θ) = E resp. E
∂θ ∂θ

20 / 25
Fisher information
Lemma 5
Under regular condition, let Y1 , · · · , Yn be a random sample of size n from the
continuous population pdf fY (y ; θ). Then the Fisher Information in the random
sample Y1 , · · · , Yn equals n times the Fisher information in X :

" 2 # " 2 #
∂ ln fY1 ,··· ,Yn (Y1 , · · · , Yn ; θ) ∂ ln fY (Y ; θ)
E = nE = nI (θ)
∂θ ∂θ

(A similar statement holds for the discrete case px (k; θ)).

21 / 25
Fisher information
Lemma 5
Under regular condition, let Y1 , · · · , Yn be a random sample of size n from the
continuous population pdf fY (y ; θ). Then the Fisher Information in the random
sample Y1 , · · · , Yn equals n times the Fisher information in X :

" 2 # " 2 #
∂ ln fY1 ,··· ,Yn (Y1 , · · · , Yn ; θ) ∂ ln fY (Y ; θ)
E = nE = nI (θ)
∂θ ∂θ

(A similar statement holds for the discrete case px (k; θ)).

Proof. Based on two observations:


" n
!2 #
X ∂
LHS = E ln fYi (Yi ; θ)
i=1
∂θ
  ∂
f (y ; θ)
Z Z
∂ ∂θ Y ∂
E ln fYi (Yi ; θ) = fY (y ; θ)dy = fY (y ; θ)dy
∂θ R fY (y ; θ) R ∂θ
Z
R.C. ∂ ∂
= fY (y ; θ)dy = 1 = 0.
∂θ R ∂θ
21 / 25
Fisher information
Lemma 6
Under regular condition, if ln fY (y ; θ) is twice differentiable in θ, then
 2 

I (θ) = −E ln fY (Y ; θ)
∂θ2
(A similar statement holds for the discrete case pX (k; θ)).

22 / 25
Fisher information
Lemma 6
Under regular condition, if ln fY (y ; θ) is twice differentiable in θ, then
 2 

I (θ) = −E ln fY (Y ; θ)
∂θ2
(A similar statement holds for the discrete case pX (k; θ)).

Proof. Based on two observations:


!2
∂2 ∂
∂2 f (Y ; θ)
∂θ 2 Y
f (Y ; θ)
∂θ Y
(1) ln fY (Y ; θ) = −
∂θ2 fY (Y ; θ) fY (Y ; θ)
| {z }
∂ ln f (Y ;θ) 2
=( ∂θ Y )

∂2 ∂2
!
f (Y ; θ) f (y ; θ) ∂2
Z Z
∂θ 2 Y ∂θ 2 Y
(2) E = fY (y ; θ)dy = fY (y ; θ)dy
fY (Y ; θ) RfY (y ; θ) R ∂θ2
2
∂2
Z

= fY (y ; θ)dy = 1 = 0.
∂θ2 R ∂θ2

22 / 25
MVE: The Cramér-Rao Lower Bound

Theorem 7
(Cramér-Rao Inequality) Under regular condition, let Y1 , · · · , Yn be a random
sample of size n from the continuous population pdf fY (y ; θ). Let
θb = θb (Y1 , · · · , Yn ) be any unbiased estimator for θ. Then

b ≥ 1
Var(θ)
nl(θ)
(A similar statement holds for the discrete case px (k; θ)).

23 / 25
Proof of The Cramér-Rao Lower Bound

If n = 1, then by Cauchy-Schwartz inequality,


  q

E (θb − θ) b × I (θ)
ln fY (Y ; θ) ≤ Var(θ)
∂θ
On the other hand,
  Z ∂
∂ fY (y ; θ)
E (θb − θ) ln fY (Y ; θ) = (θb − θ) ∂θ fY (y ; θ)dy
∂θ R f Y (y ; θ)
Z

= (θb − θ) fY (y ; θ)dy
R ∂θ
Z

= (θb − θ)fY (y ; θ)dy +1 = 1.
∂θ R
| {z }
=E(θ−θ)=0
b

24 / 25
The Cramér-Rao Lower Bound

Definition 6
Let Θ be the set of all estimators θb that are unbiased for the parameter θ. We
say that θb∗ is a best or minimum-variance estimator (MVE) if θ∗ ∈ Θ and
 
Var θb∗ ≤ Var(θ) b for all θb ∈ Θ

Definition 7
An unbiased estimator θb is efficient if Var(θ)
b is equal to the Cramér-Rao lower
−1
bound, i.e., Var θ = (nI (θ)) .
b

25 / 25

You might also like