7 Mle
7 Mle
Instructor: Xinyue Li
Lecture 7
1 / 25
Motivating example
2 / 25
Solutions
Rationale: The choice of the parameter p should be the value that
maximizes the probability of the sample.
P (X1 = 1, X2 = 1, X3 = 0) = P (X1 = 1) P (X2 = 1) P (X3 = 0)
= p 2 (1 − p)
3 / 25
Solutions
Rationale: The choice of the parameter p should be the value that
maximizes the probability of the sample.
P (X1 = 1, X2 = 1, X3 = 0) = P (X1 = 1) P (X2 = 1) P (X3 = 0)
= p 2 (1 − p)
Maximize f (p) = p 2 (1 − p) . . .
3 / 25
Bernoulli (p)
4 / 25
Statistic
X1 , · · · , Xn are i.i.d. random variables, each following the same given pdf.
A statistic or an estimator is a function of the random sample.
Statistic/Estimator is a random variable!
e.g.,
n
1X
p̂ = Xi
n i=1
5 / 25
Estimating parameters
6 / 25
Maximum Likelihood Estimation
Definition 1
For a random sample of size n from the discrete (resp. continuous)
population/pdf px (k; θ) (resp. fY (y ; θ)), the likelihood function, L(θ), is the
product of the pdf evaluated at Xi = ki (resp. Yi = yi ), i.e.,
n n
!
Y Y
L(θ) = pX (ki ; θ) resp. L(θ) = fY (yi ; θ)
i=1 i=1
Definition 2
Let L(θ) be as defined above. If θe is a value of the parameter such that
L (θe ) ≥ L(θ) for all possible values of θ, then we call θe the maximum
likelihood estimate for θ.
7 / 25
Maximum Likelihood Estimation
Definition 1
For a random sample of size n from the discrete (resp. continuous)
population/pdf px (k; θ) (resp. fY (y ; θ)), the likelihood function, L(θ), is the
product of the pdf evaluated at Xi = ki (resp. Yi = yi ), i.e.,
n n
!
Y Y
L(θ) = pX (ki ; θ) resp. L(θ) = fY (yi ; θ)
i=1 i=1
Definition 2
Let L(θ) be as defined above. If θe is a value of the parameter such that
L (θe ) ≥ L(θ) for all possible values of θ, then we call θe the maximum
likelihood estimate for θ.
How to obtain MLE? Often but not always MLE can be obtained by setting
the first derivative equal to zero:
7 / 25
Poisson distribution:
k
Poisson distribution: pX (k) = e −λ λk! , k = 0, 1, · · · .
n n
!−1
ki
−λ λ
Y Pk Y
−nλ i=1 ki
L(λ) = e =e λ ki !
i=1
ki ! i=1
n
! n
!
X Y
ln L(λ) = −nλ + ki ln λ − ln ki !
i=1 i=1
n
d 1X
ln L(λ) = −n + ki .
dλ λ i=1
n
d 1X
ln L(λ) = 0 =⇒ λe = ki =: k̄
dλ n i=1
8 / 25
Exponential distribution:
n n
!
Y −λyi n
X
L(λ) = λe = λ exp −λ yi
i=1 i=1
n
X
ln L(λ) = n ln λ − λ yi
i=1
n
d n X
ln L(λ) = − yi
dλ λ i=1
d n 1
ln L(λ) = 0 =⇒ λe = Pn =:
dλ i=1 yi ȳ
Recall that exponential distribution models the waiting time...
9 / 25
Gamma distribution:
λ′
Gamma distribution: fY (y ; λ) = Γ(r )
y r −1 e −λy for y ≥ 0 with r > 1 known.
n n
! n
!
Y λr r −1 −λyi Y X
L(λ) = yi e = λrn Γ(r )−n yir −1 exp −λ yi
i=1
Γ(r ) i=1 i=1
n
! n
Y X
ln L(λ) = rn ln λ − n ln Γ(r ) + ln yir −1 −λ yi
i=1 i=1
n
d rn X
ln L(λ) = − yi
dλ λ i=1
d rn r
ln L(λ) = 0 =⇒ λe = Pn = .
dλ i=1 yi ȳ
10 / 25
Gamma distribution:
λ′
Gamma distribution: fY (y ; λ) = Γ(r )
y r −1 e −λy for y ≥ 0 with r > 1 known.
n n
! n
!
Y λr r −1 −λyi Y X
L(λ) = yi e = λrn Γ(r )−n yir −1 exp −λ yi
i=1
Γ(r ) i=1 i=1
n
! n
Y X
ln L(λ) = rn ln λ − n ln Γ(r ) + ln yir −1 −λ yi
i=1 i=1
n
d rn X
ln L(λ) = − yi
dλ λ i=1
d rn r
ln L(λ) = 0 =⇒ λe = Pn = .
dλ i=1 yi ȳ
10 / 25
Normal distribution:
n n
!
Y 1 (y −µ)2 −n/2 1 X
− i 2
L µ, σ 2 = √ e 2σ = 2πσ 2 exp − 2 (yi − µ)2
i=1
2πσ 2σ i=1
n
n 1 X
ln L µ, σ 2 = − ln 2πσ 2 − 2 (yi − µ)2
2 2σ i=1
∂
µ, σ 2 = σ12 ni=1 (yi − µ)
P
∂µ
ln L
∂
µ, σ 2 = − 2σn 2 + 2σ1 4 ni=1 (yi − µ)2
P
∂σ 2 ln L
ln L(µ, σ 2 ) = 0
∂
∂µ µe = ȳ
⇒
∂ 2
σe2 = n1 ni=1 (yi − ȳ )2
P
∂σ 2 ln L(µ, σ ) = 0
11 / 25
Uniform Distribution
Example 1
Uniform distribution on [a, b] with a < b :
(
1
if y ∈ [a, b]
fY (y ; a, b) = b−a
0 otherwise.
12 / 25
Properties of Estimators
Question: Estimators are not in general unique (MLE or MME ...). How
to select one estimator?
Recall: For a random sample of size n from the population with given pdf,
we have X1 , · · · , Xn , which are i.i.d. r.v.s. The estimator θ̂ is a function of
Xi′ s :
θ̂ = θ̂ (X1 , · · · , Xn )
13 / 25
Properties of Estimators
Question: Estimators are not in general unique (MLE or MME ...). How
to select one estimator?
Recall: For a random sample of size n from the population with given pdf,
we have X1 , · · · , Xn , which are i.i.d. r.v.s. The estimator θ̂ is a function of
Xi′ s :
θ̂ = θ̂ (X1 , · · · , Xn )
Criterions:
Unbiased. (Mean)
Efficiency, the minimum-variance estimator. (Variance)
Sufficency.
Consistency. (Asymptotic behavior)
13 / 25
Unbiasedness
Definition 3
Given a random sample of size n whose population distribution dependes on an
unknown parameter θ, let θ̂ be an estimator of θ. Then θ̂ is called unbiased if
E(θ̂) = θ; and θ̂ is called asymptotically unbiased if limn→∞ E(θ̂) = θ.
14 / 25
Unbiasedness
Example 2
Let X1 , · · · , Xn be a random sample of size n with the unknown parameter
σ 2 = Var(X ). Consider two statistics
Pn 2
b2 =
σ 1
n i=1 Xi − X̄ Unbiased?
15 / 25
Unbiasedness
Example 3
Let X1 , . . . , XN be random samples from exponential distribution with
parameter λ. The MLE is λ b = 1/Ȳ . Is it unbiased?
16 / 25
Efficiency
Definition 4
Let
θb1 and
θ2 be two
b
unbiased estimators for a parameter θ. If
Var θ2 < Var θ1 , then we say that θb2 is more efficient than θb1 . The
b b
relative efficiency of θb2 w.r.t. θb1 is the ratio Var θb2 / Var θb1 .
17 / 25
Efficiency
Example 4
Let X1 , · · · , Xn be a random sample of size n with the unknown parameter
θ = E(X ) (suppose σ 2 = Var(X ) < ∞ .
(1) Show that as long as ni=1 ai = 1, θ̂ = ni=1 ai Xi is unbiased estimator.
P P
18 / 25
Minimum Variance Estimator (MVE)
Question: Can one identify the unbiased estimator having the smallest
variance?
Short answer: In many cases, yes!
We are going to develop the theory to answer this question in details!
19 / 25
Fisher information
Definition 5
Definition. The Fisher Information of a continous (resp. discrete) random
variable Y ( resp. X ) with pdf fY (y ; θ) (resp. pX (k; θ)) is defined as
" 2 # " 2 #!
∂ ln fY (Y ; θ) ∂ ln pX (X ; θ)
I (θ) = E resp. E
∂θ ∂θ
20 / 25
Fisher information
Lemma 5
Under regular condition, let Y1 , · · · , Yn be a random sample of size n from the
continuous population pdf fY (y ; θ). Then the Fisher Information in the random
sample Y1 , · · · , Yn equals n times the Fisher information in X :
" 2 # " 2 #
∂ ln fY1 ,··· ,Yn (Y1 , · · · , Yn ; θ) ∂ ln fY (Y ; θ)
E = nE = nI (θ)
∂θ ∂θ
21 / 25
Fisher information
Lemma 5
Under regular condition, let Y1 , · · · , Yn be a random sample of size n from the
continuous population pdf fY (y ; θ). Then the Fisher Information in the random
sample Y1 , · · · , Yn equals n times the Fisher information in X :
" 2 # " 2 #
∂ ln fY1 ,··· ,Yn (Y1 , · · · , Yn ; θ) ∂ ln fY (Y ; θ)
E = nE = nI (θ)
∂θ ∂θ
22 / 25
Fisher information
Lemma 6
Under regular condition, if ln fY (y ; θ) is twice differentiable in θ, then
2
∂
I (θ) = −E ln fY (Y ; θ)
∂θ2
(A similar statement holds for the discrete case pX (k; θ)).
∂2 ∂2
!
f (Y ; θ) f (y ; θ) ∂2
Z Z
∂θ 2 Y ∂θ 2 Y
(2) E = fY (y ; θ)dy = fY (y ; θ)dy
fY (Y ; θ) RfY (y ; θ) R ∂θ2
2
∂2
Z
∂
= fY (y ; θ)dy = 1 = 0.
∂θ2 R ∂θ2
22 / 25
MVE: The Cramér-Rao Lower Bound
Theorem 7
(Cramér-Rao Inequality) Under regular condition, let Y1 , · · · , Yn be a random
sample of size n from the continuous population pdf fY (y ; θ). Let
θb = θb (Y1 , · · · , Yn ) be any unbiased estimator for θ. Then
b ≥ 1
Var(θ)
nl(θ)
(A similar statement holds for the discrete case px (k; θ)).
23 / 25
Proof of The Cramér-Rao Lower Bound
24 / 25
The Cramér-Rao Lower Bound
Definition 6
Let Θ be the set of all estimators θb that are unbiased for the parameter θ. We
say that θb∗ is a best or minimum-variance estimator (MVE) if θ∗ ∈ Θ and
Var θb∗ ≤ Var(θ) b for all θb ∈ Θ
Definition 7
An unbiased estimator θb is efficient if Var(θ)
b is equal to the Cramér-Rao lower
−1
bound, i.e., Var θ = (nI (θ)) .
b
25 / 25