Unit 04 - Maximum Likelihood Estimation - 1 Per Page
Unit 04 - Maximum Likelihood Estimation - 1 Per Page
1
Unit 4 Outline
• The Likelihood Function and Maximum Likelihood
Estimation (MLE)
• MLE Examples
• Functions in R
• Newton’s Method to find roots
• Fisher’s Information and the Assymptotic Normality of
MLEs
• Confidence Intervals for MLEs
• Mean Square Error, Efficiency, and the Cramer-Rao Lower
Bound
• Optimization in R
2
MLE: an improved estimation approach
• In Unit 3, we learned one way to construct
estimators: Method of Moments
• This is not necessarily the best way to find
estimators. In fact, it is rarely used in practice (it’s
just so simple, so it’s a good teaching tool to
introduce estimation methods)
• A more efficient and widely-used approach:
Maximum Likelihood Estimation (MLE).
• This approach essentially chooses the value for the
parameter, θ, that maximizes the likelihood of seeing
the sample data that is collected.
3
Likelihood Function
• Again, what is inference?
• It is making statements about parameter(s), θ, given a
sample of data, X1, X2, …, Xn.
• It sure would be nice to have a function of θ given
the sample data in order to help us make these
inferential statements.
• That is exactly what the likelihood function is doing:
lik ( ) f ( X 1 , X 1 ,..., X n | )
2
1
0
-4 -3 -2 -1
ˆ arg maxl ( )
y
0 2 4 6 8 10
x
http://www.youtube.com/watch?v=JGS90HEbP5U
9
Unit 4 Outline
• The Likelihood Function and Maximum Likelihood
Estimation (MLE)
• MLE Examples
• Functions in R
• Newton’s Method to find roots
• Fisher’s Information and the Assymptotic Normality of
MLEs
• Confidence Intervals for MLEs
• Mean Square Error, Efficiency, and the Cramer-Rao Lower
Bound
• Optimization in R
10
MLE Example 1: Poisson distribution
• Suppose we have i.i.d. Xi ~ Pois(λ).
• What is the likelihood function?
n
X i
lik ( ) e
i 1 ( X i )!
• What is the log-likelihood function?
n X i n
l ( ) log e X i log( ) log( X i !)
i 1 ( X i )! i 1
n n
log( ) X i log( X i !) n
i 1 i 1
11
MLE Example 1: Poisson dist. (cont.)
• What is the maximum likelihood estimator for λ?
• First differentiate log-likelihood = lʹ(λ):
n n
d log( ) X i log( X i !) n
l ( )
n
X i n
i 1 i 1 1
d i 1
• Then set it to zero, lʹ(λ) = 0, and solve for λ.
n
l ' ( ) X i n 0
1
i 1 1 n
̂ X i X
1 n n i 1
X i n
i 1
12
MLE Example 2: Normal distribution
• Suppose Xi ~ N(μ, σ2) and i.i.d.
• What is the likelihood function?
n 1 ( X ) 2
lik ( , )
2
exp i
i 1 2 2 2
• What is the log-likelihood function?
n 1 ( X ) 2
l ( , ) log
2
exp i
i 1 2 2 2
( X i )2
n n n
log( ) log 2
i 1 i 1 i 1 2 2
1 n
n log( ) n log 2 2 ( X i ) 2
2 i 1
13
MLE Example 2: Normal dist. (cont.)
• Now there are two unknown parameters so we will
need to find the separate partial derivatives:
l ( , 2 )
2
n
1
n log( ) n log 2 2
2
i 1
( X i )
1 n
2 (X i )
i 1
l ( , 2 ) n
n
1 2
1
log( ) n log 2
2
(X i )
2
2
2
2
2 i 1
(X
n
n1 2 2
2 ) 2
2
i
2 i 1
14
MLE Example 2: Normal dist. (cont.)
• Set the separate partial derivatives to zero and solve for
the specific parameter:
l ( , 2 ) 1 n
2 ( X i ˆ ) 0
i 1
1 n
ˆ X i X
n i 1
l ( , 2 ) n 1 n
2
2
2ˆ
2 ˆ 2
2 i
( X
i 1
) 2
0
1 n 1 n
ˆ ( X i ) ( X i ˆ ) 2
2 2
n i 1 n i 1
15
MLE Example 3: Gamma distribution
• Suppose Xi ~ Gamma(a, λ) and i.i.d.
• What is the likelihood function?
n
a a 1 X i
lik (a, ) Xi e
i 1 a
• What is the log-likelihood function?
n
a a 1 X i
l (a, ) log Xi e
i 1 a
n n n n
log(a ) log(a ) (a 1) log X i X i
i 1 i 1 i 1 i 1
n n
na log( ) n log(a ) (a 1) log X i X i
i 1 i 1
MLE Example 3: Gamma dist. (cont.)
• Two unknown parameters: θ ={a, λ}, so take the two
partial derivative separately:
l (a, ) n n
na log( ) n log(a ) (a 1) log X i X i
a a i 1 i 1
' (a ) n
n log( ) n log X i
(a ) i 1
l (a, ) n n
na log( ) n log(a ) (a 1) log X i X i
i 1 i 1
na n
Xi
i 1
MLE Example 3: Gamma dist. (cont.)
• And set to zero (solve the λ-partial first):
l (a, ) na n
Xi 0
ˆ i 1
ˆ aˆ aˆ
n
n Xi
1 X
i 1
19
Which Newton is That?
20
Unit 4 Outline
• The Likelihood Function and Maximum Likelihood
Estimation (MLE)
• MLE Examples
• Functions in R
• Newton’s Method to find roots
• Fisher’s Information and the Assymptotic Normality of
MLEs
• Confidence Intervals for MLEs
• Mean Square Error, Efficiency, and the Cramer-Rao Lower
Bound
• Optimization in R
21
Functions in R
• We’d like to calculate the log-likelihood (or likelihood) for a
model given a set of parameter(s), θ, and the data, x.
• The best way to do this is to write a user-defined function in
R so that this calculation can be done over and over again (so
we can draw the function, determine it’s maximum, etc…).
• A user-defined function in R looks like this:
my.function = function(arg1,arg2,...){
result = ... # do some work
return(result)
}
• my.function has several parts: the function name, arguments,
body, and results to be passed back to the user in the regular
R environment.
• The work (like the result var) is done internally in the
function, and cannot be accessed outside the function unless
it is explicitly returned to the user with the return expression.
• An example would be helpful…
22
R-Function: Gamma log-lik
• Recall that the log-likelihood of i.i.d. Xi ~ Gamma(a, λ):
n n
l (a, ) na log( ) n log(a) (a 1) log X i X i
i 1 i 1
> hist(precip,col="gray",main="")
80
60
Frequency
40
20
0
precip
25
Plotting the l(a, λ): a double for loop
• We would like to plot the log-likelihood for the Boston storm
dataset for various values of a and λ.
• This poses a difficulty since there are two unknowns
(parameters here) we need to look over
• We need to use a “double” for loop. See code below:
a=1:100/100
lambda=1:200/50
loglik=matrix(NA,nrow=length(a),ncol=length(lambda))
dim(loglik)
for(i in 1:length(a)){
for(j in 1:length(lambda)){
loglik[i,j]=gamma.loglik(theta=c(a[i],lambda[j]),x=precip)
}
}
Plotting l(a, λ) and Finding MLEs
• Let’s plot is (need a 3d plot, so need to use an R-package:
• This poses a difficulty since there are two unknowns
(parameters here) we need to look over
• We need to use a “double” for loop. See code below:
require(scatterplot3d)
persp(x=a,y=lambda,z=loglik,shade=0.5,axes=T,
col=c("darkred"),phi=20, theta=-60,
ticktype="detailed")
• And with some careful indexing, we can find the correct values
for a and λ that maximize our calculated log-likelihoods:
index=which(loglik==max(loglik))
a[index%%length(a)]
lambda[ceiling(index/length(a))]
[index%%length(a)]
[1] 0.59
> lambda[ceiling(index/length(a))]
[1] 2
Unit 4 Outline
• The Likelihood Function and Maximum Likelihood
Estimation (MLE)
• MLE Examples
• Functions in R
• Newton’s Method to find roots
• Fisher’s Information and the Assymptotic Normality of
MLEs
• Confidence Intervals for MLEs
• Mean Square Error, Efficiency, and the Cramer-Rao Lower
Bound
• Optimization in R
28
Newton’s Method
• Newton’s Method (sometimes called the Newton-Raphson
Method) is a numerical way to solve for roots of a
function.
• It is an iterative algorithm based on the following equation:
f ( x0 )
x1 x0
f ' ( x0 )
f ( xk )
xk 1 xk
f ' ( xk )
• You iteratively update a potential root xk until:
| xk – xk-1 | < ε for some small amount of error (ε).
• Some issues arise. But key is that your starting point, x0,
needs to be somewhat close to the potential root.
29
Newton’s Method (cont.)
• Newton’s Method is quite applicable for maximum
likelihood estimation! The situation often arise when
the roots of l’(θ) = 0 do not have a closed form
expression.
• So for MLE, the f(x) from the previous slide is actually
l’(θ). Thus the formulas become:
l ' ( 0 ) l ' ( k )
1 0 k 1 k
l ' ' ( 0 ) l ' ' ( k )
30
Using Newton’s Method: Gamma dist.
• Recall, the a-partial equation for l’(a) for a gamma
distribution was:
l (a, ) ' (aˆ ) 1 n
logaˆ logX log X i 0
a (aˆ ) n i 1
• So we need to do 2 steps in R to solve this equation:
1) Create a user-defined function to calculate the result of
this function (l’(a)) given a value of the parameter a
(and given the data X1, X2, …, Xn).
2) Use the function uniroot to find the appropriate root
for this equation
*And don’t forget there another parameter to estimate
afterwards: ˆ â
X
A Function in R to Calculate l’(a)
• Step #1: create a user-defined function (let’s call it a.partial)
to calculate the results of the function of the a-partial
derivative (given parameter, a, and data, x):
l (a, ) ' (aˆ ) 1 n
logaˆ logX log X i 0
a (aˆ ) n i 1
a.partial = function(a,x){
f = log(a)-digamma(a)-log(mean(x))+mean(log(x))
return(f)
}
32
uniroot in R to solve l’(a) = 0
• Step #2: Use the function uniroot to find the appropriate
root for this equation:
> result1
?uniroot
$root
uniroot(f=a.partial,interval= [1] 0.5920035
c(min(precip),max(precip)),x=precip) $f.root
result1=uniroot(f=a.partial,interval= [1] -4.113931e-06
c(min(precip),max(precip)),x=precip) $iter
a.mle=result1$root [1] 10
lambda.mle=a.mle/mean(precip) $estim.prec
[1] 6.103516e-05
a.mle
lambda.mle > a.mle
[1] 0.5920035
> lambda.mle
[1] 1.997605
33
Unit 4 Outline
• The Likelihood Function and Maximum Likelihood
Estimation (MLE)
• MLE Examples
• Functions in R
• Newton’s Method to find roots
• Fisher’s Information and the Assymptotic Normality of
MLEs
• Confidence Intervals for MLEs
• Mean Square Error, Efficiency, and the Cramer-Rao Lower
Bound
• Optimization in R
34
Fisher’s Information
• Finally, back to real Statistics (and not computation)
• Recall, l(θ) is measuring the likelihood of the potential
values of θ given the data, X1, X2, …, Xn.
• We’d like a measure for the uncertainty of an MLE
(like the variance of ˆMLE).
• The Fisher information, In(θ), does exactly that:
2
I n ( ) E l ( )
• Which is mathematically equivalent to (for ease of
calculations):
2
I n ( ) E 2 l ( )
*See DeGroot p.515-516 for proof.
35
Fisher’s Information
• Theorem: for MLEs, we can show that under “mild
conditions”:
1
ˆ
~ N ˆ 0 , ˆ
2
I n ( 0 )
MLE
where θ0 is the true unknown parameter(s).
• What distribution does I n ( 0 ) ˆMLE 0 have?
• What does this say about biasness of MLEs? What
about consistency of MLEs?
• What is Fisher’s information, In(θ0) , measuring? Do
you want it to be large or small?
36
Derivation of Theorem
• By definition of MLE: l ' ˆMLE 0 . Applying a Taylor
series expansion:
l ' ˆMLE l ' 0 ˆ0 0 l ' ' 0
ˆMLE 0 ll'''0
0
l ' 0
1
n ˆMLE
0 n
l ' ' 0
1
n
• Expand the numerator:
1 n
l ' 0 log f ( X i | 0 )
1
n
n i 1
• What is this a sum of?
• i.i.d. random variables. So if CLT holds…
37
Derivation of Theorem (cont.)
• By CLT, we know:
1 n 2
log f ( X | ) N
E log f ( X | )
0 , Var log f ( X | )
0
n i 1
i 0
• What are E log f ( X | 0 ) and Var log f ( X | 0 ) ?
log f ( X | ) f ( X | )dx
E log f ( X | 0 ) 0
0
f ( X | 0 ) / f ( X | 0 ) f ( X | 0 )dx
f ( X | 0 ) dx f ( X | 0 )dx 0
38
Bring it home!
E log f ( X | ) 0 2 I
2
Var log f ( X | 0 ) 0
0
• Hooray! So (if n is large enough, by the CLT):
l ' 0 ~ N 0, I ( 0 )
1
n
• What about the denominator?
1 n 2 2
l ' ' 0 2 log f ( X i | 0 ) E 2 log f ( X | 0 )
1
n n i 1
• Thus:
l ' 0
1
I 0
ˆ
n 0 0 n
1 1
l ' 0 N 0,
2
2
1
l ' ' 0
1 I 0 n I 0
I 0
n
39
Proof is over!
• So what did we just show?
• That the sampling distribution of any MLE will be
approximately Normally distribution, given:
• n is large enough,
• you don’t have too extreme outliers in l’(θ0)
• and your observations are i.i.d.
• So what? Now we have an easy way to construct
confidence intervals and conduct hypothesis tests
• Note this also holds in the multi-dimensional
parameter case, but what are the dimension of In(θ0)?
• So it needs to be written as:
ˆ0 N 0 , 2 I n1 0
40
Happy National Battery Day!
• Batteries were invented by this guy in 1800:
Alessandro Volta
z* z*
P 0 ˆMLE 1 / 2 , ˆ 1 / 2 1
ˆ MLE ˆ
I n ( MLE ) I n ( MLE )
^
• By using In(θ) instead of In(θ0), we will technically
mess up the asymptotic normal distribution (just like
using s screwed up the normal distribution in
calculating µ).
• But it’s close enough to a normal distribution when n is
large (just like a t-distribution looks like a normal
when df is large).
48
Example: MLE-based C.I. for a Poisson
• Let i.i.d Xi ~ Pois(λ).
• What is λ𝑀𝐿𝐸 ? What is Var(λ𝑀𝐿𝐸 )?
λ𝑀𝐿𝐸 = X Var(X ) = λ/n
• What is Fisher’s Information, In(θ)?
2 2 n n
I n ( ) E 2 l ( ) E 2 log( ) X i log( X i !) n
i 1 i 1
n
1 n i
X
nX n n
E i X n
E i 1
E 2
2
i 1
2
n
• Thus the estimated Information is I n (ˆ) .ˆ
49
Example: MLE-based C.I. for a Poisson
(cont.)
• Construct an asymptotic 95% C.I. for λ.
z* z*
ˆMLE 1 / 2 , ˆMLE 1 / 2
ˆ ) ˆ )
I (
n MLE I (
n MLE
* *
ˆ
z1 / 2 ˆ z1 / 2
,
n n
ˆ
ˆ
ˆ ˆ ˆ
ˆ
1.96 , 1.96
n n
50
Unit 4 Outline
• The Likelihood Function and Maximum Likelihood
Estimation (MLE)
• MLE Examples
• Functions in R
• Newton’s Method to find roots
• Fisher’s Information and the Assymptotic Normality of
MLEs
• Confidence Intervals for MLEs
• Mean Square Error, Efficiency, and the Cramer-Rao Lower
Bound
• Optimization in R
51
Mean Square Error
• We have talked about a few properties of estimators so
far: biasedness, consistency, and variance.
• Another way to measure how good an estimator is: the
mean squared error (MSE):
MSE (ˆ) E (ˆ 0 ) 2
Var (ˆ) E (ˆ) 0
2
• What is θ𝑀𝐿𝐸 ?
θ𝑀𝐿𝐸 = X
• What is Var(θ𝑀𝐿𝐸 )?
Var(X ) = λ/n
55
Asymptotic Efficiency of MLEs
• What is the asymptotic bias of any maximum likelihood
estimator, θ𝑀𝐿𝐸 ? Meaning: what is the bias as n → ∞?
• In the last lecture we saw that E(θ𝑀𝐿𝐸 ) → θ as n → ∞.
• What is the asymptotic variance of θ𝑀𝐿𝐸 ?
1
Var (ˆ
MLE )
I n ( )
• So MLEs achieve the Cramer-Rao lower bound
asymptotically.
• This means MLEs are asymptotically efficient! It’s the
best we can do (asymptotically).
56
Unit 4 Outline
• The Likelihood Function and Maximum Likelihood
Estimation (MLE)
• MLE Examples
• Functions in R
• Newton’s Method to find roots
• Fisher’s Information and the Assymptotic Normality of
MLEs
• Confidence Intervals for MLEs
• Mean Square Error, Efficiency, and the Cramer-Rao Lower
Bound
• Optimization in R
57
Optimization in R
• In the last lecture, we numerically calculated the
MLE by solving the derivative of the log-likelihood
function, l’(θ) = 0, by using the uniroot function
in R.
• This works great if we can analytically write down
l’(θ). Sometimes this is not easy.
• That’s OK, there’s another way to numerically solve
for MLEs based on the log-likelihood function
directly: using R’s optim function.
• This will allow us to maximize a function that has
multiple parameters at once. But the key is: you
need a good starting spot!
58
optim in R to minimize or
maximize functions
?optim
theta=optim(par=c(a.mom,lambda.mom), fn=gamma.loglik,
control=list(fnscale=-1), x=precip)
theta
60
So which approach should we take?
61
Hooray for MLEs!!
• What’s your favorite MLE?
Managed Learning Environment? Major League Eating?
62