0% found this document useful (0 votes)
33 views62 pages

Unit 04 - Maximum Likelihood Estimation - 1 Per Page

Uploaded by

debebeneg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views62 pages

Unit 04 - Maximum Likelihood Estimation - 1 Per Page

Uploaded by

debebeneg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

Stat 111

Unit 4: Maximum Likelihood Estimation


(MLE)
Sections 7.2, 7.5, 7.6, 8.5, 8.8 in DeGroot

1
Unit 4 Outline
• The Likelihood Function and Maximum Likelihood
Estimation (MLE)
• MLE Examples
• Functions in R
• Newton’s Method to find roots
• Fisher’s Information and the Assymptotic Normality of
MLEs
• Confidence Intervals for MLEs
• Mean Square Error, Efficiency, and the Cramer-Rao Lower
Bound
• Optimization in R
2
MLE: an improved estimation approach
• In Unit 3, we learned one way to construct
estimators: Method of Moments
• This is not necessarily the best way to find
estimators. In fact, it is rarely used in practice (it’s
just so simple, so it’s a good teaching tool to
introduce estimation methods)
• A more efficient and widely-used approach:
Maximum Likelihood Estimation (MLE).
• This approach essentially chooses the value for the
parameter, θ, that maximizes the likelihood of seeing
the sample data that is collected.

3
Likelihood Function
• Again, what is inference?
• It is making statements about parameter(s), θ, given a
sample of data, X1, X2, …, Xn.
• It sure would be nice to have a function of θ given
the sample data in order to help us make these
inferential statements.
• That is exactly what the likelihood function is doing:
lik ( )  f ( X 1 , X 1 ,..., X n |  )

• In English: the likelihood function is the probability


of observing the data as a function of θ.
Note: other notations used: lik ( )   ( | x)  L ( | x)
4
Likelihood Function
• So really, this is just a slight of hand, or thinking about
the pdf from a different perspective.
• Instead of writing the PDF as a function of x given θ,
we can think of it as a function of θ, but it loses the
interpretation that the PDF would have.
• Mathematically this is fine to do. Once the sample is
drawn, the unknown is now the parameter (for now we
will not think of it as a random variable…since we are
first taking the classical Frequentist approach).
• If X1, X2, …, Xn are i.i.d. from f(x|θ) (this is usually the
case, but not always), the likelihood function
simplifies to: n
lik ( )   f  X i |  
i 1
5
Log-Likelihood Function
• What happens to this likelihood function when there is a
lot of data (n is large)?
• Since you are multiplying a lot of things (which are
usually quite small), the likelihood is likely to get very
small.
• To fix this computational problem we can instead deal
with the log-likelihood for convenience sake (and for
other reasons too, which we will get into much later).
• We define the log-likelihood, l(θ) = log[lik(θ)].
• And for i.i.d. observations:
n
l ( )  loglik ( )    log f  X i |  
i 1

• Note: Statisticians are sloppy: log = loge = ln. R too!


6
Maximum Likelihood Estimator (MLE)
• What is the likelihood function really measuring?
• It’s a measure of how likely any values of the
parameter(s) seem to be given a specific observed sample
of data
• So what’s better? Higher or lower?
• How can we maximize this function in terms of θ?
• Hooray for calculus!
• We call the value for θ that maximizes the likelihood
function, the maximum likelihood estimator (MLE) of θ:
ˆ  arg maxlik ( )

Note: it will sometimes be written as ˆMLE to differentiate it


from other estimators
7
Finding the MLE
• So the logarithm function is monotonic, the value of
that maximizes the likelihood function also maximizes
the log-likelihood function. The log function

• So instead we can solve:

2
1
0
-4 -3 -2 -1
ˆ  arg maxl ( )

y

0 2 4 6 8 10
x

• Simplest MLE example: fair coin vs. biased coin.


• You own 2 coins: a fair one (with p = 0.50 of landing
heads) and a biased one (with p = 0.80). You reach
into your pocket and select one coin at random to flip.
• You flip it 4 times and see 3 heads and one tail.
• What is the maximum likelihood estimate for p?
8
Bigger is better. For treehouses…
and for likelihood functions.

http://www.youtube.com/watch?v=JGS90HEbP5U

9
Unit 4 Outline
• The Likelihood Function and Maximum Likelihood
Estimation (MLE)
• MLE Examples
• Functions in R
• Newton’s Method to find roots
• Fisher’s Information and the Assymptotic Normality of
MLEs
• Confidence Intervals for MLEs
• Mean Square Error, Efficiency, and the Cramer-Rao Lower
Bound
• Optimization in R
10
MLE Example 1: Poisson distribution
• Suppose we have i.i.d. Xi ~ Pois(λ).
• What is the likelihood function?
n
  X i  
lik ( )    e 
i 1  ( X i )! 
• What is the log-likelihood function?
 n   X i   n
l ( )  log   e    X i log( )  log( X i !)   
 i 1  ( X i )!  i 1
n n
 log( ) X i    log( X i !)  n
i 1 i 1

11
MLE Example 1: Poisson dist. (cont.)
• What is the maximum likelihood estimator for λ?
• First differentiate log-likelihood = lʹ(λ):
 n n

d  log( ) X i    log( X i !)  n 
 
l ( )   
n
   X i   n
i 1 i 1 1
d    i 1
• Then set it to zero, lʹ(λ) = 0, and solve for λ.
  n
l ' ( )      X i   n  0
1
   i 1 1 n
̂    X i  X
 
1 n  n  i 1
  X i  n
   i 1
12
MLE Example 2: Normal distribution
• Suppose Xi ~ N(μ, σ2) and i.i.d.
• What is the likelihood function?
n  1  ( X   ) 2

lik (  ,  )   
2
exp  i

i 1   2  2 2

• What is the log-likelihood function?
n  1  ( X   ) 2

l (  ,  )   log 
2
exp  i

i 1  2  2 2


 
 ( X i   )2 
n n n
  log( )   log 2    
i 1 i 1 i 1  2  2

 1 n

  n log( )  n log 2  2  ( X i   ) 2
2 i 1
13
MLE Example 2: Normal dist. (cont.)
• Now there are two unknown parameters so we will
need to find the separate partial derivatives:
l (  ,  2 )  
  2
n
1

   n log( )  n log 2  2
  2

i 1
( X i   ) 

1 n
 2  (X i  )
 i 1

l (  ,  2 )   n
    
n
1 2

1
   log( )  n log 2  
2
(X i  ) 
2

 2
  2
2
2 i 1 

  (X
n
n1 2 2
 2     ) 2

2
i
2 i 1

14
MLE Example 2: Normal dist. (cont.)
• Set the separate partial derivatives to zero and solve for
the specific parameter:
l (  ,  2 )  1 n
 2  ( X i  ˆ )  0
  i 1
1 n
ˆ   X i  X
n i 1

l (  ,  2 ) n 1 n

 2
 2 
2ˆ  
2 ˆ 2
2  i
( X
i 1
  ) 2
0

1 n 1 n
ˆ   ( X i   )   ( X i  ˆ ) 2
2 2

n i 1 n i 1

15
MLE Example 3: Gamma distribution
• Suppose Xi ~ Gamma(a, λ) and i.i.d.
• What is the likelihood function?
n
 a a 1  X i 
lik (a,  )    Xi e 
i 1   a  
• What is the log-likelihood function?
n
 a a 1  X i 
l (a,  )   log  Xi e 
i 1  a  
n n n n
  log(a )   log(a )   (a  1) log X i     X i
i 1 i 1 i 1 i 1
n n
 na log( )  n log(a )   (a  1) log X i     X i
i 1 i 1
MLE Example 3: Gamma dist. (cont.)
• Two unknown parameters: θ ={a, λ}, so take the two
partial derivative separately:

l (a,  )   n n

  na log( )  n log(a )   (a  1) log X i     X i 
a a  i 1 i 1 
' (a ) n
 n log( )  n   log X i 
(a ) i 1

l (a,  )   n n

  na log( )  n log(a )   (a  1) log X i     X i 
   i 1 i 1 
na n
   Xi
 i 1
MLE Example 3: Gamma dist. (cont.)
• And set to zero (solve the λ-partial first):
l (a,  ) na n
   Xi  0
 ˆ i 1
ˆ aˆ aˆ
 n

n  Xi
1 X
i 1

And plug into a-partial equation:


l (a,  )
  ˆ n
  log X i   0
' ( a )
 n log ˆ  n
a (aˆ ) i 1
' (aˆ )  1  n
logaˆ   logX      log X i   0
(aˆ )  n  i 1
MLE Example 3: Gamma dist. (cont.)
• That second equation (the a-partial solution) is a non-
linear equation…a closed form solution does not exist 
(and boy is it ugly!)
• For a particular application, we need a numerical method
to solve. What are some examples?
• bisection method
• Newton’s method (also called Newton-Raphson)
• etc…
• We can get R to do this for us!
• How can we get sampling distributions for the MLE’s?
• Empirically via simulation/resampling methods!

19
Which Newton is That?

20
Unit 4 Outline
• The Likelihood Function and Maximum Likelihood
Estimation (MLE)
• MLE Examples
• Functions in R
• Newton’s Method to find roots
• Fisher’s Information and the Assymptotic Normality of
MLEs
• Confidence Intervals for MLEs
• Mean Square Error, Efficiency, and the Cramer-Rao Lower
Bound
• Optimization in R
21
Functions in R
• We’d like to calculate the log-likelihood (or likelihood) for a
model given a set of parameter(s), θ, and the data, x.
• The best way to do this is to write a user-defined function in
R so that this calculation can be done over and over again (so
we can draw the function, determine it’s maximum, etc…).
• A user-defined function in R looks like this:
my.function = function(arg1,arg2,...){
result = ... # do some work
return(result)
}
• my.function has several parts: the function name, arguments,
body, and results to be passed back to the user in the regular
R environment.
• The work (like the result var) is done internally in the
function, and cannot be accessed outside the function unless
it is explicitly returned to the user with the return expression.
• An example would be helpful…
22
R-Function: Gamma log-lik
• Recall that the log-likelihood of i.i.d. Xi ~ Gamma(a, λ):
n n
l (a,  )  na log(  )  n log(a)   (a  1) log  X i     X i
i 1 i 1

• A user-defined function in R to calculate this:


gamma.loglik=function(theta,x){
n=length(x)
a=theta[1]
lambda=theta[2]
l=n*(a*log(lambda)-log(gamma(a))+(a-1)*mean(log(x))-
lambda*mean(x))
return(l)
}

• So how do we use this function? Need an application…


23
Boston Storm Data
• Recall the Boston Storm Data from 2013:
• Observations are the daily rainfall for each day that it
rained in Boston (n = 129)
• The histogram loosely resembles a Gamma distribution
(right-skewed):
100

> hist(precip,col="gray",main="")
80
60
Frequency

40
20
0

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5

precip

• We will use this as illustration for numerically calculating


log-likelihood and finding the MLE estimates for the
gamma distribution.
24
Some old R code (to find MOMs)
f=file.choose()
storm=read.csv(f,header=T)
dim(storm)
names(storm)
> lambda.mom
attach(storm)
[1] 1.459674
n=length(precip) > a.mom
xbar=mean(precip) [1] 0.4325839
s2=var(precip)
sigma2.hat=s2*(n-1)/n
lambda.mom=xbar/sigma2.hat
a.mom=xbar^2/sigma2.hat
lambda.mom
a.mom

25
Plotting the l(a, λ): a double for loop
• We would like to plot the log-likelihood for the Boston storm
dataset for various values of a and λ.
• This poses a difficulty since there are two unknowns
(parameters here) we need to look over
• We need to use a “double” for loop. See code below:
a=1:100/100
lambda=1:200/50
loglik=matrix(NA,nrow=length(a),ncol=length(lambda))
dim(loglik)

for(i in 1:length(a)){
for(j in 1:length(lambda)){

loglik[i,j]=gamma.loglik(theta=c(a[i],lambda[j]),x=precip)
}
}
Plotting l(a, λ) and Finding MLEs
• Let’s plot is (need a 3d plot, so need to use an R-package:
• This poses a difficulty since there are two unknowns
(parameters here) we need to look over
• We need to use a “double” for loop. See code below:
require(scatterplot3d)
persp(x=a,y=lambda,z=loglik,shade=0.5,axes=T,
col=c("darkred"),phi=20, theta=-60,
ticktype="detailed")

• And with some careful indexing, we can find the correct values
for a and λ that maximize our calculated log-likelihoods:
index=which(loglik==max(loglik))
a[index%%length(a)]
lambda[ceiling(index/length(a))]
[index%%length(a)]
[1] 0.59
> lambda[ceiling(index/length(a))]
[1] 2
Unit 4 Outline
• The Likelihood Function and Maximum Likelihood
Estimation (MLE)
• MLE Examples
• Functions in R
• Newton’s Method to find roots
• Fisher’s Information and the Assymptotic Normality of
MLEs
• Confidence Intervals for MLEs
• Mean Square Error, Efficiency, and the Cramer-Rao Lower
Bound
• Optimization in R
28
Newton’s Method
• Newton’s Method (sometimes called the Newton-Raphson
Method) is a numerical way to solve for roots of a
function.
• It is an iterative algorithm based on the following equation:
f ( x0 )
x1  x0 
f ' ( x0 )
f ( xk )
xk 1  xk 
f ' ( xk )
• You iteratively update a potential root xk until:
| xk – xk-1 | < ε for some small amount of error (ε).
• Some issues arise. But key is that your starting point, x0,
needs to be somewhat close to the potential root.
29
Newton’s Method (cont.)
• Newton’s Method is quite applicable for maximum
likelihood estimation! The situation often arise when
the roots of l’(θ) = 0 do not have a closed form
expression.
• So for MLE, the f(x) from the previous slide is actually
l’(θ). Thus the formulas become:
l ' ( 0 ) l ' ( k )
1   0   k 1   k 
l ' ' ( 0 ) l ' ' ( k )

• We will talk about l’’(θ) in a bit. But for now, we can


just use the predefined functions in R: uniroot and
polyroot

30
Using Newton’s Method: Gamma dist.
• Recall, the a-partial equation for l’(a) for a gamma
distribution was:
l (a,  ) ' (aˆ )  1  n
 logaˆ   logX      log X i   0
a (aˆ )  n  i 1
• So we need to do 2 steps in R to solve this equation:
1) Create a user-defined function to calculate the result of
this function (l’(a)) given a value of the parameter a
(and given the data X1, X2, …, Xn).
2) Use the function uniroot to find the appropriate root
for this equation
*And don’t forget there another parameter to estimate
afterwards: ˆ â

X
A Function in R to Calculate l’(a)
• Step #1: create a user-defined function (let’s call it a.partial)
to calculate the results of the function of the a-partial
derivative (given parameter, a, and data, x):
l (a,  ) ' (aˆ ) 1 n
 logaˆ    logX     log X i   0
a (aˆ )  n  i 1

a.partial = function(a,x){
f = log(a)-digamma(a)-log(mean(x))+mean(log(x))
return(f)
}

• What is the name of the function? What are the arguments


of the function? What is the body of the function? What is
the result of the function? How do we use this function?

32
uniroot in R to solve l’(a) = 0
• Step #2: Use the function uniroot to find the appropriate
root for this equation:
> result1
?uniroot
$root
uniroot(f=a.partial,interval= [1] 0.5920035
c(min(precip),max(precip)),x=precip) $f.root
result1=uniroot(f=a.partial,interval= [1] -4.113931e-06

c(min(precip),max(precip)),x=precip) $iter
a.mle=result1$root [1] 10

lambda.mle=a.mle/mean(precip) $estim.prec
[1] 6.103516e-05
a.mle
lambda.mle > a.mle
[1] 0.5920035
> lambda.mle
[1] 1.997605

33
Unit 4 Outline
• The Likelihood Function and Maximum Likelihood
Estimation (MLE)
• MLE Examples
• Functions in R
• Newton’s Method to find roots
• Fisher’s Information and the Assymptotic Normality of
MLEs
• Confidence Intervals for MLEs
• Mean Square Error, Efficiency, and the Cramer-Rao Lower
Bound
• Optimization in R
34
Fisher’s Information
• Finally, back to real Statistics (and not computation) 
• Recall, l(θ) is measuring the likelihood of the potential
values of θ given the data, X1, X2, …, Xn.
• We’d like a measure for the uncertainty of an MLE
(like the variance of ˆMLE).
• The Fisher information, In(θ), does exactly that:
   
2

I n ( )  E  l ( ) 
   
• Which is mathematically equivalent to (for ease of
calculations):
 2 
I n ( )   E  2 l ( )
  
*See DeGroot p.515-516 for proof.
35
Fisher’s Information
• Theorem: for MLEs, we can show that under “mild
conditions”:
 1 
ˆ 
~ N  ˆ   0 ,  ˆ 
2

I n ( 0 ) 
MLE

where θ0 is the true unknown parameter(s).

• What distribution does I n ( 0 ) ˆMLE   0 have? 
• What does this say about biasness of MLEs? What
about consistency of MLEs?
• What is Fisher’s information, In(θ0) , measuring? Do
you want it to be large or small?

36
Derivation of Theorem
 
• By definition of MLE: l ' ˆMLE  0 . Applying a Taylor
series expansion:
l ' ˆMLE   l '  0   ˆ0   0 l ' '  0 

ˆMLE  0    ll'''0 
0

l '  0 
1

n ˆMLE 
 0  n
 l ' '  0 
1
n
• Expand the numerator:
1 n 
l '  0   log f ( X i |  0 ) 
1
n

n i 1 
• What is this a sum of?
• i.i.d. random variables. So if CLT holds…
37
Derivation of Theorem (cont.)
• By CLT, we know:
1 n     2  
 log  f ( X |  )   N 
   E  log  f ( X |  )
0  ,   Var  log  f ( X |  ) 
0  
n i 1     
i 0
  
   
• What are E  log f ( X |  0 )  and Var  log f ( X |  0 )  ?
     
     log f ( X |  )  f ( X |  )dx
E  log f ( X |  0 )    0 

0
    
   
     f ( X |  0 )  / f ( X |  0 )  f ( X |  0 )dx
    
 
  f ( X |  0 ) dx   f ( X |  0 )dx  0
 

38
Bring it home!
  E   log f ( X |  )    0 2  I  
 2

Var  log f ( X |  0 )   0  
 
0
   
   

• Hooray! So (if n is large enough, by the CLT):
l '  0  ~ N 0, I ( 0 ) 
1
n
• What about the denominator?
1 n 2  2 
 l ' '  0     2 log f ( X i |  0 )   E  2 log f ( X |  0 ) 
1
n n i 1    
• Thus:
l '  0 
1
I  0 
ˆ 
n 0  0  n 

1  1
  
 
l '  0   N    0,  
 2

 2


1 
 


 l ' '  0   
1 I 0 n  I 0
I 0 
n
39
Proof is over!
• So what did we just show?
• That the sampling distribution of any MLE will be
approximately Normally distribution, given:
• n is large enough,
• you don’t have too extreme outliers in l’(θ0)
• and your observations are i.i.d.
• So what? Now we have an easy way to construct
confidence intervals and conduct hypothesis tests 
• Note this also holds in the multi-dimensional
parameter case, but what are the dimension of In(θ0)?
• So it needs to be written as:
ˆ0  N    0 ,  2  I n1  0 
40
Happy National Battery Day!
• Batteries were invented by this guy in 1800:

Alessandro Volta

• Batteries are good for:


• Transferring chemical energy
into electrical energy.
• 9V: Shocking your tongue:

• D-cell: Throwing at JD Drew:


41
Unit 4 Outline
• The Likelihood Function and Maximum Likelihood
Estimation (MLE)
• MLE Examples
• Functions in R
• Newton’s Method to find roots
• Fisher’s Information and the Assymptotic Normality of
MLEs
• Confidence Intervals for MLEs
• Mean Square Error, Efficiency, and the Cramer-Rao Lower
Bound
• Optimization in R
42
Confidence Intervals Revisited
• Back in Unit 2, we talked about confidence intervals from an
Intro. Stat (104) perspective.
• What’s the purpose of calculating a confidence interval?
• Gives an interval estimate for a parameter for the
population/theoretical distribution.
• What is the interpretation of a confidence interval?
• It is a range of plausible values for the parameter of the
distribution that generated our sample
• “We are 95% confident that the true population parameter falls
in the range”
• “If we were to repeatedly sample and recalculate this interval
over-and-over again, we’d expect 95% of them to cover the
true parameter we are estimating (the random sampling/data
generation method is random).”
43
Confidence Intervals Revisited
• How is one classically calculated?
• We calculate the quantiles of the sampling distribution of the
statistic assuming the population parameter was equal to the
observed statistic in the actual sample.
• For example, we’d like to construct a 95% confidence interval
for the mean of a normal distribution, µ. Then we can calculate
a 95% C.I. for μ based on a t-distribution since the statistic
X   / S / n  is know to follow a t-distribution:
n X    *
P (t / 2 
*
 t1 / 2 )  1  
S • What’s the estimator
here?
  *  S   S  
P    X  t / 2  , X  t1 / 2 
*
   1   • What’s the difference
   n  n   between the 2nd and
3rd lines? Why is this
   S   S   legal?
P    X  t1 / 2 
*
, X  t1 / 2 
*
   1  
   n  n  
44
Pivotal Statistic
• In the previous slide, we mentioned that the statistic
T  X    / S / n  is know to follow a t-distribution.
• Not only that, but this t-distribution that the statistic follows
does not depend on the values of the parameters. Thus, T is
called a:
• pivotal statistic: a statistic, V, (which is a random variable)
whose distribution is the same for all values of the
parameter(s).
• Pivotal statistics are useful because they can be used as the
basis of confidence intervals: just select the appropriate
quantiles of the distribution of the pivotal statistic, V, (based
on its inverse CDF: FV-1) in order to build the bounds of the
confidence interval.
*
* t
• This is what we did in the previous slide: t / 2 and 1 / 2
are the quantiles from the dist. of the pivotal statistic: T.
45
Calculating Confidence Intervals
• There are 3 approaches (that we will cover) to calculate
confidence intervals:
• Exact method (based on the analytical solution to
the sampling distribution)
• Approximation based on large sample theory
(usually based on the normal dist.)
• Resampling methods (bootstrap, etc…)
• In each approach, you pull off the (α/2) th and (1 –
α/2)th quantiles of the sampling distribution, where α
is some small error rate (usually α = 0.05).
• For our MLEs, we can use any of the 3. But usually
the exact distribution is difficult to solve, so we will
usually rely on the asymptotic normality of MLEs
46
Conf. Int. Construction for MLEs
(based on asymptotic normality)
• From earlier in Unit 4, we saw:
 
I n ( 0 ) ˆMLE   0  N (0, 1)

• What is θ0? What is In(θ0)?


• So we can always construct an assymptotic confidence
interval from this result.
   
P z* / 2  I n ( 0 ) ˆMLE   0  z1* / 2  1  
   z1* / 2   z1* / 2  

P  0  ˆMLE  , ˆMLE     1  
   I ( )   I ( )  
  n 0   n 0  

• Uh-oh, this assumes we know the true In(θ0). What


should we use instead? I n ˆMLE  
47
Conf. Int. Construction for MLEs
(based on asymptotic normality)

   z*   z*  

P 0  ˆMLE  1 / 2 , ˆ   1 / 2   1  
   ˆ  MLE  ˆ  
   I n ( MLE )   I n ( MLE )  
^
• By using In(θ) instead of In(θ0), we will technically
mess up the asymptotic normal distribution (just like
using s screwed up the normal distribution in
calculating µ).
• But it’s close enough to a normal distribution when n is
large (just like a t-distribution looks like a normal
when df is large).

48
Example: MLE-based C.I. for a Poisson
• Let i.i.d Xi ~ Pois(λ).
• What is λ𝑀𝐿𝐸 ? What is Var(λ𝑀𝐿𝐸 )?
λ𝑀𝐿𝐸 = X Var(X ) = λ/n
• What is Fisher’s Information, In(θ)?
 2   2  n n

I n ( )   E  2 l ( )   E  2  log( ) X i    log( X i !)  n 
      i 1 i 1 
 n 
   1  n   i 
  X 
   nX  n n
  E    i  X   n  
   E  i 1
   E  2 
 2 
     i 1        
2

 
n
• Thus the estimated Information is I n (ˆ)  .ˆ

49
Example: MLE-based C.I. for a Poisson
(cont.)
• Construct an asymptotic 95% C.I. for λ.
  z*   z* 
ˆMLE   1 / 2 , ˆMLE   1 / 2 
  ˆ )  ˆ ) 
 I (
 n MLE  I (
 n MLE 
    
  *   * 
 ˆ 
  
z1 / 2  ˆ  z1 / 2 
,  
 n  n 
  ˆ  
ˆ

      
 ˆ ˆ ˆ 
ˆ
   1.96 ,   1.96 
 n n

50
Unit 4 Outline
• The Likelihood Function and Maximum Likelihood
Estimation (MLE)
• MLE Examples
• Functions in R
• Newton’s Method to find roots
• Fisher’s Information and the Assymptotic Normality of
MLEs
• Confidence Intervals for MLEs
• Mean Square Error, Efficiency, and the Cramer-Rao Lower
Bound
• Optimization in R
51
Mean Square Error
• We have talked about a few properties of estimators so
far: biasedness, consistency, and variance.
• Another way to measure how good an estimator is: the
mean squared error (MSE):

MSE (ˆ)  E (ˆ   0 ) 2 

 Var (ˆ)  E (ˆ)   0 
2

• MSE is an intuitive measure of how good an estimator


is. Just look at the squared distance from what you want
it to be. To compare estimators, we’d like to minimize
this measure.
• Note: MSE takes into account both the bias ( E (ˆ)   0)
of an estimator and its variance.
52
Efficiency of Estimators
• The relative efficiency of two estimators, θ and θ, is
defined to be:
~
ˆ ~ MSE ( )
eff ( ,  ) 
MSE (ˆ)
• If θ and θ have the same bias, this simplifies to:
~
ˆ ~ Var ( )
eff ( ,  ) 
Var (ˆ)
• Example: Suppose i.i.d Xi ~ N(μ, σ2). Let θ = θ𝑀𝐿𝐸 =X
and θ = X1.
• What are these estimators biases? What are their
variances? What is the relative efficiency of θ to θ ?
53
Cramer-Rao Lower Bound
• For a statistical estimation problem, the optimal
estimate is often considered to be the one with
min(MSE).
• If we are comparing two unbiased estimators, then
MSE(θ) = Var(θ).
• So what is the best we can do as far as variance?
• Cramer-Rao Lower Bound: Let θ be an unbiased
estimator of θ. Then:
1
Var (ˆ) 
I n ( )
• If an estimator achieves the Cramer-Rao lower bound, it
is said to be completely efficient.
54
Cramer-Rao Example: Poisson
• Let i.i.d Xi ~ Pois(λ).

• What is θ𝑀𝐿𝐸 ?
θ𝑀𝐿𝐸 = X

• What is Var(θ𝑀𝐿𝐸 )?
Var(X ) = λ/n

• What is Fisher’s Information, In(θ)?


In(θ) = n/λ

• We’ve achieved the Cramer-Rio Lower bound!!!

55
Asymptotic Efficiency of MLEs
• What is the asymptotic bias of any maximum likelihood
estimator, θ𝑀𝐿𝐸 ? Meaning: what is the bias as n → ∞?
• In the last lecture we saw that E(θ𝑀𝐿𝐸 ) → θ as n → ∞.
• What is the asymptotic variance of θ𝑀𝐿𝐸 ?
1
Var (ˆ
MLE ) 
I n ( )
• So MLEs achieve the Cramer-Rao lower bound
asymptotically.
• This means MLEs are asymptotically efficient! It’s the
best we can do (asymptotically).
56
Unit 4 Outline
• The Likelihood Function and Maximum Likelihood
Estimation (MLE)
• MLE Examples
• Functions in R
• Newton’s Method to find roots
• Fisher’s Information and the Assymptotic Normality of
MLEs
• Confidence Intervals for MLEs
• Mean Square Error, Efficiency, and the Cramer-Rao Lower
Bound
• Optimization in R
57
Optimization in R
• In the last lecture, we numerically calculated the
MLE by solving the derivative of the log-likelihood
function, l’(θ) = 0, by using the uniroot function
in R.
• This works great if we can analytically write down
l’(θ). Sometimes this is not easy.
• That’s OK, there’s another way to numerically solve
for MLEs based on the log-likelihood function
directly: using R’s optim function.
• This will allow us to maximize a function that has
multiple parameters at once. But the key is: you
need a good starting spot!
58
optim in R to minimize or
maximize functions
?optim

optim(par, fn, gr = NULL, ...,


control = list(), hessian = FALSE)

par: the initial estimate(s) of the parameter(s) of the function


fn: the function which you minimizing or maximizing
gr: optional: an analytical function for the gradient
control: other arg’s: use “control=list(fnscale=-1)” to maximize
hessian: whether to return the numeric hessian matrix for fn
…: other arguments to pass along to the function fn
*Note: read the help file for other arguments.
59
optim: Example l(θ) Maximization
• Using the function optim to maximize a likelihood
function: > theta
$par
[1] 0.59196 1.99734
gamma.loglik=function(theta,x){
$value
n=length(x) [1] 42.54922
a=theta[1] $counts
lambda=theta[2] function gradient
101 NA
l=n*(a*log(lambda)-log(gamma(a))
+(a-1)*mean(log(x))-lambda*mean(x)) $convergence
[1] 0
return(l)
} $message
NULL

theta=optim(par=c(a.mom,lambda.mom), fn=gamma.loglik,
control=list(fnscale=-1), x=precip)
theta
60
So which approach should we take?

• We’ve learned 3 ways to calculate the MLE’s:


1) Analytical Solution
2) Numerically solving l’(θ) (uniroot)
3) Numerically maximizing l(θ) directly (optim)
• So which is the best choice? Does it really matter?
• At the very least, you can double-check your
solutions with another approach…

61
Hooray for MLEs!!
• What’s your favorite MLE?
Managed Learning Environment? Major League Eating?

DJ MLE? Maximum Likelihood Estimation?

62

You might also like