0% found this document useful (0 votes)

13 views41 pages

Week 6 Mle

Uploaded by

huy.nguyenngoc2137

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views41 pages

Week 6 Mle

Uploaded by

huy.nguyenngoc2137

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

Week 6: Maximum Likelihood Estimation

Marcelo Coca Perraillon

University of Colorado
Anschutz Medical Campus

Health Services Research Methods I

HSMP 7607
2019

These slides are part of a forthcoming book to be published by Cambridge

University Press. For more information, go to perraillon.com/PLH. c This
material is copyrighted. Please see the entire copyright notice on the book’s
website.
Updated notes are here: https://clas.ucdenver.edu/marcelo-perraillon/
teaching/health-services-research-methods-i-hsmp-7607
1
Outline

An alternative way of estimating parameters: Maximum likelihood

estimation (MLE)
Simple examples: Bernoulli and Normal with no covariates
Adding explanatory variables
Variance estimation
Why MLE is so important?
Likelihood ratio tests

2
Bernoulli example

Suppose that we know that the following ten numbers were simulated
using a Bernoulli distribution: 0 0 0 1 1 1 0 1 1 1
We can denote them by y1 , y2 , ..., y10 . So y1 = 0 and y10 = 1
Recall that the pdf of a Bernoulli random variable is
f (y ; p) = p y (1 − p)1−y , where y ∈ {0, 1}
The probability of 1 is p while the probability of 0 is (1 − p)
We want to figure out what is the p that was used to simulate the
ten numbers
All we know is that 1) they come from a Bernoulli distribution and 2)
they are independent from each other

3
Bernoulli example

Since we know the pdf that generated the numbers is Bernoulli, we

know that the probability of the first number is p y1 (1 − p)1−y1
The probability of the second is p y2 (1 − p)1−y2 and so on...
We could replace the yi with the actual numbers. For example, the
first one is y1 = 0 so the probability is just (1 − p). I’ll keep the
symbols because we are going to make the problem more general
What we do not know is the value of the parameter p
Since we know that they are independent we could also write down
the probability of observing all 10 numbers. That is, their joint
probability
Since they are independent their joint distribution is the multiplication
of the 10 pdfs. Recall: p(A ∩ B) = P(A)P(B) if A and B are
independent

4
Bernoulli example

Q
We
Q2 use the product symbol to simplify the notation. For example,
i=1 xi = x1 ∗ x2
So we can write the joint probability or the likelihood (L) of seeing
those 10 numbers as:
L(p) = 10 yi 1−yi
Q
i=1 p (1 − p)

5
Bernoulli example

Remember that we are trying to find the p that was used to generate
the 10 numbers. That’s our unknown
In other words, we want to find the p that maximizes the likelihood
function L(p). Once we find it, we could it write as our estimated
parameter as p̂
Yet another way: we want to find the p̂ that makes the joint
likelihood of seeing those numbers as high as possible
Sounds like a calculus problem... We can take the derivative of L(p)
with respect to p and set it to zero to find the optimal p̂
Of course, the second step is to verify that it’s a maximum and not a
minimum (take second derivative) and also verify that is unique, etc.
We will skip those steps

6
Bernoulli example

Taking that derivative is complicated because we would need to use

the chain rule several times. A lot easier to make it a sum so we take
the log; the log function is a monotonic transformation, it won’t
change the optimal p̂ value
We will use several properties of the log, in particular:
log (x a y b ) = log (x a ) + log (y b ) = a ∗ log (x) + b ∗ log (y )
So now we have (for n numbers rather than 10):
lnL(p) = ni=1 yi ln(p) + ni=1 (1 − yi )ln(1 − p)
P P

Which simplifies to: lnL(p) = nȳ ln(p) + (n − nȳ )ln(1 − p)

dln(p)
This looks a lot easier; all we have to do is take dp , set it to zero,
and solve for p

7
Bernoulli example
dln(p) nȳ (n−nȳ )
dp = p − (1−p) =0
Pn yi
After solving, we’ll find that p̂(yi ) = ȳ = i=1 n
So that’s the MLE estimator of p. This is saying more or less the
obvious: our best guess for the p that generated the data is the
proportion of 1s, in this case p = 0.6
We would need to verify that our estimator satisfies the three basic
properties of an estimator: bias, efficiency, and consistency (this will
be in your exam)
Note that we can plug in the optimal p̂ back into the ln likelihood
function:
lnL(p̂) = nȳ ln(p̂) + (n − nȳ )ln(1 − p̂) = a, where a will be a number
that represents the highest likelihood we can achieve (we chose p̂)
that way

8
Example

Simulated 100 Bernoulli rvs with p = 0.4

set obs 100
gen bernie = uniform()<0.4
sum bernie
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
bernie | 100 .46 .5009083 0 1
* We just showed that p hat is 0.46

* Let’s get the highest value of the ln likelihood

* Plug in p hat and the other values

di 100*0.46*ln(0.46) + (100-100*0.46)*ln(1-0.46)
-68.994376

And we just did logistic regression “by hand.” A logistic model with
only a constant (no covariates), also known as the null model

9
Example
We will use the logit command to model indicator variables, like
whether a person died
logit bernie
Iteration 0: log likelihood = -68.994376
Iteration 1: log likelihood = -68.994376

Logistic regression Number of obs = 100

LR chi2(0) = -0.00
Prob > chi2 = .
Log likelihood = -68.994376 Pseudo R2 = -0.0000

------------------------------------------------------------------------------
bernie | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_cons | -.1603427 .2006431 -0.80 0.424 -.5535959 .2329106
------------------------------------------------------------------------------
di 1/(1+exp( .1603427 ))
.45999999

Note that Log likelihood = -68.994376 matches what we found “by

hand;” the coefficient is in the log-odds scale
This is a model with no explanatory variables. We can easily make
the parameter p be a linear function of predictors
10
Some things to note

Statistical software maximizes the log likelihood numerically (also the

log likelihood because of numerical precision)
The algorithm is given a starting value for some parameters (often
using using the null model)
Each iteration “improves” the maximization
The second derivatives are also computed (we will see why in a sec)
In many cases, we need to be mindful of the difference between the
scale of estimation and the scale of interest
Logit models report coefficients in the log-odds scale
Not the only way of deriving logit models. You could also assume a
Binomial pdf. The Bernoulli is a special case of the Binomial when
the number of trials is 1

11
Let’s plot the - ln(L) function with respect to p

12
What about the precision (standard error) of the estimate?

There is some intuition in the plot above. The precision of the

estimate p̂ can be measured by the curvature of the lnL(θ) function
around its peak
A flatter curve has more uncertainty
The Fisher information function, I (θ) formalizes that intuition:
2
I (θ) = −E [ ∂ lnL(θ)
∂2θ
]
It turns out that we can calculate var (θ) using the inverse of I (θ)
n
For the Bernoulli, I (p̂) = [p̂(1−p̂)] (evaluated at p̂)
p̂(1−p̂)
The variance is 1/I (p̂) = n
Note something. Once we know p̂ we also know its variance. The
Normal distribution is unique in that the variance is not a function of
the mean

13
What about covariates?

In most applications we want to estimate the effect of covariates on

the probability p
So we could just make p a function of covariates: p = f (x1 , x2 , ..., xp )
We can’t just make it a linear function like
p = β0 + β1 x1 + · · · + βp xp . Why?
But we can use a function that guarantees that p will be bounded
between 0 and 1
1
Enters the logistic or logit function: 1+e − (β0 +β1 x1 +···+βp xp )
Now we don’t want to estimate p. The unknows are the parameters βj
Hence the logit or logistic model name. See, piece of cake. Careful
with Chatterjee’s textbook...

14
Normal example

What about if we do the same but now we have numbers like

90.46561
105.1319
117.5445
102.7179
102.7788
107.6234
94.87266
95.48918
75.63886
87.40594

I tell you that they were simulated from a normal distribution with
parameters µ and σ 2 . The numbers are independent. Your job is to
come up with the best guess for the two parameters
Same problem as with the Bernoulli example. We can solve it in
exactly the same way

15
Normal example

As before, we know the pdf of a Normal random variable and because

the observations are independent we can multiply the densities:
2
L(µ, σ 2 ) = ni=1 √ 1 2 exp ( −(y2σ
i −µ)
Q
2 )
2πσ
Remember the rules of exponents, in particular e a e b = e a+b . We can
write the likelihood as:
L(µ, σ 2 ) = ( √ 1 2 )n exp(− 2σ1 2 ni=1 (yi − µ)2 )
P
2πσ

16
Normal example
As before, we can simplify the problem by taking the log to help us
take the derivatives. But before:
Alert: Perhaps you are wondering, why are we using the pdf of the
normal if we know that the probability of one number is zero?
Because we can think of the pdf as giving us the probability of yi + d
when d → 0
We need computers with lots of floating number ability. MLE was
invented in the 50s/60s. Super difficult to implement. In the 80s, we
had Commodore 64s
https://en.wikipedia.org/wiki/Commodore_64
Researchers could use mainframe computers with punch cards. Your
iPhone is faster than mainframes that used fit in a building...
https://en.wikipedia.org/wiki/Mainframe_computer
Maybe you were not wondering that but I was at some point. I
wonder a lot in general. And daydream on a minute by minute basis...
17
Normal example

After taking the ln, we have:

Pn
lnL(µ, σ 2 ) = − n2 ln(2πσ 2 ) − 1
2σ 2 i=1 (yi − µ)2
All we have left is to take the derivative with respect to our two
unknowns, µ and σ 2 and set them to zero. Let’s start with µ:
∂ln(L(µ,σ 2 ))
= 2 2σ1 2 ni=1 (yi − µ) = 0
P
∂µ
The above expression reduces to (I added theˆto emphasize that’s
the optimal):
Pn
i=1 (yi − µ̂) = 0
Does it look familiar? Replace µ̂ with yˆi . That’s exactly the same as
the first order condition we saw when minimizing the sum of squares
Pn
i y
Solving, we find that µ̂ = i=1
n = ȳ . In other words, our best guess
is just the mean of the numbers

18
Normal example

We can also figure out the variance by taking the derivative with
respect to σ 2
Pn
i=1 (yi −µ̂)
We will find that σ̂ 2 = n
If you remember the review lecture on probability and statistics, we
know that this formula is biased. We need to divide by (n − 1) instead
(What is the definition of bias?)
This happens often in MLE. The MLE estimate of the variance is
often biased but it is easy to correct for it

19
Normal example Stata

We just figured out that the best guess is to calculate the sample
mean and sample variance
We can easily verify in Stata
clear
set seed 1234567
set obs 100
gen ynorm = rnormal(100, 10)
sum ynorm
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
ynorm | 100 98.52294 10.03931 74.16368 123.5079

The sum commands divides the sample variance by (n-1)

20
Linear regression: adding covariates

What about if I told you that the number I generated is a linear

function of one variable, say, x1 ? In other words, I’m saying that the
mean of the normal distribution is µ = β0 + β1 x1
Now we want to find the parameters β0 , β1 , σ 2 that maximize the
likelihood function. Once we know the optimal βˆ0 , βˆ1 we find the
optimal µ̂
The likelihood function is now:
Pn
L(β0 , β1 , σ 2 ) = √ 1 exp(− 1 2 − β0 − β1 x1i )2 )
2πσ 2 2σ i=1 (yi
The ln likelihood is:
Pn
lnL(µ, σ 2 ) = − n2 ln(2πσ 2 ) − 1
2σ 2 i=1 (yi − β0 − β1 x1i )2

21
Linear regression

If we take the derivatives with respect to β0 and β1 we will find

exactly the same first order conditions as we did with OLS. For
example, with respect to β1 :
Pn
i=1 x1 (yi − β0 − β1 x1 ) = 0
All the algebraic properties of OLS still hold true here
The MLE estimate of σ 2 will be biased but we divide by (n-p-1)
instead as we saw before
So what do we gain with MLE?
We do gain a lot in the understanding of linear regression

22
The regression command again
The regression command does not use MLE but it does give you the
log likelihood
use auto
qui reg price weight mpg
ereturn list
scalars:
e(N) = 74
e(df_m) = 2
e(df_r) = 71
e(F) = 14.7398153853841
e(r2) = .2933891231947529
e(rmse) = 2514.028573297152
e(mss) = 186321279.739451
e(rss) = 448744116.3821706
e(r2_a) = .27348459145376
e(ll) = -682.8636883111164
e(ll_0) = -695.7128688987767
e(rank) = 3

The log likelihood of the estimated model is stored in e(ll). The log
likelihood of the null model (with no covariates) is stored in e(ll0 ).
From the numbers above e(ll) > e(ll0 )
23
The regression command again

Stata uses a formula to go from SSE to log likelihood. Remember,

SSE is Stata is stored in the scalar r(rss)
sysuse auto, clear
qui reg price weight mpg
* Save sample size and SSE
local N = e(N)
local rss = e(rss)
* Use formula
local ll = -0.5*‘N’*(ln(2*_pi)+ln(‘rss’/‘N’)+1)
display %20.6f ‘ll’
-682.863688
display %20.6f e(ll)
-682.863688

The formula is −0.5N(ln(2π) + ln( SSE

N ) + 1)

24
Easy MLE in Stata
To estimate in MLE using Stata you need to write a program but
Stata now makes it a lot easier (for teaching purposes) with the
mlexp command
mlexp (ln(normalden(price, {xb: weight mpg _cons}, {sigma})))
initial: log likelihood = -<inf> (could not be evaluated)
feasible: log likelihood = -803.76324
rescale: log likelihood = -729.85758
rescale eq: log likelihood = -697.2346
Iteration 0: log likelihood = -697.2346
Iteration 1: log likelihood = -687.4506
Iteration 2: log likelihood = -682.92425
Iteration 3: log likelihood = -682.86401
Iteration 4: log likelihood = -682.86369
Iteration 5: log likelihood = -682.86369
Maximum likelihood estimation
Log likelihood = -682.86369 Number of obs = 74
------------------------------------------------------------------------------
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
xb |
weight | 1.746559 .6282189 2.78 0.005 .5152727 2.977846
mpg | -49.51222 84.39157 -0.59 0.557 -214.9167 115.8922
_cons | 1946.069 3523.382 0.55 0.581 -4959.634 8851.771
-------------+----------------------------------------------------------------
/sigma | 2462.542 202.4197 12.17 0.000 2065.806 2859.277
------------------------------------------------------------------------------

25
Almost same

The SEs are slightly different and so is Root MSE. Stata is using the
second derivatives to calculate SEs using MLE
. reg price weight mpg

Source | SS df MS Number of obs = 74

-------------+---------------------------------- F(2, 71) = 14.74
Model | 186321280 2 93160639.9 Prob > F = 0.0000
Residual | 448744116 71 6320339.67 R-squared = 0.2934
-------------+---------------------------------- Adj R-squared = 0.2735
Total | 635065396 73 8699525.97 Root MSE = 2514

------------------------------------------------------------------------------
price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
weight | 1.746559 .6413538 2.72 0.008 .467736 3.025382
mpg | -49.51222 86.15604 -0.57 0.567 -221.3025 122.278
_cons | 1946.069 3597.05 0.54 0.590 -5226.245 9118.382
------------------------------------------------------------------------------

26
Asymptotic properties are so important in stats

The auto dataset has only 74 obs. What about we use the MEPS
that has about 15000? (That’s really an overkill but just to make the
point)
mlexp (ln(normalden(lexp, {xb: age _cons}, {sigma})))
initial: log likelihood = -<inf> (could not be evaluated)
could not find feasible values
* I tried giving it starting values but didn’t work. Easier to do it the old fashioned way

* Create program defining model and likelihood function

capture program drop lfols
program lfols
args lnf xb lnsigma
local y "$ML_y1"
quietly replace ‘lnf’ = ln(normalden(‘y’, ‘xb’,exp(‘lnsigma’)))
end
*Estimate model
ml model lf lfols (xb: lexp = age female) (lnsigma:)
ml maximize
* Sigma estimated in the log scale so it’s positive
display exp([lnsigma]_cons)
reg lexp age female

27
Asymptotic properties are so important in stats

ml maximize
initial: log likelihood = -453412.9
alternative: log likelihood = -163550.49
.
Iteration 5: log likelihood = -29153.79
Number of obs = 15,946
Wald chi2(2) = 2981.22
Log likelihood = -29153.79 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
lexp | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
xb |
age | .0358123 .0006779 52.82 0.000 .0344836 .0371411
female | .3511679 .024252 14.48 0.000 .303635 .3987009
_cons | 5.329011 .0373155 142.81 0.000 5.255874 5.402148
-------------+----------------------------------------------------------------
lnsigma |
_cons | .4093438 .0055996 73.10 0.000 .3983687 .4203189
------------------------------------------------------------------------------
display exp([lnsigma]_cons)
1.5058293

28
Asymptotic properties are SO IMPORTANT in stats

reg lexp age female

Source | SS df MS Number of obs = 15,946
-------------+---------------------------------- F(2, 15943) = 1490.33
Model | 6759.97668 2 3379.98834 Prob > F = 0.0000
Residual | 36157.9049 15,943 2.26794862 R-squared = 0.1575
-------------+---------------------------------- Adj R-squared = 0.1574
Total | 42917.8816 15,945 2.69162004 Root MSE = 1.506

------------------------------------------------------------------------------
lexp | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | .0358123 .000678 52.82 0.000 .0344834 .0371413
female | .3511679 .0242542 14.48 0.000 .3036269 .398709
_cons | 5.329011 .037319 142.80 0.000 5.255861 5.40216
------------------------------------------------------------------------------

One should be tolerant of small decimal differences. Which brings me

back to why is everybody (including medical journals) so obsessed
with p = 0.05 being the sacred line? What’s the difference between
p = 0.05 and p = 0.06 or p = 0.04? Makes little sense

29
So is Stata taking derivatives and finding formulas? Nope
Stata uses numerical methods to maximize the likelihood. There are
many and some work better than others in some situations. Type
“help mle” for the gory details
A classic one is the Newton-Raphson algorithm
The idea requires Taylor expansions (a way to approximate nonlinear
functions using linear functions)
The steps are:
1 Make a guess about the parameters, say just one parameter θ0
2 Approximate the log likelihood function using Taylor series at θ0 and
set it equal to zero (easier to solve because it’s a linear function)
3 Find the new θ, say, θ1 . Check if the log likelihood has improved
4 Repeat until the -2 log likelihood changes by only a small amount, say
0.02
The idea of using -2 log likelihood < 0.02 is that that amount would
not change statistical inference since -2 log likelihood is in the
Chi-square scale (more on this in a sec)
30
Why is the log likelihood function negative?
The likelihood function L(p) is a small number since it’s the joint
likelihood of observing the outcome values
Different type of MLE methods
twoway function y =log(x), range(-2 2) xline(0 1) yline(0) ///
color(red) title("y = log(x)")
graph export logy.png, replace

31
What we get from MLE
1) It is clear that we are modeling a conditional expectation
function: E [Y |X ]
Perhaps this got lost but it’s worth repeating. We started with the
normal density:
2
f (y ; µ, σ) = √ 1 exp ( −(y2σ
i −µ)
2 )
2πσ 2
We then said that the mean µ is a function of one or more
covariates x and we made no assumptions about the distribution of
x:
2
f (y ; µ, σ) = √ 1 exp ( −(yi −(β2σ0 +β
2
1 xi ))
)
2πσ 2
That’s why I said many times that the assumption ∼ N(0, σ 2 ) is the
same as saying that the assumption is y ∼ N(β0 + β1 x, σ 2 ), since
µ = β0 + β1 x
Note that with MLE we did not assume anything about the errors.
In fact, the errors are not even in the equations
32
What we get from MLE

2) It is clear from the start of setting up the problem that we are

assuming that Y distributes normal conditional on the values of X .
Remember the example of heights for men and women. In some
cases, perfectly valid to use a linear model even if the distribution
of Y does not look like a normal. See http://tiny.cc/1r26qy
3) It is clear that we assume that the observations are independent;
otherwise, we cannot multiply the densities
4) The value of the optimal log likelihood function gives us a measure
of the goodness of fit, much like SSR (i.e. the explained part) did. By
comparing the log likelihood of alternative models, we will test if the
reduced model is adequate like we did with the F test
5) The curvature of the log likelihood function provides information
about the precision of the estimates (i.e. standard errors)

33
What we get from MLE

6) MLE is much more general than OLS. You will use MLE for logit,
Probit, Poisson, mixture models, survival models. Pretty much all the
standard models an applied researcher needs
7) Learning to model using likelihood ratio tests is more useful for
more type of models than using the SSE for nested models
8) AIC and BIC to compare non-nested models are based on the log
likelihood function
Here is a more detailed proof of MLE for the normal:
https://www.statlect.com/fundamentals-of-statistics/
normal-distribution-maximum-likelihood

34
Likelihood ratio test (LRT)

The null H0 is that the restricted (constrained) model is adequate

The alternative H1 is that the full (unconstrained) model is adequate
The likelihood ratio test compares the log-likelihoods of both models
and can be written as:
LR = −2[L(RM) − L(FM)], where L(RM) is the log-likelihood of the
restricted model and L(FM) that of the full model
Under the null that the restricted model is adequate, the test
statistics LR distributes χ2 with degrees of freedom given by
df = dffull − dfrestricted ; that is, the difference in degrees of freedom
between the restricted and full models

35
Likelihood ratio test: sketch of theory

The theory of LRTs is a bit dense but the intuition is not that
difficult to understand
We could re-write as LR = −2L( RM a
FM ) since log ( b ) = log (a) − log (b)
So we are comparing the likelihood of the reduced model to the full
model and wondering if the reduced model alone is just fine. Sounds
familiar? Not that different from the F test comparing SSEs of nested
models
Keep in mind that the estimated model parameters are those that
maximized the value of the likelihood
The more theoretical part is to figure out how the LRT distributes
and under which conditions the LRT is valid (models must be nested)

36
Recall the F test

We have LR = −2L( RM
FM )
[SSE (RM)−SSE (FM)]/(p+1−k)
The F test was F = SSE (FM)/(n−p−1)
Both are using a measure of fit to compare models
With MLE, we want to know if reaching a higher likelihood is due to
chance under the null
With the F test, we want to know if the additional reduction in the
residual variance is due to chance under the null
The requirement is that models must be nested

37
Example

Compare the likelihood and other criteria

qui reg colgpa
est sto m1
...
est table m1 m2 m3, star stat(r2 r2_a ll bic aic) b(%7.3f)
-----------------------------------------------------
Variable | m1 m2 m3
-------------+---------------------------------------
hsgpa | 0.482*** 0.459***
skipped | -0.077**
_cons | 3.057*** 1.415*** 1.579***
-------------+---------------------------------------
r2 | 0.000 0.172 0.223
r2_a | 0.000 0.166 0.211
ll | -60.257 -46.963 -42.493
bic | 125.462 103.823 99.832
aic | 122.513 97.925 90.985
-----------------------------------------------------
legend: * p<0.05; ** p<0.01; *** p<0.001

Note that the log likelihood (ll) gets larger for better fitting models;
we will cover AIC and BIC later

38
Example

LR tests
lrtest m3 m2
Likelihood-ratio test LR chi2(1) = 8.94
(Assumption: m2 nested in m3) Prob > chi2 = 0.0028

. lrtest m3 m1

Likelihood-ratio test LR chi2(2) = 35.53

(Assumption: m1 nested in m3) Prob > chi2 = 0.0000

It seems logical that LRT and F-test comparing nested models should
be equivalent (asymptotically)

39
LRT and F-tests
Compare tests
qui reg colgpa
est sto m0
scalar ll0 = e(ll)
reg colgpa male campus
Source | SS df MS Number of obs = 141
-------------+---------------------------------- F(2, 138) = 0.62
Model | .171856209 2 .085928105 Prob > F = 0.5413
Residual | 19.2342432 138 .139378574 R-squared = 0.0089
-------------+---------------------------------- Adj R-squared = -0.0055
Total | 19.4060994 140 .138614996 Root MSE = .37333
...

est sto m1
scalar ll1 = e(ll)

lrtest m0 m1
Likelihood-ratio test LR chi2(2) = 1.25
(Assumption: m0 nested in m1) Prob > chi2 = 0.5341

* By hand
di -2*[ll0 - ll1]
1.2542272

p-value of both 0.5341 (I chose bad predictors so p-values would be

high)
40
Summary

MLE is not more difficult than OLS

The advantage of learning MLE is that it is by far the most general
estimation method
Learning the concept of log-likelihood and LRT will help us when
modeling linear models, logistics, Probit, Poisson and many more
AIC and BIC use the log-likelihood
We are using the log-likelihood in a similar way we used SSR,
although we did the F-test in terms of SSE but we know that SST =
SSE + SSR
Never forget the main lesson of MLE with a normal: We are
modeling the mean as a function of variables
See more examples under Code on my website:
http://tinyurl.com/mcperraillon

ML and MAP - HTML
No ratings yet
ML and MAP - HTML
9 pages
Stat 5102 Lecture Slides: Deck 6 Gauss-Markov Theorem, Sufficiency, Generalized Linear Models, Likelihood Ratio Tests, Categorical Data Analysis
No ratings yet
Stat 5102 Lecture Slides: Deck 6 Gauss-Markov Theorem, Sufficiency, Generalized Linear Models, Likelihood Ratio Tests, Categorical Data Analysis
86 pages
Artificial Intelligence and Machine Learning
No ratings yet
Artificial Intelligence and Machine Learning
55 pages
MLEstimation
No ratings yet
MLEstimation
8 pages
CNS 3-1 Lab Manual
100% (2)
CNS 3-1 Lab Manual
34 pages
Probability Theory For Machine Learning: Chris Cremer September 2015
No ratings yet
Probability Theory For Machine Learning: Chris Cremer September 2015
40 pages
Maximum Likelihood Estimation
No ratings yet
Maximum Likelihood Estimation
46 pages
Prints PDF
No ratings yet
Prints PDF
106 pages
Mlelectures PDF
No ratings yet
Mlelectures PDF
24 pages
15.097: Probabilistic Modeling and Bayesian Analysis
No ratings yet
15.097: Probabilistic Modeling and Bayesian Analysis
42 pages
STAT 135 Solutions To Homework 4:: 30 Points
No ratings yet
STAT 135 Solutions To Homework 4:: 30 Points
9 pages
MIT18 05S14 Reading10b PDF
No ratings yet
MIT18 05S14 Reading10b PDF
9 pages
CQF ML Lab Estimating Default Probability With Logistic Regression
No ratings yet
CQF ML Lab Estimating Default Probability With Logistic Regression
7 pages
MATH 437/ MATH 535: Applied Stochastic Processes/ Advanced Applied Stochastic Processes
No ratings yet
MATH 437/ MATH 535: Applied Stochastic Processes/ Advanced Applied Stochastic Processes
7 pages
Etf3600 Lecture3 Mle LPM 2013
No ratings yet
Etf3600 Lecture3 Mle LPM 2013
36 pages
11 Parameter Estimation
No ratings yet
11 Parameter Estimation
6 pages
SP2009F - Lecture03 - Maximum Likelihood Estimation (Parametric Methods)
No ratings yet
SP2009F - Lecture03 - Maximum Likelihood Estimation (Parametric Methods)
23 pages
Log-Linear Models and Conditional Random Fieldsels
No ratings yet
Log-Linear Models and Conditional Random Fieldsels
27 pages
2223hk1 Slide01 ML2022-2
No ratings yet
2223hk1 Slide01 ML2022-2
23 pages
Mlelectures PDF
No ratings yet
Mlelectures PDF
24 pages
Lecture15 Binary Dependent Variables
No ratings yet
Lecture15 Binary Dependent Variables
38 pages
Learning Models From Data: 1 Parametric Estimation
No ratings yet
Learning Models From Data: 1 Parametric Estimation
14 pages
Bayesian Learning: Thanks To Nir Friedman, HU
No ratings yet
Bayesian Learning: Thanks To Nir Friedman, HU
41 pages
Maximum Likelihood Estimation: Guy Lebanon February 19, 2011
No ratings yet
Maximum Likelihood Estimation: Guy Lebanon February 19, 2011
6 pages
03 Lectureslides ParameterInference
No ratings yet
03 Lectureslides ParameterInference
24 pages
SAMPLE 1 Pure Unit 1 SBA
No ratings yet
SAMPLE 1 Pure Unit 1 SBA
20 pages
Design of Singly Reinforced Beam Case 1
No ratings yet
Design of Singly Reinforced Beam Case 1
12 pages
Week 12 LPN Logit 0
No ratings yet
Week 12 LPN Logit 0
35 pages
ML Lecture 03 - Probabilistic Inference (Spring 2024)
No ratings yet
ML Lecture 03 - Probabilistic Inference (Spring 2024)
46 pages
FIT5197 2021 S1 Formula Sheet
No ratings yet
FIT5197 2021 S1 Formula Sheet
20 pages
cs675 SS2022 Midterm Solution PDF
No ratings yet
cs675 SS2022 Midterm Solution PDF
10 pages
Lec 5-Stacks and Queues
No ratings yet
Lec 5-Stacks and Queues
71 pages
Lecture5 Maximum Likelihood
No ratings yet
Lecture5 Maximum Likelihood
13 pages
Control Lab Report Experiment No. 01 PDF
No ratings yet
Control Lab Report Experiment No. 01 PDF
5 pages
Speaker Recognition System Using MFCC and Vector Quantization
No ratings yet
Speaker Recognition System Using MFCC and Vector Quantization
7 pages
Statistics Formula Sheet
No ratings yet
Statistics Formula Sheet
11 pages
DI&M Part3
No ratings yet
DI&M Part3
18 pages
Creating Superpositions and Entangled States Using Quantum Gates
No ratings yet
Creating Superpositions and Entangled States Using Quantum Gates
3 pages
Tutorial On Maximum Likelihood Estimation
100% (2)
Tutorial On Maximum Likelihood Estimation
11 pages
ML Unit 3
No ratings yet
ML Unit 3
14 pages
Digit Problems
No ratings yet
Digit Problems
3 pages
Formula Sheet ENMG 435
No ratings yet
Formula Sheet ENMG 435
11 pages
Lecture17 Mle Map
No ratings yet
Lecture17 Mle Map
29 pages
Lecture 6
No ratings yet
Lecture 6
13 pages
Context of Cryptography: Confidentiality
No ratings yet
Context of Cryptography: Confidentiality
13 pages
Group 1 - Heap Sort and Timsort
No ratings yet
Group 1 - Heap Sort and Timsort
19 pages
Likelihood Frequentist
No ratings yet
Likelihood Frequentist
27 pages
PSI Homework 1
No ratings yet
PSI Homework 1
8 pages
Anshul's Resume
No ratings yet
Anshul's Resume
1 page
Probabilistic Learning and Generalized Linear Models (GLMS)
No ratings yet
Probabilistic Learning and Generalized Linear Models (GLMS)
38 pages
(MLE) - MLE-vs-Bayes
No ratings yet
(MLE) - MLE-vs-Bayes
11 pages
3logistic Regression
No ratings yet
3logistic Regression
61 pages
Week 6 Mle Perraillon 0
No ratings yet
Week 6 Mle Perraillon 0
69 pages
CS 170, Spring 2020 HW1 A. Chiesa & J. Nelson
No ratings yet
CS 170, Spring 2020 HW1 A. Chiesa & J. Nelson
4 pages
Probs Stats
No ratings yet
Probs Stats
26 pages
MAP&MLE
No ratings yet
MAP&MLE
44 pages
Seventh Semester B. Tech. Degree Examination: (Answer All Questions: 5 X 2 Marks 10 Marks)
No ratings yet
Seventh Semester B. Tech. Degree Examination: (Answer All Questions: 5 X 2 Marks 10 Marks)
2 pages
Tips - Network Games Theory Models and Dynamics Synthesis
100% (1)
Tips - Network Games Theory Models and Dynamics Synthesis
160 pages
Laboratory Exercise 6: Digital Filter Structures
No ratings yet
Laboratory Exercise 6: Digital Filter Structures
26 pages
Lec2 Maximumlikelihood
No ratings yet
Lec2 Maximumlikelihood
19 pages
Advanced Engineering Mathematics
No ratings yet
Advanced Engineering Mathematics
32 pages
Wk04 Machine Learning
No ratings yet
Wk04 Machine Learning
6 pages
Unit 1 Matrix Theory: Applied Mathematics For Electrical Engineers
No ratings yet
Unit 1 Matrix Theory: Applied Mathematics For Electrical Engineers
2 pages
03 Lecturenote MLE MAP
No ratings yet
03 Lecturenote MLE MAP
7 pages
3 H1 18 Aleksandr Andelkovic Using Artificial Intelligence To Test The Candy Crush Saga Game 1
No ratings yet
3 H1 18 Aleksandr Andelkovic Using Artificial Intelligence To Test The Candy Crush Saga Game 1
42 pages
L08 MaximumLikelihoodEstimation
No ratings yet
L08 MaximumLikelihoodEstimation
5 pages
Front Page f5132150
No ratings yet
Front Page f5132150
26 pages
Animated Algorithm Visualizer For Graph Based Algorithms and Recursive Programs
No ratings yet
Animated Algorithm Visualizer For Graph Based Algorithms and Recursive Programs
5 pages
Model Predictive Path Integral Control Using Covariance Variable Important Sampling
No ratings yet
Model Predictive Path Integral Control Using Covariance Variable Important Sampling
9 pages
Ps 2,3
No ratings yet
Ps 2,3
48 pages
Lecture1 ML MLE
No ratings yet
Lecture1 ML MLE
103 pages
Machine Learning Meets Advanced Robotic Manipulation: A, B C, C C D e
No ratings yet
Machine Learning Meets Advanced Robotic Manipulation: A, B C, C C D e
69 pages
Synthesizing Visual Realities Design and Implementation of A Text To Image Synthesizer Leveraging Spatial Transformer Generative Adversarial Networks
No ratings yet
Synthesizing Visual Realities Design and Implementation of A Text To Image Synthesizer Leveraging Spatial Transformer Generative Adversarial Networks
5 pages
Sta255 Week 11-2 Pre
No ratings yet
Sta255 Week 11-2 Pre
21 pages
2020 - Zhang-Liang-Li-Wang-Wu - Research On Stock Prediction Model Based On Deep Learning - Journal of Physics Conference Series
No ratings yet
2020 - Zhang-Liang-Li-Wang-Wu - Research On Stock Prediction Model Based On Deep Learning - Journal of Physics Conference Series
8 pages
2024 Estimation
No ratings yet
2024 Estimation
91 pages
T8 - Classical Stat Inference
No ratings yet
T8 - Classical Stat Inference
8 pages
Fan 2019
No ratings yet
Fan 2019
7 pages
Imotion-Llm: Motion Prediction Instruction Tuning: This Work Was Done Outside of Meta With Personal Capacity
No ratings yet
Imotion-Llm: Motion Prediction Instruction Tuning: This Work Was Done Outside of Meta With Personal Capacity
16 pages
12 MLEFilled
No ratings yet
12 MLEFilled
8 pages
5 Logistic
No ratings yet
5 Logistic
53 pages
1 - Introduction To DS
No ratings yet
1 - Introduction To DS
22 pages
M335 Lecture
No ratings yet
M335 Lecture
30 pages
Poolin Layer
No ratings yet
Poolin Layer
28 pages
AI - SMPS-Unit 6 - Week 2.2
No ratings yet
AI - SMPS-Unit 6 - Week 2.2
5 pages
Section 5
No ratings yet
Section 5
18 pages
Business Analytics Data Analysis Decision Making 6th Edition Business Analytics Data Analysis Decision Making PDF Download
No ratings yet
Business Analytics Data Analysis Decision Making 6th Edition Business Analytics Data Analysis Decision Making PDF Download
84 pages
FALLSEM2025-26 BAMAT101 ETH CH2025260103502 Reference Material I 02 Multivariable Calculus de With LAB BTECH Common 1
No ratings yet
FALLSEM2025-26 BAMAT101 ETH CH2025260103502 Reference Material I 02 Multivariable Calculus de With LAB BTECH Common 1
2 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Optimization in Function Spaces
From Everand
Optimization in Function Spaces
Amol Sasane
No ratings yet
A Short Course in Discrete Mathematics
From Everand
A Short Course in Discrete Mathematics
Edward A. Bender
3/5 (1)
Calculus I Essentials
From Everand
Calculus I Essentials
Editors of REA
1/5 (1)
Complex numbers
From Everand
Complex numbers
Alessio Mangoni
No ratings yet
Lectures on Boolean Algebras
From Everand
Lectures on Boolean Algebras
Paul R. Halmos
4/5 (2)
Top Numerical Methods With Matlab For Beginners!
From Everand
Top Numerical Methods With Matlab For Beginners!
Andrei Besedin
No ratings yet

Week 6 Mle

Uploaded by

Week 6 Mle

Uploaded by

Week 6: Maximum Likelihood Estimation

Marcelo Coca Perraillon

Health Services Research Methods I

These slides are part of a forthcoming book to be published by Cambridge

An alternative way of estimating parameters: Maximum likelihood

Since we know the pdf that generated the numbers is Bernoulli, we

Taking that derivative is complicated because we would need to use

Which simplifies to: lnL(p) = nȳ ln(p) + (n − nȳ )ln(1 − p)

Simulated 100 Bernoulli rvs with p = 0.4

* Let’s get the highest value of the ln likelihood

Logistic regression Number of obs = 100

Note that Log likelihood = -68.994376 matches what we found “by

Statistical software maximizes the log likelihood numerically (also the

There is some intuition in the plot above. The precision of the

In most applications we want to estimate the effect of covariates on

What about if we do the same but now we have numbers like

As before, we know the pdf of a Normal random variable and because

After taking the ln, we have:

The sum commands divides the sample variance by (n-1)

What about if I told you that the number I generated is a linear

If we take the derivatives with respect to β0 and β1 we will find

Stata uses a formula to go from SSE to log likelihood. Remember,

The formula is −0.5N(ln(2π) + ln( SSE

Source | SS df MS Number of obs = 74

* Create program defining model and likelihood function

reg lexp age female

One should be tolerant of small decimal differences. Which brings me

2) It is clear from the start of setting up the problem that we are

The null H0 is that the restricted (constrained) model is adequate

Compare the likelihood and other criteria

Likelihood-ratio test LR chi2(2) = 35.53

p-value of both 0.5341 (I chose bad predictors so p-values would be

MLE is not more difficult than OLS

You might also like