Week 6 Mle
Week 6 Mle
University of Colorado
Anschutz Medical Campus
2
Bernoulli example
Suppose that we know that the following ten numbers were simulated
using a Bernoulli distribution: 0 0 0 1 1 1 0 1 1 1
We can denote them by y1 , y2 , ..., y10 . So y1 = 0 and y10 = 1
Recall that the pdf of a Bernoulli random variable is
f (y ; p) = p y (1 − p)1−y , where y ∈ {0, 1}
The probability of 1 is p while the probability of 0 is (1 − p)
We want to figure out what is the p that was used to simulate the
ten numbers
All we know is that 1) they come from a Bernoulli distribution and 2)
they are independent from each other
3
Bernoulli example
4
Bernoulli example
Q
We
Q2 use the product symbol to simplify the notation. For example,
i=1 xi = x1 ∗ x2
So we can write the joint probability or the likelihood (L) of seeing
those 10 numbers as:
L(p) = 10 yi 1−yi
Q
i=1 p (1 − p)
5
Bernoulli example
Remember that we are trying to find the p that was used to generate
the 10 numbers. That’s our unknown
In other words, we want to find the p that maximizes the likelihood
function L(p). Once we find it, we could it write as our estimated
parameter as p̂
Yet another way: we want to find the p̂ that makes the joint
likelihood of seeing those numbers as high as possible
Sounds like a calculus problem... We can take the derivative of L(p)
with respect to p and set it to zero to find the optimal p̂
Of course, the second step is to verify that it’s a maximum and not a
minimum (take second derivative) and also verify that is unique, etc.
We will skip those steps
6
Bernoulli example
7
Bernoulli example
dln(p) nȳ (n−nȳ )
dp = p − (1−p) =0
Pn yi
After solving, we’ll find that p̂(yi ) = ȳ = i=1 n
So that’s the MLE estimator of p. This is saying more or less the
obvious: our best guess for the p that generated the data is the
proportion of 1s, in this case p = 0.6
We would need to verify that our estimator satisfies the three basic
properties of an estimator: bias, efficiency, and consistency (this will
be in your exam)
Note that we can plug in the optimal p̂ back into the ln likelihood
function:
lnL(p̂) = nȳ ln(p̂) + (n − nȳ )ln(1 − p̂) = a, where a will be a number
that represents the highest likelihood we can achieve (we chose p̂)
that way
8
Example
di 100*0.46*ln(0.46) + (100-100*0.46)*ln(1-0.46)
-68.994376
And we just did logistic regression “by hand.” A logistic model with
only a constant (no covariates), also known as the null model
9
Example
We will use the logit command to model indicator variables, like
whether a person died
logit bernie
Iteration 0: log likelihood = -68.994376
Iteration 1: log likelihood = -68.994376
------------------------------------------------------------------------------
bernie | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_cons | -.1603427 .2006431 -0.80 0.424 -.5535959 .2329106
------------------------------------------------------------------------------
di 1/(1+exp( .1603427 ))
.45999999
11
Let’s plot the - ln(L) function with respect to p
12
What about the precision (standard error) of the estimate?
13
What about covariates?
14
Normal example
I tell you that they were simulated from a normal distribution with
parameters µ and σ 2 . The numbers are independent. Your job is to
come up with the best guess for the two parameters
Same problem as with the Bernoulli example. We can solve it in
exactly the same way
15
Normal example
16
Normal example
As before, we can simplify the problem by taking the log to help us
take the derivatives. But before:
Alert: Perhaps you are wondering, why are we using the pdf of the
normal if we know that the probability of one number is zero?
Because we can think of the pdf as giving us the probability of yi + d
when d → 0
We need computers with lots of floating number ability. MLE was
invented in the 50s/60s. Super difficult to implement. In the 80s, we
had Commodore 64s
https://en.wikipedia.org/wiki/Commodore_64
Researchers could use mainframe computers with punch cards. Your
iPhone is faster than mainframes that used fit in a building...
https://en.wikipedia.org/wiki/Mainframe_computer
Maybe you were not wondering that but I was at some point. I
wonder a lot in general. And daydream on a minute by minute basis...
17
Normal example
18
Normal example
We can also figure out the variance by taking the derivative with
respect to σ 2
Pn
i=1 (yi −µ̂)
We will find that σ̂ 2 = n
If you remember the review lecture on probability and statistics, we
know that this formula is biased. We need to divide by (n − 1) instead
(What is the definition of bias?)
This happens often in MLE. The MLE estimate of the variance is
often biased but it is easy to correct for it
19
Normal example Stata
We just figured out that the best guess is to calculate the sample
mean and sample variance
We can easily verify in Stata
clear
set seed 1234567
set obs 100
gen ynorm = rnormal(100, 10)
sum ynorm
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
ynorm | 100 98.52294 10.03931 74.16368 123.5079
20
Linear regression: adding covariates
21
Linear regression
22
The regression command again
The regression command does not use MLE but it does give you the
log likelihood
use auto
qui reg price weight mpg
ereturn list
scalars:
e(N) = 74
e(df_m) = 2
e(df_r) = 71
e(F) = 14.7398153853841
e(r2) = .2933891231947529
e(rmse) = 2514.028573297152
e(mss) = 186321279.739451
e(rss) = 448744116.3821706
e(r2_a) = .27348459145376
e(ll) = -682.8636883111164
e(ll_0) = -695.7128688987767
e(rank) = 3
The log likelihood of the estimated model is stored in e(ll). The log
likelihood of the null model (with no covariates) is stored in e(ll0 ).
From the numbers above e(ll) > e(ll0 )
23
The regression command again
24
Easy MLE in Stata
To estimate in MLE using Stata you need to write a program but
Stata now makes it a lot easier (for teaching purposes) with the
mlexp command
mlexp (ln(normalden(price, {xb: weight mpg _cons}, {sigma})))
initial: log likelihood = -<inf> (could not be evaluated)
feasible: log likelihood = -803.76324
rescale: log likelihood = -729.85758
rescale eq: log likelihood = -697.2346
Iteration 0: log likelihood = -697.2346
Iteration 1: log likelihood = -687.4506
Iteration 2: log likelihood = -682.92425
Iteration 3: log likelihood = -682.86401
Iteration 4: log likelihood = -682.86369
Iteration 5: log likelihood = -682.86369
Maximum likelihood estimation
Log likelihood = -682.86369 Number of obs = 74
------------------------------------------------------------------------------
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
xb |
weight | 1.746559 .6282189 2.78 0.005 .5152727 2.977846
mpg | -49.51222 84.39157 -0.59 0.557 -214.9167 115.8922
_cons | 1946.069 3523.382 0.55 0.581 -4959.634 8851.771
-------------+----------------------------------------------------------------
/sigma | 2462.542 202.4197 12.17 0.000 2065.806 2859.277
------------------------------------------------------------------------------
25
Almost same
The SEs are slightly different and so is Root MSE. Stata is using the
second derivatives to calculate SEs using MLE
. reg price weight mpg
------------------------------------------------------------------------------
price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
weight | 1.746559 .6413538 2.72 0.008 .467736 3.025382
mpg | -49.51222 86.15604 -0.57 0.567 -221.3025 122.278
_cons | 1946.069 3597.05 0.54 0.590 -5226.245 9118.382
------------------------------------------------------------------------------
26
Asymptotic properties are so important in stats
The auto dataset has only 74 obs. What about we use the MEPS
that has about 15000? (That’s really an overkill but just to make the
point)
mlexp (ln(normalden(lexp, {xb: age _cons}, {sigma})))
initial: log likelihood = -<inf> (could not be evaluated)
could not find feasible values
* I tried giving it starting values but didn’t work. Easier to do it the old fashioned way
27
Asymptotic properties are so important in stats
ml maximize
initial: log likelihood = -453412.9
alternative: log likelihood = -163550.49
.
Iteration 5: log likelihood = -29153.79
Number of obs = 15,946
Wald chi2(2) = 2981.22
Log likelihood = -29153.79 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
lexp | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
xb |
age | .0358123 .0006779 52.82 0.000 .0344836 .0371411
female | .3511679 .024252 14.48 0.000 .303635 .3987009
_cons | 5.329011 .0373155 142.81 0.000 5.255874 5.402148
-------------+----------------------------------------------------------------
lnsigma |
_cons | .4093438 .0055996 73.10 0.000 .3983687 .4203189
------------------------------------------------------------------------------
display exp([lnsigma]_cons)
1.5058293
28
Asymptotic properties are SO IMPORTANT in stats
------------------------------------------------------------------------------
lexp | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | .0358123 .000678 52.82 0.000 .0344834 .0371413
female | .3511679 .0242542 14.48 0.000 .3036269 .398709
_cons | 5.329011 .037319 142.80 0.000 5.255861 5.40216
------------------------------------------------------------------------------
29
So is Stata taking derivatives and finding formulas? Nope
Stata uses numerical methods to maximize the likelihood. There are
many and some work better than others in some situations. Type
“help mle” for the gory details
A classic one is the Newton-Raphson algorithm
The idea requires Taylor expansions (a way to approximate nonlinear
functions using linear functions)
The steps are:
1 Make a guess about the parameters, say just one parameter θ0
2 Approximate the log likelihood function using Taylor series at θ0 and
set it equal to zero (easier to solve because it’s a linear function)
3 Find the new θ, say, θ1 . Check if the log likelihood has improved
4 Repeat until the -2 log likelihood changes by only a small amount, say
0.02
The idea of using -2 log likelihood < 0.02 is that that amount would
not change statistical inference since -2 log likelihood is in the
Chi-square scale (more on this in a sec)
30
Why is the log likelihood function negative?
The likelihood function L(p) is a small number since it’s the joint
likelihood of observing the outcome values
Different type of MLE methods
twoway function y =log(x), range(-2 2) xline(0 1) yline(0) ///
color(red) title("y = log(x)")
graph export logy.png, replace
31
What we get from MLE
1) It is clear that we are modeling a conditional expectation
function: E [Y |X ]
Perhaps this got lost but it’s worth repeating. We started with the
normal density:
2
f (y ; µ, σ) = √ 1 exp ( −(y2σ
i −µ)
2 )
2πσ 2
We then said that the mean µ is a function of one or more
covariates x and we made no assumptions about the distribution of
x:
2
f (y ; µ, σ) = √ 1 exp ( −(yi −(β2σ0 +β
2
1 xi ))
)
2πσ 2
That’s why I said many times that the assumption ∼ N(0, σ 2 ) is the
same as saying that the assumption is y ∼ N(β0 + β1 x, σ 2 ), since
µ = β0 + β1 x
Note that with MLE we did not assume anything about the errors.
In fact, the errors are not even in the equations
32
What we get from MLE
33
What we get from MLE
6) MLE is much more general than OLS. You will use MLE for logit,
Probit, Poisson, mixture models, survival models. Pretty much all the
standard models an applied researcher needs
7) Learning to model using likelihood ratio tests is more useful for
more type of models than using the SSE for nested models
8) AIC and BIC to compare non-nested models are based on the log
likelihood function
Here is a more detailed proof of MLE for the normal:
https://www.statlect.com/fundamentals-of-statistics/
normal-distribution-maximum-likelihood
34
Likelihood ratio test (LRT)
35
Likelihood ratio test: sketch of theory
The theory of LRTs is a bit dense but the intuition is not that
difficult to understand
We could re-write as LR = −2L( RM a
FM ) since log ( b ) = log (a) − log (b)
So we are comparing the likelihood of the reduced model to the full
model and wondering if the reduced model alone is just fine. Sounds
familiar? Not that different from the F test comparing SSEs of nested
models
Keep in mind that the estimated model parameters are those that
maximized the value of the likelihood
The more theoretical part is to figure out how the LRT distributes
and under which conditions the LRT is valid (models must be nested)
36
Recall the F test
We have LR = −2L( RM
FM )
[SSE (RM)−SSE (FM)]/(p+1−k)
The F test was F = SSE (FM)/(n−p−1)
Both are using a measure of fit to compare models
With MLE, we want to know if reaching a higher likelihood is due to
chance under the null
With the F test, we want to know if the additional reduction in the
residual variance is due to chance under the null
The requirement is that models must be nested
37
Example
Note that the log likelihood (ll) gets larger for better fitting models;
we will cover AIC and BIC later
38
Example
LR tests
lrtest m3 m2
Likelihood-ratio test LR chi2(1) = 8.94
(Assumption: m2 nested in m3) Prob > chi2 = 0.0028
. lrtest m3 m1
It seems logical that LRT and F-test comparing nested models should
be equivalent (asymptotically)
39
LRT and F-tests
Compare tests
qui reg colgpa
est sto m0
scalar ll0 = e(ll)
reg colgpa male campus
Source | SS df MS Number of obs = 141
-------------+---------------------------------- F(2, 138) = 0.62
Model | .171856209 2 .085928105 Prob > F = 0.5413
Residual | 19.2342432 138 .139378574 R-squared = 0.0089
-------------+---------------------------------- Adj R-squared = -0.0055
Total | 19.4060994 140 .138614996 Root MSE = .37333
...
est sto m1
scalar ll1 = e(ll)
lrtest m0 m1
Likelihood-ratio test LR chi2(2) = 1.25
(Assumption: m0 nested in m1) Prob > chi2 = 0.5341
* By hand
di -2*[ll0 - ll1]
1.2542272
41