0% found this document useful (0 votes)
34 views24 pages

21 Mle

The document provides information about maximum likelihood estimation (MLE). MLE involves estimating parameters of a statistical model by finding values that maximize the likelihood function based on observed data. The document discusses: 1) Using MLE to estimate the probability of heads for a coin based on observed flip outcomes. 2) Estimating both the mean and variance parameters of a normal distribution based on independent samples, which requires taking partial derivatives with respect to each parameter. 3) Key steps in MLE including taking the log-likelihood, finding derivatives, setting derivatives equal to zero, and using the second derivative test to confirm a maximum.

Uploaded by

alexpaulirungu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views24 pages

21 Mle

The document provides information about maximum likelihood estimation (MLE). MLE involves estimating parameters of a statistical model by finding values that maximize the likelihood function based on observed data. The document discusses: 1) Using MLE to estimate the probability of heads for a coin based on observed flip outcomes. 2) Estimating both the mean and variance parameters of a normal distribution based on independent samples, which requires taking partial derivatives with respect to each parameter. 3) Key steps in MLE including taking the log-likelihood, finding derivatives, setting derivatives equal to zero, and using the second derivative test to confirm a maximum.

Uploaded by

alexpaulirungu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Maximum Likelihood CSE 312 Summer 21

Estimation Lecture 21
Important Dates!

• Real World 2 – Wednesday, Aug 11


• Review Summary 3 – Friday, Aug 13
• Problem Set 7 – Monday, Aug 16
• Final Released – Friday, Aug 13
• Final Due & Key Released – Tuesday, Aug 17
Asking The Opposite Question
So far:
Give you rules for an experiment.
Give you the event/outcome we’re interested in.
You calculate/estimate/bound what the probability is.
Today:
Give you some of the rules of the experiment.
Tell you what happened.
You estimate what the rest of the rules of the experiment were.
Example
Suppose you flip a coin independently 10 times, and you see

HTTTHHTHHH

What is your estimate of the probability the coin comes up heads?


a) 2/5
b) 1/2
c) 3/5 Fill out the poll everywhere so
d) 55/100 Kushal knows how long to explain
Go to pollev.com/cse312su21
Maximum Likelihood Estimation
Idea: we got the results we got.
High probability events happen more often than low probability events.
So, guess the rules that maximize the probability of the events we saw
(relative to other choices of the rules).

Since that event happened, might as well guess the set of rules for
which that event was most likely.
Maximum Likelihood Estimation
Formally, we are trying to estimate a parameter of the experiment (here:
the probability of a coin flip being heads).
The likelihood of an event 𝐸 given a parameter 𝜃 is
ℒ(𝐸; 𝜃) is ℙ(𝐸) when the experiment is run with 𝜃
We’ll use the notation ℙ(𝐸; 𝜃) for probability when run with parameter 𝜃
where the semicolon means “extra rules” rather than conditioning

We will choose 𝜃መ = argmax𝜃 ℒ(𝐸; 𝜃)


argmax is the argument that produces the maximum so the 𝜃 that
causes ℒ(𝐸; 𝜃) to be maximized.
Notation comparison
ℙ(𝑋|𝑌) probability of 𝑋, conditioned on the event 𝑌 having happened
(𝑌 is a subset of the sample space)
ℙ(𝑋; 𝜃) probability of 𝑋, where to properly define our probability space
we need to know the extra piece of information 𝜃. Since 𝜃 isn’t an event,
this is not conditioning
ℒ(𝑋; 𝜃) the likelihood of event 𝑋, given that an experiment was run with
parameter 𝜃. Likelihoods don’t have all the properties we associate with
probabilities (e.g. they don’t all sum up to 1) and this isn’t conditioning
on an event (𝜃 is a parameter/rule of how the event could be
generated).
MLE

Maximum Likelihood Estimator


The maximum likelihood estimator of the parameter 𝜃 is:

𝜃መ = argmax𝜃 ℒ(𝐸; 𝜃)

𝜃 is a variable, 𝜃መ is a number (or formula given the event).


We’ll also use the notation 𝜃መMLE if we want to emphasize how we found
this estimator.
The Coin Example
ℒ(𝐸; 𝜃)

ℒ(HTTTHHTHHH ; 𝜃) = 𝜃 6 1 − 𝜃 4

Where is 𝜃 maximized?
How do we usually find a maximum?
Calculus!!
𝑑 6 𝜃
𝜃 1−𝜃 4 = 6𝜃 5 1 − 𝜃 4 − 4𝜃 6 1 − 𝜃 3
𝑑𝜃
Set equal to 0 and solve
4 3 3
6𝜃෠ 5 1 − 𝜃෠ − 4𝜃෠ 6 1 − 𝜃෠ ෠ ෠ ෠ ෠
= 0 ⇒ 6 1 − 𝜃 − 4𝜃 = 0 ⇒ −10𝜃 = −6 ⇒ 𝜃 =
5
The Coin Example
For this problem, 𝜃 must be in the closed interval [0,1]. Since ℒ() is a
continuous function, the maximum must occur at and endpoint or
where the derivative is 0.

Evaluate ℒ(⋅; 0) = 0, ℒ(⋅; 1) = 0


at 𝜃 = 0.6 we get a positive value,
so 𝜃 = 0.6 is the maximizer on the interval [0,1].
Maximizing a Function
CLOSED INTERVALS SECOND DERIVATIVE TEST
Set derivative equal to 0 and Set derivative equal to 0 and
solve. solve.
Evaluate likelihood at endpoints Take the second derivative. If
and any critical points. negative everywhere, then the
critical point is the maximizer.
Maximum value must be
maximum on that interval.
A Math Trick
We’re going to be taking the derivative of products a lot.
The product rule is not fun. There has to be a better way!
Take the log!
ln 𝑎 ⋅ 𝑏 = ln 𝑎 + ln(𝑏)
We don’t need the product rule if our expression is a sum!

Can we still take the max? ln() is an increasing function, so


argmaxθ ln ℒ(𝐸; 𝜃) = argmaxθ ℒ(𝐸; 𝜃)
Coin flips is easier
ℒ(HTTTHHTHHH; 𝜃) = 𝜃 6 1 − 𝜃 4

ln(ℒ(HTTTHHTHHH; 𝜃) = 6 ln 𝜃 + 4 ln(1 − 𝜃)
𝑑 6 4
ln ℒ ⋅ = −
𝑑𝜃 𝜃 1−𝜃
Set to 0 and solve:
6 4 6 4 3
መ መ
− ෡ = 0 ⇒ ෡ = ෡ ⇒ 6 − 6𝜃 = 4𝜃 ⇒ 𝜃 = መ

𝜃 1−𝜃 𝜃 1−𝜃 5
𝑑2 −6 4
= 2− < 0 everywhere, so any critical point must be a
𝑑𝜃 2 𝜃 1−𝜃 2
maximum.
What about continuous random variables?
Can’t use probability, since the probability is going to be 0.
Can use the density!
It’s supposed to show relative chances, that’s all we’re trying to find
anyway.

ℒ(𝒙𝟏 , 𝒙𝟐 , … , 𝒙𝒏 ; 𝜽) = ∏𝒏𝒊=𝟏 𝒇𝑿 (𝒙𝒊 ; 𝜽)


Continuous Example
Suppose you get values 𝑥1 , 𝑥2 , … 𝑥𝑛 from independent draws of a
normal random variable 𝒩(𝜇, 1) (for 𝜇 unkown)
We’ll also call these “realizations” of the random variable.

𝑛 1 1
ℒ(𝑥𝑖 ; 𝜇) = ∏𝑖=1 exp − 𝑥𝑖 − 𝜇 2
2𝜋 2
1 1
ln(ℒ(𝑥𝑖 ; 𝜇)) = σ𝑛𝑖=1 ln − 𝑥𝑖 − 𝜇 2
2𝜋 2
Finding 𝜇ො
1 1
ln ℒ = σ𝑛𝑖=1 ln − 𝑥𝑖 − 𝜇 2
2𝜋 2
𝑑
ln ℒ = σ𝑛𝑖=1 𝑥𝑖 − 𝜇
𝑑𝜇

Setting 𝜇 = 0 and solving:


σ𝑛
𝑖=1 𝑥𝑖
σ𝑛𝑖=1 𝑥𝑖 − 𝜇Ƹ = 0 ⇒ σ𝑛𝑖=1 𝑥𝑖 = 𝜇Ƹ ⋅ 𝑛 ⇒ 𝜇Ƹ =
𝑛
Check using the second derivative test:
𝑑2
2 ln(ℒ) = −𝑛
𝑑𝜇

Second derivative is negative everywhere, so log-likelihood is concave down


and average of the 𝑥𝑖 is a maximizer.
Summary
Given: an event 𝐸 (usually 𝑛 i.i.d. samples from a distribution with
unknown parameter 𝜃).
1. Find likelihood ℒ(𝐸; 𝜃)
Usually ∏ℙ (𝑥𝑖 ; 𝜃) for discrete and ∏𝑓(𝑥𝑖 ; 𝜃) for continuous
2. Maximize the likelihood. Usually:
A. Take the log (if it will make the math easier)
B. Take the derivative
C. Set the derivative to 0 and solve
3. Use the second derivative test to confirm you have a maximizer
Two Parameter Estimation
Two Parameter Estimation Setup
We just saw that to estimate 𝜇 for 𝒩(𝜇, 1) we get:
σ𝑛
𝑖=1 𝑥𝑖
𝜇ො =
𝑛

Now what happens if we know our data is 𝒩() but nothing else. Both
the mean and the variance are unknown.
Log-likelihood
Let 𝜃𝜇 and 𝜃𝜎2 be the unknown mean and standard deviation of a
normal distribution. Suppose we get independent draws 𝑥1 , 𝑥2 , … , 𝑥𝑛 .

2
𝑛 1 1 𝑥𝑖 −𝜃𝜇
ℒ 𝑥1 , … , 𝑥𝑛 ; 𝜃𝜇 , 𝜃𝜎2 = ∏𝑖=1 exp − ⋅
𝜃𝜎2 2𝜋 2 𝜃𝜎2

2
1 1 𝑥𝑖 −𝜃𝜇
ln ℒ 𝑥𝑖 ; 𝜃𝜇 , 𝜃𝜎2 = σ𝑛𝑖=1 ln − ⋅
𝜃𝜎2 2𝜋 2 𝜃𝜎2
Expectation Arithmetic is nearly
identical to known
1 1 𝑥𝑖 −𝜃𝜇
2 variance case.
ln ℒ 𝑥𝑖 ; 𝜃𝜇 , 𝜃𝜎2 = σ𝑛𝑖=1 ln − ⋅
𝜃𝜎2 2𝜋 2 𝜃𝜎2

𝜕 𝑛 𝑥𝑖 −𝜃𝜇
ln ℒ = σ𝑖=1
𝜕𝜃𝜇 𝜃𝜎2
Setting equal to 0 and solving
𝑥𝑖 −𝜃෢𝜇 σ𝑛
𝑖=1 𝑥𝑖
σ𝑛𝑖=1 ෢
= 0 ⇒ σ𝑛𝑖=1 𝑥𝑖 − 𝜃𝜇 = 0 ⇒ σ 𝑛 ෢ ෢
𝑖=1 𝑥𝑖 = 𝑛 ⋅ 𝜃𝜇 ⇒ 𝜃𝜇 =
𝜃𝜎2 𝑛
𝜕2 𝑛
2 = −
𝜕𝜃𝜇 𝜃𝜎2
𝜃𝜎2 is an estimate of a variance. It’ll never be negative (and as long as the
draws aren’t identical it won’t be 0). So, the second derivative is negative, and
we really have a maximizer.
Variance
2
1 1 𝑥𝑖 −𝜃𝜇
ln ℒ 𝑥𝑖 ; 𝜃𝜇 , 𝜃𝜎2 = σ𝑛𝑖=1 ln − ⋅
𝜃𝜎2 2𝜋 2 𝜃𝜎2
2
𝑛 1 1 1 𝑥𝑖 −𝜃𝜇
= σ𝑖=1 − ln 𝜃𝜎 2 − ln(2𝜋) − ⋅
2 2 2 𝜃𝜎2
𝑛 𝑛⋅ln 2𝜋 1 2
= − ln 𝜃𝜎2 − − σ𝑛𝑖=1 𝑥𝑖 − 𝜃𝜇
2 2 2𝜃𝜎2

𝜕 𝑛 1 𝑛 2
ln ℒ = − + 2 σ𝑖=1 𝑥𝑖 − 𝜃𝜇
𝜕𝜃𝜎2 2𝜃𝜎2 2 𝜃𝜎2
Variance
𝜕 𝑛 1 𝑛 2
ln ℒ = − + 2 σ𝑖=1 𝑥𝑖 − 𝜃𝜇
𝜕𝜃𝜎2 2𝜃𝜎2 2 𝜃𝜎2
𝑛 1 𝑛 2
− ෢ + 2 σ𝑖=1 𝑥𝑖 − 𝜃𝜇 =0
2 𝜃𝜎 2 2 𝜃෢
𝜎2
𝑛 1 𝑛 2 2
⇒ ෢
− 𝜃𝜎2 + σ𝑖=1 𝑥𝑖 − 𝜃𝜇 = 0 (multiply by 𝜃𝜎2 )

2 2
1 𝑛 2
⇒ ෢
𝜃𝜎2 = σ𝑖=1 𝑥𝑖 − 𝜃𝜇
𝑛

To get the overall max


We’ll plug in 𝜃
෢𝜇
Summary
If you get independent samples 𝑥1 , 𝑥2 , … , 𝑥𝑛 from a 𝒩(𝜇, 𝜎 2 ) where 𝜇
and 𝜎 2 are unknown, the maximum likelihood estimates of the normal is:

σ𝑛
𝑖=1 𝑥𝑖 1 𝑛 2

𝜃𝜇 = and 𝜃𝜎2 = σ𝑖=1 𝑥𝑖 − 𝜃
෢ ෢𝜇
𝑛 𝑛

The maximum likelihood estimator of the mean is the sample mean that
is the estimate of 𝜇 is the average value of all the data points.
The MLE for the variance is: the variance of the experiment “choose one
of the 𝑥𝑖 at random”

You might also like