0% found this document useful (0 votes)
9 views44 pages

MAP&MLE

Uploaded by

Akshaya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views44 pages

MAP&MLE

Uploaded by

Akshaya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Basics of parameter

estimation: MLE and MAP


Estimating the Bias of a Coin
Problem: Assume we can flip a coin with bias θ several times. Estimate the probability
that it turns out heads when we flip it?
Each flip yields a Boolean value for X, X~ Bernoulli(θ){"The random variable X follows
a Bernoulli distribution with parameter θ. It has only two outcomes, typically
represented as 0 (failure) and 1 (success).
}
Bernoulli Random Variable P (X = 1) = θ; P (X = 0) = 1 − θ
Estimating the Bias of a Coin
MLE for Bernoulli Variables
● note that if data D consists of just one coin flip, then
P(D|θ) = θ if that one coin flip results in X = 1, and P(D|θ)
= (1−θ) if the result is instead X = 0.
● Furthermore, if we observe a set of i.i.d. coin flips such as
D = {1,1,0,1,0}, then we can easily calculate P(D|θ) by
multiplying together the probabilities of each individual
coin flip:
● P(D = {1,1,0,1,0} | θ) = θ · θ ·(1−θ)· θ ·(1−θ) = θ 3 ·(1−θ)2.
MLE
● maximizing P(D|θ) with respect to θ is equivalent to
maximizing its logarithm, lnP(D|θ) with respect to θ,
because ln(x) increases monotonically with x:

It often simplifies the mathematics to maximize lnP(D|θ)


rather than P(D|θ)
Set derivative to 0 - Because lnP(D|θ) is a concave function
of θ, the value of θ where this derivative is zero will be the
value that maximizes lnP(D|θ).
MLE
MLE - Examples
1. Suppose that X is a discrete random variable with the

following probability mass function: where 0 ≤ θ ≤ 1 is
a parameter. The following 10 independent
observations were taken from such a distribution: (3,
0, 2, 1, 3, 2, 1, 0, 2, 1). What is the maximum likelihood
estimate of θ.
MLE - Examples
Solution:

Since the sample is (3,0,2,1,3,2,1,0,2,1), the likelihood is
L(θ) = P(X = 3)P(X = 0)P(X = 2)P(X = 1)P(X = 3)
× P(X = 2)P(X = 1)P(X = 0)P(X = 2)P(X = 1)
Substituting from the probability distribution given above,
we have

= (2θ/3)2. (θ/3)3. (2(1-θ)/3)3. ((1-θ)/3)2


MLE - Examples
Solution:

Taking logarithm
ln(L(θ)) = 2ln(θ) + 3ln(θ) + 3ln(1-θ) +2 ln(1-θ) - log c
Setting the derivative to 0 and solving
d/dθ(ln(L(θ))) = d/dθ(5ln(θ)+5ln(1-θ))
= 5/θ - 5/1-θ
Equating to 0 we get 1/θ = 1/(1-θ)
θ = 1/2, The maximum likelihood value of θ is 0.5
MLE - Examples
Suppose
● X1,X2….,Xn are i.i.d random variable with density
function . What is the MLE for 𝞼?
MLE - Examples
Maximum
● likelihood for continuous distributions

Suppose that the lifetime of light bulbs is modeled by an


exponential distribution with(unknown)parameter λ. We
test 5 bulbs and find they have lifetimes of 2, 3, 1, 3, and
4 years, respectively. What is the MLE for λ?
MLE - Examples
Maximum
● Likelihood to estimate multiple Parameters

Suppose the data x1, x2, . . . , xn is drawn i.i.d from a N(μ, σ2)
distribution, where μ and σ2are unknown. Find the
maximum likelihood estimate for the pair(μ, σ2).
MLE

Maximum A Posteriori (MAP) Estimation
● MLE is powerful when you have enough data. However, it
● doesn’t work well when observed data size is small.
● If we flip the coin 50 times, observing 24 heads and 26

tails, then we will estimate θMLE = 0.48.


● If we observe only 3 flips of the coin, we might observe 1

head and 2 tails, and the estimate θMLE= 0.33.


● If we have prior knowledge about the coin, like θ is close
to 0.5, then we might respond by still believing the
probability is closer to 0.5 than 0.33.
● This leads to Maximum A Posteriori Estimation
Maximum A Posteriori (MAP) Estimation
● ● Maximum likelihood estimation (MLE) says that we
should find the parameter θ that maximizes the
likelihood (“probability”) of seeing the data.
● But MAP allows us to incorporate prior knowledge
into our estimate
● We can determine the MAP estimation by using Bayes
theorem to calculate the posterior probability of each
candidate.
Maximum A Posteriori (MAP) Estimation
● Prior Knowledge: E.g., I know that the coin is “close”
to 50-50.
Maximum A Posteriori (MAP) Estimation

Maximum A Posteriori (MAP) Estimation

It chooses the value that is most probable given observed


data and prior belief
Maximum A Posteriori (MAP) Estimation

Maximum A Posteriori Estimation

As P(D) does not depend on θ, we can simplify this by ignoring the


denominator:
Maximum A Posteriori (Example)
1. Suppose our samples are x = (0, 0, 1, 1, 0), from Bernoulli(θ), where θ is
● unknown. Assume θ is unrestricted; that is, θ ∈ (0, 1). What is the MLE for
θ?
L(x | θ) = θ2(1 − θ)3
θMLE= argmaxθ∈[0,1]θ2(1 − θ)3
= No:0f heads/(No:of heads+tails)
=⅖
2. Suppose we impose the restriction that θ ∈ {0.2, 0.5, 0.7}. What is the MLE
for θ?
L(x | 0.2) = (0.220.83 ) = 0.02048
L(x | 0.5) = (0.520.53 ) = 0.03125
L(x | 0.7) = (0.720.33 ) = 0.01323
θMLE = arg maxθ∈{0.2,0.5,0.7}L(x | θ) =0.5
Maximum A Posteriori (Example)

Maximum A Posteriori (Example)
Consider a medical diagnosis problem in which there are two

alternative hypotheses: (1) that the patient has cancer. and (2) that the
patient does not. The available data is from a particular laboratory test
with two possible outcomes: (positive) and (negative). We have prior
knowledge that over the entire population of people only .008 have this
disease. Furthermore, the lab test is only an imperfect indicator of the
disease. The test returns a correct positive result in only 98% of the
cases in which the disease is actually present and a correct negative
result in only 97% of the cases in which the disease is not present. In
other cases, the test returns the opposite result.
Maximum A Posteriori (Example)
The above situation can be summarized by the following probabilities:

P(positive|cancer) = 0.98 P(negative|cancer) = 0.02
P(positive|not cancer) = 0.03 P(negative|not cancer) = 0.97
P(cancer) = 0.008 P(not cancer) = 0.992

Suppose we now observe a new patient for whom the lab test returns a
positive result. Should we diagnose the patient as having cancer or not?

For that calculate Maximum a posteriori estimation.


Maximum A Posteriori (Example)
Maximum a posteriori estimation:

P(cancer|positive) = P(positive|cancer) * P(cancer)
= 0.98 * 0.008
= 0.00784
P(not cancer|positive) = P(positive|not cancer)*P(not cancer)
= 0.03 * 0.992
= 0.0298
Thus MAP estimation diagnose the patient as not having cancer
P(positive|cancer) = 0.98 P(negative|cancer) = 0.02
P(positive|not cancer) = 0.03 P(negative|not cancer) = 0.97
P(cancer) = 0.008 P(not cancer) = 0.992
Maximum A Posteriori (Example)
● Consider the same example of flipping the coin. Now let’s try to
● construct a MAP estimate for θ for the same Bernoulli
experiment. Obviously, we now need a prior belief distribution for
the parameter θ to be estimated.
● Our prior belief in possible values for θ must reflect the following
constraints: –
○ The prior for θ must be zero outside the [0, 1] interval.
○ Within the [0, 1] interval, we are free to specify our beliefs in
any way we wish.
○ In most cases, we would want to choose a distribution for the
prior beliefs that peaks somewhere in the [0, 1] interval.
Maximum A Posteriori (Example)
● The following beta distribution can express our prior beliefs

The MAP estimate for θ

Ignore the denominator as B(β0,β1) is independent of θ


Maximum A Posteriori (Example)
● Solve for the value of θ that maximizes the expression.
●● It is identical to the Maximum likelihood function
● We can therefore reuse the derivation of θMLE

Here you can see that both p(θ) and p(θ|D) are following beta
distribution:

i.e P(θ|D) ~ Beta(𝞫0+ 𝞪0 , 𝞫1+𝞪1){ prior and posterior are following beta
distribution)
Maximum A Posteriori (Example)

If the posterior distribution is in the same family as the prior
distribution, then we say that the prior distribution is the conjugate
prior for the likelihood function.

Here Beta is a conjugate prior.


Maximum A Posteriori (Example)

Maximum A Posteriori (Example)


Priors

Examples
A gamma distribution with parameters α, β has the following density

function, where Γ(t) is the gamma function.

Using the Gamma distribution as a prior, show that the Exponential


distribution is a conjugate prior of the Gamma distribution. Also, find
the maximum a posteriori estimator for the parameter of the
Exponential distribution as a function of α and β.

You might also like