03 Lecturenote MLE MAP
03 Lecturenote MLE MAP
Learning Objective
Understand that many ml algorithms estimate probabilities. Appreciate that estimating probability distri-
butions is beneficial. Get to know common estimators for the parameters of probability distributions.
Application
Imagine that we want to predict the production of bushels of corn
per acre on a farm as a function of the proportion of that farm’s
planting area that was treated with a new pesticide. We have train-
ing data for 100 farms for this problem (https://www.developer.com/mgmt/
real-world-machine-learning-model-evaluation-and-optimization.html).
Take some time to answer the following warm-up questions:
(1) Is this a regression or classification problem? (2) What are the fea-
tures? (3) What is the prediction task? (4) How well do you think a linear
model will perform? http://www.corncapitalinnovations.com/production/
300- bushel-corn/
1 Introduction
For general machine learning problem, our goal is to estimate the function f which satisfies f (x) = y. For
example, ordinary least squares regression wants to estimate the vector w which makes f (x) = w⊺ x. How-
ever, there are some limitations: (1) y is only a point estimate (2) how do we deal with noise?
Therefore, it is a better idea to estimate the conditional probability p(y ∣ x). In this way, we can deal
with uncertain outcomes/noise and incorporate prior knowledge.
1.1 Noise
Let’s look at noise in our corn production
application. Plotting the target (bushels of
corn per acre) versus the feature (% of the
farm treated) it is clear that an increasing,
non-linear relationship exists, and that the
data also have random fluctuations, cf. Fig-
ure 11 .
1
2
You have the following data: H, T, T, H, H, H, T, T, T, T. What is p(y = H) given that we observed nH
heads and nT tails? So, intuitively,
nH
p(y = H) = = 0.4
nH + nT
Note: we have no x′ s in this example. Let’s formally derive this probability.
Formally (mle principle): Find θ̂ that maximizes the likelihood of the data p(D ∣ θ):
For the sequence of coin flips we can use the binomial distribution (cf. fcml 2.3.2) to model p(D ∣ θ):
nH + nT nH
p(D ∣ θ) = ( )θ (1 − θ)nT = Bin(nH ∣ n, θ) (2)
nH
Now,
nH + nT nH
θ̂mle = arg max ( )θ (1 − θ)nT
θ nH
nH + nT (3)
= arg max log ( ) + log θnH + log(1 − θ)nT
θ nH
= arg max nH log θ + nT log(1 − θ)
We can now solve for θ by taking the derivative and equating it to zero. This results in:
nH nT nH
= ⇒ nH (1 − θ) = nT θ ⇒ θ̂mle = (4)
θ 1−θ nH + nT
Note that we found the (arg) maximum of the log-likelihood instead of the likelihood, which oftentimes leads
to much easier to solve equations!
Advantages:
• mle gives the explanation of the data you observed.
• If n is large and your model/distribution is correct (that is H includes the true model), then mle finds
the true parameters.
3
Disadvantages:
• But the mle can overfit the data if n is small.
• If you do not have the correct model (and n is small) then mle can be terribly wrong.
p(y; θ) = θe−θy .
In order for you to use this distribution to compute the probability of your own jX phone dying next month,
we need to estimate its parameter θ.
(b) How do you derive the mle estimator θ̂ based on ll(θ, y1 , . . . , yn )? No computation required.
Exercise 2.2. Assume you model your data y1 , . . . , yn with a Poisson distribution:
θy e−θ
P (y; θ) = for y = 0, 1, 2, . . . , K.
y!
For example, set q = 0.5, then add m heads and m tails to dataset D. Now,
nH + m
θ̂ = (5)
nH + nT + 2m
For large n, this change is insignificant; for small n, it incorporates your prior belief about θ. From Figure 2
we can see that map smoothing works well. But note that q is uncertain!
4
Figure 2: Comparing mle to map smoothing we see that incorporating prior knowledge helps when observing
few training examples
The Bayesian way is to model θ as a random variable drawn from a distribution p(θ). Note that θ is not
a random variable associated with an event in a sample space. In frequentist statistics, this is forbidden. In
Bayesian statistics, this is allowed.
θ is a continuous univariate RV on [0,1], we can use Beta distribution (cf. fcml 2.5.2) to model p(θ):
θα−1 (1 − θ)β−1
p(θ) = = Beta(θ ∣ α, β) (6)
b(α, β)
where b(α, β) = Γ(α)Γ(β)
Γ(α+β)
is the Beta function that acts as a normalization constant. Note that here we
only need a distribution over a univariate (1d) random variable. The multivariate generalization of the Beta
distribution is the Dirichlet distribution.
Note:
p(θ ∣ D) = Beta(nH + α, nT + β) (7)
Note that in general θ are the parameters of our model. For the coin flipping scenario θ = p(y = H). So far,
we have a distribution over θ. How can we get an estimate for θ?
Formally (map principle): Find θ̂ that maximizes the posterior distribution over parameters p(θ ∣ D):
Advantages:
• as n → ∞, θ̂map → θ̂mle
• map is a great estimator if prior belief exists and is accurate
Disadvantage:
• if n is small, it can be very wrong if prior belief is wrong
• also we have to choose a reasonable prior (p(θ) > 0 ∀ θ)
6
Posterior Mean
So, instead of the maximum as we did with map, we can use the posterior mean (end even its variance).
For large n all three estimators (mle, map, mean) will be the same, however, for a small number of
observations we see that they are different as shown in Figure 4.
Figure 4: Probability density function (pdf) for likelihood, prior, and posterior (over parameters) and
estimators (mle, map, posterior mean) when the number of data points is small (left) and large (right).
Exercise 4.1. In our coin flipping example, show that θ̂mean → θ̂map → θ̂mle as n → ∞.
p(y = H ∣ D) = ∫ p(y = H, θ ∣ D) dθ
θ
= ∫ θ p(θ ∣ D) dθ
θ
Here, we used the fact that we defined p(y = H) = θ and that p(y = H) = p(y = H ∣ θ, D) (this is only the case
for coin flipping - not in general). Note that the prediction using the predictive posterior distribution is the
same as θ̂mean . Again, this nice result only holds for this particular example (coin flipping) and not in general.
7
5 Summary
List of concepts and terms to understand from this lecture:
• data likelihood
• log-likelihood
• negative log-likelihood
• prior distribution over parameters
• posterior distribution over parameters
• posterior predictive distribution
• mle estimator
• map estimator
(a) Using your own words, summarize each of the concepts listed above in 2-3 sentences by retrieving the
knowledge from the top of your head.
(b) Why are we interested in the log likelihood?
(c) What is the main difference of the mle estimator and map estimator?
(d) What is the difference between the posterior distribution over parameters and the posterior predictive
distribution?
(e) Why do we need the posterior distribution over parameters? Give a guess now. We will cover this in
the next lecture!
And always remember: It’s not bad to get it wrong. Getting it wrong is part of learning! Use your notes or
other resources to get the correct answer or come to our office hours to get help!