0% found this document useful (0 votes)
32 views7 pages

03 Lecturenote MLE MAP

The document outlines a lecture on estimating probabilities from data in a machine learning context, focusing on concepts such as maximum likelihood estimation (MLE) and Bayesian methods. It discusses the importance of estimating probability distributions for handling noise and uncertain outcomes, using examples like coin tossing and corn production. Key terms introduced include likelihood, prior and posterior distributions, and different estimation techniques, emphasizing the benefits and limitations of each approach.

Uploaded by

mizhou0309
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views7 pages

03 Lecturenote MLE MAP

The document outlines a lecture on estimating probabilities from data in a machine learning context, focusing on concepts such as maximum likelihood estimation (MLE) and Bayesian methods. It discusses the importance of estimating probability distributions for handling noise and uncertain outcomes, using examples like coin tossing and corn production. Key terms introduced include likelihood, prior and posterior distributions, and different estimation techniques, emphasizing the benefits and limitations of each approach.

Uploaded by

mizhou0309
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

CSE517A Machine Learning Fall 2022

Lecture 3: Estimating Probabilities from Data


Instructor: Marion Neumann
Reading: fcml 2.1-2.6 (Random Variables and Probability), 3.1-3.7 (Coin Game)

Learning Objective
Understand that many ml algorithms estimate probabilities. Appreciate that estimating probability distri-
butions is beneficial. Get to know common estimators for the parameters of probability distributions.

Application
Imagine that we want to predict the production of bushels of corn
per acre on a farm as a function of the proportion of that farm’s
planting area that was treated with a new pesticide. We have train-
ing data for 100 farms for this problem (https://www.developer.com/mgmt/
real-world-machine-learning-model-evaluation-and-optimization.html).
Take some time to answer the following warm-up questions:
(1) Is this a regression or classification problem? (2) What are the fea-
tures? (3) What is the prediction task? (4) How well do you think a linear
model will perform? http://www.corncapitalinnovations.com/production/
300- bushel-corn/

1 Introduction
For general machine learning problem, our goal is to estimate the function f which satisfies f (x) = y. For
example, ordinary least squares regression wants to estimate the vector w which makes f (x) = w⊺ x. How-
ever, there are some limitations: (1) y is only a point estimate (2) how do we deal with noise?
Therefore, it is a better idea to estimate the conditional probability p(y ∣ x). In this way, we can deal
with uncertain outcomes/noise and incorporate prior knowledge.

1.1 Noise
Let’s look at noise in our corn production
application. Plotting the target (bushels of
corn per acre) versus the feature (% of the
farm treated) it is clear that an increasing,
non-linear relationship exists, and that the
data also have random fluctuations, cf. Fig-
ure 11 .

Figure 1: Signal-to-noise ratio in corn production application.


Exercise 1.1. How does noise look like in classification problems such as image classification?
1 image source: https://www.developer.com/mgmt/real-world-machine-learning-model-evaluation-and-optimization.html

1
2

Discriminative vs. Generative Learning


Most supervised machine learning methods can be viewed as estimating p(y ∣ x) or p(x, y). When we
estimate p(y ∣ x) directly, then we call it discriminative learning. When we estimate p(x, y) = p(x ∣ y)p(y),
then we call it generative learning. In the following lectures, we will introduce examples for both.

1.2 Basic Problem: Tossing a Coin


Before we start thinking about estimating probability distributions in the context of regression and classifi-
cation. Let’s start with a simpler example. Imagine you find a funny looking coin and you start flipping it.
Naturally, you ask yourself: What is the probability that it comes up heads?

You have the following data: H, T, T, H, H, H, T, T, T, T. What is p(y = H) given that we observed nH
heads and nT tails? So, intuitively,
nH
p(y = H) = = 0.4
nH + nT
Note: we have no x′ s in this example. Let’s formally derive this probability.

2 Maximum Likelihood Estimation


Let p(y = H) = θ, where θ is the unknown parameter. All we have is D (sequence of heads and tails). So,
the goal is to choose θ such that the observed data D is most likely.

Formally (mle principle): Find θ̂ that maximizes the likelihood of the data p(D ∣ θ):

θ̂mle = arg max p(D ∣ θ) (1)


θ

For the sequence of coin flips we can use the binomial distribution (cf. fcml 2.3.2) to model p(D ∣ θ):
nH + nT nH
p(D ∣ θ) = ( )θ (1 − θ)nT = Bin(nH ∣ n, θ) (2)
nH
Now,
nH + nT nH
θ̂mle = arg max ( )θ (1 − θ)nT
θ nH
nH + nT (3)
= arg max log ( ) + log θnH + log(1 − θ)nT
θ nH
= arg max nH log θ + nT log(1 − θ)

We can now solve for θ by taking the derivative and equating it to zero. This results in:
nH nT nH
= ⇒ nH (1 − θ) = nT θ ⇒ θ̂mle = (4)
θ 1−θ nH + nT
Note that we found the (arg) maximum of the log-likelihood instead of the likelihood, which oftentimes leads
to much easier to solve equations!

Advantages:
• mle gives the explanation of the data you observed.
• If n is large and your model/distribution is correct (that is H includes the true model), then mle finds
the true parameters.
3

Disadvantages:
• But the mle can overfit the data if n is small.
• If you do not have the correct model (and n is small) then mle can be terribly wrong.

Exercise 2.1. What is the probability that my smartphone dies?


Let y1 , . . . , yn ∈ R with yi ≥ 0 be the customer-reported lifetimes of pear’s popular smartphone jX. We
further assume that lifetimes follow an exponential distribution:

p(y; θ) = θe−θy .

In order for you to use this distribution to compute the probability of your own jX phone dying next month,
we need to estimate its parameter θ.

(a) Derive the log-likelihood ll(θ, y1 , . . . , yn ).

(b) How do you derive the mle estimator θ̂ based on ll(θ, y1 , . . . , yn )? No computation required.

Exercise 2.2. Assume you model your data y1 , . . . , yn with a Poisson distribution:

θy e−θ
P (y; θ) = for y = 0, 1, 2, . . . , K.
y!

(a) Derive the negative log-likelihood (nll) of your data as a function of θ.


(b) For what data/events do you use a Poisson distribution? (You may search the internet for an answer.)
Name a general definition and find at least two example applications.

3 The Bayesian Way and Maximum-a-posterior Estimation


For example, suppose you observe H,H,H,H,H. What is θ̂mle ? Can we do something about this?

Answer: incorporate prior knowledge!

Say we think θ is close to q.


Simple fix (map smoothing): add m imaginary throws that would result in q.

For example, set q = 0.5, then add m heads and m tails to dataset D. Now,
nH + m
θ̂ = (5)
nH + nT + 2m
For large n, this change is insignificant; for small n, it incorporates your prior belief about θ. From Figure 2
we can see that map smoothing works well. But note that q is uncertain!
4

Figure 2: Comparing mle to map smoothing we see that incorporating prior knowledge helps when observing
few training examples

The Bayesian way is to model θ as a random variable drawn from a distribution p(θ). Note that θ is not
a random variable associated with an event in a sample space. In frequentist statistics, this is forbidden. In
Bayesian statistics, this is allowed.

Now, we can look at p(θ ∣ D) = p(D∣θ)p(θ)


p(D)
(Bayes rule), where

• p(D ∣ θ) is the likelihood of the data given the parameter(s) θ


• p(θ) is the prior distribution over the parameter(s) θ

• p(θ ∣ D) is the posterior distribution over the parameter(s) θ

Figure 3: Prior distribution, likelihood, and posterior distribution.


5

θ is a continuous univariate RV on [0,1], we can use Beta distribution (cf. fcml 2.5.2) to model p(θ):
θα−1 (1 − θ)β−1
p(θ) = = Beta(θ ∣ α, β) (6)
b(α, β)
where b(α, β) = Γ(α)Γ(β)
Γ(α+β)
is the Beta function that acts as a normalization constant. Note that here we
only need a distribution over a univariate (1d) random variable. The multivariate generalization of the Beta
distribution is the Dirichlet distribution.

Why do we use Beta distribution?


• it models continuous probabilities (θ lives on [0,1] and ∑i θ = 1)
• it is of the same distributional family as the binomial distribution (conjugate prior )
→ p(θ ∣ D) ∝ p(D ∣ θ)p(θ) ∝ θnH +α−1 (1 − θ)nT +β−1

Note:
p(θ ∣ D) = Beta(nH + α, nT + β) (7)

Note that in general θ are the parameters of our model. For the coin flipping scenario θ = p(y = H). So far,
we have a distribution over θ. How can we get an estimate for θ?

For example, choose θ̂ to be the most likely θ given D.

Formally (map principle): Find θ̂ that maximizes the posterior distribution over parameters p(θ ∣ D):

θ̂map = arg max p(θ ∣ D)


θ
p(D ∣ θ)p(θ)
= arg max (8)
θ p(D)
= arg max log p(D ∣ θ) + log p(θ)
θ

For coin flipping scenario, we get:

θ̂map = arg max log p(D ∣ θ) + log p(θ)


θ
= arg max nH log θ + nT log(1 − θ) + (α − 1) log θ + (β − 1) log(1 − θ) (9)
θ
= arg max(nH + α − 1) log θ + (nT + β − 1) log(1 − θ)
Solve for θ by taking the derivative and equating it to zero. This results in:
nH + α − 1
θ̂map = (10)
nH + nT + α + β − 2
Note that we found the (arg) maximum of the log-posterior instead of the posterior, which oftentimes leads
to much easier to solve equations!

Advantages:
• as n → ∞, θ̂map → θ̂mle
• map is a great estimator if prior belief exists and is accurate
Disadvantage:
• if n is small, it can be very wrong if prior belief is wrong
• also we have to choose a reasonable prior (p(θ) > 0 ∀ θ)
6

4 “True” Bayesian Approach


map is only one way to get an estimator for θ. There is much more information in p(θ ∣ D).

Posterior Mean
So, instead of the maximum as we did with map, we can use the posterior mean (end even its variance).

θ̂mean = E[θ, D] = ∫ θ p(θ ∣ D) dθ (11)


θ
nH +α
For coin flipping, this can be computed as θ̂mean = nH +α+nT +β
as the posterior distribution is a Beta distri-
bution, cf. Eq. 7, and the mean of Beta(α, β) is given by µ = α+β
α
.

For large n all three estimators (mle, map, mean) will be the same, however, for a small number of
observations we see that they are different as shown in Figure 4.

Figure 4: Probability density function (pdf) for likelihood, prior, and posterior (over parameters) and
estimators (mle, map, posterior mean) when the number of data points is small (left) and large (right).

Exercise 4.1. In our coin flipping example, show that θ̂mean → θ̂map → θ̂mle as n → ∞.

Posterior Predictive Distribution


So far, we talked about modeling and estimating parameters. But in Machine Learning we are actually
interested in predictions. To directly estimate y from the given data, we can use the posterior predictive
distribution. In our coin tossing example, this is given by:

p(y = H ∣ D) = ∫ p(y = H, θ ∣ D) dθ
θ

= ∫ p(y = H ∣ θ, D) p(θ ∣ D) dθ (12)


θ

= ∫ θ p(θ ∣ D) dθ
θ

Here, we used the fact that we defined p(y = H) = θ and that p(y = H) = p(y = H ∣ θ, D) (this is only the case
for coin flipping - not in general). Note that the prediction using the predictive posterior distribution is the
same as θ̂mean . Again, this nice result only holds for this particular example (coin flipping) and not in general.
7

5 Summary
List of concepts and terms to understand from this lecture:

• data likelihood
• log-likelihood
• negative log-likelihood
• prior distribution over parameters
• posterior distribution over parameters
• posterior predictive distribution
• mle estimator
• map estimator

Exercise 5.1. Practice Retrieving!


For this summary exercise, it is intended that your answers are based on your own (current) understanding
of the concepts (and not on the definitions you read and copy from these notes or from elsewhere). Don’t
hesitate to say it out loud to your seat neighbor, your pet or stuffed animal, or to yourself before writing
it down. Research studies show that this practice of retrieval and phrasing out loud will help you retain
the knowledge!

(a) Using your own words, summarize each of the concepts listed above in 2-3 sentences by retrieving the
knowledge from the top of your head.
(b) Why are we interested in the log likelihood?
(c) What is the main difference of the mle estimator and map estimator?
(d) What is the difference between the posterior distribution over parameters and the posterior predictive
distribution?
(e) Why do we need the posterior distribution over parameters? Give a guess now. We will cover this in
the next lecture!

And always remember: It’s not bad to get it wrong. Getting it wrong is part of learning! Use your notes or
other resources to get the correct answer or come to our office hours to get help!

You might also like