0% found this document useful (0 votes)

34 views7 pages

03 Lecturenote MLE MAP

The document outlines a lecture on estimating probabilities from data in a machine learning context, focusing on concepts such as maximum likelihood estimation (MLE) and Bayesian methods. It discusses the importance of estimating probability distributions for handling noise and uncertain outcomes, using examples like coin tossing and corn production. Key terms introduced include likelihood, prior and posterior distributions, and different estimation techniques, emphasizing the benefits and limitations of each approach.

Uploaded by

mizhou0309

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views7 pages

03 Lecturenote MLE MAP

Uploaded by

mizhou0309

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

CSE517A Machine Learning Fall 2022

Lecture 3: Estimating Probabilities from Data

Instructor: Marion Neumann
Reading: fcml 2.1-2.6 (Random Variables and Probability), 3.1-3.7 (Coin Game)

Learning Objective
Understand that many ml algorithms estimate probabilities. Appreciate that estimating probability distri-
butions is beneficial. Get to know common estimators for the parameters of probability distributions.

Application
Imagine that we want to predict the production of bushels of corn
per acre on a farm as a function of the proportion of that farm’s
planting area that was treated with a new pesticide. We have train-
ing data for 100 farms for this problem (https://www.developer.com/mgmt/
real-world-machine-learning-model-evaluation-and-optimization.html).
Take some time to answer the following warm-up questions:
(1) Is this a regression or classification problem? (2) What are the fea-
tures? (3) What is the prediction task? (4) How well do you think a linear
model will perform? http://www.corncapitalinnovations.com/production/
300- bushel-corn/

1 Introduction
For general machine learning problem, our goal is to estimate the function f which satisfies f (x) = y. For
example, ordinary least squares regression wants to estimate the vector w which makes f (x) = w⊺ x. How-
ever, there are some limitations: (1) y is only a point estimate (2) how do we deal with noise?
Therefore, it is a better idea to estimate the conditional probability p(y ∣ x). In this way, we can deal
with uncertain outcomes/noise and incorporate prior knowledge.

1.1 Noise
Let’s look at noise in our corn production
application. Plotting the target (bushels of
corn per acre) versus the feature (% of the
farm treated) it is clear that an increasing,
non-linear relationship exists, and that the
data also have random fluctuations, cf. Fig-
ure 11 .

Figure 1: Signal-to-noise ratio in corn production application.

Exercise 1.1. How does noise look like in classification problems such as image classification?
1 image source: https://www.developer.com/mgmt/real-world-machine-learning-model-evaluation-and-optimization.html

1
2

Discriminative vs. Generative Learning

Most supervised machine learning methods can be viewed as estimating p(y ∣ x) or p(x, y). When we
estimate p(y ∣ x) directly, then we call it discriminative learning. When we estimate p(x, y) = p(x ∣ y)p(y),
then we call it generative learning. In the following lectures, we will introduce examples for both.

1.2 Basic Problem: Tossing a Coin

Before we start thinking about estimating probability distributions in the context of regression and classifi-
cation. Let’s start with a simpler example. Imagine you find a funny looking coin and you start flipping it.
Naturally, you ask yourself: What is the probability that it comes up heads?

You have the following data: H, T, T, H, H, H, T, T, T, T. What is p(y = H) given that we observed nH
heads and nT tails? So, intuitively,
nH
p(y = H) = = 0.4
nH + nT
Note: we have no x′ s in this example. Let’s formally derive this probability.

2 Maximum Likelihood Estimation

Let p(y = H) = θ, where θ is the unknown parameter. All we have is D (sequence of heads and tails). So,
the goal is to choose θ such that the observed data D is most likely.

Formally (mle principle): Find θ̂ that maximizes the likelihood of the data p(D ∣ θ):

θ̂mle = arg max p(D ∣ θ) (1)

For the sequence of coin flips we can use the binomial distribution (cf. fcml 2.3.2) to model p(D ∣ θ):
nH + nT nH
p(D ∣ θ) = ( )θ (1 − θ)nT = Bin(nH ∣ n, θ) (2)
nH
Now,
nH + nT nH
θ̂mle = arg max ( )θ (1 − θ)nT
θ nH
nH + nT (3)
= arg max log ( ) + log θnH + log(1 − θ)nT
θ nH
= arg max nH log θ + nT log(1 − θ)

We can now solve for θ by taking the derivative and equating it to zero. This results in:
nH nT nH
= ⇒ nH (1 − θ) = nT θ ⇒ θ̂mle = (4)
θ 1−θ nH + nT
Note that we found the (arg) maximum of the log-likelihood instead of the likelihood, which oftentimes leads
to much easier to solve equations!

Advantages:
• mle gives the explanation of the data you observed.
• If n is large and your model/distribution is correct (that is H includes the true model), then mle finds
the true parameters.
3

Disadvantages:
• But the mle can overfit the data if n is small.
• If you do not have the correct model (and n is small) then mle can be terribly wrong.

Exercise 2.1. What is the probability that my smartphone dies?

Let y1 , . . . , yn ∈ R with yi ≥ 0 be the customer-reported lifetimes of pear’s popular smartphone jX. We
further assume that lifetimes follow an exponential distribution:

p(y; θ) = θe−θy .

In order for you to use this distribution to compute the probability of your own jX phone dying next month,
we need to estimate its parameter θ.

(a) Derive the log-likelihood ll(θ, y1 , . . . , yn ).

(b) How do you derive the mle estimator θ̂ based on ll(θ, y1 , . . . , yn )? No computation required.

Exercise 2.2. Assume you model your data y1 , . . . , yn with a Poisson distribution:

θy e−θ
P (y; θ) = for y = 0, 1, 2, . . . , K.
y!

(a) Derive the negative log-likelihood (nll) of your data as a function of θ.

(b) For what data/events do you use a Poisson distribution? (You may search the internet for an answer.)
Name a general definition and find at least two example applications.

3 The Bayesian Way and Maximum-a-posterior Estimation

For example, suppose you observe H,H,H,H,H. What is θ̂mle ? Can we do something about this?

Answer: incorporate prior knowledge!

Say we think θ is close to q.

Simple fix (map smoothing): add m imaginary throws that would result in q.

For example, set q = 0.5, then add m heads and m tails to dataset D. Now,
nH + m
θ̂ = (5)
nH + nT + 2m
For large n, this change is insignificant; for small n, it incorporates your prior belief about θ. From Figure 2
we can see that map smoothing works well. But note that q is uncertain!
4

Figure 2: Comparing mle to map smoothing we see that incorporating prior knowledge helps when observing
few training examples

The Bayesian way is to model θ as a random variable drawn from a distribution p(θ). Note that θ is not
a random variable associated with an event in a sample space. In frequentist statistics, this is forbidden. In
Bayesian statistics, this is allowed.

Now, we can look at p(θ ∣ D) = p(D∣θ)p(θ)

p(D)
(Bayes rule), where

• p(D ∣ θ) is the likelihood of the data given the parameter(s) θ

• p(θ) is the prior distribution over the parameter(s) θ

• p(θ ∣ D) is the posterior distribution over the parameter(s) θ

Figure 3: Prior distribution, likelihood, and posterior distribution.

θ is a continuous univariate RV on [0,1], we can use Beta distribution (cf. fcml 2.5.2) to model p(θ):
θα−1 (1 − θ)β−1
p(θ) = = Beta(θ ∣ α, β) (6)
b(α, β)
where b(α, β) = Γ(α)Γ(β)
Γ(α+β)
is the Beta function that acts as a normalization constant. Note that here we
only need a distribution over a univariate (1d) random variable. The multivariate generalization of the Beta
distribution is the Dirichlet distribution.

Why do we use Beta distribution?

• it models continuous probabilities (θ lives on [0,1] and ∑i θ = 1)
• it is of the same distributional family as the binomial distribution (conjugate prior )
→ p(θ ∣ D) ∝ p(D ∣ θ)p(θ) ∝ θnH +α−1 (1 − θ)nT +β−1

Note:
p(θ ∣ D) = Beta(nH + α, nT + β) (7)

Note that in general θ are the parameters of our model. For the coin flipping scenario θ = p(y = H). So far,
we have a distribution over θ. How can we get an estimate for θ?

For example, choose θ̂ to be the most likely θ given D.

Formally (map principle): Find θ̂ that maximizes the posterior distribution over parameters p(θ ∣ D):

θ̂map = arg max p(θ ∣ D)

θ
p(D ∣ θ)p(θ)
= arg max (8)
θ p(D)
= arg max log p(D ∣ θ) + log p(θ)
θ

For coin flipping scenario, we get:

θ̂map = arg max log p(D ∣ θ) + log p(θ)

θ
= arg max nH log θ + nT log(1 − θ) + (α − 1) log θ + (β − 1) log(1 − θ) (9)
θ
= arg max(nH + α − 1) log θ + (nT + β − 1) log(1 − θ)
Solve for θ by taking the derivative and equating it to zero. This results in:
nH + α − 1
θ̂map = (10)
nH + nT + α + β − 2
Note that we found the (arg) maximum of the log-posterior instead of the posterior, which oftentimes leads
to much easier to solve equations!

Advantages:
• as n → ∞, θ̂map → θ̂mle
• map is a great estimator if prior belief exists and is accurate
Disadvantage:
• if n is small, it can be very wrong if prior belief is wrong
• also we have to choose a reasonable prior (p(θ) > 0 ∀ θ)
6

4 “True” Bayesian Approach

map is only one way to get an estimator for θ. There is much more information in p(θ ∣ D).

Posterior Mean
So, instead of the maximum as we did with map, we can use the posterior mean (end even its variance).

θ̂mean = E[θ, D] = ∫ θ p(θ ∣ D) dθ (11)

θ
nH +α
For coin flipping, this can be computed as θ̂mean = nH +α+nT +β
as the posterior distribution is a Beta distri-
bution, cf. Eq. 7, and the mean of Beta(α, β) is given by µ = α+β
α
.

For large n all three estimators (mle, map, mean) will be the same, however, for a small number of
observations we see that they are different as shown in Figure 4.

Figure 4: Probability density function (pdf) for likelihood, prior, and posterior (over parameters) and
estimators (mle, map, posterior mean) when the number of data points is small (left) and large (right).

Exercise 4.1. In our coin flipping example, show that θ̂mean → θ̂map → θ̂mle as n → ∞.

Posterior Predictive Distribution

So far, we talked about modeling and estimating parameters. But in Machine Learning we are actually
interested in predictions. To directly estimate y from the given data, we can use the posterior predictive
distribution. In our coin tossing example, this is given by:

p(y = H ∣ D) = ∫ p(y = H, θ ∣ D) dθ
θ

= ∫ p(y = H ∣ θ, D) p(θ ∣ D) dθ (12)

= ∫ θ p(θ ∣ D) dθ
θ

Here, we used the fact that we defined p(y = H) = θ and that p(y = H) = p(y = H ∣ θ, D) (this is only the case
for coin flipping - not in general). Note that the prediction using the predictive posterior distribution is the
same as θ̂mean . Again, this nice result only holds for this particular example (coin flipping) and not in general.
7

5 Summary
List of concepts and terms to understand from this lecture:

• data likelihood
• log-likelihood
• negative log-likelihood
• prior distribution over parameters
• posterior distribution over parameters
• posterior predictive distribution
• mle estimator
• map estimator

Exercise 5.1. Practice Retrieving!

For this summary exercise, it is intended that your answers are based on your own (current) understanding
of the concepts (and not on the definitions you read and copy from these notes or from elsewhere). Don’t
hesitate to say it out loud to your seat neighbor, your pet or stuffed animal, or to yourself before writing
it down. Research studies show that this practice of retrieval and phrasing out loud will help you retain
the knowledge!

(a) Using your own words, summarize each of the concepts listed above in 2-3 sentences by retrieving the
knowledge from the top of your head.
(b) Why are we interested in the log likelihood?
(c) What is the main difference of the mle estimator and map estimator?
(d) What is the difference between the posterior distribution over parameters and the posterior predictive
distribution?
(e) Why do we need the posterior distribution over parameters? Give a guess now. We will cover this in
the next lecture!

And always remember: It’s not bad to get it wrong. Getting it wrong is part of learning! Use your notes or
other resources to get the correct answer or come to our office hours to get help!

PyCon 2015 - Bayesian Statistics Made Simple
100% (4)
PyCon 2015 - Bayesian Statistics Made Simple
145 pages
PR2 Q1W2L2 - PPT - Research Variables
100% (5)
PR2 Q1W2L2 - PPT - Research Variables
18 pages
Week 5
No ratings yet
Week 5
49 pages
BaYesian Models Machine Learning 2016
No ratings yet
BaYesian Models Machine Learning 2016
126 pages
CS464 Ch3 Estimation
No ratings yet
CS464 Ch3 Estimation
56 pages
Maximum Likelihood and Bayesian Parameter Estimation: Chapter 3, DHS
No ratings yet
Maximum Likelihood and Bayesian Parameter Estimation: Chapter 3, DHS
35 pages
Naive Bayes Classifier and Other Topics
No ratings yet
Naive Bayes Classifier and Other Topics
52 pages
L09 Learning I Bayesian Learning
No ratings yet
L09 Learning I Bayesian Learning
66 pages
Likelihood Frequentist
No ratings yet
Likelihood Frequentist
27 pages
BML Lecture Notes
No ratings yet
BML Lecture Notes
126 pages
Statistical Learning Theory
No ratings yet
Statistical Learning Theory
213 pages
Lecture 10
No ratings yet
Lecture 10
59 pages
Mle & Map
No ratings yet
Mle & Map
21 pages
Maximum Likelihood Estimation
No ratings yet
Maximum Likelihood Estimation
46 pages
PML Class 1 2025
No ratings yet
PML Class 1 2025
54 pages
Toc 1
No ratings yet
Toc 1
17 pages
Unsupervised Learning Clustering Math
No ratings yet
Unsupervised Learning Clustering Math
28 pages
03 Lectureslides ParameterInference
No ratings yet
03 Lectureslides ParameterInference
24 pages
Bayesian
No ratings yet
Bayesian
91 pages
MLT Unit 4 Notes
No ratings yet
MLT Unit 4 Notes
26 pages
Probabilistic Theory of Deep Learning
No ratings yet
Probabilistic Theory of Deep Learning
11 pages
Frequentist vs. Bayesian Statistics Frequentist Thinking Bayesian Thinking
No ratings yet
Frequentist vs. Bayesian Statistics Frequentist Thinking Bayesian Thinking
18 pages
Slide 1
No ratings yet
Slide 1
37 pages
ML Lecture 03 - Probabilistic Inference (Spring 2024)
No ratings yet
ML Lecture 03 - Probabilistic Inference (Spring 2024)
46 pages
AIML-Unit 3 Notes-Assignment 3
No ratings yet
AIML-Unit 3 Notes-Assignment 3
37 pages
CLASS 2025 Bayesian Framework
No ratings yet
CLASS 2025 Bayesian Framework
46 pages
Notes 19
No ratings yet
Notes 19
11 pages
Week 6 Mle
No ratings yet
Week 6 Mle
41 pages
Zzzz-Essential Bayes
No ratings yet
Zzzz-Essential Bayes
158 pages
Chapter 4 ML Parametric Classification
No ratings yet
Chapter 4 ML Parametric Classification
42 pages
DMML2023 Lecture16 09mar2023
No ratings yet
DMML2023 Lecture16 09mar2023
16 pages
Lecture17 Mle Map
No ratings yet
Lecture17 Mle Map
29 pages
ML Unit 3
No ratings yet
ML Unit 3
14 pages
Non Parametric
No ratings yet
Non Parametric
18 pages
Lecture5 Maximum Likelihood
No ratings yet
Lecture5 Maximum Likelihood
13 pages
Understanding Python
No ratings yet
Understanding Python
9 pages
Information Retrieval: Venkatesh Vinayakarao
No ratings yet
Information Retrieval: Venkatesh Vinayakarao
57 pages
Probabilistic Machine Learning: Exponential Families
No ratings yet
Probabilistic Machine Learning: Exponential Families
19 pages
Artificial Intelligence and Machine Learning
No ratings yet
Artificial Intelligence and Machine Learning
55 pages
Notes6 Classification
No ratings yet
Notes6 Classification
10 pages
843 Artificial Intelligence Xi Xii
No ratings yet
843 Artificial Intelligence Xi Xii
12 pages
2223hk1 Slide01 ML2022-2
No ratings yet
2223hk1 Slide01 ML2022-2
23 pages
Lecture Notes MAI
No ratings yet
Lecture Notes MAI
114 pages
Model Selection/ Structure Learning Koller & Friedman Chapter 14 Mackay Chapter 28
No ratings yet
Model Selection/ Structure Learning Koller & Friedman Chapter 14 Mackay Chapter 28
49 pages
CS 601 Machine Learning Unit 5
No ratings yet
CS 601 Machine Learning Unit 5
18 pages
Introductory Econometrics 2000
No ratings yet
Introductory Econometrics 2000
391 pages
Lecture 6
No ratings yet
Lecture 6
13 pages
Bayesian Learning: Thanks To Nir Friedman, HU
No ratings yet
Bayesian Learning: Thanks To Nir Friedman, HU
41 pages
MIT18 05S14 Reading10b PDF
No ratings yet
MIT18 05S14 Reading10b PDF
9 pages
Bayes Manuscripts
No ratings yet
Bayes Manuscripts
180 pages
Maximum Likelihood Estimation
No ratings yet
Maximum Likelihood Estimation
6 pages
Probability Theory For Machine Learning: Chris Cremer September 2015
No ratings yet
Probability Theory For Machine Learning: Chris Cremer September 2015
40 pages
Chapter 3 - Introduction Via Linear Regression
No ratings yet
Chapter 3 - Introduction Via Linear Regression
20 pages
Introstat
No ratings yet
Introstat
16 pages
Bayes ML Tutorial
No ratings yet
Bayes ML Tutorial
69 pages
10-701/15-781, Machine Learning: Homework 1: Aarti Singh Carnegie Mellon University
No ratings yet
10-701/15-781, Machine Learning: Homework 1: Aarti Singh Carnegie Mellon University
6 pages
Chapter 9 Bayesian Methods - Machine Learning For Factor Investing
No ratings yet
Chapter 9 Bayesian Methods - Machine Learning For Factor Investing
11 pages
Bayesian Basics: Ryan P. Adams
No ratings yet
Bayesian Basics: Ryan P. Adams
7 pages
15.097: Probabilistic Modeling and Bayesian Analysis
No ratings yet
15.097: Probabilistic Modeling and Bayesian Analysis
42 pages
Discrete and Continuous Probability Distributions PPT BEC
No ratings yet
Discrete and Continuous Probability Distributions PPT BEC
68 pages
A Study On Working Capital Management at Berger Paints LTD
No ratings yet
A Study On Working Capital Management at Berger Paints LTD
6 pages
Probability Theory - Towards Data Science
No ratings yet
Probability Theory - Towards Data Science
19 pages
24cs3019-Data Analytics and Visualization
No ratings yet
24cs3019-Data Analytics and Visualization
2 pages
11 Parameter Estimation
No ratings yet
11 Parameter Estimation
6 pages
Descriptive Research Design
No ratings yet
Descriptive Research Design
41 pages
Comparing Quantities Using Analytical Tools
No ratings yet
Comparing Quantities Using Analytical Tools
10 pages
Data Science
No ratings yet
Data Science
28 pages
STAT-2450 Assignment 1: Name:, Student ID: B00
No ratings yet
STAT-2450 Assignment 1: Name:, Student ID: B00
9 pages
Methods of Psychology
No ratings yet
Methods of Psychology
35 pages
Basic Statistical Terms
No ratings yet
Basic Statistical Terms
3 pages
M03 Donn BS3 TB 6165 C03
No ratings yet
M03 Donn BS3 TB 6165 C03
86 pages
CHO AI 105 - Data Analytics-As Shared
No ratings yet
CHO AI 105 - Data Analytics-As Shared
8 pages
EMS Cloud Platform-User Manual V1 0 0
No ratings yet
EMS Cloud Platform-User Manual V1 0 0
23 pages
AP Stats Homework Answers Chapter 10
100% (1)
AP Stats Homework Answers Chapter 10
8 pages
Maharaja Institute of Technology Mysore Department of Computer Science and Business System
No ratings yet
Maharaja Institute of Technology Mysore Department of Computer Science and Business System
18 pages
BOW in PR2
No ratings yet
BOW in PR2
2 pages
Applied Mineral Inventory Estimation 1st Edition Alastair J. Sinclair Instant Download
No ratings yet
Applied Mineral Inventory Estimation 1st Edition Alastair J. Sinclair Instant Download
86 pages
Eco 232 Main Exam May 2022 PDF
No ratings yet
Eco 232 Main Exam May 2022 PDF
5 pages
Lecture-5 (Fitting of A Exponential Curve)
No ratings yet
Lecture-5 (Fitting of A Exponential Curve)
4 pages
02 Lecturenote GD
No ratings yet
02 Lecturenote GD
10 pages
05 Lecturenote NB
No ratings yet
05 Lecturenote NB
10 pages
A Generative Model For Inorganic Materials Design-Peer Review Report
No ratings yet
A Generative Model For Inorganic Materials Design-Peer Review Report
25 pages
Effects of Resistance Training in Children and Adolescents A Meta-Analysis
No ratings yet
Effects of Resistance Training in Children and Adolescents A Meta-Analysis
14 pages
04 Lecturenote MLE MAP Discriminative
No ratings yet
04 Lecturenote MLE MAP Discriminative
6 pages
Intro Part2
No ratings yet
Intro Part2
50 pages
Seminar Handout
No ratings yet
Seminar Handout
7 pages
01 Lecturenote SRM
No ratings yet
01 Lecturenote SRM
9 pages
Group 5 Sip Sampling
No ratings yet
Group 5 Sip Sampling
11 pages
Dray Etal 2014 Ecology
No ratings yet
Dray Etal 2014 Ecology
8 pages
Anova
No ratings yet
Anova
3 pages
Regression Statistics
No ratings yet
Regression Statistics
8 pages
Perilaku Biaya Aktivitas
No ratings yet
Perilaku Biaya Aktivitas
4 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Multiple Integrals, A Collection of Solved Problems
From Everand
Multiple Integrals, A Collection of Solved Problems
Steven Tan
No ratings yet