0% found this document useful (0 votes)

29 views28 pages

2 Mle

Machine Learning Notes

Uploaded by

Aaron Powell

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views28 pages

2 Mle

Machine Learning Notes

Uploaded by

Aaron Powell

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

2 - Probability Review and Maximum Likelihood

Estimation

UCLA Math156: Machine Learning

Instructor: Lara Kassab
Motivation

A key concept in the field of pattern recognition is that of

uncertainty.

It arises both through noise on measurements, as well as

through the finite size of data sets.
Probability theory provides a consistent framework for the
quantification and manipulation of uncertainty and forms one
of the central foundations for pattern recognition.
Joint and Conditional Probabilities

Joint Probability

p(X, Y )
Probability of X and Y

Conditional Probability

p(X|Y )
Probability of X given Y
Independent and Conditional Probabilities

1 Assuming that p(B) ̸= 0, the conditional probability of A

given B:
p(A|B) = p(A, B)/p(B)
Product Rule: p(A, B) = p(A|B)p(B) = p(B|A)p(A)

2 Two events A and B are independent if p(A, B) = p(A)p(B)

i.e., the joint probability is the product of marginals.

3 Two events A and B are conditionally independent given C if

they are independent after conditioning on C:
p(A, B|C) = p(B|A, C)p(A|C) = p(B|C)p(A|C).
Marginalization and Law of Total Probability

P
Sum Rule (Marginalization): p(X) = p(X, Y )
Y

P
Law of Total Probability: p(X) = p(X|Y )p(Y )
Y
Baye’s Theorem

From the product rule, together with the symmetry property

p(X, Y ) = p(Y, X), we immediately obtain the following
relationship between conditional probabilities:

p(X|Y )p(Y )
p(Y |X) =
p(X)
P
Note the denominator p(X) = p(X|Y )p(Y ).
Y
Baye’s Theorem

Suppose we have a statistical model with some parameters w, and

we have observed some data D. Baye’s Theorem:

p(D|w)p(w)
p(w|D) =
p(D)

likelihood · prior
posterior =
evidence

posterior ∝ (likelihood function) · (prior distribution)

Baye’s Theorem

p(w|D) ∝ p(D|w)p(w)

p(w|D): represents our updated belief about the values of the

parameters after observing the data
p(D|w): represents how well the model with parameters w
explains the observed data
p(w): represents our beliefs about the values of the
parameters prior to observing any data
Discrete vs Continuous Random Variables

Discrete Random Variables

Distribution defined by probability mass function (pmf)

P
Marginalization: p(X) = p(X, Y )
Y

Continuous Random Variables

Distribution defined by probability density function (pdf)

R
Marginalization: p(X) = p(X, Y )
Y
Probability Distribution Statistics: Expectation

The average value of some function f (x) under a probability

distribution p(x) is called the expectation of f (x) and will be
denoted by E[f ].

For a discrete distribution, it is given by

X
E[f ] = p(x)f (x)
x

In the case of continuous variables,

Z
E[f ] = p(x)f (x)dx

One of the most important operations involving probabilities is

that of finding weighted averages of functions.
Probability Distribution Statistics: Expectation

In either case, if we are given a finite number N of points drawn

from the probability distribution or probability density, then the
expectation can be approximated as a finite sum over these points
N
1 X
E[f ] ≈ f (xn )
N
n=1

The approximation becomes exact in the limit N → ∞.

Probability Distribution Statistics: Variance

The variance of f (x) is defined by

var[f ] = E[(f (x) − E[f (x)])2 ]

= E[f (x)2 ] − E[f (x)]2

and provides a measure of how much variability there is in f (x)

around its mean value E[f (x)].
Probability Distribution Statistics: Covariance

For two random variables x and y, the covariance is defined by

cov[x, y] = Ex,y [(x − E[x])(y − E[y])]

= Ex,y [xy] − E[x]E[y]

which expresses the extent to which x and y vary together. If x

and y are independent, then their covariance vanishes.
Probability Distribution Statistics: Covariance

In the case of two vectors of random variables x and y, the

covariance is a matrix:
h h ii
cov[x, y] = Ex,y (x − E[x])(y⊤ − E y⊤ )
h i h i
= Ex,y xy⊤ − E[x]E y⊤ .

Note. In this class, vectors are denoted by lower case bold Roman
letters such as x, and all vectors are assumed to be column vectors.
I.I.D.

We say the r.v.’s in x = (x1 , · · · , xn )⊤ are independent and

identically distributed (i.i.d) if they are drawn independently from
the same distribution.

In this case, the joint density is

n
Y
p(x) = p(xi )
i=1
The Gaussian Distribution

The Gaussian or Normal distribution, for the case of a single

real-valued variable x, is defined by

2
1 1 2
N x | µ, σ = exp − 2 (x − µ)
(2πσ 2 )1/2 2σ

which is governed by two parameters: µ, called the mean, and σ 2 ,

called the variance.

The square root of the variance, given by σ, is called the standard

deviation, and the reciprocal of the variance, written as β = 1/σ 2 ,
is called the precision.
The Gaussian Distribution

The Gaussian distribution defined over a D-dimensional vector x

of continuous variables is given by

1 1 1 ⊤ −1
N (x | µ, Σ) = exp − (x − µ) Σ (x − µ)
(2π)D/2 |Σ|1/2 2

where the D-dimensional vector µ is called the mean, the D × D

matrix Σ is called the covariance, and |Σ| denotes the
determinant of Σ.
MLE for Parameter Estimation

Suppose you obtain i.i.d. samples from a r.v. that you think are
normally distributed. How can we determine the normal
distribution (i.e., the parameters µ and σ) these sample likely came
from?

→ We can estimate them using Maximum Likelihood Estimation

(MLE). Let’s discuss this.
Bayesian probabilities

Roughly speaking:
1 Frequentist Approach: probability is interpreted as a long-run
frequency of a ‘repeatable’ event which led to the notion of
confidence intervals.

2 Bayesian Approach: probability is interpreted as a measure of

uncertainty or degree of belief in an event occurring, given the
information available.
Maximum Likelihood Estimation

1 A widely used frequentist estimator is maximum likelihood, in

which w is set to the value that maximizes the likelihood
function p(D|w).
2 This corresponds to choosing the value of w for which the
probability of the observed data set is maximized.
3 In other words, choosing the parameters w that make the
observed data the most likely.
Maximum Likelihood Estimation

Assume the data we have D = {x1 , x2 , · · · , xn } is

independent and identically distributed (i.i.d).
Therefore, the likelihood of our data given parameters w is
n
Y
p(D|w) = p(xi |w)
i=1

In MLE, our goal is to choose values of our parameters w that

maximizes the likelihood function p(D|w), i.e.,

wML = argmax p(D|w)

w
Maximum Likelihood Estimation

Since log is a monotonic function, the argmax of a function is

the same as the argmax of the log of the function. That’s
nice because logs make the math simpler.
Therefore, for MLE, we first write the log likelihood function
as:
n
Y
wML = argmax p(D|w) = argmax p(xi |w)
w w
i=1
n
!
Y
= argmax log p(xi |w)
w
i=1
n
X
= argmax log p(xi |w)
w
i=1
MLE for Gaussian

Assume the data we have x = (x1 , x2 , · · · , xn )⊤ is i.i.d. from a

Gaussian distribution whose mean µ and variance σ 2 are unknown.
→ We would like to determine these parameters from the data set.
Let us do this together by MLE:
MLE for Gaussian (Continued)
MLE for Gaussian (Continued)
Maximum A Posteriori Estimation

MLE is great, but it is not the only way to estimate parameters!

Maximum A Posteriori (MAP) states that we should choose

the value for our parameters that is the most likely given the
data.
Recall that MLE chooses the value of parameters that makes
the data most likely. One of the disadvantages of MLE is that
it best explains data we have seen and makes no attempt to
generalize to unseen data.
In MAP, we incorporate prior belief about our parameters, and
then we update our posterior belief of the parameters based
on the data we have seen.
Maximum A Posteriori Estimation

Assume the data we have D = {x1 , x2 , · · · , xn } is

independent and identically distributed (i.i.d).
In MAP, our goal is to chose values of our parameters w that
maximizes the posterior distribution p(w|D), i.e.,

wMAP = argmax p(w|D)

Using Baye’s Theorem:

p(D|w)p(w)
wMAP = argmax p(w|D) = argmax
w w p(D)
Yn
= argmax p(xi |w)p(w)
w
i=1
Maximum A Posteriori Estimation

As before, it will be more convenient to find the argmax of the log

of the MAP function, which yields:

n
!
X
wMAP = argmax log p(w) + log p(xi |w)
w
i=1

Note. If you compare with MLE, notice that MAP is the argmax
of the same function plus a term for the log of the prior. We will
discuss some of the implications of this!

L31 Bayesian Logistic Regression PDF
No ratings yet
L31 Bayesian Logistic Regression PDF
8 pages
2 Probability
No ratings yet
2 Probability
30 pages
Probability Theory For Machine Learning: Chris Cremer September 2015
No ratings yet
Probability Theory For Machine Learning: Chris Cremer September 2015
40 pages
Scribe: Naive Bayes Classifier
No ratings yet
Scribe: Naive Bayes Classifier
16 pages
Applied Maths
No ratings yet
Applied Maths
34 pages
Unit 2 (2) - 1
No ratings yet
Unit 2 (2) - 1
37 pages
15.097: Probabilistic Modeling and Bayesian Analysis
No ratings yet
15.097: Probabilistic Modeling and Bayesian Analysis
42 pages
Lecture 03 Maximum Likelihood Estimation
No ratings yet
Lecture 03 Maximum Likelihood Estimation
22 pages
03 MLE MAP NBayes-1-21-2015
No ratings yet
03 MLE MAP NBayes-1-21-2015
40 pages
11 Parameter Estimation
No ratings yet
11 Parameter Estimation
6 pages
Session 32 - Point Estimate
No ratings yet
Session 32 - Point Estimate
53 pages
Maximum Likelihood and Bayesian Parameter Estimation: Chapter 3, DHS
No ratings yet
Maximum Likelihood and Bayesian Parameter Estimation: Chapter 3, DHS
35 pages
PBM Notes
No ratings yet
PBM Notes
130 pages
Applied Statistics - Lecture 1: Mario Beraha
No ratings yet
Applied Statistics - Lecture 1: Mario Beraha
52 pages
Probs Stats
No ratings yet
Probs Stats
26 pages
MLE and MAP Ex PG 1-4 Print
No ratings yet
MLE and MAP Ex PG 1-4 Print
10 pages
Statistical Machine Learning W4400 Lecture Slides PDF
No ratings yet
Statistical Machine Learning W4400 Lecture Slides PDF
520 pages
Dealing With Uncertainty P (X - E) : Probability Theory The Foundation of Statistics
No ratings yet
Dealing With Uncertainty P (X - E) : Probability Theory The Foundation of Statistics
34 pages
Bayesian and MLE
No ratings yet
Bayesian and MLE
30 pages
ECE523 Engineering Applications of Machine Learning and Data Analytics - Bayes and Risk - 1
No ratings yet
ECE523 Engineering Applications of Machine Learning and Data Analytics - Bayes and Risk - 1
7 pages
Lecture 6
No ratings yet
Lecture 6
13 pages
Maximum Likelihood Estimation by K.Kashin
No ratings yet
Maximum Likelihood Estimation by K.Kashin
34 pages
PPT CH 1 PR Ir
No ratings yet
PPT CH 1 PR Ir
48 pages
Mathematics in Machine Learning
No ratings yet
Mathematics in Machine Learning
83 pages
MLT Unit 4 Notes
No ratings yet
MLT Unit 4 Notes
26 pages
Learning Models From Data: 1 Parametric Estimation
No ratings yet
Learning Models From Data: 1 Parametric Estimation
14 pages
10-701/15-781, Machine Learning: Homework 1: Aarti Singh Carnegie Mellon University
No ratings yet
10-701/15-781, Machine Learning: Homework 1: Aarti Singh Carnegie Mellon University
6 pages
4gaussian Discriminant
No ratings yet
4gaussian Discriminant
50 pages
Lecture2 Math ML Review
No ratings yet
Lecture2 Math ML Review
87 pages
BCS-DS-602: Machine Learning: Dr. Sarika Chaudhary Associate Professor Fet-Cse
No ratings yet
BCS-DS-602: Machine Learning: Dr. Sarika Chaudhary Associate Professor Fet-Cse
18 pages
4 - Probability Theory
No ratings yet
4 - Probability Theory
20 pages
91 With: Probability
No ratings yet
91 With: Probability
13 pages
PRML Slides 2
No ratings yet
PRML Slides 2
86 pages
Fisher Information
No ratings yet
Fisher Information
59 pages
Wk04 Machine Learning
No ratings yet
Wk04 Machine Learning
6 pages
17 Notes MFML Probreview
No ratings yet
17 Notes MFML Probreview
19 pages
Gaussian Mixture Model
No ratings yet
Gaussian Mixture Model
48 pages
Operations Research Lesson 3-1
No ratings yet
Operations Research Lesson 3-1
42 pages
斯坦福大学机器学习数学基础 25-32
No ratings yet
斯坦福大学机器学习数学基础 25-32
8 pages
CLASS 2025 Bayesian Framework
No ratings yet
CLASS 2025 Bayesian Framework
46 pages
Lec 2
No ratings yet
Lec 2
46 pages
S1B 16 All Lectures
No ratings yet
S1B 16 All Lectures
221 pages
3logistic Regression
No ratings yet
3logistic Regression
61 pages
Probability Concepts Explained
No ratings yet
Probability Concepts Explained
10 pages
Lecture01 Uppsala EQG 12
No ratings yet
Lecture01 Uppsala EQG 12
39 pages
ML - Lec 2 - Review of Probability and Statistics
No ratings yet
ML - Lec 2 - Review of Probability and Statistics
30 pages
ECE 368 Course Review: Probabilistic Reasoning 2023
No ratings yet
ECE 368 Course Review: Probabilistic Reasoning 2023
138 pages
BML Lecture Notes
No ratings yet
BML Lecture Notes
126 pages
Artificial Intelligence and Machine Learning
No ratings yet
Artificial Intelligence and Machine Learning
55 pages
2223hk1 Slide01 ML2022-2
No ratings yet
2223hk1 Slide01 ML2022-2
23 pages
BaYesian Models Machine Learning 2016
No ratings yet
BaYesian Models Machine Learning 2016
126 pages
Unit 5 - Machine Learning
No ratings yet
Unit 5 - Machine Learning
16 pages
جلسه پنجم-1
No ratings yet
جلسه پنجم-1
15 pages
Foundations of Machine Learning: Part A: Probability Basics
No ratings yet
Foundations of Machine Learning: Part A: Probability Basics
75 pages
Covariance Matrix (W Krzanowski)
No ratings yet
Covariance Matrix (W Krzanowski)
5 pages
Unit 5 - Machine Learning - WWW - Rgpvnotes.in
No ratings yet
Unit 5 - Machine Learning - WWW - Rgpvnotes.in
17 pages
Lecture 10
No ratings yet
Lecture 10
59 pages
Probability - Statistics - Class Notes
No ratings yet
Probability - Statistics - Class Notes
15 pages
Bayesian
No ratings yet
Bayesian
26 pages
Module C
No ratings yet
Module C
30 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Todini - Hydrological Catchment Modelling - Past-Present and Future
No ratings yet
Todini - Hydrological Catchment Modelling - Past-Present and Future
15 pages
Bayes Lecture Notes
No ratings yet
Bayes Lecture Notes
79 pages
Unit-4 Knowledge Representation
No ratings yet
Unit-4 Knowledge Representation
31 pages
HW 1
No ratings yet
HW 1
3 pages
Cse291d 2 PDF
No ratings yet
Cse291d 2 PDF
54 pages
Multiagent Systems: Course Overview and
No ratings yet
Multiagent Systems: Course Overview and
19 pages
Steyn - Poligraafverslag
No ratings yet
Steyn - Poligraafverslag
160 pages
Bayesian Methods For The Analysis of Small Sample Multilevel Data With A Complex Variance Structure
No ratings yet
Bayesian Methods For The Analysis of Small Sample Multilevel Data With A Complex Variance Structure
14 pages
Apress Bayesian Optimization Theory and Practice Using Python 1484290623
No ratings yet
Apress Bayesian Optimization Theory and Practice Using Python 1484290623
243 pages
Lecture 13
No ratings yet
Lecture 13
27 pages
Business Data Mining Week 7 A
No ratings yet
Business Data Mining Week 7 A
8 pages
Ch01 5 CondProb
No ratings yet
Ch01 5 CondProb
19 pages
Quantitative Methods I: Section C (Computer) Assignment - 2 Nikhil Vishal Bagde 1211174
No ratings yet
Quantitative Methods I: Section C (Computer) Assignment - 2 Nikhil Vishal Bagde 1211174
4 pages
ML Lecture 03 - Probabilistic Inference (Spring 2024)
No ratings yet
ML Lecture 03 - Probabilistic Inference (Spring 2024)
46 pages
Mit18 05 s22 Class11 Pset Sol
No ratings yet
Mit18 05 s22 Class11 Pset Sol
9 pages
23 Estimate Rand Var 2
No ratings yet
23 Estimate Rand Var 2
19 pages
Fundamentals of Data Science Unit 4
100% (1)
Fundamentals of Data Science Unit 4
31 pages
Ai ML Important Questions
No ratings yet
Ai ML Important Questions
21 pages
Chapter 6 Section 4-5: Probability: Multiple Choice
No ratings yet
Chapter 6 Section 4-5: Probability: Multiple Choice
7 pages
Food Inflation in India: The Role For Monetary Policy: Rahul Anand, Ding Ding, and Volodymyr Tulin
No ratings yet
Food Inflation in India: The Role For Monetary Policy: Rahul Anand, Ding Ding, and Volodymyr Tulin
23 pages
JSO (Test - 11) Paid
No ratings yet
JSO (Test - 11) Paid
5 pages
Rubin 1976
No ratings yet
Rubin 1976
12 pages
Population Mental Health Improves With Increasing Access To Treatment Evidence From A Dynamic Modelling Analysis
No ratings yet
Population Mental Health Improves With Increasing Access To Treatment Evidence From A Dynamic Modelling Analysis
11 pages
Financial Stability and Sustainable Development: Perspectives From Fiscal and Monetary Policy
No ratings yet
Financial Stability and Sustainable Development: Perspectives From Fiscal and Monetary Policy
19 pages
DS 630 - Lec 3 - ST
No ratings yet
DS 630 - Lec 3 - ST
24 pages
Modeling Mindsets
No ratings yet
Modeling Mindsets
113 pages
Overview of Bayesian Statistics
No ratings yet
Overview of Bayesian Statistics
13 pages
Parameter Estimation
No ratings yet
Parameter Estimation
12 pages
007-01-The Bayes Theorem Examples MAP
No ratings yet
007-01-The Bayes Theorem Examples MAP
10 pages