0% found this document useful (0 votes)
29 views28 pages

2 Mle

Machine Learning Notes

Uploaded by

Aaron Powell
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views28 pages

2 Mle

Machine Learning Notes

Uploaded by

Aaron Powell
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

2 - Probability Review and Maximum Likelihood

Estimation

UCLA Math156: Machine Learning


Instructor: Lara Kassab
Motivation

A key concept in the field of pattern recognition is that of


uncertainty.

It arises both through noise on measurements, as well as


through the finite size of data sets.
Probability theory provides a consistent framework for the
quantification and manipulation of uncertainty and forms one
of the central foundations for pattern recognition.
Joint and Conditional Probabilities

Joint Probability

p(X, Y )
Probability of X and Y

Conditional Probability

p(X|Y )
Probability of X given Y
Independent and Conditional Probabilities

1 Assuming that p(B) ̸= 0, the conditional probability of A


given B:
p(A|B) = p(A, B)/p(B)
Product Rule: p(A, B) = p(A|B)p(B) = p(B|A)p(A)

2 Two events A and B are independent if p(A, B) = p(A)p(B)


i.e., the joint probability is the product of marginals.

3 Two events A and B are conditionally independent given C if


they are independent after conditioning on C:
p(A, B|C) = p(B|A, C)p(A|C) = p(B|C)p(A|C).
Marginalization and Law of Total Probability

P
Sum Rule (Marginalization): p(X) = p(X, Y )
Y

P
Law of Total Probability: p(X) = p(X|Y )p(Y )
Y
Baye’s Theorem

From the product rule, together with the symmetry property


p(X, Y ) = p(Y, X), we immediately obtain the following
relationship between conditional probabilities:

p(X|Y )p(Y )
p(Y |X) =
p(X)
P
Note the denominator p(X) = p(X|Y )p(Y ).
Y
Baye’s Theorem

Suppose we have a statistical model with some parameters w, and


we have observed some data D. Baye’s Theorem:

p(D|w)p(w)
p(w|D) =
p(D)

likelihood · prior
posterior =
evidence

posterior ∝ (likelihood function) · (prior distribution)


Baye’s Theorem

p(w|D) ∝ p(D|w)p(w)

p(w|D): represents our updated belief about the values of the


parameters after observing the data
p(D|w): represents how well the model with parameters w
explains the observed data
p(w): represents our beliefs about the values of the
parameters prior to observing any data
Discrete vs Continuous Random Variables

Discrete Random Variables

Distribution defined by probability mass function (pmf)


P
Marginalization: p(X) = p(X, Y )
Y

Continuous Random Variables

Distribution defined by probability density function (pdf)


R
Marginalization: p(X) = p(X, Y )
Y
Probability Distribution Statistics: Expectation

The average value of some function f (x) under a probability


distribution p(x) is called the expectation of f (x) and will be
denoted by E[f ].

For a discrete distribution, it is given by


X
E[f ] = p(x)f (x)
x

In the case of continuous variables,


Z
E[f ] = p(x)f (x)dx

One of the most important operations involving probabilities is


that of finding weighted averages of functions.
Probability Distribution Statistics: Expectation

In either case, if we are given a finite number N of points drawn


from the probability distribution or probability density, then the
expectation can be approximated as a finite sum over these points
N
1 X
E[f ] ≈ f (xn )
N
n=1

The approximation becomes exact in the limit N → ∞.


Probability Distribution Statistics: Variance

The variance of f (x) is defined by

var[f ] = E[(f (x) − E[f (x)])2 ]


= E[f (x)2 ] − E[f (x)]2

and provides a measure of how much variability there is in f (x)


around its mean value E[f (x)].
Probability Distribution Statistics: Covariance

For two random variables x and y, the covariance is defined by

cov[x, y] = Ex,y [(x − E[x])(y − E[y])]


= Ex,y [xy] − E[x]E[y]

which expresses the extent to which x and y vary together. If x


and y are independent, then their covariance vanishes.
Probability Distribution Statistics: Covariance

In the case of two vectors of random variables x and y, the


covariance is a matrix:
h h ii
cov[x, y] = Ex,y (x − E[x])(y⊤ − E y⊤ )
h i h i
= Ex,y xy⊤ − E[x]E y⊤ .

Note. In this class, vectors are denoted by lower case bold Roman
letters such as x, and all vectors are assumed to be column vectors.
I.I.D.

We say the r.v.’s in x = (x1 , · · · , xn )⊤ are independent and


identically distributed (i.i.d) if they are drawn independently from
the same distribution.

In this case, the joint density is


n
Y
p(x) = p(xi )
i=1
The Gaussian Distribution

The Gaussian or Normal distribution, for the case of a single


real-valued variable x, is defined by
 
2
 1 1 2
N x | µ, σ = exp − 2 (x − µ)
(2πσ 2 )1/2 2σ

which is governed by two parameters: µ, called the mean, and σ 2 ,


called the variance.

The square root of the variance, given by σ, is called the standard


deviation, and the reciprocal of the variance, written as β = 1/σ 2 ,
is called the precision.
The Gaussian Distribution

The Gaussian distribution defined over a D-dimensional vector x


of continuous variables is given by
 
1 1 1 ⊤ −1
N (x | µ, Σ) = exp − (x − µ) Σ (x − µ)
(2π)D/2 |Σ|1/2 2

where the D-dimensional vector µ is called the mean, the D × D


matrix Σ is called the covariance, and |Σ| denotes the
determinant of Σ.
MLE for Parameter Estimation

Suppose you obtain i.i.d. samples from a r.v. that you think are
normally distributed. How can we determine the normal
distribution (i.e., the parameters µ and σ) these sample likely came
from?

→ We can estimate them using Maximum Likelihood Estimation


(MLE). Let’s discuss this.
Bayesian probabilities

Roughly speaking:
1 Frequentist Approach: probability is interpreted as a long-run
frequency of a ‘repeatable’ event which led to the notion of
confidence intervals.

2 Bayesian Approach: probability is interpreted as a measure of


uncertainty or degree of belief in an event occurring, given the
information available.
Maximum Likelihood Estimation

1 A widely used frequentist estimator is maximum likelihood, in


which w is set to the value that maximizes the likelihood
function p(D|w).
2 This corresponds to choosing the value of w for which the
probability of the observed data set is maximized.
3 In other words, choosing the parameters w that make the
observed data the most likely.
Maximum Likelihood Estimation

Assume the data we have D = {x1 , x2 , · · · , xn } is


independent and identically distributed (i.i.d).
Therefore, the likelihood of our data given parameters w is
n
Y
p(D|w) = p(xi |w)
i=1

In MLE, our goal is to choose values of our parameters w that


maximizes the likelihood function p(D|w), i.e.,

wML = argmax p(D|w)


w
Maximum Likelihood Estimation

Since log is a monotonic function, the argmax of a function is


the same as the argmax of the log of the function. That’s
nice because logs make the math simpler.
Therefore, for MLE, we first write the log likelihood function
as:
n
Y
wML = argmax p(D|w) = argmax p(xi |w)
w w
i=1
n
!
Y
= argmax log p(xi |w)
w
i=1
n
X
= argmax log p(xi |w)
w
i=1
MLE for Gaussian

Assume the data we have x = (x1 , x2 , · · · , xn )⊤ is i.i.d. from a


Gaussian distribution whose mean µ and variance σ 2 are unknown.
→ We would like to determine these parameters from the data set.
Let us do this together by MLE:
MLE for Gaussian (Continued)
MLE for Gaussian (Continued)
Maximum A Posteriori Estimation

MLE is great, but it is not the only way to estimate parameters!

Maximum A Posteriori (MAP) states that we should choose


the value for our parameters that is the most likely given the
data.
Recall that MLE chooses the value of parameters that makes
the data most likely. One of the disadvantages of MLE is that
it best explains data we have seen and makes no attempt to
generalize to unseen data.
In MAP, we incorporate prior belief about our parameters, and
then we update our posterior belief of the parameters based
on the data we have seen.
Maximum A Posteriori Estimation

Assume the data we have D = {x1 , x2 , · · · , xn } is


independent and identically distributed (i.i.d).
In MAP, our goal is to chose values of our parameters w that
maximizes the posterior distribution p(w|D), i.e.,

wMAP = argmax p(w|D)


w

Using Baye’s Theorem:

p(D|w)p(w)
wMAP = argmax p(w|D) = argmax
w w p(D)
Yn
= argmax p(xi |w)p(w)
w
i=1
Maximum A Posteriori Estimation

As before, it will be more convenient to find the argmax of the log


of the MAP function, which yields:

n
!
X
wMAP = argmax log p(w) + log p(xi |w)
w
i=1

Note. If you compare with MLE, notice that MAP is the argmax
of the same function plus a term for the log of the prior. We will
discuss some of the implications of this!

You might also like