0% found this document useful (0 votes)
10 views8 pages

lecture03c_maximum_likelihood_annotated

The document discusses the concept of Maximum Likelihood Estimation (MLE) in the context of a machine learning course, specifically focusing on its relationship with least-squares problems. It explains how MLE can be derived from a probabilistic model where data is generated with Gaussian noise, and demonstrates that maximizing the log-likelihood is equivalent to minimizing the mean squared error. Additionally, it outlines the properties of MLE, including consistency, asymptotic normality, and efficiency.

Uploaded by

Quan Nguyen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views8 pages

lecture03c_maximum_likelihood_annotated

The document discusses the concept of Maximum Likelihood Estimation (MLE) in the context of a machine learning course, specifically focusing on its relationship with least-squares problems. It explains how MLE can be derived from a probabilistic model where data is generated with Gaussian noise, and demonstrates that maximizing the log-likelihood is equivalent to minimizing the mean squared error. Additionally, it outlines the properties of MLE, including consistency, asymptotic normality, and efficiency.

Uploaded by

Quan Nguyen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

annota

Verside

Machine Learning Course - CS-433

Maximum Likelihood

Sept 25, 2024

Martin Jaggi
Last updated on: September 24, 2024
credits to Mohammad Emtiyaz Khan & Rüdiger Urbanke
Motivation
In the previous lecture 3a we arrived at the least-squares -

problem in the following way: we postulated a particular


-

cost function (square loss) and then, given data, found that
model that minimizes this cost function. In the current lec-
ture we will take an alternative route. The final answer will
be the same, but our starting point will be probabilistic. In
this way we find a second interpretation of the least-squares
problem.

Histogram
140 1200

120
1000

100 en = Yn-Xin
u 800

80
Y
y

600

60

' 400
O

I
40

#
200
20

0 0
1.2 1.4 1.6 1.8 2 −20 −10 0 10 20
x Error in prediction
Gaussian distribution and independence
Recall the definition of a Gaussian
random variable in R with mean µ
-
-
and variance 2. It has a density of YEIR

2 2 1 (y µ)2
p(y | µ, ) = N (y | µ, ) = p exp 2
.
2⇡ 2 2
In a similar manner, the density of a
Gaussian random vector with mean
µ and covariance ⌃ (which must be
EEIRPXD
a positive semi-definite matrix) is
-

N (y | µ, ⌃) = p
1
exp
YER

1
(y µ)>⌃ 1(y µ) .
D
(2⇡) det(⌃) 2

>
-
Also recall that two random vari-
ables X and Y are called indepen-
dent when p(x, y) = p(x)p(y).

vTMTr
imm
=

*T
A probabilistic model for least-squares
We assume that our data is gener-
ated by the model, noise assum En
- is Gaussian
yn = x>
nw + ✏n ,

M 10 sy
-

where the ✏n (the noise) is a zero- p(n) =


.
,
mean Gaussian random variable
with variance 2 and the noise that
is added to the various samples is
#
independent of each other, and in-
dependent of the input. Note that P(ya(x w) ,

the model w is unknown. =


N(y /xw 54 - ,

Therefore, given N samples, the


likelihood of the data vector y =
(y1, · · · , yN ) given the input X
(each row is one input) and the
model w is equal to assumption

independence on noise
-
YN NY
> 2
p(y | X, w) = p(y | x , w) =
n n N (y | x w,
n n ).
n=1 n=1

The probabilistic view point is that


we should maximize this likelihood
over the choice of model w. I.e., the
“best” model is the one that maxi-
mizes this likelihood.
log p(yIX ul ,

Defining cost with log-likelihood


Instead of maximizing the likeli-
=
losy
hood, we can take the logarithm of
the likelihood and maximize it in-
=
log TTN(y ( -
,
6)

log T expl
stead. Expression is called the log- =

likelihood (LL).

1
N
X
= exp(y y, ) -
x

+ coast
LLL(w) := log p(y | X, w) = 2
(yn x>
n w) 2
+ cnst.
2 n=1

Compare the LL to the MSE (mean


squared error)
independent of w

max N
1 X ↓
LLL(w) = =>

2
(yn x>
n w)
2
+ cnst
2 n=1

-
min N
X
1
LMSE(w) = (yn x>
n w)
2
2N
Li
n=1
argax
Il
arguin Inse
Maximum-likelihood estimator (MLE)
It is clear that maximizing the LL is
equivalent to minimizing the MSE:

C
arg min L (w) = arg max L
w
MSE
w
LL (w).

This gives us another way to design


cost functions.

MLE can also be interpreted as find-


ing the model under which the ob-
served data is most likely to have
been generated from (probabilisti-
cally). This interpretation has some
advantages that we discuss now.
2 = logp(y(x m) ,

Properties of MLE
Maximize
X was
MLE is a sample approximation to
the expected log-likelihood:

-
⇥ ⇤
LLL(w) ⇡ Ep(y,x) log p(y | x, w)

MLE is consistent, i.e., it will give


us the correct model assuming that
war
to
we have a sufficient amount of data.
(can be proven under some weak condi-
tions)

&
wMLE !p wtrue in probability

The MLE is asymptotically normal,


i.e., optional
1
(wMLE wtrue) ! p N (wMLE | 0, F 1(wtrue))
d
N

S
h 2 i
@ L
where F(w) = Ep(y) @w@w > is
the Fisher information.

MLE is efficient, i.e. it achieves the


Cramer-Rao lower bound.

Covariance(wMLE) = F 1(wtrue)
Another example
We can replace Gaussian distribu-

i
tion by a Laplace distribution.
1 1 |y >
p(yn | xn, w) = e b n xn w|
mat 2b

4 ly- *-) = MAE

min

You might also like