0% found this document useful (0 votes)
2 views

lecture03c_maximum_likelihood_annotated

The document discusses the concept of Maximum Likelihood Estimation (MLE) in the context of a machine learning course, specifically focusing on its relationship with least-squares problems. It explains how MLE can be derived from a probabilistic model where data is generated with Gaussian noise, and demonstrates that maximizing the log-likelihood is equivalent to minimizing the mean squared error. Additionally, it outlines the properties of MLE, including consistency, asymptotic normality, and efficiency.

Uploaded by

Quan Nguyen
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

lecture03c_maximum_likelihood_annotated

The document discusses the concept of Maximum Likelihood Estimation (MLE) in the context of a machine learning course, specifically focusing on its relationship with least-squares problems. It explains how MLE can be derived from a probabilistic model where data is generated with Gaussian noise, and demonstrates that maximizing the log-likelihood is equivalent to minimizing the mean squared error. Additionally, it outlines the properties of MLE, including consistency, asymptotic normality, and efficiency.

Uploaded by

Quan Nguyen
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

annota

Verside

Machine Learning Course - CS-433

Maximum Likelihood

Sept 25, 2024

Martin Jaggi
Last updated on: September 24, 2024
credits to Mohammad Emtiyaz Khan & Rüdiger Urbanke
Motivation
In the previous lecture 3a we arrived at the least-squares -

problem in the following way: we postulated a particular


-

cost function (square loss) and then, given data, found that
model that minimizes this cost function. In the current lec-
ture we will take an alternative route. The final answer will
be the same, but our starting point will be probabilistic. In
this way we find a second interpretation of the least-squares
problem.

Histogram
140 1200

120
1000

100 en = Yn-Xin
u 800

80
Y
y

600

60

' 400
O

I
40

#
200
20

0 0
1.2 1.4 1.6 1.8 2 −20 −10 0 10 20
x Error in prediction
Gaussian distribution and independence
Recall the definition of a Gaussian
random variable in R with mean µ
-
-
and variance 2. It has a density of YEIR

2 2 1 (y µ)2
p(y | µ, ) = N (y | µ, ) = p exp 2
.
2⇡ 2 2
In a similar manner, the density of a
Gaussian random vector with mean
µ and covariance ⌃ (which must be
EEIRPXD
a positive semi-definite matrix) is
-

N (y | µ, ⌃) = p
1
exp
YER

1
(y µ)>⌃ 1(y µ) .
D
(2⇡) det(⌃) 2

>
-
Also recall that two random vari-
ables X and Y are called indepen-
dent when p(x, y) = p(x)p(y).

vTMTr
imm
=

*T
A probabilistic model for least-squares
We assume that our data is gener-
ated by the model, noise assum En
- is Gaussian
yn = x>
nw + ✏n ,

M 10 sy
-

where the ✏n (the noise) is a zero- p(n) =


.
,
mean Gaussian random variable
with variance 2 and the noise that
is added to the various samples is
#
independent of each other, and in-
dependent of the input. Note that P(ya(x w) ,

the model w is unknown. =


N(y /xw 54 - ,

Therefore, given N samples, the


likelihood of the data vector y =
(y1, · · · , yN ) given the input X
(each row is one input) and the
model w is equal to assumption

independence on noise
-
YN NY
> 2
p(y | X, w) = p(y | x , w) =
n n N (y | x w,
n n ).
n=1 n=1

The probabilistic view point is that


we should maximize this likelihood
over the choice of model w. I.e., the
“best” model is the one that maxi-
mizes this likelihood.
log p(yIX ul ,

Defining cost with log-likelihood


Instead of maximizing the likeli-
=
losy
hood, we can take the logarithm of
the likelihood and maximize it in-
=
log TTN(y ( -
,
6)

log T expl
stead. Expression is called the log- =

likelihood (LL).

1
N
X
= exp(y y, ) -
x

+ coast
LLL(w) := log p(y | X, w) = 2
(yn x>
n w) 2
+ cnst.
2 n=1

Compare the LL to the MSE (mean


squared error)
independent of w

max N
1 X ↓
LLL(w) = =>

2
(yn x>
n w)
2
+ cnst
2 n=1

-
min N
X
1
LMSE(w) = (yn x>
n w)
2
2N
Li
n=1
argax
Il
arguin Inse
Maximum-likelihood estimator (MLE)
It is clear that maximizing the LL is
equivalent to minimizing the MSE:

C
arg min L (w) = arg max L
w
MSE
w
LL (w).

This gives us another way to design


cost functions.

MLE can also be interpreted as find-


ing the model under which the ob-
served data is most likely to have
been generated from (probabilisti-
cally). This interpretation has some
advantages that we discuss now.
2 = logp(y(x m) ,

Properties of MLE
Maximize
X was
MLE is a sample approximation to
the expected log-likelihood:

-
⇥ ⇤
LLL(w) ⇡ Ep(y,x) log p(y | x, w)

MLE is consistent, i.e., it will give


us the correct model assuming that
war
to
we have a sufficient amount of data.
(can be proven under some weak condi-
tions)

&
wMLE !p wtrue in probability

The MLE is asymptotically normal,


i.e., optional
1
(wMLE wtrue) ! p N (wMLE | 0, F 1(wtrue))
d
N

S
h 2 i
@ L
where F(w) = Ep(y) @w@w > is
the Fisher information.

MLE is efficient, i.e. it achieves the


Cramer-Rao lower bound.

Covariance(wMLE) = F 1(wtrue)
Another example
We can replace Gaussian distribu-

i
tion by a Laplace distribution.
1 1 |y >
p(yn | xn, w) = e b n xn w|
mat 2b

4 ly- *-) = MAE

min

You might also like