0% found this document useful (0 votes)
9 views

lecture03c_maximum_likelihood

The lecture discusses an alternative probabilistic approach to the least-squares problem using maximum likelihood estimation (MLE). It explains how the MLE can be derived from the log-likelihood of a Gaussian model and highlights its properties, including consistency, asymptotic normality, and efficiency. Additionally, the lecture briefly mentions the possibility of using a Laplace distribution instead of a Gaussian distribution for modeling.

Uploaded by

Quan Nguyen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

lecture03c_maximum_likelihood

The lecture discusses an alternative probabilistic approach to the least-squares problem using maximum likelihood estimation (MLE). It explains how the MLE can be derived from the log-likelihood of a Gaussian model and highlights its properties, including consistency, asymptotic normality, and efficiency. Additionally, the lecture briefly mentions the possibility of using a Laplace distribution instead of a Gaussian distribution for modeling.

Uploaded by

Quan Nguyen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Machine Learning Course - CS-433

Maximum Likelihood

Sept 25, 2024

Martin Jaggi
Last updated on: September 24, 2024
credits to Mohammad Emtiyaz Khan & Rüdiger Urbanke
Motivation
In the previous lecture 3a we arrived at the least-squares
problem in the following way: we postulated a particular
cost function (square loss) and then, given data, found that
model that minimizes this cost function. In the current lec-
ture we will take an alternative route. The final answer will
be the same, but our starting point will be probabilistic. In
this way we find a second interpretation of the least-squares
problem.
140 1200

120
1000

100
800

80
y

600

60

400
40

200
20

0 0
1.2 1.4 1.6 1.8 2 −20 −10 0 10 20
x Error in prediction
Gaussian distribution and independence
Recall the definition of a Gaussian
random variable in R with mean µ
and variance σ 2. It has a density of
2
 
1 (y − µ)
p(y | µ, σ 2) = N (y | µ, σ 2) = √ exp − 2
.
2πσ 2 2σ
In a similar manner, the density of a
Gaussian random vector with mean
µ and covariance Σ (which must be
a positive semi-definite matrix) is
 
1 1
N (y | µ, Σ) = p exp − (y − µ)⊤Σ−1(y − µ) .
(2π)D det(Σ) 2

Also recall that two random vari-


ables X and Y are called indepen-
dent when p(x, y) = p(x)p(y).
A probabilistic model for least-squares
We assume that our data is gener-
ated by the model,

yn = x⊤
n w + ϵn ,

where the ϵn (the noise) is a zero-


mean Gaussian random variable
with variance σ 2 and the noise that
is added to the various samples is
independent of each other, and in-
dependent of the input. Note that
the model w is unknown.
Therefore, given N samples, the
likelihood of the data vector y =
(y1, · · · , yN ) given the input X
(each row is one input) and the
model w is equal to
N
Y N
Y
p(y | X, w) = p(yn | xn, w) = N (yn | x⊤ 2
n w, σ ).
n=1 n=1

The probabilistic view point is that


we should maximize this likelihood
over the choice of model w. I.e., the
“best” model is the one that maxi-
mizes this likelihood.
Defining cost with log-likelihood
Instead of maximizing the likeli-
hood, we can take the logarithm of
the likelihood and maximize it in-
stead. Expression is called the log-
likelihood (LL).
N
1 X
LLL(w) := log p(y | X, w) = − 2 (yn − x⊤
n w) 2
+ cnst.
2σ n=1

Compare the LL to the MSE (mean


squared error)
N
1 X
LLL(w) = − 2 (yn − x⊤ 2
n w) + cnst
2σ n=1
N
1 X
LMSE(w) = (yn − x⊤
n w)
2
2N n=1
Maximum-likelihood estimator (MLE)
It is clear that maximizing the LL is
equivalent to minimizing the MSE:

arg min LMSE(w) = arg max LLL(w).


w w

This gives us another way to design


cost functions.

MLE can also be interpreted as find-


ing the model under which the ob-
served data is most likely to have
been generated from (probabilisti-
cally). This interpretation has some
advantages that we discuss now.
Properties of MLE
MLE is a sample approximation to
the expected log-likelihood:
 
LLL(w) ≈ Ep(y,x) log p(y | x, w)

MLE is consistent, i.e., it will give


us the correct model assuming that
we have a sufficient amount of data.
(can be proven under some weak condi-
tions)

wMLE −→p wtrue in probability

The MLE is asymptotically normal,


i.e.,
1
(wMLE − wtrue) −→ √ N (wMLE | 0, F−1(wtrue))
d
N
h 2 i
∂ L
where F(w) = −Ep(y) ∂w∂w ⊤ is
the Fisher information.

MLE is efficient, i.e. it achieves the


Cramer-Rao lower bound.

Covariance(wMLE) = F−1(wtrue)
Another example
We can replace Gaussian distribu-
tion by a Laplace distribution.
1 − 1 |yn−x⊤n w|
p(yn | xn, w) = e b
2b

You might also like