0% found this document useful (0 votes)
2 views

lecture03c_maximum_likelihood

The lecture discusses an alternative probabilistic approach to the least-squares problem using maximum likelihood estimation (MLE). It explains how the MLE can be derived from the log-likelihood of a Gaussian model and highlights its properties, including consistency, asymptotic normality, and efficiency. Additionally, the lecture briefly mentions the possibility of using a Laplace distribution instead of a Gaussian distribution for modeling.

Uploaded by

Quan Nguyen
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

lecture03c_maximum_likelihood

The lecture discusses an alternative probabilistic approach to the least-squares problem using maximum likelihood estimation (MLE). It explains how the MLE can be derived from the log-likelihood of a Gaussian model and highlights its properties, including consistency, asymptotic normality, and efficiency. Additionally, the lecture briefly mentions the possibility of using a Laplace distribution instead of a Gaussian distribution for modeling.

Uploaded by

Quan Nguyen
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Machine Learning Course - CS-433

Maximum Likelihood

Sept 25, 2024

Martin Jaggi
Last updated on: September 24, 2024
credits to Mohammad Emtiyaz Khan & Rüdiger Urbanke
Motivation
In the previous lecture 3a we arrived at the least-squares
problem in the following way: we postulated a particular
cost function (square loss) and then, given data, found that
model that minimizes this cost function. In the current lec-
ture we will take an alternative route. The final answer will
be the same, but our starting point will be probabilistic. In
this way we find a second interpretation of the least-squares
problem.
140 1200

120
1000

100
800

80
y

600

60

400
40

200
20

0 0
1.2 1.4 1.6 1.8 2 −20 −10 0 10 20
x Error in prediction
Gaussian distribution and independence
Recall the definition of a Gaussian
random variable in R with mean µ
and variance σ 2. It has a density of
2
 
1 (y − µ)
p(y | µ, σ 2) = N (y | µ, σ 2) = √ exp − 2
.
2πσ 2 2σ
In a similar manner, the density of a
Gaussian random vector with mean
µ and covariance Σ (which must be
a positive semi-definite matrix) is
 
1 1
N (y | µ, Σ) = p exp − (y − µ)⊤Σ−1(y − µ) .
(2π)D det(Σ) 2

Also recall that two random vari-


ables X and Y are called indepen-
dent when p(x, y) = p(x)p(y).
A probabilistic model for least-squares
We assume that our data is gener-
ated by the model,

yn = x⊤
n w + ϵn ,

where the ϵn (the noise) is a zero-


mean Gaussian random variable
with variance σ 2 and the noise that
is added to the various samples is
independent of each other, and in-
dependent of the input. Note that
the model w is unknown.
Therefore, given N samples, the
likelihood of the data vector y =
(y1, · · · , yN ) given the input X
(each row is one input) and the
model w is equal to
N
Y N
Y
p(y | X, w) = p(yn | xn, w) = N (yn | x⊤ 2
n w, σ ).
n=1 n=1

The probabilistic view point is that


we should maximize this likelihood
over the choice of model w. I.e., the
“best” model is the one that maxi-
mizes this likelihood.
Defining cost with log-likelihood
Instead of maximizing the likeli-
hood, we can take the logarithm of
the likelihood and maximize it in-
stead. Expression is called the log-
likelihood (LL).
N
1 X
LLL(w) := log p(y | X, w) = − 2 (yn − x⊤
n w) 2
+ cnst.
2σ n=1

Compare the LL to the MSE (mean


squared error)
N
1 X
LLL(w) = − 2 (yn − x⊤ 2
n w) + cnst
2σ n=1
N
1 X
LMSE(w) = (yn − x⊤
n w)
2
2N n=1
Maximum-likelihood estimator (MLE)
It is clear that maximizing the LL is
equivalent to minimizing the MSE:

arg min LMSE(w) = arg max LLL(w).


w w

This gives us another way to design


cost functions.

MLE can also be interpreted as find-


ing the model under which the ob-
served data is most likely to have
been generated from (probabilisti-
cally). This interpretation has some
advantages that we discuss now.
Properties of MLE
MLE is a sample approximation to
the expected log-likelihood:
 
LLL(w) ≈ Ep(y,x) log p(y | x, w)

MLE is consistent, i.e., it will give


us the correct model assuming that
we have a sufficient amount of data.
(can be proven under some weak condi-
tions)

wMLE −→p wtrue in probability

The MLE is asymptotically normal,


i.e.,
1
(wMLE − wtrue) −→ √ N (wMLE | 0, F−1(wtrue))
d
N
h 2 i
∂ L
where F(w) = −Ep(y) ∂w∂w ⊤ is
the Fisher information.

MLE is efficient, i.e. it achieves the


Cramer-Rao lower bound.

Covariance(wMLE) = F−1(wtrue)
Another example
We can replace Gaussian distribu-
tion by a Laplace distribution.
1 − 1 |yn−x⊤n w|
p(yn | xn, w) = e b
2b

You might also like