lecture03c_maximum_likelihood_annotated
lecture03c_maximum_likelihood_annotated
Verside
Maximum Likelihood
Martin Jaggi
Last updated on: September 24, 2024
credits to Mohammad Emtiyaz Khan & Rüdiger Urbanke
Motivation
In the previous lecture 3a we arrived at the least-squares -
cost function (square loss) and then, given data, found that
model that minimizes this cost function. In the current lec-
ture we will take an alternative route. The final answer will
be the same, but our starting point will be probabilistic. In
this way we find a second interpretation of the least-squares
problem.
Histogram
140 1200
120
1000
100 en = Yn-Xin
u 800
80
Y
y
600
60
' 400
O
I
40
#
200
20
0 0
1.2 1.4 1.6 1.8 2 −20 −10 0 10 20
x Error in prediction
Gaussian distribution and independence
Recall the definition of a Gaussian
random variable in R with mean µ
-
-
and variance 2. It has a density of YEIR
2 2 1 (y µ)2
p(y | µ, ) = N (y | µ, ) = p exp 2
.
2⇡ 2 2
In a similar manner, the density of a
Gaussian random vector with mean
µ and covariance ⌃ (which must be
EEIRPXD
a positive semi-definite matrix) is
-
N (y | µ, ⌃) = p
1
exp
YER
1
(y µ)>⌃ 1(y µ) .
D
(2⇡) det(⌃) 2
>
-
Also recall that two random vari-
ables X and Y are called indepen-
dent when p(x, y) = p(x)p(y).
vTMTr
imm
=
*T
A probabilistic model for least-squares
We assume that our data is gener-
ated by the model, noise assum En
- is Gaussian
yn = x>
nw + ✏n ,
M 10 sy
-
independence on noise
-
YN NY
> 2
p(y | X, w) = p(y | x , w) =
n n N (y | x w,
n n ).
n=1 n=1
log T expl
stead. Expression is called the log- =
likelihood (LL).
1
N
X
= exp(y y, ) -
x
+ coast
LLL(w) := log p(y | X, w) = 2
(yn x>
n w) 2
+ cnst.
2 n=1
max N
1 X ↓
LLL(w) = =>
2
(yn x>
n w)
2
+ cnst
2 n=1
-
min N
X
1
LMSE(w) = (yn x>
n w)
2
2N
Li
n=1
argax
Il
arguin Inse
Maximum-likelihood estimator (MLE)
It is clear that maximizing the LL is
equivalent to minimizing the MSE:
C
arg min L (w) = arg max L
w
MSE
w
LL (w).
Properties of MLE
Maximize
X was
MLE is a sample approximation to
the expected log-likelihood:
-
⇥ ⇤
LLL(w) ⇡ Ep(y,x) log p(y | x, w)
&
wMLE !p wtrue in probability
S
h 2 i
@ L
where F(w) = Ep(y) @w@w > is
the Fisher information.
Covariance(wMLE) = F 1(wtrue)
Another example
We can replace Gaussian distribu-
i
tion by a Laplace distribution.
1 1 |y >
p(yn | xn, w) = e b n xn w|
mat 2b
min