2 Mle
2 Mle
Estimation
Joint Probability
p(X, Y )
Probability of X and Y
Conditional Probability
p(X|Y )
Probability of X given Y
Independent and Conditional Probabilities
P
Sum Rule (Marginalization): p(X) = p(X, Y )
Y
P
Law of Total Probability: p(X) = p(X|Y )p(Y )
Y
Baye’s Theorem
p(X|Y )p(Y )
p(Y |X) =
p(X)
P
Note the denominator p(X) = p(X|Y )p(Y ).
Y
Baye’s Theorem
p(D|w)p(w)
p(w|D) =
p(D)
likelihood · prior
posterior =
evidence
p(w|D) ∝ p(D|w)p(w)
Note. In this class, vectors are denoted by lower case bold Roman
letters such as x, and all vectors are assumed to be column vectors.
I.I.D.
Suppose you obtain i.i.d. samples from a r.v. that you think are
normally distributed. How can we determine the normal
distribution (i.e., the parameters µ and σ) these sample likely came
from?
Roughly speaking:
1 Frequentist Approach: probability is interpreted as a long-run
frequency of a ‘repeatable’ event which led to the notion of
confidence intervals.
p(D|w)p(w)
wMAP = argmax p(w|D) = argmax
w w p(D)
Yn
= argmax p(xi |w)p(w)
w
i=1
Maximum A Posteriori Estimation
n
!
X
wMAP = argmax log p(w) + log p(xi |w)
w
i=1
Note. If you compare with MLE, notice that MAP is the argmax
of the same function plus a term for the log of the prior. We will
discuss some of the implications of this!