Lecture 6
Lecture 6
-5 -4 -3 -2 -1 0 1 2 3 4 5
wTxn wTxn
yn yn
Xn Xn
Let’s maximize
Previously, we see that likelihood
linear or ridge estimates which
regression tries to will eventually
minimize the sum of reduce the error.
actual squared error.
Multivariate Gaussian Distribution
p(x
p(x
1)
2)
Identify OR
heatmaps?
Probabilistic Modeling of Data
Why probablistic modeling?
● Parameter Estimation: Find θ using input data.
● Prediction: Perform prediction from new test data by formulating it as supervised or
unsupervised learning.
Assumption:
● Training data is given by y={y1,y2,...,yN} and there is no x, for simplicity.
● It is obtained from a probability model: yn ∼ p(y|θ), ∀n. θ is a unknown parameter.
● The data is independently & identically distributed (i.i.d.).
● One such case can be a sequence of N coin toss outcomes. Each outcome is a
binary random variable. Head = 1, Tail = 0
Prior p(θ) can be combined with the likelihood p(y|θ) using p(y|θ)
Bayes rule to get the posterior distribution p(θ|y) over θ, i.e., p(θ)
p(θ|y)
θ can be estimated by MAP maximizing posterior probability p(θ|y) (or, find θ that is most
likely given data and our prior belief) rather than MLE which maximizes likelihood.
θMLE and prior p(θ) sort of attract each other to reach to a final consensus.
Maximum-a-Posteriori (MAP)
Estimation
● MAP is same as MLE except an
addition log-prior-distribution term
which provides regularization on θ.
● When p(θ) is a uniform prior, MAP
reduces to MLE.
Our Example
● If the prior is informative, the posterior is a mixture of the prior and the data
● The more informative the prior, the more data you need to "change" your
beliefs, so to speak because the posterior is very much driven by the prior
information
● If you have a lot of data, the data will dominate the posterior distribution (they
will overwhelm the prior)
How to choose prior such that it models equi-probable and high belief of head and
tail?
● Observation: MAP estimation “pulls” the estimate to-ward the prior.
● The more focused our prior belief, the larger the pull toward the prior.
● By using larger values for α and β (but keeping them equal), we can narrow the
peak of the beta distribution around the value of p= 0.5. This would cause the
MAP estimate to move closer to the prior.
Fully Bayesian Inference
MLE/MAP Fully Bayesian
● Both MLE and MAP provide single
p(y|θ) p(θ|y)
value of θ p(θ)
● Bayesian estimation calculates full p(θ|y)
p(X) is a beta
distribution with
parameters α and β
Recall
How to make predictions?
● Once θ is learned from training data, it can be used to make predictions about test set
● E.g., predict probability of next toss being head by analyzing the previous coin tosses
● This can be accomplished by utilizing point estimates (MLE/MAP), or full posterior.
● Hence, predictions in our coin toss example are: