0% found this document useful (0 votes)
14 views

Lecture 6

Uploaded by

bhavesh agrawal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Lecture 6

Uploaded by

bhavesh agrawal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

MACHINE LEARNING (CS 403/603)

Probabilistic modelling: Introduction

Dr. Puneet Gupta


Univariate Gaussian Distribution
μ= 0, σ = 1
μ= 0, σ = 2
μ= 2, σ = 2
μ= 2, σ = 0.5

-5 -4 -3 -2 -1 0 1 2 3 4 5

● Distribution over real-valued scalar random variable x


● Defined by a scalar mean μ and a scalar variance σ2.
● Useful in Regression with one input.
Linear Regression: Probabilistic
View
p(y|w,x)

wTxn wTxn
yn yn

Xn Xn
Let’s maximize
Previously, we see that likelihood
linear or ridge estimates which
regression tries to will eventually
minimize the sum of reduce the error.
actual squared error.
Multivariate Gaussian Distribution

p(x

p(x
1)
2)

Identify OR
heatmaps?
Probabilistic Modeling of Data
Why probablistic modeling?
● Parameter Estimation: Find θ using input data.
● Prediction: Perform prediction from new test data by formulating it as supervised or
unsupervised learning.

Assumption:
● Training data is given by y={y1,y2,...,yN} and there is no x, for simplicity.
● It is obtained from a probability model: yn ∼ p(y|θ), ∀n. θ is a unknown parameter.
● The data is independently & identically distributed (i.i.d.).
● One such case can be a sequence of N coin toss outcomes. Each outcome is a
binary random variable. Head = 1, Tail = 0

Likelihood is a fuction of θ, but how can best θ be estimated?


Parameter Estimation
Maximum Likelihood Estimation (MLE) How to choose best θ?
Find θ that fits the data most probable or
likely, i.e., maximizes the likelihood p(y|θ)
w.r.t. θ.
p(y|θ)

In MLE, we maximize log-likelihood rather than θ θMLE


likelihood because: i) it reduces numerical
instabilities by converting product to sum; and ii) Example
log is monotonic thus, does not effect estimation. What can be the probability distribution
for the case of N coin toss outcomes?
● Bernoulli distribution can model it.

● It’s an optimization problem w.r.t. θ


[ Differentiate
● Negative log-likelihood can be thought as above w.r.t. θ
and set to 0]
a loss function then, MLE is minimizing
training data loss. Overfitting in absence of abundant data.
(MLE uses training data without any regularization)
Incorporating Prior Belief in
Parameter Estimation
In probabilistic models, we can incorporate our prior belief by Example of a prior
distribution, p(θ)
specifying a prior distribution p(θ) on the unknown θ.
Why it is required?
● For specifying which values of θ are more likely than
others.
It also aids in regularization for θ (see soon)
0 0.5 1

Prior p(θ) can be combined with the likelihood p(y|θ) using p(y|θ)
Bayes rule to get the posterior distribution p(θ|y) over θ, i.e., p(θ)
p(θ|y)

What happen, if prior has uniform distribution?


● It is the same as using no prior. θMAP
θMLE

θ can be estimated by MAP maximizing posterior probability p(θ|y) (or, find θ that is most
likely given data and our prior belief) rather than MLE which maximizes likelihood.

θMLE and prior p(θ) sort of attract each other to reach to a final consensus.
Maximum-a-Posteriori (MAP)
Estimation
● MAP is same as MLE except an
addition log-prior-distribution term
which provides regularization on θ.
● When p(θ) is a uniform prior, MAP
reduces to MLE.

Our Example

● Hyperparameters of p(θ), α and β are


expected numbers of heads and tails
respectively before tossing the coin.
● If (α = β = 1) in p(θ), p(θ) exhibit
uniform prior distribution, hence no
regularization and θMAP = θMLE
How to choose hyperparameters?
These rules of thumb follow directly from the nature of the Bayesian analysis
procedure:
● If the prior is uninformative, the posterior is very much determined by the data
(the posterior is data-driven)

● If the prior is informative, the posterior is a mixture of the prior and the data

● The more informative the prior, the more data you need to "change" your
beliefs, so to speak because the posterior is very much driven by the prior
information

● If you have a lot of data, the data will dominate the posterior distribution (they
will overwhelm the prior)
How to choose prior such that it models equi-probable and high belief of head and
tail?
● Observation: MAP estimation “pulls” the estimate to-ward the prior.
● The more focused our prior belief, the larger the pull toward the prior.
● By using larger values for α and β (but keeping them equal), we can narrow the
peak of the beta distribution around the value of p= 0.5. This would cause the
MAP estimate to move closer to the prior.
Fully Bayesian Inference
MLE/MAP Fully Bayesian
● Both MLE and MAP provide single
p(y|θ) p(θ|y)
value of θ p(θ)
● Bayesian estimation calculates full p(θ|y)

posterior distribution p(θ|X). Then,


select a value that we consider best
θMAP
in some sense like in MAP. θML
Importance: E
Large variance thus,
● The variance calculated for θ using less confident. May
its posterior distribution, can neglect.
provides us some confidence in our
prediction. If the variance is too
large, we may declare that there 0 20

does not exist a good estimate for θ. p(y|θ) p(y|θ)


● Online learning p(θ) Step 1 p(θ) Step 2
p(θ|y) p(θ|y)
Bayesian estimation is made
complex because now the
denominator in the Bayes’ Rule,
a.k.a evidence cannot be ignored.
Online learning:
● Old posterior becomes new prior.
● Our belief about θ keeps getting updated as we
see more and more data.
Fully Bayesian Inference
For a given likelihood function, if we have a choice regarding how we express our prior
beliefs, we must use that form which allows us to carry out the integration in the
numerator of Bayes rule. This is known as choosing the conjugate priors.

Example of coin toss


With Bernoulli likelihood for parameter θ Sketch of proof:
and prior of Beta(α, β) (which is a
conjugate pair), the posterior is Beta(α+N1,
β+N0), where N1 and N0 are the number of
heads and tails respectively.

Important point to remember for beta distribution

It is known as Beta function

p(X) is a beta
distribution with
parameters α and β

Details of beta and gamma function can be found at: http://math.feld.cvut.cz/mt/txtd/5/txe3da5h.htm


How to make predictions?
● Once θ is learned from training data, it can be used to make predictions about test set
● E.g., predict probability of next toss being head by analyzing the previous coin tosses
● This can be accomplished by utilizing point estimates (MLE/MAP), or full posterior.
● Hence, predictions in our coin toss example are:

• When doing MLE/MAP, we approximate the posterior 𝑝(𝜃|𝒚) by a single point


• For fully Bayesian, we need to compute the predictive distribution by averaging over the
full posterior – basically calculate 𝑝(𝑦N+1│𝜃) for each possible 𝜃, weighs it by how likely
this 𝜃 is under the posterior 𝑝(𝜃│𝒚), and sum all such posterior weighted predictions.
Note that not each value of 𝜃 is given equal importance in the averaging.

Recall
How to make predictions?
● Once θ is learned from training data, it can be used to make predictions about test set
● E.g., predict probability of next toss being head by analyzing the previous coin tosses
● This can be accomplished by utilizing point estimates (MLE/MAP), or full posterior.
● Hence, predictions in our coin toss example are:

Why delta function


is required?

Here, fully Bayesian


approach for prediction,
averages over all
possible values of θ,
weighted by their
respective posterior
probabilities (easy in this
example, but a hard
problem in general)
● Probabilistic model is an intuitive and flexible way to model data where likelihood
(corresponds to a loss function) and prior (corresponds to a regularizer) are choosen based
on the property of data.
● MLE and MAP estimation can be considered as unregularized and regularized loss function
minimization respectively.

You might also like