CS772 Lec1
CS772 Lec1
▪ Proration: If you miss any quiz/mid-sem, we can prorate it using end-sem marks
▪ Proration only allowed on limited grounds (e.g., health related)
CS772A: PML
4
Textbooks and Readings
▪ Some books that you may use as reference (freely available online)
▪ Kevin P. Murphy, Probabilistic Machine Learning: An Introduction (PML-1), The MIT Press, 2022.
▪ Kevin P. Murphy, Probabilistic Machine Learning: Advanced Topics(PML-2), The MIT Press, 2022.
▪ Chris Bishop, Pattern Recognition and Machine Learning (PRML), Springer, 2007.
▪ Chris Bishop and Hugh Bishop, Deep Learning: Foundations and Concepts (DLFC), Springer, 2023.
▪ Follow the suggested readings for each lecture (may also include some portions
from these books), rather than trying to read these books in a linear fashion
CS772A: PML
5
Probabilistic Machine Learning
▪ Machine Learning primarily deals with
𝑁
▪ Predicting output 𝑦∗ for new (test) inputs 𝒙∗ given training data 𝑿, 𝒚 = 𝒙𝑖 , 𝑦𝑖 𝑖=1
▪ Generating new (synthetic) data given some training data 𝑿 = 𝒙𝑖 𝑁 𝑖=1
▪ Probabilistic ML gives a natural way to solve both these tasks (with some advantages)
▪ Prediction: Learning the predictive distribution PML is about estimating
these distributions accurately
Using this, we can not only
and efficiently
get the mean but also the
variance (uncertainty) of the 𝑝 𝑦∗ 𝑥∗ , 𝑿, 𝒚) Estimating them exactly is
predicted output 𝑦∗
hard in general but we can
▪ Generation: Learning a generative model of data use approximations
Can “sample” (simulate) from
this distribution to generate 𝑝 𝒙∗ 𝑿) Both are conditional
distributions
new data
▪ At its core, both problems require estimating the underlying distribution of data
CS772A: PML
6
Probabilistic Machine Learning
▪ With a probabilistic approach to ML, we can also easily incorporate “domain knowledge”
▪ Can specify our assumptions about data using suitable probability distributions over
inputs/outputs, usually in the forms Distribution of the input
Probability distribution of
𝑝 𝑦𝑛 𝑥𝑛 , 𝜃) 𝑝(𝑥𝑛 |𝑦𝑛 , 𝜃) conditioned on its “label/output”
the output as a function Distribution of
of input Unknown parameters
of this distribution
𝑝(𝑥𝑛 |𝜃) the inputs
▪ Can specify our assumptions about the unknowns 𝜃 using a “prior distribution”
Represents our belief
about the unknown
parameters before we
see the data
𝑝(𝜃)
▪ After seeing some data 𝒟, can update the prior into a posterior distribution 𝑝(𝜃|𝒟)
CS772A: PML
7
The Core of PML: Two Basic Rules of Probability
▪ Sum Rule (marginalization): Distribution of 𝑎 considering for all possibilities of 𝑏
If 𝑏 is a discrete r.v. If 𝑏 is a continuous r.v.
𝑝 𝑎 = 𝑝(𝑎, 𝑏) or 𝑝 𝑎 = න 𝑝 𝑎, 𝑏 𝑑𝑏
▪ Product Rule 𝑏
𝑝 𝑎, 𝑏 = 𝑝 𝑎 𝑝 𝑏 𝑎 = 𝑝 𝑏 𝑝 𝑎 𝑏
▪ These two rules are the core of most of probabilistic/Bayesian ML
▪ Bayes rule easily derived from the sum and product rules
𝑝 𝑏 𝑝 𝑎𝑏 𝑝 𝑏 𝑝 𝑎𝑏 Assuming 𝑏 is a
𝑝 𝑏𝑎 = = continuous r.v.
𝑝 𝑎 𝑎 𝑝 , 𝑏 𝑑𝑏
CS772A: PML
8
ML and Uncertainty
(and how PML handles uncertainty)
CS772A: PML
9
Uncertainty due to Limited Training Data
▪ Model/parameter uncertainty is due to not having enough training data
Same model class (linear models) Uncertainty not just about the
but uncertainty about the weights weights but also the model class
Image credit: Balaji L, Dustin T, Jasper N. (NeurIPS 2020 tutorial) CS772A: PML
10
Uncertainty due to Inherent Noise in Training Data
▪ Data uncertainty can be due to various reasons, e.g.,
▪ Intrinsic hardness of labeling, class overlap
▪ Labeling errors/disagreements (for difficult training inputs)
▪ Noisy or missing features
Image credit: Eric Nalisnick Image source: “Improving machine classification using human uncertainty measurements” (Battleday et al, 2021)
CS772A: PML
Image source: “Aleatoric and epistemic uncertainty in machine learning: an introduction to concepts and methods” (H&W 2021)
In this course, we will mostly focus 11
How to Estimate Uncertainty? on the Bayesian approach but other
two approaches are also popular
and will also be discussed
▪ Uncertainty in predictions: Usually estimated by computing and reporting the mean and
variance of predictions made using many possible values of 𝜃. Commonly reported as:
Predictive Distribution Can get both mean
𝑝(𝑦∗ |𝑥∗ , 𝒟) and variance/quantiles Sets/intervals of possible predictions
of the prediction
CS772A: PML
12
Predictive Uncertainty
▪ Information about uncertainty gives an idea about how much to trust a prediction
▪ It can also “guide” us in sequential decision-making:
Test output Test input Given our current estimate of the
regression function, which training
𝑝 𝑦∗ 𝑥∗ , 𝐷) = 𝒩(𝑦∗ |𝜇∗ , 𝜎∗2 )
input(s) should we add next to
improve its estimate the most?
Training data
Blue curve is the mean of the Uncertainty can help here: Acquire training
function (learned so far using inputs from regions where the function is
the available data), shaded most uncertain about its current predictions
region denotes the current
predictive uncertainty
▪ Assume that both training and test data come from the same distribution
▪ This assumption, although standard, may be violated in real-world applications of ML and
there are “adaptation” methods to handle that
CS772A: PML