0% found this document useful (0 votes)
10 views21 pages

Ds 6

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views21 pages

Ds 6

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 21

Probabilistic ML

Extra Class
July 03, 2024 (Wednesday), 4PM, L18

Timing and venue same for both …


only day of week different

In lieu of missed class on June 14,


2024
Probabilistic ML 3
Till now we have looked at ML techniques that assign a label for every
data point (the label is from the set for binary classification, for
multiclass classification with classes, for regression etc)
Examples include DT, linear models
Probabilistic ML techniques, given a data point, do not output a single
label, they instead output a distribution over all possible labels
For binary classification, output a PMF over , for multiclassification, output a
PMF over , for regression, output a PDF over
The probability mass/density of a label in the output PMF/PDF indicates how
likely does the ML model think that label is the correct one for that data point
Note: the algorithm is allowed to output a possibly different PMF/PDF for
every data point. However, the support of these PMFs/PDFs is always the set
of all possible labels (i.e., even very unlikely labels are included in the
Probabilistic ML for Classification
Exactly! Suppose we have three classes and for a data point, the ML model gives
us the PMF . The second class does win being the mode but the model seems not
very certain about this prediction (only 40% confidence).
4
Say we have somehow learnt a PML model which, for a data point ,
gives us a PMF over the set of all possible labels, say
True! Suppose another model gives us the PMF on the same data point. The
forclass
second binary classification,
still wins but this time for
the multiclassification
model is very certain about this prediction
Note(since
that itweis conditioned on which
giving a very high are not r.v.
85% confidence at the
in this moment but nevertheless
prediction).
fixed since we are looking at the data point using model
I could not agree more. However, in many ML applications (e.g. active learning) if
Wewe may use
find that thethis PMF
model in very
is making unsurecreative ways
predictions, we can switch to another
Predict
model or justthe
askmode of this
a human PMF
to step if someone
in. Thus, wants
confidence info acan
single label
be used predicted
fruitfully
Warning! Just because a prediction
May use the median/mean as well – Bayesian MLmore
is made with exploits this possibility
confidence does
Use to find out if the ML model is confident about
not mean its prediction
it must be correct! or totally
confused about which label is the correct one!
May use variance of to find this as well (low variance = very confident
prediction and high variance = less confident/confused prediction)
Probabilistic Binary Classification 5
Find a way to map every data point to a Rademacher distribution
Another way of saying this: map every data point to a prob
Will give us a PMF i.e.
If using mode predictor i.e. then this PMF will give us the correct
label only if the following happens
When the true label of is , , in other words
When the true label of is , , in other words
Note that if , it means ML model is totally confused about label of
Data points for whom are on decision boundary!!
Of course, as usual we want a healthy margin
If true label of the data point is , then we want i.e.
If true label of the data point is , then we want i.e.
Probabilistic Binary Classification So can we never use linear
models to do probabilistic
ML?
6
How to map feature vectors to probability values ?
We problem
Could treat it as a regression can – one way to solve
since probthevalues
problem of
after all
using linear methods to map is called
Will need to modify the training set a bit to do this (basically
logistic regression – have seen it before
change all
labels to since we want if the label is
Yes, but there is a trick involved. Let us take a look at
Could use DT etc to solve this regression it problem
Using linear models to do this presents a challenge
Ah! The name makes sense now – logistic regression is used to solve
If webinary
learnclassification
a linear model using
problems butridge regression
since it does so by it may happen
mapping , expertsthat for some
data point , weit would
thought have be
or cool
elseto have the term “regression” in the name
wont make sense in this case – not a valid PMF!!
DT doesn’t suffer from this problem since it always predict a
DT uses averages of a bunch of train labels to obtain test prediction – the
average of a bunch of 0s and 1s is always a value in the range
Nice! So I want to learn a linear model such that once I do this sigmoidal
Sigmoid Function
map, data points with label get mapped to a probability value close to
whereas data points with label get mapped to a probability value close to
1
7
0.5
There are several other such wrapper/quashing/link/activation
functions which do similar jobs e.g. tanh, ramp, ReLU
0
Trick: learn a linear model and map
May have an explicit/hidden bias term as well
How do I learn such a model ?
This will always give us a value in the range , hence give a valid PMF
Note that if and if and also that as and as
This means that our sigmoidal map will predict if and if
Likelihood
Data might not actually be independent e.g. my visiting a website may not
be independent from my friend visiting the same website if I have found
an offer on that website and posted about it on social website. However,
8
Suppose we have a linear model (assume bias is hidden for now)
often we nevertheless assume independence to make life simple
Given a data point , and , the use of the sigmoidal map gives us a
Rademacher PMF
The probability that this PMF gives to the correct label i.e. is called the
likelihood of this model with respect to this data point
It easy to show that
Hint: use the fact that and that
If we have several points then we define the likelihood of w.r.t entire
dataset as
Usually we assume data points are independent so we use product rule to get
Maximum Likelihood 9
The expression tells us if the model thinks the label is a very likely
label given the feature vector or not likely at all!
similarly tells us how likely does the model think the labels are, given the
feature vectors
Since we trust our training data as clean and representative of reality,
we should look for a that considers training labels to be very likely
E.g. in RecSys example, let if customer makes a purchase and otherwise. If
we trust that these labels do represent reality i.e. what our customers like and
dislike, then we should learn a model accordingly
Totally different story if we mistrust our data – different techniques for that
Maximum Likelihood Estimator (MLE): the model that gives
highest likelihood to observed labels
Logistic Regression 10
Suppose we learn a model as the MLE while using sigmoidal map

Working with products can be numerically unstable


Since , product of several such values can be extremely small
Solution: take logarithms and exploit that
Also called negative log-
likelihood

Thus, the logistic loss function pops out automatically when we try
to learn a model that maximizes the likelihood function
Just as we had the Bernoulli distributions over the support , if the
Probabilistic Multiclassification
support instead has elements, then the distributions are called
either Multinoulli distributions or Categorical distributions
11
Suppose we have classes, then for every data point we would have to
output a PMF over the support To specify a multinoulli distribution
Popular way: assign a positive scoreover
to all classes
labels, andtonormalize
we need so that the
specify non-
scores form a proper probability distribution
negative numbers that add up to one
Common trick: to convert any score to a positive score – exponentiate!!
Learn models , given a point , ,
Assign a positive score per class
Normalize to obtain a PMF for any
Likelihood in this case is
Log-likelihood in this case is
Softmax Regression
I may 12
find other ways to assign a PMF over to each data point by choosing
some function other than e.g. ReLU to assign positive scores i.e. let , let
If we nowand
want
thento learntothe
proceed MLE,
obtain weSomething
an MLE. would have totofind
similar this is indeed used
in deep learning

where It should be noted that this is not the only way to do


Using the negative probabilistic
log-likelihoodmulticlassification.
for numerical It is just that this way
stability
is simple to understand, implement and hence popular
However, be warned that generating a PMF using
DT need
Note: this is nothing but the not necessarily
softmax lossbefunction
an MLE since
wewe haveearlier,
saw also
not explicitly maximized any likelihood function
known as the cross entropy loss function here
Reason for the name: it corresponds to something known as the cross entropy
between Ithe PMF
could given
do also DTby the
and model
invoke theand the trueaslabel
“probability of the data point
proportions”
interpretation to assign a test data point to a PMF that simply
gives the proportion of each label in the leaf of that data point!!
General Recipe for MLE Algorithms 13
Given a problem with label set , find a way to map data features to
PMFs with support
The notation captures parameters in the model (e.g. vectors, bias terms)
For binary classification, and
For multiclassification, and
The function is often called the likelihood function
The function called negative log likelihood function
Given data , find the model parameters that maximize likelihood
function i.e. think that the training labels are very likely
Probabilistic Regression
But apart from the first term and the scaling factor,
both of which are constants and do not depend on the
14
In order to perform probabilistic
model the restregression I have
is just the least squarestoloss
assign
term! a label
distribution over all for every data point using a PDF
Suppose I decide to do that using a Gaussian The MLEdistribution – need to
with respect to the
decide on a mean and a variance Gaussian likelihood indeed the
minimizes least squares loss
Popular choice: Let and i.e.
We can also choose a different for every
Also data point
note that – more
if we set allcomplicated
then it
does not matter which we choose –
Likelihood function w.r.t a data point then becomes
will get the same model

Negative log likelihood w.r.t a set of data points

min ❑
𝐰 ∈ℝ𝑑
Probabilistic Regression
Be warned though – the we chose will start mattering the moment
we add regularization! It is just that in these simple cases it does not
matter. is usually treated like a hyperparameter and tuned.
15
Suppose I decide to use a Laplacian distribution instead and choose
So I am a bit confused. All MLEs (classification/regression)
and i.e. demand a model that places maximum probability on the true
label. Why w.r.t
Likelihood function don’t we just ask
a data the model
point thentobecomes
predict the true label
itself?

Negative log likelihood w.r.tThat


a setis like
of data
askingpoints
the PMF/PDF to place
probability on the true label and
everywhere else – why can’t we do just that?
Thus, if we change the likelihood function to use the Laplacian
Fordistribution
the same reason instead,
we needed the MLE
slack endsinup
variables minimizing
CSVM absolute
– to allow for loss!
the fact that
in realistic situations, no linear model may be able to do what we would ideally
As In
like. before, doesML,
probabilistic notallowing
matterthewhich
model towe choose
place a less than probability on
the true label is much like a slack – allows us to learn good models even if not
perfect ones
Probabilistic Regularization?? 16
We have seen that MLE often Butreduces to are
our models loss minimization
vectors right? Can wee.g. logistic
regression/least squares regression but without
have probability over vectors as terms 
regularization
distribution
well?
Even probabilistic methods can do regularization by way of priors
Recall: regularization basically tells us which kinds of models we prefer
Of course
L2 regularization means we prefer we can.
models withBut first,L2
small let us see the
norm
basic operations in a toy 1D setting before
L1 regularization means we prefer models with small L1 norm/sparse models
getting into the complications of vector-
In the language of probability, the most valued
directr.v.s
way of specifying such a
preference is by specifying a probability distribution itself
Prior: a probability distribution over all possible models
Just like we usually decide regularization before seeing any data, prior
distribution also does not consider/condition on, any data
Can you Guess the Mean?
In this case we are said to have a prior belief or simply prior, on the models , in
17
There
this caseis
theauniform
Gaussianprior with unknown
. This means mean
that unless butany
we see known
data to variance
make us (for
sake
believeof simplicity)
otherwise, we willfrom
think .which we receive independent samples

Can we estimate the “model” from these samples?


Likelihood function: for a What happensmodel
candidate we do seeand
somesample
data, namely
the actual samples from the distribution?

MLE:
Suppose we believe (e.g.We
someone tells us)
use the samples even
and the rulesbefore the samples
of probability
have been presented that definitely lies
to update our in the
beliefs interval
about what can (but
and could
otherwise be any value withincannot
that be.
interval)
Let us see how to do this
Posterior 18
Before we see any data, we have a prior belief on the models
It tells us which models are more
Notelikely/less likely
that when we say before we have seen data
or , we mean
Then we see data and we probability density and
wish to update ournotbelief.
probability mass
Basically we
since is a continuous r.v.
want to find out
This quantity has a name: posterior
Bayes Rule
belief or simply posterior
Samples are independent
It tells us which models are more likely/less likely after we have seen data

Law of total probability ℙ [ 𝜇 ] =UNIF ( [ 0 , 2 ] )


if
else if , then
Maximum a Posteriori (MAP)
It is better Estimate
to choose priors that do not
completely exclude some models by
19
Just as MLE gave us the modelgiving
, MAPthemgives us the(asmodel
0 probability we did)
Thus, MAP returns the model that becomes the most likely one after
we have seen some data
Note: posterior probability
True!(density) of someif models
Even in general, may
your priors arebe larger
bad, or than their
prior probability (density)
tooi.e. afterthen
strong, seeing
you maydataend
those models
up getting seem more likely,
funny
for other models, it may go downasi.e.
models theyofseem
a result doingless
MAPlikely after seeing the data
estimation
Note: However, if prior probability (density) of some model is 0, the posterior
probability (density) has to beFor
Indeed! zero as well
example – need
if we were to be careful
wrong and wasabout priors
Warning: Do not read too much
actually intonot
not then these
matternames likelihood,
how many samplesprior,
we posterior.
All of them tell us how likely
see,something
we will neveris, estimate
given orcorrectly!!
not given something else
MAP vs Regularization 20
Taking negative log likelihoods on both sides

However, is constant for and otherwise ()


s.t.
Thus, even MAP solutions can correspond to optimization problems!
In this case, what was the prior became a constraint
In general, the prior becomes a regularizer
MAP vs Regularization
Similarly, had we used a Laplacian prior, we
would have obtained L1 regularization instead
21
Consider the same problem as before but a different prior
This time we do not believe must have been in the interval but a much
milder prior that is not too large in magnitude
A good way to The regularization
express this is toconstant is dictated prior
use a Gaussian by the strength of
the regularization. Be careful not to have strong priors
(uninformed strong opinions are bad in real life too )
MAP:

Thus, a Gaussian prior gave us L2 regularization!


Note: effecitely dictates the regularization constant – not useless!!
Note: this is basically ridge regression except in one dimension!!

You might also like