Ds 6
Ds 6
Extra Class
July 03, 2024 (Wednesday), 4PM, L18
Thus, the logistic loss function pops out automatically when we try
to learn a model that maximizes the likelihood function
Just as we had the Bernoulli distributions over the support , if the
Probabilistic Multiclassification
support instead has elements, then the distributions are called
either Multinoulli distributions or Categorical distributions
11
Suppose we have classes, then for every data point we would have to
output a PMF over the support To specify a multinoulli distribution
Popular way: assign a positive scoreover
to all classes
labels, andtonormalize
we need so that the
specify non-
scores form a proper probability distribution
negative numbers that add up to one
Common trick: to convert any score to a positive score – exponentiate!!
Learn models , given a point , ,
Assign a positive score per class
Normalize to obtain a PMF for any
Likelihood in this case is
Log-likelihood in this case is
Softmax Regression
I may 12
find other ways to assign a PMF over to each data point by choosing
some function other than e.g. ReLU to assign positive scores i.e. let , let
If we nowand
want
thento learntothe
proceed MLE,
obtain weSomething
an MLE. would have totofind
similar this is indeed used
in deep learning
min ❑
𝐰 ∈ℝ𝑑
Probabilistic Regression
Be warned though – the we chose will start mattering the moment
we add regularization! It is just that in these simple cases it does not
matter. is usually treated like a hyperparameter and tuned.
15
Suppose I decide to use a Laplacian distribution instead and choose
So I am a bit confused. All MLEs (classification/regression)
and i.e. demand a model that places maximum probability on the true
label. Why w.r.t
Likelihood function don’t we just ask
a data the model
point thentobecomes
predict the true label
itself?
MLE:
Suppose we believe (e.g.We
someone tells us)
use the samples even
and the rulesbefore the samples
of probability
have been presented that definitely lies
to update our in the
beliefs interval
about what can (but
and could
otherwise be any value withincannot
that be.
interval)
Let us see how to do this
Posterior 18
Before we see any data, we have a prior belief on the models
It tells us which models are more
Notelikely/less likely
that when we say before we have seen data
or , we mean
Then we see data and we probability density and
wish to update ournotbelief.
probability mass
Basically we
since is a continuous r.v.
want to find out
This quantity has a name: posterior
Bayes Rule
belief or simply posterior
Samples are independent
It tells us which models are more likely/less likely after we have seen data