0% found this document useful (0 votes)

10 views21 pages

Ds 6

Uploaded by

SAURABH VISHWAKARMA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views21 pages

Ds 6

Uploaded by

SAURABH VISHWAKARMA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 21

Probabilistic ML

Extra Class
July 03, 2024 (Wednesday), 4PM, L18

Timing and venue same for both …

only day of week different

In lieu of missed class on June 14,

2024
Probabilistic ML 3
Till now we have looked at ML techniques that assign a label for every
data point (the label is from the set for binary classification, for
multiclass classification with classes, for regression etc)
Examples include DT, linear models
Probabilistic ML techniques, given a data point, do not output a single
label, they instead output a distribution over all possible labels
For binary classification, output a PMF over , for multiclassification, output a
PMF over , for regression, output a PDF over
The probability mass/density of a label in the output PMF/PDF indicates how
likely does the ML model think that label is the correct one for that data point
Note: the algorithm is allowed to output a possibly different PMF/PDF for
every data point. However, the support of these PMFs/PDFs is always the set
of all possible labels (i.e., even very unlikely labels are included in the
Probabilistic ML for Classification
Exactly! Suppose we have three classes and for a data point, the ML model gives
us the PMF . The second class does win being the mode but the model seems not
very certain about this prediction (only 40% confidence).
4
Say we have somehow learnt a PML model which, for a data point ,
gives us a PMF over the set of all possible labels, say
True! Suppose another model gives us the PMF on the same data point. The
forclass
second binary classification,
still wins but this time for
the multiclassification
model is very certain about this prediction
Note(since
that itweis conditioned on which
giving a very high are not r.v.
85% confidence at the
in this moment but nevertheless
prediction).
fixed since we are looking at the data point using model
I could not agree more. However, in many ML applications (e.g. active learning) if
Wewe may use
find that thethis PMF
model in very
is making unsurecreative ways
predictions, we can switch to another
Predict
model or justthe
askmode of this
a human PMF
to step if someone
in. Thus, wants
confidence info acan
single label
be used predicted
fruitfully
Warning! Just because a prediction
May use the median/mean as well – Bayesian MLmore
is made with exploits this possibility
confidence does
Use to find out if the ML model is confident about
not mean its prediction
it must be correct! or totally
confused about which label is the correct one!
May use variance of to find this as well (low variance = very confident
prediction and high variance = less confident/confused prediction)
Probabilistic Binary Classification 5
Find a way to map every data point to a Rademacher distribution
Another way of saying this: map every data point to a prob
Will give us a PMF i.e.
If using mode predictor i.e. then this PMF will give us the correct
label only if the following happens
When the true label of is , , in other words
When the true label of is , , in other words
Note that if , it means ML model is totally confused about label of
Data points for whom are on decision boundary!!
Of course, as usual we want a healthy margin
If true label of the data point is , then we want i.e.
If true label of the data point is , then we want i.e.
Probabilistic Binary Classification So can we never use linear
models to do probabilistic
ML?
6
How to map feature vectors to probability values ?
We problem
Could treat it as a regression can – one way to solve
since probthevalues
problem of
after all
using linear methods to map is called
Will need to modify the training set a bit to do this (basically
logistic regression – have seen it before
change all
labels to since we want if the label is
Yes, but there is a trick involved. Let us take a look at
Could use DT etc to solve this regression it problem
Using linear models to do this presents a challenge
Ah! The name makes sense now – logistic regression is used to solve
If webinary
learnclassification
a linear model using
problems butridge regression
since it does so by it may happen
mapping , expertsthat for some
data point , weit would
thought have be
or cool
elseto have the term “regression” in the name
wont make sense in this case – not a valid PMF!!
DT doesn’t suffer from this problem since it always predict a
DT uses averages of a bunch of train labels to obtain test prediction – the
average of a bunch of 0s and 1s is always a value in the range
Nice! So I want to learn a linear model such that once I do this sigmoidal
Sigmoid Function
map, data points with label get mapped to a probability value close to
whereas data points with label get mapped to a probability value close to
1
7
0.5
There are several other such wrapper/quashing/link/activation
functions which do similar jobs e.g. tanh, ramp, ReLU
0
Trick: learn a linear model and map
May have an explicit/hidden bias term as well
How do I learn such a model ?
This will always give us a value in the range , hence give a valid PMF
Note that if and if and also that as and as
This means that our sigmoidal map will predict if and if
Likelihood
Data might not actually be independent e.g. my visiting a website may not
be independent from my friend visiting the same website if I have found
an offer on that website and posted about it on social website. However,
8
Suppose we have a linear model (assume bias is hidden for now)
often we nevertheless assume independence to make life simple
Given a data point , and , the use of the sigmoidal map gives us a
Rademacher PMF
The probability that this PMF gives to the correct label i.e. is called the
likelihood of this model with respect to this data point
It easy to show that
Hint: use the fact that and that
If we have several points then we define the likelihood of w.r.t entire
dataset as
Usually we assume data points are independent so we use product rule to get
Maximum Likelihood 9
The expression tells us if the model thinks the label is a very likely
label given the feature vector or not likely at all!
similarly tells us how likely does the model think the labels are, given the
feature vectors
Since we trust our training data as clean and representative of reality,
we should look for a that considers training labels to be very likely
E.g. in RecSys example, let if customer makes a purchase and otherwise. If
we trust that these labels do represent reality i.e. what our customers like and
dislike, then we should learn a model accordingly
Totally different story if we mistrust our data – different techniques for that
Maximum Likelihood Estimator (MLE): the model that gives
highest likelihood to observed labels
Logistic Regression 10
Suppose we learn a model as the MLE while using sigmoidal map

Working with products can be numerically unstable

Since , product of several such values can be extremely small
Solution: take logarithms and exploit that
Also called negative log-
likelihood

Thus, the logistic loss function pops out automatically when we try
to learn a model that maximizes the likelihood function
Just as we had the Bernoulli distributions over the support , if the
Probabilistic Multiclassification
support instead has elements, then the distributions are called
either Multinoulli distributions or Categorical distributions
11
Suppose we have classes, then for every data point we would have to
output a PMF over the support To specify a multinoulli distribution
Popular way: assign a positive scoreover
to all classes
labels, andtonormalize
we need so that the
specify non-
scores form a proper probability distribution
negative numbers that add up to one
Common trick: to convert any score to a positive score – exponentiate!!
Learn models , given a point , ,
Assign a positive score per class
Normalize to obtain a PMF for any
Likelihood in this case is
Log-likelihood in this case is
Softmax Regression
I may 12
find other ways to assign a PMF over to each data point by choosing
some function other than e.g. ReLU to assign positive scores i.e. let , let
If we nowand
want
thento learntothe
proceed MLE,
obtain weSomething
an MLE. would have totofind
similar this is indeed used
in deep learning

where It should be noted that this is not the only way to do

Using the negative probabilistic
log-likelihoodmulticlassification.
for numerical It is just that this way
stability
is simple to understand, implement and hence popular
However, be warned that generating a PMF using
DT need
Note: this is nothing but the not necessarily
softmax lossbefunction
an MLE since
wewe haveearlier,
saw also
not explicitly maximized any likelihood function
known as the cross entropy loss function here
Reason for the name: it corresponds to something known as the cross entropy
between Ithe PMF
could given
do also DTby the
and model
invoke theand the trueaslabel
“probability of the data point
proportions”
interpretation to assign a test data point to a PMF that simply
gives the proportion of each label in the leaf of that data point!!
General Recipe for MLE Algorithms 13
Given a problem with label set , find a way to map data features to
PMFs with support
The notation captures parameters in the model (e.g. vectors, bias terms)
For binary classification, and
For multiclassification, and
The function is often called the likelihood function
The function called negative log likelihood function
Given data , find the model parameters that maximize likelihood
function i.e. think that the training labels are very likely
Probabilistic Regression
But apart from the first term and the scaling factor,
both of which are constants and do not depend on the
14
In order to perform probabilistic
model the restregression I have
is just the least squarestoloss
assign
term! a label
distribution over all for every data point using a PDF
Suppose I decide to do that using a Gaussian The MLEdistribution – need to
with respect to the
decide on a mean and a variance Gaussian likelihood indeed the
minimizes least squares loss
Popular choice: Let and i.e.
We can also choose a different for every
Also data point
note that – more
if we set allcomplicated
then it
does not matter which we choose –
Likelihood function w.r.t a data point then becomes
will get the same model

Negative log likelihood w.r.t a set of data points

min ❑
𝐰 ∈ℝ𝑑
Probabilistic Regression
Be warned though – the we chose will start mattering the moment
we add regularization! It is just that in these simple cases it does not
matter. is usually treated like a hyperparameter and tuned.
15
Suppose I decide to use a Laplacian distribution instead and choose
So I am a bit confused. All MLEs (classification/regression)
and i.e. demand a model that places maximum probability on the true
label. Why w.r.t
Likelihood function don’t we just ask
a data the model
point thentobecomes
predict the true label
itself?

Negative log likelihood w.r.tThat

a setis like
of data
askingpoints
the PMF/PDF to place
probability on the true label and
everywhere else – why can’t we do just that?
Thus, if we change the likelihood function to use the Laplacian
Fordistribution
the same reason instead,
we needed the MLE
slack endsinup
variables minimizing
CSVM absolute
– to allow for loss!
the fact that
in realistic situations, no linear model may be able to do what we would ideally
As In
like. before, doesML,
probabilistic notallowing
matterthewhich
model towe choose
place a less than probability on
the true label is much like a slack – allows us to learn good models even if not
perfect ones
Probabilistic Regularization?? 16
We have seen that MLE often Butreduces to are
our models loss minimization
vectors right? Can wee.g. logistic
regression/least squares regression but without
have probability over vectors as terms 
regularization
distribution
well?
Even probabilistic methods can do regularization by way of priors
Recall: regularization basically tells us which kinds of models we prefer
Of course
L2 regularization means we prefer we can.
models withBut first,L2
small let us see the
norm
basic operations in a toy 1D setting before
L1 regularization means we prefer models with small L1 norm/sparse models
getting into the complications of vector-
In the language of probability, the most valued
directr.v.s
way of specifying such a
preference is by specifying a probability distribution itself
Prior: a probability distribution over all possible models
Just like we usually decide regularization before seeing any data, prior
distribution also does not consider/condition on, any data
Can you Guess the Mean?
In this case we are said to have a prior belief or simply prior, on the models , in
17
There
this caseis
theauniform
Gaussianprior with unknown
. This means mean
that unless butany
we see known
data to variance
make us (for
sake
believeof simplicity)
otherwise, we willfrom
think .which we receive independent samples

Can we estimate the “model” from these samples?

Likelihood function: for a What happensmodel
candidate we do seeand
somesample
data, namely
the actual samples from the distribution?

MLE:
Suppose we believe (e.g.We
someone tells us)
use the samples even
and the rulesbefore the samples
of probability
have been presented that definitely lies
to update our in the
beliefs interval
about what can (but
and could
otherwise be any value withincannot
that be.
interval)
Let us see how to do this
Posterior 18
Before we see any data, we have a prior belief on the models
It tells us which models are more
Notelikely/less likely
that when we say before we have seen data
or , we mean
Then we see data and we probability density and
wish to update ournotbelief.
probability mass
Basically we
since is a continuous r.v.
want to find out
This quantity has a name: posterior
Bayes Rule
belief or simply posterior
Samples are independent
It tells us which models are more likely/less likely after we have seen data

Law of total probability ℙ [ 𝜇 ] =UNIF ( [ 0 , 2 ] )

if
else if , then
Maximum a Posteriori (MAP)
It is better Estimate
to choose priors that do not
completely exclude some models by
19
Just as MLE gave us the modelgiving
, MAPthemgives us the(asmodel
0 probability we did)
Thus, MAP returns the model that becomes the most likely one after
we have seen some data
Note: posterior probability
True!(density) of someif models
Even in general, may
your priors arebe larger
bad, or than their
prior probability (density)
tooi.e. afterthen
strong, seeing
you maydataend
those models
up getting seem more likely,
funny
for other models, it may go downasi.e.
models theyofseem
a result doingless
MAPlikely after seeing the data
estimation
Note: However, if prior probability (density) of some model is 0, the posterior
probability (density) has to beFor
Indeed! zero as well
example – need
if we were to be careful
wrong and wasabout priors
Warning: Do not read too much
actually intonot
not then these
matternames likelihood,
how many samplesprior,
we posterior.
All of them tell us how likely
see,something
we will neveris, estimate
given orcorrectly!!
not given something else
MAP vs Regularization 20
Taking negative log likelihoods on both sides

However, is constant for and otherwise ()

s.t.
Thus, even MAP solutions can correspond to optimization problems!
In this case, what was the prior became a constraint
In general, the prior becomes a regularizer
MAP vs Regularization
Similarly, had we used a Laplacian prior, we
would have obtained L1 regularization instead
21
Consider the same problem as before but a different prior
This time we do not believe must have been in the interval but a much
milder prior that is not too large in magnitude
A good way to The regularization
express this is toconstant is dictated prior
use a Gaussian by the strength of
the regularization. Be careful not to have strong priors
(uninformed strong opinions are bad in real life too )
MAP:

Thus, a Gaussian prior gave us L2 regularization!

Note: effecitely dictates the regularization constant – not useless!!
Note: this is basically ridge regression except in one dimension!!

Huawei: H13-311 - V3.0 Exam
100% (2)
Huawei: H13-311 - V3.0 Exam
93 pages
Machine Learning Cheat Sheet PDF
No ratings yet
Machine Learning Cheat Sheet PDF
15 pages
Enterprise Artificial Intelligence and Machine Learning For Managers
100% (2)
Enterprise Artificial Intelligence and Machine Learning For Managers
97 pages
Unit Ii
No ratings yet
Unit Ii
8 pages
Statistical Learning Theory
No ratings yet
Statistical Learning Theory
4 pages
Machine Learning in 10 Pages PDF
No ratings yet
Machine Learning in 10 Pages PDF
10 pages
CS 601 Machine Learning Unit 5
No ratings yet
CS 601 Machine Learning Unit 5
18 pages
Ds 8
No ratings yet
Ds 8
10 pages
ML in 10 Pages 1683806402
No ratings yet
ML in 10 Pages 1683806402
10 pages
15 Types of Regression You Should Know
No ratings yet
15 Types of Regression You Should Know
30 pages
Mathematics of Machine Learning MIT
No ratings yet
Mathematics of Machine Learning MIT
411 pages
Lecture 6
No ratings yet
Lecture 6
19 pages
Ds 7
No ratings yet
Ds 7
20 pages
AI Week 14
No ratings yet
AI Week 14
3 pages
CS115 01
No ratings yet
CS115 01
38 pages
Lecture13 - ML Linear & Log-Linear Models
No ratings yet
Lecture13 - ML Linear & Log-Linear Models
34 pages
i2ML Cheatsheets
No ratings yet
i2ML Cheatsheets
7 pages
Probabilistic Models For Supervised Learning: Piyush Rai Introduction To Machine Learning (CS771A)
No ratings yet
Probabilistic Models For Supervised Learning: Piyush Rai Introduction To Machine Learning (CS771A)
32 pages
Lecture5 Maximum Likelihood
No ratings yet
Lecture5 Maximum Likelihood
13 pages
MLSM Lecture1 050923
No ratings yet
MLSM Lecture1 050923
37 pages
02-03-Warming-Up and Data and Features
No ratings yet
02-03-Warming-Up and Data and Features
22 pages
Session On Maximum Likelihood Estimation
No ratings yet
Session On Maximum Likelihood Estimation
15 pages
DSA5102 Lecture1
No ratings yet
DSA5102 Lecture1
60 pages
Mil780 Classification
No ratings yet
Mil780 Classification
18 pages
Supervised Learning
No ratings yet
Supervised Learning
5 pages
Mathematical Foundations of Computational Linguistics: Manfred Klenner and Jannis Vamvas
No ratings yet
Mathematical Foundations of Computational Linguistics: Manfred Klenner and Jannis Vamvas
32 pages
Lecture 6
No ratings yet
Lecture 6
13 pages
Lec 25
No ratings yet
Lec 25
15 pages
Unit 5 - Machine Learning - WWW - Rgpvnotes.in
No ratings yet
Unit 5 - Machine Learning - WWW - Rgpvnotes.in
17 pages
Aula 4 (L) - Oggi La Tua Lezione È in Presenza
No ratings yet
Aula 4 (L) - Oggi La Tua Lezione È in Presenza
11 pages
Lecture 1, Part 2: Linear Classification: Roger Grosse
No ratings yet
Lecture 1, Part 2: Linear Classification: Roger Grosse
10 pages
ML 01
No ratings yet
ML 01
24 pages
Lecture 11 - 09.09.24 Classification Part 1
No ratings yet
Lecture 11 - 09.09.24 Classification Part 1
51 pages
MIT18 657F15 LecNote PDF
No ratings yet
MIT18 657F15 LecNote PDF
194 pages
Lecture1 Intro ML
No ratings yet
Lecture1 Intro ML
60 pages
Notes6 Classification
No ratings yet
Notes6 Classification
10 pages
02 - Linear Models - D (Multiclass Classification)
No ratings yet
02 - Linear Models - D (Multiclass Classification)
9 pages
Machine Learning and Pattern Recognition Week 3 Intro - Classification
No ratings yet
Machine Learning and Pattern Recognition Week 3 Intro - Classification
5 pages
Deep Learning Summer School 2015: Introduction To Machine Learning
No ratings yet
Deep Learning Summer School 2015: Introduction To Machine Learning
46 pages
03-Linear Classification
No ratings yet
03-Linear Classification
17 pages
What Are Probabilistic Machine Learning Models?
No ratings yet
What Are Probabilistic Machine Learning Models?
61 pages
Slide 1
No ratings yet
Slide 1
37 pages
01.black Box ML
No ratings yet
01.black Box ML
67 pages
Lecture 3: Applications of Machine Learning Algorithms Jul. 06 & 09, 2018
No ratings yet
Lecture 3: Applications of Machine Learning Algorithms Jul. 06 & 09, 2018
3 pages
CH 1
No ratings yet
CH 1
24 pages
MLP RL1
No ratings yet
MLP RL1
6 pages
06 Lectureslides LinearClassification Fixed
No ratings yet
06 Lectureslides LinearClassification Fixed
52 pages
DSA5105 Lecture1
No ratings yet
DSA5105 Lecture1
51 pages
Lecture 1
No ratings yet
Lecture 1
56 pages
DSA5102X Lecture1
No ratings yet
DSA5102X Lecture1
51 pages
DL145611 03 Shallow
No ratings yet
DL145611 03 Shallow
92 pages
Cours1 ML
No ratings yet
Cours1 ML
41 pages
Logistic Regression
No ratings yet
Logistic Regression
29 pages
3 LogisticRegression
No ratings yet
3 LogisticRegression
30 pages
Main
No ratings yet
Main
9 pages
GML Slides 2024 04 29
No ratings yet
GML Slides 2024 04 29
206 pages
Lecture 03 - Feedforward Networks - 4p
No ratings yet
Lecture 03 - Feedforward Networks - 4p
19 pages
Introduction To Inverse Problems - Guillaume Bal PDF
No ratings yet
Introduction To Inverse Problems - Guillaume Bal PDF
205 pages
When Models Meet Data
No ratings yet
When Models Meet Data
25 pages
Regression: UNIT - V Regression Model
100% (1)
Regression: UNIT - V Regression Model
21 pages
Solution 5
No ratings yet
Solution 5
4 pages
CS 182 Berkeley 2021 Discussion 1
No ratings yet
CS 182 Berkeley 2021 Discussion 1
7 pages
Unit 2 - Neural Networks (DL Illustrated)
No ratings yet
Unit 2 - Neural Networks (DL Illustrated)
146 pages
Machine Learning Based House Price Prediction Using Modified Extreme Boosting
No ratings yet
Machine Learning Based House Price Prediction Using Modified Extreme Boosting
14 pages
Interview Questions
No ratings yet
Interview Questions
23 pages
Lecture Notes in Earth Sciences
No ratings yet
Lecture Notes in Earth Sciences
267 pages
Advance Mathematical Methods
No ratings yet
Advance Mathematical Methods
3 pages
Exercise Sheet 12
No ratings yet
Exercise Sheet 12
6 pages
romanov-DIY Mixed Order Ambisonics Microphone Array
No ratings yet
romanov-DIY Mixed Order Ambisonics Microphone Array
30 pages
Artificial Neural Networks in Construction Engineering and Management
No ratings yet
Artificial Neural Networks in Construction Engineering and Management
12 pages
Lecture 2 EE675
No ratings yet
Lecture 2 EE675
4 pages
Boosting Algorithms: Regularization, Prediction and Model Fitting
No ratings yet
Boosting Algorithms: Regularization, Prediction and Model Fitting
29 pages
Intelligence-Based Medicine: Rutuja Shinde
No ratings yet
Intelligence-Based Medicine: Rutuja Shinde
15 pages
Bridging Geometry-Coherent Text-to-3D Generation With Multi-View Diffusion Priors and Gaussian Splatting
No ratings yet
Bridging Geometry-Coherent Text-to-3D Generation With Multi-View Diffusion Priors and Gaussian Splatting
27 pages
Modeling Investor Behavior Using Machine Learning
No ratings yet
Modeling Investor Behavior Using Machine Learning
14 pages
Efficient Diffusion Training Via Min-SNR Weighting Strategy
No ratings yet
Efficient Diffusion Training Via Min-SNR Weighting Strategy
18 pages
DIP Solved Questions
No ratings yet
DIP Solved Questions
13 pages
22n01f0038-Deep Side A Deep Learning Framework For Drug Side Effect Prediction
No ratings yet
22n01f0038-Deep Side A Deep Learning Framework For Drug Side Effect Prediction
36 pages
An Overview of Systems-Theoretic Guarantees in Data-Driven Model Predictive Control
No ratings yet
An Overview of Systems-Theoretic Guarantees in Data-Driven Model Predictive Control
25 pages
2020 Taming The Factor Zoo A Test of New Factors
No ratings yet
2020 Taming The Factor Zoo A Test of New Factors
44 pages
Survey of Multitask Learning
No ratings yet
Survey of Multitask Learning
20 pages
Building Good Training Sets
No ratings yet
Building Good Training Sets
51 pages
Application of Wave-Number Approach For Room Behaviour Analysis and Absorption Coefficient Measurement
No ratings yet
Application of Wave-Number Approach For Room Behaviour Analysis and Absorption Coefficient Measurement
9 pages
State - vs. Community-Led Land Tenure Regularization in Tanzania
No ratings yet
State - vs. Community-Led Land Tenure Regularization in Tanzania
113 pages
A Framework For Massive Scale Personalized Promotion
No ratings yet
A Framework For Massive Scale Personalized Promotion
10 pages
The Informational Content of FOMC Meeting Transcripts
No ratings yet
The Informational Content of FOMC Meeting Transcripts
25 pages
Engineering Fracture Mechanics: Sciencedirect
No ratings yet
Engineering Fracture Mechanics: Sciencedirect
18 pages
EE675A Lecture 4
No ratings yet
EE675A Lecture 4
7 pages
Exercise Sheet 10
No ratings yet
Exercise Sheet 10
3 pages
EE675 Lecture 5
No ratings yet
EE675 Lecture 5
6 pages
EEG Forward and Inverse Model
No ratings yet
EEG Forward and Inverse Model
3 pages
Pragmatic Machine Learning with Python: Learn How to Deploy Machine Learning Models in Production
From Everand
Pragmatic Machine Learning with Python: Learn How to Deploy Machine Learning Models in Production
Avishek Nag
No ratings yet
Mastering Classification Algorithms for Machine Learning: Learn how to apply Classification algorithms for effective Machine Learning solutions (English Edition)
From Everand
Mastering Classification Algorithms for Machine Learning: Learn how to apply Classification algorithms for effective Machine Learning solutions (English Edition)
PARTHA MAJUMDAR
No ratings yet
Machine Learning Interview Questions
From Everand
Machine Learning Interview Questions
Tech Interviews
4.5/5 (2)
Java and Java EE Interview Preparations
From Everand
Java and Java EE Interview Preparations
Navin Shet
No ratings yet

Ds 6

Uploaded by

Ds 6

Uploaded by

Probabilistic ML

Timing and venue same for both …

In lieu of missed class on June 14,

Working with products can be numerically unstable

where It should be noted that this is not the only way to do

Negative log likelihood w.r.t a set of data points

Negative log likelihood w.r.tThat

Can we estimate the “model” from these samples?

Law of total probability ℙ [ 𝜇 ] =UNIF ( [ 0 , 2 ] )

However, is constant for and otherwise ()

Thus, a Gaussian prior gave us L2 regularization!

You might also like