02 - Linear Models - D (Multiclass Classification)

- Softmax regression is a single-layer neural network that uses the softmax function to normalize the outputs of a linear model so that they can be interpreted as probabilities. - The softmax function ensures the outputs are all nonnegative and sum to 1, allowing them to represent a proper probability distribution over predicted classes. - Cross-entropy loss is used as the loss function for softmax regression, comparing the predicted probabilities to the true class labels. This loss function encourages the model to estimate class probabilities accurately during training.

Uploaded by

Duy Hùng Đào

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

89 views9 pages

02 - Linear Models - D (Multiclass Classification)

Uploaded by

Duy Hùng Đào

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Linear Multiclass Classification

One Hot Encoding

• We should use the one-hot encoding to represent a category value because it

is not ordinal
• This is for both input and ouptut LM is classification, number of word (categories could be very large but
CPU can handle it)
• A language model (LM) is a model predicting the next word given the past
words. In English, how many candidates for the next words? Is predicting next
word a regression or classification? Then what would be the dimension of the
one-hot encoding in a LM?
Linear Regression with Multiple Outputs

• Assume that the total number of classes are C

• Can we simply extend the linear regression model to predict C outputs?

label 𝑦! = 0 𝑦" = 1 𝑦# = 0

prediction 𝑦"! = 0.82 𝑦"" = 0.64 𝑦"# = 0.24

Fig. 3.4.1: Softmax regression is a single-layer neural network.

• We want the predicctions for each class to be the probability like the one-hot
vector
To e.g.,
express (Cat, chicken,
the model dog) we
more compactly, = (0.2, 0.7,linear
can use 0.1) algebra notation. In vector form, we
arrive at o = Wx + b, a form better suited both for mathematics, and for writing code. Note that
we have gathered all of our weights into a 3 × 4 matrix and that for a given example x, our outputs
probability. Nothing constrains these numbers to sum to 1. Moreover, depending on the inputs,
they can take negative values. These violate basic axioms of probability presented in Section 2.6
Softmax Function
To interpret our outputs as probabilities, we must guarantee that (even on new data), they will be
nonnegative and sum up to 1. Moreover, we need a training objective that encourages the model
to estimate faithfully probabilities. Of all instances when a classifier outputs .5, we hope that half
of those examples
• Thiswill that belong
actually
means we needtotothe predicted
normalize theclass.
outputThis is alinear
of the property called calibration.
model
(called, logits) so that the sum becomes 1 while all outputs are
The softmax function, invented in 1959 by the social scientist R Duncan Luce in the context of choice
nonnegative
models does precisely this. To transform our logits such that they become nonnegative and sum to
1, while requiring that the model remains differentiable, we first exponentiate each logit (ensuring
• and
non-negativity) then divide
The softmax by their
function sum
makes (ensuring
sure of this that they sum to 1).
Make sure nonnegativity
exp(oi )
ŷ = softmax(o) where ŷi = ! . (3.4.3)
j exp(oj )

It is easy to see ŷ1 + ŷ2 + ŷ3 = 1 with 0 ≤ ŷi ≤ 1 for all i. Thus, ŷ is a proper probability
Make sure yis sum to 1distribu-
tion and the values of ŷ can be interpreted accordingly. Note that the softmax operation does not
change the ordering among the logits, and thus we can still pick out the most likely class by:
ı̂(o) = argmax oi = argmax ŷi . (3.4.4)
i i
*Why it is called “softmax”

• The original form of softmax has the temperature hyperparameter 𝜏

exp 𝑜! /𝜏
𝑦! =
∑" exp 𝑜" /𝜏

• When 𝜏 is low it becomes a distribution sampling the max value most of

the time (depending on 𝜏)

• Being differentiable is important in many applications. max() is non-

differentiable but softmax() is differentiable
likelihood maximization, the very same concept that we encountered when providing a probabilis-
likelihood maximization, the very same concept that we encountered when providing a probabilis-
tic justification for the least squares objective in linear regression (Section 3.1).
tic justification for the least squares objective in linear regression (Section 3.1).
Loss Function for Classification
Log-Likelihood
Log-Likelihood

The softmax Thefunction

softmax function
gives usgives
a vector ŷ, which
us a vector ŷ, which
we can we interpret
can interpret as estimatedconditional
as estimated conditional prob-
prob-
abilities of each
• ofCross-Entropy
abilities each class class
given given
the
Loss: thex,
input input
e.g.,x,ŷe.g., ŷ = P̂ (y =
1 = P̂ 1(y = cat
Maximum-Likelihood |cat
for x).| x).
WeWe cancan
Classification comparethe
compare theestimates
estimates
with realitywith
byreality by checking
checking how probable
how probable the actual
the actual classes
classes are are according
according to to ourmodel,
our model,given
given the
the
features.
features.
n
! n
"
n
! n
(y | x ) and thus − log P (Y | X) =
P (Y | X) =(i) P (i) (i) (i) "
− log P (i)
(y (i) | (i)
x(i) ). (3.4.6)
P (Y | X) = P (y i=1| x ) and thus − log P (Y | X) = − log P (y | x ).
i=1 (3.4.6)
i=1 i=1
Maximizing P (Y | X) (and thus equivalently minimizing − log P (Y | X)) corresponds to predict-
MaximizingingPthe
(Y label (andThis
| X)well. thusyields
equivalently minimizing
the loss function − log Pthe
(we dropped | X)) corresponds
(Y superscript to notation
(i) to avoid predict-
• where
clutter):
ing the label well. This yields the loss function (we dropped the superscript (i) to avoid notation
Predicted probability of class j
clutter): "
l = − log P (y | x) = − yj log ŷj . (3.4.7)
" j
l = − log P (y | x) = − yj log ŷj . (3.4.7)
For reasons explained later on, this loss function isj commonly called the cross-entropy loss. Here,
we used that by construction ŷ is a discrete probability distribution and that the vector y is a one-
For reasonshotexplained laterthe
vector. Hence on,thethis
sumloss
overfunction is commonly
all coordinates j vanishescalled
for allthe
butcross-entropy
one term. Sinceloss.
all Here,
ŷj are
we used that by construction
probabilities, ŷ is a discrete
their logarithm is neverprobability
larger than distribution
0. Consequently,and thethatloss vector ycannot
thefunction is a one- be
hot vector.minimized
Hence the the
any sum over
further if weall coordinates
correctly predictjyvanishes Actual
for
with certainty,all probability
but
i.e., if one x) =of1Since
P (y |term. class j correct
all
for the ŷj are
label.their
probabilities, Notelogarithm
that this is is
often not larger
never possible. For0.
than example, there might
Consequently, thebeloss
labelfunction
noise in the dataset
cannot be
minimized(some examplesifmay
any further be mislabeled).
we correctly predictIt may alsocertainty,
y with not be possible
i.e., ifwhen
P (y the| x)input features
= 1 for are not
the correct
P (Y | X) = P (y | x ) and thus − log P (Y | X) = − log P (y | x ). (3.4.6)
i=1 i=1
Cross-Entropy Loss
Maximizing P (Y | X) (and thus equivalently minimizing − log P (Y | X)) corresponds to predict-
ing the label well. This yields the loss function (we dropped the superscript (i) to avoid notation
clutter):
"
l = − log P (y | x) = − yj log ŷj . Here j is the class index
(3.4.7)
j

For reasons explained later on, this loss function is commonly called the cross-entropy loss. Here,
we used that by construction ŷ is a discrete probability distribution and that the vector y is a one-
hot vector. Hence the the sum over 𝑦all
label ! = 0 𝑦# =for
𝑦" =j 1vanishes
coordinates 0 all but one term. Since all ŷj are
probabilities, their logarithm is never larger than 0. Consequently, the loss function cannot be
Cross Entropy
minimized any prediction
further if we correctly
𝑦"! = 0.12 = 0.64
predict 𝑦y""with 𝑦"# =0.24
certainty, i.e., if P (y | x) = 1 for the correct
label. Note that this is often not possible. For example, there might be label noise in theminimize dataset
(some examples may be mislabeled). It may also Softmaxnot be possible when the input features are not
sufficiently informative to classify every example perfectly.

Fig. 3.4.1: Softmax regression is a single-layer neural network.

3.4. Softmax Regression 113

The Gradients of The Cross Entropy Loss

• The softmax function is a non-linear function (due to exp). Thus, we

don’t have a closed form solution. This means that we need to use the
gradient descent method.
• Try to derive the following gradient of the cross-entropy loss:

+
1 (!) (!)
∇%+ 𝐿 𝒘 = . 𝑦#; − 𝑦; 𝒙(!)
𝑁
(&'
Decision Boundaries

Information Systems - What Every Business Student Needs To Know, Second Edition
No ratings yet
Information Systems - What Every Business Student Needs To Know, Second Edition
391 pages
Ecba (Babok) Summary
100% (5)
Ecba (Babok) Summary
29 pages
Factory Reset For Samsung Printer SL-C1860FW
50% (2)
Factory Reset For Samsung Printer SL-C1860FW
3 pages
03-Linear Classification
No ratings yet
03-Linear Classification
17 pages
Softmax Reg Skimmed - Ipynb - Colab
No ratings yet
Softmax Reg Skimmed - Ipynb - Colab
9 pages
2 Softmaxregression
No ratings yet
2 Softmaxregression
4 pages
Mathematical Foundations of Computational Linguistics: Manfred Klenner and Jannis Vamvas
No ratings yet
Mathematical Foundations of Computational Linguistics: Manfred Klenner and Jannis Vamvas
32 pages
DL145611 03 Shallow
No ratings yet
DL145611 03 Shallow
92 pages
Main
No ratings yet
Main
9 pages
d2l en 165 218
No ratings yet
d2l en 165 218
35 pages
Detailed Sigmoid and Softmax Activation Function
No ratings yet
Detailed Sigmoid and Softmax Activation Function
5 pages
Lecture 6
No ratings yet
Lecture 6
19 pages
Lect 8
No ratings yet
Lect 8
117 pages
Cross Interopy
No ratings yet
Cross Interopy
7 pages
Solution 5
No ratings yet
Solution 5
4 pages
C2 W2 SoftMax
No ratings yet
C2 W2 SoftMax
7 pages
Notes6 Classification
No ratings yet
Notes6 Classification
10 pages
Lec 04 Deep Networks 2
No ratings yet
Lec 04 Deep Networks 2
78 pages
C2 W2 SoftMax
No ratings yet
C2 W2 SoftMax
7 pages
Practice QuestionsV1
No ratings yet
Practice QuestionsV1
7 pages
Practice QuestionsV1
No ratings yet
Practice QuestionsV1
7 pages
Slides MC Softmax Regression
No ratings yet
Slides MC Softmax Regression
11 pages
Chapter02 Introduction To DeepLearning
No ratings yet
Chapter02 Introduction To DeepLearning
84 pages
Lecture Notes 6 Logistic Regression
No ratings yet
Lecture Notes 6 Logistic Regression
8 pages
cs231n Github Io Neural Networks Case Study
No ratings yet
cs231n Github Io Neural Networks Case Study
17 pages
06 Lectureslides LinearClassification Fixed
No ratings yet
06 Lectureslides LinearClassification Fixed
52 pages
L3 Cse256 Fa24 FFN
No ratings yet
L3 Cse256 Fa24 FFN
64 pages
Lecture 11
No ratings yet
Lecture 11
26 pages
Lecture 19
No ratings yet
Lecture 19
8 pages
9.b Handout-1-Loss Functions
No ratings yet
9.b Handout-1-Loss Functions
3 pages
W02 MLOptDL
No ratings yet
W02 MLOptDL
23 pages
05 AIS302 ANN-Optimization
No ratings yet
05 AIS302 ANN-Optimization
44 pages
CSCI-43646364 S25 - Lecture 5
No ratings yet
CSCI-43646364 S25 - Lecture 5
104 pages
Logistic Regression
No ratings yet
Logistic Regression
25 pages
Output 25
No ratings yet
Output 25
8 pages
CM20315 05 Loss
No ratings yet
CM20315 05 Loss
100 pages
ML Basics Lecture2 Linear Classification
No ratings yet
ML Basics Lecture2 Linear Classification
34 pages
Binary Classification and Logistic Regression
No ratings yet
Binary Classification and Logistic Regression
7 pages
CS229 Supplemental Lecture Notes: 1 Binary Classification
No ratings yet
CS229 Supplemental Lecture Notes: 1 Binary Classification
7 pages
Cross Entropy Loss Intro, Applications
No ratings yet
Cross Entropy Loss Intro, Applications
21 pages
CH 1
No ratings yet
CH 1
24 pages
Loss Function - Ipynb - Colaboratory
No ratings yet
Loss Function - Ipynb - Colaboratory
6 pages
Lec 05
No ratings yet
Lec 05
54 pages
Cheatsheet Supervised Learning
No ratings yet
Cheatsheet Supervised Learning
4 pages
Machine Learning and Pattern Recognition Week 3 Intro - Classification
No ratings yet
Machine Learning and Pattern Recognition Week 3 Intro - Classification
5 pages
03 Linear Models
No ratings yet
03 Linear Models
46 pages
IML Summary
No ratings yet
IML Summary
12 pages
Ds 6
No ratings yet
Ds 6
21 pages
Lecture 2
No ratings yet
Lecture 2
8 pages
AI SVM Network
No ratings yet
AI SVM Network
10 pages
MIT18 657F15 LecNote PDF
No ratings yet
MIT18 657F15 LecNote PDF
194 pages
3a Variations
No ratings yet
3a Variations
17 pages
Mathematics of Machine Learning MIT
No ratings yet
Mathematics of Machine Learning MIT
411 pages
1 Intro
No ratings yet
1 Intro
5 pages
Lecture 07
No ratings yet
Lecture 07
29 pages
4 Linear Regression Additional Notes
No ratings yet
4 Linear Regression Additional Notes
8 pages
8 Linear Classifiers HInge Loss 03-08-2024
No ratings yet
8 Linear Classifiers HInge Loss 03-08-2024
20 pages
7 TrainingNN-2
No ratings yet
7 TrainingNN-2
84 pages
Lec 4
No ratings yet
Lec 4
24 pages
Machine Learning PDF
No ratings yet
Machine Learning PDF
77 pages
189 Cheat Sheet Nominicards PDF
No ratings yet
189 Cheat Sheet Nominicards PDF
2 pages
Pizza Hut Management System
No ratings yet
Pizza Hut Management System
14 pages
CS484 Semester Project
No ratings yet
CS484 Semester Project
2 pages
Q5X IO-Link Data Reference Guide
No ratings yet
Q5X IO-Link Data Reference Guide
7 pages
Random - Language (API) - Processing 3+
No ratings yet
Random - Language (API) - Processing 3+
2 pages
Raja Shankar Shah University, Chhindwara (M.P.)
No ratings yet
Raja Shankar Shah University, Chhindwara (M.P.)
2 pages
Game Engine Gems 2 1st Edition Eric Lengyel Instant Download
100% (3)
Game Engine Gems 2 1st Edition Eric Lengyel Instant Download
81 pages
What Is DTCO - An Introduction To Design-Technology Co-Optimization (TSMC) - SemiWiki
No ratings yet
What Is DTCO - An Introduction To Design-Technology Co-Optimization (TSMC) - SemiWiki
6 pages
Machine Learningbased Lie Detectorappliedtoa Collectedand Annotated Dataset
No ratings yet
Machine Learningbased Lie Detectorappliedtoa Collectedand Annotated Dataset
10 pages
Tunmi Project
No ratings yet
Tunmi Project
51 pages
Apsmo: Olympiad
No ratings yet
Apsmo: Olympiad
4 pages
7941 17755 1 SM
No ratings yet
7941 17755 1 SM
17 pages
Installation Manual & Operation Instructions: Wheel Balancer Pse Wb-260
No ratings yet
Installation Manual & Operation Instructions: Wheel Balancer Pse Wb-260
42 pages
16 EPSON - Data Sheet
No ratings yet
16 EPSON - Data Sheet
2 pages
wb3 Draft
No ratings yet
wb3 Draft
6 pages
New, Changed, and Deprecated Features For Microsoft Dynamics AX 2012
No ratings yet
New, Changed, and Deprecated Features For Microsoft Dynamics AX 2012
207 pages
Class 12 Ip Practical Programs 2024-25
No ratings yet
Class 12 Ip Practical Programs 2024-25
37 pages
Emerson Fbxconnect Licensing Tool User Manual For Emerson Impact Partners en 11362138
No ratings yet
Emerson Fbxconnect Licensing Tool User Manual For Emerson Impact Partners en 11362138
93 pages
FIre Detection BOQ
No ratings yet
FIre Detection BOQ
10 pages
Brochure Cyber Law Moot Brochure
No ratings yet
Brochure Cyber Law Moot Brochure
9 pages
Scholar Advacned Higher Maths Unit 1
No ratings yet
Scholar Advacned Higher Maths Unit 1
274 pages
Instant Download Antennas From Theory To Practice 1st Edition Huang Y. PDF All Chapters
No ratings yet
Instant Download Antennas From Theory To Practice 1st Edition Huang Y. PDF All Chapters
51 pages
The Girl With The Broken Heart Lurlene Mcdaniel Download
No ratings yet
The Girl With The Broken Heart Lurlene Mcdaniel Download
27 pages
Prompt Engineering Guide For Students
No ratings yet
Prompt Engineering Guide For Students
5 pages
Driver List
No ratings yet
Driver List
20 pages
Afnan PPT (1) (Read-Only)
No ratings yet
Afnan PPT (1) (Read-Only)
13 pages
Mini Project 2 Semster
No ratings yet
Mini Project 2 Semster
35 pages
Mechatronics - 302050: Lecture Notes / PPT Unit Iii
No ratings yet
Mechatronics - 302050: Lecture Notes / PPT Unit Iii
29 pages