0% found this document useful (0 votes)
89 views9 pages

02 - Linear Models - D (Multiclass Classification)

- Softmax regression is a single-layer neural network that uses the softmax function to normalize the outputs of a linear model so that they can be interpreted as probabilities. - The softmax function ensures the outputs are all nonnegative and sum to 1, allowing them to represent a proper probability distribution over predicted classes. - Cross-entropy loss is used as the loss function for softmax regression, comparing the predicted probabilities to the true class labels. This loss function encourages the model to estimate class probabilities accurately during training.

Uploaded by

Duy Hùng Đào
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
89 views9 pages

02 - Linear Models - D (Multiclass Classification)

- Softmax regression is a single-layer neural network that uses the softmax function to normalize the outputs of a linear model so that they can be interpreted as probabilities. - The softmax function ensures the outputs are all nonnegative and sum to 1, allowing them to represent a proper probability distribution over predicted classes. - Cross-entropy loss is used as the loss function for softmax regression, comparing the predicted probabilities to the true class labels. This loss function encourages the model to estimate class probabilities accurately during training.

Uploaded by

Duy Hùng Đào
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Linear Multiclass Classification

One Hot Encoding

• We should use the one-hot encoding to represent a category value because it


is not ordinal
• This is for both input and ouptut LM is classification, number of word (categories could be very large but
CPU can handle it)
• A language model (LM) is a model predicting the next word given the past
words. In English, how many candidates for the next words? Is predicting next
word a regression or classification? Then what would be the dimension of the
one-hot encoding in a LM?
Linear Regression with Multiple Outputs

• Assume that the total number of classes are C


• Can we simply extend the linear regression model to predict C outputs?

label 𝑦! = 0 𝑦" = 1 𝑦# = 0

prediction 𝑦"! = 0.82 𝑦"" = 0.64 𝑦"# = 0.24

Fig. 3.4.1: Softmax regression is a single-layer neural network.


• We want the predicctions for each class to be the probability like the one-hot
vector
To e.g.,
express (Cat, chicken,
the model dog) we
more compactly, = (0.2, 0.7,linear
can use 0.1) algebra notation. In vector form, we
arrive at o = Wx + b, a form better suited both for mathematics, and for writing code. Note that
we have gathered all of our weights into a 3 × 4 matrix and that for a given example x, our outputs
probability. Nothing constrains these numbers to sum to 1. Moreover, depending on the inputs,
they can take negative values. These violate basic axioms of probability presented in Section 2.6
Softmax Function
To interpret our outputs as probabilities, we must guarantee that (even on new data), they will be
nonnegative and sum up to 1. Moreover, we need a training objective that encourages the model
to estimate faithfully probabilities. Of all instances when a classifier outputs .5, we hope that half
of those examples
• Thiswill that belong
actually
means we needtotothe predicted
normalize theclass.
outputThis is alinear
of the property called calibration.
model
(called, logits) so that the sum becomes 1 while all outputs are
The softmax function, invented in 1959 by the social scientist R Duncan Luce in the context of choice
nonnegative
models does precisely this. To transform our logits such that they become nonnegative and sum to
1, while requiring that the model remains differentiable, we first exponentiate each logit (ensuring
• and
non-negativity) then divide
The softmax by their
function sum
makes (ensuring
sure of this that they sum to 1).
Make sure nonnegativity
exp(oi )
ŷ = softmax(o) where ŷi = ! . (3.4.3)
j exp(oj )

It is easy to see ŷ1 + ŷ2 + ŷ3 = 1 with 0 ≤ ŷi ≤ 1 for all i. Thus, ŷ is a proper probability
Make sure yis sum to 1distribu-
tion and the values of ŷ can be interpreted accordingly. Note that the softmax operation does not
change the ordering among the logits, and thus we can still pick out the most likely class by:
ı̂(o) = argmax oi = argmax ŷi . (3.4.4)
i i
*Why it is called “softmax”

• The original form of softmax has the temperature hyperparameter 𝜏

exp 𝑜! /𝜏
𝑦! =
∑" exp 𝑜" /𝜏

• When 𝜏 is low it becomes a distribution sampling the max value most of


the time (depending on 𝜏)

• Being differentiable is important in many applications. max() is non-


differentiable but softmax() is differentiable
likelihood maximization, the very same concept that we encountered when providing a probabilis-
likelihood maximization, the very same concept that we encountered when providing a probabilis-
tic justification for the least squares objective in linear regression (Section 3.1).
tic justification for the least squares objective in linear regression (Section 3.1).
Loss Function for Classification
Log-Likelihood
Log-Likelihood

The softmax Thefunction


softmax function
gives usgives
a vector ŷ, which
us a vector ŷ, which
we can we interpret
can interpret as estimatedconditional
as estimated conditional prob-
prob-
abilities of each
• ofCross-Entropy
abilities each class class
given given
the
Loss: thex,
input input
e.g.,x,ŷe.g., ŷ = P̂ (y =
1 = P̂ 1(y = cat
Maximum-Likelihood |cat
for x).| x).
WeWe cancan
Classification comparethe
compare theestimates
estimates
with realitywith
byreality by checking
checking how probable
how probable the actual
the actual classes
classes are are according
according to to ourmodel,
our model,given
given the
the
features.
features.
n
! n
"
n
! n
(y | x ) and thus − log P (Y | X) =
P (Y | X) =(i) P (i) (i) (i) "
− log P (i)
(y (i) | (i)
x(i) ). (3.4.6)
P (Y | X) = P (y i=1| x ) and thus − log P (Y | X) = − log P (y | x ).
i=1 (3.4.6)
i=1 i=1
Maximizing P (Y | X) (and thus equivalently minimizing − log P (Y | X)) corresponds to predict-
MaximizingingPthe
(Y label (andThis
| X)well. thusyields
equivalently minimizing
the loss function − log Pthe
(we dropped | X)) corresponds
(Y superscript to notation
(i) to avoid predict-
• where
clutter):
ing the label well. This yields the loss function (we dropped the superscript (i) to avoid notation
Predicted probability of class j
clutter): "
l = − log P (y | x) = − yj log ŷj . (3.4.7)
" j
l = − log P (y | x) = − yj log ŷj . (3.4.7)
For reasons explained later on, this loss function isj commonly called the cross-entropy loss. Here,
we used that by construction ŷ is a discrete probability distribution and that the vector y is a one-
For reasonshotexplained laterthe
vector. Hence on,thethis
sumloss
overfunction is commonly
all coordinates j vanishescalled
for allthe
butcross-entropy
one term. Sinceloss.
all Here,
ŷj are
we used that by construction
probabilities, ŷ is a discrete
their logarithm is neverprobability
larger than distribution
0. Consequently,and thethatloss vector ycannot
thefunction is a one- be
hot vector.minimized
Hence the the
any sum over
further if weall coordinates
correctly predictjyvanishes Actual
for
with certainty,all probability
but
i.e., if one x) =of1Since
P (y |term. class j correct
all
for the ŷj are
label.their
probabilities, Notelogarithm
that this is is
often not larger
never possible. For0.
than example, there might
Consequently, thebeloss
labelfunction
noise in the dataset
cannot be
minimized(some examplesifmay
any further be mislabeled).
we correctly predictIt may alsocertainty,
y with not be possible
i.e., ifwhen
P (y the| x)input features
= 1 for are not
the correct
P (Y | X) = P (y | x ) and thus − log P (Y | X) = − log P (y | x ). (3.4.6)
i=1 i=1
Cross-Entropy Loss
Maximizing P (Y | X) (and thus equivalently minimizing − log P (Y | X)) corresponds to predict-
ing the label well. This yields the loss function (we dropped the superscript (i) to avoid notation
clutter):
"
l = − log P (y | x) = − yj log ŷj . Here j is the class index
(3.4.7)
j

For reasons explained later on, this loss function is commonly called the cross-entropy loss. Here,
we used that by construction ŷ is a discrete probability distribution and that the vector y is a one-
hot vector. Hence the the sum over 𝑦all
label ! = 0 𝑦# =for
𝑦" =j 1vanishes
coordinates 0 all but one term. Since all ŷj are
probabilities, their logarithm is never larger than 0. Consequently, the loss function cannot be
Cross Entropy
minimized any prediction
further if we correctly
𝑦"! = 0.12 = 0.64
predict 𝑦y""with 𝑦"# =0.24
certainty, i.e., if P (y | x) = 1 for the correct
label. Note that this is often not possible. For example, there might be label noise in theminimize dataset
(some examples may be mislabeled). It may also Softmaxnot be possible when the input features are not
sufficiently informative to classify every example perfectly.

Fig. 3.4.1: Softmax regression is a single-layer neural network.

3.4. Softmax Regression 113


The Gradients of The Cross Entropy Loss

• The softmax function is a non-linear function (due to exp). Thus, we


don’t have a closed form solution. This means that we need to use the
gradient descent method.
• Try to derive the following gradient of the cross-entropy loss:

+
1 (!) (!)
∇%+ 𝐿 𝒘 = . 𝑦#; − 𝑦; 𝒙(!)
𝑁
(&'
Decision Boundaries

You might also like