0% found this document useful (0 votes)
99 views14 pages

DeepNotes Softmax&Crossentropy

The document summarizes the softmax function and cross entropy loss. Softmax takes a vector of real numbers and transforms it into a probability distribution. Cross entropy loss measures the performance of a classification model whose output is a probability distribution. The derivative of softmax and cross entropy loss are also provided, which are needed for backpropagation during training deep neural networks.

Uploaded by

cantahorn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
99 views14 pages

DeepNotes Softmax&Crossentropy

The document summarizes the softmax function and cross entropy loss. Softmax takes a vector of real numbers and transforms it into a probability distribution. Cross entropy loss measures the performance of a classification model whose output is a probability distribution. The derivative of softmax and cross entropy loss are also provided, which are needed for backpropagation during training deep neural networks.

Uploaded by

cantahorn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Classification and Loss Evaluation -

Softmax and Cross Entropy Loss


Lets dig a little deep into how we convert the output of our CNN into
probability - Softmax; and the loss measure to guide our optimization -
Cross Entropy.

The Softmax Function

Derivative of Softmax

Cross Entropy Loss

Derivative of Cross Entropy Loss with Softmax

PARAS DAHAL

Note: Complete source code can be found here


https://github.com/parasdahal/deepnet

The Softmax Function

Softmax function takes an N-dimensional vector of real numbers and transforms it


a

into a vector of real number in range (0,1) which add upto 1. p e i

i =
N a
∑ e
k=1 k

As the name suggests, softmax function is a “soft” version of max function. Instead of
selecting one maximum value, it breaks the whole (1) with maximal element getting
the largest portion of the distribution, but other smaller elements getting some of it as
well.
This property of softmax function that it outputs a probability distribution makes it
suitable for probabilistic interpretation in classification tasks.

In python, we the code for softmax function as follows:

def softmax(X):
exps = np.exp(X)
return exps / np.sum(exps)

We have to note that the numerical range of floating point numbers in numpy is
limited. For float64 the upper bound is 10 308
. For exponential, its not difficult to
overshoot that limit, in which case python returns nan .

To make our softmax function numerically stable, we simply normalize the values in
the vector, by multiplying the numerator and denominator with a constant C .

ai
e
pi =
N ak
∑ e
k=1
ai
Ce
=
N ak
C ∑ e
k=1

ai +log(C)
e
=
N ak +log(C)
∑ e
k=1

We can choose an arbitrary value for log(C) term, but generally log(C) = −max(a)
is chosen, as it shifts all of elements in the vector to negative to zero, and negatives
with large exponents saturate to zero rather than the infinity, avoiding overflowing
and resulting in nan .

The code for our stable softmax is as follows:

def stable_softmax(X):
exps = np.exp(X - np.max(X))
return exps / np.sum(exps)

Derivative of Softmax

Due to the desirable property of softmax function outputting a probability


distribution, we use it as the final layer in neural networks. For this we need to
calculate the derivative or gradient and pass it back to the previous layer during
backpropagation.
a
e i

N
∂pi ∑ e
a
k
k=1
=
∂aj ∂aj

g(x) g′(x)h(x)−h′(x)g(x)
From quotient rule we know that for f(x) = h(x)
, we have f ′
(x) =
2
h(x)

In our case g(x) = e and h(x) = ∑ . In h(x), will always be e has it


ai N ak ∂ aj
e a
k=1 ∂e j

will always have e . But we have to note that in g(x) ,


aj

∂e

a
j
will be e aj
only if i = j,
otherwise its 0.

If i = j,

a
e i
∂ ai N ak aj ai

N
e
a
k e ∑ e −e e
k=1 k=1
=
2
∂aj N
ak
(∑ e )
k=1

ai N ak aj
e (∑ e −e )
k=1

=
2
N ak
(∑ e )
k=1

N ak aj
aj (∑ e −e )
e k=1

= ×
N ak N ak
∑ e ∑ e
k=1 k=1

= pi (1 − pj )

For i ≠ j,

a
e i
∂ N aj ai

k=1
e
a
k 0−e e
=
2
∂aj N ak
(∑ e )
k=1

aj ai
−e e
= ×
N ak N ak
∑ e ∑ e
k=1 k=1

= −pj . pi

So the derivative of the softmax function is given as,

∂pi pi (1 − pj ) if i = j
= {
∂aj −pj . pi if i ≠ j

1 if i = j
Or using Kronecker delta δij = {
0 if i ≠ j
∂pi
= pi (δ ij − pj )
∂aj

Cross Entropy Loss

Cross entropy indicates the distance between what the model believes the output
distribution should be, and what the original distribution really is. It is defined as,
H(y, p) = − ∑
i
yi log(pi ) Cross entropy measure is a widely used alternative of
squared error. It is used when node activations can be understood as representing the
probability that each hypothesis might be true, i.e. when the output is a probability
distribution. Thus it is used as a loss function in neural networks which have softmax
activations in the output layer.

def cross_entropy(X,y):
"""
X is the output from fully connected layer (num_examples x num_classes)
y is labels (num_examples x 1)
Note that y is not one-hot encoded vector.
It can be computed as y.argmax(axis=1) from one-hot encoded vectors of labels if required.
"""
m = y.shape[0]
p = softmax(X)
# We use multidimensional array indexing to extract
# softmax probability of the correct label for each sample.
# Refer to https://docs.scipy.org/doc/numpy/user/basics.indexing.html#indexing-multi-dimensional-arra
log_likelihood = -np.log(p[range(m),y])
loss = np.sum(log_likelihood) / m
return loss

Derivative of Cross Entropy Loss with Softmax

Cross Entropy Loss with Softmax function are used as the output layer extensively.
Now we use the derivative of softmax [1] that we derived earlier to derive the
derivative of the cross entropy loss function.
L = − ∑ yi log(pi )

∂L ∂log(pk )
= − ∑ yk
∂o i ∂o i
k

∂log(pk ) ∂pk
= − ∑ yk ×
∂pk ∂o i
k

1 ∂pk
= − ∑ yk ×
pk ∂o i

From derivative of softmax we derived earlier,

∂L 1
= −yi (1 − pi ) − ∑ yk (−pk . pi )
∂o i pk
k≠i

= −yi (1 − pi ) + ∑ yk . pi

k≠1

= −yi + yi pi + ∑ yk . pi

k≠1

= pi (yi + ∑ yk ) − yi

k≠1

= pi (yi + ∑ yk ) − yi

k≠1

y is a one hot encoded vector for the labels, so∑ k


yk = 1 , and y i + ∑
k≠1
yk = 1 . So
we have,

∂L
= pi − yi
∂o i

which is a very simple and elegant expression. Translating it into code [2]

def delta_cross_entropy(X,y):
"""
X is the output from fully connected layer (num_examples x num_classes)
y is labels (num_examples x 1)
Note that y is not one-hot encoded vector.
It can be computed as y.argmax(axis=1) from one-hot encoded vectors of labels if required.
"""
m = y.shape[0]
grad = softmax(X)
grad[range(m),y] -= 1
grad = grad/m
return grad
References

1. The Softmax function and its derivative [link]


Bendersky, E., 2016.

2. CS231n Convolutional Neural Networks for Visual Recognition [link]


Andrej Karpathy, A.K., 2016.

Comments
72 Comments deepnotes 🔒 Disqus' Privacy Policy 
1 Login

 Recommend 10 t Tweet f Share Sort by Newest

Join the discussion…

LOG IN WITH
OR SIGN UP WITH DISQUS ?

Name

chsand420 • 10 days ago


While calculating the gradient, why are we dividing it by m?
△ ▽ • Reply • Share ›

张强 • a month ago
The code for the delta_cross_entropy seems to have something wrong.
x = np.array([11., 42., 3.])
y = np.array([1])
delta_cross_entropy(x,y)
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-21-bbab36a3762b> in <module>
----> 1 delta_cross_entropy(x,y)

i th i t f8bb fb bf i d lt t (X )
<ipython-input-5-f8bbefba54bf> in delta_cross_entropy(X, y)
8 m = y.shape[0]
9 grad = softmax(X)
---> 10 grad[range(m),y] -= 1
11 grad = grad/m
12 return grad

IndexError: too many indices for array


△ ▽ • Reply • Share ›

Miro • 3 months ago • edited


You have typo in section From derivative of softmax we derived earlier. In the last four
equations you use k=/=1 instead of k=/=i.
△ ▽ • Reply • Share ›

Davide Riva • 6 months ago


I've found some typos:
* p_i =e^{a_i} / ∑e^a_k should be p_i =e^{a_i} / ∑e^{a_k}
* "In python, we the code for softmax function as follows:" should be "In python, we
code the softmax function..." or something similar
△ ▽ • Reply • Share ›

ComfortablyBumb • 7 months ago


The derivative you calculate is not for the stable version of softmax and cross entropy,
right? I mean I didn't see the max element accounted for anywhere in derivative. Or do
you assume the "-max" on inputs has no effect on the derivative?
△ ▽ • Reply • Share ›

Aleksandr Dremov > ComfortablyBumb • 2 months ago


It has no effect on derivation. Moreover, you can add any constant to the
exponent and it will return the same values.
△ ▽ • Reply • Share ›

Dibben Nandakishor • 7 months ago


Hi mate, I have a question on implementing cross entropy. In your code
grad[range(m),y] -= 1, when i tried it out it minuses 1 from the entire column rather
than just the intended value. Can you explain this, thanks.
△ ▽ • Reply • Share ›

Dibben Nandakishor > Dibben Nandakishor • 7 months ago


found the reason why y should be in the shape (num_examples,) not
(num_examples,1)
△ ▽ • Reply • Share ›

兴乐 安 • 9 months ago • edited


Thanks for your great work. I just wonder how to implement the derivative part if I use
Softmax as a usual activation function (since x is vector)?

class Activation:
def forward(self, x):
raise NotImplementedError
def derivative(self, x):
raise NotImplementedError
def __call__(self, *inputs):
return self.forward(*inputs)

class SoftMax(Activation):
def forward(self, x, axis=-1):
shift_x = x - np.max(x, axis=axis, keepdims=True) # stable softmax
exp = np.exp(shift_x + 1e-6)
return exp / np.sum(exp, axis=axis, keepdims=True)
def derivative(self, x):
# TODO: HOW to implement the derivative of softmax?

△ ▽ • Reply • Share ›

Jia • 9 months ago • edited


Thanks for your great work, I am confused with the derivative of cross-entropy: dL/do_i
= p_i - y_i. Here y_i is a on-hot encoded vector, p_i is a scalar at least from the
definition of the softmax part? (solved)
△ ▽ • Reply • Share ›

‫ • מנהל האתר‬a year ago


You didn't explain why cross-entropy is widely used, why not use mse, and why cross-
entropy is suitable for softmax outputs. this is not good enough. Thank you for your
explanations. But in interview, it's not good enough.
2△ ▽ 1 • Reply • Share ›

Sankhadip Mazumder > ‫ • מנהל האתר‬10 months ago


Actually cross-entropy has nothing to do with softmax , it's an information and
coding theory concept . An entropy function always tends to have admissible
gradient (used for heavy penalty for wrong classification )and has less tendency
to get saturated at extreme points.
1△ ▽ • Reply • Share ›

jcatanza > ‫ • מנהל האתר‬a year ago


Why don't you explain it then?
2△ ▽ 2 • Reply • Share ›

Elad • a year ago


You lost me once you used Oi without specifying what it is
△ ▽ • Reply • Share ›
willprice94 > Elad • a year ago
p_i = softmax(o_i)
△ ▽ • Reply • Share ›

Miro > willprice94 • 3 months ago • edited


Then he should have used a_i, which was used in the first definition of
softmax. Usage of o_i was therefore justifiably misleading.
△ ▽ • Reply • Share ›

David Snyder • a year ago


why is he using a cross entropy loss that uses y.argmax(axis=1) instead of just making
one that uses the one hot encoded vector?
△ ▽ • Reply • Share ›

Ugur • a year ago • edited


Thanks you for such a great explanations and derivations.

But you want to change first equation from $p_i = \frac{e^{a_i}}{\sum_{k=1}^N


e^a_k}$ to p_i = \frac{e^{a_i}}{\sum_{k=1}^N e^{a_k}}

and also if floating point precision exceeded, numpy returns $inf$


△ ▽ • Reply • Share ›

boy boy • a year ago


why are we subracting 1 in

grad[range(m),y] -= 1
1△ ▽ • Reply • Share ›

Khoa Doan > boy boy • a year ago


that's p - y, but only need to -1 from elements of p vector whose corresponding
output (y) is 1. For example if p = [0.1 0.1 0.1] and y = [1 0 1] then we only need
to subtract 1 from the first and third elements of p.
1△ ▽ • Reply • Share ›

Veronica • 2 years ago


I don't understand one thing: why in the cross-entropy loss derivative, he's doing it with
respect to Oi? What is Oi?
Many thanks.
△ ▽ 1 • Reply • Share ›

Taha AIT > Veronica • a year ago • edited


I didn't read the article, but i know that to minimize the Loss function he must
derive with respect to all the weight parameters to find gradient, that will be used
after in optimization (gradient descent)
△ ▽ • Reply • Share ›
△ ▽ • Reply • Share ›

Nogret Humphrey > Veronica • a year ago


Oi is the output of the fully connected layer.
△ ▽ • Reply • Share ›

Jae Duk Seo > Veronica • 2 years ago


I think the ouput of the final layer
△ ▽ • Reply • Share ›

Miro > Jae Duk Seo • 3 months ago


It's a_i used in the first definition of softmax function. For some
unknown reason, it was changed to o_i in the last formula.
△ ▽ • Reply • Share ›

Ben • 2 years ago


...what i was looking for...
△ ▽ • Reply • Share ›

Amit Chaudhary • 2 years ago


As a software engineer, It's easier to understand the math when starting with the code
and then breaking apart the math backwards. Thank you for the article.
△ ▽ 1 • Reply • Share ›

Safoora Yousefi • 2 years ago


I don't think your softmax implemetation works for X of size (num_examples x
num_classes), as numpy.sum function sums over ALL elements of the input in
axis=None. You only want to sum over columns (classes), not samples.
△ ▽ • Reply • Share ›

Shivam Singla • 2 years ago


I really appreciate your efforts to clarify it, thanks sir/madam.
△ ▽ • Reply • Share ›

Scott Lowe • 2 years ago


Your mathematical notation has gone awry in a couple of places.
1. Originally you write your softmax in terms of logits labelled a_j. But then in the final
part, they become o_i. The change in subscript is fine, but they should still be a's and
not o's, since o is never defined. The o_i should be replaced with a_j in the cross
entropy derivative steps.
2. Also in the cross entropy derivative, you are doing a sum over k=/=i, which becomes
k=/=1 on the second steps and stays the same for the rest of the derivation. This should
be k=/=i all the way through, since the derivation is for a general logit and not just the
first one.

If you make these fixes, it will make it much easier to follow your derivation! Thanks.
6△ ▽ • Reply • Share ›

Nogret Humphrey > Scott Lowe • a year ago


Agreed.The last part is confusing.But great article.
△ ▽ • Reply • Share ›

Veronica > Scott Lowe • 2 years ago


I agree, it was kind of hard to follow with these small typos and mistakes.
△ ▽ • Reply • Share ›

Rui Zhou • 2 years ago


Hi thanks for sharing!
I wonder that, in your def delta_cross_entropy(X,y):, if I transfer y to a one-hot vector
y', is there an easier way to the grad, which is just softmax(X) - y', or (softmax(X) - y') /
m? Thanks!
△ ▽ • Reply • Share ›

NABIH NEBBACHE • 2 years ago


Thank you for that very clear explanation, I wanted vectorize your solution of the Loss
derivative but i couldn't, I got stuck in the the vectorized softmax derivative, could you
please provide some hints so i can finish my solution?
△ ▽ • Reply • Share ›

Miro > NABIH NEBBACHE • 3 months ago


What exactly do you want to vectorize?
△ ▽ • Reply • Share ›

theanh • 2 years ago


Thank you for your good explanations. I have a trouble with the loss function. If you
don't mind, plz help me. If output from the last dense layer if a matrix (or 3-dimension
tensor, if batch size is included) in which elements are belong to (0, 1) and the target is
also a matrix with the same size in which each element is 0 or 1, then which loss
function should I use? sigmoid_cross_entropy_with_logits or mean_squared_error?
Thank you in advance.
△ ▽ • Reply • Share ›

Paras Dahal Author > theanh • 2 years ago • edited


Since your output is 0 or 1, you can use cross entropy. In Tensorflow
sigmoid_cross_entropy_with_logits function actually applies sigmoid function
to your outputs, bring them from binary to [0,1]. So I would suggest calculating it
as `cross_entropy = -tf.reduce_sum(y * tf.log(output))`. However, you have to
be careful of numerical instability as log 0 is undefined, so add a small value like
10e-8 to the matrices avoid NaNs.
△ ▽ • Reply • Share ›

ih d 2
ihadanny • 2 years ago
thanks for this great post! Question: when optimizing for a multi-label setup (i.e. more
than one class can be true), it makes no sense to use softmax, but it still makes sense to
use crossentropy loss. Can you show how to derive the gradient in this case?
△ ▽ • Reply • Share ›

Paras Dahal Author > ihadanny • 2 years ago • edited


Yes, in multi-label setup, softmax doesn't make sense as it gives a probability
distribution around all the labels but we can use sigmoid to get independant
probabilities for each label.

We can derrive the gradient by swapping in sigmoid's derrivative in place of


softmax's derrivative.
△ ▽ • Reply • Share ›

Milk_Wise • 2 years ago


Hello, Thanks for the explanation, even though I'm still confused.
Can I ask why use the p[range(m),y] and what it purpose?
1△ ▽ • Reply • Share ›

Paras Dahal Author > Milk_Wise • 2 years ago


We are using multidimensional array indexing in the term p[range(m), y], which
can be a bit confusing. Lets break it up into the two axis. In the first axis,
range(m) is selecting all the rows of the p matrix. But by using 'y' in the second
axis, we are only taking the probabilities of the correct label as 'y' is the index of
the correct label.
This is in essence, same as multiplying the one-hot encoded y and summing up
along the axis 1, which can be seen in many implementations.

Refer to https://docs.scipy.org/doc/... to understand multidimensional array


indexing.
△ ▽ • Reply • Share ›

Amir Hossein • 2 years ago


Good Job on this explanation, but shouldn't the last equation on Derivative of Softmax
section be dp_i/da_j = pi(kr_ij - p_j)?
△ ▽ • Reply • Share ›

Paras Dahal Author > Amir Hossein • 2 years ago


Yes it should be and I've corrected it. Thanks for pointing it out :)
△ ▽ • Reply • Share ›

Yash Akhauri • 2 years ago • edited


In the delta cross entropy function, you say:
"""
X is the output from fully connected layer (num_examples x num_classes)
i l b l ( l )
y is labels (num_examples x 1)
"""
But the label has to be one hot, as is your output from the fully connected layer. This
comment is misleading, I think.
Edit: I think we need to first convert the one-hot label to index to bring it to
(num_examples x 1) form, right?
Do correct me if i am wrong.
Thanks!
37 △ ▽ • Reply • Share ›

Paras Dahal Author > Yash Akhauri • 2 years ago


Yes you are correct, we need to convert the one-hot label to indexes. We pass the
indexes instead of one-hot encoded vectors directly in the function because we
use it to index the 'p' matrix. But we can always get the indexes from one-hot
vectors by using y.argmax(axis=1) in the function itself if required.
1△ ▽ • Reply • Share ›

Artem Chernodub • 2 years ago


I found a mistake (typo) in indexes - please, see a pic here http://piccy.info/view3/124...
△ ▽ • Reply • Share ›

Paras Dahal Author > Artem Chernodub • 2 years ago


Thanks for sharing :)
△ ▽ • Reply • Share ›

Artem Chernodub > Paras Dahal • 2 years ago


You are welcome )) Probably this will be funny for you to know that your
post was occasionally popular among my students. They have a task in
thier homework about the softmax in neural networks. As a result, 50% of
them did "the same" mistake, but the rest corrected it ))) Great post,
thanks again. Have a nice day!
1△ ▽ • Reply • Share ›

Paras Dahal Author > Artem Chernodub • 2 years ago


Haha, thats quite funny. I am glad I helped 50% of your students
:D
△ ▽ • Reply • Share ›

Daniel Severo • 2 years ago


I think you have some typos in your softmax expressions. You write p_j = ... but I think
you mean p_i

Great post btw!


△ ▽ • Reply • Share ›

Load more comments


Load more comments

You might also like