0% found this document useful (0 votes)

99 views14 pages

DeepNotes Softmax&Crossentropy

The document summarizes the softmax function and cross entropy loss. Softmax takes a vector of real numbers and transforms it into a probability distribution. Cross entropy loss measures the performance of a classification model whose output is a probability distribution. The derivative of softmax and cross entropy loss are also provided, which are needed for backpropagation during training deep neural networks.

Uploaded by

cantahorn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

99 views14 pages

DeepNotes Softmax&Crossentropy

Uploaded by

cantahorn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Classification and Loss Evaluation -

Softmax and Cross Entropy Loss

Lets dig a little deep into how we convert the output of our CNN into
probability - Softmax; and the loss measure to guide our optimization -
Cross Entropy.

The Softmax Function

Derivative of Softmax

Cross Entropy Loss

Derivative of Cross Entropy Loss with Softmax

PARAS DAHAL

Note: Complete source code can be found here

https://github.com/parasdahal/deepnet

The Softmax Function

Softmax function takes an N-dimensional vector of real numbers and transforms it

into a vector of real number in range (0,1) which add upto 1. p e i

i =
N a
∑ e
k=1 k

As the name suggests, softmax function is a “soft” version of max function. Instead of
selecting one maximum value, it breaks the whole (1) with maximal element getting
the largest portion of the distribution, but other smaller elements getting some of it as
well.
This property of softmax function that it outputs a probability distribution makes it
suitable for probabilistic interpretation in classification tasks.

In python, we the code for softmax function as follows:

def softmax(X):
exps = np.exp(X)
return exps / np.sum(exps)

We have to note that the numerical range of floating point numbers in numpy is
limited. For float64 the upper bound is 10 308
. For exponential, its not difficult to
overshoot that limit, in which case python returns nan .

To make our softmax function numerically stable, we simply normalize the values in
the vector, by multiplying the numerator and denominator with a constant C .

ai
e
pi =
N ak
∑ e
k=1
ai
Ce
=
N ak
C ∑ e
k=1

ai +log(C)
e
=
N ak +log(C)
∑ e
k=1

We can choose an arbitrary value for log(C) term, but generally log(C) = −max(a)
is chosen, as it shifts all of elements in the vector to negative to zero, and negatives
with large exponents saturate to zero rather than the infinity, avoiding overflowing
and resulting in nan .

The code for our stable softmax is as follows:

def stable_softmax(X):
exps = np.exp(X - np.max(X))
return exps / np.sum(exps)

Derivative of Softmax

Due to the desirable property of softmax function outputting a probability

distribution, we use it as the final layer in neural networks. For this we need to
calculate the derivative or gradient and pass it back to the previous layer during
backpropagation.
a
e i
∂
N
∂pi ∑ e
a
k
k=1
=
∂aj ∂aj

g(x) g′(x)h(x)−h′(x)g(x)
From quotient rule we know that for f(x) = h(x)
, we have f ′
(x) =
2
h(x)

In our case g(x) = e and h(x) = ∑ . In h(x), will always be e has it

ai N ak ∂ aj
e a
k=1 ∂e j

will always have e . But we have to note that in g(x) ,

∂e
∂
a
j
will be e aj
only if i = j,
otherwise its 0.

If i = j,

a
e i
∂ ai N ak aj ai
∑
N
e
a
k e ∑ e −e e
k=1 k=1
=
2
∂aj N
ak
(∑ e )
k=1

ai N ak aj
e (∑ e −e )
k=1

=
2
N ak
(∑ e )
k=1

N ak aj
aj (∑ e −e )
e k=1

= ×
N ak N ak
∑ e ∑ e
k=1 k=1

= pi (1 − pj )

For i ≠ j,

a
e i
∂ N aj ai
∑
k=1
e
a
k 0−e e
=
2
∂aj N ak
(∑ e )
k=1

aj ai
−e e
= ×
N ak N ak
∑ e ∑ e
k=1 k=1

= −pj . pi

So the derivative of the softmax function is given as,

∂pi pi (1 − pj ) if i = j
= {
∂aj −pj . pi if i ≠ j

1 if i = j
Or using Kronecker delta δij = {
0 if i ≠ j
∂pi
= pi (δ ij − pj )
∂aj

Cross Entropy Loss

Cross entropy indicates the distance between what the model believes the output
distribution should be, and what the original distribution really is. It is defined as,
H(y, p) = − ∑
i
yi log(pi ) Cross entropy measure is a widely used alternative of
squared error. It is used when node activations can be understood as representing the
probability that each hypothesis might be true, i.e. when the output is a probability
distribution. Thus it is used as a loss function in neural networks which have softmax
activations in the output layer.

def cross_entropy(X,y):
"""
X is the output from fully connected layer (num_examples x num_classes)
y is labels (num_examples x 1)
Note that y is not one-hot encoded vector.
It can be computed as y.argmax(axis=1) from one-hot encoded vectors of labels if required.
"""
m = y.shape[0]
p = softmax(X)
# We use multidimensional array indexing to extract
# softmax probability of the correct label for each sample.
# Refer to https://docs.scipy.org/doc/numpy/user/basics.indexing.html#indexing-multi-dimensional-arra
log_likelihood = -np.log(p[range(m),y])
loss = np.sum(log_likelihood) / m
return loss

Derivative of Cross Entropy Loss with Softmax

Cross Entropy Loss with Softmax function are used as the output layer extensively.
Now we use the derivative of softmax [1] that we derived earlier to derive the
derivative of the cross entropy loss function.
L = − ∑ yi log(pi )

∂L ∂log(pk )
= − ∑ yk
∂o i ∂o i
k

∂log(pk ) ∂pk
= − ∑ yk ×
∂pk ∂o i
k

1 ∂pk
= − ∑ yk ×
pk ∂o i

From derivative of softmax we derived earlier,

∂L 1
= −yi (1 − pi ) − ∑ yk (−pk . pi )
∂o i pk
k≠i

= −yi (1 − pi ) + ∑ yk . pi

k≠1

= −yi + yi pi + ∑ yk . pi

k≠1

= pi (yi + ∑ yk ) − yi

k≠1

= pi (yi + ∑ yk ) − yi

k≠1

y is a one hot encoded vector for the labels, so∑ k

yk = 1 , and y i + ∑
k≠1
yk = 1 . So
we have,

∂L
= pi − yi
∂o i

which is a very simple and elegant expression. Translating it into code [2]

def delta_cross_entropy(X,y):
"""
X is the output from fully connected layer (num_examples x num_classes)
y is labels (num_examples x 1)
Note that y is not one-hot encoded vector.
It can be computed as y.argmax(axis=1) from one-hot encoded vectors of labels if required.
"""
m = y.shape[0]
grad = softmax(X)
grad[range(m),y] -= 1
grad = grad/m
return grad
References

1. The Softmax function and its derivative [link]

Bendersky, E., 2016.

2. CS231n Convolutional Neural Networks for Visual Recognition [link]

Andrej Karpathy, A.K., 2016.

Comments
72 Comments deepnotes 🔒 Disqus' Privacy Policy 
1 Login

 Recommend 10 t Tweet f Share Sort by Newest

Join the discussion…

Name

chsand420 • 10 days ago

While calculating the gradient, why are we dividing it by m?
△ ▽ • Reply • Share ›

张强 • a month ago
The code for the delta_cross_entropy seems to have something wrong.
x = np.array([11., 42., 3.])
y = np.array([1])
delta_cross_entropy(x,y)
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-21-bbab36a3762b> in <module>
----> 1 delta_cross_entropy(x,y)

i th i t f8bb fb bf i d lt t (X )
<ipython-input-5-f8bbefba54bf> in delta_cross_entropy(X, y)
8 m = y.shape[0]
9 grad = softmax(X)
---> 10 grad[range(m),y] -= 1
11 grad = grad/m
12 return grad

IndexError: too many indices for array

△ ▽ • Reply • Share ›

Miro • 3 months ago • edited

You have typo in section From derivative of softmax we derived earlier. In the last four
equations you use k=/=1 instead of k=/=i.
△ ▽ • Reply • Share ›

Davide Riva • 6 months ago

I've found some typos:
* p_i =e^{a_i} / ∑e^a_k should be p_i =e^{a_i} / ∑e^{a_k}
* "In python, we the code for softmax function as follows:" should be "In python, we
code the softmax function..." or something similar
△ ▽ • Reply • Share ›

ComfortablyBumb • 7 months ago

The derivative you calculate is not for the stable version of softmax and cross entropy,
right? I mean I didn't see the max element accounted for anywhere in derivative. Or do
you assume the "-max" on inputs has no effect on the derivative?
△ ▽ • Reply • Share ›

Aleksandr Dremov > ComfortablyBumb • 2 months ago

It has no effect on derivation. Moreover, you can add any constant to the
exponent and it will return the same values.
△ ▽ • Reply • Share ›

Dibben Nandakishor • 7 months ago

Hi mate, I have a question on implementing cross entropy. In your code
grad[range(m),y] -= 1, when i tried it out it minuses 1 from the entire column rather
than just the intended value. Can you explain this, thanks.
△ ▽ • Reply • Share ›

Dibben Nandakishor > Dibben Nandakishor • 7 months ago

found the reason why y should be in the shape (num_examples,) not
(num_examples,1)
△ ▽ • Reply • Share ›

兴乐安 • 9 months ago • edited

Thanks for your great work. I just wonder how to implement the derivative part if I use
Softmax as a usual activation function (since x is vector)?

class Activation:
def forward(self, x):
raise NotImplementedError
def derivative(self, x):
raise NotImplementedError
def __call__(self, *inputs):
return self.forward(*inputs)

class SoftMax(Activation):
def forward(self, x, axis=-1):
shift_x = x - np.max(x, axis=axis, keepdims=True) # stable softmax
exp = np.exp(shift_x + 1e-6)
return exp / np.sum(exp, axis=axis, keepdims=True)
def derivative(self, x):
# TODO: HOW to implement the derivative of softmax?

△ ▽ • Reply • Share ›

Jia • 9 months ago • edited

Thanks for your great work, I am confused with the derivative of cross-entropy: dL/do_i
= p_i - y_i. Here y_i is a on-hot encoded vector, p_i is a scalar at least from the
definition of the softmax part? (solved)
△ ▽ • Reply • Share ›

‫ • מנהל האתר‬a year ago

You didn't explain why cross-entropy is widely used, why not use mse, and why cross-
entropy is suitable for softmax outputs. this is not good enough. Thank you for your
explanations. But in interview, it's not good enough.
2△ ▽ 1 • Reply • Share ›

Sankhadip Mazumder > ‫ • מנהל האתר‬10 months ago

Actually cross-entropy has nothing to do with softmax , it's an information and
coding theory concept . An entropy function always tends to have admissible
gradient (used for heavy penalty for wrong classification )and has less tendency
to get saturated at extreme points.
1△ ▽ • Reply • Share ›

jcatanza > ‫ • מנהל האתר‬a year ago

Why don't you explain it then?
2△ ▽ 2 • Reply • Share ›

Elad • a year ago

You lost me once you used Oi without specifying what it is
△ ▽ • Reply • Share ›
willprice94 > Elad • a year ago
p_i = softmax(o_i)
△ ▽ • Reply • Share ›

Miro > willprice94 • 3 months ago • edited

Then he should have used a_i, which was used in the first definition of
softmax. Usage of o_i was therefore justifiably misleading.
△ ▽ • Reply • Share ›

David Snyder • a year ago

why is he using a cross entropy loss that uses y.argmax(axis=1) instead of just making
one that uses the one hot encoded vector?
△ ▽ • Reply • Share ›

Ugur • a year ago • edited

Thanks you for such a great explanations and derivations.

But you want to change first equation from $p_i = \frac{e^{a_i}}{\sum_{k=1}^N

e^a_k}$ to p_i = \frac{e^{a_i}}{\sum_{k=1}^N e^{a_k}}

and also if floating point precision exceeded, numpy returns $inf$

△ ▽ • Reply • Share ›

boy boy • a year ago

why are we subracting 1 in

grad[range(m),y] -= 1
1△ ▽ • Reply • Share ›

Khoa Doan > boy boy • a year ago

that's p - y, but only need to -1 from elements of p vector whose corresponding
output (y) is 1. For example if p = [0.1 0.1 0.1] and y = [1 0 1] then we only need
to subtract 1 from the first and third elements of p.
1△ ▽ • Reply • Share ›

Veronica • 2 years ago

I don't understand one thing: why in the cross-entropy loss derivative, he's doing it with
respect to Oi? What is Oi?
Many thanks.
△ ▽ 1 • Reply • Share ›

Taha AIT > Veronica • a year ago • edited

I didn't read the article, but i know that to minimize the Loss function he must
derive with respect to all the weight parameters to find gradient, that will be used
after in optimization (gradient descent)
△ ▽ • Reply • Share ›
△ ▽ • Reply • Share ›

Nogret Humphrey > Veronica • a year ago

Oi is the output of the fully connected layer.
△ ▽ • Reply • Share ›

Jae Duk Seo > Veronica • 2 years ago

I think the ouput of the final layer
△ ▽ • Reply • Share ›

Miro > Jae Duk Seo • 3 months ago

It's a_i used in the first definition of softmax function. For some
unknown reason, it was changed to o_i in the last formula.
△ ▽ • Reply • Share ›

Ben • 2 years ago

...what i was looking for...
△ ▽ • Reply • Share ›

Amit Chaudhary • 2 years ago

As a software engineer, It's easier to understand the math when starting with the code
and then breaking apart the math backwards. Thank you for the article.
△ ▽ 1 • Reply • Share ›

Safoora Yousefi • 2 years ago

I don't think your softmax implemetation works for X of size (num_examples x
num_classes), as numpy.sum function sums over ALL elements of the input in
axis=None. You only want to sum over columns (classes), not samples.
△ ▽ • Reply • Share ›

Shivam Singla • 2 years ago

I really appreciate your efforts to clarify it, thanks sir/madam.
△ ▽ • Reply • Share ›

Scott Lowe • 2 years ago

Your mathematical notation has gone awry in a couple of places.
1. Originally you write your softmax in terms of logits labelled a_j. But then in the final
part, they become o_i. The change in subscript is fine, but they should still be a's and
not o's, since o is never defined. The o_i should be replaced with a_j in the cross
entropy derivative steps.
2. Also in the cross entropy derivative, you are doing a sum over k=/=i, which becomes
k=/=1 on the second steps and stays the same for the rest of the derivation. This should
be k=/=i all the way through, since the derivation is for a general logit and not just the
first one.

If you make these fixes, it will make it much easier to follow your derivation! Thanks.
6△ ▽ • Reply • Share ›

Nogret Humphrey > Scott Lowe • a year ago

Agreed.The last part is confusing.But great article.
△ ▽ • Reply • Share ›

Veronica > Scott Lowe • 2 years ago

I agree, it was kind of hard to follow with these small typos and mistakes.
△ ▽ • Reply • Share ›

Rui Zhou • 2 years ago

Hi thanks for sharing!
I wonder that, in your def delta_cross_entropy(X,y):, if I transfer y to a one-hot vector
y', is there an easier way to the grad, which is just softmax(X) - y', or (softmax(X) - y') /
m? Thanks!
△ ▽ • Reply • Share ›

NABIH NEBBACHE • 2 years ago

Thank you for that very clear explanation, I wanted vectorize your solution of the Loss
derivative but i couldn't, I got stuck in the the vectorized softmax derivative, could you
please provide some hints so i can finish my solution?
△ ▽ • Reply • Share ›

Miro > NABIH NEBBACHE • 3 months ago

What exactly do you want to vectorize?
△ ▽ • Reply • Share ›

theanh • 2 years ago

Thank you for your good explanations. I have a trouble with the loss function. If you
don't mind, plz help me. If output from the last dense layer if a matrix (or 3-dimension
tensor, if batch size is included) in which elements are belong to (0, 1) and the target is
also a matrix with the same size in which each element is 0 or 1, then which loss
function should I use? sigmoid_cross_entropy_with_logits or mean_squared_error?
Thank you in advance.
△ ▽ • Reply • Share ›

Paras Dahal Author > theanh • 2 years ago • edited

Since your output is 0 or 1, you can use cross entropy. In Tensorflow
sigmoid_cross_entropy_with_logits function actually applies sigmoid function
to your outputs, bring them from binary to [0,1]. So I would suggest calculating it
as `cross_entropy = -tf.reduce_sum(y * tf.log(output))`. However, you have to
be careful of numerical instability as log 0 is undefined, so add a small value like
10e-8 to the matrices avoid NaNs.
△ ▽ • Reply • Share ›

ih d 2
ihadanny • 2 years ago
thanks for this great post! Question: when optimizing for a multi-label setup (i.e. more
than one class can be true), it makes no sense to use softmax, but it still makes sense to
use crossentropy loss. Can you show how to derive the gradient in this case?
△ ▽ • Reply • Share ›

Paras Dahal Author > ihadanny • 2 years ago • edited

Yes, in multi-label setup, softmax doesn't make sense as it gives a probability
distribution around all the labels but we can use sigmoid to get independant
probabilities for each label.

We can derrive the gradient by swapping in sigmoid's derrivative in place of

softmax's derrivative.
△ ▽ • Reply • Share ›

Milk_Wise • 2 years ago

Hello, Thanks for the explanation, even though I'm still confused.
Can I ask why use the p[range(m),y] and what it purpose?
1△ ▽ • Reply • Share ›

Paras Dahal Author > Milk_Wise • 2 years ago

We are using multidimensional array indexing in the term p[range(m), y], which
can be a bit confusing. Lets break it up into the two axis. In the first axis,
range(m) is selecting all the rows of the p matrix. But by using 'y' in the second
axis, we are only taking the probabilities of the correct label as 'y' is the index of
the correct label.
This is in essence, same as multiplying the one-hot encoded y and summing up
along the axis 1, which can be seen in many implementations.

Refer to https://docs.scipy.org/doc/... to understand multidimensional array

indexing.
△ ▽ • Reply • Share ›

Amir Hossein • 2 years ago

Good Job on this explanation, but shouldn't the last equation on Derivative of Softmax
section be dp_i/da_j = pi(kr_ij - p_j)?
△ ▽ • Reply • Share ›

Paras Dahal Author > Amir Hossein • 2 years ago

Yes it should be and I've corrected it. Thanks for pointing it out :)
△ ▽ • Reply • Share ›

Yash Akhauri • 2 years ago • edited

In the delta cross entropy function, you say:
"""
X is the output from fully connected layer (num_examples x num_classes)
i l b l ( l )
y is labels (num_examples x 1)
"""
But the label has to be one hot, as is your output from the fully connected layer. This
comment is misleading, I think.
Edit: I think we need to first convert the one-hot label to index to bring it to
(num_examples x 1) form, right?
Do correct me if i am wrong.
Thanks!
37 △ ▽ • Reply • Share ›

Paras Dahal Author > Yash Akhauri • 2 years ago

Yes you are correct, we need to convert the one-hot label to indexes. We pass the
indexes instead of one-hot encoded vectors directly in the function because we
use it to index the 'p' matrix. But we can always get the indexes from one-hot
vectors by using y.argmax(axis=1) in the function itself if required.
1△ ▽ • Reply • Share ›

Artem Chernodub • 2 years ago

I found a mistake (typo) in indexes - please, see a pic here http://piccy.info/view3/124...
△ ▽ • Reply • Share ›

Paras Dahal Author > Artem Chernodub • 2 years ago

Thanks for sharing :)
△ ▽ • Reply • Share ›

Artem Chernodub > Paras Dahal • 2 years ago

You are welcome )) Probably this will be funny for you to know that your
post was occasionally popular among my students. They have a task in
thier homework about the softmax in neural networks. As a result, 50% of
them did "the same" mistake, but the rest corrected it ))) Great post,
thanks again. Have a nice day!
1△ ▽ • Reply • Share ›

Paras Dahal Author > Artem Chernodub • 2 years ago

Haha, thats quite funny. I am glad I helped 50% of your students
:D
△ ▽ • Reply • Share ›

Daniel Severo • 2 years ago

I think you have some typos in your softmax expressions. You write p_j = ... but I think
you mean p_i

Great post btw!

△ ▽ • Reply • Share ›

Load more comments

Analisis Studi Kelayakan Usaha Pendirian Home Industry
No ratings yet
Analisis Studi Kelayakan Usaha Pendirian Home Industry
10 pages
Harrison Kinsley, Daniel Kukieła - Neural Networks From Scratch in Python (2020) - 93-123
No ratings yet
Harrison Kinsley, Daniel Kukieła - Neural Networks From Scratch in Python (2020) - 93-123
31 pages
Smooth N-Gram
No ratings yet
Smooth N-Gram
2 pages
Soft Max
No ratings yet
Soft Max
6 pages
Fandi Ct3 201004 Exam Final
No ratings yet
Fandi Ct3 201004 Exam Final
8 pages
Getting Started - TensorFlow
0% (1)
Getting Started - TensorFlow
14 pages
A Review of The Applications of Artificial Intelligence - 2024 - Energy Conversi
No ratings yet
A Review of The Applications of Artificial Intelligence - 2024 - Energy Conversi
24 pages
A I F: I - D R L P O: Dvancing Nvestment Rontiers Ndustry Grade EEP Einforcement Earning For Ortfolio Ptimization
No ratings yet
A I F: I - D R L P O: Dvancing Nvestment Rontiers Ndustry Grade EEP Einforcement Earning For Ortfolio Ptimization
25 pages
Cheat Sheet - PLC
No ratings yet
Cheat Sheet - PLC
8 pages
Deep Learning Assignment3 Solution
No ratings yet
Deep Learning Assignment3 Solution
9 pages
Book: Mathematics For Economics and Business
No ratings yet
Book: Mathematics For Economics and Business
2 pages
Deep Learning
No ratings yet
Deep Learning
78 pages
Lec 04 Deep Networks 2
No ratings yet
Lec 04 Deep Networks 2
78 pages
Curs4site PDF
No ratings yet
Curs4site PDF
44 pages
Deep Learning Lectures - 2
No ratings yet
Deep Learning Lectures - 2
73 pages
Lect 8
No ratings yet
Lect 8
117 pages
Ch2-Training, Optimization and Regularization of DNN-new
No ratings yet
Ch2-Training, Optimization and Regularization of DNN-new
114 pages
Sliding Mode Controller For PWM Based Buck-Boost DC/DC Converter As State Space Averaging Method in Continuous Conduction Mode
No ratings yet
Sliding Mode Controller For PWM Based Buck-Boost DC/DC Converter As State Space Averaging Method in Continuous Conduction Mode
5 pages
Lecture04 VDL
No ratings yet
Lecture04 VDL
93 pages
Tensorflow and Deep Learning
No ratings yet
Tensorflow and Deep Learning
51 pages
Cs217 Perceptron Sigmoid Softmax Week5 3feb25
No ratings yet
Cs217 Perceptron Sigmoid Softmax Week5 3feb25
90 pages
Simple Exponential Smoothing
No ratings yet
Simple Exponential Smoothing
32 pages
Unit 2.1
No ratings yet
Unit 2.1
37 pages
Unit 10 Dynamic Programming - 1: Structure
No ratings yet
Unit 10 Dynamic Programming - 1: Structure
22 pages
Transportation Model
No ratings yet
Transportation Model
44 pages
DL Unit2
No ratings yet
DL Unit2
22 pages
Noncomputability and The Busy Beaver Problem: Bryant A. Julstrom
No ratings yet
Noncomputability and The Busy Beaver Problem: Bryant A. Julstrom
36 pages
Logistic Regression
No ratings yet
Logistic Regression
29 pages
Cross Entropy Loss Intro, Applications
No ratings yet
Cross Entropy Loss Intro, Applications
21 pages
Unit 2 DL
No ratings yet
Unit 2 DL
70 pages
SL-2 pr1 PPT
No ratings yet
SL-2 pr1 PPT
23 pages
Video 7 - Building A Multilayer Feedforward Network For Classification in PyTorch
No ratings yet
Video 7 - Building A Multilayer Feedforward Network For Classification in PyTorch
18 pages
Dat 300
No ratings yet
Dat 300
12 pages
cs231n Github Io Neural Networks Case Study
No ratings yet
cs231n Github Io Neural Networks Case Study
17 pages
W02 MLOptDL
No ratings yet
W02 MLOptDL
23 pages
Medium Understand The Softmax Function in Minutes F3a59641e86d
No ratings yet
Medium Understand The Softmax Function in Minutes F3a59641e86d
14 pages
Dokumen - Tips Contoh Studi Kasus Decision Tree
No ratings yet
Dokumen - Tips Contoh Studi Kasus Decision Tree
11 pages
Lessson 13 ANN
No ratings yet
Lessson 13 ANN
76 pages
Lecture 3
No ratings yet
Lecture 3
24 pages
3a Variations
No ratings yet
3a Variations
17 pages
Understand The Softmax Function in Minutes: Data Science Bootcamp
No ratings yet
Understand The Softmax Function in Minutes: Data Science Bootcamp
15 pages
Boston Consulting Group Matrix: Relative Market Share (Cash Generation)
No ratings yet
Boston Consulting Group Matrix: Relative Market Share (Cash Generation)
9 pages
03-Linear Classification
No ratings yet
03-Linear Classification
17 pages
BSDCH ZC317 AlgorithmDesign 1-2022
No ratings yet
BSDCH ZC317 AlgorithmDesign 1-2022
7 pages
CS 188 Fall 2018 Written HW4 Soln
No ratings yet
CS 188 Fall 2018 Written HW4 Soln
6 pages
Softmax
No ratings yet
Softmax
17 pages
Types of Neural Networks
No ratings yet
Types of Neural Networks
7 pages
2 Softmaxregression
No ratings yet
2 Softmaxregression
4 pages
Practical-5 - 2CEIT606 - Artificial Intelligence
No ratings yet
Practical-5 - 2CEIT606 - Artificial Intelligence
14 pages
On Genetic
No ratings yet
On Genetic
17 pages
Module 1 - Problems in Neural Network
No ratings yet
Module 1 - Problems in Neural Network
20 pages
Sequencias e Series
No ratings yet
Sequencias e Series
14 pages
AI SVM Network
No ratings yet
AI SVM Network
10 pages
Algoritma Dinic
No ratings yet
Algoritma Dinic
4 pages
02 - Linear Models - D (Multiclass Classification)
No ratings yet
02 - Linear Models - D (Multiclass Classification)
9 pages
Unit 2b
No ratings yet
Unit 2b
11 pages
Notes Chapter8
No ratings yet
Notes Chapter8
4 pages
Loss Functions
No ratings yet
Loss Functions
15 pages
2018 Online Normalizer Calculation For Softmax Milakov Gimelshein ArXiv
No ratings yet
2018 Online Normalizer Calculation For Softmax Milakov Gimelshein ArXiv
9 pages
Tutorial Activity 1 Suggested Solutions
No ratings yet
Tutorial Activity 1 Suggested Solutions
6 pages
SoftMax Regress Real
No ratings yet
SoftMax Regress Real
8 pages
Zhang Wenbin ISF2009 Paper
No ratings yet
Zhang Wenbin ISF2009 Paper
7 pages
Homework 2
No ratings yet
Homework 2
3 pages
C2 W2 SoftMax
No ratings yet
C2 W2 SoftMax
7 pages
Cross Interopy
No ratings yet
Cross Interopy
7 pages
Practice QuestionsV1
No ratings yet
Practice QuestionsV1
7 pages
Day 2 - Loss & Activation Functions
No ratings yet
Day 2 - Loss & Activation Functions
8 pages
Practice QuestionsV1
No ratings yet
Practice QuestionsV1
7 pages
A Magic Square
No ratings yet
A Magic Square
10 pages
C2 W2 SoftMax
No ratings yet
C2 W2 SoftMax
7 pages
Softmax Reg Skimmed - Ipynb - Colab
No ratings yet
Softmax Reg Skimmed - Ipynb - Colab
9 pages
hw1 Sol
No ratings yet
hw1 Sol
6 pages
What Is A Neural Network?
No ratings yet
What Is A Neural Network?
7 pages
NMCP Unit 6
No ratings yet
NMCP Unit 6
3 pages
3a Variations4
No ratings yet
3a Variations4
5 pages
Sentiment Analysis
No ratings yet
Sentiment Analysis
5 pages
Softmax
No ratings yet
Softmax
5 pages
Main
No ratings yet
Main
9 pages
Exp 5
No ratings yet
Exp 5
4 pages
تقدير متجه المتوسطات ومصفوفة التباين والتباين المشترك PDF
No ratings yet
تقدير متجه المتوسطات ومصفوفة التباين والتباين المشترك PDF
3 pages
Digital Implementation of The Softmax Activation Function and The Inverse Softmax Function
No ratings yet
Digital Implementation of The Softmax Activation Function and The Inverse Softmax Function
4 pages
Solution 5
No ratings yet
Solution 5
4 pages
TMA-Summer 2023-2024
No ratings yet
TMA-Summer 2023-2024
3 pages
Notes On Backpropagation: Peter.j.sadowski@uci - Edu
No ratings yet
Notes On Backpropagation: Peter.j.sadowski@uci - Edu
3 pages
Forecasting MiM Exercises Part3
No ratings yet
Forecasting MiM Exercises Part3
2 pages
Cross Entropy Pytorch
No ratings yet
Cross Entropy Pytorch
2 pages
Yugdeep 1
No ratings yet
Yugdeep 1
1 page
Assignment Problem
No ratings yet
Assignment Problem
11 pages

DeepNotes Softmax&Crossentropy

Uploaded by

DeepNotes Softmax&Crossentropy

Uploaded by

Classification and Loss Evaluation -

Softmax and Cross Entropy Loss

The Softmax Function

Cross Entropy Loss

Derivative of Cross Entropy Loss with Softmax

Note: Complete source code can be found here

The Softmax Function

Softmax function takes an N-dimensional vector of real numbers and transforms it

into a vector of real number in range (0,1) which add upto 1. p e i

In python, we the code for softmax function as follows:

The code for our stable softmax is as follows:

Due to the desirable property of softmax function outputting a probability

In our case g(x) = e and h(x) = ∑ . In h(x), will always be e has it

will always have e . But we have to note that in g(x) ,

So the derivative of the softmax function is given as,

Cross Entropy Loss

Derivative of Cross Entropy Loss with Softmax

From derivative of softmax we derived earlier,

y is a one hot encoded vector for the labels, so∑ k

1. The Softmax function and its derivative [link]

2. CS231n Convolutional Neural Networks for Visual Recognition [link]

 Recommend 10 t Tweet f Share Sort by Newest

Join the discussion…

chsand420 • 10 days ago

IndexError: too many indices for array

Miro • 3 months ago • edited

Davide Riva • 6 months ago

ComfortablyBumb • 7 months ago

Aleksandr Dremov > ComfortablyBumb • 2 months ago

Dibben Nandakishor • 7 months ago

Dibben Nandakishor > Dibben Nandakishor • 7 months ago

兴乐 安 • 9 months ago • edited

Jia • 9 months ago • edited

‫ • מנהל האתר‬a year ago

Sankhadip Mazumder > ‫ • מנהל האתר‬10 months ago

jcatanza > ‫ • מנהל האתר‬a year ago

Elad • a year ago

Miro > willprice94 • 3 months ago • edited

David Snyder • a year ago

Ugur • a year ago • edited

But you want to change first equation from $p_i = \frac{e^{a_i}}{\sum_{k=1}^N

and also if floating point precision exceeded, numpy returns $inf$

boy boy • a year ago

Khoa Doan > boy boy • a year ago

Veronica • 2 years ago

Taha AIT > Veronica • a year ago • edited

Nogret Humphrey > Veronica • a year ago

Jae Duk Seo > Veronica • 2 years ago

Miro > Jae Duk Seo • 3 months ago

Ben • 2 years ago

Amit Chaudhary • 2 years ago

Safoora Yousefi • 2 years ago

Shivam Singla • 2 years ago

Scott Lowe • 2 years ago

Nogret Humphrey > Scott Lowe • a year ago

Veronica > Scott Lowe • 2 years ago

Rui Zhou • 2 years ago

NABIH NEBBACHE • 2 years ago

Miro > NABIH NEBBACHE • 3 months ago

theanh • 2 years ago

Paras Dahal Author > theanh • 2 years ago • edited

Paras Dahal Author > ihadanny • 2 years ago • edited

We can derrive the gradient by swapping in sigmoid's derrivative in place of

Milk_Wise • 2 years ago

Paras Dahal Author > Milk_Wise • 2 years ago

Refer to https://docs.scipy.org/doc/... to understand multidimensional array

Amir Hossein • 2 years ago

Paras Dahal Author > Amir Hossein • 2 years ago

Yash Akhauri • 2 years ago • edited

Paras Dahal Author > Yash Akhauri • 2 years ago

Artem Chernodub • 2 years ago

Paras Dahal Author > Artem Chernodub • 2 years ago

Artem Chernodub > Paras Dahal • 2 years ago

Paras Dahal Author > Artem Chernodub • 2 years ago

Daniel Severo • 2 years ago

Great post btw!

兴乐安 • 9 months ago • edited