DeepNotes Softmax&Crossentropy
DeepNotes Softmax&Crossentropy
Derivative of Softmax
PARAS DAHAL
i =
N a
∑ e
k=1 k
As the name suggests, softmax function is a “soft” version of max function. Instead of
selecting one maximum value, it breaks the whole (1) with maximal element getting
the largest portion of the distribution, but other smaller elements getting some of it as
well.
This property of softmax function that it outputs a probability distribution makes it
suitable for probabilistic interpretation in classification tasks.
def softmax(X):
exps = np.exp(X)
return exps / np.sum(exps)
We have to note that the numerical range of floating point numbers in numpy is
limited. For float64 the upper bound is 10 308
. For exponential, its not difficult to
overshoot that limit, in which case python returns nan .
To make our softmax function numerically stable, we simply normalize the values in
the vector, by multiplying the numerator and denominator with a constant C .
ai
e
pi =
N ak
∑ e
k=1
ai
Ce
=
N ak
C ∑ e
k=1
ai +log(C)
e
=
N ak +log(C)
∑ e
k=1
We can choose an arbitrary value for log(C) term, but generally log(C) = −max(a)
is chosen, as it shifts all of elements in the vector to negative to zero, and negatives
with large exponents saturate to zero rather than the infinity, avoiding overflowing
and resulting in nan .
def stable_softmax(X):
exps = np.exp(X - np.max(X))
return exps / np.sum(exps)
Derivative of Softmax
g(x) g′(x)h(x)−h′(x)g(x)
From quotient rule we know that for f(x) = h(x)
, we have f ′
(x) =
2
h(x)
∂e
∂
a
j
will be e aj
only if i = j,
otherwise its 0.
If i = j,
a
e i
∂ ai N ak aj ai
∑
N
e
a
k e ∑ e −e e
k=1 k=1
=
2
∂aj N
ak
(∑ e )
k=1
ai N ak aj
e (∑ e −e )
k=1
=
2
N ak
(∑ e )
k=1
N ak aj
aj (∑ e −e )
e k=1
= ×
N ak N ak
∑ e ∑ e
k=1 k=1
= pi (1 − pj )
For i ≠ j,
a
e i
∂ N aj ai
∑
k=1
e
a
k 0−e e
=
2
∂aj N ak
(∑ e )
k=1
aj ai
−e e
= ×
N ak N ak
∑ e ∑ e
k=1 k=1
= −pj . pi
∂pi pi (1 − pj ) if i = j
= {
∂aj −pj . pi if i ≠ j
1 if i = j
Or using Kronecker delta δij = {
0 if i ≠ j
∂pi
= pi (δ ij − pj )
∂aj
Cross entropy indicates the distance between what the model believes the output
distribution should be, and what the original distribution really is. It is defined as,
H(y, p) = − ∑
i
yi log(pi ) Cross entropy measure is a widely used alternative of
squared error. It is used when node activations can be understood as representing the
probability that each hypothesis might be true, i.e. when the output is a probability
distribution. Thus it is used as a loss function in neural networks which have softmax
activations in the output layer.
def cross_entropy(X,y):
"""
X is the output from fully connected layer (num_examples x num_classes)
y is labels (num_examples x 1)
Note that y is not one-hot encoded vector.
It can be computed as y.argmax(axis=1) from one-hot encoded vectors of labels if required.
"""
m = y.shape[0]
p = softmax(X)
# We use multidimensional array indexing to extract
# softmax probability of the correct label for each sample.
# Refer to https://docs.scipy.org/doc/numpy/user/basics.indexing.html#indexing-multi-dimensional-arra
log_likelihood = -np.log(p[range(m),y])
loss = np.sum(log_likelihood) / m
return loss
Cross Entropy Loss with Softmax function are used as the output layer extensively.
Now we use the derivative of softmax [1] that we derived earlier to derive the
derivative of the cross entropy loss function.
L = − ∑ yi log(pi )
∂L ∂log(pk )
= − ∑ yk
∂o i ∂o i
k
∂log(pk ) ∂pk
= − ∑ yk ×
∂pk ∂o i
k
1 ∂pk
= − ∑ yk ×
pk ∂o i
∂L 1
= −yi (1 − pi ) − ∑ yk (−pk . pi )
∂o i pk
k≠i
= −yi (1 − pi ) + ∑ yk . pi
k≠1
= −yi + yi pi + ∑ yk . pi
k≠1
= pi (yi + ∑ yk ) − yi
k≠1
= pi (yi + ∑ yk ) − yi
k≠1
∂L
= pi − yi
∂o i
which is a very simple and elegant expression. Translating it into code [2]
def delta_cross_entropy(X,y):
"""
X is the output from fully connected layer (num_examples x num_classes)
y is labels (num_examples x 1)
Note that y is not one-hot encoded vector.
It can be computed as y.argmax(axis=1) from one-hot encoded vectors of labels if required.
"""
m = y.shape[0]
grad = softmax(X)
grad[range(m),y] -= 1
grad = grad/m
return grad
References
Comments
72 Comments deepnotes 🔒 Disqus' Privacy Policy
1 Login
LOG IN WITH
OR SIGN UP WITH DISQUS ?
Name
张强 • a month ago
The code for the delta_cross_entropy seems to have something wrong.
x = np.array([11., 42., 3.])
y = np.array([1])
delta_cross_entropy(x,y)
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-21-bbab36a3762b> in <module>
----> 1 delta_cross_entropy(x,y)
i th i t f8bb fb bf i d lt t (X )
<ipython-input-5-f8bbefba54bf> in delta_cross_entropy(X, y)
8 m = y.shape[0]
9 grad = softmax(X)
---> 10 grad[range(m),y] -= 1
11 grad = grad/m
12 return grad
class Activation:
def forward(self, x):
raise NotImplementedError
def derivative(self, x):
raise NotImplementedError
def __call__(self, *inputs):
return self.forward(*inputs)
class SoftMax(Activation):
def forward(self, x, axis=-1):
shift_x = x - np.max(x, axis=axis, keepdims=True) # stable softmax
exp = np.exp(shift_x + 1e-6)
return exp / np.sum(exp, axis=axis, keepdims=True)
def derivative(self, x):
# TODO: HOW to implement the derivative of softmax?
△ ▽ • Reply • Share ›
grad[range(m),y] -= 1
1△ ▽ • Reply • Share ›
If you make these fixes, it will make it much easier to follow your derivation! Thanks.
6△ ▽ • Reply • Share ›
ih d 2
ihadanny • 2 years ago
thanks for this great post! Question: when optimizing for a multi-label setup (i.e. more
than one class can be true), it makes no sense to use softmax, but it still makes sense to
use crossentropy loss. Can you show how to derive the gradient in this case?
△ ▽ • Reply • Share ›