9.b Handout-1-Loss Functions
9.b Handout-1-Loss Functions
Loss functions
We have discussed the regularization loss part of the objective, which can be seen as
penalizing some measure of complexity of the model. The second part of an objective is the
data loss, which in a supervised learning problem measures the compatibility between a
prediction (e.g. the class scores in classification) and the ground truth label. The data loss
takes the form of an average over the data losses for every individual example. That is,
L =
1
N
∑ Li
i
where N is the number of training data. Lets abbreviate f = f (xi ; W ) to be
the activations of the output layer in a Neural Network. There are several types of problems
you might want to solve in practice:
Classification is the case that we have so far discussed at length. Here, we assume a dataset
of examples and a single correct label (out of a fixed set) for each example. One of two most
commonly seen cost functions in this setting is the SVM (e.g. the Weston Watkins
formulation):
Li = ∑ max(0, f j − f y + 1)
i
j≠y i
As we briefly alluded to, some people report better performance with the squared hinge loss
(i.e. instead using max(0, f j − f y + 1)
i
2
). The second common choice is the Softmax
classifier that uses the cross-entropy loss:
fy
e i
Li = − log( )
fj
∑ e
j
Problem: Large number of classes. When the set of labels is very large (e.g. words in English
dictionary, or ImageNet which contains 22,000 categories), computing the full softmax
probabilities becomes expensive. For certain applications, approximate versions are popular.
For instance, it may be helpful to use Hierarchical Softmax in natural language processing
tasks (see one explanation here (pdf)). The hierarchical softmax decomposes words as labels
in a tree. Each label is then represented as a path along the tree, and a Softmax classifier is
trained at every node of the tree to disambiguate between the left and right branch. The
structure of the tree strongly impacts the performance and is generally problem-dependent.
Attribute classification. Both losses above assume that there is a single correct answer y i .
But what if y i is a binary vector where every example may or may not have a certain attribute,
and where the attributes are not exclusive? For example, images on Instagram can be thought
of as labeled with a certain subset of hashtags from a large set of all hashtags, and an image
may contain multiple. A sensible approach in this case is to build a binary classifier for every
single attribute independently. For example, a binary classifier for each category independently
would take the form:
Back to Top
https://cs231n.github.io/neural-networks-2/ 12/15
4/24/22, 9:50 AM CS231n Convolutional Neural Networks for Visual Recognition
Li = ∑ max(0, 1 − y ij f j )
where the sum is over all categories j, and y ij is either +1 or -1 depending on whether the i-th
example is labeled with the j-th attribute, and the score vector f j will be positive when the
class is predicted to be present and negative otherwise. Notice that loss is accumulated if a
positive example has score less than +1, or when a negative example has score greater than
-1.
An alternative to this loss would be to train a logistic regression classifier for every attribute
independently. A binary logistic regression classifier has only two classes (0,1), and calculates
the probability of class 1 as:
1
T
P (y = 1 ∣ x; w, b) = = σ(w x + b)
T
−(w x+b)
1 + e
Since the probabilities of class 1 and 0 sum to one, the probability for class 0 is
P (y = 0 ∣ x; w, b) = 1 − P (y = 1 ∣ x; w, b) . Hence, an example is classified as a positive
example (y = 1) if σ(w
T
x + b) > 0.5, or equivalently if the score w x + b > 0 . The loss
T
function then maximizes this probability. You can convince yourself that this simplifies to
minimizing the negative log-likelihood:
where the labels y ij are assumed to be either 1 (positive) or 0 (negative), and σ(⋅) is the
sigmoid function. The expression above can look scary but the gradient on f is in fact
extremely simple and intuitive: ∂ Li /∂ f j = σ(f j ) − y ij (as you can double check yourself
by taking the derivatives).
Regression is the task of predicting real-valued quantities, such as the price of houses or the
length of something in an image. For this task, it is common to compute the loss between the
predicted quantity and the true answer and then measure the L2 squared norm, or L1 norm of
the difference. The L2 norm squared would compute the loss for a single example of the form:
2
Li = ∥f − y i ∥
2
The reason the L2 norm is squared in the objective is that the gradient becomes much simpler,
without changing the optimal parameters since squaring is a monotonic operation. The L1
norm would be formulated by summing the absolute value along each dimension:
Li = ∥f − y i ∥1 = ∑ ∣ f j − (y i ) j ∣
Back to Top
https://cs231n.github.io/neural-networks-2/ 13/15
4/24/22, 9:50 AM CS231n Convolutional Neural Networks for Visual Recognition
where the sum ∑j is a sum over all dimensions of the desired prediction, if there is more than
one quantity being predicted. Looking at only the j-th dimension of the i-th example and
denoting the difference between the true and the predicted value by δij , the gradient for this
dimension (i.e. ∂ Li /∂ f j ) is easily derived to be either δij with the L2 norm, or sign(δij ).
That is, the gradient on the score will either be directly proportional to the difference in the
error, or it will be fixed and only inherit the sign of the difference.
Word of caution: It is important to note that the L2 loss is much harder to optimize than a
more stable loss such as Softmax. Intuitively, it requires a very fragile and specific property
from the network to output exactly one correct value for each input (and its augmentations).
Notice that this is not the case with Softmax, where the precise value of each score is less
important: It only matters that their magnitudes are appropriate. Additionally, the L2 loss is
less robust because outliers can introduce huge gradients. When faced with a regression
problem, first consider if it is absolutely inadequate to quantize the output into bins. For
example, if you are predicting star rating for a product, it might work much better to use 5
independent classifiers for ratings of 1-5 stars instead of a regression loss. Classification has
the additional benefit that it can give you a distribution over the regression outputs, not just a
single output with no indication of its confidence. If you’re certain that classification is not
appropriate, use the L2 but be careful: For example, the L2 is more fragile and applying
dropout in the network (especially in the layer right before the L2 loss) is not a great idea.
When faced with a regression task, first consider if it is absolutely necessary. Instead, have a
strong preference to discretizing your outputs to bins and perform classification over them
whenever possible.
Structured prediction. The structured loss refers to a case where the labels can be arbitrary
structures such as graphs, trees, or other complex objects. Usually it is also assumed that the
space of structures is very large and not easily enumerable. The basic idea behind the
structured SVM loss is to demand a margin between the correct structure y i and the highest-
scoring incorrect structure. It is not common to solve this problem as a simple unconstrained
optimization problem with gradient descent. Instead, special solvers are usually devised so
that the specific simplifying assumptions of the structure space can be taken advantage of.
We mention the problem briefly but consider the specifics to be outside of the scope of the
class.
Summary
In summary:
The recommended preprocessing is to center the data to have mean of zero, and
normalize its scale to [-1, 1] along each feature Back to Top
https://cs231n.github.io/neural-networks-2/ 14/15