Lecture 07
Lecture 07
Deep Learning 1
Practical Aspects
▶ Utility-based loss functions
▶ Multiple tasks
1/28
Formulating the Learning Problem
▶ The choice of loss function, e.g. whether larger errors are penalized
more than small errors (today's lecture).
2/28
Part 1 Loss Functions for Regression
3/28
Regression Losses
Observations:
▶ In numerous applications, one needs to predict real values (e.g. age of
an organism, expected durability of a component, energy of a physical
system, value of a good, product of a chemical reaction, yield of a
machine, temperature next week, etc.).
4/28
The 0/1 Loss
`
Function to minimize: 1
0 −ϵ ≤ y − t ≤ ϵ y¡t
ℓ(y, t) = ¡² ²
1 else
unacceptable acceptable unacceptable
Advantages:
▶ Tolerant to some small task-irrelevant discrepancies (→ does not need
to t the data exactly) and can therefore accomodate simple,
better-generalizing models.
Disadvantage:
▶ The gradient of that loss function is almost always zero → impossible
to optimize via gradient descent.
5/28
The Squared Loss
`
Function to minimize: 1
Advantages:
▶ Tolerant to some small task-irrelevant discrepancies.
▶ Unlike the 0/1 loss, gradients are most of the time non-zero. This
makes this loss easy to optimize.
Disadvantage:
▶ Strongly aected by outliers (errors grow quadratically).
6/28
The Absolute Loss
`
Function to minimize: 1
ℓ(y, t) = |y − t| y¡t
¡² ²
unacceptable acceptable unacceptable
Advantages:
▶ Compared to the square error, less aected by outliers (errors grow
only linearly).
Disadvantage:
▶ Unlike the 0/1 loss and the square error, it is not tolerant to small
errors (small errors incur a non-negligible cost).
7/28
The Log-Cosh Loss
Function to minimize: `
1 1
ℓ(y, t) = log cosh(β · (y − t))
β
y¡t
¡² ²
with β a positive-valued hyperparame- unacceptable acceptable unacceptable
ter.
Advantages:
▶ Tolerant to some small task-irrelevant discrepancies.
8/28
Regression Losses
Systematic comparison
outlier-robust
optimizable
ϵ-tolerant
0/1 loss ✗ ✓ ✓
2
squared loss (y − t) ✓ ✗ ✓
absolute loss |y − t| ✓ ✓ ✗
log-cosh loss ✓ ✓ ✓
Note:
▶ Many further loss functions have been proposed in the literature (e.g.
Huber's loss, ϵ-sensitive loss, etc.). They often implement similar
desirable properties as the log-cosh loss.
9/28
Regression Losses: Adding Predictive Uncertainty
Idea:
▶ Let the network output consist of two variables µ, σ , representing the
parameters of some probability distribution modeling the labels t, for
example a normal distribution y ∼ N (µ, σ).
▶ We can then dene the log-likelihood function, which we would like to
maximize w.r.t. the parameters of the network:
(t − µ)2 √
log p(y = t | µ, σ) = − 2
− log( 2πσ)
2σ
10/28
Regression Losses: Adding Predictive Uncertainty
Objective to maximize:
(t − µ)2 √
log p(y = t | µ, σ) = − 2
− log( 2πσ)
2σ
Observation:
▶ The objective has a gradient w.r.t. µ and σ (as long as the scale σ is
positive and not too small). To ensure this, one can use some special
activation function to produce σ, e.g. σ = log(1 + exp(·)).
▶ If we set σ constant (i.e. disconnect it from the rest of the network),
the model reduces to an application of the square error loss function.
However, if we learn σ, the latter provides us with a an indication of
prediction uncertainty.
11/28
Part 2 Loss Functions for Classication
12/28
Classication Losses
Observations:
▶ Classication is perhaps the most common scenario in machine
learning (e.g. detecting if some tissue contains cancerous tissue or not,
determining whether to grant access or not to some resource,
detecting if some text is positive or negative, etc.)
▶ However, the output of the neural network is, like for the regression
case, real-valued. For binary classication, it is typically a real-valued
scalar the sign of which gives the class. The classication is then
correct if and only if:
(y > 0) ∧ (t = 1) ∨ (y < 0) ∧ (t = −1)
y·t>0
13/28
0/1 Loss
`
Function to minimize:
0 if y·t>0 y·t
ℓ(y, t) =
1 if y·t<0 incorrect correct
decision decision
Properties:
▶ Using the 0/1 loss function is equivalent to minimizing the average
classication error on the training data.
Problem:
▶ The loss function has gradient zero everywhere ⇒ It can't be
optimized via gradient descent.
14/28
Perceptron Loss
Function to minimize: `
0 if y·t>0
ℓ(y, t) = y·t
|y| if y·t<0
Advantage:
▶ Gradient is non-zero for misclassications and indicates how to adapt
the model to reduce the classication errors.
▶ Remains fairly capable of dealing with misclassied data (like the 0/1
loss), because the error only grows linearly with y.
iteration 31
4
Disadvantage: 3
2
▶ Training stops as soon as training points are on 1
the correct side of the decision boundary. → 0
1
Unlikely to generalize well to new data points 2
3
(the 0/1 loss function has the same problem).
4
4 2 0 2 4
15/28
Log-Loss
`
Function to minimize:
y·t
ℓ(y, t) = log(1 + exp(−y · t))
incorrect correct
decision decision
iteration 999
Advantages: 4
3
▶ Penalize points that are correctly classied if the 2
1
neural network output is too close to the 0
threshold. This pushes the decision boundary 1
2
away from the training data and provide intrinsic 3
regularization properties. 4
4 2 0 2 4
16/28
Log-Loss
Probabilistic interpretation:
Assuming the following mapping from neural network output y to class
probabilities
exp(−y) exp(y)
p= ,
1 + exp(−y) 1 + exp(y)
minimizing the log-loss is equivalent to minimizing the cross-entropy
H(q, p) where q = (1t<0 , 1t>0 ) is a one-hot vector encoding the class.
Proof: 2
X
H(q, p) = − qi log pi
i=1
= −q1 log p1 − q2 log p2
e−y ey
= −1t<0 log − 1t>0 log
1 + e−y 1 + ey
eyt
= − log
1 + eyt
1
= − log
1 + e−yt
= log(1 + e−yt )
17/28
Classication Losses
Systematic comparison
mislabeling-robust
builds margin
optimizable
0/1 loss ✗ ✓ ✗
perceptron loss max(0, −yt) ✓ ✓ ✗
log loss log(1 + exp(−yt)) ✓ ✓ ✓
18/28
Handling Multiple Classes
Blueprint:
▶ Build a neural network with as many outputs
as there are classes, call them y1 , . . . , yC .
▶ Classify as k = arg max[y1 , . . . , yC ].
Observation:
▶ The 0/1 loss function can then be straightforwardly generalized to the
multi-class case as:
k=C 1 0
▶ However, this generalization of the 0/1 loss suers from the same
problem as the original 0/1 loss, that is, the diculty to optimize it,
and the fact it does not promote margins between the data/predictions
and the decision boundary.
19/28
Handling Multiple Classes
Generalizing the log-loss to multiple classes:
▶ Let y1 , . . . , y C be the C outputs of our network. Mapping these scores
to a probability vector via the softmax function
exp(yi )
pi = PC
j=1 exp(yj )
= − log pt
C
X
= log exp yj − yt
j=1
20/28
Part 3 Practical Aspects
21/28
Practical Aspect 1: Non-Uniform Misclassication Costs
▶ Assumes a type of error is much more costly than another, e.g. missing
the detection of a disease.
Actual
22/28
Practical Aspect 2: Labels of Varying Quality
Examples:
▶ Non-expert vs. expert labeler, outcome of a physics simulation
with/without approximations, noisy/clean measurement of an
experimental outcome.
Idea:
▶ In presence of two similar instances that are similar but with diverging
labels, focus on the high-quality one. Low-quality labels remain useful
in regions with scarce data.
23/28
Practical Aspect 2: Labels of Varying Quality
Idea (cont.):
▶ Use a dierent loss function for dierent data points, e.g. one
associates to instance i the loss function:
ℓi (y, t) = Ci ℓ(y, t)
24/28
Practical Aspect 3: Multiple Tasks
In practice, we may want the same neural network to perform several tasks
simultaneously, e.g. multiple binary classication tasks, or some additional
regression tasks.
L
X
ℓ(y, t) = ℓj (yj , tj )
j=1
25/28
Practical Aspect 3: Multiple Tasks
Remark 1:
▶ When the dierent tasks are regression tasks (with similar scale and
weighting), and when applying the square loss and absolute loss to
these dierent tasks, the multi-task loss takes the respective forms:
L
X
E(y, t) = (yl − tl )2 = ∥y − t∥2
l=1
L
X
E(y, t) = |yl − tl | = ∥y − t∥1 .
l=1
Remark 2:
▶ We distinguish between multi-class classication and multiple binary
classication tasks. For example, in image recognition, there are
typically multiple objects on one image, and one often prefers to
indicate for each object its presence or absence rather than to
associate to the image a single class.
26/28
Summary
27/28
Summary
▶ Lectures 56 have highlighted that the actual data on which we train
the model plays an important role. In Lecture 7, we have demonstrated
that an equally important role is played by the way we specify the
errors of the model through particular choices of a loss function ℓ.
▶ Many loss functions exist for tasks such as regression, binary
classication, multi-class classication, multi-task learning, etc.
28/28