0% found this document useful (0 votes)
5 views

Lecture 07

Uploaded by

Tim Widmoser
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Lecture 07

Uploaded by

Tim Widmoser
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

WiSe 2023/24

Deep Learning 1

Lecture 7 Loss Functions


Outline

Recap: Formulating the learning problem


Loss functions for regression
▶ 0/1 loss, squared loss, absolute loss, logcosh

▶ Incorporating predictive uncertainty

Loss functions for classication


▶ 0/1 loss, perceptron loss, log loss

▶ Extensions to multiple classes

Practical Aspects
▶ Utility-based loss functions

▶ Incorporating data quality

▶ Multiple tasks

1/28
Formulating the Learning Problem

Objective to minimize is often dened


as the average over the training data
of a loss function ℓ, measuring for
each instance i the discrepancy be-
tween the prediction yi = f (xi , θ) and
the ground-truth ti .
N
1 X
E(θ) ℓ(yi , ti )
N i=1

Two factors inuence the learned model f :


▶ What data is available for training the model (Lectures 5 and 6).

▶ The choice of loss function, e.g. whether larger errors are penalized
more than small errors (today's lecture).

2/28
Part 1 Loss Functions for Regression

3/28
Regression Losses

Observations:
▶ In numerous applications, one needs to predict real values (e.g. age of
an organism, expected durability of a component, energy of a physical
system, value of a good, product of a chemical reaction, yield of a
machine, temperature next week, etc.).

▶ For these applications, labels are provided as real-valued targets t ∈ R,


and one needs to choose a loss function that quanties well the
dierence between such target value and the prediction f (x) ∈ R.

Several considerations for designing ℓ:


▶ What is the cost of making certain types of errors? Are small errors
tolerated? Are big errors more costly?

▶ What is the quality of the ground-truth target values in the dataset?


Are there some outliers?

4/28
The 0/1 Loss

`
Function to minimize: 1

0 −ϵ ≤ y − t ≤ ϵ y¡t
ℓ(y, t) = ¡² ²
1 else
unacceptable acceptable unacceptable

Advantages:
▶ Tolerant to some small task-irrelevant discrepancies (→ does not need
to t the data exactly) and can therefore accomodate simple,
better-generalizing models.

▶ Not aected by potential outliers in the data (just treat them as


regular errors).

Disadvantage:
▶ The gradient of that loss function is almost always zero → impossible
to optimize via gradient descent.

5/28
The Squared Loss

`
Function to minimize: 1

ℓ(y, t) = (y − t)2 y¡t


¡² ²
unacceptable acceptable unacceptable

Advantages:
▶ Tolerant to some small task-irrelevant discrepancies.

▶ Unlike the 0/1 loss, gradients are most of the time non-zero. This
makes this loss easy to optimize.

Disadvantage:
▶ Strongly aected by outliers (errors grow quadratically).

6/28
The Absolute Loss

`
Function to minimize: 1

ℓ(y, t) = |y − t| y¡t
¡² ²
unacceptable acceptable unacceptable

Advantages:
▶ Compared to the square error, less aected by outliers (errors grow
only linearly).

▶ Non-zero gradients → easy to optimize.

Disadvantage:
▶ Unlike the 0/1 loss and the square error, it is not tolerant to small
errors (small errors incur a non-negligible cost).

7/28
The Log-Cosh Loss

Function to minimize: `
1 1
ℓ(y, t) = log cosh(β · (y − t))
β
y¡t
¡² ²
with β a positive-valued hyperparame- unacceptable acceptable unacceptable
ter.

Advantages:
▶ Tolerant to some small task-irrelevant discrepancies.

▶ Non-zero gradients everywhere (except when the prediction is correct).


This makes this loss easy to optimize.

▶ Only mildly aected by outliers (error grows linearly).

8/28
Regression Losses

Systematic comparison

outlier-robust
optimizable

ϵ-tolerant
0/1 loss ✗ ✓ ✓
2
squared loss (y − t) ✓ ✗ ✓
absolute loss |y − t| ✓ ✓ ✗
log-cosh loss ✓ ✓ ✓

Note:
▶ Many further loss functions have been proposed in the literature (e.g.
Huber's loss, ϵ-sensitive loss, etc.). They often implement similar
desirable properties as the log-cosh loss.

9/28
Regression Losses: Adding Predictive Uncertainty

Idea:
▶ Let the network output consist of two variables µ, σ , representing the
parameters of some probability distribution modeling the labels t, for
example a normal distribution y ∼ N (µ, σ).
▶ We can then dene the log-likelihood function, which we would like to
maximize w.r.t. the parameters of the network:

(t − µ)2 √
log p(y = t | µ, σ) = − 2
− log( 2πσ)

10/28
Regression Losses: Adding Predictive Uncertainty

Objective to maximize:

(t − µ)2 √
log p(y = t | µ, σ) = − 2
− log( 2πσ)

Observation:
▶ The objective has a gradient w.r.t. µ and σ (as long as the scale σ is
positive and not too small). To ensure this, one can use some special
activation function to produce σ, e.g. σ = log(1 + exp(·)).
▶ If we set σ constant (i.e. disconnect it from the rest of the network),
the model reduces to an application of the square error loss function.
However, if we learn σ, the latter provides us with a an indication of
prediction uncertainty.

▶ If we choose dierent data distributions, we recover dierent loss


functions (e.g. the Laplace distribution yields the absolute loss, or the
hyperbolic secant distribution yields the log-cosh loss).

11/28
Part 2 Loss Functions for Classication

12/28
Classication Losses

Observations:
▶ Classication is perhaps the most common scenario in machine
learning (e.g. detecting if some tissue contains cancerous tissue or not,
determining whether to grant access or not to some resource,
detecting if some text is positive or negative, etc.)

▶ For these applications, labels are provided as elements of a set,


typically t ∈ {−1, 1} for binary classication or t ∈ {1, 2, . . . , C} for
multi-class classication.

▶ However, the output of the neural network is, like for the regression
case, real-valued. For binary classication, it is typically a real-valued
scalar the sign of which gives the class. The classication is then
correct if and only if:

 
(y > 0) ∧ (t = 1) ∨ (y < 0) ∧ (t = −1)

and this can be written more compactly as:

y·t>0

13/28
0/1 Loss

`
Function to minimize:

0 if y·t>0 y·t
ℓ(y, t) =
1 if y·t<0 incorrect correct
decision decision
Properties:
▶ Using the 0/1 loss function is equivalent to minimizing the average
classication error on the training data.

▶ If the training data would exactly correspond to the test distribution,


then, the optimization objective would exactly maximize what we are
interested in, i.e. the classication accuracy.

Problem:
▶ The loss function has gradient zero everywhere ⇒ It can't be
optimized via gradient descent.

14/28
Perceptron Loss
Function to minimize: `

0 if y·t>0
ℓ(y, t) = y·t
|y| if y·t<0

Note that it can also be formulated more com-


incorrect correct
decision decision
pactly as ℓ(y, t) = max(0, −y · t).

Advantage:
▶ Gradient is non-zero for misclassications and indicates how to adapt
the model to reduce the classication errors.

▶ Remains fairly capable of dealing with misclassied data (like the 0/1
loss), because the error only grows linearly with y.
iteration 31
4
Disadvantage: 3
2
▶ Training stops as soon as training points are on 1
the correct side of the decision boundary. → 0
1
Unlikely to generalize well to new data points 2
3
(the 0/1 loss function has the same problem).
4
4 2 0 2 4

15/28
Log-Loss

`
Function to minimize:
y·t
ℓ(y, t) = log(1 + exp(−y · t))
incorrect correct
decision decision

iteration 999
Advantages: 4
3
▶ Penalize points that are correctly classied if the 2
1
neural network output is too close to the 0
threshold. This pushes the decision boundary 1
2
away from the training data and provide intrinsic 3
regularization properties. 4
4 2 0 2 4

16/28
Log-Loss

Probabilistic interpretation:
Assuming the following mapping from neural network output y to class
probabilities
 exp(−y) exp(y) 
p= ,
1 + exp(−y) 1 + exp(y)
minimizing the log-loss is equivalent to minimizing the cross-entropy
H(q, p) where q = (1t<0 , 1t>0 ) is a one-hot vector encoding the class.

Proof: 2
X
H(q, p) = − qi log pi
i=1
= −q1 log p1 − q2 log p2
e−y ey
= −1t<0 log − 1t>0 log
1 + e−y 1 + ey
eyt
= − log
1 + eyt
1
= − log
1 + e−yt
= log(1 + e−yt )

17/28
Classication Losses

Systematic comparison

mislabeling-robust

builds margin
optimizable
0/1 loss ✗ ✓ ✗
perceptron loss max(0, −yt) ✓ ✓ ✗
log loss log(1 + exp(−yt)) ✓ ✓ ✓

18/28
Handling Multiple Classes

Blueprint:
▶ Build a neural network with as many outputs
as there are classes, call them y1 , . . . , yC .
▶ Classify as k = arg max[y1 , . . . , yC ].

Observation:
▶ The 0/1 loss function can then be straightforwardly generalized to the
multi-class case as:

t=1 t=2 ... t=C


k=1 0 1 ... 1
k=2 1 0
. . .
. . .
. . .

k=C 1 0

▶ However, this generalization of the 0/1 loss suers from the same
problem as the original 0/1 loss, that is, the diculty to optimize it,
and the fact it does not promote margins between the data/predictions
and the decision boundary.

19/28
Handling Multiple Classes
Generalizing the log-loss to multiple classes:
▶ Let y1 , . . . , y C be the C outputs of our network. Mapping these scores
to a probability vector via the softmax function

exp(yi )
pi = PC
j=1 exp(yj )

and constructing a one-hot encoding q of the class label t, we dene


the loss function as the cross-entropy H(q, p), i.e.
C
X
ℓ(y, t) = H(q, p) = − qi log pi
i=1

= − log pt
C
X
= log exp yj − yt
j=1

which can be interpreted as the dierence between the evidence found


by the neural network for all classes, and the evidence found by the
neural network for the target class.

20/28
Part 3 Practical Aspects

21/28
Practical Aspect 1: Non-Uniform Misclassication Costs

Example: medical diagnosis.

▶ Assumes a type of error is much more costly than another, e.g. missing
the detection of a disease.
Actual

Predicted No infection Infection


No infection 0 10000
Infection 2000 0

Approach for the 0/1 loss:


▶ To reect this cost structure, the 0/1 loss can be straightforwardly
enhanced by replacing the 1s in the loss function by the actual costs.

▶ Minimizing the loss function is then equivalent to minimizing the


expected cost (or maximizing utility).

Approach for other losses:


▶ When the loss has a probabilistic interpretation (e.g. log-loss), one can
treat predicted probabilities p(y P
= k) as `ground-truth' and estimate
the expected cost for class i as C k=1 cost(choose i|k)p(y = k).

22/28
Practical Aspect 2: Labels of Varying Quality

low quality labels


high quality labels

Examples:
▶ Non-expert vs. expert labeler, outcome of a physics simulation
with/without approximations, noisy/clean measurement of an
experimental outcome.

Idea:
▶ In presence of two similar instances that are similar but with diverging
labels, focus on the high-quality one. Low-quality labels remain useful
in regions with scarce data.

23/28
Practical Aspect 2: Labels of Varying Quality

low quality labels


high quality labels

Idea (cont.):
▶ Use a dierent loss function for dierent data points, e.g. one
associates to instance i the loss function:

ℓi (y, t) = Ci ℓ(y, t)

where Ci is a multiplicative factor set large if i is a high quality data


point or low if i is low-quality.

24/28
Practical Aspect 3: Multiple Tasks
In practice, we may want the same neural network to perform several tasks
simultaneously, e.g. multiple binary classication tasks, or some additional
regression tasks.

Example: (New J. Phys. 15 095003, 2013)

Denoting by t = (t1 , . . . , tL ) the vector of targets for the L dierent tasks,


and building a neural network with the corresponding number of outputs
y = (y1 , . . . , yL ), we can dene the loss function

L
X
ℓ(y, t) = ℓj (yj , tj )
j=1

where ℓj is the loss function chosen for solving task j.

25/28
Practical Aspect 3: Multiple Tasks

Remark 1:
▶ When the dierent tasks are regression tasks (with similar scale and
weighting), and when applying the square loss and absolute loss to
these dierent tasks, the multi-task loss takes the respective forms:

L
X
E(y, t) = (yl − tl )2 = ∥y − t∥2
l=1
L
X
E(y, t) = |yl − tl | = ∥y − t∥1 .
l=1

Remark 2:
▶ We distinguish between multi-class classication and multiple binary
classication tasks. For example, in image recognition, there are
typically multiple objects on one image, and one often prefers to
indicate for each object its presence or absence rather than to
associate to the image a single class.

26/28
Summary

27/28
Summary

▶ Lectures 56 have highlighted that the actual data on which we train
the model plays an important role. In Lecture 7, we have demonstrated
that an equally important role is played by the way we specify the
errors of the model through particular choices of a loss function ℓ.
▶ Many loss functions exist for tasks such as regression, binary
classication, multi-class classication, multi-task learning, etc.

▶ Loss functions must be designed by taking multiple aspects into


account, such as the ability to account for mislabelings, the ability to
tolerate some noise, and the ability to support ecient optimization.

▶ Loss functions can be dened exibly to address practical aspects such


as the presence of asymetric misclassication costs, subsets of the
data with dierent data quality, or the presence of multiple subtasks.

28/28

You might also like