0% found this document useful (0 votes)

8 views29 pages

Lecture 07

Uploaded by

Tim Widmoser

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views29 pages

Lecture 07

Uploaded by

Tim Widmoser

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

WiSe 2023/24

Deep Learning 1

Lecture 7 Loss Functions

Outline

Recap: Formulating the learning problem

Loss functions for regression
▶ 0/1 loss, squared loss, absolute loss, logcosh

▶ Incorporating predictive uncertainty

Loss functions for classication

▶ 0/1 loss, perceptron loss, log loss

▶ Extensions to multiple classes

Practical Aspects
▶ Utility-based loss functions

▶ Incorporating data quality

▶ Multiple tasks

1/28
Formulating the Learning Problem

Objective to minimize is often dened

as the average over the training data
of a loss function ℓ, measuring for
each instance i the discrepancy be-
tween the prediction yi = f (xi , θ) and
the ground-truth ti .
N
1 X
E(θ) ℓ(yi , ti )
N i=1

Two factors inuence the learned model f :

▶ What data is available for training the model (Lectures 5 and 6).

▶ The choice of loss function, e.g. whether larger errors are penalized
more than small errors (today's lecture).

2/28
Part 1 Loss Functions for Regression

3/28
Regression Losses

Observations:
▶ In numerous applications, one needs to predict real values (e.g. age of
an organism, expected durability of a component, energy of a physical
system, value of a good, product of a chemical reaction, yield of a
machine, temperature next week, etc.).

▶ For these applications, labels are provided as real-valued targets t ∈ R,

and one needs to choose a loss function that quanties well the
dierence between such target value and the prediction f (x) ∈ R.

Several considerations for designing ℓ:

▶ What is the cost of making certain types of errors? Are small errors
tolerated? Are big errors more costly?

▶ What is the quality of the ground-truth target values in the dataset?

Are there some outliers?

4/28
The 0/1 Loss

`
Function to minimize: 1

0 −ϵ ≤ y − t ≤ ϵ y¡t
ℓ(y, t) = ¡² ²
1 else
unacceptable acceptable unacceptable

Advantages:
▶ Tolerant to some small task-irrelevant discrepancies (→ does not need
to t the data exactly) and can therefore accomodate simple,
better-generalizing models.

▶ Not aected by potential outliers in the data (just treat them as

regular errors).

Disadvantage:
▶ The gradient of that loss function is almost always zero → impossible
to optimize via gradient descent.

5/28
The Squared Loss

`
Function to minimize: 1

ℓ(y, t) = (y − t)2 y¡t

¡² ²
unacceptable acceptable unacceptable

Advantages:
▶ Tolerant to some small task-irrelevant discrepancies.

▶ Unlike the 0/1 loss, gradients are most of the time non-zero. This
makes this loss easy to optimize.

Disadvantage:
▶ Strongly aected by outliers (errors grow quadratically).

6/28
The Absolute Loss

`
Function to minimize: 1

ℓ(y, t) = |y − t| y¡t
¡² ²
unacceptable acceptable unacceptable

Advantages:
▶ Compared to the square error, less aected by outliers (errors grow
only linearly).

▶ Non-zero gradients → easy to optimize.

Disadvantage:
▶ Unlike the 0/1 loss and the square error, it is not tolerant to small
errors (small errors incur a non-negligible cost).

7/28
The Log-Cosh Loss

Function to minimize: `
1 1
ℓ(y, t) = log cosh(β · (y − t))
β
y¡t
¡² ²
with β a positive-valued hyperparame- unacceptable acceptable unacceptable
ter.

Advantages:
▶ Tolerant to some small task-irrelevant discrepancies.

▶ Non-zero gradients everywhere (except when the prediction is correct).

This makes this loss easy to optimize.

▶ Only mildly aected by outliers (error grows linearly).

8/28
Regression Losses

Systematic comparison

outlier-robust
optimizable

ϵ-tolerant
0/1 loss ✗ ✓ ✓
2
squared loss (y − t) ✓ ✗ ✓
absolute loss |y − t| ✓ ✓ ✗
log-cosh loss ✓ ✓ ✓

Note:
▶ Many further loss functions have been proposed in the literature (e.g.
Huber's loss, ϵ-sensitive loss, etc.). They often implement similar
desirable properties as the log-cosh loss.

9/28
Regression Losses: Adding Predictive Uncertainty

Idea:
▶ Let the network output consist of two variables µ, σ , representing the
parameters of some probability distribution modeling the labels t, for
example a normal distribution y ∼ N (µ, σ).
▶ We can then dene the log-likelihood function, which we would like to
maximize w.r.t. the parameters of the network:

(t − µ)2 √
log p(y = t | µ, σ) = − 2
− log( 2πσ)
2σ

10/28
Regression Losses: Adding Predictive Uncertainty

Objective to maximize:

(t − µ)2 √
log p(y = t | µ, σ) = − 2
− log( 2πσ)
2σ

Observation:
▶ The objective has a gradient w.r.t. µ and σ (as long as the scale σ is
positive and not too small). To ensure this, one can use some special
activation function to produce σ, e.g. σ = log(1 + exp(·)).
▶ If we set σ constant (i.e. disconnect it from the rest of the network),
the model reduces to an application of the square error loss function.
However, if we learn σ, the latter provides us with a an indication of
prediction uncertainty.

▶ If we choose dierent data distributions, we recover dierent loss

functions (e.g. the Laplace distribution yields the absolute loss, or the
hyperbolic secant distribution yields the log-cosh loss).

11/28
Part 2 Loss Functions for Classication

12/28
Classication Losses

Observations:
▶ Classication is perhaps the most common scenario in machine
learning (e.g. detecting if some tissue contains cancerous tissue or not,
determining whether to grant access or not to some resource,
detecting if some text is positive or negative, etc.)

▶ For these applications, labels are provided as elements of a set,

typically t ∈ {−1, 1} for binary classication or t ∈ {1, 2, . . . , C} for
multi-class classication.

▶ However, the output of the neural network is, like for the regression
case, real-valued. For binary classication, it is typically a real-valued
scalar the sign of which gives the class. The classication is then
correct if and only if:

(y > 0) ∧ (t = 1) ∨ (y < 0) ∧ (t = −1)

and this can be written more compactly as:

y·t>0

13/28
0/1 Loss

`
Function to minimize:

0 if y·t>0 y·t
ℓ(y, t) =
1 if y·t<0 incorrect correct
decision decision
Properties:
▶ Using the 0/1 loss function is equivalent to minimizing the average
classication error on the training data.

▶ If the training data would exactly correspond to the test distribution,

then, the optimization objective would exactly maximize what we are
interested in, i.e. the classication accuracy.

Problem:
▶ The loss function has gradient zero everywhere ⇒ It can't be
optimized via gradient descent.

14/28
Perceptron Loss
Function to minimize: `

0 if y·t>0
ℓ(y, t) = y·t
|y| if y·t<0

Note that it can also be formulated more com-

incorrect correct
decision decision
pactly as ℓ(y, t) = max(0, −y · t).

Advantage:
▶ Gradient is non-zero for misclassications and indicates how to adapt
the model to reduce the classication errors.

▶ Remains fairly capable of dealing with misclassied data (like the 0/1
loss), because the error only grows linearly with y.
iteration 31
4
Disadvantage: 3
2
▶ Training stops as soon as training points are on 1
the correct side of the decision boundary. → 0
1
Unlikely to generalize well to new data points 2
3
(the 0/1 loss function has the same problem).
4
4 2 0 2 4

15/28
Log-Loss

`
Function to minimize:
y·t
ℓ(y, t) = log(1 + exp(−y · t))
incorrect correct
decision decision

iteration 999
Advantages: 4
3
▶ Penalize points that are correctly classied if the 2
1
neural network output is too close to the 0
threshold. This pushes the decision boundary 1
2
away from the training data and provide intrinsic 3
regularization properties. 4
4 2 0 2 4

16/28
Log-Loss

Probabilistic interpretation:
Assuming the following mapping from neural network output y to class
probabilities
exp(−y) exp(y)
p= ,
1 + exp(−y) 1 + exp(y)
minimizing the log-loss is equivalent to minimizing the cross-entropy
H(q, p) where q = (1t<0 , 1t>0 ) is a one-hot vector encoding the class.

Proof: 2
X
H(q, p) = − qi log pi
i=1
= −q1 log p1 − q2 log p2
e−y ey
= −1t<0 log − 1t>0 log
1 + e−y 1 + ey
eyt
= − log
1 + eyt
1
= − log
1 + e−yt
= log(1 + e−yt )

17/28
Classication Losses

Systematic comparison

mislabeling-robust

builds margin
optimizable
0/1 loss ✗ ✓ ✗
perceptron loss max(0, −yt) ✓ ✓ ✗
log loss log(1 + exp(−yt)) ✓ ✓ ✓

18/28
Handling Multiple Classes

Blueprint:
▶ Build a neural network with as many outputs
as there are classes, call them y1 , . . . , yC .
▶ Classify as k = arg max[y1 , . . . , yC ].

Observation:
▶ The 0/1 loss function can then be straightforwardly generalized to the
multi-class case as:

t=1 t=2 ... t=C

k=1 0 1 ... 1
k=2 1 0
. . .
. . .
. . .

k=C 1 0

▶ However, this generalization of the 0/1 loss suers from the same
problem as the original 0/1 loss, that is, the diculty to optimize it,
and the fact it does not promote margins between the data/predictions
and the decision boundary.

19/28
Handling Multiple Classes
Generalizing the log-loss to multiple classes:
▶ Let y1 , . . . , y C be the C outputs of our network. Mapping these scores
to a probability vector via the softmax function

exp(yi )
pi = PC
j=1 exp(yj )

and constructing a one-hot encoding q of the class label t, we dene

the loss function as the cross-entropy H(q, p), i.e.
C
X
ℓ(y, t) = H(q, p) = − qi log pi
i=1

= − log pt
C
X
= log exp yj − yt
j=1

which can be interpreted as the dierence between the evidence found

by the neural network for all classes, and the evidence found by the
neural network for the target class.

20/28
Part 3 Practical Aspects

21/28
Practical Aspect 1: Non-Uniform Misclassication Costs

Example: medical diagnosis.

▶ Assumes a type of error is much more costly than another, e.g. missing
the detection of a disease.
Actual

Predicted No infection Infection

No infection 0 10000
Infection 2000 0

Approach for the 0/1 loss:

▶ To reect this cost structure, the 0/1 loss can be straightforwardly
enhanced by replacing the 1s in the loss function by the actual costs.

▶ Minimizing the loss function is then equivalent to minimizing the

expected cost (or maximizing utility).

Approach for other losses:

▶ When the loss has a probabilistic interpretation (e.g. log-loss), one can
treat predicted probabilities p(y P
= k) as `ground-truth' and estimate
the expected cost for class i as C k=1 cost(choose i|k)p(y = k).

22/28
Practical Aspect 2: Labels of Varying Quality

low quality labels

high quality labels

Examples:
▶ Non-expert vs. expert labeler, outcome of a physics simulation
with/without approximations, noisy/clean measurement of an
experimental outcome.

Idea:
▶ In presence of two similar instances that are similar but with diverging
labels, focus on the high-quality one. Low-quality labels remain useful
in regions with scarce data.

23/28
Practical Aspect 2: Labels of Varying Quality

low quality labels

high quality labels

Idea (cont.):
▶ Use a dierent loss function for dierent data points, e.g. one
associates to instance i the loss function:

ℓi (y, t) = Ci ℓ(y, t)

where Ci is a multiplicative factor set large if i is a high quality data

point or low if i is low-quality.

24/28
Practical Aspect 3: Multiple Tasks
In practice, we may want the same neural network to perform several tasks
simultaneously, e.g. multiple binary classication tasks, or some additional
regression tasks.

Example: (New J. Phys. 15 095003, 2013)

Denoting by t = (t1 , . . . , tL ) the vector of targets for the L dierent tasks,

and building a neural network with the corresponding number of outputs
y = (y1 , . . . , yL ), we can dene the loss function

L
X
ℓ(y, t) = ℓj (yj , tj )
j=1

where ℓj is the loss function chosen for solving task j.

25/28
Practical Aspect 3: Multiple Tasks

Remark 1:
▶ When the dierent tasks are regression tasks (with similar scale and
weighting), and when applying the square loss and absolute loss to
these dierent tasks, the multi-task loss takes the respective forms:

L
X
E(y, t) = (yl − tl )2 = ∥y − t∥2
l=1
L
X
E(y, t) = |yl − tl | = ∥y − t∥1 .
l=1

Remark 2:
▶ We distinguish between multi-class classication and multiple binary
classication tasks. For example, in image recognition, there are
typically multiple objects on one image, and one often prefers to
indicate for each object its presence or absence rather than to
associate to the image a single class.

26/28
Summary

27/28
Summary

▶ Lectures 56 have highlighted that the actual data on which we train
the model plays an important role. In Lecture 7, we have demonstrated
that an equally important role is played by the way we specify the
errors of the model through particular choices of a loss function ℓ.
▶ Many loss functions exist for tasks such as regression, binary
classication, multi-class classication, multi-task learning, etc.

▶ Loss functions must be designed by taking multiple aspects into

account, such as the ability to account for mislabelings, the ability to
tolerate some noise, and the ability to support ecient optimization.

▶ Loss functions can be dened exibly to address practical aspects such

as the presence of asymetric misclassication costs, subsets of the
data with dierent data quality, or the presence of multiple subtasks.

28/28

Neurology Clinics-Current Advances and Future Trends in Vascular Neurology 2024
No ratings yet
Neurology Clinics-Current Advances and Future Trends in Vascular Neurology 2024
141 pages
Ajay Kumar Yadav Ongc 2
No ratings yet
Ajay Kumar Yadav Ongc 2
23 pages
C4 +Supervised+Machine+Learning
No ratings yet
C4 +Supervised+Machine+Learning
169 pages
Federated Learning Fundamentals and Advances Yaochu Jin Hangyu Zhu Jinjin Xu Yang Chen Download
No ratings yet
Federated Learning Fundamentals and Advances Yaochu Jin Hangyu Zhu Jinjin Xu Yang Chen Download
85 pages
01-DL-Introduction To Deep Learning 01
No ratings yet
01-DL-Introduction To Deep Learning 01
18 pages
Chapter 5 - Machine Learning
No ratings yet
Chapter 5 - Machine Learning
114 pages
DL145611 03 Shallow
No ratings yet
DL145611 03 Shallow
92 pages
DL 02 Basics
No ratings yet
DL 02 Basics
95 pages
Lecture3 Logistic Regression Classifier V0
No ratings yet
Lecture3 Logistic Regression Classifier V0
41 pages
Deep Learning (Part 2) - Loss Function and Gradient Function - by Sumbatilinda - Medium
No ratings yet
Deep Learning (Part 2) - Loss Function and Gradient Function - by Sumbatilinda - Medium
30 pages
2 LossAndOptimization
No ratings yet
2 LossAndOptimization
130 pages
L3 Cse256 Fa24 FFN
No ratings yet
L3 Cse256 Fa24 FFN
64 pages
Unit1 DL JNTUK
No ratings yet
Unit1 DL JNTUK
43 pages
Arning Time Series Classification With Fisher Information
No ratings yet
Arning Time Series Classification With Fisher Information
22 pages
Deep Learning Quantum
No ratings yet
Deep Learning Quantum
124 pages
1 - Introduction To Deep Learning
No ratings yet
1 - Introduction To Deep Learning
87 pages
Math For ML
No ratings yet
Math For ML
10 pages
Final Report - CAB 420
No ratings yet
Final Report - CAB 420
13 pages
CS-31002 (ML) - CS End April 2025
No ratings yet
CS-31002 (ML) - CS End April 2025
19 pages
Loss Functions Types
No ratings yet
Loss Functions Types
11 pages
05 AIS302 ANN-Optimization
No ratings yet
05 AIS302 ANN-Optimization
44 pages
DL 02 Basics
No ratings yet
DL 02 Basics
94 pages
2025 Naacl-Srw 43
No ratings yet
2025 Naacl-Srw 43
8 pages
Chapter 1
No ratings yet
Chapter 1
27 pages
Group 30
No ratings yet
Group 30
33 pages
Group30 Linear Regression
No ratings yet
Group30 Linear Regression
20 pages
Harmonic Loss Trains Interpretable AI Models: David D. Baek Ziming Liu Riya Tyagi Max Tegmark
No ratings yet
Harmonic Loss Trains Interpretable AI Models: David D. Baek Ziming Liu Riya Tyagi Max Tegmark
12 pages
Guo Et Al. (2022)
No ratings yet
Guo Et Al. (2022)
16 pages
CS6910 Tutorial5
No ratings yet
CS6910 Tutorial5
9 pages
Week 2 Introduction To Linear Models - Revised - v1
No ratings yet
Week 2 Introduction To Linear Models - Revised - v1
54 pages
Kullback-Leibler Divergence - Wikipedia
No ratings yet
Kullback-Leibler Divergence - Wikipedia
23 pages
Draft Model Knows When To Stop. A Self-Verification Length Policy For Speculative Decoding
No ratings yet
Draft Model Knows When To Stop. A Self-Verification Length Policy For Speculative Decoding
15 pages
DeepLearning Recap
No ratings yet
DeepLearning Recap
104 pages
Lect 8
No ratings yet
Lect 8
117 pages
Machine Learning: Mathematics
No ratings yet
Machine Learning: Mathematics
11 pages
Loss Functions
No ratings yet
Loss Functions
15 pages
Loss Functions
No ratings yet
Loss Functions
29 pages
Align and Distill: Unifying and Improving Domain Adaptive Object Detection
No ratings yet
Align and Distill: Unifying and Improving Domain Adaptive Object Detection
30 pages
Loss Functions
No ratings yet
Loss Functions
25 pages
Lecture 11
No ratings yet
Lecture 11
26 pages
practicalMachineLearning Lecture3
No ratings yet
practicalMachineLearning Lecture3
25 pages
Understanding The Ranking Loss For Recommendation With Sparse User Feedback
No ratings yet
Understanding The Ranking Loss For Recommendation With Sparse User Feedback
10 pages
Logistic Regression: "And How Do You Know That These Fine Begonias Are Not of Equal Importance?"
No ratings yet
Logistic Regression: "And How Do You Know That These Fine Begonias Are Not of Equal Importance?"
25 pages
DL Assi02
No ratings yet
DL Assi02
9 pages
DL Unit 1
No ratings yet
DL Unit 1
21 pages
Deep Learning
No ratings yet
Deep Learning
15 pages
Introduction To Cross Entropy Loss
No ratings yet
Introduction To Cross Entropy Loss
13 pages
Learning3 6pp
No ratings yet
Learning3 6pp
15 pages
Lec 04 Deep Networks 2
No ratings yet
Lec 04 Deep Networks 2
78 pages
LogisticRegression ExercisesSolutions
No ratings yet
LogisticRegression ExercisesSolutions
5 pages
03 Linear Models
No ratings yet
03 Linear Models
46 pages
Most Influential Data Science Research Papers
No ratings yet
Most Influential Data Science Research Papers
628 pages
3 - Loss Functions
No ratings yet
3 - Loss Functions
14 pages
Roz-4 - Janocha
No ratings yet
Roz-4 - Janocha
11 pages
Lecture 1
No ratings yet
Lecture 1
6 pages
MAjor Project Report
No ratings yet
MAjor Project Report
27 pages
Lecture-4 Emprical Risk and Optimization
No ratings yet
Lecture-4 Emprical Risk and Optimization
20 pages
CS480 6 Linear Models
No ratings yet
CS480 6 Linear Models
68 pages
Nptel Lec
No ratings yet
Nptel Lec
22 pages
Chapter 4 Assignment
No ratings yet
Chapter 4 Assignment
5 pages
01 Lecturenote SRM
No ratings yet
01 Lecturenote SRM
9 pages
Lec 05 Regularization
No ratings yet
Lec 05 Regularization
77 pages
Binary Classification and Logistic Regression
No ratings yet
Binary Classification and Logistic Regression
7 pages
On Loss Functions For Deep Neural Networks in Classification Katarzyna Janocha, Wojciech Marian Czarnecki
No ratings yet
On Loss Functions For Deep Neural Networks in Classification Katarzyna Janocha, Wojciech Marian Czarnecki
10 pages
Lecture 1, Part 3: Training A Classifier: Roger Grosse
No ratings yet
Lecture 1, Part 3: Training A Classifier: Roger Grosse
11 pages
Linear Models (Unit II) Chapter III 1
No ratings yet
Linear Models (Unit II) Chapter III 1
24 pages
Linear Regression
No ratings yet
Linear Regression
9 pages
Loss Functions
No ratings yet
Loss Functions
8 pages
9.b Handout-1-Loss Functions
No ratings yet
9.b Handout-1-Loss Functions
3 pages
04 LossFunctions
No ratings yet
04 LossFunctions
22 pages
Lecture 1, Part 1: Linear Regression: Roger Grosse
No ratings yet
Lecture 1, Part 1: Linear Regression: Roger Grosse
9 pages
4-Loss Function
No ratings yet
4-Loss Function
8 pages
ML Linear Model
No ratings yet
ML Linear Model
10 pages
Homework 2
No ratings yet
Homework 2
3 pages
Lec 2
No ratings yet
Lec 2
5 pages
02 - Linear Models - D (Multiclass Classification)
No ratings yet
02 - Linear Models - D (Multiclass Classification)
9 pages
Loss Function - Ipynb - Colaboratory
No ratings yet
Loss Function - Ipynb - Colaboratory
6 pages
Unit 2b
No ratings yet
Unit 2b
11 pages
L02 Linear Regression
No ratings yet
L02 Linear Regression
9 pages
Loss Functions
No ratings yet
Loss Functions
7 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
29 pages
Captionomaly A Deep Learning Toolbox For Anomaly Captioning in Social Surveillance Systems
No ratings yet
Captionomaly A Deep Learning Toolbox For Anomaly Captioning in Social Surveillance Systems
9 pages
CS229 Supplemental Lecture Notes: 1 Binary Classification
No ratings yet
CS229 Supplemental Lecture Notes: 1 Binary Classification
7 pages
Introduction To Machine Learning Lecture 2: Linear Regression
No ratings yet
Introduction To Machine Learning Lecture 2: Linear Regression
38 pages
9.deep Feedforward Networks
100% (1)
9.deep Feedforward Networks
13 pages
DL Unit-2
No ratings yet
DL Unit-2
24 pages
Loss Functions
No ratings yet
Loss Functions
37 pages
Solidworks 2018 Learn by Doing - Part 3: DimXpert and Rendering
From Everand
Solidworks 2018 Learn by Doing - Part 3: DimXpert and Rendering
Tutorial Books
No ratings yet
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet
SOLIDWORKS 2017 Learn by doing - Part 3
From Everand
SOLIDWORKS 2017 Learn by doing - Part 3
Tutorial Books
No ratings yet

Lecture 07

Uploaded by

Lecture 07

Uploaded by

WiSe 2023/24

Lecture 7 Loss Functions

Recap: Formulating the learning problem

▶ Incorporating predictive uncertainty

Loss functions for classication

▶ Extensions to multiple classes

▶ Incorporating data quality

Objective to minimize is often dened

Two factors inuence the learned model f :

▶ For these applications, labels are provided as real-valued targets t ∈ R,

Several considerations for designing ℓ:

▶ What is the quality of the ground-truth target values in the dataset?

▶ Not aected by potential outliers in the data (just treat them as

ℓ(y, t) = (y − t)2 y¡t

▶ Non-zero gradients → easy to optimize.

▶ Non-zero gradients everywhere (except when the prediction is correct).

▶ Only mildly aected by outliers (error grows linearly).

▶ If we choose dierent data distributions, we recover dierent loss

▶ For these applications, labels are provided as elements of a set,

and this can be written more compactly as:

▶ If the training data would exactly correspond to the test distribution,

Note that it can also be formulated more com-

t=1 t=2 ... t=C

and constructing a one-hot encoding q of the class label t, we dene

which can be interpreted as the dierence between the evidence found

Example: medical diagnosis.

Predicted No infection Infection

Approach for the 0/1 loss:

▶ Minimizing the loss function is then equivalent to minimizing the

Approach for other losses:

low quality labels

low quality labels

where Ci is a multiplicative factor set large if i is a high quality data

Example: (New J. Phys. 15 095003, 2013)

Denoting by t = (t1 , . . . , tL ) the vector of targets for the L dierent tasks,

where ℓj is the loss function chosen for solving task j.

▶ Loss functions must be designed by taking multiple aspects into

▶ Loss functions can be dened exibly to address practical aspects such

You might also like

Loss functions for classication

Objective to minimize is often dened

Two factors inuence the learned model f :

▶ Not aected by potential outliers in the data (just treat them as

▶ Only mildly aected by outliers (error grows linearly).

▶ If we choose dierent data distributions, we recover dierent loss

and constructing a one-hot encoding q of the class label t, we dene

which can be interpreted as the dierence between the evidence found

Denoting by t = (t1 , . . . , tL ) the vector of targets for the L dierent tasks,

▶ Loss functions can be dened exibly to address practical aspects such