0% found this document useful (0 votes)
3 views15 pages

Loss Functions

The document discusses various loss functions used in neural networks, primarily focusing on Mean Squared Error (MSE) and Binary Cross Entropy (BCE). MSE is calculated as the average of squared differences between predicted and actual outputs, while BCE is used for binary classification tasks and involves logarithmic calculations based on predicted probabilities. Additionally, the document covers the concept of cross-entropy loss for multi-class classification and its implications in training neural networks.

Uploaded by

Cát Lăng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views15 pages

Loss Functions

The document discusses various loss functions used in neural networks, primarily focusing on Mean Squared Error (MSE) and Binary Cross Entropy (BCE). MSE is calculated as the average of squared differences between predicted and actual outputs, while BCE is used for binary classification tasks and involves logarithmic calculations based on predicted probabilities. Additionally, the document covers the concept of cross-entropy loss for multi-class classification and its implications in training neural networks.

Uploaded by

Cát Lăng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Loss Functions in Neural

Networks
We will discuss some of the loss functions that are widely used in Neural Networks

Remember:
The objective is to try to minimize the loss
between the predictions and the actual outputs
Mean Squared Error (L2 Loss)
Calculate the squared error (squared difference between the actual output and the
predicted output for each sample. Sum them up and take their average.
𝑛 𝑛
1 1
𝑀𝑆𝐸 = ෍(𝑌𝑖 − 𝑌෠𝑖 ) = ෍(𝑌෠𝑖 − 𝑌𝑖 )2
2
𝑛 𝑛
𝑖=1 𝑖=1

𝑌෠𝑖 − 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑂𝑢𝑡𝑝𝑢𝑡


𝑌𝑖 − 𝐴𝑐𝑡𝑢𝑎𝑙 𝑂𝑢𝑡𝑝𝑢𝑡
- Quadratic/Convex 𝑛 − 𝑇𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑆𝑎𝑚𝑝𝑙𝑒𝑠 𝑖𝑛 𝑒𝑎𝑐ℎ 𝑚𝑖𝑛𝑖𝑏𝑎𝑡𝑐ℎ
- One Global Minimum to find
- Getting stuck at local
minimum is eliminated
𝐼𝑓 𝑛𝑜𝑡 𝑢𝑠𝑖𝑛𝑔 𝑚𝑖𝑛𝑖𝑏𝑎𝑡𝑐ℎ 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔, 𝑡ℎ𝑒𝑛 𝑛 = 𝑇𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑆𝑎𝑚𝑝𝑙𝑒𝑠

Complete Guide to Neural Networks with Python: Theory and Applications


Variations of Mean Squared Error
• Half of the Mean Squared Error 𝑛
1
𝑀𝑆𝐸 = ෍(𝑌෠𝑖 − 𝑌𝑖 )2
2𝑛
𝑖=1
• Root Mean Squared Error
𝑛
1
𝑅𝑀𝑆𝐸 = ෍(𝑌෠𝑖 − 𝑌𝑖 )2
𝑛
𝑖=1

Complete Guide to Neural Networks with Python: Theory and Applications


Example of MSE
Sample Predicted Actual Error Squared Error
1 48 60 -12 144
2 51 53 -2 4
3 57 60 -3 9

3
1 157
෍ 144 + 4 + 9 = = 52.3 BIG!!!
3 3 MINIMIZE
𝑖=1
Note on the side:
If there is more than one output neuron, you would add the error for each output neuron, in each of the training
samples and take the average. Then we take the average over all samples
Number
of output
𝑛 𝑗 neurons
1 1
𝑀𝑆𝐸 = ෍ ෍(𝑌𝑖 − 𝑌෠𝑖 )2
𝑛 𝑗
Complete Guide to Neural Networks with Python: Theory and Applications
𝑠=1 𝑖=1
Negative of the Logarithmic Function
Binary Cross Entropy The Complete Neural Networks Bootcamp: Theory, Applications

• Usually used when the output labels have values of 0 or 1.


• It can also be used when the output labels have values between 0 and 1
• It is also widely used when we have only two classes (0 or 1) (example: yes or no)

𝑛 𝑐
1
− ෍ ෍[𝑦𝑖 log 𝑝𝑖 + 1 − 𝑦𝑖 log 1 − 𝑝𝑖 ]
𝑛
𝑗=1 𝑖=1

𝑦 − 𝐴𝑐𝑡𝑢𝑎𝑙 𝑐𝑙𝑎𝑠𝑠 𝑙𝑎𝑏𝑒𝑙 0 𝑜𝑟 1


𝑝 − 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑓𝑜𝑟 𝑡ℎ𝑒 𝑐𝑙𝑎𝑠𝑠
𝑐 − 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑙𝑎𝑠𝑠𝑒𝑠
𝑛 − 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑎𝑚𝑝𝑙𝑒𝑠
When we only have two classes (binary classification)
𝑛
1
− ෍[𝑦𝑖 log 𝑝𝑖 + 1 − 𝑦𝑖 log 1 − 𝑝𝑖 ]
𝑛
𝑖=1
For each sample

If label is 1: −log 𝑝𝑖

Sigmoid 0.6 If label is 0: −log 1 − 𝑝𝑖

Class 1 𝒑 : 0.6
Class 2 𝒑 : 1 - 0.6 = 0.4 We do this procedure for all samples n
and then take the average
Let’s see what’s happening
Consider a problem with two classes [1 or 0]:
𝑛
1
𝐵𝐶𝐸 𝐿𝑜𝑠𝑠 = − ෍[𝑦 log 𝑝 + 1 − 𝑦 log 1 − 𝑝 ]
𝑛
𝑖=1
Consider one sample n=1

If the label y is 1 and the prediction p is 0.1  −𝑦 log 𝑝 = − log 0.1 → 𝐿𝑜𝑠𝑠 𝑖𝑠 𝐻𝑖𝑔ℎ
If the label y is 1 and the prediction p is 0.9  −𝑦 log 𝑝 = − log(0.9) → 𝐿𝑜𝑠𝑠 𝑖𝑠 𝐿𝑜𝑤
If the label y is 0 and the prediction p is 0.9  − 1 − y log 1 − p = −log 1 − 0.9 =
− log 0.1 → 𝐿𝑜𝑠𝑠 𝑖𝑠 𝐻𝑖𝑔h
If the label y is 0 and the prediction p is 0.1  − 1 − y log 1 − p = −log 1 − 0.1 =
− log 0.9 → 𝐿𝑜𝑠𝑠 𝑖𝑠 𝐿𝑜𝑤

Complete Guide to Neural Networks with Python: Theory and Applications


What does that mean….
𝑛
1
𝐵𝐶𝐸 𝐿𝑜𝑠𝑠 = − ෍[𝑦 log 𝑝 + 1 − 𝑦 log 1 − 𝑝 ]
𝑛
𝑖=1
Consider n=1

If the label is 1 and the prediction is 0.1  −𝑦 log 𝑝 = − log 0.1 → 𝐿𝑜𝑠𝑠 𝑖𝑠 𝐻𝑖𝑔ℎ Minimize!
If the label is 1 and the prediction is 0.9  −𝑦 log 𝑝 = − log(0.9) → 𝐿𝑜𝑠𝑠 𝑖𝑠 𝐿𝑜𝑤
If the label is 0 and the prediction is 0.9  − 1 − y log 1 − p = −log 1 − 0.9 = Minimize!
− log 0.1 → 𝐿𝑜𝑠𝑠 𝑖𝑠 𝐻𝑖𝑔h
If the label is 0 and the prediction is 0.1  − 1 − y log 1 − p = −log 1 − 0.1 =
− log 0.9 → 𝐿𝑜𝑠𝑠 𝑖𝑠 𝐿𝑜𝑤

Ideal case
When label is 1 and prediction is 1  -log(1) = 0
Complete Guide to Neural Networks with Python: Theory and Applications
When label is 0 and prediction is 0  -log(1-0) = 0
PyTorch example on using BCE Loss
input – Tensor of arbitrary shape
target – Tensor of the same shape as input

Returns a tensor filled with random numbers Returns a tensor filled with random
from a uniform distribution on the interval [0, 1] numbers from a normal distribution with
mean 0 and variance 1 (also called the
standard normal distribution).
Multi-label Classification
The Complete Neural Networks Bootcamp: Theory, Applications

Loss

0.6 1

Sigmoid Output
0.9 0
0.2 0
0.8 1
Consider n=1 sample
𝑐

𝐵𝐶𝐸 𝐿𝑜𝑠𝑠 = − ෍[𝑦𝑖 log 𝑝𝑖 + 1 − 𝑦𝑖 log 1 − 𝑝𝑖 ]


𝑖=1

𝐵𝐶𝐸 𝐿𝑜𝑠𝑠 = − ෍[log 0.6 + log 1 − 0.9 + log 1 − 0.2 + log 0.8 ]
Example of Multi-Label Classification

Predicted Attributes:
Beach
Dog
Brown
Sitting
Laying
People
Walking
Cross Entropy
𝑛 𝑐 i = class number
1 c = number of classes
𝐶𝐸 = − ෍ ෍ 𝑦𝑖 log 𝑦ො𝑖 𝑦𝑖 = actual label
𝑛 𝑦ො𝑖 = predicted label
𝑗=1 𝑖=1

The Actual Outputs should be in the form of a one-hot vector


Suppose you have 4 labels (4 outputs), then:

1 0 0 0
Class 1 Class 2 1 Class 3 0 Class 4 0
0
0 0 1 0
0 0 0 1

If Label = 0 (wrong)  No Loss Calculation! Loss only calculated for correct predictions!
If Label = 1 (correct)  Loss Calculation It penalizes probabilities of correct classes only!
For example
Suppose you have 4 different classes to classify:
For a single Training example:
The ground truth (actual) labels are: [1 0 0 0]
The predicted labels (after softmax) are: [0.1 0.4 0.2 0.3]
Wrong! Should be 1

𝐶𝑟𝑜𝑠𝑠 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝐿𝑜𝑠𝑠 = − 1 × log( 0.1) + 0 + 0 + 0 = −log 0.1 = 2.303 Loss is High

Ignore the loss for the 0 labels


The loss doesn't depend on the probabilities for the incorrect classes!

Sometimes, the cross entropy loss is averaged over the training samples n:
n  mini-batch size if using mini-batch training
n  complete training samples if not using mini-batch training
𝑛 In case of a one-hot vector, for each sample,
1
𝐽 = − (෍ 𝑦𝑖 log 𝑦ො𝑖 ) we only have one correct class. All other
𝑛 classes are 0. This, summation over classes c is
𝑖=1
eliminated.
For example
Suppose you have 4 different classes to classify:
For a single Training example:
The ground truth (actual) labels are: [1 0 0 0 ]
The predicted labels are: [0.9 0.01 0.05 0.04]
Correct! Almost 1

𝐶𝑟𝑜𝑠𝑠 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝐿𝑜𝑠𝑠 = − 1 × log( 0.9) + 0 + 0 + 0 = −log 0.9 = 0.04 Loss is Low
Ignore the loss for the 0 labels
The loss doesn't depend on the probabilities for the incorrect classes!

You might also like