Loss functions are a crucial component in neural network training, as every machine learning model requires optimization, which helps in reducing the loss and making correct predictions. Without loss functions, there's no way to drive your model to make correct predictions. But what exactly are loss functions, and how do you use them? In this article, we'll look into the different loss functions available that can be used in the optimization of your models.
Loss Function
When predicting values from a model for the first time, you're not sure whether your model will land on the actual predictions or not. As the weights are initialized randomly, it's difficult for a model to predict the actual values. This is where our loss function is needed.
The loss function is an expression used to measure how close the predicted value is to the actual value. This expression outputs a value called loss, which tells us the performance of our model. By reducing this loss value in further training, the model can be optimized to output values that are closer to the actual values.
Loss Functions in Pytorch
Pytorch is a popular open-source Python library for building deep learning models effectively. It provides us with a ton of loss functions that can be used for different problems. There are basically three types of loss functions in probability: classification, regression, and ranking loss functions.
Regression losses are mostly for problems that deal with continuous values, such as predicting age or prices. On the other hand, classification losses are for problems that deal with discrete values, such as detecting whether an email is spam or ham. The relative distances between values are predicted by ranking losses.
All the loss functions are packaged inside the nn module, which is the base class in pytorch for working with neural networks. So, it is pretty easy to import and use all the loss functions.
Python
#import nn module
import torch.nn as nn
mse_loss_fn = nn.MSELoss()
The above code will store the MSELoss function in the mse_loss_fn variable which can then used to return the loss value by passing the correct parameters as shown below.
loss = mse_loss_fn(predicted_value, target)
#predicted value is what the model is predicting
#target is the actual value
Now let's look into the workings of various types of loss functions available in pytorch.
L1 loss
The L1 loss function also called Mean Absolute Error (MAE) computes the average of the sum of absolute differences between the predicted and the actual values. It first calculates the absolute differences between the predicted and actual values and then sums all the values. Finally, it takes the average to calculate the loss. L1 loss is mostly used for regression problems and is more robust to outliers.
Formula:
\mathcal{L}_{\text{L1}}(y, \hat{y}) = \frac{1}{n} \sum_{i=1}^{n} \left| y_i - \hat{y}_i \right|
Here,
- n represents the total number of observations or samples
- yirepresents the actual or observed value for the ithsample,
- ลทi represents the predicted or estimated value for the ith sample.
Syntax:
torch.nn.L1Loss(size_average=None, reduce=None, reduction='mean')
Python
#initialising the loss function
loss = nn.L1Loss()
#randomly initialising the input and the target value...input is considered as predicted value here.
input = torch.randn(2, 4, requires_grad=True)
target = torch.randn(2, 4)
#passing both the values inside the loss function.
output = loss(input, target)
#backpropagation
output.backward()
print(output)
Output:
tensor(2.0804, grad_fn=<MseLossBackward0>)
Advantage
- MAE is more robust to outliers compared to Mean Squared Error (MSE) because it takes the absolute difference, reducing the impact of extremely large errors.
- The MAE loss is straightforward to interpret as it represents the average magnitude of errors, making it easier to communicate the model's performance to stakeholders.
Disadvantage
- MAE treats all errors equally, regardless of their magnitude. This can be a disadvantage in cases where distinguishing between small and large errors is important.
- The gradient of MAE is a constant value, which can slow down convergence during optimization, especially in comparison to MSE, where the gradient decreases as the error decreases.
Mean Square Error
Also known as L2 loss, MSE is similar to the MAE loss but instead of calculating absolute differences like we did in MAE, L2 computes the average of the squared differences between the predicted and actual values.. The main idea behind squaring is to penalise the model for large difference so that the model avoid larger differences.
This is also used for regression problems but it is less robust than MAE. It is default loss function for most Pytorch regression problems
Formula:
MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
Here,
- n represents the total number of observations or samples,
- yi represents the actual or observed value for the ith sample,
- ลทi represents the predicted or estimated value for the ith sample.
Syntax:
torch.nn.MSELoss(size_average=None, reduce=None, reduction='mean')
Python
import torch
from torch import nn
#initialising the loss function
loss = nn.MSELoss()
#randomly initialising the input and the target value...input is considered as predicted value here.
input = torch.randn(2, 4, requires_grad=True)
target = torch.randn(2, 4)
#passing both the values inside the loss function.
output = loss(input, target)
#backpropagation
output.backward()
print(output)
Output:
tensor(0.9743, grad_fn=<MeanBackward0>)
Advantages:
- Suitable for regression tasks where the final goal is to minimize the squared differences between predicted values and actual values.
- Differentiable and convex.
Disadvantages:
- Sensitive to outliers due to the squaring operation, which deviates the results in the optimization process.
- Not ideal for tasks where errors should be penalized differently.
Huber Loss
This loss is used while tackling regression problems especially when dealing with outliers. It combines both MAE( Mean Absolute Error ) and MSE( Mean Squared Error) and which loss will be used depends upon the delta value. First it calculates the difference between actual and predicted values. If that difference is less than the threshold value i.e. delta, it behaves like MAE, otherwise it switches to MSE.
Formula:
\text{HuberLoss}(x, \text{target}, \delta) = \frac{1}{N} \sum_{i} \begin{cases} \frac{1}{2} (x_i - \text{target}_i)^2 & \text{if } |x_i - \text{target}_i| \leq \delta \\ \delta(|x_i - \text{target}_i| - \frac{1}{2}\delta) & \text{otherwise}\end{cases}
Here,
- x represents the predicted values,target represents the ground truth or target values,
- ฮด is a parameter controlling the threshold for switching between quadratic and linear loss
Syntax:
torch.nn.HuberLoss(reduction='mean', delta=1.0)
Advantage-
- Less sensitive to outliers than MSE but still provide a more balanced approach to evaluating the performance of regression models compared to MAE.
Disadvantage-
- Introduces a new hyper parameter and the optimization of that leads to more complexity in the model.
MAE, MSE and Huber loss are used in regression problems but, which one should we use. MSE can be used when you want to penalize larger errors more heavily. It's useful when the data does not have significant outliers and you assume that the errors are normally distributed. MAE can be used when you want robust loss function that is less affected by outliers. And Huber loss can be used when you want to compromise the benefits of both MAE and MSE.
Cross Entropy Loss
Cross Entropy Loss function is used in classification problems, this loss function computes the differences between two probability distributions for a given set of random variables. The output of the network is a softmax layer which ensures that the final probability value remains in the range of 0 to 1.
Used in binary classification tasks, for which itโs the default loss function in Pytorch.
Formula:
\text{CrossEntropyLoss}(x, \text{target})
= -\frac{1}{N} \sum_{i} \left( \text{target}_i \cdot \log(x_i) \right)
Here,
- x represents the predicted values,
- target represents the ground truth or target values.
Syntax:
torch.nn.CrossEntropyLoss(weight=None, size_average=None, ignore_index=- 100, reduce=None, reduction='mean', label_smoothing=0.0)
Python
#initialising the loss function
loss = nn.CrossEntropyLoss()
#randomly initialising the input and the target value...input is considered as predicted value here.
input = torch.randn(2, 4, requires_grad=True)
target = torch.empty(2, dtype=torch.long).random_(4)
#passing both the values inside the loss function.
output = loss(input, target)
#backpropagation
output.backward()
print(output)
Output:
tensor(1.5564, grad_fn=<NllLossBackward0>)
Advantages-
- Invariant to scaling and shifting of the predicted probabilities.
- Provides a measure of uncertainty which helps in better understanding the model predictions.
Disadvantages-
- Sensitive to outliers and imbalanced data (can be biased towards majority class).
- It does not provide a similarity between classes which can be required in some cases.
Negative Log Likelihood Loss
Similar to Cross Entropy loss, this loss is also used for classification problems and uses the softmax layer to calculate the loss value. After the output of the softmax layer is calculated (i.e. a value between 0 and 1), negative log is calculated of that value. The final layer combined is called as log-softmax layer. Generally, it is used in multi-class classification problems.
Formula:
\text{NegativeLogLikelihoodLoss}(x, \text{target}) = -\frac{1}{N} \sum_{i} \log(x_{\text{target}_i})
Here,
- x represents the predicted values,
- target represents the ground truth or target values
Syntax:
torch.nn.NLLLoss(weight=None, size_average=None, ignore_index=- 100, reduce=None, reduction='mean')
Python
import torch
import torch.nn as nn
# size of input (N x C) is = 3 x 5
input = torch.randn(3, 5, requires_grad=True)
# every element in target should have 0 <= value < C
target = torch.tensor([1, 0, 4])
#initialising loss function
m = nn.LogSoftmax(dim=1)
nll_loss = nn.NLLLoss()
output = nll_loss(m(input), target)
#backpropagation
output.backward()
#printing output
print(output)
Output:
tensor(1.8755, grad_fn=<NllLossBackward0>)
Binary Cross Entropy Loss
This loss is a special class of Cross Entropy Loss which is used for binary classification problems. The output of the neural network is a sigmoid layer which ensures that the final result is either a value close to zero or close to one. As the name suggests, used for binary classification tasks.
Formula:
\text{BCELoss}(x, \text{target}) = -\frac{1}{N} \sum_{i} \left( \text{target}_i \cdot \log(x_i) + (1 - \text{target}_i) \cdot \log(1 - x_i) \right)
Here,
- x represents the predicted values,target represents the ground truth or target values,
- N represents the samples for which loss is being calculated
Syntax:
torch.nn.BCELoss(weight=None, size_average=None, reduce=None, reduction='mean')
Python
#sigmoid layer
m = nn.Sigmoid()
#initialising the loss
loss = nn.BCELoss()
#initialising input and target
input = torch.randn(3, requires_grad=True)
target = torch.empty(3).random_(2)
output = loss(m(input), target)
#backpropagation
output.backward()
#printing the loss value
print(output)
Output:
tensor(0.6782, grad_fn=<BinaryCrossEntropyBackward0>)
Advantages:
- It is commonly used in binary classification problems. It also helps in handling imbalanced datasets.
- It encourages the model to predict high probabilities for the correct class.
Disadvantages:
- It can undergo vanishing gradient and slow convergence when the predicted probabilities are far from the true labels.
Kullback-Leibler Divergence Loss
Kullback-Leibler Divergence loss computes the difference between two probability distributions. It tells how much the divergence is if the predicted probability distribution is used instead of the target probability distribution. If the divergence is greater, the loss function will be greater. If the KL divergence loss is zero, it means both the probability distributions are the same. It is used to approximate complex functions, multi-class classification tasks and If you want to make sure that the distribution of predictions is similar to that of training data.
Formula:
\text{KLDivLoss}(x, \text{target}) = \frac{1}{N} \sum_{i} \left( \text{target}_i \cdot \log\left(\frac{\text{target}_i}{x_i}\right) \right)
Here,
- x represents the predicted values ,
- target represents the ground truth or target values.
Syntax:
torch.nn.KLDivLoss(size_average=None, reduce=None, reduction='mean', log_target=False)
Python
#import
import torch
from torch import nn
import torch.nn.functional as F
#initialise loss
kl_loss = nn.KLDivLoss(reduction="batchmean")
#define input
input = F.log_softmax(torch.randn(3, 5, requires_grad=True), dim=1)
#define the target value
target = F.softmax(torch.rand(3, 5), dim=1)
#passing the values of input and target to the loss function
output = kl_loss(input, target)
#print output
print(output)
Output:
tensor(0.4681, grad_fn=<DivBackward0>)
Advantages:
- It measures the difference between two probability distributions and is used frequently for tasks like variational autoencoders (VAEs).
- It is also used in training generative models and aligning the respective predicted distribution with the true distribution.
Disadvantages:
- It is not symmetric and is not suitable for a stand-alone loss function in a training classifiers.
Hinge Embedding Loss
This loss is mostly used in semi-supervised learning tasks or learning nonlinear embeddings. There's an input tensor and target labels which are between -1 and 1 i.e. binary classification. This loss measures whether two inputs are similar or dissimilar.
Syntax:
torch.nn.HingeEmbeddingLoss(margin=1.0, size_average=None, reduce=None, reduction='mean')
Python
#importing libraries
import torch
import torch.nn as nn
#initialising random input and target variables
input = torch.randn(3, 5, requires_grad=True)
target = torch.randn(3, 5)
#initialising the loss function
hinge_loss = nn.HingeEmbeddingLoss()
#calculating the ouput by passing the input and target values
output = hinge_loss(input, target)
#backpropagation
output.backward()
#printing the output
print(output)
Output:
tensor(1.1289, grad_fn=<MeanBackward0>)
Margin Ranking Loss
Margin ranking loss function measures the relative distance between inputs between a set of inputs in the dataset where the labels contain only binary values i.e. 1 and -1. Suppose there are two inputs, when the target will be 1, the first input will be considered of higher ranking whereas in case of label being -1, the second input will be considered of higher value. Used mostly in ranking problems.
Syntax:
torch.nn.MarginRankingLoss(margin=0.0, size_average=None, reduce=None, reduction='mean')
Python
#importing libraries
import torch
import torch.nn as nn
#initialising two inputs and target label
input_one = torch.randn(3, requires_grad=True)
input_two = torch.randn(3, requires_grad=True)
target = torch.randn(3).sign()
#initialising the loss
ranking_loss = nn.MarginRankingLoss()
#passing the values into the loss function
output = ranking_loss(input_one, input_two, target)
#backpropagation
output.backward()
#printing the output
print(output)
Output:
tensor(0.0444, grad_fn=<MeanBackward0>)
CTC Loss
Connectionist Temporal Classification (CTC) loss calculates the loss between a continuous series of inputs and a target sequence. It sums over the probability of all the possible alignments of input to target, producing a loss value which is differentiable with respect to each input node.
Syntax:
torch.nn.CTCLoss(blank=0, reduction='mean', zero_infinity=False)
There are many other losses which are not yet covered in this article such as Binary cross entropy with logits, etc. Refer the documentation to get an overview about all the loss functions available in pytorch.
In conclusion, loss functions are a crucial component in training the neural networks and pytorch does a great job in providing us with the loss functions with enough abstraction so that they can be understood and used easily.
Similar Reads
Deep Learning Tutorial Deep Learning is a subset of Artificial Intelligence (AI) that helps machines to learn from large datasets using multi-layered neural networks. It automatically finds patterns and makes predictions and eliminates the need for manual feature extraction. Deep Learning tutorial covers the basics to adv
5 min read
Deep Learning Basics
Introduction to Deep LearningDeep Learning is transforming the way machines understand, learn and interact with complex data. Deep learning mimics neural networks of the human brain, it enables computers to autonomously uncover patterns and make informed decisions from vast amounts of unstructured data. How Deep Learning Works?
7 min read
Artificial intelligence vs Machine Learning vs Deep LearningNowadays many misconceptions are there related to the words machine learning, deep learning, and artificial intelligence (AI), most people think all these things are the same whenever they hear the word AI, they directly relate that word to machine learning or vice versa, well yes, these things are
4 min read
Deep Learning Examples: Practical Applications in Real LifeDeep learning is a branch of artificial intelligence (AI) that uses algorithms inspired by how the human brain works. It helps computers learn from large amounts of data and make smart decisions. Deep learning is behind many technologies we use every day like voice assistants and medical tools.This
3 min read
Challenges in Deep LearningDeep learning, a branch of artificial intelligence, uses neural networks to analyze and learn from large datasets. It powers advancements in image recognition, natural language processing, and autonomous systems. Despite its impressive capabilities, deep learning is not without its challenges. It in
7 min read
Why Deep Learning is ImportantDeep learning has emerged as one of the most transformative technologies of our time, revolutionizing numerous fields from computer vision to natural language processing. Its significance extends far beyond just improving predictive accuracy; it has reshaped entire industries and opened up new possi
5 min read
Neural Networks Basics
What is a Neural Network?Neural networks are machine learning models that mimic the complex functions of the human brain. These models consist of interconnected nodes or neurons that process data, learn patterns and enable tasks such as pattern recognition and decision-making.In this article, we will explore the fundamental
12 min read
Types of Neural NetworksNeural networks are computational models that mimic the way biological neural networks in the human brain process information. They consist of layers of neurons that transform the input data into meaningful outputs through a series of mathematical operations. In this article, we are going to explore
7 min read
Layers in Artificial Neural Networks (ANN)In Artificial Neural Networks (ANNs), data flows from the input layer to the output layer through one or more hidden layers. Each layer consists of neurons that receive input, process it, and pass the output to the next layer. The layers work together to extract features, transform data, and make pr
4 min read
Activation functions in Neural NetworksWhile building a neural network, one key decision is selecting the Activation Function for both the hidden layer and the output layer. It is a mathematical function applied to the output of a neuron. It introduces non-linearity into the model, allowing the network to learn and represent complex patt
8 min read
Feedforward Neural NetworkFeedforward Neural Network (FNN) is a type of artificial neural network in which information flows in a single direction i.e from the input layer through hidden layers to the output layer without loops or feedback. It is mainly used for pattern recognition tasks like image and speech classification.
6 min read
Backpropagation in Neural NetworkBack Propagation is also known as "Backward Propagation of Errors" is a method used to train neural network . Its goal is to reduce the difference between the modelâs predicted output and the actual output by adjusting the weights and biases in the network.It works iteratively to adjust weights and
9 min read
Deep Learning Models
Deep Learning Frameworks
TensorFlow TutorialTensorFlow is an open-source machine-learning framework developed by Google. It is written in Python, making it accessible and easy to understand. It is designed to build and train machine learning (ML) and deep learning models. It is highly scalable for both research and production.It supports CPUs
2 min read
Keras TutorialKeras high-level neural networks APIs that provide easy and efficient design and training of deep learning models. It is built on top of powerful frameworks like TensorFlow, making it both highly flexible and accessible. Keras has a simple and user-friendly interface, making it ideal for both beginn
3 min read
PyTorch TutorialPyTorch is an open-source deep learning framework designed to simplify the process of building neural networks and machine learning models. With its dynamic computation graph, PyTorch allows developers to modify the networkâs behavior in real-time, making it an excellent choice for both beginners an
7 min read
Caffe : Deep Learning FrameworkCaffe (Convolutional Architecture for Fast Feature Embedding) is an open-source deep learning framework developed by the Berkeley Vision and Learning Center (BVLC) to assist developers in creating, training, testing, and deploying deep neural networks. It provides a valuable medium for enhancing com
8 min read
Apache MXNet: The Scalable and Flexible Deep Learning FrameworkIn the ever-evolving landscape of artificial intelligence and deep learning, selecting the right framework for building and deploying models is crucial for performance, scalability, and ease of development. Apache MXNet, an open-source deep learning framework, stands out by offering flexibility, sca
6 min read
Theano in PythonTheano is a Python library that allows us to evaluate mathematical operations including multi-dimensional arrays efficiently. It is mostly used in building Deep Learning Projects. Theano works way faster on the Graphics Processing Unit (GPU) rather than on the CPU. This article will help you to unde
4 min read
Model Evaluation
Deep Learning Projects