Loss Functions
Dr. V. Sowmya,
Associate Professor,
Amrita School of Artificial
Intelligence,
Coimbatore,
Amrita Vishwa Vidyapeetham,
India.
27-01-2025.
Loss Functions
• During training, a loss function is used to
optimize the model’s parameters.
• Measures the difference between the predicted
and expected outputs of the model.
• The objective of training is to minimize this
difference.
Loss Functions - Properties
Mean Squared Error (MSE) / L2 Loss
Properties:
• Non-negative.
• Sensitive to Outliers.
• Differentiable.
• Convex (non-convex due to the multiple layers of
non-linear activation functions in DL).
• Susceptible to outliers in the data.
• Loss function and performance metric.
• Scale-dependent.
Mean Absolute Error (MSE) / L1 Loss
Properties:
• Non-negative.
• Robust to Outliers.
• Non-Differentiable.
• Convex (non-convex due to the
multiple layers of non-linear activation
functions in DL).
• Loss function and performance metric.
• Scale Dependent.
Mean Absolute Percentage Error (MAPE) or Normalized Mean Absolute Error (NMAE) to
compare models across different scales or units.
Huber Loss
Properties:
• Robust to Outliers.
• Differentiable.
• Used in time series
forecasting.
δ to a small value if the data has a lot of noise and to a
large value if the data has outliers.
Log-Cosh Loss
Properties:
• Smooth and Differentiable.
• Less Sensitive to Outliers than MSE.
• More sensitive to small errors than the
Huber loss.
Huber Loss - when we have a reason to define a specific point where the loss
function should switch from quadratic to linear, depending on the noise
characteristics of the data.
Log – Cosh Loss - when we do not have clear reasons to manually set a transition
threshold as in Huber loss.
Quantile Loss
Used for predicting an interval instead of a
single value.
The loss is scaled by q for underestimations and (1 − q) for
overestimations.
When q = 0.5, the quantile loss is equivalent to the Mean
Absolute Error (MAE), making it a generalization of MAE that
allows for asymmetric penalties for underestimations and
overestimations.
Financial Risk Management, Supply Chain and Inventory
Management, Energy Production, Economic Forecasting, Weather
Forecasting, Real Estate Pricing, Healthcare.
Poisson Loss
when the target variable represents
count data
Traffic Modelling, Healthcare, Insurance, Customer
Service, Internet Usage, Manufacturing, Crime Analysis.
Binary Cross Entropy (BCE) and
Weighted BCE
Assigns a higher weight to the minority class, helping to balance the
influence of each class on the training process.
Categorical Cross Entropy (CCE)
Sparse Categorical Cross Entropy
(CCE)
Cross-Entropy Loss with Label
Smoothing
This technique has been shown to improve the generalization
of models, particularly in scenarios with many categories or
when the dataset contains noisy labels.
Negative Log Likelihood
(NLL)
Poly Loss
ϵ = 0, Poly-1 reduces to the standard cross-entropy
loss. When ϵ > 0, the loss function becomes more
sensitive to confident predictions, reducing
overfitting in imbalanced datasets or tasks
requiring higher precision
When dealing with imbalanced datasets, To
simplify the hyperparameter optimization process
Hinge Loss
maximum-margin classification
tasks
y · f(x) reflects the raw margin, which measures how
far the predicted value is from the decision boundary
in terms of alignment and distance
Squared Hinge
Loss