Loss Functions
Loss Functions
Neural Network uses optimizing strategies like stochastic gradient descent to minimize the error
in the algorithm. The way we actually compute this error is by using a Loss Function. It is used to
quantify how good or bad the model is performing.
Loss functions can be classified into two major categories depending upon the type of learning
task we are dealing with — Regression losses and Classification losses.
In classification, we are trying to predict output from set of finite categorical values i.e Given
large data set of images of hand written digits, categorizing them into one of 0–9 digits.
Regression, on the other hand, deals with predicting a continuous value for example given floor
area, number of rooms, size of rooms, predict the price of room.
NOTE
n - Number of training examples.
i - ith training example in a data set.
y(i) - Ground truth label for ith training example.
y_hat(i) - Prediction for ith training example.
Regression Losses
1. Mean Square Error/Quadratic Loss/L2 Loss
Mathematical formulation :-
As the name suggests, Mean square error is measured as the average of squared difference between
predictions and actual observations.
It’s only concerned with the average magnitude of error irrespective of their direction. However, due to
squaring, predictions which are far away from actual values are penalized heavily in comparison to less
deviated predictions. Calculating gradient of MSE is easier.
Mean absolute error, on the other hand, is measured as the average of sum of absolute differences
between predictions and actual observations. Like MSE, this as well measures the magnitude of error
without considering their direction.
MAE is more robust to outliers since it does not make use of square.
Mathematical formulation :-
.
from sklearn.metrics import mean_absolute_error
>>> y_true = [3, -0.5, 2, 7]
>>> y_pred = [2.5, 0.0, 2, 8]
>>> mean_absolute_error(y_true, y_pred)
0.5
MAE loss is useful if the training data is corrupted with outliers (i.e. we erroneously receive
unrealistically huge negative/positive values in our training environment, but not our testing
environment).
Deciding which loss function to use
If the outliers represent anomalies that are important for business and should be detected, then we
should use MSE. On the other hand, if we believe that the outliers just represent corrupted data,
then we should choose MAE as loss.
L1 loss is more robust to outliers, but its derivatives are not continuous, making it inefficient to
find the solution. L2 loss is sensitive to outliers, but gives a more stable and closed form solution
3. Huber Loss:
Mean Square Error (MSE) is greater for learning the outliers in the dataset, on the other
hand, Mean Absolute Error(MAE) is good to ignore the outliers.
But in some cases, the data which looks like outliers should not be ignored and also those
points should not get high priority. Here where Huber Loss comes in.
Huber Loss = Combination of both MSE and MAE
Huber loss is both MSE and MAE means it is quadratic(MSE) when the error is small else
MAE. Here delta is the hyperparameter to define the range for MAE and MSE which can be
iterative to make sure the correct delta value.
Classification Losses
It gives the probability value between 0 and 1 for a classification task. Cross-Entropy calculates
the average difference between the predicted and actual probabilities.
Each predicted probability is compared to the actual class output value (0 or 1) and a score is
calculated that penalizes the probability based on the distance from the expected value. The
penalty is logarithmic, offering a small score for small differences (0.1 or 0.2) and enormous
score for a large difference (0.9 or 1.0).
This is the most common setting for classification problems. Cross-entropy loss increases as the
predicted probability diverges from the actual label.
Consider a 4-class classification task where an image is classified as either a dog, cat,
horse or cheetah.
Let us calculate the probability generated by the first logit after Softmax is applied
E= 2.73
In the above Figure, Softmax converts logits into probabilities. The purpose of the Cross-Entropy is to
take the output probabilities (P) and measure the distance from the truth values (as shown in Figure
below).
Cross-entropy is defined as