We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 6
‘202019
Loss Functions in Neural Networks | Isaac Changhau
Loss Functions in Neural Networks
Wed, Jun 7, 2017 + 10 min read
loss function is an important part in artificial neural networks, which is used to
measure the inconsistency between predicted value j) and actual label (Its a non-
negative value, where the robustness of model increases along with the decrease of the
value of loss function. Loss function isthe hard core of empirical risk function as well as
a significant component of structural risk function. Generally, the structural risk
function of a model is consist of empitcal risk term andl regularization term, which can
be represented as
9 = argmin £(0) + d- ®(8)
1 (0
= argmin ~ L(y, 9%) + A- 28)
= argmin 41> L(y, fx, 0)) + d+ (0)
a
where (6) is the regularization term or penalty term, 6 is the parameters of model to
be teamed. f(-) represents the activation function and
2 = (20, 209,...,210} € RM denotes the a watning sample.
Here we only concentrate on the empirical risk term (los function)
Lem ecu
£(8) = = SO U(y", £8)
and introduce the mathematical expressions of several commonly-used loss functions
as well as the corresponding expression in Deeplearingad (https:/deeplearning4) org).
Mean Squared Error
‘Mean Squated Error (MSE), or quadratic, loss function is widely used in linear regression
as the performance measure, and the method of minimizing MSE (s called Ordinary
Least squares cosy Guttps:inww google.comsgjur?
sastBrctejSq+Gesre=sGsourcesweblicdslécad=rjaGuact=8tved=OahUKEWIN-
[NPRSKrUAHULLYBKHQUOATEQFageMAABt
FdcrrRmcRnimknA), the basic principle of OSL is that the optimized fitting line should be
a line which minimizes the sum of distance of each point to the regression line, Le.
minimizes the quadratic sum. The standard form of MSE loss function is defined as
2-239 9
where (y!) — 9!) ts named as residual, and the target of MSE loss function is to
minimize the residual sum of squares. In Deeplearning4), it is
Lossrunctions LorsFunction.HSE Of LossFunctions.LossFunction.sQuaRED_Loss (they are same
in DIAD. However, fa using Sigmotd
(nttps:/Asaacchanghau github io/2017/0si2/Activation-Functions-in-Antficial-Neural-
Networks/#Sigmoid-Units) as the activation function, the quadratic loss function would
suffer the problem of slow convergence (learning speed), for other activation funtions, it
would not have such problem.
For example, by using sigmoid, 9 = o(a) = o(6"x!), simply, we only consider
one sample, say, (y — o(2))?, and it derivative is computed by
yo) -0'@)-x
hitps'saacchanghau.gthub olpostloss_functions!
\ips'k3A%2FH2Fen wikipedia orgh2Fwiki%2FOrdinary_least_squar
18‘202019
Loss Funetons in Neural Networks | Isaae Changhau
according to the shape and feature of Sigmoid, when o(2) tends to 0 or 1, (2) is close
to zero, and when o(2) close to 0.5, o'(a) will ach it maximum. In this case, when the
difference between predicted value and true label (y ~ o(2)) is large, o/(2) will close to
0, wich decreases the convergence speed, this is improper, since we expect that the
learning speed should be fast when the error is large.
Mean Squared Logarithmic Error
‘Mean Squared Logarithmic Error (MSLE) loss function is a variant of MSE, which is
defined as
£=1S° (tog(y? +1) — logis” +2)?
SLE Is ls used to measure the dierent between actual and preted. By taking he
fog of tte predictions and actual values, what changes isthe valance tat you are
measuring usualy used when you donot want fo penalize huge dflerences inthe
predicted andthe actus values when both predicied and aus values are huge nubers,
Another things that MSUE penalizes underestimates more han over-estimate
1L If both predicted and actual values are small: MSE and MSLE {s same.
2 Ifeither predicted or the actual value is big MSE > MSLE.
3.Ifboth predicted and actual values are big, MSE > MSLE (Msit becomes
almost negligible.
It is expressed as LossFunctions.LossFunction.MEAK_SQUARED_LOGARITHNZC_eRRER in
Deeplearningad.
Lz
12 loss function is the square of the L2 norm of the difference between actual value and
predicted value. It is mathematically similar to MSE, only do not have division by n. itis
computed by
c= 9F
a
For more details, typically in mathematic, please read the paper: On Loss Functions for
Deep Neural Networks in Classification —_nttps:iiwww.google.com spurt?
sastbrctejGqeGesre=sGsourcesweblicd=26cadsrjaGuact=86ved=“OahUKEwiBuMG-
‘N6VUANWKchQKHCISCIQQFgEMMAEBurl=httpshIAK2FH2Farxv.org’h2Fpar2F1702.056596usp-AFOICNGQLOBW!
which gives comprehensive explanation about several commomly-used loss functions,
including 12, L1 loss function. In DeepLeaing4J, it is expressed as
LossFunctions. LossFunction.(2
Mean Absolute Error
‘Mean Absolute Error (MAE) is a quantity used to measure how close forecasts or
predictions are to the eventual outcomes, which ts computed by
A “a
c=2 yy
a
where |-| denotes the absolute value, Albeit, both MSE and MAE ate used in predictive
modeling, there are several differences between them. MSE has nice mathematical
properties which makes it easier to compute the gradient, However, MAE requires more
complicated tools such as linear programming to compute the gradient, Because of the
square, large errors have relatively greater influence on MSE than do the smaller error.
Therefore, MAE is more robust to outliers since it does not make use of square. On the
other hand, MSE (s more useful if concerning about large errors whose consequences
are much bigger than equivalent smaller ones. MSE also corresponds to maximizing the
likelihood of Gaussian random variables. In DeeplearningJ. it is expressed as
LossFunctsons. Lo#sFunetion. NEA ABSOLUTE_ ERROR
Mean Absolute Percentage Error
hitps'saacchanghau.gthub olpostloss_functions!
216‘202019 Loss Functions in Neural Networks | Isaac Changhau
‘Mean Absolute Percentage Error (MAPE) is a variant of MAE, itis computed by
ie
£=2>
9
EF |: 100
ro
Although the concept of MAPE sounds very simple and convincing, it has major
{drawbacks in practical application:
L It cannot be used if there are zero values (hich sometimes happens for example
in demand data) because there would be a division by zero.
2. For forecasts which are too low the percentage error cannot exceed 100, but for
forecasts which are too high there is no upper limit to the percentage erro.
8. When MAPE Is used to compare the accuracy of prediction methods it is biased in.
‘that it will systematically select a method whose forecasts are too low. This lttle-
Known but serious issue can be overcome by using an accuracy measure based on
the ratlo of the predicted to actual value (called the Accuracy Ratlo), this approach,
leads to superior statistical properties and leads to predictions which can be
interpreted in terms of the geometric mean.
IL 1S expressed as LossFunctions.LossFunction.MEAN ABSOLUTE PERCENTAGE ERROR iN
DeepLearningad
Li
11 loss function is sum of absolute errors of the difference between actual value and
predicted value. Similar to the relation between MSE and 12, 11 is mathematically
similar to MAE, only do not have division by n, and itis defined as
c=DW-
In DeepLearninga, it is expressed as LossFunctions.tossFunction.Li
i
Kullback Leibler (KL) Divergence
XL Divergence, also known as relative entiopy, information dlvergenceigain, 1s a
measure of how one probability distibution diverges from a second expected
probability distribution. KL divergence loss function is computed by
Le Pe
£= => Derly|I9)
12 0
= EE Woe]
= ai
12 12
= 235 6 og) 2 (toe (G
we? Toeto)) 5 D(H “toatG))
where the first term is entropy and another (s cross entropy (another kind of loss
function which will be introduced later). KL divergence is a distribution-wise asymmetric
measure and thus does not qualify as a statistical metric of spread. In the simple case, a
KL divergence of 0 indicates that we can expect similar, f not the same, behavior of two
different distributions, while a KL divergence of 1 indicates that the two distributions
behave in such @ different manner that the expectation given the fist distribution
approaches zero, For more detalls, please visit the wikipedia: —Mintd
(hitps:fen wikipedia org/wikiKullback*&E2%80%93Leibler divergence),
In DeepLearningad, itis expressed as Lossrunctions.Lossfunction.Kt_ DIVERGENCE . Moreover,
the implementation of Reconstruction Gross Entropy
(https:fenwikipediaorg/wikiCross.entropy) in Deepleaming4J is same as Kullback
Ieibler )_—=—Civergence, thus, == you. = can also. use
LossFunctions LossFunct.on. RECONSTRUCTION CROSSERTROPY
Cross Entropy
hitps'saacchanghau.gthub olpostloss_functions! sie