0% found this document useful (0 votes)
244 views6 pages

Loss Functions in Neural Networks PDF

Uploaded by

Vignesh S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
244 views6 pages

Loss Functions in Neural Networks PDF

Uploaded by

Vignesh S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 6
‘202019 Loss Functions in Neural Networks | Isaac Changhau Loss Functions in Neural Networks Wed, Jun 7, 2017 + 10 min read loss function is an important part in artificial neural networks, which is used to measure the inconsistency between predicted value j) and actual label (Its a non- negative value, where the robustness of model increases along with the decrease of the value of loss function. Loss function isthe hard core of empirical risk function as well as a significant component of structural risk function. Generally, the structural risk function of a model is consist of empitcal risk term andl regularization term, which can be represented as 9 = argmin £(0) + d- ®(8) 1 (0 = argmin ~ L(y, 9%) + A- 28) = argmin 41> L(y, fx, 0)) + d+ (0) a where (6) is the regularization term or penalty term, 6 is the parameters of model to be teamed. f(-) represents the activation function and 2 = (20, 209,...,210} € RM denotes the a watning sample. Here we only concentrate on the empirical risk term (los function) Lem ecu £(8) = = SO U(y", £8) and introduce the mathematical expressions of several commonly-used loss functions as well as the corresponding expression in Deeplearingad (https:/deeplearning4) org). Mean Squared Error ‘Mean Squated Error (MSE), or quadratic, loss function is widely used in linear regression as the performance measure, and the method of minimizing MSE (s called Ordinary Least squares cosy Guttps:inww google.comsgjur? sastBrctejSq+Gesre=sGsourcesweblicdslécad=rjaGuact=8tved=OahUKEWIN- [NPRSKrUAHULLYBKHQUOATEQFageMAABt FdcrrRmcRnimknA), the basic principle of OSL is that the optimized fitting line should be a line which minimizes the sum of distance of each point to the regression line, Le. minimizes the quadratic sum. The standard form of MSE loss function is defined as 2-239 9 where (y!) — 9!) ts named as residual, and the target of MSE loss function is to minimize the residual sum of squares. In Deeplearning4), it is Lossrunctions LorsFunction.HSE Of LossFunctions.LossFunction.sQuaRED_Loss (they are same in DIAD. However, fa using Sigmotd (nttps:/Asaacchanghau github io/2017/0si2/Activation-Functions-in-Antficial-Neural- Networks/#Sigmoid-Units) as the activation function, the quadratic loss function would suffer the problem of slow convergence (learning speed), for other activation funtions, it would not have such problem. For example, by using sigmoid, 9 = o(a) = o(6"x!), simply, we only consider one sample, say, (y — o(2))?, and it derivative is computed by yo) -0'@)-x hitps'saacchanghau.gthub olpostloss_functions! \ips'k3A%2FH2Fen wikipedia orgh2Fwiki%2FOrdinary_least_squar 18 ‘202019 Loss Funetons in Neural Networks | Isaae Changhau according to the shape and feature of Sigmoid, when o(2) tends to 0 or 1, (2) is close to zero, and when o(2) close to 0.5, o'(a) will ach it maximum. In this case, when the difference between predicted value and true label (y ~ o(2)) is large, o/(2) will close to 0, wich decreases the convergence speed, this is improper, since we expect that the learning speed should be fast when the error is large. Mean Squared Logarithmic Error ‘Mean Squared Logarithmic Error (MSLE) loss function is a variant of MSE, which is defined as £=1S° (tog(y? +1) — logis” +2)? SLE Is ls used to measure the dierent between actual and preted. By taking he fog of tte predictions and actual values, what changes isthe valance tat you are measuring usualy used when you donot want fo penalize huge dflerences inthe predicted andthe actus values when both predicied and aus values are huge nubers, Another things that MSUE penalizes underestimates more han over-estimate 1L If both predicted and actual values are small: MSE and MSLE {s same. 2 Ifeither predicted or the actual value is big MSE > MSLE. 3.Ifboth predicted and actual values are big, MSE > MSLE (Msit becomes almost negligible. It is expressed as LossFunctions.LossFunction.MEAK_SQUARED_LOGARITHNZC_eRRER in Deeplearningad. Lz 12 loss function is the square of the L2 norm of the difference between actual value and predicted value. It is mathematically similar to MSE, only do not have division by n. itis computed by c= 9F a For more details, typically in mathematic, please read the paper: On Loss Functions for Deep Neural Networks in Classification —_nttps:iiwww.google.com spurt? sastbrctejGqeGesre=sGsourcesweblicd=26cadsrjaGuact=86ved=“OahUKEwiBuMG- ‘N6VUANWKchQKHCISCIQQFgEMMAEBurl=httpshIAK2FH2Farxv.org’h2Fpar2F1702.056596usp-AFOICNGQLOBW! which gives comprehensive explanation about several commomly-used loss functions, including 12, L1 loss function. In DeepLeaing4J, it is expressed as LossFunctions. LossFunction.(2 Mean Absolute Error ‘Mean Absolute Error (MAE) is a quantity used to measure how close forecasts or predictions are to the eventual outcomes, which ts computed by A “a c=2 yy a where |-| denotes the absolute value, Albeit, both MSE and MAE ate used in predictive modeling, there are several differences between them. MSE has nice mathematical properties which makes it easier to compute the gradient, However, MAE requires more complicated tools such as linear programming to compute the gradient, Because of the square, large errors have relatively greater influence on MSE than do the smaller error. Therefore, MAE is more robust to outliers since it does not make use of square. On the other hand, MSE (s more useful if concerning about large errors whose consequences are much bigger than equivalent smaller ones. MSE also corresponds to maximizing the likelihood of Gaussian random variables. In DeeplearningJ. it is expressed as LossFunctsons. Lo#sFunetion. NEA ABSOLUTE_ ERROR Mean Absolute Percentage Error hitps'saacchanghau.gthub olpostloss_functions! 216 ‘202019 Loss Functions in Neural Networks | Isaac Changhau ‘Mean Absolute Percentage Error (MAPE) is a variant of MAE, itis computed by ie £=2> 9 EF |: 100 ro Although the concept of MAPE sounds very simple and convincing, it has major {drawbacks in practical application: L It cannot be used if there are zero values (hich sometimes happens for example in demand data) because there would be a division by zero. 2. For forecasts which are too low the percentage error cannot exceed 100, but for forecasts which are too high there is no upper limit to the percentage erro. 8. When MAPE Is used to compare the accuracy of prediction methods it is biased in. ‘that it will systematically select a method whose forecasts are too low. This lttle- Known but serious issue can be overcome by using an accuracy measure based on the ratlo of the predicted to actual value (called the Accuracy Ratlo), this approach, leads to superior statistical properties and leads to predictions which can be interpreted in terms of the geometric mean. IL 1S expressed as LossFunctions.LossFunction.MEAN ABSOLUTE PERCENTAGE ERROR iN DeepLearningad Li 11 loss function is sum of absolute errors of the difference between actual value and predicted value. Similar to the relation between MSE and 12, 11 is mathematically similar to MAE, only do not have division by n, and itis defined as c=DW- In DeepLearninga, it is expressed as LossFunctions.tossFunction.Li i Kullback Leibler (KL) Divergence XL Divergence, also known as relative entiopy, information dlvergenceigain, 1s a measure of how one probability distibution diverges from a second expected probability distribution. KL divergence loss function is computed by Le Pe £= => Derly|I9) 12 0 = EE Woe] = ai 12 12 = 235 6 og) 2 (toe (G we? Toeto)) 5 D(H “toatG)) where the first term is entropy and another (s cross entropy (another kind of loss function which will be introduced later). KL divergence is a distribution-wise asymmetric measure and thus does not qualify as a statistical metric of spread. In the simple case, a KL divergence of 0 indicates that we can expect similar, f not the same, behavior of two different distributions, while a KL divergence of 1 indicates that the two distributions behave in such @ different manner that the expectation given the fist distribution approaches zero, For more detalls, please visit the wikipedia: —Mintd (hitps:fen wikipedia org/wikiKullback*&E2%80%93Leibler divergence), In DeepLearningad, itis expressed as Lossrunctions.Lossfunction.Kt_ DIVERGENCE . Moreover, the implementation of Reconstruction Gross Entropy (https:fenwikipediaorg/wikiCross.entropy) in Deepleaming4J is same as Kullback Ieibler )_—=—Civergence, thus, == you. = can also. use LossFunctions LossFunct.on. RECONSTRUCTION CROSSERTROPY Cross Entropy hitps'saacchanghau.gthub olpostloss_functions! sie

You might also like