Davydenko 2013
Davydenko 2013
the aggregated accuracy is surprisingly small, and one process have produced conflicting results (e.g., Fildes et al.,
apparent reason for this is the inability to agree on 2009, Franses & Legerstee, 2010). In these studies, different
appropriate accuracy metrics (Hoover, 2006). As McCarthy, measures were applied to different datasets and arrived at
Davis, Golicic, and Mentzer (2006) reported, only 55% of different conclusions. Some studies where a set of mea-
the companies surveyed believed that their forecasting sures was employed reported an interesting picture, where
performance was being formally evaluated. adjustments improved the accuracy in certain settings ac-
The key issue when evaluating a forecasting process is cording to MdAPE (median absolute percentage error),
the improvements achieved in supply chain performance. while harming the accuracy in the same settings accord-
While this has only an indirect link to the forecasting ing to MAPE (Fildes et al., 2009; Trapero, Pedregal, Fildes,
accuracy, organisations rely on accuracy improvements as & Weller, 2011). In practice, such results may be damaging
a suitable proxy measure, not least because of their ease of for forecasters and forecast users, since they do not give a
calculation. This paper examines the behaviours of various clear indication of the changes in accuracy that correspond
well-known error measures in the particular context of to some well-known loss function. Using real-world data,
demand forecasting in the supply chain. We show that, this paper considers the appropriateness of various previ-
due to the features of SKU demand data, well-known error ously used measures, and demonstrates the use of the pro-
measures are generally not advisable for the evaluation posed enhanced accuracy measurement scheme.
of judgmental adjustments, and can even give misleading The next section describes the data employed for the
results. To be useful in supply chain applications, an error analysis in this paper. Section 3 illustrates the disadvan-
measure usually needs to have the following properties: tages and limitations of various well-known error mea-
(i) scale independence—though it is sometimes desirable sures when they are applied to SKU-level data. In Section 4,
to weight measures according to some characteristic the proposed accuracy measure is introduced. Section 5
such as their profitability; (ii) robustness to outliers; and contains the results from measuring the accuracy of judg-
(iii) interpretability (though the focus might occasionally mental adjustments with real-world data using the alter-
shift to extremes, e.g., where ensuring a minimum level of native measures and explains the differences in the results,
supply is important). demonstrating the benefits of the proposed enhanced ac-
The most popular measure used in practice is the mean curacy measure. The concluding section summarises the
absolute percentage error, MAPE (Fildes & Goodwin, 2007), results of the empirical evaluation and offers practical rec-
which has long been being criticised (see, for example, ommendations as to which of the different error measures
Fildes, 1992, Hyndman & Koehler, 2006, Kolassa & Schutz, can be employed safely.
2007). In particular, the use of percentage errors is often
inadvisable, due to the large number of extremely high 2. Descriptive analysis of the source data
percentages which arise from relatively low actual demand
values. The current research employed data collected from a
To overcome the disadvantages of percentage mea- company specialising in the manufacture of fast-moving
sures, the MASE (mean absolute scaled error) measure was consumer goods (FMCG). This is an extended data set
proposed by Hyndman and Koehler (2006). The MASE is a from one of the companies considered by Fildes et al.
relative error measure which uses the MAE (mean abso- (2009). The company concerned is a leading European
lute error) of a benchmark forecast (specifically, the ran- provider of household and personal care products to a wide
dom walk) as its denominator. In this paper we analyse the range of major retailers. Table 1 summarises the data set
MASE and show that, like the MAPE, it also has a number of and indicates the number of cases used for the analysis.
disadvantages. Most importantly: (i) it introduces a bias to- Each case includes (i) the one-step-ahead monthly forecast
wards overrating the performance of a benchmark forecast prepared using some statistical method (this will be called
as a result of arithmetic averaging; and (ii) it is vulnerable the system forecast); (ii) the corresponding judgmentally
to outliers, as a result of dividing by small benchmark MAE adjusted forecast (this will be called the final forecast);
values. and (iii) the corresponding actual demand value. The
To ensure a more reliable evaluation of the effectiveness system forecast was obtained using an enterprise software
of adjustments, this paper proposes the use of an enhanced package, and the final forecast was obtained as a result of
measure that shows the average relative improvement in a revision of the statistical forecast by experts (Fildes et al.,
MAE. In contrast to MASE, it is proposed that the weighted 2009). The two forecasts coincide when the experts had
geometric average be used to find the average relative no extra information to add. The data set is representative
MAE. By taking the statistical forecast as a benchmark, of most FMCG manufacturing or distribution companies
it becomes possible to evaluate the relative change in which deal with large numbers of time series of different
forecasting accuracy yielded by the use of judgmental lengths relating to different products, and is similar to
adjustments, without experiencing the limitations of other the other manufacturing data sets considered by Fildes
standard measures. Therefore, the proposed statistic can et al. (2009), in terms of the total number of time series,
be used to provide a more robust and easily interpretable the proportion of judgmentally adjusted forecasts and the
indicator of changes in accuracy, meeting the criteria laid frequencies of occurrence of zero errors and zero actuals.
down earlier. Since the data relate to FMCG, the numbers of cases
The importance of the choice of an appropriate error of zero demand periods and zero errors are not large
measure can be seen from the fact that previous studies (see Table 1). However, the further investigation of the
of the gains in accuracy from the judgmental adjustment properties of error measures presented in Section 3 will
512 A. Davydenko, R. Fildes / International Journal of Forecasting 29 (2013) 510–522
also consider possible situations when the data involve 3. Appropriateness of existing measures for SKU-level
small counts, and zero observations occur more frequently demand data
(as is common with intermittent demand data).
As Table 1 shows, for this particular data set, ad- 3.1. Percentage errors
justments of positive sign occur more frequently than
adjustments of negative sign. However, in order to Let the forecasting error for a given time period t and
characterise the average magnitude of the adjustments, SKU i be
an additional analysis is required. In their study of judg-
mental adjustments, Fildes et al. (2009) analysed the size ei,t = Yi,t − Fi,t ,
of judgmental adjustments using the measure of rela- where Yi,t is a demand value for SKU i observed at time t,
tive adjustments that is defined as 100 × (Final forecast and Fi,t is the forecast of Yi,t .
− System forecast)/System forecast. A traditional way to compare the accuracy of forecasts
Since the values of the relative adjustments are scale- across multiple time series is based on using absolute
independent, they can be compared across time series. percentage errors (Hyndman & Koehler, 2006). Let us
However, the above measure is asymmetrical. For exam- define the percentage error (PE) as pi,t = 100 × ei,t /Yi,t .
ple, if an expert doubles a statistical forecast (say from 10 Hence, the absolute percentage error (APE) is |pi,t |. The
units to 20 units), he/she increases it by 100%, but if he/she most popular PE-based measures are MAPE and MdAPE,
halves a statistical forecast (say from 20 units to 10 units), which are defined as follows:
he/she decreases it by 50% (not 100%). The sampling distri-
bution of the relative adjustment is bounded by −100% on MAPE = mean(|pi,t |),
the left side and unbounded on the right side (see Fig. 1). MdAPE = median(|pi,t |),
Generally, these effects mean that the distribution of the
where mean(|pi,t |) denotes the sample mean of |pi,t | over
relative adjustment may become non-informative about
all available values of i and t, and median(|pi,t |) is the
the size of the adjustment as measured on the original
sample median.
scale. When defining a ‘symmetric measure’, Mathews and
Diamantopoulos (1987) argued for a measure where the In the study by Fildes et al. (2009), these measures
adjustment size is measured relative to an average of the served as the main tool for the analysis of the accuracy
system and final forecasts. The same principle is used in the of judgmental adjustments. In order to determine the
symmetric MAPE (sMAPE) measure proposed by Makri- change in forecasting accuracy, MAPE and MdAPE values
dakis (1993). However, Goodwin and Lawton (1999) later of the statistical baseline forecasts and final judgmentally
showed that such approaches still do not lead to the desir- adjusted forecasts were calculated and compared. The
able property of symmetry. significance of the change in accuracy was assessed based
In this paper, in order to avoid the problem of the non- on the distribution of the differences between the absolute
symmetrical scale of the relative adjustment, we carry out percentage errors (APEs) of forecasts. The difference
the analysis of the magnitude of adjustments using the between APEs is defined as
natural logarithm of the (Final forecast/System forecast)
,t = pi,t − pi,t ,
diAPE f s
ratio. From Fig. 2, it can be seen that the log-transformed
relative adjustment follows a leptokurtic distribution. As is where pfi,t and psi,t denote APEs for the final and baseline
well known, the sample mean is not an efficient measure of statistical forecasts, respectively, for a given SKU i and
location under departures from normality (Wilcox, 2005). period t. It can be tested whether the final forecast
We therefore used the trimmed mean as a more robust APE differs statistically from the statistical forecast APE;
A. Davydenko, R. Fildes / International Journal of Forecasting 29 (2013) 510–522 513
Table 2
Summary statistics for the magnitude of adjustment.
Sign of adjustment ln(Final forecast/System forecast)
1st quartile Median 3rd quartile Mean(2% trim) exp[Mean(2% trim) ]
because of the non-normal distribution of the difference, relatively small compared to the forecast error ei,t , the
Fildes et al. (2009) tested whether the median of diAPE,t resulting percentage error pi,t becomes extremely large,
differs significantly from zero using a two-sample paired which distorts the results of further analyses (Hyndman
(Wilcoxon) sign rank test. & Koehler, 2006). Such high values can be treated as
The sample mean of diAPE ,t is the difference between outliers, since they often do not allow for a meaningful
the MAPE values corresponding to the statistical and final interpretation (large percentage errors are not necessarily
forecasts: harmful or damaging, as they can arise merely from
relatively low actual values). However, identifying outliers
mean diAPE = mean pfi,t − mean psi,t
,t
in a skewed distribution is a non-trivial problem, where
= MAPEf − MAPEs . (1) it is necessary to determine an appropriate trimming
Therefore, testing the mean or median (in cases where level in order to find robust estimates, while at the
the underlying distribution is symmetric) of diAPE same time avoiding losing too much information. Usually
,t against
zero using the above-mentioned test leads to establishing authors choose the trimming level for MAPE based on
whether MAPEf differs significantly from MAPEs . their experience after experimentation (for example,
The results reported suggest that, overall, the value Fildes et al., 2009, used a 2% trim), but this decision
of MAPE was improved by the use of adjustments, but still remains subjective. Moreover, the trimmed mean
the accuracy of positive and negative adjustments differed gives a biased estimate of location for highly skewed
substantially. Based on the MAPE measure, it was found distributions (Marques et al., 2000), which complicates the
that positive adjustments did not change the forecasting interpretation of the trimmed MAPE. In particular, for a
accuracy significantly, while negative adjustments led random variable that follows a highly skewed distribution,
to significant improvements. However, percentage error the expected value of the trimmed mean differs from
measures have a number of disadvantages when applied the expected value of the random variable itself. This
to the adjustments data, as we explain below. bias depends on both the trim level and the number
One well-known disadvantage of percentage errors is of observations used to calculate the trimmed mean.
that when the actual value Yi,t in the denominator is Therefore, it is difficult to compare the measurement
514 A. Davydenko, R. Fildes / International Journal of Forecasting 29 (2013) 510–522
results based on the trimmed means for samples that bias the comparison in favour of methods that issue low
contain different numbers of observations, even when the forecasts (Armstrong, 1985; Armstrong & Collopy, 1992;
trim level remains the same. Kolassa & Schutz, 2007). This happens because, under
SKU-level demand time series typically exhibit a high certain conditions, percentage errors put a heavier penalty
degree of variation in actual values, due to seasonal on positive errors than on negative errors. In particular,
effects and the changing stages of a product’s life we can observe this when the forecast is taken as fixed.
cycle. Therefore, data on adjustments can contain a high To illustrate this phenomenon, Kolassa and Schutz (2007)
proportion of low demand values, which makes PE- provide the following example. Assume that we have
based measures particularly inadvisable in this context. a time series that contains values distributed uniformly
Considering extremes, a common occurrence in the between 10 and 50. If we are using a symmetrical loss
situation of intermittent demand is for many observations function, the best forecast for this time series would be
(and forecasts) to be zero (see the discussion by Syntetos 30. However, a forecast of 22 produces a better accuracy
& Boylan, 2005). All cases with zero actual values must in terms of MAPE. As a result, if the aim is to choose a
be excluded from the analysis, since the percentage error method that is better in terms of a linear loss, then the
cannot be computed when Yi,t = 0, due to its definition. values of PE-based measures can be misleading. The way
The extreme percentage errors that can be obtained in which the use of MAPE can bias the comparison of the
can be shown using scaled values of errors and actual performances of judgmental adjustments of different signs
demand values (Fig. 3). The variables shown were scaled will be illustrated below.
by the standard deviation of actual values in each series One important effect which arises from the presence of
in order to eliminate the differences between time series. cognitive biases and the non-negative nature of demand
It can be seen that the final forecast errors have a values is the fact that the most damaging positive
skewed distribution and are correlated with both the adjustments (producing the largest absolute errors)
actual values and the signs of adjustments; it is also clear typically correspond to relatively low actuals (left corner
that a substantial number of the errors are comparable of Fig. 3(a)), while the worst negative adjustments (pro-
to the actual demand values. Excluding observations with ducing the largest absolute errors) correspond to higher
relatively low values on the original scale (here, all actuals (centre section, Fig. 3(b)). More specifically, the fol-
observations less than 10 were excluded from the analysis, lowing general dependency can be found within most time
as was done by Fildes et al., 2009) still cannot improve the series. The
properties of percentage errors sufficiently, since a large
difference between the absolute final forecast
error efi,t and the absolute statistical forecast error esi,t is
number of observations still remain in the area where positively correlated with the actual value Yi,t for positive
the actual demand value is less than the absolute error. adjustments, while there is a negative correlation for neg-
This results in extremely high APEs (>100%), which are
ative adjustments. To reveal this effect, distribution-free
all too easy to misinterpret (since very large APEs do not
measures of the association between variables were used.
necessarily correspond to very damaging errors, and arise
For each SKU i, Spearman’s ρ coefficients were calculated,
primarily because of low actual demand values). In Fig. 3,
representing the correlationbetween the improvement
the area below the dashed line shows cases in which
in terms of absolute errors efi,t − esi,t and the actual
the errors were higher than the actual demand values.
value Yi,t . Fig. 5 shows the distributions of the coefficients
These cases result in extreme percentage errors, as shown
ρi+ , calculated for positive adjustments, and ρi− , corre-
in Fig. 4. Due to the presence of extreme percentages,
sponding to negative adjustments (the coefficients can
the distribution of APEs becomes highly skewed and
take values 1 and −1 when only a few observations are
heavy-tailed, which makes MAPE-based estimates highly
present in a series). For the given dataset, mean ρi+ ≈
unstable.
0.47 and mean ρi− ≈ −0.44, indicating that the im-
A widely used robust alternative to MAPE is MdAPE.
However, MdAPE is neither easily interpretable nor suffi- provement in forecasting is markedly correlated with the
ciently indicative of changes in accuracy when forecast- actual demand values. This illustrates the fact that posi-
ing methods have different shaped error distributions. tive adjustments are most effective for larger values of de-
The sample median of the APEs is resistant to the influ- mand, and least effective (or even damaging) for smaller
ence of extreme cases, but is also insensitive to large er- values of demand. Actually, efficient averaging of correla-
rors, even if they are not outliers or extreme percent- tion coefficients requires applying Fisher’s z transforma-
ages. Comparing the accuracy using the MdAPE shows the tion to them and then transforming back the result (see,
changes in accuracy that relate to the lowest 50% of APEs. e.g., Mudholkar, 1983). But here we used raw coefficients
However, MdAPE’s improvement can be accompanied by because we only wanted to show that the ρ value clearly
remaining more damaging errors which lie above the me- correlates with the adjustment sign.
dian if the shapes of the error distributions differ. In Because of the division by the scale factor that is
Section 5, it will be shown that, while the MdAPE indicates correlated with the numerator, the difference
of APEs
,t = 100 × ei,t − ei,t /Yi,t )
that judgmental adjustments improve the accuracy for a (which is calculated as diAPE f s
given dataset, the trimmed MAPE suggests the opposite to will not reflect changes in forecasting accuracy in terms
be the case. Therefore, additional indicators are required in of a symmetric loss function. More specifically, for
order to be able to draw better-substantiated conclusions positive adjustments, diAPE
,t will systematically downgrade
with regard to the forecasting accuracy. improvements in accuracy and exaggerate degradations of
Apart from the presence of extreme APEs, another accuracy (on the percentage scale). In contrast, for negative
problem with using PE-based measures is that they can adjustments, the improvements will be exaggerated, while
A. Davydenko, R. Fildes / International Journal of Forecasting 29 (2013) 510–522 515
Fig. 3. Dependencies between forecast error, actual value, and the sign of adjustment (based on scaled data).
scaled actual demand value
Fig. 4. Percentage errors, depending on the actual demand value and adjustment sign.
Fig. 5. Spearman’s ρ coefficients showing the correlation between the improvement in accuracy and the actual demand value for each time series (relative
frequency histograms).
the errors from harmful forecasts will receive smaller therefore been to reinterpret the results of previous studies
weights. Since the difference in MAPEs is calculated as through the use of alternative measures.
the sample mean of diAPE,t (in accordance with Eq. (1)), the A second measure based on percentage errors was also
comparison of forecasts using MAPE will also give a result used by Franses and Legerstee (2010). In order to evaluate
which is biased towards underrating positive adjustments the accuracy of improvements, the RMSPE (root mean
and overrating negative adjustments. Consequently, since square percentage error) was calculated for the statistical
the forecast errors arising from adjustments of different and judgmentally adjusted forecasts, and the resulting
signs are penalised differently, the MAPE measure is flawed values were then compared. Based on this measure, it
when comparing the performances of adjustments of was concluded that the expert adjusted forecasts were no
different signs. One of the aims of the present research has better than the model forecasts. However, the RMSPE is
516 A. Davydenko, R. Fildes / International Journal of Forecasting 29 (2013) 510–522
also based on percentage errors, and is affected by the Fildes (1992) recommends the use of the Relative
outliers and biases described above even more strongly. Geometric Root Mean Square Error (RelGRMSE). The
RelGRMSE for a particular time series i is defined as
3.2. Relative errors 2 2n1 i
ei,t
t ∈Ti
Another approach to obtaining scale-independent mea- ,
RelGRMSEi =
2
sures is based on using relative errors. The relative error ebi,t
(RE) is defined as t ∈Ti
REi,t = /
ei,t ebi,t , where Ti is a set containing time periods for which non-
zero errors ei,t and ebi,t are available, and ni is the number
where ebi,t is the forecast error obtained from a benchmark of elements in Ti .
method. Usually a naïve forecast is taken as the benchmark After obtaining the RelGRMSE for each series, Fildes
method. (1992) recommends finding the geometric mean of
Well-known measures based on relative errors include the RelGRMSEs across all time series, thus obtaining
Mean Relative Absolute Error (MRAE), Median Relative gmean (RelGRMSEi ). As Hyndman (2006) pointed out,
Absolute Error (MdRAE), and Geometric Mean Relative the Geometric Root Mean Square Error (GRMSE) and
Absolute Error (GMRAE): the Geometric Mean Absolute Error (GMAE) are identical
because the square roots cancel each other in a geometric
MRAE = mean REi,t ,
mean. Similarly, it can be shown that
MdRAE = median REi,t ,
gmean (RelGRMSEi ) = GMRAE.
GMRAE = gmean REi,t .
An alternative representation of GMRAE is:
Averaging the ratios of absolute errors across individual
observations overcomes the problems related to dividing
m
by actual values. In particular, the RE-based measures are 1
,
not affected by the presence of low actual values, or by the GMRAE = exp
m
ln REi,t
correlation between errors and actual outcomes. However, ni i=1 t ∈Ti
i =1
REs also have a number of limitations.
The calculation of REi,t requires division by the non- where m is the total number of time series, and other
zero error of the benchmark forecast ebi,t . In the case of variables retain their previous meaning.
calculating GMRAE, it is also required that ei,t ̸= 0. The For the adjustments data set under consideration, only a
actual and forecasted demands are usually count data, small proportion of observations contain zero errors (about
which means that the forecasting errors are count data as 1%). It has been found empirically that for the given data
well. With count data, the probability of a zero error of set, the log-transformed absolute REs, ln REi,t , can be
the benchmark forecast can be non-zero. Such cases must approximated adequately using a distribution which has a
be excluded from the analysis when using relative errors. finitevariance.
In fact, even if a heavy-tailed distribution
When using intermittent demand data, the use of relative of ln REi,t arises, the influence of extreme cases can be
errors becomes impossible due to the frequent occurrences eliminated based on various robustifying schemes such as
of zero errors (Hyndman, 2006; Syntetos & Boylan, 2005). trimming or Winsorizing. In contrast to APEs, the use of
As was pointed out by Hyndman and Koehler (2006), such schemes for ln REi,t is unlikely to lead to biased
in the case of continuous distributions, the benchmark estimates, since the distribution of ln REi,t is not highly
forecast error ebi,t can have a positive probability density at skewed.
zero, and therefore the use of MRAE can be problematic. Though GMRAE (or, equivalently, gmean (RelGRMSEi ))
In particular, REi,t can follow a heavy-tailed distribution has some desirable statistical properties and can give a
for which the sample mean becomes a highly inefficient reliable aggregated indication of changes in accuracy, its
estimate that is vulnerable to outliers. In addition, the use can be complicated for the following two reasons.
distribution of |REi,t | is highly skewed. At the same time, Firstly, as was mentioned previously, zero-error forecasts
while MdRAE is highly robust, it cannot be sufficiently cannot be taken into account directly. Secondly, in a
informative, as it is insensitive to large REs which lie in similar way to the median, the geometric mean of absolute
the tails of the distribution. Thus, even if the large REs errors generally does not reflect changes in accuracy under
are not outliers which arise from the division by relatively standard loss functions. For instance, for a particular time
small benchmark errors, they still will not be taken into series, GMAE (and, hence, GMRAE) favours methods which
account when using MdRAE. Averaging the absolute REs produce errors with heavier tailed-distributions, while for
using GMRAE is preferable to using either MRAE or MdRAE, the same series RMSE (root mean square error) can suggest
as it provides a reliable and robust estimate, and at the the opposite ranking.
same time takes into account the values of REs which lie The latter aspect of using GMRAE can be illustrated us-
in the tails of the distribution. Also, when averaging the ing the following example. Suppose that for a particular
benchmark ratios, the geometric mean has the advantage time series, method A produces errors eAt that are inde-
that it produces rankings which are invariant to the choice pendent and identically distributed variables following a
of the benchmark (see Fleming & Wallace, 1986). heavy-tailed distribution. More specifically, let eAt follow
A. Davydenko, R. Fildes / International Journal of Forecasting 29 (2013) 510–522 517
the t-distribution with ν = 3 degrees of freedom: eAt ∼ tν . penalty for bad forecasting becomes larger than the reward
Also, let method B produce independent errors that follow for good forecasting.
the normal distribution: eBt ∼ N (0, 3). Let method B be the To show how the MASE rewards and penalises forecasts,
benchmark method. It can be shown analytically
that
the it can be represented as
variances for eAt and eBt are equal: Var eAt = Var eBt = 3. m
1
Thus, the relative RMSE (RelRMSE, the ratio of the two RM- MASE = 1 + m
ni (ri − 1) .
SEs) for this series is 1. However, the Relative Geometric
ni i = 1
RMSE (or, equivalently, GMRAE) will show that method A i =1
is better than method B: GMRAE ≈ 0.69 (based on 106
The reward for improving the benchmark MAE from A
simulated pairs of eAt and eBt ). Now if, for example, eBt ∼
to B (A > B) in a series i is Ri = ni (1 − B/A), while the
N (0, 2.5), then the RelRMSE and GMRAE will be 1.10 and
penalty for harming MAE by changing it from B to A is Pi =
0.76, respectively. This means that method B is now prefer- ni (A/B − 1). Since Ri < Pi , the reward given for improving
able in terms of the variance of errors, while method A is the benchmark MAE cannot balance the penalty given for
still (substantially) better in terms of the GMRAE. However, reducing the benchmark MAE by the same quantity. As a
the geometric mean absolute error is rarely used when op- result, obtaining MASE > 1 does not necessarily indicate
timising predictions with the use of mathematical models. that the accuracy of the benchmark method was better on
Some authors claim that the comparison based on RelRMSE average. This leads to ambiguity in the comparison of the
can be more desirable, as in this case the criterion used for accuracy of forecasts.
the optimisation of predictions corresponds to the evalua- For example, suppose that the performance of some
tion criteria (Diebold, 1993; Zellner, 1986). forecasting method is compared with the performance
Thus, analogously to what was said with regard to PE- of the naïve method across two series (m = 2) which
based measures, if the aim of the comparison is to choose contain equal numbers of forecasts and observations. For
a method that is better in terms of a linear or a quadratic the first series, the MAE ratio is r1 = 1/2, and for the
loss, then GMRAE may not be sufficiently informative, or second series, the MAE ratio is the opposite: r2 = 2/1.
may even lead to counterintuitive conclusions. The improvement in accuracy for the first series obtained
using the forecasting method is the same as the reduction
3.3. Scaled errors for the second series. However, averaging the ratios gives
MASE = 1/2(r1 + r2 ) = 1.25, which indicates that
In order to overcome the imperfections of PE-based the benchmark method is better. While this is a well-
measures, Hyndman and Koehler (2006) proposed the use known point, its implications for error measures, with the
of the MASE (mean absolute scaled error). For the scenario potential for misleading conclusions, are widely ignored.
when forecasts are produced from varying origins but with In addition to the above effect, the use of MASE (as for
a constant horizon, the MASE is calculated as follows (see MAPE) may result in unstable estimates, as the arithmetic
Appendix): mean is severely influenced by extreme cases which arise
from dividing by relatively small values. In this case,
ei,t outliers occur when dividing by the relatively small MAEs
, MASE = mean |qi,t | ,
qi,t =
MAEbi of benchmark forecasts which can appear in short series.
Some authors (e.g., Hoover, 2006) recommend the use
where qi,t is the scaled error and MAEbi is the mean absolute of the MAD/MEAN ratio. In contrast to the MASE, the
error (MAE) of the naïve (benchmark) forecast for series i. MAD/MEAN ratio approach assumes that the forecasting
Though this was not specified by Hyndman and Koehler errors are scaled by the mean of time series elements,
(2006), it is possible to show (see Appendix) that in the instead of by the in-sample MAE of the naïve forecast.
given scenario, the MASE is equivalent to the weighted The advantage of this scheme is that it reduces the
arithmetic mean of relative MAEs, where the number of risk of dividing by a small denominator (see Kolassa &
available values of ei,t is used as the weight: Schutz, 2007). However, Hyndman (2006) notes that the
m
MAD/MEAN ratio assumes that the mean is stable over
1 MAEi time, which may make it unreliable when the data exhibit
MASE = m
ni r i , ri = , (2)
MAEbi trends or seasonal patterns. In Section 5, we show that
ni i = 1 both the MASE and the MAD/MEAN are prone to outliers
i =1
for the data set we consider in this paper. Generally, the
where m is the total number of series, ni is the number of use of these schemes has the risk of producing unreliable
values of ei,t for series i, MAEbi is the MAE of the benchmark estimates that are based on highly skewed left-bounded
forecast for series i, and MAEi is the MAE of the forecast distributions.
being evaluated against the benchmark. Thus, while the use of the standard MAPE has long been
It is known that the arithmetic mean is not strictly known to be flawed, the newly proposed MASE also suffers
appropriate for averaging observations representing rela- from some of the same limitations, and may also lead to
tive quantities, and in such situations the geometric mean an unreliable interpretation of the empirical results. We
should be used instead (Spizman & Weinstein, 2008). therefore need a measure that does not suffer from these
As a result of using the arithmetic mean of MAE ratios, problems. The next section presents an improved statistic
Eq. (2) introduces a bias towards overrating the accuracy which is more suitable for comparing the accuracies of
of a benchmark forecasting method. In other words, the SKU-level forecasts.
518 A. Davydenko, R. Fildes / International Journal of Forecasting 29 (2013) 510–522
4. Recommended accuracy evaluation scheme of errors efi,t and esi,t within a given series i have different
levels
of the kurtosis, then ln ri is a biased estimate of
ln E efi,t /E esi,t . Thus, the indication of an improvement
The recommended forecast evaluation scheme is based
on averaging the relative efficiencies of adjustments under linear loss given by the AvgRelMAE may be biased.
across time series. The geometric mean is the correct In fact, if ni = 1 for each i, then the AvgRelMAE becomes
average to use for averaging benchmark ratio results, equivalent to the GMRAE, which has the limitations
since it gives equal weight to reciprocal relative changes described in Section 3.2. However, our experiments have
(Fleming & Wallace, 1986). Using the geometric mean shown that the bias of ln ri diminishes rapidly as ni
of MAE ratios, it is possible to define an appropriate increases, becoming negligible for ni > 4.
measure of the average relative MAE (AvgRelMAE). If the To eliminate the influence of outliers and extreme
baseline statistical forecast is taken as the benchmark, then cases, the trimmed mean can be used in order to
the AvgRelMAE showing how the judgmentally adjusted define a measure of location for the relative MAE. The
forecasts improve/reduce the accuracy can be found as trimmed AvgRelMAE for a given threshold t (0 ≤
m
t ≤ 0.5) is calculated by excluding the [tm] lowest
1/ ni and [tm] highest values of ni ln ri from the calculations
m
n i=1 MAEfi
AvgRelMAE = ri i , ri = , (3) (square brackets indicate the integer part of tm). As was
i=1
MAEsi mentioned in Section 2, the optimal trim level depends
on the distribution. In practice, the choice of the trim
where MAEsi is the MAE of the baseline statistical forecast level usually remains subjective, since the distribution
for series i, MAEfi is the MAE of the judgmentally adjusted is unknown. Wilcox (1996) wrote that ‘Currently there
forecast for series i, ni is the number of available errors of is no way of being certain how much trimming should
judgmentally adjusted forecasts for series i, and m is the be done in a given situation, but the important point
total number of time series. This differs from the proposals is that some trimming often gives substantially better
of Fildes (1992), who examined the behaviour of the GRM- results, compared to no trimming’ (p. 16). Our experiments
SEs of the individual relative errors. show that a 5% level can be recommended for the
The MAEs in Eq. (3) are found as AvgRelMAE measure. This level ensures high efficiency,
1 f 1 because the underlying distribution usually does not
e , es ,
MAEfi = i,t MAEsi = i ,t exhibit large departures from the normal distribution. A
ni t ∈T ni t ∈T
i i manual screening for outliers could also be performed in
order to exclude time series with non-typical properties
where efi,t is the error of the judgmentally adjusted forecast
from the analysis.
for period t and series i, Ti is a set containing the time
The results described in the next section show that the
periods for which efi,t are available, and esi,t is the error of
robust estimates obtained using a 5% trimming level are
the baseline statistical forecast for period t and series i.
very close to the estimates based on the whole sample.
AvgRelMAE is immediately interpretable, as it rep-
The distribution of ni ln ri is more symmetrical than the
resents the average relative value of MAE adequately,
distribution of either the APEs or absolute scaled errors.
and directly shows how the adjustments improve/reduce
Therefore, the analysis of the outliers in relative MAEs can
the MAE compared to the baseline statistical forecast.
be performed more efficiently than the analysis of outliers
Obtaining AvgRelMAE < 1 means that on average MAEfi <
when using the measures considered previously.
MAEsi , and therefore adjustments improve the accuracy,
Since the AvgRelMAE does not require scaling by actual
while AvgRelMAE > 1 indicates the opposite. The aver-
values, it can be used in cases of low or zero actuals, as
age percentage improvement in MAE of forecasts is found
well as in cases of zero forecasting errors. Consequently,
as (1 − AvgRelMAE) × 100. If required, Eq. (3) can also be
it is suitable for intermittent demand forecasts. The only
extended to other measures of dispersion or loss functions.
For example, instead of MAE one might use the MSE (mean limitation is that the MAEs in Eq. (3) should be greater than
square error), interquartile range, or mean prediction in- zero for all series.
terval length. The choice of the measure depends on the Thus, the advantages of the recommended accuracy
purposes of analysis. In this study, we use MAE, assuming evaluation scheme are that it (i) can be interpreted
that the penalty is proportional to the absolute error. easily, (ii) represents the performance of the adjustments
Equivalently, the geometric mean of MAE ratios can be objectively (without the introduction of substantial biases
found as or outliers), (iii) is informative and uses all available
information efficiently, and (iv) is applicable in a wide
m
range of settings, with minimal assumptions about the
1 features of the data.
.
AvgRelMAE = exp
m
ni ln ri
ni i=1 5. Results of empirical evaluation
i=1
Table 3
Accuracy of adjustments according to different error measures.
Error measure Positive adjustments Negative adjustments All nonzero adjustments
Statistical Adjusted Statistical Adjusted Statistical Adjusted
forecast forecast forecast forecast forecast forecast
Fig. 6. Box-and-whisker plot for absolute percentage errors (log scale, zero-error forecasts excluded).
used a 2% trim level for MAPE values. However, as noted, forecast is smaller than the error of judgmental forecast.
it is difficult to determine an appropriate trim level. As For the adjustments data, the lengths of the series vary
a result, the difference in APEs between the system and substantially, so the MASE is affected seriously by outliers.
final forecasts has a very high dispersion and cannot be Fig. 8 shows that the use of the MAD/MEAN scheme
used efficiently to assess improvements in accuracy. It instead of the MASE does not improve the properties of
can also be seen that the distribution of APEs is highly the distribution of the scaled errors. Table 3 shows that
skewed, which means that the trimmed means cannot be a trimmed version of the MAD/MEAN scheme gives the
considered as unbiased estimates of the location. Albeit opposite rankings with regard to the overall accuracy of
the distribution of the APEs has a very high kurtosis, our adjustments, which indicates that this scheme is highly
experiments show that increasing the trim level (say from unstable. Moreover, with such distributions, the use of
2% to 5%) would substantially bias the estimates of the trimming for either MASE or MAD/MEAN leads to biased
location of the APEs due to the extremely high skewness of estimates, as was the case with MAPE.
the distribution. We therefore use the 2% trimmed MAPE Fig. 9 shows that the log-transformed relative absolute
in this study. Also, the use of this trim level makes the errors follow a symmetric distribution and contain outliers
measurement results comparable to the results of Fildes that are easier to detect and to eliminate. Based on the
et al. (2009).
shape of the underlying distribution, it seems that using
Table 3 shows that the rankings based on the trimmed
a 5% trimmed GMRAE would give a location estimate
MAPE and MdAPE differ, suggesting different conclusions
with a reasonable level of efficiency. Although the GMRAE
about the effectiveness of adjustments. As was explained
measure is not vulnerable to outliers, its interpretation
in Section 3.1, the interpretation of PE-based measures is
can present difficulties, for the reasons explained in
not straightforward. While MdAPE is resistant to outliers,
Section 3.2.
it is not sufficiently informative, as it is insensitive to APEs
which lie above the median. Also, PE-measures produce Compared to the APEs and the absolute scaled er-
a biased comparison, since the improvement on the real rors, the log-transformed relative MAEs are not affected
scale within each series is correlated markedly with the severely by outliers and have a more symmetrical distri-
actual value. Therefore, applying percentage errors in bution (Fig. 10). The AvgRelMAE can therefore serve as a
the current setting leads to ambiguous results and to more reliable indicator of changes in accuracy. At the same
confusion in their interpretation. For example, for positive time, in terms of a linear loss function, the AvgRelMAE
adjustments, the trimmed MAPE and MdAPE suggest the scheme represents the effectiveness of adjustments ade-
opposite rankings: while the trimmed MAPE shows a quately and gives a directly interpretable meaning.
substantial worsening of the final forecast due to the The AvgRelMAE result shows improvements from both
judgmental adjustments, the MdAPE value points in the positive and negative adjustments, whereas according to
opposite direction. MAPE and MASE, only negative adjustments improve the
The absolute scaled errors found using the MASE accuracy. For the whole sample, adjustments improve the
scheme (as described in Section 3.3) also follow a non- MAE of statistical forecasts by 10%, on average. Positive
symmetrical distribution and can take extremely large adjustments are less accurate than negative adjustments
values (Fig. 7) in short series where the MAE of the naïve and provide only minor improvements.
520 A. Davydenko, R. Fildes / International Journal of Forecasting 29 (2013) 510–522
Fig. 7. Box-and-whisker plot for the absolute scaled errors found by the MASE scheme (log scale, zero-error forecasts excluded).
Fig. 8. Box-and-whisker plot for absolute scaled errors found by the MAD/MEAN scheme (log scale, zero-error forecasts excluded).
Fig. 9. Box-and-whisker plot for the log-transformed relative absolute errors (using the statistical forecast as the benchmark).
Fig. 10. Box-and-whisker plot for the weighted log-transformed relative MAEs (ni ln ri ).
Table 4
Results of using the binomial test to analyse the frequency of a successful adjustment.
Adjustment Total number Number of adjustments p-value Probability of a successful 95% confidence interval for the
sign of adjustments that improved forecast adjustment probability of a successful adjustment
To determine whether the probability of a successful low actual values, which lead to high percentage errors
adjustment is higher than 0.5, a two-sided binomial test with no direct interpretation for practical use. Moreover,
was applied. The results are shown in Table 4. the errors corresponding to adjustments of different signs
Based on the p-values obtained for each sample, it can are penalised differently when using percentage errors,
be concluded that adjustments improved the accuracy of because the forecasting errors are correlated with both
forecasts more frequently than they reduced it. However, the actual demand values and the adjustment sign. As a
the probability of a successful intervention was rather low result, measures such as MAPE and MdAPE do not provide
for positive adjustments. sufficient indication of the effectiveness of adjustments, in
terms of a linear loss function. Similar arguments were also
6. Conclusions found to apply to the calculation of MASE, which can also
induce biases and outliers as a result of using the arithmetic
The appropriate measurement of accuracy is important mean to average relative quantities. Thus, an organization
in many organizational settings, and is not of merely which determines its forecast improvement strategy based
academic interest. Due to the specific features of SKU- on an inadequate measure will misallocate its resources,
level demand data, many well-known error measures are and will therefore fail in its objective of improving the
not appropriate for use in evaluating the effectiveness of accuracy at the SKU level.
adjustments. In particular, the use of percentage errors is In order to overcome the disadvantages of existing
not advisable because of the considerable proportion of measures, it is recommended that an average relative MAE
A. Davydenko, R. Fildes / International Journal of Forecasting 29 (2013) 510–522 521
be used which is calculated as the geometric mean of evaluated against the benchmark for series i and period t , li
relative MAE values. This scheme allows for the objective is the number of elements in series i, and Yi,j is the actual
comparison of forecasts, and is more reliable for the value observed at time j for series i.
analysis of adjustments. Let the mean absolute scaled error (MASE) be calculated
For the empirical dataset, the analysis has shown by averaging the absolute scaled errors across time periods
that adjustments improved accuracy in terms of the and time series:
average relative MAE (AvgRelMAE) by approximately
m
1 e i ,t
10%. For the same dataset, a range of well-known error MASE = m
,
measures, including MAPE, MdAPE, GMRAE, MASE, and the MAEbi
ni i=1 t ∈Ti
MAD/MEAN ratio, indicated conflicting results. The MAPE- i =1
based results suggested that, on the whole, adjustments
where ni is the number of available values of ei,t for series
did not improve the accuracy, while the MdAPE results
i, m is the total number of series, and Ti is a set containing
showed a substantial improvement (dropping from 25%
time periods for which the errors ei,t are available for series
to 20%, approximately). The analysis using MASE and
i.
the MAD/MEAN ratio was complicated, due to a highly
Then,
skewed underlying distribution, and did not allow any
firm conclusions to be reached. The GMRAE showed that m
1 ei,t
adjustments improved the accuracy by 13%, a result that MASE = m
MAEbi
is close to that obtained using the AvgRelMAE. Since ni i=1 t ∈Ti
analyses based on different measures can lead to different i=1
conclusions, it is important to have a clear understanding m
|ei,t |
of the statistical properties of any error measure used. We 1 t ∈Ti
= m
have described various undesirable effects that complicate MAEbi
the interpretation of the well-known error measures. As ni i=1
i=1
an improved scheme which is appropriate for evaluating 1
changes in accuracy under linear loss, we recommend m ni
|ei,t |
1 t ∈Ti
using the AvgRelMAE. The generalisation of this scheme = ni
m
can be obtained straightforwardly for other loss functions MAEbi
ni i=1
as well. i=1
The process by which a new error measure is developed m
1 MAEi
and accepted by an organisation has not received any = m
ni r i , ri = ,
research attention. A case in point is intermittent demand, MAEbi
ni i=1
where service improvements can be achieved, but only by i=1
abandoning the standard error metrics and replacing them
where MAEi is the MAE for the series i for the forecast being
with service-level objectives (Syntetos & Boylan, 2005).
evaluated against the benchmark.
When an organisation and those to whom the forecasting
function reports insist on retaining the MAPE or similar
(as will mostly be the case), the forecaster’s objective References
must then shift to delivering to the organisation’s chosen
Armstrong, S. (1985). Long-range forecasting: from crystal ball to computer.
performance measure, whilst using a more appropriate New York: John Wiley.
measure, such as the AvgRelMAE, to interpret what is really Armstrong, J. S., & Collopy, F. (1992). Error measures for generalizing
going on with the data. In essence, the forecaster cannot about forecasting methods: empirical comparisons. International
Journal of Forecasting, 8, 69–80.
reasonably resort to using the organisation’s measure and
Armstrong, J. S., & Fildes, R. (1995). Correspondence on the selection of
expect to achieve a cost-effective result. error measures for comparisons among forecasting methods. Journal
of Forecasting, 14(1), 67–71.
Diebold, F. X. (1993). On the limitations of comparing mean square
Appendix. Alternative representation of MASE forecast errors: comment. Journal of Forecasting, 12, 641–642.
Fildes, R. (1992). The evaluation of extrapolative forecasting methods.
According to Hyndman and Koehler (2006), for the International Journal of Forecasting, 8(1), 81–98.
Fildes, R., & Goodwin, P. (2007). Against your better judgment? How
scenario when forecasts are made from varying origins but organizations can improve their use of management judgment in
with a constant horizon (here taken as 1), the scaled error forecasting. Interfaces, 37, 570–576.
is defined as1 Fildes, R., Goodwin, P., Lawrence, M., & Nikolopoulos, K. (2009). Effective
forecasting and judgmental adjustments: an empirical evaluation and
li strategies for improvement in supply-chain planning. International
ei,t 1
qi,t = b
, MAEbi = |Yi,j − Yi,j−1 |, Journal of Forecasting, 25(1), 3–23.
MAEi l i − 1 j =2 Fleming, P. J., & Wallace, J. J. (1986). How not to lie with statistics: the
correct way to summarize benchmark results. Communications of the
ACM, 29(3), 218–221.
where MAEbi is the MAE from the benchmark (naïve) Franses, P. H., & Legerstee, R. (2010). Do experts’ adjustments on
method for series i, ei,t is the error of a forecast being model-based SKU-level forecasts improve forecast quality? Journal of
Forecasting, 29, 331–340.
Goodwin, P., & Lawton, R. (1999). On the asymmetry of the symmetric
MAPE. International Journal of Forecasting, 4, 405–408.
1 The formula corresponds to the software implementation described Hill, M., & Dixon, W. J. (1982). Robustness in real life: a study of clinical
by Hyndman and Khandakar (2008). laboratory data. Biometrics, 38, 377–396.
522 A. Davydenko, R. Fildes / International Journal of Forecasting 29 (2013) 510–522
Hoover, J. (2006). Measuring forecast accuracy: omissions in today’s Syntetos, A. A., & Boylan, J. E. (2005). The accuracy of intermittent demand
forecasting engines and demand-planning software. Foresight: The estimates. International Journal of Forecasting, 21(2), 303–314.
International Journal of Applied Forecasting, 4, 32–35. Trapero, J. R., Pedregal, D. J., Fildes, R., & Weller, M. (2011). Analysis of
Hyndman, R. J. (2006). Another look at forecast-accuracy metrics for judgmental adjustments in presence of promotions. Paper presented at
intermittent demand. Foresight: The International Journal of Applied the 31th international symposium on forecasting. ISF2011. Prague.
Forecasting, 4(4), 43–46. Tversky, A., & Kahneman, D. (1974). Judgment under uncertainty:
Hyndman, R. J., & Khandakar, Y. (2008). Automatic time series forecasting: heuristics and biases. Science, 185, 1124–1130.
the forecast package for R. Journal of Statistical Software, 27(3). Wilcox, R. R. (1996). Statistics for the social sciences. San Diego, CA:
Hyndman, R., & Koehler, A. (2006). Another look at measures of forecast Academic Press.
accuracy. International Journal of Forecasting, 22(4), 679–688. Wilcox, R. R. (2005). Trimmed means. Encyclopedia of Statistics in
Kolassa, S., & Schutz, W. (2007). Advantages of the MAD/MEAN ratio over Behavioral Science, 4, 2066–2067.
the MAPE. Foresight: The International Journal of Applied Forecasting, 6, Zellner, A. (1986). A tale of forecasting 1001 series: the Bayesian knight
strikes again. International Journal of Forecasting, 2, 491–494.
40–43.
Makridakis, S. (1993). Accuracy measures: theoretical and practical
concerns. International Journal of Forecasting, 9, 527–529. Andrey Davydenko is working in the area of the development and
Marques, C. R., Neves, P. D., & Sarmento, L. M. (2000). Evaluating software implementation of statistical methods for business forecasting.
core inflation indicators. Working paper 3-00. Economics Research He has a Ph.D. from Lancaster University. He holds a candidate of science
Department. Banco de Portugal. degree in mathematical methods in economics. His current research
Mathews, B., & Diamantopoulos, A. (1987). Alternative indicators of focuses on the composite use of judgmental and statistical information
forecast revision and improvement. Marketing Intelligence, 5(2), in forecasting support systems.
20–23.
McCarthy, T. M., Davis, D. F., Golicic, S. L., & Mentzer, J. T. (2006). The
evolution of sales forecasting management: a 20-year longitudinal Robert Fildes is Professor of Management Science in the School of
study of forecasting practice. Journal of Forecasting, 25, 303–324. Management, Lancaster University, and Director of the Lancaster Centre
Mudholkar, G. S. (1983). Fisher’s z-transformation. Encyclopedia of for Forecasting. He has a mathematics degree from Oxford and a Ph.D.
Statistical Sciences, 3, 130–135. in statistics from the University of California. He was co-founder of
Sanders, N., & Ritzman, L. (2004). Integrating judgmental and quantitative the Journal of Forecasting in 1981 and of the International Journal of
forecasts: methodologies for pooling marketing and operations Forecasting in 1985. For ten years from 1988 he was Editor-in-Chief of
information. International Journal of Operations and Production the IJF. He was president of the International Institute of Forecasters
Management, 24, 514–529. between 2000 and 2004. His current research interests are concerned
Spizman, L., & Weinstein, M. (2008). A note on utilizing the geometric with the comparative evaluation of different forecasting methods, the
mean: when, why and how the forensic economist should employ the implementation of improved forecasting procedures in organizations and
geometric mean. Journal of Legal Economics, 15(1), 43–55. the design of forecasting systems.