0% found this document useful (0 votes)

38 views13 pages

Davydenko 2013

Uploaded by

rachanaudupaaa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views13 pages

Davydenko 2013

Uploaded by

rachanaudupaaa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

International Journal of Forecasting 29 (2013) 510–522

Contents lists available at SciVerse ScienceDirect

International Journal of Forecasting

journal homepage: www.elsevier.com/locate/ijforecast

Measuring forecasting accuracy: The case of judgmental

adjustments to SKU-level demand forecasts
Andrey Davydenko ∗ , Robert Fildes
Department of Management Science, Lancaster University, Lancaster, Lancashire, LA1 4YX, United Kingdom

article info abstract

Keywords: Forecast adjustment commonly occurs when organizational forecasters adjust a statistical
Judgmental adjustments
forecast of demand to take into account factors which are excluded from the statistical
Forecasting support systems
calculation. This paper addresses the question of how to measure the accuracy of such
Forecast accuracy
Forecast evaluation adjustments. We show that many existing error measures are generally not suited to the
Forecast error measures task, due to specific features of the demand data. Alongside the well-known weaknesses
of existing measures, a number of additional effects are demonstrated that complicate the
interpretation of measurement results and can even lead to false conclusions being drawn.
In order to ensure an interpretable and unambiguous evaluation, we recommend the use
of a metric based on aggregating performance ratios across time series using the weighted
geometric mean. We illustrate that this measure has the advantage of treating over- and
under-forecasting even-handedly, has a more symmetric distribution, and is robust.
Empirical analysis using the recommended metric showed that, on average, adjust-
ments yielded improvements under symmetric linear loss, while harming accuracy in
terms of some traditional measures. This provides further support to the critical impor-
tance of selecting appropriate error measures when evaluating the forecasting accuracy.
© 2012 International Institute of Forecasters. Published by Elsevier B.V. All rights reserved.

1. Introduction order to ensure the rational use of the organisation’s re-

sources which are invested in the forecasting process.
The most well-established approach to forecasting The task of measuring the accuracy of judgmental
within supply chain companies starts with a statistical adjustments is inseparably linked with the need to choose
time series forecast, which is then adjusted by managers in an appropriate error measure. In fact, the choice of an
the company based on their expert knowledge. This pro- error measure for assessing the accuracy of forecasts across
cess is usually carried out at a highly disaggregated level time series is itself an important topic for forecasting
of SKUs (stock-keeping units), where there are often hun- research. It has theoretical implications for the comparison
dreds if not thousands of series to consider (Fildes & Good- of forecasting methods and is of wide practical importance,
win, 2007; Sanders & Ritzman, 2004). At the same time, the since the forecasting function is often evaluated using
empirical evidence suggests that judgments under uncer- inappropriate measures (see, for example, Armstrong &
tainty are affected by various types of cognitive biases and Collopy, 1992; Armstrong & Fildes, 1995), and therefore
are inherently non-optimal (Tversky & Kahneman, 1974).
the link to economic performance may well be distorted.
Such biases and inefficiencies have been shown to apply
Despite the continuing interest in the topic, the choice of
specifically to judgmental adjustments (Fildes, Goodwin,
the most suitable error measure for evaluating companies’
Lawrence, & Nikolopoulos, 2009). Therefore, it is impor-
forecasts still remains controversial. Due to their statistical
tant to monitor the accuracy of judgmental adjustments in
properties, popular error measures do not always ensure
easily interpretable results when applied to real-world
∗ Corresponding author. Tel.: +44 1524 593879. data (Hyndman & Koehler, 2006; Kolassa & Schutz,
E-mail address: [email protected] (A. Davydenko). 2007). In practice, the proportion of firms which track
0169-2070/$ – see front matter © 2012 International Institute of Forecasters. Published by Elsevier B.V. All rights reserved.
doi:10.1016/j.ijforecast.2012.09.002
A. Davydenko, R. Fildes / International Journal of Forecasting 29 (2013) 510–522 511

the aggregated accuracy is surprisingly small, and one process have produced conflicting results (e.g., Fildes et al.,
apparent reason for this is the inability to agree on 2009, Franses & Legerstee, 2010). In these studies, different
appropriate accuracy metrics (Hoover, 2006). As McCarthy, measures were applied to different datasets and arrived at
Davis, Golicic, and Mentzer (2006) reported, only 55% of different conclusions. Some studies where a set of mea-
the companies surveyed believed that their forecasting sures was employed reported an interesting picture, where
performance was being formally evaluated. adjustments improved the accuracy in certain settings ac-
The key issue when evaluating a forecasting process is cording to MdAPE (median absolute percentage error),
the improvements achieved in supply chain performance. while harming the accuracy in the same settings accord-
While this has only an indirect link to the forecasting ing to MAPE (Fildes et al., 2009; Trapero, Pedregal, Fildes,
accuracy, organisations rely on accuracy improvements as & Weller, 2011). In practice, such results may be damaging
a suitable proxy measure, not least because of their ease of for forecasters and forecast users, since they do not give a
calculation. This paper examines the behaviours of various clear indication of the changes in accuracy that correspond
well-known error measures in the particular context of to some well-known loss function. Using real-world data,
demand forecasting in the supply chain. We show that, this paper considers the appropriateness of various previ-
due to the features of SKU demand data, well-known error ously used measures, and demonstrates the use of the pro-
measures are generally not advisable for the evaluation posed enhanced accuracy measurement scheme.
of judgmental adjustments, and can even give misleading The next section describes the data employed for the
results. To be useful in supply chain applications, an error analysis in this paper. Section 3 illustrates the disadvan-
measure usually needs to have the following properties: tages and limitations of various well-known error mea-
(i) scale independence—though it is sometimes desirable sures when they are applied to SKU-level data. In Section 4,
to weight measures according to some characteristic the proposed accuracy measure is introduced. Section 5
such as their profitability; (ii) robustness to outliers; and contains the results from measuring the accuracy of judg-
(iii) interpretability (though the focus might occasionally mental adjustments with real-world data using the alter-
shift to extremes, e.g., where ensuring a minimum level of native measures and explains the differences in the results,
supply is important). demonstrating the benefits of the proposed enhanced ac-
The most popular measure used in practice is the mean curacy measure. The concluding section summarises the
absolute percentage error, MAPE (Fildes & Goodwin, 2007), results of the empirical evaluation and offers practical rec-
which has long been being criticised (see, for example, ommendations as to which of the different error measures
Fildes, 1992, Hyndman & Koehler, 2006, Kolassa & Schutz, can be employed safely.
2007). In particular, the use of percentage errors is often
inadvisable, due to the large number of extremely high 2. Descriptive analysis of the source data
percentages which arise from relatively low actual demand
values. The current research employed data collected from a
To overcome the disadvantages of percentage mea- company specialising in the manufacture of fast-moving
sures, the MASE (mean absolute scaled error) measure was consumer goods (FMCG). This is an extended data set
proposed by Hyndman and Koehler (2006). The MASE is a from one of the companies considered by Fildes et al.
relative error measure which uses the MAE (mean abso- (2009). The company concerned is a leading European
lute error) of a benchmark forecast (specifically, the ran- provider of household and personal care products to a wide
dom walk) as its denominator. In this paper we analyse the range of major retailers. Table 1 summarises the data set
MASE and show that, like the MAPE, it also has a number of and indicates the number of cases used for the analysis.
disadvantages. Most importantly: (i) it introduces a bias to- Each case includes (i) the one-step-ahead monthly forecast
wards overrating the performance of a benchmark forecast prepared using some statistical method (this will be called
as a result of arithmetic averaging; and (ii) it is vulnerable the system forecast); (ii) the corresponding judgmentally
to outliers, as a result of dividing by small benchmark MAE adjusted forecast (this will be called the final forecast);
values. and (iii) the corresponding actual demand value. The
To ensure a more reliable evaluation of the effectiveness system forecast was obtained using an enterprise software
of adjustments, this paper proposes the use of an enhanced package, and the final forecast was obtained as a result of
measure that shows the average relative improvement in a revision of the statistical forecast by experts (Fildes et al.,
MAE. In contrast to MASE, it is proposed that the weighted 2009). The two forecasts coincide when the experts had
geometric average be used to find the average relative no extra information to add. The data set is representative
MAE. By taking the statistical forecast as a benchmark, of most FMCG manufacturing or distribution companies
it becomes possible to evaluate the relative change in which deal with large numbers of time series of different
forecasting accuracy yielded by the use of judgmental lengths relating to different products, and is similar to
adjustments, without experiencing the limitations of other the other manufacturing data sets considered by Fildes
standard measures. Therefore, the proposed statistic can et al. (2009), in terms of the total number of time series,
be used to provide a more robust and easily interpretable the proportion of judgmentally adjusted forecasts and the
indicator of changes in accuracy, meeting the criteria laid frequencies of occurrence of zero errors and zero actuals.
down earlier. Since the data relate to FMCG, the numbers of cases
The importance of the choice of an appropriate error of zero demand periods and zero errors are not large
measure can be seen from the fact that previous studies (see Table 1). However, the further investigation of the
of the gains in accuracy from the judgmental adjustment properties of error measures presented in Section 3 will
512 A. Davydenko, R. Fildes / International Journal of Forecasting 29 (2013) 510–522

Table 1 summary measure of location. The optimal trim level

Source data summary. that corresponds to the lowest variance of the trimmed
Total number of cases 6882 mean depends on the distribution, which is unknown
Total number of time series (SKUs) 412 in the current case. Some studies have shown that, for
Period of observations March 2004–July 2007
symmetrical distributions, a 5% trim generally ensures a
high efficiency with a useful degree of robustness (e.g., Hill
Total number of adjusted statistical 4779 (69%)
forecasts (% of total number of cases)
& Dixon, 1982). However, it is also known that the trimmed
mean gives a biased estimate if the distribution is skewed
Number of zero actual demand periods 271 (4%) (Marques, Neves, & Sarmento, 2000). We used a 2% trim
(% of total number of cases)
in order to eliminate the influence of outliers while at the
Number of zero-error statistical forecasts 47 (<1%) same time avoiding introducing a substantial bias.
(% of total number of cases) The results presented in Table 2 suggest that, on aver-
Number of zero-error judgmentally 61 (1%) age, for the dataset under consideration, the magnitude of
adjusted forecasts (% of total number of positive adjustments is higher than the magnitude of nega-
adjusted forecasts) tive adjustments, measured relative to the system forecast.
Number of positive adjustments 3394 (71%) The average magnitude of a positive relative adjustment is
(% of total number of adjusted forecasts) about twice as large as the average magnitude of a nega-
tive adjustment. Also, adjustments with positive signs have
Number of negative adjustments 1385 (29%)
(% of total number of adjusted forecasts) much higher ranges than negative ones.

also consider possible situations when the data involve 3. Appropriateness of existing measures for SKU-level
small counts, and zero observations occur more frequently demand data
(as is common with intermittent demand data).
As Table 1 shows, for this particular data set, ad- 3.1. Percentage errors
justments of positive sign occur more frequently than
adjustments of negative sign. However, in order to Let the forecasting error for a given time period t and
characterise the average magnitude of the adjustments, SKU i be
an additional analysis is required. In their study of judg-
mental adjustments, Fildes et al. (2009) analysed the size ei,t = Yi,t − Fi,t ,
of judgmental adjustments using the measure of rela- where Yi,t is a demand value for SKU i observed at time t,
tive adjustments that is defined as 100 × (Final forecast and Fi,t is the forecast of Yi,t .
− System forecast)/System forecast. A traditional way to compare the accuracy of forecasts
Since the values of the relative adjustments are scale- across multiple time series is based on using absolute
independent, they can be compared across time series. percentage errors (Hyndman & Koehler, 2006). Let us
However, the above measure is asymmetrical. For exam- define the percentage error (PE) as pi,t = 100 × ei,t /Yi,t .
ple, if an expert doubles a statistical forecast (say from 10 Hence, the absolute percentage error (APE) is |pi,t |. The
units to 20 units), he/she increases it by 100%, but if he/she most popular PE-based measures are MAPE and MdAPE,
halves a statistical forecast (say from 20 units to 10 units), which are defined as follows:
he/she decreases it by 50% (not 100%). The sampling distri-
bution of the relative adjustment is bounded by −100% on MAPE = mean(|pi,t |),
the left side and unbounded on the right side (see Fig. 1). MdAPE = median(|pi,t |),
Generally, these effects mean that the distribution of the
where mean(|pi,t |) denotes the sample mean of |pi,t | over
relative adjustment may become non-informative about
all available values of i and t, and median(|pi,t |) is the
the size of the adjustment as measured on the original
sample median.
scale. When defining a ‘symmetric measure’, Mathews and
Diamantopoulos (1987) argued for a measure where the In the study by Fildes et al. (2009), these measures
adjustment size is measured relative to an average of the served as the main tool for the analysis of the accuracy
system and final forecasts. The same principle is used in the of judgmental adjustments. In order to determine the
symmetric MAPE (sMAPE) measure proposed by Makri- change in forecasting accuracy, MAPE and MdAPE values
dakis (1993). However, Goodwin and Lawton (1999) later of the statistical baseline forecasts and final judgmentally
showed that such approaches still do not lead to the desir- adjusted forecasts were calculated and compared. The
able property of symmetry. significance of the change in accuracy was assessed based
In this paper, in order to avoid the problem of the non- on the distribution of the differences between the absolute
symmetrical scale of the relative adjustment, we carry out percentage errors (APEs) of forecasts. The difference
the analysis of the magnitude of adjustments using the between APEs is defined as
natural logarithm of the (Final forecast/System forecast)
,t = pi,t − pi,t ,
   
diAPE  f   s 
ratio. From Fig. 2, it can be seen that the log-transformed    
relative adjustment follows a leptokurtic distribution. As is where pfi,t  and psi,t  denote APEs for the final and baseline
well known, the sample mean is not an efficient measure of statistical forecasts, respectively, for a given SKU i and
location under departures from normality (Wilcox, 2005). period t. It can be tested whether the final forecast
We therefore used the trimmed mean as a more robust APE differs statistically from the statistical forecast APE;
A. Davydenko, R. Fildes / International Journal of Forecasting 29 (2013) 510–522 513

Fig. 1. Histogram of the relative adjustment, measured in percentages.

Fig. 2. Histogram of ln(Final forecast/System forecast).

Table 2
Summary statistics for the magnitude of adjustment.
Sign of adjustment ln(Final forecast/System forecast)
1st quartile Median 3rd quartile Mean(2% trim) exp[Mean(2% trim) ]

Positive 0.123 0.273 0.592 0.412 1.510

Negative −0.339 −0.153 −0.071 −0.290 0.749
Both −0.043 0.144 0.425 0.218 1.243

because of the non-normal distribution of the difference, relatively small compared to the forecast error ei,t , the
Fildes et al. (2009) tested whether the median of diAPE,t resulting percentage error pi,t becomes extremely large,
differs significantly from zero using a two-sample paired which distorts the results of further analyses (Hyndman
(Wilcoxon) sign rank test. & Koehler, 2006). Such high values can be treated as
The sample mean of diAPE ,t is the difference between outliers, since they often do not allow for a meaningful
the MAPE values corresponding to the statistical and final interpretation (large percentage errors are not necessarily
forecasts: harmful or damaging, as they can arise merely from
relatively low actual values). However, identifying outliers
 
mean diAPE = mean pfi,t  − mean psi,t 
   
,t
in a skewed distribution is a non-trivial problem, where
= MAPEf − MAPEs . (1) it is necessary to determine an appropriate trimming
Therefore, testing the mean or median (in cases where level in order to find robust estimates, while at the
the underlying distribution is symmetric) of diAPE same time avoiding losing too much information. Usually
,t against
zero using the above-mentioned test leads to establishing authors choose the trimming level for MAPE based on
whether MAPEf differs significantly from MAPEs . their experience after experimentation (for example,
The results reported suggest that, overall, the value Fildes et al., 2009, used a 2% trim), but this decision
of MAPE was improved by the use of adjustments, but still remains subjective. Moreover, the trimmed mean
the accuracy of positive and negative adjustments differed gives a biased estimate of location for highly skewed
substantially. Based on the MAPE measure, it was found distributions (Marques et al., 2000), which complicates the
that positive adjustments did not change the forecasting interpretation of the trimmed MAPE. In particular, for a
accuracy significantly, while negative adjustments led random variable that follows a highly skewed distribution,
to significant improvements. However, percentage error the expected value of the trimmed mean differs from
measures have a number of disadvantages when applied the expected value of the random variable itself. This
to the adjustments data, as we explain below. bias depends on both the trim level and the number
One well-known disadvantage of percentage errors is of observations used to calculate the trimmed mean.
that when the actual value Yi,t in the denominator is Therefore, it is difficult to compare the measurement
514 A. Davydenko, R. Fildes / International Journal of Forecasting 29 (2013) 510–522

results based on the trimmed means for samples that bias the comparison in favour of methods that issue low
contain different numbers of observations, even when the forecasts (Armstrong, 1985; Armstrong & Collopy, 1992;
trim level remains the same. Kolassa & Schutz, 2007). This happens because, under
SKU-level demand time series typically exhibit a high certain conditions, percentage errors put a heavier penalty
degree of variation in actual values, due to seasonal on positive errors than on negative errors. In particular,
effects and the changing stages of a product’s life we can observe this when the forecast is taken as fixed.
cycle. Therefore, data on adjustments can contain a high To illustrate this phenomenon, Kolassa and Schutz (2007)
proportion of low demand values, which makes PE- provide the following example. Assume that we have
based measures particularly inadvisable in this context. a time series that contains values distributed uniformly
Considering extremes, a common occurrence in the between 10 and 50. If we are using a symmetrical loss
situation of intermittent demand is for many observations function, the best forecast for this time series would be
(and forecasts) to be zero (see the discussion by Syntetos 30. However, a forecast of 22 produces a better accuracy
& Boylan, 2005). All cases with zero actual values must in terms of MAPE. As a result, if the aim is to choose a
be excluded from the analysis, since the percentage error method that is better in terms of a linear loss, then the
cannot be computed when Yi,t = 0, due to its definition. values of PE-based measures can be misleading. The way
The extreme percentage errors that can be obtained in which the use of MAPE can bias the comparison of the
can be shown using scaled values of errors and actual performances of judgmental adjustments of different signs
demand values (Fig. 3). The variables shown were scaled will be illustrated below.
by the standard deviation of actual values in each series One important effect which arises from the presence of
in order to eliminate the differences between time series. cognitive biases and the non-negative nature of demand
It can be seen that the final forecast errors have a values is the fact that the most damaging positive
skewed distribution and are correlated with both the adjustments (producing the largest absolute errors)
actual values and the signs of adjustments; it is also clear typically correspond to relatively low actuals (left corner
that a substantial number of the errors are comparable of Fig. 3(a)), while the worst negative adjustments (pro-
to the actual demand values. Excluding observations with ducing the largest absolute errors) correspond to higher
relatively low values on the original scale (here, all actuals (centre section, Fig. 3(b)). More specifically, the fol-
observations less than 10 were excluded from the analysis, lowing general dependency can be found within most time
as was done by Fildes et al., 2009) still cannot improve the series. The
properties of percentage errors sufficiently, since a large
 difference between the absolute final forecast
 
error efi,t  and the absolute statistical forecast error esi,t  is
number of observations still remain in the area where positively correlated with the actual value Yi,t for positive
the actual demand value is less than the absolute error. adjustments, while there is a negative correlation for neg-
This results in extremely high APEs (>100%), which are
ative adjustments. To reveal this effect, distribution-free
all too easy to misinterpret (since very large APEs do not
measures of the association between variables were used.
necessarily correspond to very damaging errors, and arise
For each SKU i, Spearman’s ρ coefficients were calculated,
primarily because of low actual demand values). In Fig. 3,
representing the correlationbetween   the  improvement
the area below the dashed line shows cases in which
in terms of absolute errors efi,t  − esi,t  and the actual
the errors were higher than the actual demand values.
value Yi,t . Fig. 5 shows the distributions of the coefficients
These cases result in extreme percentage errors, as shown
ρi+ , calculated for positive adjustments, and ρi− , corre-
in Fig. 4. Due to the presence of extreme percentages,
sponding to negative adjustments (the coefficients can
the distribution of APEs becomes highly skewed and
take values 1 and −1 when only a few observations   are
heavy-tailed, which makes MAPE-based estimates highly
present in a series). For the given dataset, mean ρi+ ≈
unstable.
0.47 and mean ρi− ≈ −0.44, indicating that the im-
 
A widely used robust alternative to MAPE is MdAPE.
However, MdAPE is neither easily interpretable nor suffi- provement in forecasting is markedly correlated with the
ciently indicative of changes in accuracy when forecast- actual demand values. This illustrates the fact that posi-
ing methods have different shaped error distributions. tive adjustments are most effective for larger values of de-
The sample median of the APEs is resistant to the influ- mand, and least effective (or even damaging) for smaller
ence of extreme cases, but is also insensitive to large er- values of demand. Actually, efficient averaging of correla-
rors, even if they are not outliers or extreme percent- tion coefficients requires applying Fisher’s z transforma-
ages. Comparing the accuracy using the MdAPE shows the tion to them and then transforming back the result (see,
changes in accuracy that relate to the lowest 50% of APEs. e.g., Mudholkar, 1983). But here we used raw coefficients
However, MdAPE’s improvement can be accompanied by because we only wanted to show that the ρ value clearly
remaining more damaging errors which lie above the me- correlates with the adjustment sign.
dian if the shapes of the error distributions differ. In Because of the division by the scale factor that is
Section 5, it will be shown that, while the MdAPE indicates correlated with the numerator, the  difference
  of APEs
,t = 100 × ei,t − ei,t /Yi,t )
that judgmental adjustments improve the accuracy for a (which is calculated as diAPE  f   s 
given dataset, the trimmed MAPE suggests the opposite to will not reflect changes in forecasting accuracy in terms
be the case. Therefore, additional indicators are required in of a symmetric loss function. More specifically, for
order to be able to draw better-substantiated conclusions positive adjustments, diAPE
,t will systematically downgrade
with regard to the forecasting accuracy. improvements in accuracy and exaggerate degradations of
Apart from the presence of extreme APEs, another accuracy (on the percentage scale). In contrast, for negative
problem with using PE-based measures is that they can adjustments, the improvements will be exaggerated, while
A. Davydenko, R. Fildes / International Journal of Forecasting 29 (2013) 510–522 515

scaled actual demand value

scaled forecast error scaled forecast error

(a) Positive adjustments. (b) Negative adjustments.

Fig. 3. Dependencies between forecast error, actual value, and the sign of adjustment (based on scaled data).
scaled actual demand value

percentage error, 100% percentage error, 100%

(a) Positive adjustments. (b) Negative adjustments.

Fig. 4. Percentage errors, depending on the actual demand value and adjustment sign.

(a) Positive adjustments. (b) Negative adjustments.

Fig. 5. Spearman’s ρ coefficients showing the correlation between the improvement in accuracy and the actual demand value for each time series (relative
frequency histograms).

the errors from harmful forecasts will receive smaller therefore been to reinterpret the results of previous studies
weights. Since the difference in MAPEs is calculated as through the use of alternative measures.
the sample mean of diAPE,t (in accordance with Eq. (1)), the A second measure based on percentage errors was also
comparison of forecasts using MAPE will also give a result used by Franses and Legerstee (2010). In order to evaluate
which is biased towards underrating positive adjustments the accuracy of improvements, the RMSPE (root mean
and overrating negative adjustments. Consequently, since square percentage error) was calculated for the statistical
the forecast errors arising from adjustments of different and judgmentally adjusted forecasts, and the resulting
signs are penalised differently, the MAPE measure is flawed values were then compared. Based on this measure, it
when comparing the performances of adjustments of was concluded that the expert adjusted forecasts were no
different signs. One of the aims of the present research has better than the model forecasts. However, the RMSPE is
516 A. Davydenko, R. Fildes / International Journal of Forecasting 29 (2013) 510–522

also based on percentage errors, and is affected by the Fildes (1992) recommends the use of the Relative
outliers and biases described above even more strongly. Geometric Root Mean Square Error (RelGRMSE). The
RelGRMSE for a particular time series i is defined as
3.2. Relative errors  2  2n1 i
ei,t
 t ∈Ti
Another approach to obtaining scale-independent mea- ,

RelGRMSEi = 
 2 
sures is based on using relative errors. The relative error ebi,t

(RE) is defined as t ∈Ti

REi,t = /
ei,t ebi,t , where Ti is a set containing time periods for which non-
zero errors ei,t and ebi,t are available, and ni is the number
where ebi,t is the forecast error obtained from a benchmark of elements in Ti .
method. Usually a naïve forecast is taken as the benchmark After obtaining the RelGRMSE for each series, Fildes
method. (1992) recommends finding the geometric mean of
Well-known measures based on relative errors include the RelGRMSEs across all time series, thus obtaining
Mean Relative Absolute Error (MRAE), Median Relative gmean (RelGRMSEi ). As Hyndman (2006) pointed out,
Absolute Error (MdRAE), and Geometric Mean Relative the Geometric Root Mean Square Error (GRMSE) and
Absolute Error (GMRAE): the Geometric Mean Absolute Error (GMAE) are identical
because the square roots cancel each other in a geometric
MRAE = mean REi,t  ,
 
mean. Similarly, it can be shown that
MdRAE = median REi,t  ,
 
gmean (RelGRMSEi ) = GMRAE.
GMRAE = gmean REi,t  .
 
An alternative representation of GMRAE is:
Averaging the ratios of absolute errors across individual  
observations overcomes the problems related to dividing
m 
by actual values. In particular, the RE-based measures are  1 
,
 
not affected by the presence of low actual values, or by the GMRAE = exp 
m
ln REi,t 
correlation between errors and actual outcomes. However, ni i=1 t ∈Ti
i =1
REs also have a number of limitations.
The calculation of REi,t requires division by the non- where m is the total number of time series, and other
zero error of the benchmark forecast ebi,t . In the case of variables retain their previous meaning.
calculating GMRAE, it is also required that ei,t ̸= 0. The For the adjustments data set under consideration, only a
actual and forecasted demands are usually count data, small proportion of observations contain zero errors (about
which means that the forecasting errors are count data as 1%). It has been found empirically that for the given data
well. With count data, the probability of a zero error of set, the log-transformed absolute REs, ln REi,t , can be
the benchmark forecast can be non-zero. Such cases must approximated adequately using a distribution which has a
be excluded from the analysis when using relative errors. finitevariance.
 In fact, even if a heavy-tailed distribution
When using intermittent demand data, the use of relative of ln REi,t  arises, the influence of extreme cases can be
errors becomes impossible due to the frequent occurrences eliminated based on various robustifying schemes such as
of zero errors (Hyndman, 2006; Syntetos & Boylan, 2005). trimming or Winsorizing.  In contrast to APEs, the use of
As was pointed out by Hyndman and Koehler (2006), such schemes for ln REi,t  is unlikely to lead to biased
 
in the case of continuous distributions, the benchmark estimates, since the distribution of ln REi,t  is not highly
forecast error ebi,t can have a positive probability density at skewed.
zero, and therefore the use of MRAE can be problematic. Though GMRAE (or, equivalently, gmean (RelGRMSEi ))
In particular, REi,t can follow a heavy-tailed distribution has some desirable statistical properties and can give a
for which the sample mean becomes a highly inefficient reliable aggregated indication of changes in accuracy, its
estimate that is vulnerable to outliers. In addition, the use can be complicated for the following two reasons.
distribution of |REi,t | is highly skewed. At the same time, Firstly, as was mentioned previously, zero-error forecasts
while MdRAE is highly robust, it cannot be sufficiently cannot be taken into account directly. Secondly, in a
informative, as it is insensitive to large REs which lie in similar way to the median, the geometric mean of absolute
the tails of the distribution. Thus, even if the large REs errors generally does not reflect changes in accuracy under
are not outliers which arise from the division by relatively standard loss functions. For instance, for a particular time
small benchmark errors, they still will not be taken into series, GMAE (and, hence, GMRAE) favours methods which
account when using MdRAE. Averaging the absolute REs produce errors with heavier tailed-distributions, while for
using GMRAE is preferable to using either MRAE or MdRAE, the same series RMSE (root mean square error) can suggest
as it provides a reliable and robust estimate, and at the the opposite ranking.
same time takes into account the values of REs which lie The latter aspect of using GMRAE can be illustrated us-
in the tails of the distribution. Also, when averaging the ing the following example. Suppose that for a particular
benchmark ratios, the geometric mean has the advantage time series, method A produces errors eAt that are inde-
that it produces rankings which are invariant to the choice pendent and identically distributed variables following a
of the benchmark (see Fleming & Wallace, 1986). heavy-tailed distribution. More specifically, let eAt follow
A. Davydenko, R. Fildes / International Journal of Forecasting 29 (2013) 510–522 517

the t-distribution with ν = 3 degrees of freedom: eAt ∼ tν . penalty for bad forecasting becomes larger than the reward
Also, let method B produce independent errors that follow for good forecasting.
the normal distribution: eBt ∼ N (0, 3). Let method B be the To show how the MASE rewards and penalises forecasts,
benchmark method. It can be shown analytically
  that
 the it can be represented as
variances for eAt and eBt are equal: Var eAt = Var eBt = 3. m
1 
Thus, the relative RMSE (RelRMSE, the ratio of the two RM- MASE = 1 + m
ni (ri − 1) .
SEs) for this series is 1. However, the Relative Geometric 
ni i = 1
RMSE (or, equivalently, GMRAE) will show that method A i =1
is better than method B: GMRAE ≈ 0.69 (based on 106
The reward for improving the benchmark MAE from A
simulated pairs of eAt and eBt ). Now if, for example, eBt ∼
to B (A > B) in a series i is Ri = ni (1 − B/A), while the
N (0, 2.5), then the RelRMSE and GMRAE will be 1.10 and
penalty for harming MAE by changing it from B to A is Pi =
0.76, respectively. This means that method B is now prefer- ni (A/B − 1). Since Ri < Pi , the reward given for improving
able in terms of the variance of errors, while method A is the benchmark MAE cannot balance the penalty given for
still (substantially) better in terms of the GMRAE. However, reducing the benchmark MAE by the same quantity. As a
the geometric mean absolute error is rarely used when op- result, obtaining MASE > 1 does not necessarily indicate
timising predictions with the use of mathematical models. that the accuracy of the benchmark method was better on
Some authors claim that the comparison based on RelRMSE average. This leads to ambiguity in the comparison of the
can be more desirable, as in this case the criterion used for accuracy of forecasts.
the optimisation of predictions corresponds to the evalua- For example, suppose that the performance of some
tion criteria (Diebold, 1993; Zellner, 1986). forecasting method is compared with the performance
Thus, analogously to what was said with regard to PE- of the naïve method across two series (m = 2) which
based measures, if the aim of the comparison is to choose contain equal numbers of forecasts and observations. For
a method that is better in terms of a linear or a quadratic the first series, the MAE ratio is r1 = 1/2, and for the
loss, then GMRAE may not be sufficiently informative, or second series, the MAE ratio is the opposite: r2 = 2/1.
may even lead to counterintuitive conclusions. The improvement in accuracy for the first series obtained
using the forecasting method is the same as the reduction
3.3. Scaled errors for the second series. However, averaging the ratios gives
MASE = 1/2(r1 + r2 ) = 1.25, which indicates that
In order to overcome the imperfections of PE-based the benchmark method is better. While this is a well-
measures, Hyndman and Koehler (2006) proposed the use known point, its implications for error measures, with the
of the MASE (mean absolute scaled error). For the scenario potential for misleading conclusions, are widely ignored.
when forecasts are produced from varying origins but with In addition to the above effect, the use of MASE (as for
a constant horizon, the MASE is calculated as follows (see MAPE) may result in unstable estimates, as the arithmetic
Appendix): mean is severely influenced by extreme cases which arise
from dividing by relatively small values. In this case,
ei,t outliers occur when dividing by the relatively small MAEs
, MASE = mean |qi,t | ,
 
qi,t =
MAEbi of benchmark forecasts which can appear in short series.
Some authors (e.g., Hoover, 2006) recommend the use
where qi,t is the scaled error and MAEbi is the mean absolute of the MAD/MEAN ratio. In contrast to the MASE, the
error (MAE) of the naïve (benchmark) forecast for series i. MAD/MEAN ratio approach assumes that the forecasting
Though this was not specified by Hyndman and Koehler errors are scaled by the mean of time series elements,
(2006), it is possible to show (see Appendix) that in the instead of by the in-sample MAE of the naïve forecast.
given scenario, the MASE is equivalent to the weighted The advantage of this scheme is that it reduces the
arithmetic mean of relative MAEs, where the number of risk of dividing by a small denominator (see Kolassa &
available values of ei,t is used as the weight: Schutz, 2007). However, Hyndman (2006) notes that the
m
MAD/MEAN ratio assumes that the mean is stable over
1  MAEi time, which may make it unreliable when the data exhibit
MASE = m
ni r i , ri = , (2)
 MAEbi trends or seasonal patterns. In Section 5, we show that
ni i = 1 both the MASE and the MAD/MEAN are prone to outliers
i =1
for the data set we consider in this paper. Generally, the
where m is the total number of series, ni is the number of use of these schemes has the risk of producing unreliable
values of ei,t for series i, MAEbi is the MAE of the benchmark estimates that are based on highly skewed left-bounded
forecast for series i, and MAEi is the MAE of the forecast distributions.
being evaluated against the benchmark. Thus, while the use of the standard MAPE has long been
It is known that the arithmetic mean is not strictly known to be flawed, the newly proposed MASE also suffers
appropriate for averaging observations representing rela- from some of the same limitations, and may also lead to
tive quantities, and in such situations the geometric mean an unreliable interpretation of the empirical results. We
should be used instead (Spizman & Weinstein, 2008). therefore need a measure that does not suffer from these
As a result of using the arithmetic mean of MAE ratios, problems. The next section presents an improved statistic
Eq. (2) introduces a bias towards overrating the accuracy which is more suitable for comparing the accuracies of
of a benchmark forecasting method. In other words, the SKU-level forecasts.
518 A. Davydenko, R. Fildes / International Journal of Forecasting 29 (2013) 510–522

4. Recommended accuracy evaluation scheme of errors efi,t and esi,t within a given series i have different
levels
  of the  kurtosis, then ln ri is a biased estimate of
ln E efi,t /E esi,t  . Thus, the indication of an improvement

The recommended forecast evaluation scheme is based
on averaging the relative efficiencies of adjustments under linear loss given by the AvgRelMAE may be biased.
across time series. The geometric mean is the correct In fact, if ni = 1 for each i, then the AvgRelMAE becomes
average to use for averaging benchmark ratio results, equivalent to the GMRAE, which has the limitations
since it gives equal weight to reciprocal relative changes described in Section 3.2. However, our experiments have
(Fleming & Wallace, 1986). Using the geometric mean shown that the bias of ln ri diminishes rapidly as ni
of MAE ratios, it is possible to define an appropriate increases, becoming negligible for ni > 4.
measure of the average relative MAE (AvgRelMAE). If the To eliminate the influence of outliers and extreme
baseline statistical forecast is taken as the benchmark, then cases, the trimmed mean can be used in order to
the AvgRelMAE showing how the judgmentally adjusted define a measure of location for the relative MAE. The
forecasts improve/reduce the accuracy can be found as trimmed AvgRelMAE for a given threshold t (0 ≤
m
t ≤ 0.5) is calculated by excluding the [tm] lowest
 1/  ni and [tm] highest values of ni ln ri from the calculations
m
 n i=1 MAEfi
AvgRelMAE = ri i , ri = , (3) (square brackets indicate the integer part of tm). As was
i=1
MAEsi mentioned in Section 2, the optimal trim level depends
on the distribution. In practice, the choice of the trim
where MAEsi is the MAE of the baseline statistical forecast level usually remains subjective, since the distribution
for series i, MAEfi is the MAE of the judgmentally adjusted is unknown. Wilcox (1996) wrote that ‘Currently there
forecast for series i, ni is the number of available errors of is no way of being certain how much trimming should
judgmentally adjusted forecasts for series i, and m is the be done in a given situation, but the important point
total number of time series. This differs from the proposals is that some trimming often gives substantially better
of Fildes (1992), who examined the behaviour of the GRM- results, compared to no trimming’ (p. 16). Our experiments
SEs of the individual relative errors. show that a 5% level can be recommended for the
The MAEs in Eq. (3) are found as AvgRelMAE measure. This level ensures high efficiency,
1  f  1  because the underlying distribution usually does not
e , es ,

MAEfi = i,t MAEsi = i ,t exhibit large departures from the normal distribution. A
ni t ∈T ni t ∈T
i i manual screening for outliers could also be performed in
order to exclude time series with non-typical properties
where efi,t is the error of the judgmentally adjusted forecast
from the analysis.
for period t and series i, Ti is a set containing the time
The results described in the next section show that the
periods for which efi,t are available, and esi,t is the error of
robust estimates obtained using a 5% trimming level are
the baseline statistical forecast for period t and series i.
very close to the estimates based on the whole sample.
AvgRelMAE is immediately interpretable, as it rep-
The distribution of ni ln ri is more symmetrical than the
resents the average relative value of MAE adequately,
distribution of either the APEs or absolute scaled errors.
and directly shows how the adjustments improve/reduce
Therefore, the analysis of the outliers in relative MAEs can
the MAE compared to the baseline statistical forecast.
be performed more efficiently than the analysis of outliers
Obtaining AvgRelMAE < 1 means that on average MAEfi <
when using the measures considered previously.
MAEsi , and therefore adjustments improve the accuracy,
Since the AvgRelMAE does not require scaling by actual
while AvgRelMAE > 1 indicates the opposite. The aver-
values, it can be used in cases of low or zero actuals, as
age percentage improvement in MAE of forecasts is found
well as in cases of zero forecasting errors. Consequently,
as (1 − AvgRelMAE) × 100. If required, Eq. (3) can also be
it is suitable for intermittent demand forecasts. The only
extended to other measures of dispersion or loss functions.
For example, instead of MAE one might use the MSE (mean limitation is that the MAEs in Eq. (3) should be greater than
square error), interquartile range, or mean prediction in- zero for all series.
terval length. The choice of the measure depends on the Thus, the advantages of the recommended accuracy
purposes of analysis. In this study, we use MAE, assuming evaluation scheme are that it (i) can be interpreted
that the penalty is proportional to the absolute error. easily, (ii) represents the performance of the adjustments
Equivalently, the geometric mean of MAE ratios can be objectively (without the introduction of substantial biases
found as or outliers), (iii) is informative and uses all available
  information efficiently, and (iv) is applicable in a wide
m
range of settings, with minimal assumptions about the
 1  features of the data.
.

AvgRelMAE = exp 
m
ni ln ri 
ni i=1 5. Results of empirical evaluation
i=1

ri < 0 means an average

m
Therefore, obtaining i=1 ni ln The results of applying the measures described above
improvement of accuracy, and i=1 ni ln ri > 0 means the
m
are shown in Table 3.
opposite. For the given dataset, a large number of APEs have
In theory, the following effect may complicate the extreme values (>100%) which arise from low actual
interpretation of the AvgRelMAE value. If the distributions demand values (Fig. 6). Following Fildes et al. (2009), we
A. Davydenko, R. Fildes / International Journal of Forecasting 29 (2013) 510–522 519

Table 3
Accuracy of adjustments according to different error measures.
Error measure Positive adjustments Negative adjustments All nonzero adjustments
Statistical Adjusted Statistical Adjusted Statistical Adjusted
forecast forecast forecast forecast forecast forecast

MAPE, % 38.85 61.54 70.45 45.13 47.88 56.85

MAPE, % (2% trimmed) 30.98 40.56 48.71 30.12 34.51 37.22
MdAPE, % 25.48 20.65 23.90 17.27 24.98 19.98
GMRAE 1.00 0.93 1.00 0.70 1.00 0.86
GMRAE (5% trimmed) 1.00 0.94 1.00 0.71 1.00 0.87
MASE 0.97 0.97 0.95 0.70 0.96 0.90
Mean (MAD/Mean) 0.37 0.42 0.33 0.24 0.36 0.37
Mean (MAD/Mean) (5% trimmed) 0.34 0.35 0.29 0.21 0.33 0.31
AvgRelMAE 1.00 0.96 1.00 0.71 1.00 0.90
AvgRelMAE (5% trimmed) 1.00 0.96 1.00 0.73 1.00 0.89
Avg. improvement based on AvgRelMAE 0.00 0.04 0.00 0.29 0.00 0.10

Fig. 6. Box-and-whisker plot for absolute percentage errors (log scale, zero-error forecasts excluded).

used a 2% trim level for MAPE values. However, as noted, forecast is smaller than the error of judgmental forecast.
it is difficult to determine an appropriate trim level. As For the adjustments data, the lengths of the series vary
a result, the difference in APEs between the system and substantially, so the MASE is affected seriously by outliers.
final forecasts has a very high dispersion and cannot be Fig. 8 shows that the use of the MAD/MEAN scheme
used efficiently to assess improvements in accuracy. It instead of the MASE does not improve the properties of
can also be seen that the distribution of APEs is highly the distribution of the scaled errors. Table 3 shows that
skewed, which means that the trimmed means cannot be a trimmed version of the MAD/MEAN scheme gives the
considered as unbiased estimates of the location. Albeit opposite rankings with regard to the overall accuracy of
the distribution of the APEs has a very high kurtosis, our adjustments, which indicates that this scheme is highly
experiments show that increasing the trim level (say from unstable. Moreover, with such distributions, the use of
2% to 5%) would substantially bias the estimates of the trimming for either MASE or MAD/MEAN leads to biased
location of the APEs due to the extremely high skewness of estimates, as was the case with MAPE.
the distribution. We therefore use the 2% trimmed MAPE Fig. 9 shows that the log-transformed relative absolute
in this study. Also, the use of this trim level makes the errors follow a symmetric distribution and contain outliers
measurement results comparable to the results of Fildes that are easier to detect and to eliminate. Based on the
et al. (2009).
shape of the underlying distribution, it seems that using
Table 3 shows that the rankings based on the trimmed
a 5% trimmed GMRAE would give a location estimate
MAPE and MdAPE differ, suggesting different conclusions
with a reasonable level of efficiency. Although the GMRAE
about the effectiveness of adjustments. As was explained
measure is not vulnerable to outliers, its interpretation
in Section 3.1, the interpretation of PE-based measures is
can present difficulties, for the reasons explained in
not straightforward. While MdAPE is resistant to outliers,
Section 3.2.
it is not sufficiently informative, as it is insensitive to APEs
which lie above the median. Also, PE-measures produce Compared to the APEs and the absolute scaled er-
a biased comparison, since the improvement on the real rors, the log-transformed relative MAEs are not affected
scale within each series is correlated markedly with the severely by outliers and have a more symmetrical distri-
actual value. Therefore, applying percentage errors in bution (Fig. 10). The AvgRelMAE can therefore serve as a
the current setting leads to ambiguous results and to more reliable indicator of changes in accuracy. At the same
confusion in their interpretation. For example, for positive time, in terms of a linear loss function, the AvgRelMAE
adjustments, the trimmed MAPE and MdAPE suggest the scheme represents the effectiveness of adjustments ade-
opposite rankings: while the trimmed MAPE shows a quately and gives a directly interpretable meaning.
substantial worsening of the final forecast due to the The AvgRelMAE result shows improvements from both
judgmental adjustments, the MdAPE value points in the positive and negative adjustments, whereas according to
opposite direction. MAPE and MASE, only negative adjustments improve the
The absolute scaled errors found using the MASE accuracy. For the whole sample, adjustments improve the
scheme (as described in Section 3.3) also follow a non- MAE of statistical forecasts by 10%, on average. Positive
symmetrical distribution and can take extremely large adjustments are less accurate than negative adjustments
values (Fig. 7) in short series where the MAE of the naïve and provide only minor improvements.
520 A. Davydenko, R. Fildes / International Journal of Forecasting 29 (2013) 510–522

Fig. 7. Box-and-whisker plot for the absolute scaled errors found by the MASE scheme (log scale, zero-error forecasts excluded).

Fig. 8. Box-and-whisker plot for absolute scaled errors found by the MAD/MEAN scheme (log scale, zero-error forecasts excluded).

Fig. 9. Box-and-whisker plot for the log-transformed relative absolute errors (using the statistical forecast as the benchmark).

Fig. 10. Box-and-whisker plot for the weighted log-transformed relative MAEs (ni ln ri ).

Table 4
Results of using the binomial test to analyse the frequency of a successful adjustment.
Adjustment Total number Number of adjustments p-value Probability of a successful 95% confidence interval for the
sign of adjustments that improved forecast adjustment probability of a successful adjustment

Positive 3394 1815 <0.001 0.535 0.518 0.552

Negative 1385 915 <0.001 0.661 0.635 0.686
Both 4779 2730 <0.001 0.571 0.557 0.585

To determine whether the probability of a successful low actual values, which lead to high percentage errors
adjustment is higher than 0.5, a two-sided binomial test with no direct interpretation for practical use. Moreover,
was applied. The results are shown in Table 4. the errors corresponding to adjustments of different signs
Based on the p-values obtained for each sample, it can are penalised differently when using percentage errors,
be concluded that adjustments improved the accuracy of because the forecasting errors are correlated with both
forecasts more frequently than they reduced it. However, the actual demand values and the adjustment sign. As a
the probability of a successful intervention was rather low result, measures such as MAPE and MdAPE do not provide
for positive adjustments. sufficient indication of the effectiveness of adjustments, in
terms of a linear loss function. Similar arguments were also
6. Conclusions found to apply to the calculation of MASE, which can also
induce biases and outliers as a result of using the arithmetic
The appropriate measurement of accuracy is important mean to average relative quantities. Thus, an organization
in many organizational settings, and is not of merely which determines its forecast improvement strategy based
academic interest. Due to the specific features of SKU- on an inadequate measure will misallocate its resources,
level demand data, many well-known error measures are and will therefore fail in its objective of improving the
not appropriate for use in evaluating the effectiveness of accuracy at the SKU level.
adjustments. In particular, the use of percentage errors is In order to overcome the disadvantages of existing
not advisable because of the considerable proportion of measures, it is recommended that an average relative MAE
A. Davydenko, R. Fildes / International Journal of Forecasting 29 (2013) 510–522 521

be used which is calculated as the geometric mean of evaluated against the benchmark for series i and period t , li
relative MAE values. This scheme allows for the objective is the number of elements in series i, and Yi,j is the actual
comparison of forecasts, and is more reliable for the value observed at time j for series i.
analysis of adjustments. Let the mean absolute scaled error (MASE) be calculated
For the empirical dataset, the analysis has shown by averaging the absolute scaled errors across time periods
that adjustments improved accuracy in terms of the and time series:
average relative MAE (AvgRelMAE) by approximately  
m  
1  e i ,t 
10%. For the same dataset, a range of well-known error MASE = m
,
measures, including MAPE, MdAPE, GMRAE, MASE, and the  MAEbi
ni i=1 t ∈Ti
MAD/MEAN ratio, indicated conflicting results. The MAPE- i =1
based results suggested that, on the whole, adjustments
where ni is the number of available values of ei,t for series
did not improve the accuracy, while the MdAPE results
i, m is the total number of series, and Ti is a set containing
showed a substantial improvement (dropping from 25%
time periods for which the errors ei,t are available for series
to 20%, approximately). The analysis using MASE and
i.
the MAD/MEAN ratio was complicated, due to a highly
Then,
skewed underlying distribution, and did not allow any  
firm conclusions to be reached. The GMRAE showed that m  
1  ei,t 
adjustments improved the accuracy by 13%, a result that MASE = m
 MAEbi
is close to that obtained using the AvgRelMAE. Since ni i=1 t ∈Ti
analyses based on different measures can lead to different i=1

conclusions, it is important to have a clear understanding m
|ei,t |
of the statistical properties of any error measure used. We 1  t ∈Ti
= m
have described various undesirable effects that complicate  MAEbi
the interpretation of the well-known error measures. As ni i=1
i=1
an improved scheme which is appropriate for evaluating 1

changes in accuracy under linear loss, we recommend m ni
|ei,t |
1  t ∈Ti
using the AvgRelMAE. The generalisation of this scheme = ni
m
can be obtained straightforwardly for other loss functions  MAEbi
ni i=1
as well. i=1
The process by which a new error measure is developed m
1  MAEi
and accepted by an organisation has not received any = m
ni r i , ri = ,
research attention. A case in point is intermittent demand,  MAEbi
ni i=1
where service improvements can be achieved, but only by i=1
abandoning the standard error metrics and replacing them
where MAEi is the MAE for the series i for the forecast being
with service-level objectives (Syntetos & Boylan, 2005).
evaluated against the benchmark.
When an organisation and those to whom the forecasting
function reports insist on retaining the MAPE or similar
(as will mostly be the case), the forecaster’s objective References
must then shift to delivering to the organisation’s chosen
Armstrong, S. (1985). Long-range forecasting: from crystal ball to computer.
performance measure, whilst using a more appropriate New York: John Wiley.
measure, such as the AvgRelMAE, to interpret what is really Armstrong, J. S., & Collopy, F. (1992). Error measures for generalizing
going on with the data. In essence, the forecaster cannot about forecasting methods: empirical comparisons. International
Journal of Forecasting, 8, 69–80.
reasonably resort to using the organisation’s measure and
Armstrong, J. S., & Fildes, R. (1995). Correspondence on the selection of
expect to achieve a cost-effective result. error measures for comparisons among forecasting methods. Journal
of Forecasting, 14(1), 67–71.
Diebold, F. X. (1993). On the limitations of comparing mean square
Appendix. Alternative representation of MASE forecast errors: comment. Journal of Forecasting, 12, 641–642.
Fildes, R. (1992). The evaluation of extrapolative forecasting methods.
According to Hyndman and Koehler (2006), for the International Journal of Forecasting, 8(1), 81–98.
Fildes, R., & Goodwin, P. (2007). Against your better judgment? How
scenario when forecasts are made from varying origins but organizations can improve their use of management judgment in
with a constant horizon (here taken as 1), the scaled error forecasting. Interfaces, 37, 570–576.
is defined as1 Fildes, R., Goodwin, P., Lawrence, M., & Nikolopoulos, K. (2009). Effective
forecasting and judgmental adjustments: an empirical evaluation and
li strategies for improvement in supply-chain planning. International
ei,t 1 
qi,t = b
, MAEbi = |Yi,j − Yi,j−1 |, Journal of Forecasting, 25(1), 3–23.
MAEi l i − 1 j =2 Fleming, P. J., & Wallace, J. J. (1986). How not to lie with statistics: the
correct way to summarize benchmark results. Communications of the
ACM, 29(3), 218–221.
where MAEbi is the MAE from the benchmark (naïve) Franses, P. H., & Legerstee, R. (2010). Do experts’ adjustments on
method for series i, ei,t is the error of a forecast being model-based SKU-level forecasts improve forecast quality? Journal of
Forecasting, 29, 331–340.
Goodwin, P., & Lawton, R. (1999). On the asymmetry of the symmetric
MAPE. International Journal of Forecasting, 4, 405–408.
1 The formula corresponds to the software implementation described Hill, M., & Dixon, W. J. (1982). Robustness in real life: a study of clinical
by Hyndman and Khandakar (2008). laboratory data. Biometrics, 38, 377–396.
522 A. Davydenko, R. Fildes / International Journal of Forecasting 29 (2013) 510–522

Hoover, J. (2006). Measuring forecast accuracy: omissions in today’s Syntetos, A. A., & Boylan, J. E. (2005). The accuracy of intermittent demand
forecasting engines and demand-planning software. Foresight: The estimates. International Journal of Forecasting, 21(2), 303–314.
International Journal of Applied Forecasting, 4, 32–35. Trapero, J. R., Pedregal, D. J., Fildes, R., & Weller, M. (2011). Analysis of
Hyndman, R. J. (2006). Another look at forecast-accuracy metrics for judgmental adjustments in presence of promotions. Paper presented at
intermittent demand. Foresight: The International Journal of Applied the 31th international symposium on forecasting. ISF2011. Prague.
Forecasting, 4(4), 43–46. Tversky, A., & Kahneman, D. (1974). Judgment under uncertainty:
Hyndman, R. J., & Khandakar, Y. (2008). Automatic time series forecasting: heuristics and biases. Science, 185, 1124–1130.
the forecast package for R. Journal of Statistical Software, 27(3). Wilcox, R. R. (1996). Statistics for the social sciences. San Diego, CA:
Hyndman, R., & Koehler, A. (2006). Another look at measures of forecast Academic Press.
accuracy. International Journal of Forecasting, 22(4), 679–688. Wilcox, R. R. (2005). Trimmed means. Encyclopedia of Statistics in
Kolassa, S., & Schutz, W. (2007). Advantages of the MAD/MEAN ratio over Behavioral Science, 4, 2066–2067.
the MAPE. Foresight: The International Journal of Applied Forecasting, 6, Zellner, A. (1986). A tale of forecasting 1001 series: the Bayesian knight
strikes again. International Journal of Forecasting, 2, 491–494.
40–43.
Makridakis, S. (1993). Accuracy measures: theoretical and practical
concerns. International Journal of Forecasting, 9, 527–529. Andrey Davydenko is working in the area of the development and
Marques, C. R., Neves, P. D., & Sarmento, L. M. (2000). Evaluating software implementation of statistical methods for business forecasting.
core inflation indicators. Working paper 3-00. Economics Research He has a Ph.D. from Lancaster University. He holds a candidate of science
Department. Banco de Portugal. degree in mathematical methods in economics. His current research
Mathews, B., & Diamantopoulos, A. (1987). Alternative indicators of focuses on the composite use of judgmental and statistical information
forecast revision and improvement. Marketing Intelligence, 5(2), in forecasting support systems.
20–23.
McCarthy, T. M., Davis, D. F., Golicic, S. L., & Mentzer, J. T. (2006). The
evolution of sales forecasting management: a 20-year longitudinal Robert Fildes is Professor of Management Science in the School of
study of forecasting practice. Journal of Forecasting, 25, 303–324. Management, Lancaster University, and Director of the Lancaster Centre
Mudholkar, G. S. (1983). Fisher’s z-transformation. Encyclopedia of for Forecasting. He has a mathematics degree from Oxford and a Ph.D.
Statistical Sciences, 3, 130–135. in statistics from the University of California. He was co-founder of
Sanders, N., & Ritzman, L. (2004). Integrating judgmental and quantitative the Journal of Forecasting in 1981 and of the International Journal of
forecasts: methodologies for pooling marketing and operations Forecasting in 1985. For ten years from 1988 he was Editor-in-Chief of
information. International Journal of Operations and Production the IJF. He was president of the International Institute of Forecasters
Management, 24, 514–529. between 2000 and 2004. His current research interests are concerned
Spizman, L., & Weinstein, M. (2008). A note on utilizing the geometric with the comparative evaluation of different forecasting methods, the
mean: when, why and how the forensic economist should employ the implementation of improved forecasting procedures in organizations and
geometric mean. Journal of Legal Economics, 15(1), 43–55. the design of forecasting systems.

CBSE Computer Science Class 12 Question Paper 2024 Solutions FREE PDF
No ratings yet
CBSE Computer Science Class 12 Question Paper 2024 Solutions FREE PDF
44 pages
EnCom LG ABS 40 - EnCom
No ratings yet
EnCom LG ABS 40 - EnCom
2 pages
Mip Report
No ratings yet
Mip Report
22 pages
Ramsey S Legacy 1st Edition Lillehammer Download PDF
100% (6)
Ramsey S Legacy 1st Edition Lillehammer Download PDF
84 pages
Dr. Devang Sharma
No ratings yet
Dr. Devang Sharma
6 pages
TENSION TEST ON Tor Steel
No ratings yet
TENSION TEST ON Tor Steel
7 pages
When Perfect Plans Meet AI Reality: A Leadership Survival Guide
From Everand
When Perfect Plans Meet AI Reality: A Leadership Survival Guide
Loren Cossette
No ratings yet
Implementation of a Data Reliability Program: Implementation of a Data Reliability Program
From Everand
Implementation of a Data Reliability Program: Implementation of a Data Reliability Program
Orlando Lopez
No ratings yet
Module-3-Electro Chem PDF
No ratings yet
Module-3-Electro Chem PDF
11 pages
CBSE Class 11 Biology Sample Paper Set 4
No ratings yet
CBSE Class 11 Biology Sample Paper Set 4
3 pages
Evaporators Performance
No ratings yet
Evaporators Performance
14 pages
Its A Small Small Small Small World
No ratings yet
Its A Small Small Small Small World
15 pages
(Touzi) Deterministic and Stochastic Control, Application To Finance
No ratings yet
(Touzi) Deterministic and Stochastic Control, Application To Finance
117 pages
Chapter 2 Fiber Optics A Brief History of Fiber Optics Lesson 4
No ratings yet
Chapter 2 Fiber Optics A Brief History of Fiber Optics Lesson 4
5 pages
Lecture Three-Forecasting
No ratings yet
Lecture Three-Forecasting
18 pages
CH 1 Pre-Assignment Practice
No ratings yet
CH 1 Pre-Assignment Practice
6 pages
Securing Healthcare Software: A Practical Guide to Functional Testing, Penetration Testing, and Compliance
From Everand
Securing Healthcare Software: A Practical Guide to Functional Testing, Penetration Testing, and Compliance
Tamerlan Mammadzada
No ratings yet
Shu Huang - Aggregation - Getting-More-Wisdom-From-The-Crowd
No ratings yet
Shu Huang - Aggregation - Getting-More-Wisdom-From-The-Crowd
50 pages
2013PDE5247 - Mohit Rajput
No ratings yet
2013PDE5247 - Mohit Rajput
87 pages
Atlas Copco Pf4000 Manual
67% (6)
Atlas Copco Pf4000 Manual
476 pages
Heat Treatment of Steel: Assessment Performance Criteria
No ratings yet
Heat Treatment of Steel: Assessment Performance Criteria
6 pages
The Forecasting Canon Nine Generalizatio
No ratings yet
The Forecasting Canon Nine Generalizatio
8 pages
CS5371 Theory of Computation
No ratings yet
CS5371 Theory of Computation
2 pages
1 s2.0 S1877050916325054 Main
No ratings yet
1 s2.0 S1877050916325054 Main
8 pages
Forecast Error Measures Davydenko 2
No ratings yet
Forecast Error Measures Davydenko 2
13 pages
KT Ykts
No ratings yet
KT Ykts
41 pages
Project Measurement
From Everand
Project Measurement
Steve Neuendorf
No ratings yet
Cabotaje Opman
No ratings yet
Cabotaje Opman
10 pages
Rexroth Servo Drives Programming:: Page 1 of 56
No ratings yet
Rexroth Servo Drives Programming:: Page 1 of 56
56 pages
Heat of Combustion Lab 2
No ratings yet
Heat of Combustion Lab 2
14 pages
Iso 8503-1 - 8503-2 - Surface Roughness Comprator PDF
No ratings yet
Iso 8503-1 - 8503-2 - Surface Roughness Comprator PDF
4 pages
1 s2.0 S0169207006000239 Main
No ratings yet
1 s2.0 S0169207006000239 Main
10 pages
Classroom Inventory List SCHOOL YEAR
No ratings yet
Classroom Inventory List SCHOOL YEAR
1 page
LTspice Tutorial Part 4 - Intermediate Circuits
No ratings yet
LTspice Tutorial Part 4 - Intermediate Circuits
23 pages
Management Science - Module 2.1 Intro To Forecasting
No ratings yet
Management Science - Module 2.1 Intro To Forecasting
26 pages
Husqvarna 2003 SM WRE 125 Manual
No ratings yet
Husqvarna 2003 SM WRE 125 Manual
2 pages
REVIEW PCP 18 - Demand Forecasting Errors in Industrial Context
No ratings yet
REVIEW PCP 18 - Demand Forecasting Errors in Industrial Context
6 pages
Certified Clinical Documentation Specialist Exam Pathway 2025/2026 Version: Master The Concepts With 600 Targeted Practice Questions
From Everand
Certified Clinical Documentation Specialist Exam Pathway 2025/2026 Version: Master The Concepts With 600 Targeted Practice Questions
Brittany Deaton
No ratings yet
Earned Value Analysis-15-12-2016 - AH PDF
No ratings yet
Earned Value Analysis-15-12-2016 - AH PDF
17 pages
Operations
No ratings yet
Operations
12 pages
The Definite Integrals
No ratings yet
The Definite Integrals
25 pages
Valid Benchmarks From Published Surveys of Forecast Accuracy - Foresight11
No ratings yet
Valid Benchmarks From Published Surveys of Forecast Accuracy - Foresight11
10 pages
Forecasting
No ratings yet
Forecasting
34 pages
Forecasting Overview 60 109
No ratings yet
Forecasting Overview 60 109
50 pages
Group3 Forecasting
No ratings yet
Group3 Forecasting
28 pages
Exercise 2: Nerve Conduction
No ratings yet
Exercise 2: Nerve Conduction
10 pages
Om 4
No ratings yet
Om 4
14 pages
Jsaer2015 02 02 52 70
No ratings yet
Jsaer2015 02 02 52 70
19 pages
Show Pro sm192m DMX CONTROLLER User Manual
No ratings yet
Show Pro sm192m DMX CONTROLLER User Manual
10 pages
Forecasting - I
No ratings yet
Forecasting - I
28 pages
Foresight
No ratings yet
Foresight
4 pages
Intermitent and Lmpy Forecast
No ratings yet
Intermitent and Lmpy Forecast
10 pages
Data-Driven Decision Making
From Everand
Data-Driven Decision Making
Aadinath Pothuvaal
No ratings yet
Risc VS Cisc
No ratings yet
Risc VS Cisc
2 pages
Computer Awareness: Computer Awareness For IBPS PO/MT and Clerk
No ratings yet
Computer Awareness: Computer Awareness For IBPS PO/MT and Clerk
10 pages
Using Forecasting Methodologies to Explore an Uncertain Future
From Everand
Using Forecasting Methodologies to Explore an Uncertain Future
James Poon
No ratings yet
Defect Prediction in Software Development & Maintainence
From Everand
Defect Prediction in Software Development & Maintainence
Rudra Kumar
No ratings yet
The Duty for Sponor Oversight in Clinical Trials: Practical Guide: 2nd Edition Clinical Data Review
From Everand
The Duty for Sponor Oversight in Clinical Trials: Practical Guide: 2nd Edition Clinical Data Review
Doris Breiner
No ratings yet
Forecasting: Overview: Learning Objectives
No ratings yet
Forecasting: Overview: Learning Objectives
5 pages
Demand Forecasting Errors in Industrial Context Measurement and Impacts PDF
No ratings yet
Demand Forecasting Errors in Industrial Context Measurement and Impacts PDF
6 pages
Demand Forecasting Errors in Industrial Context Measurement and Impacts PDF
No ratings yet
Demand Forecasting Errors in Industrial Context Measurement and Impacts PDF
6 pages
TR 1
No ratings yet
TR 1
12 pages
Measures of Forecast Accuracy
No ratings yet
Measures of Forecast Accuracy
7 pages
Knowing When To Sell - Critical Review
No ratings yet
Knowing When To Sell - Critical Review
3 pages
How to Align Employee Targets to the Strategy
From Everand
How to Align Employee Targets to the Strategy
Tawia Tsekumah
No ratings yet
Chapter 3 Part 3
No ratings yet
Chapter 3 Part 3
2 pages
How To Track Forecast Accuracy To Guide Forecast Process Improvement
No ratings yet
How To Track Forecast Accuracy To Guide Forecast Process Improvement
9 pages
OPM
No ratings yet
OPM
25 pages
Notes in Operations Research
From Everand
Notes in Operations Research
Rahul Basu
5/5 (1)
Fulltext PDF
No ratings yet
Fulltext PDF
20 pages
How To Track Forecast Accuracy To Guide Forecast Process Improvement
No ratings yet
How To Track Forecast Accuracy To Guide Forecast Process Improvement
9 pages
Guide to Audit Data Analytics
From Everand
Guide to Audit Data Analytics
AICPA
No ratings yet
Evaluating a Psychometric Test as an Aid to Selection
From Everand
Evaluating a Psychometric Test as an Aid to Selection
Zuzana Robertson C.Psychol
5/5 (1)
Another Look at Measures of Forecast Accuracy
No ratings yet
Another Look at Measures of Forecast Accuracy
18 pages
Procuring Penetration Testing Services
From Everand
Procuring Penetration Testing Services
CREST
No ratings yet
IT Specialist: Artificial Intelligence Exam Prep - 500 Questions for Certification Success (0225)
From Everand
IT Specialist: Artificial Intelligence Exam Prep - 500 Questions for Certification Success (0225)
Satou Takahiro
No ratings yet
The Concept of TQM
No ratings yet
The Concept of TQM
10 pages
Soda Core for Modern Data Quality and Observability: The Complete Guide for Developers and Engineers
From Everand
Soda Core for Modern Data Quality and Observability: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Forecasting Methods
No ratings yet
Forecasting Methods
20 pages
For Casting
No ratings yet
For Casting
18 pages
Great Expectations Checkpoints in Data Validation: The Complete Guide for Developers and Engineers
From Everand
Great Expectations Checkpoints in Data Validation: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Principles of Test-Driven Development: Definitive Reference for Developers and Engineers
From Everand
Principles of Test-Driven Development: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
NUnit in Practice: Definitive Reference for Developers and Engineers
From Everand
NUnit in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Applied Survival Analysis: Regression Modeling of Time-to-Event Data
From Everand
Applied Survival Analysis: Regression Modeling of Time-to-Event Data
David W. Hosmer, Jr.
4/5 (2)
Application Performance Management in Modern Systems: Definitive Reference for Developers and Engineers
From Everand
Application Performance Management in Modern Systems: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Workfront Implementation and Optimization Techniques: Definitive Reference for Developers and Engineers
From Everand
Workfront Implementation and Optimization Techniques: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
OneAgent Deployment and Optimization: Definitive Reference for Developers and Engineers
From Everand
OneAgent Deployment and Optimization: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Sentry Error Monitoring and Application Observability: Definitive Reference for Developers and Engineers
From Everand
Sentry Error Monitoring and Application Observability: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Site Reliability Engineering Foundations: Definitive Reference for Developers and Engineers
From Everand
Site Reliability Engineering Foundations: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Deequ for Scalable Data Quality Assurance: The Complete Guide for Developers and Engineers
From Everand
Deequ for Scalable Data Quality Assurance: The Complete Guide for Developers and Engineers
William Smith
No ratings yet

Davydenko 2013

Uploaded by

Davydenko 2013

Uploaded by

International Journal of Forecasting 29 (2013) 510–522

Contents lists available at SciVerse ScienceDirect

International Journal of Forecasting

Measuring forecasting accuracy: The case of judgmental

article info abstract

1. Introduction order to ensure the rational use of the organisation’s re-

Table 1 summary measure of location. The optimal trim level

Fig. 1. Histogram of the relative adjustment, measured in percentages.

Fig. 2. Histogram of ln(Final forecast/System forecast).

Positive 0.123 0.273 0.592 0.412 1.510

scaled actual demand value

scaled forecast error scaled forecast error

percentage error, 100% percentage error, 100%

(a) Positive adjustments. (b) Negative adjustments.

ri < 0 means an average

MAPE, % 38.85 61.54 70.45 45.13 47.88 56.85

Positive 3394 1815 <0.001 0.535 0.518 0.552

You might also like