Time Series With Python
Time Series With Python
https://doi.org/10.1007/s43069-022-00179-z
TUTORIAL
Alain Zemkoho1
Received: 29 October 2021 / Accepted: 16 November 2022 / Published online: 23 December 2022
© The Author(s) 2022
Abstract
The aim of this paper is to present a set of Python-based tools to develop forecasts
using time series data sets. The material is based on a 4-week course that the author
has taught for 7 years to students on operations research, management science, ana-
lytics, and statistics 1-year MSc programmes. However, it can easily be adapted to
various other audiences, including executive management or some undergraduate
programmes. No particular knowledge of Python is required to use this material.
Nevertheless, we assume a good level of familiarity with standard statistical forecast-
ing methods such as exponential smoothing, autoregressive integrated moving aver-
age (ARIMA), and regression-based techniques, which is required to deliver such a
course. Access to relevant data, codes, and lecture notes, which serve as based for
this material, is made available (see https://github.com/abzemkoho/forecasting) for
anyone interested in teaching such a course or developing some familiarity with the
mathematical background of relevant methods and tools.
1 Introduction
This article is part of the Topical Collection on Model Development for the Operations Research
Classroom.
* Alain Zemkoho
[email protected]
1
School of Mathematical Sciences & Centre for Operational Research, Management Sciences
and Information Systems (CORMSIS), University of Southampton, Building 54 Mathematical
Sciences SO17 1BJ Highfield Campus, Southampton, England
13
Vol.:(0123456789)
2 Page 2 of 43 Operations Research Forum (2023) 4:2
be split into two categories: qualitative and quantitative forecasting methods, and
we can even add a third one that we label as semi-qualitative, where a combination
of both qualitative and quantitative methods can be employed to generate forecasts.
Qualitative forecasting methods are often used in situations where historical data is
not available. For more details on these concepts, interested readers are referred to
the books [1, 2] and references therein.
Our focus in this paper is on quantitative methods, as we assume that historical
times series data (i.e. data from a unit (or a group of units) observed in several suc-
cessive periods) is available for the variables of interest. Within quantitative meth-
ods, we also have a number of subcategories that can be broadly labelled as statisti-
cal methods, which are at the foundation of the subject, and machine learning ones,
which have been developing rapidly in recent years; see, e.g. [3–9] for a sample of
applications and surveys on the subject.
The material to be presented in this paper is based on statistical forecasting
methods; see, e.g. [1, 2, 10–12] for related details. Despite the fast development of
machine learning techniques, they have been consistently shown through the last two
M competitions [13, 14] to generally be outperformed by statistical methods in terms
of accuracy and computational requirements; these comparisons (see relevant details
in the papers [13, 14]) are done on more than 100 thousand practical data sets, related
to a wide range of industries, based on the ForeDeCk database (http://fsudataset.
com/). Note that the M competition series (with M referring to Spyros Makridakis,
one of the world leaders in the field) is a famous open competition, which can also be
seen as a benchmarking exercise, where competitors evaluate and compare the per-
formance of a wide range of forecasting methods on thousands of practical data sets.
The aim of this paper is to introduce the reader to existing Python tools that can be
used to deliver a practical course on basic statistical forecasting methods; namely, we will
focus on the exponential smoothing, autoregressive integrated moving average (ARIMA),
and regression-based methods, which are (or a combination of them) part of core tech-
niques shown to have the best performance in the M competitions mentioned above.
1.1 Background
The material presented in this paper is based on a course named Forecasting, that
the author has taught for the past 7 years within the School of Mathematical Sci-
ences at the University of Southampton, based in the UK. This is an optional course,
but which is very popular, and is taken by students from the eight MSc programmes
listed in Table 1, spanning both the School of Mathematical Sciences and the South-
ampton Business School.
The course is very practical and hands on, designed to run for 16h across 4 weeks,
with 2h of weekly lecture and the remaining 2h dedicated to a workshop/tutorial/
computer lab, where the students are supported to go through the Python material
to test and apply the methods on some practical data sets. The lectures focus on tak-
ing the students through the mathematical background of the methods that will be
covered here [15]. During the computer labs, students are taken through the Python
codes covered in this paper, which implement the methods that form the content
13
Operations Research Forum (2023) 4:2 Page 3 of 43 2
Table 1 List of MSc programmes of origin of the students that usually take the forecasting course, which
is the source of the material presented in this paper
School of Mathematical Sciences Southampton Business School
of the lectures, and support them in using these methods to develop forecasts on
practical data sets. Note that this course can easily be expanded to cover a few more
weeks, as necessary, and the material can also be adapted to an undergraduate level
for programmes around operations research, statistics, business analytics, and man-
agement science.
It is important to mention that before the start of the course, a brief material with
a basic introduction to Python is made available to the students, in order to bring
them up to speed with some basic elements of Python, in case they have had no prior
exposure to the language. This brief material essentially covers the relevant Python
ecosystem discussed in Section 2 and an overview of the basic steps needed to get
Python up and running on their personal computers or the university machines.
Additionally, note that each of the weekly computer labs, which take place during
the course, is an opportunity for the instructors to guide the students on how to use
the different libraries needed to implement the mathematical concepts covered in the
lecture of that week.
The author has taught the course over the last 7 years, first using Excel and rele-
vant Visual Basic for Applications (VBA) codes to enhance some of the techniques.
The transition to Python was done much recently, considering the demand both from
industry and students, and also to keep up with the pace of developments in data
science more broadly. The motivation to prepare this paper came as a result of the
transition from Excel to Python, as the author was unable to find a single book or
resource relevant to prepare for a complete delivery of this course using Python.
The paper will mostly focus on the use of existing Python tools to generate forecasts,
although a bit of the background on the mathematical concepts will be provided
as necessary. Also, although prior knowledge of Python is not necessary, it will be
assumed that the reader has some level of familiarity with methods involved in the
corresponding mathematical material, as it would be required for anyone teaching
such a course. The lecture notes [15] that form the material of the course discussed
here are based on the books [1, 2].
As for the Python material, we only found the book [16] during the prepara-
tion of the first draft of the computing material, to be presented here, in 2019.
While preparing this paper, we came across the two new books [17, 18] on the
use of Python to generate forecasts on time series data. There are two common
13
2 Page 4 of 43 Operations Research Forum (2023) 4:2
denominators to these three books; the first is that they are mostly geared
towards machine learning–based techniques for time series forecasting, with the
exception of ARIMA models, which are covered in detail. Secondly, they essen-
tially focus on the use of Python tools to generate forecasts, and hence do not
specifically pay attention to the mathematical background of the methods, which
are the based on the corresponding Python forecasting tools.
Clearly, there are two differences between the content of this paper and what
is covered in the books [16–18]. At first, considering the page limitation of an
article such as this one, we also mostly only focus on the coding side of the
methods; however, our presentation is essentially organized along the lines of
the corresponding lecture notes [15], which provide the necessary mathemati-
cal background to develop a deep understanding of all the methods covered in
this paper. Secondly, unlike in these books, we focus our attention on statistical
methods, which form the basis of most of the methods which are at the heart
of the successful practical implementations in the context of the M competition
series, as discussed at the beginning of this introduction.
It is also important to mention that our philosophy in the preparation and
delivery of the course discussed in this paper is inspired in part by the book [2];
that is, giving the reader a balanced mathematical background of the forecast-
ing methods, while accompanying them with relevant practical software tool to
use these methods on practical data sets. However, the fundamental difference is
that [2] uses R while we use Python.
The lecture notes on which this course is based (i.e. [15]), as well as all the cor-
responding codes presented here, can be accessed online via the following link:
https://github.com/abzemkoho/forecasting.
We start the next section with an overview of the main Python packages needed
to work with the tools that we will go through in this paper. Subsequently, we
present tools that can be used for a basic data analysis (i.e. time, seasonal,
and scatter plots, as well as correlation analysis, just to mention a few) before
the start of any forecasting task based on the methods covered in this paper.
Section 3 is devoted to exponential smoothing methods, which are very effi-
cient on time series that involve trends and/or seasonality. Section 4 covers
ARIMA methods; and finally, Section 5 presents tools for regression analysis
and how they can be used for forecasting. Note that the exponential smooth-
ing and ARIMA methods are blackbox techniques, as are they are built under
the assumption that historical patterns in the time series will keep repeating
themselves in the future. However, regression-based approaches assume that
the behaviour of the time series of interest (dependent variable) is influenced
by other variables (independent variables), and this is explored through linear
regression to possibly build more accurate forecasts.
13
Operations Research Forum (2023) 4:2 Page 5 of 43 2
No prior knowledge of Python is required to use the material in this paper. How-
ever, we assume that the reader/instructor who wants to use the tools presented
here has Python up and running on their device (desktop, laptop, etc.) The codes
and corresponding results are based the use of Python under Anaconda 3
with Spyder 3.6 as editor, all running on Windows 10 Enterprise (proces-
sor: Intel(R) Core(TM) i5-6300U CPU @ 2.40 GHz). The advantage of using
Anaconda is that it installs Python with many important packages that are use-
ful for time series analysis of the type covered in this paper. This therefore helps
in part to reduce dependency issues between various packages used, and hence
ensure that key packages are set to work nicely together. Nevertheless, all the
codes presented here should be able to work smoothly on most platforms running
a version 3 of Python (see https://www.python.org/). The main packages needed
are as follows:
– SciPy;
– NumPy;
– Matplotlib;
– Pandas;
– Statsmodels.
13
2 Page 6 of 43 Operations Research Forum (2023) 4:2
In this subsection, we discuss the following five key topics, which are crucial in
the preliminary analysis of time series data sets:
– Time plots;
– Adjustments;
– Decompositions;
– Correlation analysis;
– Autocorrelation function.
13
Operations Research Forum (2023) 4:2 Page 7 of 43 2
seasonality, including zooming out specific chunks of the corresponding time plots.
Also, a time plot can sometimes already give an initial indication on the presence of
seasonality in a time series; for example, intuitively, Fig. 1(b) already suggests that
we might be having peaks and troughs occurring at regular intervals. But some fur-
ther steps need to be taken to check this.
In this paper, we are going to mainly use the seasonal plots and the concept of
autocorrelation function (ACF) to decide whether a time series is seasonal or not.
The ACF will be defined at the end of this section. Before that, we start with the
seasonal plots, which correspond to a superposition of time plots over a succes-
sion of limited time periods (e.g. 12 months in the context of monthly observations,
which is what we have for most of the data sets used in our illustrations). Listing A.2
provides a code that can be used to build seasonal plots after having organized our
data in months for over a few years.
Clearly, there is an indication from Fig. 2 that the clay bricks and electricity data
may have seasonality, while it is unlikely to be the case for the treasury bills data.
From the time plots in Fig. 1, an initial guess could have already been made about
the electricity data, but maybe not necessarily for the clay bricks data. At the end of
this section, we will see how the ACF plots can help to further confirm seasonality
identified here.
Besides the different patterns that can be assessed using time plots, they can also
enable an assessment of the need for adjustments (e.g. mathematical transforma-
tions or calendar adjustments). Ideally, the role of a mathematical transformation is
to attempt to stabilize variance in a time series, where rapid changes in some parts of
a time plot can affect the ability of a forecasting method to generate accurate results.
For instance, the power (including the square root, as a special case) and log transfor-
mations are the most commonly used transformations in the literature; the square root
can help, in the case where the time series has the shape of a second-order quadratic
function, to promote a “linear” shape, which can improve the predictability capacity
of some forecasting methods. On the other hand, the log (of course, applicable only
for positive time series) has an additional advantage, in terms of its interpretability.
For more details on these transformations and many other adjustments, which can
positively impact the forecasting ability of some methods, see [2, Chapter 3]. List-
ings A.3, A.4, and A.5 provide appropriate codes to generate a log, square root, and
calendar adjustments, respectively. The code in Listing A.5 runs on a special data
set, where a calendar adjustment can be useful, as in the milk production of a cow,
13
2 Page 8 of 43 Operations Research Forum (2023) 4:2
the difference in the observations from one month to the other can essentially be due
to the number of days in months. Hence, the calendar adjustment can help to remove
such a calendar effect before any further analysis of this time series.
For a given time series {Yt }t , it is sometimes important to look for ways to split it
by means of a decomposition function f in such way that
Yt = f (Tt , St , Et ), (1)
where for a given t, Tt and St denote the trend-cycle and seasonal components,
respectively, and Et corresponds to the error that results from such a decomposi-
tion. Decompositions are useful in developing a better understanding of the consti-
tuting patterns in a time series, but not necessarily for generating forecasts. Stand-
ard selections for a decomposition function are f (Tt , St , Et ) ∶= Tt + St + Et (additive
decomposition) and f (Tt , St , Et ) ∶= Tt × St × Et (multiplicative decomposition).
The statsmodels function seasonal_decompose can be used to generate
these decompositions, with the option “model” suitable for indicating the nature of
the decomposition (i.e. additive or multiplicative); see Listing A.6 for an additive
decomposition code (used to generate Fig. 3, for illustrative purpose) and Listing
A.7 for a multiplicative one.
It is important to note that in terms of the background algorithm on how a decompo-
sition is computed, one usually starts with the trend estimation, and then, depending on
the nature of f (1), the seasonal component is estimated; interested readers are refereed
to the lecture notes associated to this material [15, Section 2] and references therein.
Correlation analysis comes into play when we want to explore relationships
between variables in cross-sectional data. There are at least two possible tools to
assess correlation between variables. Namely, scatter plots and correlation values,
both concepts are strongly related in the sense that the scatter plot provides a graphi-
cal representation that can demonstrate how strong the relationship between two vari-
ables is, while the correlation is a numerical value materializing the strength level of
such a relationship. As an example to illustrate these two concepts, consider a data set
made of a variety of used cars and their price (based on their mileage). For instance,
we might want to forecast (price) against one possible explanatory variables (mileage,
here). Running the code in Listing A.8 clearly shows that the price of a car decreases
as the mileage increases. Each point on the graph represents one specific vehicle.
Fig. 3 Additive decomposition
graphs for the clay bricks sale
time series
13
Operations Research Forum (2023) 4:2 Page 9 of 43 2
A scatter plot helps us to visualize the relationship and suggests that if one wants
to forecast the price of used car, a suitable model should include mileage as an
explanatory variable. In Listing A.8, the scatter plot function scatter function
from matplotlib is applied with arguments being the mileage and price as sepa-
rate entries. Note that pandas also has the function scatter_matrix, which
can generate scatter plots for many variables in one go; this could be particularly
important in Section 5 when studying the regression approach to forecasting. Fig-
ure 4, for example, generated by the code Listing A.9, shows scatter plots in a matrix
form for four time series.
The correlation is a statistic corresponding to a number between −1 and 1 to
measure the level of the linear relationship for bivariate data (i.e. when there are two
variables). The corrcoef function from numpy, see Listing A.8, calculates the
correlation between the mileage and prices of the cars, as discussed above. Note that
in principle, corrcoef is generated as a symmetric matrix, hence the use of cor-
relval[1,0] to extract the necessary value. In a situation where one is interested
in evaluating the relationships between various pairs of variables, the correlation
matrix enables the calculation of these values in one go, as discussed above in the
context of scatter plots, as illustrated in the left-hand-side of Fig. 4; the correspond-
ing correlation values are generated with the function corr from pandas; see the
table in the right-hand-side of Fig. 4 for an illustration with four time series.
For a given time series Yt , the concept of correlation can be extended to the time
lags Yt and Yt−k of this same series. Hence, such a correlation is called autocorrela-
tion. The autocorrelation is used to measure the degree of correlation between differ-
ent time lags in a time series. The autocorrelation function (ACF) is crucial in assess-
ing many properties in statistics, including seasonality, white noise, and stationarity.
In this section, we limit ourselves to the use of the ACF in assessing seasonality. For
its use in assessing white noise and stationarity, see Sections 3 and 4, respectively.
Fig. 4 Left, we have the matrix of scatter plots for four times series labelled as DEOM, AAA, Tto4, and
D3to4. On the right, we have the correlation matrix, which gives the correlation value that reflects the
relationship in each pair in these four data sets. As it can be seen in the scatter plots, the strongest corre-
lation is between AAA and Dto4, as confirmed by the correlation value, which is strictly larger than 0.50
13
2 Page 10 of 43 Operations Research Forum (2023) 4:2
Fig. 5 Left, we have the seasonal plots for most of the years involved in the times series. On the right-
hand-side, we have the ACF plot over 60 time lags
As accuracy is the first main concern when forecasting, we start here by discussing how
some standard error measures, i.e. the mean error (ME), mean absolute error (MAE),
mean square error (MSE), percentage error (PE), mean percentage error (MPE), and
the mean absolute percentage error (MAPE), can be computed using Python. To pro-
ceed, it is crucial to recall that an error measure on its own does not mean much, but
rather, it can only make sense in a comparison setting of 2 or more methods. Hence,
we introduce two naïve forecasting methods to illustrate how these error measures can
be used in practice. We begin with a naïve forecasting, labelled as NF1, which assumes
that for a times series {Yt }, the forecast at time point t + 1 is obtained as Ft+1 = Yt.
Next, we consider a second naïve forecasting method labelled as NF2:
13
Operations Research Forum (2023) 4:2 Page 11 of 43 2
Fig. 6 The results from NF1 and NF2 can be seen in the first and second graphs, respectively. As for the
corresponding error measures, see the table in the right-hand-side
1
Ft+1 = Yt − St + S(t−12)+1 with St = (mSt−12 + Yt ),
m+1
where St = Yt for t = 1, … , 12 and with m is the number of complete years of data
available; for the initialization of the method, we set Ft+1 = Yt for t = 1, … , 12.
The code in Listing B.1 generates the results in Fig. 6, which show both the
NF1 and NF2 forecast plots, as well as the corresponding error measures stated
above. Note that the ME and MPE are not to be taken very seriously as their
values essentially reflect the fact that positive and negative values just cancel
each other throughout the range. Clearly, NF2 outperforms NF1 on almost all
the measures, especially, on the positive ones (MAE, MSE, and MAPE), which
are more meaningful. This is not surprising, considering the fact NF2 contains
more structure capturing the nature of the data set much better than NF1, which is
essentially a one-step translation of the original data set. Similar comparisons can
be done for any two or more forecasting methods.
Another tool to assess the accuracy of a forecast method is the ACF of the
errors. Basically, the expectation is that if the results of a forecasting methods are
reasonably accurate, the time plot of the errors, seen as a time series, should be
purely random. Therefore, no patterns from the original data should be preserved
in the errors/residuals. Using the corresponding code in Listing B.2 on the data
used for Fig. 2, we get the graphs in Fig. 7, which clearly show that the forecasts
from NF1 preserve seasonality from the original time series, with the large spikes
appearing after every 12th time lag. Such a pattern is not clearly obvious for NF2.
Finally, providing the confidence interval for a forecast can help decision-
makers in building their management perspectives. Let Ft+1 be the forecast from a
given method, then, the corresponding lower and upper bounds can be obtained as
13
2 Page 12 of 43 Operations Research Forum (2023) 4:2
√ √
LF t+1 ∶= Ft+1 − z MSE and UF t+1 ∶= Ft+1 + z MSE ,
respectively, where MSE represents the mean square error over a suitable range of
the data, while z is a quantile of the normal distribution, which is a conventional
number that determines the level of confidence of the corresponding interval. Stand-
ard values commonly used in practice for z can be seen in Section 2 of [15]. Fig-
ure 8, generated with the code in Listing B.3, provides the confidence intervals for
the data and corresponding NF1 and NF2-based results.
There are four main types of exponential smoothing methods, which can be
applied based on characteristics of our time series and sometimes also consider-
ing our intended purpose. Before diving into these methods, it is important to
mention that all the related Python tools that we are going to describe here are
from the statsmodels library. The first and simplest such method is the so-
called single exponential smoothing (SES) method. The SES is usually applied
only on time series that do not exhibit any specific pattern and can only produce
one step ahead forecast.
To set the stage for the general process of all the forecasting methods that we
are going to present in this paper, we are going to provide a brief overview of the
mathematical background of the SES method. To proceed, let us assume that we
are given a time series Y1, ..., Yt , where data is available from time point T = 1 up
to T = t . Then, the forecast for this time series at time point T = t + 1 using the
SES method can be calculated as
t−1
∑
= (1 − 𝛼) F1 + 𝛼 (1 − 𝛼)j Yt−j , (2)
t
Ft+1
j=0
where the parameter 𝛼 ∈ [0, 1]. There are various ways to initialize the method; one
possibility is to select F1 = Y1. The first key observation that can be made on the
√
Fig. 8 The confidence intervals here are obtained with the formula Ft ± z MSE with z being the parame-
ter ensuring that the 90% chance that the forecasts would be between the lower and upper bounds provided
13
Operations Research Forum (2023) 4:2 Page 13 of 43 2
formula (2), and which justifies the name of this class of methods, is the fact that if
one looks carefully at the factor (1 − 𝛼), we will observe that it decays exponentially
as the power j increases. More interestingly, by the nature of the expression, this
increase is associated with the decrease of the indices of Yj . Hence, this means that
the value of Ft+1 relies heavily on more recent values of the time series Y1, ..., Yt .
This is one of the particular characteristics of any exponential smoothing methods.
Additionally, being able to optimally select the value of the parameter 𝛼 is critical
for the performance of the method. The strategy commonly used in this case is the
least square optimization approach to select its best value. It corresponds to mini-
mize the MSE
t t
1∑ 2 1 ∑( )2
min ej ∶= Fj − Yj s.t. 𝛼 ∈ [0, 1], (3)
t j=1 t j=1
Fig. 9 On the left, we have the forecast plots for different values of the parameter 𝛼 , with the 3rd being
the optimal one. The table on the right provides values of the MSE for each value of the parameter
13
2 Page 14 of 43 Operations Research Forum (2023) 4:2
involving trend without the presence of seasonality. Hence, this method involves an
estimate of the level and linear trend of the time series at a given time point. As
a consequence, the Holt linear method involves level and slope parameters 𝛼 and
𝛽 , respectively. These parameters can be optimized using the minimization of the
MSE, similarly to what is done in (3). Similarly to SES, Holt’s linear method is
applied by simply calling the function named Holt from statsmodels.tsa.
api. In the case where we want to set the parameters 𝛼 and 𝛽 manually, we can
use the options smoothing_level and smoothing_slope, respectively. To
improve the forecasting performance of the Holt linear method, the Holt function
provides an option to select the nature of the trend using the exponential or
damped option, as it can be seen in the following excerpt of the Holt forecasting
code in Listing B.5:
Obviously, the default selection of the trend in the first model (see first line in
this excerpt) is the linear trend. For more details on the different type of trends and
the corresponding mathematical adjustment, see https://www.statsmodels.org/stable/
generated/statsmodels.tsa.holtwinters.Holt.html.
Finally, we now present the Holt-Winter forecasting method, which is suitable for
time series involving both trend and seasonality. Hence, in addition to the level and
trend components needed in the Holt linear method (design only for the case where
trend in present in our time series), a seasonal component is needed. The seasonal
component also comes with its parameter generally denoted by 𝛾 . As it should be
the case for the previous two methods, all the parameters are required to be real
numbers from the interval [0, 1]. Since the Holt-Winter method is more general than
the SES and LES, the corresponding function from statsmodels.tsa.api is
labelled as ExponentialSmoothing.
As we can see from this excerpt of the corresponding code in Listing B.6, besides
the parameters 𝛼 , 𝛽 , and 𝛾 , represented here by smoothing_level, smooth-
ing_slope, and smoothing_seasonal, which can be fixed or optimized as
in the previous two exponential smoothing methods, we have the nature of the trend
and seasonality, which can be additive or multiplicative. Clearly, the term add (resp.
mul) is used for additive (resp. multiplicative) trend or seasonality. More details on
these concepts can be found in [15, Section 2].
13
Operations Research Forum (2023) 4:2 Page 15 of 43 2
We use the code in Listing B.6 to generate the results in Fig. 10, which clearly
show that the optimized models 3 and 4 are the best, with the 3rd one with addi-
tive trend and seasonality being slightly better. The ACF of the residuals from each
method are also included in the code, to further evaluate the performance of each
method. It is clear that the residuals for models 1 and 2 retain the seasonality present
in the original data set. On the other hand, Fig. 10(b), (f), and (g) just confirm that
residuals seem relatively random.
4 ARIMA Methods
As we have seen so far, the ACF plot can play an important role in showing that a
time series is seasonal and also in assessing the accuracy of a forecasting method
(mainly via the white noise concept). In this section, we are going to see how the
ACF can also be helpful in assessing a few other properties relevant to the ARIMA
method, namely, in assessing stationarity and the identification of an ARIMA model.
However, to strengthen the capacity of the ACF in this role, we now introduce the
concept of partial autocorrelation function (PACF), which is used to measure the
degree of association between observations at time lags t and t − k (i.e. Yt and Yt−k ,
respectively) when the effects of other time lags, 1, … , k − 1, are removed. Hence,
partial autocorrelations calculate true correlations between Yt , Yt−1, ..., Yk and can
therefore be obtained using a regression formula on these terms, while proceeding
13
2 Page 16 of 43 Operations Research Forum (2023) 4:2
as in the least square approach in (3) or the concept of maximum likelihood estima-
tion, which is more common in this case [2].
To get a good flavour of how the PACF can be applied, let us use it to further
illustrate white noise in combination with ACF. Similarly to the ACF, as shown in
Subsection 2.2, the PACF can be plotted by simply applying the function plot_
pacf from statsmodels.graphics.tsaplots. The code in Listing C.1
generates the AFC and PACF for an example of white noise model. The important
thing to note when this code is ran is how the ACF and PACF of a typical white
noise model look like; recall that for a model to be statistically while
√ noise, about
95% of the values of ACF and PACF are within the range ± 1.96∕ n , where n is
the total number of observations. This range is represented by the shadow band that
appears in the graphs of both the ACF and PACF.
We now turn our attention to the concept of stationarity, which is at the heart of
the development of ARIMA methods. Recall that a time series is stationary if the
distribution of the fluctuations is not time dependent. This is easy to say, but it can
be tricky to actually show that a time series is stationary. We try now to provide a
few tools that can be helpful in identifying stationarity in a time series. To proceed,
we start by stating the following scenarios or specific tools that we are going to rely
on to identify whether a time series is stationary or not:
We have just seen how to determine whether a time series is white noise, using the
ACF and PACF, which can be plotted with Python using plot_acf and plot_
pacf, respectively. As for the second item, we already know, see Subsection 2.2,
how to identify trend and seasonality, as well as cyclical patterns, using time plots.
There is an interesting way to show that a time series is non-stationary by means of
its ACF and PACF plots. Basically, the autocorrelations of a stationary time series
drop to zero quite quickly, while those of a non-stationary one can take a significant
Fig. 11 Example of non-stationary times series (Dow Jones data from January 1956 to April 1980)
13
Operations Research Forum (2023) 4:2 Page 17 of 43 2
number of time lags to become zero. On the other hand, the PACF of a non-stationary
time series will typically have a large spike, possibly close to 1, at lag 1. This can
clearly be observed in Fig. 11 generated with the code in Listing C.2.
Ultimately, if the first four points above cannot help to make a definite decision
on the stationarity or non-stationarity of a time series, then we can proceed with
a unit root test. It is important to say beforehand that this is not a magic solution
to demonstrate stationarity, as there are various types of unit root tests, which can
sometimes provide contradictory results. The version of the unit root test that we
consider here is the augmented Dickey-Fuller (ADF) test [19], which assesses the
null hypothesis that a unit root is present in a time series sample.
A simple understanding of the ADF test that is relevant to us is that it generates
a number of statistics that we are going to present next. To generate these statistics,
the function adfuller from statsmodels.tsa.stattools can be applied
to our data set. This function simply takes in the values of the time series, as it can be
seen in the example used in the code present in Listing C.3, which is used to generate
the results in Fig. 12 from three different scenarios. Considering some building mate-
rial production data from Australia, the first row of Fig. 12 presents the time, ACF,
and PACF plots, respectively, as well as the statistics generated by the ADF test.
The ADF test (see last column of Fig. 12) generates three key categories of sta-
tistics. First, we have the ADF statistics itself, which needs to be negative and sub-
sequently would need to be less than the 1% critical value to confirm the strength
of stationary if additionally, the P-value is at least less than the threshold value of
0.05. We can clearly see from Fig. 12 how the ADF test helps to confirm that we go
13
2 Page 18 of 43 Operations Research Forum (2023) 4:2
from a series, where the original and first differenced series are non-stationary to a
stationary time series when first and seasonal differencing are done.
f (x) ∶= a0 + a1 x + a2 x2 + … ap xp ,
where p is the order of the polynomial and a0, a1, ..., ap are its coefficients. To get a
complete description of this polynomial, we need to start by identifying the order p,
which determines the number of the coefficients a0, a1, ..., ap, which can then be sub-
sequently calculated. This is approximately what is done to build an ARIMA model.
To make things a bit precise, let us consider a non-seasonal ARIMA(p, d, q) model
where Bk Yt ∶= Yt−k corresponds to the backshift notation. Here, the vector (p, d, q)
represents the order of the model, and 𝜙i , i = 1, … , p and 𝜃j , j = 1, … , q are param-
eters/coefficients of the model. Algorithm 1 summarizes the building process of an
ARIMA model, including the forecasting step.
13
Operations Research Forum (2023) 4:2 Page 19 of 43 2
Fig. 13 The first row presents the time, ACF, and PACF plots of an artificially generated autoregressive
model of order 1. The second row presents analogous graphs for an artificially generated moving average
of order 1
be made on ACF and PACF of “sufficiently differenced” (in the sense of leading to
stationarity) data. The graphs in Fig. 13 show an AR(1) and a MA(1) in the first and
second row, as generated by Listings C.4 and C.5, respectively.
Considering the fact that the approach in Step 1 can only enable the estimation
of pure AR and MA models, we need a way to check whether our series exhibits
a more general ARIMA(p, d, q) model with p > 0 and q > 0 simultaneously. The
AIC, which is a function of p and q, can help us to check whether there is a model
better than the one obtained from Step 1. The smaller the AIC, the better the model
is. To proceed, we can use the code in Listing C.6, which runs through a combina-
tions of values p, d, and q from the interval [0, 2] to identify the order (p, d, q) with
best AIC. For the selection of d, it is straightforward to use the process described
above, repeating the differencing as necessary to get the best statistics from the ADF
test based on the code in Listing C.3.
In terms of the content of the code in Listing C.6, its main feature is the ARIMA
function from statsmodels. This function is also going to be used for Step 4
of Algorithm 1, but one of its most interesting features is that it also generates
other important information such as the AIC of the corresponding model. How-
ever, in the context of Listing C.6, its main role is to print and compare the AIC
to identify the best model. When the most suitable values of the order (p, d, q)
have been identified, the ARIMA function can then be applied, using this order,
to generate the forecasts, as it is done for the example in Listing C.7. Running the
code generates forecast plots and some important statistics, including the AIC of
the model and the corresponding coefficients/parameters 𝜙i , i = 1, … , p and 𝜃j ,
j = 1, … , q as described in the equation in (4).
So far, we have considered only time series that are not necessarily seasonal.
In the seasonal case, the process is the same, except that the seasonal order
(P, D, Q) and periodicity s have to be provided, as indicated in the general model
13
2 Page 20 of 43 Operations Research Forum (2023) 4:2
Fig. 14 These graphs generated from Listing C.8 present the changes in the electricity demand times
series data in Fig. 1(b), going from the original data and its ACF and PACF plots (first row), passing by
the first difference (second row) to the graphs resulting from first and seasonal differencing (third row)
13
Operations Research Forum (2023) 4:2 Page 21 of 43 2
with the corresponding number of time periods per season (s) has been identified,
the seasonal ARIMA function (SARIMAX) also from statsmodels (see List-
ing C.10) can be used to generate the forecasts. Running SARIMAX with the code
available in Listing C.10 applied on building material time series from 1986 to 2008
in Australia, we get the graphs in Fig. 15 together with a number of statistics assess-
ing the quality of the model and the results.
Fig. 15 Summary of graphical results obtained by running the SARIMAX(1, 1, 1)(0, 1, 1)12 model using
the code from Listing C.10 on building material time series from 1986 to 2008 in Australia. The first four
graphs assess the accuracy of the method, with (1) the residual plot, (2) the distribution of the error (close
to a normal distribution), (3) the normal Q–Q plot, which compares randomly generated and independent
standard normal data on the vertical axis to a standard normal population on the horizontal axis (the clos-
est the data points are to a line suggests that the data are normally distributed), and (4) the correlogram for
checking randomness in the residual. The last row shows the one-step forecasts on a section of the data for
some visual assessment of accuracy, as well as the out-of-sample future forecasts over a 20-step horizon
13
2 Page 22 of 43 Operations Research Forum (2023) 4:2
The particularity of the method that we are going to discuss here is that it is explan-
atory, in comparison to the previous ones, which are blackbox methods. A regres-
sion model exploits potential relationships between the main (dependent) variable
and other (independent) variables. We focus our attention here on the simplest and
most commonly used relationship, which is the linear regression:
Y = bo + b1 X1 + … + bk Xk + e, (8)
where Y is the dependent variable, X1, ..., Xk the independent variables, and b0, b1,
..., bk the coefficients/parameters, where b0 specifically is often called intercept. It
is important to start by recalling that a regression model as (8) is not a forecasting
method by itself; there is a large number of applications of regression models in sta-
tistics and econometrics; see, e.g. [20] for a detailed analysis of regression models
and some flavour of a sample of applications.
To apply the regression model (8) to develop a forecast for a time series {Yt }, we
assume that it is influenced by other time series {Xit } for i = 1, … , n. To have some
flavour of this, we consider the mutual savings bank case study from [14, a regres-
sion model can be built to forecast EOM while considering AAA and Tto4 as inde-
pendent variables. For some technical reasons (see [1]), our Y is the first-order dif-
ference of EOM (denoted by DEOM), and X1, X2, and X3 the AAA, Tto4, and D3to4
(first-order difference of Tto4), respectively. Note that historical time series data sets
are available for the variables DEOM, AAA, Tto4, and D3to4, and there are some
level of relationship between these variables as it can be seen from the scatter plots
and correlation matrix in Fig. 4. However, this is not enough to guarantee that the
regression model resulting from this relation would be significant. The analysis of a
regression model starts with the evaluation of its overall significance.
For the overall significance of a model, key statistics are the R2 (known as
the coefficient of determination) and the P-value, which gives the probability of
obtaining a F statistic as large as the one calculated for the data set being studied,
if in fact the true slope is zero. As the R2 is a number between 0 and 1, model (8)
would be considered to be significant if it is at least greater than 0.50. Hence, the
overall significance of the model increases as R2 grows closer to the upper bound
1. Furthermore, from the perspective of the P-value, a regression model will be
said to be significant if the P-value is smaller than the conventionally set value of
0.05; and the significance improves as the P-value decreases below this threshold.
Before we expand this discussion further, let us show how the aforementioned
statistics can be obtained with Python. Our analysis of a regression model here
is based on the ols function from statsmodels, which means ordinary least
squares, given that the parameters in (8) are computed by the same least square
approach introduced for the SES model in (3). As you can see in the demonstra-
tion code in Listing D.1, it is incredibly easy to use ols. For example, to build
the basic model for our above bank case study, what is needed is to start by writ-
ing the regression equation
formula = ’DEOM ÃAA + Tto4 + D3to4’,
13
Operations Research Forum (2023) 4:2 Page 23 of 43 2
Fig. 16 Key statistics to assess the overall and individual significance of a regression model
13
2 Page 24 of 43 Operations Research Forum (2023) 4:2
Clearly, the significance of AAA, Tto4, and D3to4 is relatively good, as it is less
than the threshold value of 0.05, although that of the latter variable is weaker.
Interestingly, the green box in the table in Fig. 16 also provides the coefficients of
this example (cf. second column). After we have seen how the function ols can help
to generate the key statistics to assess the overall and individual significance of the
model, it remains to see how the forecast can actually be derived. To be able to do this,
we need the forecasts
Gi = (Gi1 , … , Gik ) of Xi = (Xi1 , … , Xik ) for i = t + 1, … , t + m.
We can then use each of these forecasts of the independent variables in the
expected value that determined the regression-based forecast for the independ-
ent variable Y using Eq. (8):
Fi = Ŷ i = Gi b̂ for i = t + 1, … , t + m, (9)
where the forecasts Gi of each independent variable can be obtained by any method
that is most suitable. Applying (9) to our example above (see Listing D.2), we obtain
the results in Fig. 17.
To conclude this section, some quick comments are in order. First, one of the typi-
cal preliminary step when building a regression model is to conduct a correlation
Fig. 17 Generating forecasts for the time series involved in this model, i.e. AAA, Tto4, and D3to4 for the
independent variables and DEOM for the dependent variable, is quite challenging as none of data sets
exhibits a clear pattern. Hence, from the exponential smoothing methods covered in Section 3, only Holt’s
linear method is the most suitable, as it enables the calculation of out-of-sample forecasts over a number of
time points ahead. An ARIMA method could also be used to generate forecasts for AAA, Tto4, and D3to4
13
Operations Research Forum (2023) 4:2 Page 25 of 43 2
analysis (e.g. scatter plots, correlation matrix), which can be done using tools that
we have discussed in Subsection 2.2. This can be done here with matrix scatter plots
and correlation tables; see Fig. 4. Also, often to improve an initial model as in (8) or
resulting forecasting accuracy (9), a careful selection process of variables or features
of the data sets can be done. Finally, the term prediction is usually confused with that
of forecast. Prediction is much more broad, as it includes tasks such as predicting the
result of a soccer game or an election, where only characteristics of the player of each
team (soccer) or surveys from voters (election) not necessarily historical data can be
used. Further details on these topics can be found in [1, 2, 15] and references therein.
6 Conclusion
This paper puts together a set of Python-based mostly off-the-shelf tools to develop
forecasts for time series data using basic statistical forecasting methods, namely,
exponential smoothing, ARIMA, and regression methods. It is important to mention
that for each forecasting method and analysis tool described in this paper, there could
be multiple Python approaches available, to undertake them, across different Python-
based platforms. Secondly, within many packages, there could also be various ways
to do the same thing. So, when using the material presented here, it will be useful to
have a look at the most recent updates on the corresponding packages’ websites (see
corresponding links provided in Section 2) for other possible ways to conduct spe-
cific analysis or for the most recent updates on possible improvements to these tools.
Appendix
13
2 Page 26 of 43 Operations Research Forum (2023) 4:2
13
Operations Research Forum (2023) 4:2 Page 27 of 43 2
13
2 Page 28 of 43 Operations Research Forum (2023) 4:2
13
Operations Research Forum (2023) 4:2 Page 29 of 43 2
13
2 Page 30 of 43 Operations Research Forum (2023) 4:2
13
Operations Research Forum (2023) 4:2 Page 31 of 43 2
13
2 Page 32 of 43 Operations Research Forum (2023) 4:2
13
Operations Research Forum (2023) 4:2 Page 33 of 43 2
13
2 Page 34 of 43 Operations Research Forum (2023) 4:2
13
Operations Research Forum (2023) 4:2 Page 35 of 43 2
13
2 Page 36 of 43 Operations Research Forum (2023) 4:2
13
Operations Research Forum (2023) 4:2 Page 37 of 43 2
13
2 Page 38 of 43 Operations Research Forum (2023) 4:2
13
Operations Research Forum (2023) 4:2 Page 39 of 43 2
13
2 Page 40 of 43 Operations Research Forum (2023) 4:2
13
Operations Research Forum (2023) 4:2 Page 41 of 43 2
13
2 Page 42 of 43 Operations Research Forum (2023) 4:2
Acknowledgements The lecture notes [15] (based on the textbooks [1, 2]), which have served as base for
the mathematical background of the data analysis and forecasting tools discussed in this paper, have been
developed and refined over the years thanks to contributions from many colleagues from the Southamp-
ton OR Group, in particular, I would like to mention Russell Cheng and Honora Smith for preparing and
delivering the Forecasting course for many years, until the 2013–2014 academic year. The author would
like to thank the referee and the guest editor for their constructive feedback, which led to improvements in
the presentation of the paper.
Funding This work is supported by the EPSRC grant with reference EP/V049038/1 and the Alan Turing
Institute under the EPSRC grant EP/N510129/1.
Data Availability All the data sets used for the illustrations in this paper are based on the book [1]; all
the data sets related to this book are available online: https://cloud.r-project.org/web/packages/fma/index.
html. As for the specific times series from this database used in this paper, they are available via the fol-
lowing link, together with all the py files associated to the codes in the appendix: https://github.com/
abzemkoho/forecasting.
Declarations
Conflict of Interest The author declares no competing interests.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License,
which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long
as you give appropriate credit to the original author(s) and the source, provide a link to the Creative
Commons licence, and indicate if changes were made. The images or other third party material in this
article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line
to the material. If material is not included in the article’s Creative Commons licence and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permis-
sion directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/
licenses/by/4.0/.
References
1. Makridakis S, Wheelwright SC, Hyndman RJ (2008) Forecasting methods and applications. J Wiley
& Sons
2. Hyndman RJ, Athanasopoulos G (2018) Forecasting: principles and practice. OTexts
3. Deng L (2014) A tutorial survey of architectures, algorithms, and applications for deep learning.
APSIPA Transactions on Signal and Information Processing 3
4. Hamzaçebi C, Akay D, Kutay F (2009) Comparison of direct and iterative artificial neural network
forecast approaches in multi-periodic time series forecasting. Expert Systems with Applications
36(Part 2):3839–3844
5. Robinson C, Dilkina B, Hubbs J, Zhang W, Guhathakurta S, Brown MA et al (2017) Machine learn-
ing approaches for estimating commercial building energy consumption. Appl Energy 208(Supple-
ment C):889–904
6. Salaken SM, Khosravi A, Nguyen T, Nahavandi S (2017) Extreme learning machine based transfer
learning algorithms: a survey. Neurocomputing 267:516–524
7. Voyant C, Notton G, Kalogirou S, Nivet ML, Paoli C, Motte F et al (2017) Machine learning meth-
ods for solar radiation forecasting: a review. Renew Energy 105(Supplement C):569–582
8. Zhang G, Eddy Patuwo B, Hu YM (1998) Forecasting with artificial neural networks: the state of
the art. Int J Forecast 14(1):35–62
9. Zhang L, Suganthan PN (2016) A survey of randomized algorithms for training neural networks. Inf
Sci 364-365(Supplement C):146-155
10. Adya M, Collopy F (1998) How effective are neural networks at forecasting and prediction? A
review and evaluation. J Forecast 17(56):481–495
13
Operations Research Forum (2023) 4:2 Page 43 of 43 2
11. Chatfield C (1993) Neural networks: forecasting breakthrough or passing fad? Int J Forecast
9(1):1–3
12. Sharda R, Patil RB (1992) Connectionist approach to time series prediction: an empirical test. J
Intell Manuf 3(1):317–323
13. Makridakis S, Spiliotis E, Assimakopoulos V (2018) Statistical and machine learning forecasting
methods: concerns and ways forward. PLoS ONE 13(3):e0194889
14. Makridakis S, Spiliotis E, Assimakopoulos V (2018) The M4 Competition: results, findings, con-
clusion and way forward. Int J Forecast 34(4):802–808
15. Zemkoho A (2021) Forecasting. School of Mathematical Sciences, University of Southampton, Lec-
ture Notes
16. Brownlee J (2018 ) Introduction to time series forecasting with Python. Ebook avaliable at https://
machinelearningmastery.com/introduction-to-time-series-forecasting-with-Python/, (Accessed on
15 Nov 2019)
17. Korstanje J (2021) Advanced forecasting with Python. Apress
18. Lazzeri F (2021) Machine learning for time series forecasting with Python. J Wiley & Sons
19. Dickey DA, Fuller WA (1979) Distribution of the estimators for autoregressive time series with a
unit root. J Am Stat Assoc 74:427–431
20. Montgomery DC, Peck EA, Vining GG (2021) Introduction to linear regression analysis. J Wiley &
Sons
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps
and institutional affiliations.
13
Data Mining and Knowledge Discovery (2023) 37:788–832
https://doi.org/10.1007/s10618-022-00894-5
Received: 4 April 2022 / Accepted: 7 November 2022 / Published online: 2 December 2022
© The Author(s) 2022
Abstract
Recent trends in the Machine Learning (ML) and in particular Deep Learning (DL)
domains have demonstrated that with the availability of massive amounts of time
series, ML and DL techniques are competitive in time series forecasting. Neverthe-
less, the different forms of non-stationarities associated with time series challenge
the capabilities of data-driven ML models. Furthermore, due to the domain of fore-
casting being fostered mainly by statisticians and econometricians over the years,
the concepts related to forecast evaluation are not the mainstream knowledge among
ML researchers. We demonstrate in our work that as a consequence, ML researchers
oftentimes adopt flawed evaluation practices which results in spurious conclusions
suggesting methods that are not competitive in reality to be seemingly competitive.
Therefore, in this work we provide a tutorial-like compilation of the details associated
with forecast evaluation. This way, we intend to impart the information associated
with forecast evaluation to fit the context of ML, as means of bridging the knowledge
gap between traditional methods of forecasting and adopting current state-of-the-art
ML techniques.We elaborate the details of the different problematic characteristics of
time series such as non-normality and non-stationarities and how they are associated
with common pitfalls in forecast evaluation. Best practices in forecast evaluation are
outlined with respect to the different steps such as data partitioning, error calculation,
B Christoph Bergmeir
[email protected]
Hansika Hewamalage
[email protected]
Klaus Ackermann
[email protected]
1 School of Computer Science & Engineering, University of New South Wales, Sydney, Australia
2 SoDa Labs and Department of Econometrics & Business Statistics, Monash Business School,
Monash University, Melbourne, Australia
3 Department of Data Science and AI, Faculty of IT, Monash University, Melbourne, Australia
123
Forecast evaluation for... 789
statistical testing, and others. Further guidelines are also provided along selecting valid
and suitable error measures depending on the specific characteristics of the dataset at
hand.
1 Introduction
In the present era of Big Data, Machine Learning (ML) and Deep Learning (DL)
based techniques are driving the automatic decision making in many domains such
as Natural Language Processing (NLP) or Time Series Classification (TSC, Bagnall
et al. 2016; Fawaz et al. 2019). Although fields such as NLP and Computer Vision
have heavily been dominated by ML and DL based techniques for decades by now, this
has hardly been the case for the field of forecasting, until very recently. Forecasting
was traditionally the field of statisticians and econometricians. However, with massive
scales of data being collected nowadays, ML and DL has now emerged as the state
of the art for many forecasting tasks. Furthermore, with many companies hiring data
scientists, often these data scientists are tasked with forecasting. Therefore, now in
many situations practitioners tasked with forecasting have a good background in ML
and data science, but are less aware of the decades of research in the forecasting space.
This involves many aspects of the process of forecasting, from the point of data pre-
processing, building models to final forecast evaluation. Due to the self-supervised
and sequential nature of forecasting tasks, it is often associated with many pitfalls that
usual ML practitioners are not aware of. The usage of bad evaluation practices worsens
this problem since they are not clearly distinguishing the truly competitive methods
from the inferior ones by avoiding spurious results. Evaluating the performance of
models is key to the development of concepts and practices in any domain. Hence, in
this particular work, we focus on the evaluation of point forecasts as a key step in the
overall process of forecasting.
The general process of forecast evaluation involves employing a number of models
having different characteristics, training them on a training dataset and then applying
them on a validation set afterwards. Then, model selection may be performed by eval-
uating on the validation set to select the best models. Otherwise, ensemble models
may be developed instead, by combining the forecasts from all the different models,
and usually a final evaluation is then performed on a test set (Godahewa et al. 2021).
In research areas such as classification and regression, there are well-established stan-
dard practices for evaluation. Data partitioning is performed by using a standard k-fold
Cross-Validation (CV) to tune the model hyperparameters based on the error on a vali-
dation set, the model with the best hyperparameter combination is tested on the testing
set, standard error measures such as squared errors, absolute errors or precision, recall,
or area under the curve are computed and finally the best models are selected. These
best methods may continue to deliver reasonable predictions for a certain problem
task, i.e., they generalize well, under the assumption that there are no changes of the
distribution of the underlying data, which otherwise would need to be addressed as
123
790 H. Hewamalage et al.
concept drift (Webb et al. 2016; Ghomeshi et al. 2019; Ikonomovska et al. 2010) or
non-stationarity.
In contrast, evaluating forecasting models can be a surprisingly complicated task,
already for point forecasting. Data partitioning has many different options in the con-
text of forecasting, including fixed origin, rolling origin evaluation and other CV
setups as well as controversial arguments associated with them. Due to the inherent
dependency, non-stationarity and non-normality of time series, these choices are com-
plex. Also, most error measures are susceptible to break down under certain of these
conditions. Other considerations are whether to summarize errors across all available
time series or consider different steps of the forecast horizon separately etc. As a
consequence, we regularly come across papers in top Artificial Intelligence (AI)/ML
conferences and journals (even winning best paper awards) that use inadequate and
misleading benchmark methods for comparison (e.g., non-seasonal models for long-
term forecasting on seasonal series), others that use mean absolute percentage error
(MAPE) for evaluation with series, e.g., with values in the [−1, 1] interval because
the authors think the MAPE is a somewhat generic “time series error measure”, even
though MAPE is clearly inadequate in such settings. Other works make statements
along the lines of Auto-Regressive Integrated Moving Average (ARIMA) being able
to tackle non-stationarity whereas ML models can’t, neglecting that the only thing
ARIMA does is a differencing of the series as a pre-processing step to address non-
stationarity. A step that can easily be done as preprocessing for any ML method as
well. In other works, we see methods compared using Mean Absolute Error (MAE)
as the error measure, and only the proposed method by those authors is trained with
L1 loss, all other competitors with L2 loss, which leads to unfair comparisons as the
L1 loss optimizes towards MAE, whereas the L2 loss optimizes towards Root Mean
Squared Error (RMSE). Many other works evaluate on a handful of somewhat ran-
domly picked time series and then show plots of forecasts versus actuals as “proof” of
how well their method works, without considering simple benchmarks or meaningful
error measures, and other similar problems. Also, frequently forecasting competitions
and research works introduce new evaluation measures and methodologies, some-
times neglecting the prior research, e.g., by seemingly not understanding that dividing
a series by its mean will not solve scaling issues for many types of non-stationarities
(e.g., strong trends). Thus, there is no generally accepted standard for forecast eval-
uation in every possible scenario. This gap has harmed the evaluation practices used
along with ML methods for forecasting significantly in the past. It is damaging the
area currently, with spurious results in many papers, with researchers new to the field
not being able to distinguish between methods that work and methods that don’t, and
the associated waste of resources.
Overall, this article makes an effort in the direction of raising awareness among ML
practitioners regarding the best practices and pitfalls associated with the different steps
of the point forecast evaluation process. Similar exhaustive efforts have been taken
in the literature to review, formally define and categorize other important concepts
in the ML domain such as concept drift (Webb et al. 2016), concept drift adapta-
tion (Gama et al. 2014) and mining statistically sound patterns from data (Hämäläinen
and Webb 2019). In the time series space, less comprehensive and/or systematic works
in the direction of certain aspects of our work exist. Cerqueira et al. (2020) have per-
123
Forecast evaluation for... 791
formed empirical studies using different data partitioning and performance estimation
methods on some real-world and synthetic datasets and presented guidelines around
which methods work under different characteristics of time series. In the work by
Petropoulos (2022) as well, those authors have a section dedicated to explaining fore-
cast evaluation measures, best practices for both point and probabilistic forecasting as
well as benchmarking. Ditzler et al. (2015) have conducted a survey on existing meth-
ods for learning in non-stationary environments and the associated difficulties and
challenges. In the work by Shcherbakov et al. (2013), those authors have presented a
review on several error measures for forecast evaluation along with their drawbacks
and also proposed another new measure to specifically become robust to outliers on
time series. Recommendations have also been given around selecting error measures
under a specific context. Gujarati (2021) has provided a comprehensive overview on
recent developments in econometric techniques in general using many examples.
The rest of this paper is structured as follows. Section 2 first introduces ter-
minology associated with forecast evaluation, including different forms of non-
stationarities/non-normality seen in time series data. Next, Sect. 3 details the
motivation for this article, along with common pitfalls seen in the literature related
to using sufficient datasets, selecting appropriate measures for evaluation, using com-
petitive benchmarks, visualisation of results using forecast plots and data leakage
in forecast evaluation. Then, in Sect. 4, we provide best practices and guidelines
around different aspects of forecast evaluation including how to best partition the
data for a given forecasting problem with non-stationarities involved with the series,
how to select evaluation measures depending on the characteristics of the time series
under consideration and details of popular techniques used for statistical testing for
significance of differences between models. Finally, Sect. 5 concludes the paper by
summarising the overall findings of the paper and highlighting the best practices for
forecast evaluation. The code used for this work is publicly available for reproducibil-
ity of the results.1
This article focuses on point forecast evaluation, where the interest is to evaluate one
particular statistic (mean/median) of the overall forecast distribution. However, we
note that there are many works in the literature around predicting distributions and
evaluating accordingly. Figure 1 indicates a common forecasting scenario with the
training region of the data, the forecast origin which is the last known data point from
which the forecasting begins and the forecast horizon. In this section we provide a
general overview of the terminology used in the context of forecast evaluation.
In forecast evaluation, similar to other ML tasks, validation and test sets are used
for hyperparameter tuning of the models and for testing. Evaluations on validation and
test sets are often called out-of-sample (OOS) evaluations in forecasting. The two main
setups for OOS evaluation in forecasting are fixed origin evaluation and rolling origin
evaluation (Tashman 2000). Figure 2 shows the difference between the two setups.
1 Available at https://github.com/HansikaPH/Forecast_Evaluation_Pitfalls.
123
792 H. Hewamalage et al.
Fig. 1 A forecasting scenario with training region of the data, forecast origin and the forecast horizon
In the fixed origin setup, the forecast origin is fixed as well as the training region,
and the forecasts are computed as one-step ahead or multi-step ahead depending on
the requirements. In the rolling origin setup, the size of the forecast horizon is fixed,
but the forecast origin changes over the time series (rolling origin), thus effectively
creating multiple test periods for evaluation. With every new forecast origin, new data
becomes available for the model which can be used for re-fitting of the model. The
rolling origin setup is also called time series cross-validation (tsCV) and prequential
evaluation in the literature (Hyndman and Athanasopoulos 2018; Gama et al. 2013).
Time series can have different forms of non-stationarities and non-normality and
they make time series forecasting and evaluation a more difficult problem in com-
parison to other ML tasks. Listed below are some of such possibly problematic
characteristics of time series.
1. Non-stationarities.
• Seasonality
• Trends (Deterministic, e.g., Linear/Exponential)
• Stochastic Trends / Unit Roots
• Heteroscedasticity
• Structural Breaks (sudden changes, often with level shifts)
2. Non-normality
• Non-symmetric distributions
• Fat tails
123
Forecast evaluation for... 793
Fig. 2 Comparison of fixed origin versus rolling origin setups. The blue and orange data points represent
the training and testing sets respectively at each evaluation. The figure on the left side shows the fixed origin
setup where the forecast origin remains constant. The figure on the right shows the rolling origin setup
where the forecast origin rolls forward and the forecast horizon is constant. The red dotted lined triangle
encloses all the time steps used for testing across all the evaluations. Compared to the fixed origin setup, it
is seen that in the rolling origin setup, testing data instances in each evaluation pass on to the training set in
the next evaluation step
• Intermittency
• Outliers
3. Series with very short history
Non-stationarity in general means that the distribution of the data in the time series is
not constant, but it changes depending on the time (see, e.g., Salles et al. 2019). What
we refer to as non-stationarity in this work is the violation of strong stationarity defined
as in Eq. (1) (Cox and Miller 1965). Strong stationarity is defined as the distribution
of a finite window (sub-sequence) of a time series (discrete-time stochastic process)
remaining the same as we shift the window across time. In Eq. (1), yt refers to the time
series value at time step t; τ ∈ Z is the size of the shift of the window and n ∈ N is the
size of the window. FY (yt+τ , yt+1+τ , ..., yt+n+τ ) refers to the cumulative distribution
function of the joint distribution of (yt+τ , yt+1+τ , ..., yt+n+τ ). Hence, according to
Eq. (1), FY is not a function of time, it does not depend on the shift of the window.
In the rest of this paper, we refer to the violation of strong stationarity simply as
non-stationarity.
FY (yt+τ , yt+1+τ , ..., yt+n+τ ) = FY (yt , yt+1 , ..., yt+n ), for all τ ∈ Z and n ∈ N
(1)
123
794 H. Hewamalage et al.
Fig. 3 Forecasts from different models on a series with unit root based non-stationarity, with stochastic
trends. In this example, we have a continuously increasing series (increasing mean) due to the unit root.
The ML models are built as autoregressive models without any pre- or post-processing, and as such have
very limited capacity to predict values beyond the domain of the training set, seen in the second part of the
test set where predictions are considerably worse than in the first part
nor seasonality are concepts that have precise formal definitions. They are usually
merely defined as smoothed versions of the time series, where for the seasonality
the smoothing occurs over particular seasons (e.g., in a daily series, the series of all
Mondays needs to be smooth, etc.). Heteroscedasticity changes the variance of the
series and structural breaks can change the mean or other properties of the series.
Structural break is a term used in Econometrics and Statistics in a time series context
to describe a sudden change at a certain point in the series. It therewith has considerable
overlap with the notion of sudden concept drift in an ML environment, where a sudden
change of the data distribution is observed (Webb et al. 2016).
On the other hand, data can be far from normality, for example having fat tails,
or when conditions such as outliers or intermittency are observed in the series. Non-
stationarities and non-normality are both seen quite commonly in many real-world
time series and the decisions taken during forecast evaluation depend on which of
these characteristics the series have. There is no single universal rule that applies to
every scenario.
As briefly explained in Sect. 1, there exist many ML based papers for forecasting in
the recent literature that are flawed or at least weak with regards to forecast evaluation.
This section is devoted to provide the motivation of our work by discussing the most
common problems and pitfalls associated with forecast evaluation in many recent
literature.
123
Forecast evaluation for... 795
forecasting literature, newly proposed algorithms are not rigorously compared against
the relevant benchmarks.
Figure 5 illustrates the behaviour of different models that have been trained with
differencing as appropriate preprocessing on a series that has a unit root based non-
stationarity. If the series has no further predictable properties above the unit root (as
in this example), i.e., it is a random walk where the innovation added to the last
observation follows a normal distribution with a mean of zero, the naïve forecast is the
theoretically best forecast, as also suggested by the RMSE values reported in Table 1.
Other, more complex forecasting methods in this scenario will have no true predictive
power beyond the naïve method, and any potential superiority, e.g., in error evaluations
123
796 H. Hewamalage et al.
Fig. 5 Forecasts from different models on a series with unit root based non-stationarity, with stochastic
trends. The ML models are built as autoregressive integrated models, i.e., differencing has been done as
pre-processing. The methods show very similar behaviour to the naïve forecast, and do not add any value
over it by definition of the Data Generating Process (DGP) used
123
Forecast evaluation for... 797
meaningfully assessed. More complex methods will in such series usually show a
behaviour where they mostly follow the series in the same way as the naïve fore-
cast, and improvements are often small percentages over the performance of the naïve
benchmark.
Financial time series such as exchange rates and stock prices are particularly prob-
lematic to forecast. For example, exchange rates are a function of current economic
conditions and expectations about future valuations. Simultaneously, currencies are
traded on the futures market (e.g., a market participant says they will buy X amount
of US dollars in 1 year price for Y amount of Australian dollars), providing a mar-
ket expectation of future price movements. The survey by Rossi (2013) has analysed
the literature on exchange rate forecasting based on additional economic information
and concluded that the most challenging benchmark is the random walk without drift
model. Yet, ML based researchers have continued to introduce sophisticated Neural
Network (NN) models for exchange rate forecasting without proper benchmarking. In
the work by Wu et al. (2021), those authors have introduced a transformer based model
with an embedded decomposition block and an autocorrelation mechanism to address
long-term time series properties, called Autoformer. Their evaluation setup includes
an exchange rate dataset used in many recent papers of this type (Lai et al. 2018), to be
forecasted 720 days into the future. Predicting daily exchange rates based on only past
exchange rates nearly 2 years into the future may sound like an outrageous claim to
Economists already, and those authors themselves state in that paper, that the dataset
contains no obvious periodicities and thus is hard to be predicted compared to other
datasets. It is thus unclear how the decomposition mechanism used in their proposed
model should in any way make a valid contribution to predicting these series. As those
authors have not compared their model against the naïve benchmark, we experiment
using a naïve forecast on this exchange rate dataset, under the same evaluation setup
as those authors. The results are as reported in Table 3. Table 3 reports the results
for Autoformer both from our experiments as well as the experiments reported in the
paper. As seen here, the error values that we get for Autoformer are slightly different
from the error values reported in the paper, due to the randomness of the seed values
used in the experiments. Regardless, the naïve forecast beats both the results from
Autoformer across all the horizon sizes tested by a considerable margin, indicating
that the proposed method (and all comparison methods used in the original paper)
is essentially useless on this particular dataset. Also keep in mind that in this exam-
ple Autoformer takes hours to run on CPU or alternatively needs a GPU with 24GB
of memory, to finally arrive at results that are worse than trivial results that require
essentially no computation at all.
More recently, in the work by Zhou et al. (2022a), those authors have proposed
a Frequency improved Legendre Memory (FiLM) model which helps with removing
noisiness in signals and also preserves historical information for long-term forecasting.
In that paper too, those authors have experimented on the same exchange rate dataset.
According to the results reported in that paper, that model outperforms the naïve
forecast on the longest horizon size of 720 on the multivariate forecasting study of the
exchange rate dataset (the FiLM model has reported an MSE and MAE of 0.727 and
0.669, respectively, whereas the naïve forecast has an MSE of 0.817 and an MAE of
0.694 as reported in Table 3). We have attempted to reproduce the same results of the
123
798 H. Hewamalage et al.
Table 3 Results from the naïve forecast and the Autoformer model on the exchange rate dataset
123
Forecast evaluation for... 799
forecasts for them a fundamentally flawed task. One issue is that exchange rate data
(and in particular this dataset) is based on trading days, meaning that the time series
that all the aforementioned works have dealt with do not contain weekends and are not
equally spaced, so that any comments on seasonality and cycle length in these papers
are likely wrong. However, the most important point is that data is more than input into
an algorithm. The large body of literature in economics and finance over 50 years states
that it is not sensible to forecast exchange rate time series, as it violates the efficient
market hypothesis (Fama 1970). The nature of a market is that the price reflects all the
information publicly available, and even if it does not do it for a short period (such as
minutes or days; or milliseconds in high-frequency trading), and some investors enjoy
extra information, they will act on it, and the market price will adapt. There is a known
persistence in the return volatility of foreign exchange rate markets (Berger et al.
2009). Still, there is no evidence that it is reasonable to assume to forecast exchange
rates 720 days into the future. The final open question of forecasting these exchange
rates completely left out by the aforementioned literature is, why we are forecasting
exchange rate in the first place. Is the intention to trade on that information, or is it
for risk management? How does an error measure that translates to being more than
50% of the time wrong lead to anything else than the bankruptcy of the user? Would
the authors themselves be satisfied that their pension fund is using their own model
for investing their money? We guess it is fair to answer this with no.
Similar considerations hold for stock price forecasting. Some examples from the
recent ML literature in this area that benchmark on stock market related data without
comparisons against the naïve benchmark are Shen et al. (2020); Du et al. (2021); Lin
et al. (2021). Stock market data is another classical example where data is abundant,
but stock returns are deemed to be “almost unpredictable” (Engle 2003), especially
using only past stock prices as inputs alone, in the classic Economics literature, as
stock prices are again assumed to not be a function of their own past but of current
market conditions and expectations about future valuations, and in an efficient mar-
ket, forecasting using only past stock price data will not yield results more accurate
than a naïve forecast. It is important to note in this context that this holds for stock
prices and returns, but not volatility, which is predictable, e.g., using autoregressive
conditional heteroskedasticity (ARCH), a finding which led to the award of the 2003
Nobel Memorial Prize in Economic Sciences to Robert F. Engle (Engle 2003).
As such, papers that claim that they can predict stock prices or returns, or exchange
rates based on historic readings of these same signals alone need to be aware that
their claims contradict some central notions in Economics and that they need to be
evaluated very rigorously, as their results are likely to be spurious.
On series that have clear seasonal patterns, models should accordingly be bench-
marked against the seasonal naïve model as the most simplistic benchmark, and also
other simple benchmarks are commonly used in forecasting. In the work by Zhou et al.
(2021) those authors have proposed a novel memory and time efficient transformer
based architecture, namely Informer for long sequence forecasting. That paper has
also won the outstanding paper award at the Association for the Advancement of Arti-
123
800 H. Hewamalage et al.
ficial Intelligence (AAAI) conference 2021. In that work several experiments have
been conducted using Electricity Transformer Temperature data (ETT), Electricity
Consumption Load (ECL)2 data and Weather data. The ETT and ECL hourly datasets
clearly show strong multiple seasonal patterns (being hourly series, daily, weekly,
and yearly patterns are to be expected). However, the Informer model has only been
benchmarked against non-seasonal ARIMA which is not capable of handling multiple
seasonalities, and is a grotesquely misspecified model that would not be used in prac-
tice. To claim its superior performance in the long horizon forecasting problems, the
proposed Informer model in this case needs to be compared against statistical standard
benchmarks that inherently handle multiple seasonalities well, such as the Dynamic
Harmonic Regression ARIMA (DHR-ARIMA) model and the TBATS model (Hynd-
man and Athanasopoulos 2018). To demonstrate this, we conduct an experiment with
a DHR-ARIMA model on the ETTh1 and the ECL datasets on their respective longest
horizon sizes (720 for the ETTh1 dataset and 960 for the ECL dataset) for the uni-
variate forecasting task. For the ETTh1 dataset, daily and yearly seasonal patterns are
incorporated where as for the ECL dataset, all daily, weekly and yearly seasonalities
are included using Fourier terms in the DHR-ARIMA model. The results are reported
in Table 4, along with the results for the benchmark models shown in the original
paper. The horizon size is shown within parentheses next to the dataset name in Table
4. As seen from these results, when the Fourier terms are incorporated to capture the
multiple seasonalities, the standard DHR-ARIMA can outperform ARIMA as well as
the two variants of the proposed algorithm, Informer and Informer† .
Apart from that, the recent work by Zeng et al. (2022) has challenged the long-term
time series forecasting capability of transformer based models in general by comparing
against a relatively simple linear layer based NN, i.e., a set of linear models trained
for the forecasting horizon in question directly. As those authors have stated in their
work, most of the performance gains of the aforementioned transformer based models
for long-term forecasting are due to comparing their direct multi-step ahead forecasts
against iterative forecasts that are produced from more traditional methods, which
2 https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014
123
Forecast evaluation for... 801
inherently have error accumulation issues due to the recursive nature of forecasting.
This claim once again emphasises the need to perform comparisons with the right
and the most competitive established forecasting benchmarks for the relevant study,
as directly trained linear models have been shown to outperform all the considered
transformer architectures in that work.
Another common problem in the ML based forecasting literature is that many works
do not use sufficient amounts of datasets/time series for the experiments for reason-
ably claiming the superior performance of the proposed algorithms. While it may be
somewhat subjective what amount of series is sufficient, oftentimes papers use only
a handful of series when the authors clearly don’t seem to care about their particu-
lar application and/or when hundreds of series could be readily available for the same
application case, e.g., in notorious stock return prediction tasks. Some examples along
these lines (there are many more in the literature) are the works of Liu et al. (2021,
2020); Godfrey and Gashler (2018); Shen et al. (2020), and Zhang et al. (2021). In
particular, Zhang et al. (2021) use 3 time series in total, a simulated AR(1) process, a
bitcoin price series and an influenza-like illness series, to evaluate their non-parametric
neural network method. While the influenza-like illness series may be a good fore-
casting case study, basically the same considerations as for exchange rates and stock
prices hold for bitcoin prices, though bitcoin was presumably a less efficient market,
especially in its infancy. The best model to forecast an AR(1) process is trivially an
AR(1) model (which is not used as a benchmark in that paper), so fitting complex neu-
ral networks to this series makes very limited sense.3 The authors are here effectively
fitting a neural network to model a 2-dimensional linear relationship plus noise.
A variety of evaluation measures have been proposed for forecast evaluation over the
years, and thus ML based forecasting researchers seem to be in a situation unable
to clearly pick the evaluation measures that best suit their requirements and the data
at hand. For example, in the work by Lai et al. (2018), those authors have used two
measures Root Relative Squared Error (RSE) and Empirical Correlation Coefficient
(CORR) for the evaluation which both use scaling based on the mean of the time
series. While this may work as a scaling technique for time series that have minimal
or no trends, for series that contain trend based non-stationarities this does not scale
the series meaningfully. Yet, this information is only implicit and not conveyed to the
reader in their work. Consequently, there have been many other works which followed
the same evaluation setup and the measures without any attention to whether the used
3 One could argue that not always the true data generating process (DGP) is the best forecasting model,
but this usually happens for complex DGPs where not enough data is available to estimate their parameters
correctly and so simpler models perform better for forecasting. However, an AR(1) is already very simple
so that compared to a considerably more complex neural network this consideration seems not relevant
here.
123
802 H. Hewamalage et al.
series contain trends or not (examples are Guo et al. 2022; Shih et al. 2019; Wu et al.
2020; Ye et al. 2022). Although this allows for direct comparisons against previous
work, it also has caused all successive works to overlook the same issues with trended
time series with the used error measures.
Some works also use scale-dependent measures such as Mean Squared Error (MSE),
RMSE and MAE on multivariate datasets having many time series (examples are Cui
et al. 2021; Du et al. 2021; Ye et al. 2022). While this is reasonable if all the series in
the dataset have similar scales, if the scales are different, this means that the overall
error value would be driven by particular series. Some have used the coefficient of
determination (R2 ) between the forecasts and the actual values as a forecast evaluation
measure as well (for example Shen et al. 2020; Zhou et al. 2022). This can be a quite
misleading evaluation measure especially in the case of random walk time series,
which may give almost perfect R2 values (close to 1) due to the following nature
of the series indicating a competitive performance of the model, whereas in reality
the series does not have any predictable patterns at all. MAPE is another evaluation
measure commonly applied incorrectly on series having very small values in the range
[−1, 1] (examples are Moon et al. 2022; Wu et al. 2020). Due to the denominator of
the MAPE which is the actual value of the time series, on series having values close
to 0, MAPE gives excessively large values irrespective of the actual prediction.
Plots with time series forecasting results can be quite misleading and should be used
with caution. Analysing plots of forecasts from different models along with the actuals
and concluding that they seem to fit well can lead to wrong conclusions. It is important
to use benchmarks and evaluation metrics that are right for the context. In a scenario
like a random walk series as in Fig. 5 as stated before, visually our models may look
like achieving similar or better accuracy than the naïve method, but it will be a spurious
result. The visual appeal of a generated forecast or the possibility of such a forecast
to happen in general are not good criteria to judge forecasts. However, many recent
forecasting literature seem to use forecast plots that do not convey much information
regarding the performance of the methods (for example Liu et al. 2021, 2020; Du et al.
2021).
Figure 6a shows a monthly time series with yearly seasonal patterns along with
forecasts from the ETS model. The figure furthermore shows the forecasts under
fixed origin and rolling origin data partitioning schemes for the naïve forecast. When
periodic re-fitting is done with new data coming in as in a rolling origin setup, the
naïve forecast gets continuously updated with the last observed value. For the fixed
origin context on the other hand, the naïve forecast remains constant as a straight line
corresponding to the last seen observation in the training series. We see that with a
rolling-origin naïve forecast, the predictions tend to look visually very appealing, as
the forecasts follow the actuals and our eyes are deceived by the smaller horizontal
distances instead of the vertical distances that are relevant for evaluation. Figure 6b
illustrates this behaviour. It is clear how the horizontal distance between the actuals
and the naïve forecast at both points A and B are much less compared to the vertical
123
Forecast evaluation for... 803
distances which are the relevant ones for evaluation. In these situations we need to
rely on the error measures, as the plots do not give us much information. As reported
in Table 5, for this scenario the ETS forecasts have a smaller RMSE error compared
to both rolling origin and fixed origin naïve forecasts.
Figure 7 shows another series having unit root based non-stationarity and fixed
origin forecasts from several models and the naïve forecast for a forecast horizon of
60 time steps ahead. This shows another issue with using plots to determine forecast
accuracy. As explained previously, on a random walk time series, a naïve forecast is
the theoretically best forecast that can be obtained. This is also clarified by the RMSE
values for these forecasts from the different models as reported in Table 6. However,
123
804 H. Hewamalage et al.
Fig. 7 Fixed origin forecasts from several models and the naïve forecast on a random walk time series
the naïve forecast for fixed origin is a constant. Although this does not look realistic,
and in most application domains we can be certain that the actuals will not be constant,
practitioners may mistakenly identify such behaviour as a potential problem with the
models, where this forecast is indeed the best possible forecast in the sense that it
minimizes the error based on the information available at present.
In summary, plots of the forecasts can be deceiving and should be used mostly for
sanity checking. Decisions should mostly be made based on evaluations with error
measures and not based on plots.
Data leakage refers to the inadvertent use of data from the test set, or more generally
data not available during inference, while training a model. It is always a potential
problem in any ML task. For example, Kaufman et al. (2012) present an extensive
review on the concept of data leakage for data mining and potential ways to avoid it.
Arnott et al. (2019) discuss this in relation to the domain of finance. Hannun et al.
(2021) propose a technique based on Fisher information that can be used to detect data
leakage of a model with respect to various subsets of the dataset. Brownlee (2020)
also provide a tutorial overview on data preparation for common ML applications
while avoiding data leakage in the process. However, in forecasting data leakage can
happen easier and can be harder to avoid than in other ML tasks such as classifica-
tion/regression.
123
Forecast evaluation for... 805
123
806 H. Hewamalage et al.
Fig. 8 Forecasts from a model with leakage and no leakage on a time series having unit root based non-
stationarity
dataset can be equally affected. Therefore, if the series in the dataset are not aligned
and one series contains the future values with respect to another, when splitting the
training region, future information can be already included within the training set.
However, in real world application series are usually aligned so that this is not a big
problem. On the other hand, in a competition setup such as the M3 and M4 forecasting
competitions (Makridakis and Hibon 2000; Makridakis et al. 2020), where the series
are not aligned, this can easily happen (Talagala 2020).
Data leakage can also happen simply due to using the wrong forecast horizon. This
can happen by using data that in practice will become available later. For example,
we could build a one-day-ahead model, but use summary statistics over the whole
day. This means that we cannot run the model until midnight, when we have all data
from that day available. If the relevant people who use the forecasts work only from
9am-5pm, it becomes effectively a same-day model. The other option is to set the day
to start and end at 5pm everyday, but that may lead to other problems.
In conclusion, data leakage dangers are common in self-supervised forecasting
tasks. It is important to avoid leakage problems 1) in rolling origin schemes by being
able to verify and trust the implementation, as external evaluation can be difficult 2)
during preprocessing of the data (normalising, smoothing etc.) and extracting features
such as tsfeatures by splitting the data into training and test sets beforehand 3)
by making sure that within a set of series, one series does not contain in its training
period potential information about the future of another series.
123
Forecast evaluation for... 807
Forecast model building and evaluation typically encompasses the following steps.
• Data partitioning
• Forecasting
• Error Calculation
• Error Measure Calculation
• Statistical Tests for Significance (optional)
The process of evaluation in a usual regression problem is quite straightforward.
The best model out of a pool of fitted models is selected based on the value of a final
error measure on the validation set. The relevant error measures used etc. are standard
and established as best practices in these domains. However, when it comes to forecast
evaluation, many different options are available for each of the aforementioned steps
and no standards have been established thus far, and hence all the pitfalls in the
literature as explained in Sect. 3. Therefore, in this section we are presenting a set of best
practices and guidelines for each of the aforementioned steps in forecast evaluation.
In the following we present the guidelines around data partitioning for forecast eval-
uation.
Fixed origin setup is a faster and easier to implement evaluation setup. However, with a
single series, the fixed origin setup only provides one forecast per each forecast step in
the horizon. According to Tashman (2000), a preferred characteristic of OOS forecast
evaluation is to have sufficient forecasts at each forecast step. Also, having multiple
forecasts for the same forecast step allows to produce a forecast distribution per each
step for further analysis. Another requirement of OOS forecast evaluation is to make
the forecast error measures insensitive to specific phases of business (Tashman 2000).
However, with a fixed origin setup, the errors may be the result of particular patterns
only observable in that particular region of the horizon (Tashman 2000). Therefore,
the following multi period evaluation setups are introduced as opposed to the fixed
origin setup.
4.1.2 Rolling origin, time series cross-validation and prequential evaluation setups
Armstrong and Grohman (1972) are among the first researchers to give a descriptive
explanation of the rolling origin evaluation setup. Although the terms rolling origin
setup and tsCV are used interchangeably in the literature, in addition to the forecast
origin rolling forward, tsCV also allows to skip origins, effectively rolling forward by
more than one step at a time (analogously to the difference between a leave-one-out
CV and a k-fold CV).
123
808 H. Hewamalage et al.
Fig. 9 Comparison of Expanding Window versus Rolling Window setups. The blue and orange points
represent the training and test sets, respectively. The figure on the left side shows the Expanding Window
setup where the training set keeps expanding. The figure on the right shows the Rolling Window setup
where the size of the training set keeps constant and the first point of the training set keeps rolling forward
With such multi period evaluations, each time the forecast origin updates, the model
encounters new actual data. With new data becoming available, we have the options
to – in the terminology of Tashman (2000) – either update the model (feed in new
data as inputs) or recalibrate it (refit with new data). Although for some of the tradi-
tional models such as ETS and ARIMA, the usual practice (and the implementation
in the forecast package) in a rolling origin setup is to recalibrate the models, for
general ML models it is more common to mostly just accept new data as inputs and
only periodically retrain the model (updating). As ML methods tend to work better
with higher granularities, re-fitting is not an option (for example, a monthly series
predicted with ETS vs. a 5-minutely series predicted with Light Gradient Boosting
Models). Therefore, retraining as the most recent data becomes available happens in
ML methods mostly only when some sort of concept drift (change of the underlying
data generating process) is encountered (Webb et al. 2016).
Rolling origin evaluation can be conducted in two ways; 1) Expanding window
setup 2) Rolling window setup. Figure 9 illustrates the difference between the two
approaches. The expanding window method is a good setup for small datasets/short
series (Bell and Smyl 2018). On the other hand, the rolling window setup removes
the oldest data from training as new data becomes available (Cerqueira et al. 2020).
This will not make a difference with forecasting techniques that only minimally attend
to the distant past, such as ETS, but may be beneficial with pure autoregressive ML
models, that have no notion of time beyond the windows. A potential problem of the
rolling origin setup is that the first folds may not have much data available. However,
the size of the first folds is not an issue when dealing with long series, thus making
rolling origin setup a good choice with sufficient amounts of data. On the other hand,
with short series it is also possible to perform a combination of the aforementioned
two rolling origin setups where we start with an expanding window setup and then
move to a rolling window setup.
123
Forecast evaluation for... 809
Fig. 10 Comparison of
randomized CV versus OOS
evaluation. The blue and orange
dots represent the training and
test sets, respectively. In the
usual k-fold-CV setup the testing
instances are chosen randomly
over the series. In OOS, the test
set is always reserved from the
end of the series
The aforementioned two techniques of data partitioning preserve the temporal order of
the time series when splitting and using the data. A common misconception is that this
is always a necessity when dealing with time series. Another form of data partitioning
is to use a common randomized CV scheme as first proposed by Stone (1974). This
scheme is visualized in Fig. 10. Compared to the aforementioned validation schemes
which preserve the temporal order of the data, this form of randomized CV strategy
can make efficient use of the data, since all the data is used for both model training
as well as evaluation in iterations (Hastie et al. 2009). This helps to make a more
informed estimation about the generalisation error of the model.
However, this form of random splitting of a time series does not preserve the
temporal order of the data, and is therefore oftentimes not used and seen as problematic.
The common points of criticism for this strategy are that, 1) it can make it difficult for
a model to capture serial correlation between data points (autocorrelation) properly, 2)
potential non-stationarities in time series can cause problems (for example, depending
on the way that the data is partitioned, if all data from Sundays happen to be in the
test set but not the training set in a series with weekly seasonality, then the model
will not be able to produce accurate forecasts for Sundays since it has never seen data
of Sundays before), 3) the training data contains future observations and the test set
contains past data due to the random splitting and 4) since evaluation data is reserved
randomly across the series, the forecasting problem shifts to a missing value imputation
problem which certain time series models are not capable of handling (Petropoulos
2022).
Despite these problems, randomized CV can be applied to pure AR models without
serial correlation issues. Bergmeir et al. (2018) theoretically and empirically show that
CV performs well in a pure AR setup, as long as the models nest or approximate the
true model, as then the errors are uncorrelated, leaving no dependency between the
individual windows. To check this, it is important to estimate the serial correlation of
residuals. For this, the Ljung-Box test (Ljung and Box 1978) can be used on the OOS
residuals of the models. While for overfitting models there will be no autocorrelation
left in the residuals, if the models are underfitted, some autocorrelation will be left
in the OOS residuals. If there is autocorrelation left, then the model still does not
use all the information available in the data, which means there will be dependencies
between the separate windows. In such a scenario, CV of the time series dataset will
not hold valid, and underestimate the true generalisation error. The existence of signif-
icant autocorrelations anyway means that the model should be improved to do better
123
810 H. Hewamalage et al.
on the respective series (increase the AR order to capture autocorrelation etc.), since
the model has not captured all the available information. Once the models are suffi-
ciently competent in capturing the patterns of the series, for pure AR setups (without
exogenous variables), standard k-fold CV is a valid strategy. Therefore, in situations
with short series and small amounts of training data, where it is not practically feasible
to apply the aforementioned tsCV techniques due to the initial folds involving very
small lengths of the series, the standard CV method with some control of underfitting
of the models is a better choice with efficient use of data.
The aforementioned problem that the testing windows can contain future obser-
vations, is also addressed by Bergmeir et al. (2018). With the CV strategy, the past
observations not in the training data but existing in the test set can be considered
missing observations, and the task is seen more as a missing value imputation prob-
lem rather than a forecasting problem. Many forecasting models such as ETS (in its
implementation in the forecast package (Hyndman and Athanasopoulos 2018)),
which iterate throughout the whole series, cannot properly deal with missing data. For
Recurrent Neural Networks (RNN) as well, due to their internal states that are prop-
agated forward along the series, standard k-fold CV which partitions data randomly
across the series is usually not applicable. Therefore, for such models, the only feasible
validation strategy is tsCV. Models such as ETS can anyway train competitively with
minimal amounts of data (as is the case with the initial folds of the tsCV technique)
and thus, are not quite problematic with tsCV. However, for reasonably trained pure
AR models, where the forecasts for one window do not in any way depend on the
information from other windows (due to not underfitting and having no internal state),
it does not make a difference between filling the missing values in the middle of the
series and predicting future values, where both are performed OOS. Nevertheless, the
findings by Bergmeir et al. (2018) are restricted to only stationary series.
Cerqueira et al. (2020) experimented using non-stationary series, where they have
concluded that OOS validation procedures preserving the temporal order (such as
tsCV), are the right choice when non-stationarities exist in the series. However, a pos-
sible criticism of that work is the choice of models. We have seen in Sect. 3 that ML
models are oftentimes not able to address certain types of non-stationarities out of the
box. More generally speaking, ML models are non-parametric, data-driven models.
As such, the models are typically very flexible and the function fitted depends heavily
on the characteristics of the observed data. Though recently challenged (Balestriero
et al. 2021), a common notion is that ML models are typically good at interpolation
and lack extrapolation capabilities. The models used by Cerqueira et al. (2020) include
several ML models such as a Rule-based Regression (RBR) model, a RF model and
a Generalized Linear Model (GLM), without in any way explicitly tackling the non-
stationarity in the data (similar to our example in Sect. 3). Thus, if a model is poor
and not producing good forecasts, performing a validation to select hyperparameters,
using any of the aforementioned CV strategies, will be of limited value. Furthermore,
and more importantly, non-stationarity is a broad concept and both for the modelling
and the evaluation it will depend on the type of non-stationarity which procedures will
123
Forecast evaluation for... 811
perform well. For example, with abrupt structural breaks and level shifts occurring in
the unknown future, but not in the training and test set, it will be impossible for the
models to address this change and none of the aforementioned evaluation strategies
would do so either. In this situation, even tsCV would grossly underestimate the gen-
eralisation error. For a more gradual underlying change of the DGP, a validation set
at the end of the series would be more appropriate since in that case, the data points
closer to the end of the series may be already undergoing the change of the distribu-
tion. On the other hand, if the series has deterministic trend or seasonality, which are
straightforward to forecast, they can be simply extracted from the series and predicted
separately whereas the stationary remainder can be handled using the model. In such
a setup, the k-fold CV scheme will work well for the model, since the remainder
complies with the stationarity condition. For other non-deterministic trends, there are
several data pre-processing steps mentioned in the literature such as lag-1 differencing,
logarithmic transformation (for exponential trends), Seasonal and Trend Decompo-
sition using Loess (STL Decomposition), local window normalisation (Hewamalage
et al. 2021), moving average smoothing, percentage change transform, wavelet trans-
form etc. (Salles et al. 2019). The findings of Salles et al. (2019) have concluded
that there is no single universally best transformation technique across all datasets;
rather it depends on the characteristics of the individual datasets. If appropriate data
pre-processing steps are applied to enable models to handle non-stationarities, with a
pure AR setup, the CV strategy still holds valid after the data transformation, if the
transformation achieves stationarity. As such, to conclude, for non-stationarities, tsCV
seems the most adequate as it preserves the temporal order in the data. However, there
are situations where also tsCV will be misleading, and the forecasting practitioner will
already for the modelling need to attempt to understand the type of non-stationarity
they are dealing with. This information can subsequently be used for evaluation, which
may render CV methods for stationary data applicable after transformations of the data
to make them stationary.
It is important to identify which out of the above data partitioning strategies most
closely estimates (without under/overestimation) the final error of a model for the
test set under the given scenario (subject to different non-stationarities/serial corre-
lations/amount of data of the given time series). The gist of the guidelines for data
partitioning is visualized by the flow chart in Fig. 11. If the series are not short, tsCV is
usually preferrable over k-fold CV, if there are no practical considerations such as that
an implementation of an algorithm is used that is not primarily intended for time series
forecasting, and that internally performs a certain type of cross-validation. If series
are short, then k-fold CV should be used, accounting adequately for non-stationarities
and autocorrelation in the residuals.
123
812 H. Hewamalage et al.
Once the predictions are obtained from models, the next requirement is to compute
errors of the predictions to assess the model performance. Bias in predictions is a
common issue and because of this, a model can be very accurate (forecasts being very
close to actuals), but consistently produce more overestimations than underestima-
tions, which may be concerning from a business perspective. Therefore, forecast bias
is calculated with a sign, as opposed to absolute errors, so that it indicates the direc-
tion of the forecast errors, either positive or negative. For example, scale-dependent
forecast bias can be assessed with the Mean Error (ME) as defined in Equation 4.
Here, yt indicates the true value of the series, ŷt the forecast and n, the number of
all available errors. Other scale-free versions of bias can be defined by scaling with
respect to appropriate scaling factors, such as actual values of the series.
n
1 !" #
ME = yt − ŷt (4)
n
t=1
Two other popular and simple error measures used in a usual regression context
are MSE and MAE, which are both scale-dependent measures. Depending on the
business context, it can be a valid objective to forecast more accurately the series
that have higher scales, since they may be really the objects of interest. However, the
problem with scale-dependent measures is that, as soon as the scale of the series is
changed (for example converting from one currency to another), the value of the error
measures change (Tashman 2000). On the other hand, for certain businesses, it is a
requirement to compare errors across series. For example, if we say that MAE is 10
for a particular series, we have no idea whether it is a good or a bad accuracy. For
a series with an average value of 1000, this amount of accuracy is presumably quite
good, whereas for another series with an average value of 1, it is a very bad accuracy.
For this reason, the measures need to be scaled to achieve scale-independent measures,
and it has turned out to be next to impossible to develop a scaling procedure that works
for any type of possible non-stationarity and non-normality in a time series. Hence,
a wide variety of error measures have been proposed by researchers for this purpose
over the years. Nevertheless, eventually we encounter a particular condition of the time
series in the real world, that makes the proposed error measure fail (Svetunkov 2021).
123
Forecast evaluation for... 813
There are many options available for scaling such as per-step, per-series or per-dataset
scaling. Scaling can also be done by dividing either by in-sample or OOS values of
the time series. Apart from dividing by certain quantities, scaling can also be achieved
through log transformation of errors and ranking based on errors as well. The key to
selecting a particular error measure for forecast evaluation is that it is mathematically
and practically robust under the given data.
Different point forecast evaluation measures are targeted towards optimizing for a
specific statistic of the distribution. For example, measures with squared base errors
such as MSE and RMSE optimize for the mean whereas others with absolute value
base errors such as MAE and Mean Absolute Scaled Error (MASE) optimize for the
median. Although the mean and median are the same for a symmetric distribution, that
does not hold for skewed distributions as with intermittent series. There exist numerous
controversies in the literature regarding this. Petropoulos (2022) suggest that it is not
appropriate to evaluate the same forecasts using many different error measures, since
each one optimizes for a different statistic of the distribution. Also according to Kolassa
(2020), if different point forecast evaluation measures are considered, multiple point
forecasts for each series and time point also need to be created. Kolassa (2020) further
argues that, if the ultimate evaluation measure is, e.g., MAE which focuses on the
median of the distribution, it does not make sense to optimize the models using an
error measure like MSE (which accounts for the mean). It is more meaningful to
consider MAE also during model training as well. However, these arguments hold
only if it is not an application requirement for the same forecasts to perform generally
well under all these measures. Koutsandreas et al. (2021) have empirically shown
that, when the sample size is large, a wide variety of error measures agree on the most
consistently dominating methods as the best methods for that scenario. They have
also demonstrated that using two different error measures for optimizing and final
evaluation has an insignificant impact on the final accuracy of the models. Bermúdez
et al. (2006) have developed a fuzzy ETS model optimized via a multi-objective
function combining three error measures MAPE, RMSE and MAE. Empirical results
have demonstrated that using such a mix of error measures instead of just one for the
loss function leads to overall better, robust and generalisable results even when the
final evaluation is performed with just one of those measures. Fry and Lichtendahl
(2020) also assess their same forecasts across numerous error measures in a business
context. Evaluating the same forecasts with respect to many evaluation measures is a
form of sanity checking to ensure that even under other measures (though not directly
optimizing for them), the forecasts still perform well.
There are many different point forecast error measures available in the forecasting
literature categorized based on 1) whether squared or absolute errors are used 2)
techniques used to make them scale-free and 3) the operator such as mean, median
used to summarize the errors (Koutsandreas et al. 2021). Also, there are different forms
of base errors involved with each of the error measures. In the following base error
definitions, yt indicates the true value of the series, ŷt the forecast and T , the number
of time steps in the training region of the time series.
123
814 H. Hewamalage et al.
et = yt − ŷt (5)
• Percentage error
100et
pt = (6)
yt
• Percentage error (In-sample scaling) - Named as scaled Error (sE) in the work of
Petropoulos and Kourentzes (2015).
et
pt† = 1 $T (7)
T t=1 yt
|et |
pt‡ = 1 $T (8)
T t=1 yt
• Relative error - etb in Eq. (9) is the scale-dependent base error of the benchmark
method.
et
rt = (9)
etb
et
qt = 1 $T (10)
T −1 t=2 |yt − yt−1 |
et2
qt† = 1 $T (11)
T −1 t=2 (yt − yt−1 )2
123
Forecast evaluation for... 815
t
1!
ct = ŷt − yi (14)
t
i=1
Table 8 contains the definitions of error measures proposed in the literature using
the aforementioned base errors. In the definitions of Table 8, n indicates the number of
all available base errors, m denotes the number of time series, h indicates the number
of time steps in the forecast horizon and h i , the horizon size for the i th series.
Depending on each of the characteristics of time series as also stated in Sect. 2,
different error measures defined in Table 8 are preferable or should be avoided in each
case. Table 9 summarises this information and can be used to choose error measures
under given characteristics of the data. In Table 9, the scaling column indicates the type
of scaling associated with each error measure mentioned in the previous column. This
includes no scaling, scaling based on actual values, scaling based on benchmark errors
as well as the categorisation such as per-step, per-series and all-series (per-dataset)
scaling. The † sign in Table 9 indicates that the respective error measures need to be
used with caution under the given circumstances.
In almost any scenario, when applying error measures that scale based on errors
from a benchmark method, the relative competence of the benchmark method in the
intended forecast horizon needs to be taken into account, since otherwise benchmark
errors can unnecessarily drive the overall error measure values higher or lower. With
series having seasonality, percentage based measures may underestimate the errors
at peaks heavily, due to dividing by large actual values (Wong 2019; Kunst 2016) or
overstate the errors at troughs. This can be overcome by scaling based on aggregated
values (per series, all-series). On series having trends or structural breaks with level
shifts, scale-free measures which compute their scale by aggregating the values (actual
values or benchmark errors) at several time steps, tend to face problems. This is as
explained by Chen et al. (2017), that the error values at each time step need to comply
with the scale of the series at each point. A scale computed by aggregating over
several time steps which include such level shifts may not always be a good estimator
to represent the scaling factors for all the time steps of such a series. Also on series
with exponential trends, log transformation based error measures greatly reduce the
impact of errors from models. Unit roots are very similar to trends except that measures
which compute a per-step scaling may not capture peak points on such series similar
to seasonal series.
Similarly, on series having heteroscedasticity too, due to potential peaks and troughs
in the series which may have very high and low variances, measures such as MAPE
and RMSPE may have problems with capturing those points correctly. Apart from
that, log transformation based errors can reduce the impact from heteroscedasticity as
well. Especially on series having structural breaks, with measures which scale based
on benchmark errors, when those errors are computed in-sample, they may not be
representative of the errors that happen OOS when the structural breaks are either
in the forecast horizon or the forecast origin. On intermittent series, measures that
optimize for the median are problematic since they consider constant zeros as the
123
Table 8 Error measure definitions in the forecasting literature
816
123
Scale-Dependent Root Mean Squared Error (RMSE) (et2 )
n
'
Measures t=1
( n
(1 !
RMSE = )
GRMSE = 2n
and Boylan 2005) t=1
( n
(+
)
|et |
t=1
( n
(+
n
GMAE = )
Errors
Median Absolute Percentage Error (MdAPE) MdAPE = median(| pt |)
t=T +1 |yt |
WAPE = $Tt=T
|et |
Symmetric Weighted Absolute Percentage Error (sWAPE)
$T +h
h
(
t=T +1 |yt |
WRMSPE = ) h $Tt=T
h
Relative Total Absolute Error (RTAE) RTAE = t=T +1 t
, where C refers to a
1 $T +h |e |
n
Scaled Mean Error (sME, Petropoulos and Kourentzes ( pt )
n
sME =
2015) t=1
1! †
n
Scaled Mean Squared Error (sMSE, Petropoulos and ( pt )
n
sMSE =
Kourentzes 2015) t=1
1 ! †2
n
Scaled Mean Absolute Error (sMAE, Petropoulos and ( pt )
n
sMAE =
Kourentzes 2015) t=1
1! ‡
|et |
Normalized Deviation (ND, Salinas et al. 2020) - The scale
in the denominator is computed globally using many
$n
t=1 t |
|y
series.
ND = $nt=1
123
817
Table 8 continued
818
123
1 n 2
n t=1 (et )
Normalized Root Mean Squared Error (NRMSE, Salinas
* $
n
Measures based Mean Relative Absolute Error (MRAE)
n
MRAE = (|rt |)
on Relative t=1
1!
Errors
Median Relative Absolute Error (MdRAE) MdRAE = median(|rt |)
t=1
( n
(1 !
RMRSE = )
|rt |
t=1
( n
(+
GMRAE = )
t=1
( n
(+
RGRMSE = )
MAE
Relative Measures Relative Mean Absolute Error (RelMAE) RelMAE = , where MAEb is the MAE of the
MAEb
benchmark method
MSE
Relative Mean Squared Error (RelMSE) RelMSE = , where MSEb is the MSE of the
MSEb
benchmark method
H. Hewamalage et al.
Table 8 continued
MSE
Relative Root Mean Squared Error (RelRMSE) RelRMSE = , where MSEb is the MSE of the
MSEb
,
benchmark method
Forecast evaluation for...
n e2
t=1 t
Root Relative Squared Error (RSE, Lai et al. 2018)
*$
t=1 t
n (y − ȳ)2
RSE = *$
m i=1 i
MAEi
Average Relative Mean Absolute Error (AvgRelMAE, i
, where MAEib
/ 0h i $m1 h
Errors (Hynd-
man and
Koehler 2006)
Median Absolute Scaled Error (MdASE) MdASE = median(qt )
123
819
Table 8 continued
820
123
Measures based Percentage Better (PB Score, Hyndman and Koehler 2006) PB(MAE) = 100 mean(I {MAE < MAEb }), where MAEb is
on - Counts how many times (across series and time steps) a the MAE of the benchmark method.
Ranks/Counting given method is better than the benchmark and reports it
as a percentage.
Percentage of Critical Event for Margin X - Wong (2019) 100 mean(I {E > X }), where E is the error and X is the
proposed this to measure the percentage of forecasts margin
where the value of error is higher than a margin.
Measures based Root Mean Squared Logarithmic Error (RMSLE, Bojer and lt
n
'
Transformation
RMSLE = )
n
wt lt2
Normalized Weighted Root Mean Squared Logarithmic NWRMSLE = n , where wt is a weight
Error (NWRMSLE, Bojer and Meldgaard 2020) t=1 wt
,$
n
Rate-based Mea- Mean Squared Rate (MSR) MSR = ct2
sures (Kourentzes t=1
!
2014)
n
Mean Absolute Rate (MAR) MAR = |ct |
t=1
!
Other Error Weighted Mean Absolute Error (WMAE, Bojer and WMAE = t=1 wt |et | , where w is a weight assigned to
n w t
Measures Meldgaard 2020) t=1 t
$n
i=1
T +h ¯ 2
1 !
where y¯i is the mean of series i and yˆ¯i is the mean of the
predictions for series i
H. Hewamalage et al.
Forecast evaluation for... 821
best prediction. Measures with per-step scaling based on actual values can also be
problematic on intermittent series due to dividing by zero. This can be addressed by
using per-series scaling, but can again have issues if all time steps have zero values.
With measures that scale based on benchmark errors on intermittent series, it can be
problematic when benchmark errors have prefect predictions (zero errors), for example
with the naïve method giving exact zeros on zero actual values. With respect to outliers,
some applications may be interested in capturing them whereas others may want to
be robust against them. To be robust against outliers, geometric mean or median can
be used as the summary operator instead of the mean. Absolute base errors need to
be used instead of squared base errors to be robust against outliers. Measures which
scale based on per-step or per-series quantities may be heavily affected by outliers.
Similarly, with measures that scale based on benchmark errors, if the forecast of the
benchmark in the horizon is heavily affected by the outliers in the training region of
the series, it can be problematic.
The flow chart in Fig. 12 provides further support for forecast evaluation measure
selection based on user requirements and other characteristics in the data. In Fig. 12,
the error measures selected to be used with outlier time series are in the context of
being robust against outliers, not capturing them.
While forecast evaluation measures are critical to see the relative performance of the
methods and select the best ones from their rankings, they do not give information
regarding the statistical significance of the differences between these methods; i.e.
whether better performance of the best method is just by chance on this sample of
the series or whether it is likely to dominate all the methods significantly in other
samples of the data. The selected best method could be the only one to use, or there
could be other methods that are not significantly different from the best that can
be used interchangeably due to their other preferable properties such as simplicity,
computational efficiency etc.
There are many ways of performing statistical significance tests reported in the
literature. The Diebold-Mariano test (Diebold and Mariano 2002) and the Wilcoxon
rank-sum test (Mann and Whitney 1947) are both designed for comparing only between
two competing forecasts, not necessarily methods or models. However, the Diebold-
Mariano test is designed specifically for time series and parametric, meaning that it
has the assumption of normality of the data whereas the Wilcoxon test is a generic
non-parametric test based on the ranks of the methods. Due to considering ranks of
methods for each series separately, the error measures used do not necessarily have to
be scale-free. The Giacomini-White test (Giacomini and White 2006) again is based on
the comparison of two forecasts, with the potential to assess the conditional predictive
ability (CPA), a concept that refers to conditioning the choice of a potential future
state of the economy, an important concept for macro economic forecasting of a small
number of series. A continuation in this line of research is work by Li et al. (2022b) that
focuses on conditional superior predictive ability, in regards to a benchmark method
and time series with general serial dependence. It should be noted that many of the
123
Table 9 Checklist for selecting error measures for final forecast evaluation based on different time series characteristics
822
Stationary Seasonality Trend Unit Roots Heteroscedasticity Structural Breaks (With Scale Differences) Intermittence Outliers Error Scaling
Count ) (Linear/ Measures
Forecast Training Forecast
123
Data(>> 0 Exp.)
Horizon Region Origin
✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✗ RMSE None
✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✗ ✓ MAE
✓ ✗ ✓ ✓† ✓† ✓ ✓ ✓ ✗ ✗ MAPE OOS Per Actual
Step Values
✓ ✗ ✓ ✓† ✓† ✓ ✓ ✓ ✗ ✗ RMSPE
✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✗ ✓ sMAPE
✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ msMAPE
✓ ✓ ✗ ✗ ✓ ✗ ✓ ✓ ✗ ✗ WAPE OOS Per
Series
✓ ✓ ✗ ✗ ✓ ✗ ✓ ✓ ✓† ✗ WRMSPE
✓ ✓ ✗ ✗ ✓ ✗ ✗ ✗ ✗ ✗ sMAE In-Sample
Per Series
✓ ✓ ✗ ✗ ✓ ✗ ✗ ✗ ✓† ✗ sMSE
✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✗ ✓ ND OOS All
Series
✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✗ NRMSE
H. Hewamalage et al.
Table 9 continued
Stationary SeasonalityTrend Unit Roots HeteroscedasticityStructural Breaks (With Scale Differences)IntermittenceOutliersError Scaling
Count ) (Linear/ Measures
Data(>> 0 Exp.) Forecast Training Forecast
Horizon Region Origin
✓† ✓† ✗ ✗ ✓ ✓ ✓† ✓ ✗ ✓† MdRAE
✓† ✓† ✗ ✗ ✓ ✓ ✓† ✓ ✗ ✓† GMRAE
✓† ✓† ✗ ✗ ✓ ✓ ✓† ✓ ✓† ✗ RMRSE
✓† ✓† ✗ ✗ ✓ ✓ ✓† ✓ ✓† ✓† Relative Measures OOS Per
Series
✓† ✓† ✓ ✓ ✓ ✗ ✓† ✗ ✗ ✓ MASE In-Sample
PerSeries
✓† ✓† ✓ ✓ ✓ ✗ ✓† ✗ ✓† ✗ RMSSE
✓† ✓† ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ In-Sample
All Series
✓ ✓ ✓† ✓ ✓† ✓ ✓ ✓ ✓ ✓ Measures with TransformationsNone
123
823
824 H. Hewamalage et al.
mentioned comparison tests are per-se designed for comparing two forecasts, and a
multiple testing of more than two requires a correction for multiple hypothesis testing,
such as, e.g., a Bonferroni correction.
There are other techniques developed to perform comparison within a group of
methods (more than 2) as well. Means of error distributions from different methods
can be used to compare the mean performance of the methods. The F-test and the
t-test are statistical tests in this respect. They both have parametric assumptions for
123
Forecast evaluation for... 825
Fig. 13 An example of a CD diagram to visualize the significance of the differences between a number of
competing methods. The best three methods A, B and C are not significantly different from each other. On
the other hand, methods D, E and F are significantly worse than those three methods. The amount of data
has not been enough to check whether method E is significantly better than method D or worse than method
F
the means of the error distributions, that they need to follow a normal distribution.
Although, according to the Central Limit Theorem, this could hold for measures such
as MSE, MAE etc., for a sufficiently large random sample (of size n ≥ 30), it does
not hold for e.g., RMSE, since the root of a normally distributed variable is following
a chi-square distribution, which is close to normality but not equivalent. On the other
hand, the Friedman test (Friedman 1937, 1939, 1940) is a non-parametric statistical
test that can be used to detect significance between multiple competing methods, using
the ranks of the methods according to mean errors.
The Friedman test is usually followed by a post-hoc test, when the null hypoth-
esis which states that “there are no significant differences between the methods”,
is rejected. There are different types of post-hoc tests, for example, the Hochberg
procedure (Hochberg 1988), the Holm process (Holm 1979), the Bonferroni-Dunn
procedure (Dunn 1961), the Nemenyi method (Nemenyi 1963), the Multiple Compar-
isons with the Best (MCB) method (practically equivalent to the Nemenyi method)
or the Multiple Comparisons with the Mean (ANOM) method (Halperin et al. 1955),
and others. In general, the ANOM test holds less value in practice since it is more
useful to find which methods are not significantly different from the best, than from
some averagely performing method overall. The Nemenyi method works by defin-
ing confidence bounds, in terms of a Critical Distance (CD) around the mean ranks
of the methods to identify which methods have overlapping confidence bounds and
which do not. As Demšar (2006) suggests, if all the comparisons are to be performed
against one control method as opposed to each method against each other, procedures
such as Bonferroni-Dunn and Hochberg’s are better over the Nemenyi test. Once, the
quantitative results for the significance of the differences are obtained using any of the
aforementioned methods, they can be visualized using CD diagrams (Demšar 2006).
In general, in these diagrams, a horizontal axis reports the average ranks of all the
methods. Groups of methods that are not significantly different from each other are
connected using black bars. This is illustrated in Fig. 13, an example CD diagram.
123
826 H. Hewamalage et al.
Fig. 14 Flow chart for statistical tests selection to measure significance of model differences
When performing significance testing, the amount of data included heavily impacts
the results of the significance tests. For example, with a very high number of series,
the CD is usually very low, producing significant results for even small differences
between models. This means that the results are more reliable, that even the slightest
differences between models encountered for such a large amount of data are statis-
tically highly significant. On the other hand, it also depends on the number and the
relative performance of the set of models included in the comparison. For example,
having more and more poorly performing methods in the group may tend towards
making the CD larger, thus making other intermediate methods have no significant
difference from the best. The flow chart in Fig. 14 summarises the decision making
process in selecting a statistical test to measure significance of differences between
models.
5 Conclusions
123
Forecast evaluation for... 827
evaluation remains a much more complex task. The general trend in the literature has
been to propose new methodologies to address pitfalls associated with the previously
introduced. Nevertheless, for example with the forecast evaluation measures, to the
best of our knowledge, all the introduced measures thus far can break under given
certain characteristics/non-stationarities of the time series. General ML practitioners
and Data Scientists new to the field of forecasting are often not aware of these issues.
Consequently, as we demonstrate through our work, forecast evaluation practices used
by many works even published at top-tier venues in the ML domain can be flawed.
All of this is a consequence of the lack of established best practices and guidelines
for the different steps of the forecast evaluation process. Therefore, to support the
ML community in this aspect, we provide a compilation of common pitfalls and best
practice guidelines related to forecast evaluation. The key set of guidelines that we
develop are as follows.
• To claim the competitiveness of the proposed methods, they need to be bench-
marked on sufficiently large amounts of datasets.
• It is always important to compare models against the right and the simplest bench-
marks such as the naïve and the seasonal naïve.
• Using forecast plots can be misleading; making decisions purely based on the
visual appeal on forecast plots is not advisable.
• Data leakage needs to be avoided explicitly in rolling origin evaluation and other
data pre-processing tasks such as smoothing, decomposition and normalisation of
the series.
• If enough data are available, tsCV is the procedure of choice. Also, for models with
a continuous state such as RNNs and ETS where the temporal order of the data
is important, tsCV may be the only applicable validation strategy. k-fold CV is a
valid and a data efficient strategy of data partitioning for forecast model validation
with pure AR based setups, when the models do not underfit the data (which can be
detected with a test for serial correlation in the residuals, such as the Ljung-Box
test). As such, we advise this procedure especially for short series where tsCV
leads to test sets that are too small. However, if the models underfit, it is advisable
to improve the models first before using any CV technique.
• There is no single globally accepted evaluation measure for all scenarios. It depends
on the characteristics of the data as summarized in Table 9.
• When using statistical testing for significance of the differences between models,
balancing the diversity of the compared models against the number of data points
is important to avoid spurious statistical similarity/difference between models.
While the literature on evaluation measures is quite extensive, the exact errors
(squared/ absolute), summarisation operators (mean/median/geometric mean), type of
scaling to use (global/per-series/per-step/, in-sample/OOS, relative/percentage) differ
based on the user expectations, business utility and the characteristics of the underlying
time series. Due to the lack of proper knowledge in forecast evaluation, ML research
in the literature thus far has often either struggled to demonstrate the competitiveness
of its models or arrived at spurious conclusions. It is our objective that this effort
encourages better and correct forecast evaluation practices within the ML community.
As a potential avenue for further work especially with respect to evaluation measures,
123
828 H. Hewamalage et al.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which
permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give
appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence,
and indicate if changes were made. The images or other third party material in this article are included
in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If
material is not included in the article’s Creative Commons licence and your intended use is not permitted
by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the
copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
References
Armstrong J (2001) Evaluating forecasting methods. In: Armstrong JS (ed) Principles of forecasting: a
handbook for researchers and practitioners. Kluwer Academic Publishers, Norwell, MA
Armstrong JS, Grohman MC (1972) A comparative study of methods for long-range market forecasting.
Manag Sci 19(2):211–221
Arnott R, Harvey C R, Markowitz H (2019) A backtesting protocol in the era of machine learning. J Financ
Data Sci
Bagnall A, Lines J, Bostrom A, Large J, Keogh E (2016) The great time series classification bake off: a review
and experimental evaluation of recent algorithmic advances. Data Min Knowl Disc 31(3):606–660
Balestriero R, Pesenti J, LeCun Y (2021) Learning in high dimension always amounts to extrapolation.
arXiv preprint arXiv:2110.09485
Bell F, Smyl S, (2018) Forecasting at uber: an introduction. https://eng.uber.com/forecasting-introduction/
Berger D, Chaboud A, Hjalmarsson E (2009) What drives volatility persistence in the foreign exchange
market? J Financ Econ 94(2):192–213
Bergmeir C, Hyndman RJ, Koo B (2018) A note on the validity of cross-validation for evaluating autore-
gressive time series prediction. Comput Stat Data Anal 120:70–83
Bermúdez JD, Segura JV, Vercher E (2006) A decision support system methodology for forecasting of time
series based on soft computing. Comput Stat Data Anal 51(1):177–191
Bojer C S, Meldgaard J P (2020) Kaggle forecasting competitions: an overlooked learning opportunity. Int
J Forecast
Brownlee J (2020) Data preparation for machine learning: data cleaning, feature selection, and data trans-
forms in Python. Mach Learn Mastery
Cerqueira V, Torgo L, Mozetič I (2020) Evaluating time series forecasting models: an empirical study on
performance estimation methods. Mach Learn 109(11):1997–2028
Challu C, Olivares K. G, Oreshkin B N, Garza, F, Mergenthaler-Canseco M, Dubrawski A (2022) N-hits:
neural hierarchical interpolation for time series forecasting. arXiv:2201.12886
Chen C, Twycross J, Garibaldi JM (2017) A new accuracy measure based on bounded relative error for
time series forecasting. PLoS ONE 12(3):e0174202
Cox D, Miller H (1965) The Theory of Stochastic Processes
Cui Y, Xie J, Zheng K (2021) Historical inertia: a neglected but powerful baseline for long sequence
time-series forecasting. In: Proceedings of the 30th ACM International Conference on Information &
123
Forecast evaluation for... 829
Knowledge Management. CIKM ’21. Association for Computing Machinery, New York, NY, USA,
pp 2965-2969
Davydenko A, Fildes R (2013) Measuring forecasting accuracy: The case of judgmental adjustments to
SKU-level demand forecasts. Int J Forecast 29(3):510–522
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7(1):1–30
Diebold FX, Mariano RS (2002) Comparing predictive accuracy. J Bus Econ Stat 20(1):134–144
Ditzler G, Roveri M, Alippi C, Polikar R (2015) Learning in nonstationary environments: a survey. IEEE
Comput Intell Mag 10(4):12–25
Dunn OJ (1961) Multiple comparisons among means. J Am Stat Assoc 56(293):52–64
Du D, Su B, Wei Z (2022) Preformer: predictive transformer with multi-scale segment-wise correlations
for long-term time series forecasting. arXiv:2202.11356
Du Y, Wang J, Feng W, Pan S, Qin T, Xu R, Wang C (2021) Adarnn: adaptive learning and forecasting of
time series. In: Proceedings of the 30th ACM International Conference on Information & Knowledge
Management. CIKM ’21. Association for Computing Machinery, New York, NY, USA, pp 402-411
Engle R F (2003) Risk and volatility: econometric models and financial practice. Nobel Lect. https://www.
nobelprize.org/uploads/2018/06/engle-lecture.pdf
Fama EF (1970) Efficient capital markets: a review of theory and empirical work. J Financ 25(2):383–417
Fawaz HI, Forestier G, Weber J, Idoumghar L, Muller P-A (2019) Deep learning for time series classification:
a review. Data Min Knowl Discov 33(4):917–963
Friedman M (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of
variance. J Am Stat Assoc 32(200):675–701
Friedman M (1939) A correction: the use of ranks to avoid the assumption of normality implicit in the
analysis of variance. J Am Stat Assoc 34(205):109–109
Friedman M (1940) A comparison of alternative tests of significance for the problem of m rankings. Ann
Math Stat 11(1):86–92
Fry C, Lichtendahl C (2020) Google practitioner session. In: 40th International Symposium on Forecasting.
https://www.youtube.com/watch?v=FoUX-muLlB4&t=3007s
Gama J, Sebastiao R, Rodrigues PP (2013) On evaluating stream learning algorithms. Mach Learn
90(3):317–346
Gama J. a, Žliobaitundefined I, Bifet A, Pechenizkiy M, Bouchachia A (2014) A survey on concept drift
adaptation. ACM Comput Surv 46 (4)
Ghomeshi H, Gaber MM, Kovalchuk Y (2019) EACD: evolutionary adaptation to concept drifts in data
streams. Data Min Knowl Disc 33(3):663–694
Giacomini R, White H (2006) Tests of conditional predictive ability. Econometrica 74(6):1545–1578
Godahewa R, Bandara K, Webb GI, Smyl S, Bergmeir C (2021) Ensembles of localised models for time
series forecasting. Knowl Based Syst 233:107518
Godfrey LB, Gashler MS (2018) Neural decomposition of time-series data for effective generalization.
IEEE Trans Neural Netw Learn Syst 29(7):2973–2985
Gujarati DN (2021) Essentials of econometrics. Sage Publications, Christchurch, New Zealand
Guo Y, Zhang S, Yang J, Yu G, Wang Y (2022) Dual memory scale network for multi-step time series
forecasting in thermal environment of aquaculture facility: a case study of recirculating aquaculture
water temperature. Expert Syst Appl 208:118218
Halperin M, Greenhouse SW, Cornfield J, Zalokar J (1955) Tables of percentage points for the studentized
maximum absolute deviate in normal samples. J Am Stat Assoc 50(269):185–195
Hämäläinen W, Webb G I, (2019) A tutorial on statistically sound pattern discovery. Data Min Knowl
Discov 33 (2): 325–377
Hannun A, Guo C, van der Maaten L (2021) Measuring data leakage in machine-learning models with fisher
information. In: de Campos, C, Maathuis, M H (eds) Proceedings of the Thirty-Seventh Conference
on Uncertainty in Artificial Intelligence. vol 161, pp 760–770
Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining, inference, and
prediction. Springer, New York, NY
Hewamalage H, Bergmeir C, Bandara K (2021) Recurrent neural networks for time series forecasting:
current status and future directions. Int J Forecast 37(1):388–427
Hochberg Y (1988) A sharper bonferroni procedure for multiple tests of significance. Biometrika 75(4):800–
802
Holm S (1979) A simple sequentially rejective multiple test procedure. Scand J Stat 6(2):65–70
123
830 H. Hewamalage et al.
Hyndman R J, Athanasopoulos G (2018) Forecasting: principles and Practice, 2nd edn. OTexts. https://
otexts.com/fpp2/
Hyndman RJ, Koehler AB (2006) Another look at measures of forecast accuracy. Int J Forecast 22(4):679–
688
Hyndman R, Kang Y, Talagala T, Wang E, Yang Y (2019) tsfeatures: time series feature extraction. R
package version 1.0.0. https://pkg.robjhyndman.com/tsfeatures/
Ikonomovska E, Gama J, Džeroski S (2010) Learning model trees from evolving data streams. Data Min
Knowl Discov 23(1):128–168
Kaufman S, Rosset S, Perlich C, Stitelman O (2012) Leakage in data mining: Formulation, detection, and
avoidance. ACM Trans Knowl Discov Data 6(4):1–21
Kim S, Kim H (2016) A new metric of absolute percentage error for intermittent demand forecasts. Int J
Forecast 32(3):669–679
Kolassa S (2020) Why the best point forecast depends on the error or accuracy measure. Int J Forecast
36(1):208–211
Kourentzes N (2014) On intermittent demand model optimisation and selection. Int J Prod Econ 156:180–
190
Koutsandreas D, Spiliotis E, Petropoulos F, Assimakopoulos V (2021) Aasures. J Oper Res Soc, 1–18
Kunst R (2016) Visualization of distance measures implied by forecast evaluation criteria. In: Interna-
tional Symposium on Forecasting 2016. https://forecasters.org/wp-content/uploads/gravity_forms/7-
621289a708af3e7af65a7cd487aee6eb/2016/07/Kunst_Robert_ISF2016.pdf
Kuranga C, Pillay N (2022) A comparative study of nonlinear regression and autoregressive techniques in
hybrid with particle swarm optimization for time-series forecasting. Expert Syst Appl 190:116163
Lai G, Chang W.-C, Yang Y, Liu H (2018) Modeling long- and short-term temporal patterns with deep
neural networks. In: The 41st International ACM SIGIR Conference on Research & Development in
Information Retrieval. SIGIR ’18. Association for Computing Machinery, New York, NY, USA, pp
95-104
Li J, Liao Z, Quaedvlieg R (2022) Conditional superior predictive ability. Rev Econ Stud 89(2):843–875
Li B, Du S, Li T, Hu J, Jia Z (2022a) Draformer: differentially reconstructed attention transformer for
time-series forecasting. arXiv:2206.05495
Lin G, Lin A, Cao J (2021) Multidimensional knn algorithm based on eemd and complexity measures in
financial time series forecasting. Expert Syst Appl 168:114443
Liu S, Ji H, Wang MC (2020) Nonpooling convolutional neural network forecasting for seasonal time series
with trends. IEEE Trans Neural Netw Learn Syst 31(8):2879–2888
Liu Q, Long L, Peng H, Wang J, Yang Q, Song X, Riscos-Núñez A, Pérez-Jiménez M J (2021) Gated
spiking neural p systems for time series forecasting. IEEE Trans Neural Netw Learn Syst, 1–10
Ljung GM, Box GEP (1978) On a measure of lack of fit in time series models. Biometrika 65(2):297–303
Lubba CH, Sethi SS, Knaute P, Schultz SR, Fulcher BD, Jones NS (2019) catch22: CAnonical time-series
CHaracteristics. Data Min Knowl Disc 33(6):1821–1852
Makridakis S (1993) Accuracy measures: theoretical and practical concerns. Int J Forecast 9(4):527–529
Makridakis S, Hibon M (2000) The m3-competition: results, conclusions and implications. Int J Forecast
16(4):451–476
Makridakis S, Spiliotis E, Assimakopoulos V (2020) The M4 Competition: 100,000 time series and 61
forecasting methods. Int J Forecast 36(1):54–74
Makridakis S, Spiliotis E, Assimakopoulos V (2022) M5 accuracy competition: results, findings, and con-
clusions. Int J Forecast 38(4):1346–1364
Mann HB, Whitney DR (1947) On a test of whether one of two random variables is stochastically larger
than the other. Ann Math Stat 18(1):50–60
Moon H, Lee H, Song B (2022) Mixed pooling of seasonality for time series forecasting: an application to
pallet transport data. Expert Syst Appl 201:117195
Nemenyi P (1963) Distribution-free multiple comparisons. In: Ph.D. thesis, Princeton University
Petropoulos F et al (2022) Forecasting: theory and practice. Int J Forecast 38(3):705–871
Petropoulos F, Kourentzes N (2015) Forecast combinations for intermittent demand. J Oper Res Soc
66(6):914–924
Ran P, Dong K, Liu X, Wang J (2023) Short-term load forecasting based on ceemdan and transformer.
Electric Power Sys Res 214:108885
Rossi B (2013) Exchange rate predictability. J Econ Lit 51(4):1063–1119
123
Forecast evaluation for... 831
Salinas D, Flunkert V, Gasthaus J, Januschowski T (2020) Deepar: probabilistic forecasting with autore-
gressive recurrent networks. Int J Forecast 36(3):1181–1191
Salles R, Belloze K, Porto F, Gonzalez PH, Ogasawara E (2019) Nonstationary time series transformation
methods: an experimental review. Knowl Based Syst 164:274–291
Shabani A, Abdi A, Meng L, Sylvain T (2022) Scaleformer: iterative multi-scale refining transformers for
time series forecasting. arXiv:2206.04038
Shcherbakov M, Brebels A, Shcherbakova N, Tyukov A, Janovsky T, Kamaev V (2013) A survey of forecast
error measures. World Appl Sci J 24(24):171–176
Shen Z, Zhang Y, Lu J, Xu J, Xiao G (2020) A novel time series forecasting model with deep learning.
Neurocomputing 396:302–313
Shih S-Y, Sun F-K, Lee H-Y (2019) Temporal pattern attention for multivariate time series forecasting.
Mach Learn 108(8):1421–1441
Stone M (1974) Cross-validatory choice and assessment of statistical predictions. J R Stat Soc Ser B
Methodol 36(2):111–147
Suilin A (2017) kaggle-web-traffic. Accessed: 2018-11-19. https://github.com/Arturus/kaggle-web-traffic/
Sun F-K, Boning D S (2022) Fredo: frequency domain-based long-term time series forecasting.
arXiv:2205.12301
Svetunkov I (2021) Forecasting and analytics with adam. OpenForecast, (version: [current date]). https://
openforecast.org/adam/
Syntetos AA, Boylan JE (2005) The accuracy of intermittent demand estimates. Int J Forecast 21(2):303–314
Talagala T S (2020) A tool to detect potential data leaks in forecasting competitions. In: International
Symposium on Forecasting 2020. https://thiyanga.netlify.app/talk/isf20-talk/
Tashman LJ (2000) Out-of-sample tests of forecasting accuracy: an analysis and review. Int J Forecast
16(4):437–450
Webb GI, Hyde R, Cao H, Nguyen HL, Petitjean F (2016) Characterizing concept drift. Data Min Knowl
Discov 30(4):964–994
Wong L (2019) Error metrics in time series forecasting. In: International Symposium
on Forecasting 2019. https://isf.forecasters.org/wp-content/uploads/gravity_forms/2-
dd30f7ae09136fa695c552259bdb3f99/2019/07/ISF_2019_slides.pdf
Woo G, Liu C, Sahoo D, Kumar A, Hoi S (2022) Etsformer: exponential smoothing transformers for
time-series forecasting. arXiv:2202.01381
Wu Z, Pan S, Long G, Jiang J, Chang X, Zhang C (2020) Connecting the dots: Multivariate time series
forecasting with graph neural networks. In: Proceedings of the 26th ACM SIGKDD International Con-
ference on Knowledge Discovery & Data Mining. KDD ’20. Association for Computing Machinery,
New York, NY, USA, pp 753-763
Wu H, Xu J, Wang J, Long M (2021) Autoformer: Decomposition transformers with Auto-Correlation for
long-term series forecasting. In: Advances in Neural Information Processing Systems
Ye J, Liu Z, Du B, Sun L, Li W, Fu Y, Xiong H (2022) Learning the evolutionary and multi-scale graph
structure for multivariate time series forecasting. In: Proceedings of the 28th ACM SIGKDD Confer-
ence on Knowledge Discovery and Data Mining. KDD ’22. Association for Computing Machinery,
New York, NY, USA, pp 2296-2306
Zeng A, Chen M, Zhang L, Xu Q (2022) Are transformers effective for time series forecasting?
Zhang X, He K, Bao Y (2021) Error-feedback stochastic modeling strategy for time series forecasting with
convolutional neural networks. Neurocomputing 459:234–248
Zhou Y, Zhang M, Lin K-P (2022) Time series forecasting by the novel gaussian process wavelet self-join
adjacent-feedback loop reservoir model. Expert Syst Appl 198:116772
Zhou T, Ma Z, wang X, Wen Q, Sun L, Yao T, Yin W, Jin R (2022a) Film: frequency improved legendre
memory model for long-term time series forecasting. In: Advances in Neural Information Processing
Systems. arXiv:2205.08897
Zhou T, Ma Z, Wen Q, Wang X, Sun L, Jin R (2022b) FEDformer: Frequency enhanced decomposed
transformer for long-term series forecasting. In: Chaudhuri K, Jegelka S, Song L, Szepesvari C, Niu
G, Sabato S (eds), Proceedings of the 39th International Conference on Machine Learning. Vol. 162
of Proceedings of Machine Learning Research. PMLR, pp 27268–27286
Zhou H, Zhang S, Peng J, Zhang S, Li J, Xiong H, Zhang W (2021) Informer: Beyond efficient transformer for
long sequence time-series forecasting. In: The Thirty-Fifth AAAI Conference on Artificial Intelligence,
AAAI 2021, Virtual Conference. vol 35. AAAI Press, pp 11106–11115
123
832 H. Hewamalage et al.
Zhou T, Zhu J, Wang X, Ma Z, Wen Q, Sun L, Jin R (2022c) Treedrnet:a robust deep model for long term
time series forecasting. arXiv:2206.12106
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps
and institutional affiliations.
123