0% found this document useful (0 votes)
113 views9 pages

Machine Learning Vs Statistical Methods For Time S

This paper examines whether machine learning methods or statistical methods are better for time series forecasting. The authors argue that a previous study concluding statistical methods are better was biased because it used very small time series datasets that were not suitable for machine learning models. The authors test their hypothesis by comparing statistical and machine learning forecasting models on 90 time series datasets while varying the sample size. Their results show that with small sample sizes, statistical methods perform better, but as sample size increases, machine learning methods outperform statistical methods for time series forecasting.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
113 views9 pages

Machine Learning Vs Statistical Methods For Time S

This paper examines whether machine learning methods or statistical methods are better for time series forecasting. The authors argue that a previous study concluding statistical methods are better was biased because it used very small time series datasets that were not suitable for machine learning models. The authors test their hypothesis by comparing statistical and machine learning forecasting models on 90 time series datasets while varying the sample size. Their results show that with small sample sizes, statistical methods perform better, but as sample size increases, machine learning methods outperform statistical methods for time series forecasting.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

Machine Learning vs Statistical Methods for Time Series Forecasting:

Size Matters
Vitor Cerqueira1,2∗ , Luis Torgo1,2,3 and Carlos Soares1,2
1INESC TEC, Porto, Portugal
2
University of Porto
3
Dalhousie University
[email protected], [email protected], [email protected]
arXiv:1909.13316v1 [stat.ML] 29 Sep 2019

Abstract forecasting models relative to statistical methods. We address


Time series forecasting is one of the most active this question in this paper.
research topics. Machine learning methods have Our working hypothesis is that the study presented by
been increasingly adopted to solve these [Makridakis et al., 2018] is biased in one crucial aspect:
predictive sam-
tasks. However, in a recent work, evidence was ple size. The authors draw their conclusion from a large set
shown that of 1045 monthly time series used in the well-known M3
systematically present
these a approaches com- petition [Makridakis and Hibon, 2000]. However, each
lower predictive performance relative to simple of the
statistical methods. In this work, we counter these time series is extremely small. The average, minimum, and
maximum number of observations is 118, 66, and 144, re-
results. We show that these are only valid un-
der an extremely low sample size. Using a learn- spectively. We hypothesize that that these datasets are too
ing curve method, our results suggest that machine small for machine learning models to generalize properly.
learning methods improve their relative predictive Machine learning methods typically assume a functional form
performance as the sample size grows. The R code that is more flexible than that of statistical methods. Hence,
to reproduce all of our experiments is available at they are more prone to overfit. When the size of the data is
https://github.com/vcerqueira/MLforForecasting. small, the sample may not be representative of the process
generating the underlying time series. In such cases,
machine learning methods model the spurious behavior
1 Introduction represented in the sample.
Machine learning is a subset of the field of artificial intel- In this context, our goal in this paper is to compare statis-
ligence, which is devoted to developing algorithms that au- tical methods with machine learning methods for time series
tomatically learn from data [Michalski et al., 2013]. This forecasting, controlling for sample size.
area has been at the centre of important advances in science
and technology. This includes problems involving forecast- 1.1 Our Contribution
ing, such as in the domains of energy [Voyant et al., 2017], To test our hypothesis, we present an empirical analysis of
healthcare [Lee and Mark, 2010], management [Carbonneau the impact of sample size in the relative performance of dif-
et al., 2008], or climate [Xingjian et al., 2015]. ferent forecasting methods. We split these methods into two
Notwithstanding, despite gaining increasing attention, categories: machine learning methods and statistical meth-
ma- chine learning methods are still not well established in ods. Machine learning methods are often based on statistical
the forecasting literature, especially in the case of univariate techniques so this split is often a bit artificial. However, in
time series. The forecasting literature is dominated by the interest of consistency with previous work on this topic
statistical methods based on linear processes, such as [Makridakis et al., 2018], we used the term statistical to
ARIMA [Chat- field, 2000] or exponential smoothing refer to methods developed by the statistical and forecasting
[Gardner Jr, 1985]. litera- ture.
This matter is noticeable in the recent work by [Makri- In our empirical analysis, we use 90 univariate time series
dakis et al., 2018], where the authors present evidence that from several domains of application. The results of our ex-
traditional statistical methods systematically outperform ma- periments show that the conclusions draw by [Makridakis et
chine learning methods for univariate time series al., 2018] are only valid when the sample size is small. That
forecasting. This includes algorithms such as the multi-layer is, with small sample size, statistical methods show a better
perceptron or Gaussian processes. Most of the machine predictive performance compared to machine learning mod-
learning meth- ods tested by the authors fail to outperform a els. However, as the sample size grows, machine learning
simple seasonal random walk model. Makridakis and his methods outperform the former.
colleagues conclude the paper by pointing out the need to The paper is organized as follows. In the next section, we
find the reasons behind the poor predictive performance provide a background to this paper. We formalize the time
shown by machine learning series forecasting task from a univariate perspective, and
∗ out-
Contact Author
line some state of the art methods to solve this problem. In Section 3, we present the experiments, which are discussed in
Section 4. Finally, we conclude the paper in Section 5. where φi , ∀i ∈ {1, . . . , p} denote the model parameters, and
p represents the order of the model.
2 Background The AR(p) model uses the past values of the time series
as explanatory variables. Similarly, the MA(q) model uses
past errors as explanatory variables:
2.1 Time Series Forecasting Σ
q

Let Y = y1{, . . . , yn denote


} a time series. Forecasting de- yn = µ + θi ϵn−i + ϵn (4)
notes the process of estimating the future values of Y , yn+h, i=1
where h denotes the forecasting horizon. where µ denotes the mean of the observations, θi ,∀ i
Quantitative approaches to time series forecasting are {1, . . . , q represents the parameters of the models and q de-
split into two categories: univariate and multivariate. notes
} the order of the model. Essentially, the method MA(q)
Univariate methods refer to approaches that model future models the time series according to random errors that oc-
observations of a time series according to its past curred in the past q lags [Chatfield, 2000].
observations. Multivariate approaches extend univariate Effectively, the model ARMA(p,q) can be constructed by
ones by considering additional time series that are used as combining the model AR(p) with the model MA(q):
explanatory variables. We will
focus on univariate approaches in this work.
Σ Σ
p q

The forecasting horizon is another aspect to take into ac-


count when addressing time series prediction problems. Fore- yn = c + φi yn−i + θi ϵn−i + ϵn (5)
casting methods usually focus on one step ahead forecasting, i=1 i=1

i.e., the prediction of the next value of a time series (yn+1). The ARMA(p,q) is defined for stationary data. However,
Sometimes one is interested in predicting many steps into many interesting phenomena in the real-world exhibit a non-
the future. These tasks are often referred to as multi-step stationary structure, e.g. time series with trend and seasonal-
fore- casting [Taieb et al., 2012]. Higher forecasting ity. The ARIMA(p,d,q) model overcomes this limitation by
horizons typ- ically lead to a more difficult predictive task including an integration parameter of order d. Essentially,
due to the in- creased uncertainty [Weigend, 2018]. ARIMA works by applying d differencing transformations
to the time series (until it becomes stationary), before
2.2 Time Series Models applying ARMA(p,q).
Several models for time series analysis have been proposed The exponential smoothing model [Gardner Jr, 1985] is
in the literature. These are not only devised to forecast the similar to the AR(p) model in the sense that it models the
future behaviour of time series but also to help understand future values of time series using a linear combination of its
the underlying structure of the data. In this section, we past observations. In this case, however, exponential
outline a few of the most commonly used forecasting smooth- ing methods produce weighted averages of the past
methods. values, where the weight decays exponentially as the
The naive method, also known as the random walk fore- observations are older [Hyndman and Athanasopoulos,
cast, predicts the future values of the time series according 2018]. For exam- ple, in a simple exponential smoothing
to the last known observation: method, the prediction for yn+1 can be defined as follows:

yˆn+h = yn (1) yn+1 = ynβ0 + yn−1β1 + yn−2β2 + · · · (6)


There is empirical evidence that this method presents a rea- { i represent
where the β } the weights of past observations.
sonable fit for financial time series data [Kilian and Taylor, There are several types of exponential smoothing methods.
2003]. The seasonal naive model works similarly to the For a complete read, we refer to the work by [Hyndman and
naive method. The difference is that the seasonal naive Athanasopoulos, 2018].
approach uses the previously known value from the same
season of the intended forecast:
2.3 More on the AR(p) Model
From a machine learning perspective, time series
yˆn+h = yn+h−m (2) forecasting is usually formalized as an auto-regressive task,
i.e., based on an AR(p) model. This type of procedures
where m denotes the seasonal period. projects a time series into a Euclidean space according to
The ARMA (Auto-Regressive Moving Average) is one of Taken’s theorem regarding time delay embedding [Takens,
the most commonly used methods to model univariate time 1981].
series. ARMA(p,q) combines two components: AR(p), and Using common terminology in the machine learning liter-
MA(q).
According to the AR(p) model, the value of a given time ature, a set of observations (xi, yi) is constructed [Michal-

pseries, yn, can be estimated using a linear combination of the ski et al., 2013]. In each observation, the value of yi
is modelled based on the past p values before it: xi =
past observations, together with an error term ϵn and a
con- stant term c [Box et al., 2015]: {yi 1, yi 2, . . . , yi p }, where yi ∈ Y R, which represents
the vector⊂−of values − −
we want to predict, and∈ xi X Rp
represents the feature vector. The objective is to construct a
Σ
p
yn = c + φi yn−i + ϵn (3) model f : X →Y, where f denotes the regression function.
i=1 In other words, the principle behind this approach is to
model
the conditional distribution of the i-th value of the time
series given its p past values: f (yi xi). In essence, this 3 Empirical Experiments
approach leads to a multiple | regression problem. The Our goal in this paper is to address the following research
temporal depen- dency is modelled by having past question:
observations as explanatory variables. Following this
formulation, we can resort to any algorithm from the • Is sample size important in the relative predictive
perfor- mance of forecasting methods?
regression toolbox to solve the predictive task.
We are interested in comparing statistical methods with
machine learning methods for univariate time series
2.4 Related Work forecast- ing tasks. Within this predictive task we will
analyse the impact of different horizons (one-step-ahead and
multi-step- ahead forecasting).
Machine learning methods have been increasingly used to
tackle univariate time series forecasting problems. However, 3.1 Forecasting Methods
there is a small amount of work comparing their predictive In this section, we outline the algorithms used for
performance relative to traditional statistical methods. [Hill forecasting. We include five statistical methods and five
et al., 1996] compared a multi-layer perceptron with statis- machine learning algorithms.
tical methods. The neural network method is shown to per-
form significantly better than the latter. [Ahmed et al., Statistical Methods
2010] present an analysis of different machine learning The statistical methods used in the experiments are the fol-
methods for this task using time series from the M3 lowing.
competition [Makri- dakis and Hibon, 2000]. Their results ARIMA: The Auto-Regressive Integrated Moving Average
suggest that the multi- layer perceptron and Gaussian model. We use the auto.arima implementation provided in
processes methods show the best predictive performance. the forecast R package [Hyndman et al., 2014], which con-
However, the authors do not compare these methods with trols for several time series components, including trend or
state of the art approaches, such as ARIMA or exponential seasonality;
smoothing. In a different case study, [Cerqueira et al., 2019]
Naive2 A seasonal random walk forecasting benchmark,
compare different forecasting models, including statistical implemented using the snaive function available in forecast
methods and machine learning methods. In their analysis,
R package [Hyndman et al., 2014];
the latter approaches present a better average rank (better
predictive performance) relative to the former. Particularly, Theta: The Theta method by [Assimakopoulos and
a rule-based regression model, which is a variant to the Nikolopoulos, 2000], which is equivalent to simple expo-
model tree by Quinlan, present the best average rank across nential smoothing with drift;
62 time series. [Makridakis et al., 2018] extends the study ETS: The exponential smoothing state-space model typi-
by [Ahmed et al., 2010] by including several statistical cally used for forecasting [Gardner Jr, 1985];
methods in their experimental setup. Their results suggest
that most of the statistical methods system- atically Tbats: An exponential smoothing state space model with
Box-Cox transformation, ARMA errors, trend and seasonal
outperform machine learning methods for univariate time
series forecasting. This effect is noticeable for one-step and components [De Livera et al., 2011].
multi-step forecasting. The machine learning methods the In order to apply these models, we use the
authors analyze include different types of neural networks implementations available in the forecast R package
(e.g. a long short-term memory, multi-layer perceptron), the [Hyndman et al., 2014]. This package automatically tunes
nearest neighbors method, a decision tree, support vector re- the methods ETS, Tbats, and ARIMA to an optimal
gression, and Gaussian processes. On the other hand, the parameter setting.
sta- tistical methods include ARIMA, naive, exponential
smooth- ing, and theta, among others. Machine Learning Methods
In turn, we applied the AR(p) model with the five following
Despite the extension of their comparative study, we hy- machine learning algorithms.
pothesize that the experimental setup designed by [Makri-
dakis et al., 2018] is biased in terms of sample size. They RBR: A rule-based model from the Cubist R package [Kuhn
use a large set of 1045 time series from the M3 competition. et al., 2014], which is a variant of the Model Tree
[Quinlan, 1993];
However, each one of these time series is extremely small in
size. The average number of observations is 116. Our work- RF A Random Forest method, which is an ensemble of de-
ing hypothesis is that, in these conditions, machine learning cision trees [Breiman, 2001]. We use the implementation
methods are unable to learn an adequate regression function from the ranger R package [Wright, 2015];
for generalization. In the next section, we present a new GP: Gaussian Process regression. We use the implementa-
study comparing traditional statistical methods with tion available in the kernlab R package [Karatzoglou et al.,
machine learn- ing methods. Our objective is to test the 2004];
hypothesis outlined above and check whether sample size
has any effect on the relative predictive performance of MARS: The multivariate adaptive regression splines [Fried-
different types of forecast- ing methods. man and others, 1991] method, using the earth R package
implementation [Milborrow, 2016];
GLM: Generalized linear model [McCullagh, 2019] regres- package [Friedman et al., 2010].
sion with a Gaussian distribution and a different penalty These learning algorithms have been shown to present a
mixing. This model is implemented using the glmnet R competitive predictive performance with state of the art fore-
casting models [Cerqueira et al., 2019]. Other widely used We consider a time series to be seasonal according to the
machine learning methods could have been included. For test by [Wang et al., 2006]. If it is, we perform a
example, extreme gradient boosting, or recurrent neural net- multiplicative decomposition to remove seasonality.
works. The latter have been shown to be a good fit for se- Similarly to [Makri- dakis et al., 2018], this process is
quential data, which is the case of time series. Finally, we skipped for ARIMA and ETS as they have their own
optimized the parameters of these five models using a grid automatic methods for coping with seasonality. Finally, we
search, which was carried out using validation data. The list apply the Cox-Stuart test [Cox and Stuart, 1955] to
of parameter values tested is described in Table 1. determine if the trend component should be removed using
first differences. This process was applied to both types of
Table 1: Summary of the learning algorithms methods.
ID Algorithm Parameter Value 3.3 Methodology
Degree {1, 2, 3}
In terms of estimation methodology, [Makridakis et al.,
− n 18 observations
2018] perform a simple holdout. The initial
are used to fit the models. Then, the models are used to
forecast the subsequent 18 observations.
We also set the forecasting horizon to 18 observations.
However, since our goal is to control for sample size, we
em- ploy a different estimation procedure. Particularly, we
use a prequential procedure to build a learning curve. A
learn-
MARS Multivar. A. R. Splines No. terms {2,5, 7, 15} ing curve denotes a set of performance scores of a predictive
Method {Forward, Backward}
model, in which the set is ordered as the sample size grows
RF Random forest No. trees {50, 100, 250, 500} [Provost et al., 1999]. Prequential denotes an evaluation
RBR Rule-based regr. No. iterations {1, 5, 10, 25, 50}
pro- cedure in which an observation is first used to test a
GLM Generalised Linear Regr. Penalty mixing {0, 0.25, 0.5, 0.75, 1}
predictive model [Dawid, 1984]. Then, this observation
{Linear, RBF,
Kernel becomes part of the training set and is used to update the
GP Gaussian Processes Tolerance Polynomial, Laplace} respective predictive
{0.001, 0.01} model. Prequential is a commonly used evaluation methodol-
ogy in data stream mining [Gama, 2010].
We apply prequential in a growing fashion, which means
3.2 Datasets and Experimental Setup that the training set grows as observations become available
We centre out study in univariate time series. We use a set after testing. An alternative to this setting is a sliding ap-
of time series from the benchmark database tsdl [Hyndman proach, where older observations are discarded when new
and Yang, 2019]. From this database, we selected all the ones become available. We start applying the prequential pro-
univari- ate time series with at least 1000 observations and cedure in the 18-th observation to match the forecasting
which have no missing values. This query returned 55 time hori- zon. In other words, the first iteration of prequential is
series. These show a varying sampling frequency (daily, to train the models using the initial 18 observations of a time
monthly, etc.), and are from different domains of application series. These models are then used to forecast future
(e.g. healthcare, physics, economics). For a complete observations. We elaborate on this approach in the next
description of these time series we refer to the database subsections.
source [Hyndman and Yang, 2019]. We also included 35 Following both [Hyndman and Koehler, 2006] and
time series used in [Cerqueira et al., 2019]. Essentially, [Makri- dakis et al., 2018], we use the mean absolute scaled
from the set of 62 used by the authors, we selected those error (MASE) as evaluation metric. We also investigated the
with at least 1000 observations and which were not use of the symmetric mean absolute percentage error
originally from the tsdl database (since these were already (SMAPE), but we found a large number of division by zero
retrieved as described above). We refer to the work in problems. Notwithstanding, in our main analysis, we use the
[Cerqueira et al., 2019] for a description of the time series. rank to compare different models. A model with a rank of 1
In summary, our analysis is based on 90 time series. We trun- in a par- ticular time series means that this method was the
cated the data at 1000 observations to make all the time best per- forming model (with lowest MASE) in that time
series have the same size. series. We use the rank to compare the different forecasting
For the machine learning methods, we set the embedding approaches because it is a non-parametric method, hence
size (p) to 10. Notwithstanding, this parameter can be op- robust to out- liers.
timized using, for example, the False Nearest Neighbours
method [Kennel et al., 1992]. Regarding the statistical meth- 3.4 Results for one-step-ahead forecasting
ods, and where it is applicable, we set p according to the re- The first set of experiments we have carried out was
spective implementation of the forecast R package [Hyndman designed to evaluate the impact of growing the training set
et al., 2014]. size on the ability of the models to forecast the next value of
Regarding time series pre-processing, we follow the pro- the time series. To emphasize how prequential was applied,
cedure by [Makridakis et al., 2018]. First, we start by ap- we have used the following iterative experimental
plying the Box-Cox transformation to the data to stabilize procedure. In the first iteration, we have learned each model
the variance. The transformation parameter is optimized using the first 18 observations of the time series. These
accord- ing to [Guerrero, 1993]. Second, we account for models were then used to make a forecast for the 19-th
seasonality.
observation. The models
Type Machine Learning Method Statistical Method Type Machine Learning Method Statistical Method

6.0
6.5

6.0 5.5
Avg. Rank

Avg. Rank
5.5 5.0

5.0
4.5

4.5
0 250 500 750 1000 0 250 500 750 1000
Training Sample Size Training Sample Size

verage rank of each forecasting method, smoothed using a moving average of 50 periods. Results method (Naive2 excluded), smoothed using a moving average o

n we grow the training set to be 19 observations and repeated the forecasting exercise this time Type
withMachine
the Learning
goal of predicting
MethodStatistical the 20-th value. The process
Method
ust described. Particularly, this plot rep- resents a learning curve for each forecasting model, where the
1.6

1.2
MASE

models are colored by model type (machine learning or


tical). The x-axis denotes the training sample size, i.e., how 0.8
much data is used to fit the forecasting models. The y-axis
represents the average rank of each model across all the 90 0.4

time series, which is computed in each testing observations 0 250 500 750 1000
of the prequential procedure. For visualization purposes, we Training Sample Size
smooth the average rank of each model using a moving Figure 3: Learning curve using the MASE of each forecasting
age over 50 observations. The two bold smoothed lines rep- method, smoothed using a moving average of 50 periods. Results
resent the smoothed average rank across each type of obtained for one-step ahead forecasting.
method according to the LOESS method. Finally, the
line at point
vertical black144 represents the maximum sample size used
in
the experiments by [Makridakis et al., Finally, Figure 3 presents a similar analysis as before us-
2018]. ing the actual MASE values. In relative terms between both
The results depicted in this figure show a clear tendency: types of methods, the results are consistent with the average
when only few observations are available, the statistical meth- rank analysis. This figure suggests that both types of meth-
ods present a better performance. However, as the sample size ods improve their MASE score as the training sample size
grows, machine learning methods outperform increases.
them.
Figure 2 presents a similar analysis as Figure 1. The
differ- Results by Individual
ence is that we now exclude the statistical method Naive2. Model
The average rank graphics of each model depicted in Figure
Its poor performance biased the results toward machine 1 show a score for each model in each of the testing obser-
learn-
ing methods. Despite this, in the experiments reported by vations of the prequential procedure. We average this value
[Makridakis et al., 2018], this method outperforms many across the testing observations to determine the average of
ma- average rank of each individual model. The results of this
chine learning methods for one-step-ahead analysis are presented in Table 2. The method RBR
forecasting. presents
Our
exper-results confirm the conclusions drawn from the . the best score.
iments
Table presented
2: Average of the [Makridakis
byaverage rank of et al.,model
each ] (theeach
2018across vertical
test-
black
ing line in our
observation figures).
in the Namely,
prequential that statistical
procedure (one-step models
ahead fore- step ahead setting: according to the smoothed lines, statisti-
casting). cal methods are only better than machine learning ones
when the sample size is small. In this setting, however, the
RBR
5.11
ARIMA
5.26
ETS
5.39
RF
5.39
GP
5.39
Tbats
5.45
MARS
5.47
GLM
5.54
Theta
5.73
Naive2
6.28
two types of methods seem to even out as the training
sample size grows (provided we ignore Naive2). According to Figure 6,
the MASE score of the models in this scenario are consider-

3.5 Results for multi-step-ahead This is expected given the underlying increased uncertainty.
In order to evaluate the multi-step ahead forecasting
forecasting
scenario, we used a similar approach to the one-step ahead Table 3: Average of the average rank of each model across each
setting. The difference is that, instead of predicting only the testing observation in the prequential procedure (multi-step ahead
next value, we now predict the next 18 observations. To be forecasting).
precise, in the first iteration of the prequential procedure, ARIMA RBR GLM Tbats GP ETS MARS RF Theta Naive2
each model is fit using the first 18 observations of the time 4.77 4.95 4.97 5.11 5.24 5.43 5.69 5.74 6.12 6.98
series.
modelsThese
were then used to make a forecast for the next 18
ob-
servations (from the 19-th to the 36-th). As before, the mod- Table 3 presents the results by individual model in a sim-
els were ranked according to the error of these predictions ilar manner to Table 2 but for multi-step forecasting. In this
(quantified by MASE). On the second iteration, we grow the setting, the model with best score is
training set to be 19 observations, and the forecasting
exercise
is repeated. 3.6 Computational Complexity
For
(alsomachine learning models, we focus on an iterated In the interest of completeness, we also include an analysis
known as recursive) approach to multi-step forecasting [Taieb of
et al., 2012]. Initially, a model is built for one-step-ahead the computational complexity of each method. We evaluate
forecasting. To forecast observations beyond one point (i.e., this according to the computational time spent by a model,
h > 1), the model uses its predictions as input variables which we define as the time a model takes to complete the
in prequential procedure outlined in Section 3.3. Similarly to
an iterative way. We refer to the work by [Taieb et al., [Makridakis et al., 2018], we define the computational com-
2012]
for an overview and analysis of different approaches to plexity (CC) of a model m as follows:
CC =Computational Time m
(7)
Computational Time Naive2
Type Essentially, we normalize the computational time of each
Machine Learning Method Statistical Method
method by the computational time of the Naive2 method.
in Figure 7 as a bar plot. The bars in
the graphic are log scaled. The original value before taking
7 the logarithm is shown within each
From the figure, the method with worse CC is ARIMA,
fol- lowed by Tbats. The results are driven by the fact
that the
Avg. Rank

Type Machine Learning Method Statistical Method

5
7

4
6
0 250 500 750 1000
Avg. Rank

Training Sample Size

Figure 4: Learning curve using the smoothed average rank of each 5


forecasting method for h = 18. Results for multi-step ahead fore-
casting.

4
Figures 4 and 5 show the average (over the 90 time series)
rank of each model as we grow the training set according to
the procedure we have just described. The figures are similar, 0
250 500 750 1000
but the results of the Naive2 method were excluded from Training Sample Size
the second one using the same rationale as before.
Analyzing the results by model type, the main conclusion Figure 5: Learning curve using the smoothed average rank of each
in this scenario is similar to the one obtained in the one- forecasting method for h = 18, Naive2 excluded.
Type
CC relative to Naive2

Machine Learning Method Statistical Method

7.5
5
6492

1441
844
629
4 311
5.0 183
124
MASE

3
27
2.5
2 4
1
0.0
1

0 250 500 750 1000


Training Sample Size

Figure 6: Learning curve using the MASE of each forecasting


method and for h = 18, smoothed using a moving average of 50 Figure 7: Computational complexity of each method relative to the
periods. Naive2 benchmark model (log scaled)

4.2One-step vs Multi-step Forecasting


implementations of the statistical methods (except Naive2 The results obtained from the multi-step forecasting scenario
and Theta) include comprehensive automatic parameter are different from those obtained from the one-step ahead sce-
op-
timization in the forecast R package. The optimization of nario. Particularly, machine learning methods do not gain
the
machine learning methods carried out in our experiments upper hand in terms of predictive performance when we in-
was
not as exhaustive. Therefore, the CC of these methods is not crease the sample size.
Some possible reasons are the following: since multi-step
forecasting represents a task with an increased uncertainty
[Weigend, 2018], machine learning methods may need more
data to cope with this issue – in future work, we will
continue
the learning curve to assess this possibility; a different ap-
proach to multi-step forecasting may be better – we focused
on an iterative approach, but other possibilities are available
4 Discussion in the literature [Taieb et al.,
2012].
In the previous section, we carried a simple experiment to
show that sample size matters when comparing machine
learning methods with statistical methods for univariate
time
We applied the statistical methods using the widely used Our main claim in this work is that sample size matters
series forecasting.
R package forecastIn[Hyndman
this section,
et we
al.,discuss
2014].these
This package when comparing different forecasting models. We backed
contains implementations that search for the best parameter this claim
4.3 using 90 Size
On Sample time and
seriesthe
comprised
No Freeof 1000 observa-
settings
4.1 of the models,Setup
Experimental and is considered a software pack- tions. Theorem
We do not claim that this number is the optimal sample
age for automated time series forecasting models [Taylor size for fitting forecasting models. It is difficult to tell
and apriori
Letham, 2018]. Regarding the machine learning models, we the amount of data necessary to solve a predictive task. It
focused on building AR(p) models using these learning al- de-
gorithms. Other approaches could be analyzed, for example pends on many factors, such as the complexity of the
problem
coupling
[Diet- an AR(p) model with a recurrent architecture or the complexity of the learning
algorithm.
terich, 2002] or with summary We believe that the work by [Makridakis et al., 2018] is
statistics.
Our choice of machine learning algorithms was driven by biased towards small, low frequency, datasets. Naturally,
a the
recent study [Cerqueira et al., 2019], but other learning evidence that machine learning models are unable to
algo- general-
rithms could be applied. For example, we did not apply any ize from small datasets can be regarded as a limitation
relative
neural network, which have been successfully applied to se- to traditional statistical ones. However, machine learning
quential data problems [Chung et al., 2014]. Our goal was to can
show that machine learning in general is a valid approach to make an important impact in larger time series. Technological
time series forecasting, even without an extensive model se- advances such as the widespread adoption of sensor data en-
lection procedure nor a parameter tuning process as abled the collection of large, high frequency, time series.
This
Finally, we also remark that, even with a large amount of [Chung et al., 2014] Junyoung Chung, Caglar Gulcehre,
data, it is not obvious that a machine learning method would KyungHyun Cho, and Yoshua Bengio. Empirical evalua-
outperform a statistical method. This reasoning is in ac- tion of gated recurrent neural networks on sequence mod-
cordance with the No Free Lunch theorem [Wolpert, 1996], eling. arXiv preprint arXiv:1412.3555,
which states that no learning algorithm is the most appropri-
ate in all scenarios. The same rationale can be applied to [Cox and Stuart, 1955] David Roxbee Cox and Alan Stuart.
small datasets. Some quick sign tests for trend in location and
Biometrika, 42(1/2):80–95, 1955.
[Dawid, 1984] A Philip Dawid. Present position and poten-
5 Final Remarks tial developments: Some personal views statistical theory
Makridakis claims that machine learning practitioners the prequential approach. Journal of the Royal Statistical
“work- Society: Series A (General), 147(2):278–290,
ing on forecasting applications need to do something to im-
prove the accuracy of their methods” [Makridakis et al., [De Livera et al., 2011] Alysha M De Livera, Rob J Hyn-
2018]. We claim that these practitioners should start by col- dman, and Ralph D Snyder. Forecasting time series
lecting as much data as possible. Moreover, it is also ad- with
smooth-complex seasonal patterns using exponential
visable for practitioners to include both types of forecasting ing. Journal of the American Statistical Association,
methods in their studies to enrich the experimental 106(496):1513–1527, 2011.
setup.
The code to reproduce the experiments carried out in [Dietterich, 2002] Thomas G Dietterich. Machine learning
this paper can be found at https://github.com/vcerqueira/ for sequential data: A review. In Joint IAPR
International
MLforForecasting. Workshops
Recogni- on Statistical Techniques in Pattern
tion (SPR) and Structural and Syntactic Pattern
Recogni-
Acknowledgements
tion (SSPR), pages 15–30. Springer,
The work of V. Cerqueira was financially supported 2002.
by Fundac¸a˜o para a Cieˆncia e a Tecnologia [Friedman and others, 1991] Jerome H Friedman et al.
(FCT), Mul-
the Portuguese funding agency that supports science, tivariate
statis- etadaptive regression splines. The annals of
technology, and innovation, through the Ph.D. grant [Friedman al., 2010] Jerome Friedman, Trevor Hastie,
SFRH/BD/135705/2018. The work of L. Torgo was under- and Robert Tibshirani. Regularization paths for general-
taken, in part, thanks to funding from the Canada Research ized
Sta- linear models via coordinate descent. Journal of
Chairs program. tistical
2010. Software, 33(1):1–22,
References [Gama, ] Joao Gama.
2010Chapman
streams. Knowledge discovery from
and Hall/CRC,
[Ahmed et al., 2010] Nesreen K Ahmed, Amir F Atiya,
Nea- [Gardner Jr, 1985] Everette S Gardner Jr. Exponential
mat El Gayar, and Hisham El-Shishiny. An empirical smoothing: The state of the art. Journal of forecasting,
com- 4(1):1–28, 1985.
parison of machine learning models for time series fore-
casting. Econometric Reviews, 29(5-6):594–621, 2010. [Guerrero, 1993] V´ıctor M Guerrero. Time-series analysis
[Assimakopoulos and Nikolopoulos, 2000] Vassilis As- supported by power transformations. Journal of
ing, 12(1):37–48, 1993.
simakopoulos and Konstantinos Nikolopoulos. The
theta model: a decomposition approach to forecasting. [Hill et al., 1996] Tim Hill, Marcus O’Connor, and William
International journal of forecasting, 16(4):521–530, Remus. Neural network models for time series forecasts.
2000. Management
1996. science, 42(7):1082–1092,
[Box et al., 2015] George EP Box, Gwilym M Jenkins, [Hyndman and Athanasopoulos, 2018] Rob J Hyndman and
gory C Reinsel, and Greta M Ljung. Time series George
and Athanasopoulos. Forecasting: principles
forecasting and control. John Wiley & Sons, practice. OTexts, 2018.
[Breiman, 2001] Leo Breiman. Random forests. [Hyndman and Koehler, 2006] Rob J Hyndman and Anne B
learning, 45(1):5–32, 2001.
Machine Koehler. Another look at measures of forecast accuracy.
International
2006. journal of forecasting, 22(4):679–688,
[Carbonneau et al., 2008] Real Carbonneau, Kevin Lafram-
boise, and Rustam Vahidov. Application of machine [Hyndman and Yang, 2019] Rob Hyndman and
learn- Yangzhuoran Yang. tsdl: Time Series Data
ing techniques for supply chain demand forecasting. Eu- Library, 2019. https://finyang.github.io/tsdl/,
ropean Journal of Operational Research, 184(3):1140– https://github.com/FinYang/tsdl.
1154, 2008.
[Hyndman et al., 2014] Rob J Hyndman, with contribu-
[Cerqueira et al., 2019] Vitor Cerqueira, Lu´ıs Torgo, Fa tions from George Athanasopoulos, Slava Razbash,
Pinto, and Carlos Soares. Arbitrage of forecasting Drew Schmidt, Zhenyu Zhou, Yousaf Khan, Christoph
Machine Learning, 108(6):913–944, 2019. Bergmeir, and Earo Wang. forecast: Forecasting
functions
[Chatfield, 2000] Chris Chatfield. Time-series forecasting. for time series and linear models, 2014. R package
CRC Press, version
[Karatzoglou et al., 2004] Alexandros Karatzoglou, Alex [Taylor and Letham, 2018] Sean J Taylor and Benjamin
Smola, Kurt Hornik, and Achim Zeileis. kernlab – an S4 Letham. Forecasting at scale. The American Statistician,
package for kernel methods in R. Journal of Statistical 72(1):37–45, 2018.
Software,
2004. 11(9):1–20, [Voyant et al., 2017] Cyril Voyant, Gilles Notton,
[Kennel et al., 1992] Matthew B Kennel, Reggie Brown, Soteris
and Kalogirou, Marie-Laure Nivet, Christophe Paoli, Fabrice
Henry DI Abarbanel. Determining embedding dimension Motte, and Alexis Fouilloy. Machine learning methods
for phase-space reconstruction using a geometrical con- for
solar radiation forecasting: A review. Renewable Energy,
struction.
1992. Physical review A, 45(6):3403,
[Kilian and Taylor, 2003] Lutz Kilian and Mark P [Wang et al., 2006] Xiaozhe Wang, Kate Smith, and Rob
Taylor.
Why is it so difficult to beat the random walk forecast Hyndman. Characteristic-based clustering for time series
of exchange rates? Journal of International Economics, data. Data mining and knowledge Discovery, 13(3):335–
364, 2006.
60(1):85–107, 2003.
[Kuhn et al., 2014] Max Kuhn, Steve Weston, Chris [Weigend, 2018] Andreas S Weigend. Time series
Keefer, tion: forecasting the future and understanding the past.
and Nathan Coulter. C code for Cubist by Ross Quinlan.
Cubist: Rule- and Instance-Based Regression Modeling, Routledge, 2018.
2014. R package version 0.0.18. [Wolpert, 1996] David H Wolpert. The lack of a priori dis-
[Lee and Mark, 2010] Joon Lee and Roger G Mark. An in- tinctions between learning algorithms. Neural
vestigation of patterns in hemodynamic data indicative of tion, 8(7):1341–1390, 1996.
impending hypotension in intensive care. Biomedical en- [Wright, 2015] Marvin N. Wright. ranger: A Fast
gineering online, 9(1):62, 2010. mentation of Random Forests, 2015. R
[Makridakis and Hibon, 2000] Spyros Makridakis and [Xingjian et al., 2015] SHI Xingjian, Zhourong Chen, Hao
Michele Hibon. The m3-competition: results,
conclusions Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun
and implications. International journal of forecasting, Woo. Convolutional lstm network: A machine learning ap-
16(4):451–476, 2000. proach for precipitation nowcasting. In Advances in
neural
[Makridakis et al., 2018] Spyros Makridakis, Evangelos information processing systems, pages 802–810,
Spiliotis, and Vassilios Assimakopoulos. Statistical and
machine learning forecasting methods: Concerns and
ways
2018. forward. PloS one, 13(3):e0194889,
[McCullagh, 2019] Peter McCullagh. Generalized
linear
models. Routledge, 2019.
[Michalski et al., 2013] Ryszard S Michalski, Jaime G Car-
bonell,
arti- and Tom M Mitchell. Machine learning: An
ficial intelligence approach. Springer Science &
Business
Media, 2013.
[Milborrow, 2016] S. Milborrow. earth: Multivariate
Adap-
tive Regression Splines, 2016. R package version
4.4.4.
[Provost et al., 1999] Foster Provost, David Jensen, and
Tim
Oates.
Proceed- Efficient progressive sampling. In
ings of the fifth ACM SIGKDD international conference
on Knowledge discovery and data mining, pages 23–32.
ACM, 1999.
[Quinlan, 1993] J Ross Quinlan. Combining instance-based
and
in- model-based learning. In Proceedings of the tenth
ternational conference on machine learning, pages 236–
243, 1993.
[Taieb et al., 2012] Souhaib Ben Taieb, Gianluca
Bontempi,
Amir F Atiya, and Antti Sorjamaa. A review and com-
parison of strategies for multi-step ahead time series fore-
casting based on the nn5 forecasting competition.
Expert
systems
2012. with applications, 39(8):7067–7083,
[Takens, 1981] Floris Takens. Dynamical Systems and
Tur-

You might also like