Machine Learning Vs Statistical Methods For Time S
Machine Learning Vs Statistical Methods For Time S
Size Matters
Vitor Cerqueira1,2∗ , Luis Torgo1,2,3 and Carlos Soares1,2
1INESC TEC, Porto, Portugal
2
University of Porto
3
Dalhousie University
[email protected], [email protected], [email protected]
arXiv:1909.13316v1 [stat.ML] 29 Sep 2019
i.e., the prediction of the next value of a time series (yn+1). The ARMA(p,q) is defined for stationary data. However,
Sometimes one is interested in predicting many steps into many interesting phenomena in the real-world exhibit a non-
the future. These tasks are often referred to as multi-step stationary structure, e.g. time series with trend and seasonal-
fore- casting [Taieb et al., 2012]. Higher forecasting ity. The ARIMA(p,d,q) model overcomes this limitation by
horizons typ- ically lead to a more difficult predictive task including an integration parameter of order d. Essentially,
due to the in- creased uncertainty [Weigend, 2018]. ARIMA works by applying d differencing transformations
to the time series (until it becomes stationary), before
2.2 Time Series Models applying ARMA(p,q).
Several models for time series analysis have been proposed The exponential smoothing model [Gardner Jr, 1985] is
in the literature. These are not only devised to forecast the similar to the AR(p) model in the sense that it models the
future behaviour of time series but also to help understand future values of time series using a linear combination of its
the underlying structure of the data. In this section, we past observations. In this case, however, exponential
outline a few of the most commonly used forecasting smooth- ing methods produce weighted averages of the past
methods. values, where the weight decays exponentially as the
The naive method, also known as the random walk fore- observations are older [Hyndman and Athanasopoulos,
cast, predicts the future values of the time series according 2018]. For exam- ple, in a simple exponential smoothing
to the last known observation: method, the prediction for yn+1 can be defined as follows:
pseries, yn, can be estimated using a linear combination of the ski et al., 2013]. In each observation, the value of yi
is modelled based on the past p values before it: xi =
past observations, together with an error term ϵn and a
con- stant term c [Box et al., 2015]: {yi 1, yi 2, . . . , yi p }, where yi ∈ Y R, which represents
the vector⊂−of values − −
we want to predict, and∈ xi X Rp
represents the feature vector. The objective is to construct a
Σ
p
yn = c + φi yn−i + ϵn (3) model f : X →Y, where f denotes the regression function.
i=1 In other words, the principle behind this approach is to
model
the conditional distribution of the i-th value of the time
series given its p past values: f (yi xi). In essence, this 3 Empirical Experiments
approach leads to a multiple | regression problem. The Our goal in this paper is to address the following research
temporal depen- dency is modelled by having past question:
observations as explanatory variables. Following this
formulation, we can resort to any algorithm from the • Is sample size important in the relative predictive
perfor- mance of forecasting methods?
regression toolbox to solve the predictive task.
We are interested in comparing statistical methods with
machine learning methods for univariate time series
2.4 Related Work forecast- ing tasks. Within this predictive task we will
analyse the impact of different horizons (one-step-ahead and
multi-step- ahead forecasting).
Machine learning methods have been increasingly used to
tackle univariate time series forecasting problems. However, 3.1 Forecasting Methods
there is a small amount of work comparing their predictive In this section, we outline the algorithms used for
performance relative to traditional statistical methods. [Hill forecasting. We include five statistical methods and five
et al., 1996] compared a multi-layer perceptron with statis- machine learning algorithms.
tical methods. The neural network method is shown to per-
form significantly better than the latter. [Ahmed et al., Statistical Methods
2010] present an analysis of different machine learning The statistical methods used in the experiments are the fol-
methods for this task using time series from the M3 lowing.
competition [Makri- dakis and Hibon, 2000]. Their results ARIMA: The Auto-Regressive Integrated Moving Average
suggest that the multi- layer perceptron and Gaussian model. We use the auto.arima implementation provided in
processes methods show the best predictive performance. the forecast R package [Hyndman et al., 2014], which con-
However, the authors do not compare these methods with trols for several time series components, including trend or
state of the art approaches, such as ARIMA or exponential seasonality;
smoothing. In a different case study, [Cerqueira et al., 2019]
Naive2 A seasonal random walk forecasting benchmark,
compare different forecasting models, including statistical implemented using the snaive function available in forecast
methods and machine learning methods. In their analysis,
R package [Hyndman et al., 2014];
the latter approaches present a better average rank (better
predictive performance) relative to the former. Particularly, Theta: The Theta method by [Assimakopoulos and
a rule-based regression model, which is a variant to the Nikolopoulos, 2000], which is equivalent to simple expo-
model tree by Quinlan, present the best average rank across nential smoothing with drift;
62 time series. [Makridakis et al., 2018] extends the study ETS: The exponential smoothing state-space model typi-
by [Ahmed et al., 2010] by including several statistical cally used for forecasting [Gardner Jr, 1985];
methods in their experimental setup. Their results suggest
that most of the statistical methods system- atically Tbats: An exponential smoothing state space model with
Box-Cox transformation, ARMA errors, trend and seasonal
outperform machine learning methods for univariate time
series forecasting. This effect is noticeable for one-step and components [De Livera et al., 2011].
multi-step forecasting. The machine learning methods the In order to apply these models, we use the
authors analyze include different types of neural networks implementations available in the forecast R package
(e.g. a long short-term memory, multi-layer perceptron), the [Hyndman et al., 2014]. This package automatically tunes
nearest neighbors method, a decision tree, support vector re- the methods ETS, Tbats, and ARIMA to an optimal
gression, and Gaussian processes. On the other hand, the parameter setting.
sta- tistical methods include ARIMA, naive, exponential
smooth- ing, and theta, among others. Machine Learning Methods
In turn, we applied the AR(p) model with the five following
Despite the extension of their comparative study, we hy- machine learning algorithms.
pothesize that the experimental setup designed by [Makri-
dakis et al., 2018] is biased in terms of sample size. They RBR: A rule-based model from the Cubist R package [Kuhn
use a large set of 1045 time series from the M3 competition. et al., 2014], which is a variant of the Model Tree
[Quinlan, 1993];
However, each one of these time series is extremely small in
size. The average number of observations is 116. Our work- RF A Random Forest method, which is an ensemble of de-
ing hypothesis is that, in these conditions, machine learning cision trees [Breiman, 2001]. We use the implementation
methods are unable to learn an adequate regression function from the ranger R package [Wright, 2015];
for generalization. In the next section, we present a new GP: Gaussian Process regression. We use the implementa-
study comparing traditional statistical methods with tion available in the kernlab R package [Karatzoglou et al.,
machine learn- ing methods. Our objective is to test the 2004];
hypothesis outlined above and check whether sample size
has any effect on the relative predictive performance of MARS: The multivariate adaptive regression splines [Fried-
different types of forecast- ing methods. man and others, 1991] method, using the earth R package
implementation [Milborrow, 2016];
GLM: Generalized linear model [McCullagh, 2019] regres- package [Friedman et al., 2010].
sion with a Gaussian distribution and a different penalty These learning algorithms have been shown to present a
mixing. This model is implemented using the glmnet R competitive predictive performance with state of the art fore-
casting models [Cerqueira et al., 2019]. Other widely used We consider a time series to be seasonal according to the
machine learning methods could have been included. For test by [Wang et al., 2006]. If it is, we perform a
example, extreme gradient boosting, or recurrent neural net- multiplicative decomposition to remove seasonality.
works. The latter have been shown to be a good fit for se- Similarly to [Makri- dakis et al., 2018], this process is
quential data, which is the case of time series. Finally, we skipped for ARIMA and ETS as they have their own
optimized the parameters of these five models using a grid automatic methods for coping with seasonality. Finally, we
search, which was carried out using validation data. The list apply the Cox-Stuart test [Cox and Stuart, 1955] to
of parameter values tested is described in Table 1. determine if the trend component should be removed using
first differences. This process was applied to both types of
Table 1: Summary of the learning algorithms methods.
ID Algorithm Parameter Value 3.3 Methodology
Degree {1, 2, 3}
In terms of estimation methodology, [Makridakis et al.,
− n 18 observations
2018] perform a simple holdout. The initial
are used to fit the models. Then, the models are used to
forecast the subsequent 18 observations.
We also set the forecasting horizon to 18 observations.
However, since our goal is to control for sample size, we
em- ploy a different estimation procedure. Particularly, we
use a prequential procedure to build a learning curve. A
learn-
MARS Multivar. A. R. Splines No. terms {2,5, 7, 15} ing curve denotes a set of performance scores of a predictive
Method {Forward, Backward}
model, in which the set is ordered as the sample size grows
RF Random forest No. trees {50, 100, 250, 500} [Provost et al., 1999]. Prequential denotes an evaluation
RBR Rule-based regr. No. iterations {1, 5, 10, 25, 50}
pro- cedure in which an observation is first used to test a
GLM Generalised Linear Regr. Penalty mixing {0, 0.25, 0.5, 0.75, 1}
predictive model [Dawid, 1984]. Then, this observation
{Linear, RBF,
Kernel becomes part of the training set and is used to update the
GP Gaussian Processes Tolerance Polynomial, Laplace} respective predictive
{0.001, 0.01} model. Prequential is a commonly used evaluation methodol-
ogy in data stream mining [Gama, 2010].
We apply prequential in a growing fashion, which means
3.2 Datasets and Experimental Setup that the training set grows as observations become available
We centre out study in univariate time series. We use a set after testing. An alternative to this setting is a sliding ap-
of time series from the benchmark database tsdl [Hyndman proach, where older observations are discarded when new
and Yang, 2019]. From this database, we selected all the ones become available. We start applying the prequential pro-
univari- ate time series with at least 1000 observations and cedure in the 18-th observation to match the forecasting
which have no missing values. This query returned 55 time hori- zon. In other words, the first iteration of prequential is
series. These show a varying sampling frequency (daily, to train the models using the initial 18 observations of a time
monthly, etc.), and are from different domains of application series. These models are then used to forecast future
(e.g. healthcare, physics, economics). For a complete observations. We elaborate on this approach in the next
description of these time series we refer to the database subsections.
source [Hyndman and Yang, 2019]. We also included 35 Following both [Hyndman and Koehler, 2006] and
time series used in [Cerqueira et al., 2019]. Essentially, [Makri- dakis et al., 2018], we use the mean absolute scaled
from the set of 62 used by the authors, we selected those error (MASE) as evaluation metric. We also investigated the
with at least 1000 observations and which were not use of the symmetric mean absolute percentage error
originally from the tsdl database (since these were already (SMAPE), but we found a large number of division by zero
retrieved as described above). We refer to the work in problems. Notwithstanding, in our main analysis, we use the
[Cerqueira et al., 2019] for a description of the time series. rank to compare different models. A model with a rank of 1
In summary, our analysis is based on 90 time series. We trun- in a par- ticular time series means that this method was the
cated the data at 1000 observations to make all the time best per- forming model (with lowest MASE) in that time
series have the same size. series. We use the rank to compare the different forecasting
For the machine learning methods, we set the embedding approaches because it is a non-parametric method, hence
size (p) to 10. Notwithstanding, this parameter can be op- robust to out- liers.
timized using, for example, the False Nearest Neighbours
method [Kennel et al., 1992]. Regarding the statistical meth- 3.4 Results for one-step-ahead forecasting
ods, and where it is applicable, we set p according to the re- The first set of experiments we have carried out was
spective implementation of the forecast R package [Hyndman designed to evaluate the impact of growing the training set
et al., 2014]. size on the ability of the models to forecast the next value of
Regarding time series pre-processing, we follow the pro- the time series. To emphasize how prequential was applied,
cedure by [Makridakis et al., 2018]. First, we start by ap- we have used the following iterative experimental
plying the Box-Cox transformation to the data to stabilize procedure. In the first iteration, we have learned each model
the variance. The transformation parameter is optimized using the first 18 observations of the time series. These
accord- ing to [Guerrero, 1993]. Second, we account for models were then used to make a forecast for the 19-th
seasonality.
observation. The models
Type Machine Learning Method Statistical Method Type Machine Learning Method Statistical Method
6.0
6.5
6.0 5.5
Avg. Rank
Avg. Rank
5.5 5.0
5.0
4.5
4.5
0 250 500 750 1000 0 250 500 750 1000
Training Sample Size Training Sample Size
verage rank of each forecasting method, smoothed using a moving average of 50 periods. Results method (Naive2 excluded), smoothed using a moving average o
n we grow the training set to be 19 observations and repeated the forecasting exercise this time Type
withMachine
the Learning
goal of predicting
MethodStatistical the 20-th value. The process
Method
ust described. Particularly, this plot rep- resents a learning curve for each forecasting model, where the
1.6
1.2
MASE
time series, which is computed in each testing observations 0 250 500 750 1000
of the prequential procedure. For visualization purposes, we Training Sample Size
smooth the average rank of each model using a moving Figure 3: Learning curve using the MASE of each forecasting
age over 50 observations. The two bold smoothed lines rep- method, smoothed using a moving average of 50 periods. Results
resent the smoothed average rank across each type of obtained for one-step ahead forecasting.
method according to the LOESS method. Finally, the
line at point
vertical black144 represents the maximum sample size used
in
the experiments by [Makridakis et al., Finally, Figure 3 presents a similar analysis as before us-
2018]. ing the actual MASE values. In relative terms between both
The results depicted in this figure show a clear tendency: types of methods, the results are consistent with the average
when only few observations are available, the statistical meth- rank analysis. This figure suggests that both types of meth-
ods present a better performance. However, as the sample size ods improve their MASE score as the training sample size
grows, machine learning methods outperform increases.
them.
Figure 2 presents a similar analysis as Figure 1. The
differ- Results by Individual
ence is that we now exclude the statistical method Naive2. Model
The average rank graphics of each model depicted in Figure
Its poor performance biased the results toward machine 1 show a score for each model in each of the testing obser-
learn-
ing methods. Despite this, in the experiments reported by vations of the prequential procedure. We average this value
[Makridakis et al., 2018], this method outperforms many across the testing observations to determine the average of
ma- average rank of each individual model. The results of this
chine learning methods for one-step-ahead analysis are presented in Table 2. The method RBR
forecasting. presents
Our
exper-results confirm the conclusions drawn from the . the best score.
iments
Table presented
2: Average of the [Makridakis
byaverage rank of et al.,model
each ] (theeach
2018across vertical
test-
black
ing line in our
observation figures).
in the Namely,
prequential that statistical
procedure (one-step models
ahead fore- step ahead setting: according to the smoothed lines, statisti-
casting). cal methods are only better than machine learning ones
when the sample size is small. In this setting, however, the
RBR
5.11
ARIMA
5.26
ETS
5.39
RF
5.39
GP
5.39
Tbats
5.45
MARS
5.47
GLM
5.54
Theta
5.73
Naive2
6.28
two types of methods seem to even out as the training
sample size grows (provided we ignore Naive2). According to Figure 6,
the MASE score of the models in this scenario are consider-
3.5 Results for multi-step-ahead This is expected given the underlying increased uncertainty.
In order to evaluate the multi-step ahead forecasting
forecasting
scenario, we used a similar approach to the one-step ahead Table 3: Average of the average rank of each model across each
setting. The difference is that, instead of predicting only the testing observation in the prequential procedure (multi-step ahead
next value, we now predict the next 18 observations. To be forecasting).
precise, in the first iteration of the prequential procedure, ARIMA RBR GLM Tbats GP ETS MARS RF Theta Naive2
each model is fit using the first 18 observations of the time 4.77 4.95 4.97 5.11 5.24 5.43 5.69 5.74 6.12 6.98
series.
modelsThese
were then used to make a forecast for the next 18
ob-
servations (from the 19-th to the 36-th). As before, the mod- Table 3 presents the results by individual model in a sim-
els were ranked according to the error of these predictions ilar manner to Table 2 but for multi-step forecasting. In this
(quantified by MASE). On the second iteration, we grow the setting, the model with best score is
training set to be 19 observations, and the forecasting
exercise
is repeated. 3.6 Computational Complexity
For
(alsomachine learning models, we focus on an iterated In the interest of completeness, we also include an analysis
known as recursive) approach to multi-step forecasting [Taieb of
et al., 2012]. Initially, a model is built for one-step-ahead the computational complexity of each method. We evaluate
forecasting. To forecast observations beyond one point (i.e., this according to the computational time spent by a model,
h > 1), the model uses its predictions as input variables which we define as the time a model takes to complete the
in prequential procedure outlined in Section 3.3. Similarly to
an iterative way. We refer to the work by [Taieb et al., [Makridakis et al., 2018], we define the computational com-
2012]
for an overview and analysis of different approaches to plexity (CC) of a model m as follows:
CC =Computational Time m
(7)
Computational Time Naive2
Type Essentially, we normalize the computational time of each
Machine Learning Method Statistical Method
method by the computational time of the Naive2 method.
in Figure 7 as a bar plot. The bars in
the graphic are log scaled. The original value before taking
7 the logarithm is shown within each
From the figure, the method with worse CC is ARIMA,
fol- lowed by Tbats. The results are driven by the fact
that the
Avg. Rank
5
7
4
6
0 250 500 750 1000
Avg. Rank
4
Figures 4 and 5 show the average (over the 90 time series)
rank of each model as we grow the training set according to
the procedure we have just described. The figures are similar, 0
250 500 750 1000
but the results of the Naive2 method were excluded from Training Sample Size
the second one using the same rationale as before.
Analyzing the results by model type, the main conclusion Figure 5: Learning curve using the smoothed average rank of each
in this scenario is similar to the one obtained in the one- forecasting method for h = 18, Naive2 excluded.
Type
CC relative to Naive2
7.5
5
6492
1441
844
629
4 311
5.0 183
124
MASE
3
27
2.5
2 4
1
0.0
1